Quickstart
If you want to run SigSeekr right away upon installing it, you can do so with a toy dataset.
This dataset is hosted on figshare - to get it, run the following command:
wget https://ndownloader.figshare.com/files/9885379 && tar xf 9885379
You should now have a folder called example-data
in your present working directory. To run SigSeekr, enter the following command:
sigseekr.py -i example-data/inclusion/ -e example-data/exclusion/ -o sigseekr_output -pcr -p3
The directory specified with the -o
flag can be anything - it's the name of a directory where the output files will be created.
Upon entering the command, you should see output that is something like this:
2019-11-14 10:59:09 Creating inclusion kmer set...
2019-11-14 10:59:25 Creating exclusion kmer set...
2019-11-14 10:59:52 Subtracting exclusion kmers from inclusion kmers with cutoff 1...
2019-11-14 10:59:54 Found kmers unique to inclusion...
2019-11-14 10:59:57 Generating contiguous sequences from inclusion kmers...
2019-11-14 11:00:41 Generating PCR info...
2019-11-14 11:00:43 Finding amplicons of size 200...
2019-11-14 11:01:26 Running Primer3 on potential amplicons...
2019-11-14 11:01:31 SigSeekr run complete!
The sigseekr_output
folder should have five files in it:
amplicons.csv
: list of primers predicted by primer3 and the sizes of their productsinclusion_kmers.fasta
: lists all the kmers that are unique to the inclusion setsigseekr_log.txt
: logfile of captured STDOUT and STDERR stringsconfirmed_amplicons_200.fasta
: FASTA-formatted file of amplicons present in all inclusion genomes (200 refers to the amplicon size specified in the arguments. Default is 200, if multiple sizes are desired, multiple versions of this file will be created.)potential_pcr_200.fasta
: all potential amplicons based on user-specified amplicon length. This file will be further refined by the filtering of amplicon sequences present in the exclusion genomessigseekr_result.fasta
: regions that unique kmers span
The sigseekr_result.fasta
created by running SigSeekr on this toy dataset will have one unique region.
>contig1_sequence1
AACAGGCGACAGGCAGCATCACTAGCTACTA
Detailed Usage
Detailed usage options can be found by typing sigseekr.py --help
, which will give the following output.
Further details on each option can be found below.
usage: sigseekr.py [-h] -i INCLUSION -e EXCLUSION -o OUTPUT_FOLDER
[-s KMER_SIZE] [-t THREADS] [-pcr] [-k]
[-p PLASMID_FILTERING] [-l] [-p3]
[-a AMPLICON_SIZE [AMPLICON_SIZE ...]]
[-m MAX_POTENTIAL_AMPLICONS]
optional arguments:
-h, --help show this help message and exit
-i INCLUSION, --inclusion INCLUSION
Path to folder containing genome(s) you want signature sequences for. Genomes can be in FASTA
or FASTQ format. FASTA-formatted files should be uncompressed, FASTQ-formatted files can be
gzip-compressed or uncompressed.
-e EXCLUSION, --exclusion EXCLUSION
Path to folder containing exclusion genome(s) - those you do not want signature sequences for.
Genomes can be in FASTA or FASTQ format. FASTA-formatted files should be uncompressed,
FASTQ-formatted files can be gzip-compressed or uncompressed.
-o OUTPUT_FOLDER, --output_folder OUTPUT_FOLDER
Path to folder where you want to store output files. Folder will be created if it does not
exist.
-s KMER_SIZE, --kmer_size KMER_SIZE
Kmer size used to search for sequences unique to inclusion. Default 31.
No idea how changing this affects results. TO BE INVESTIGATED.
-t THREADS, --threads THREADS
Number of threads to run analysis on. Defaults to number of cores on your machine.
-pcr, --pcr Enable to filter out inclusion kmers that have close relatives in exclusion kmers.
-k, --keep_tmpfiles If enabled, will not clean up a bunch of (fairly) useless files at the end of a run.
-p PLASMID_FILTERING, --plasmid_filtering PLASMID_FILTERING
To ensure unique sequences are not plasmid-borne, a FASTA-formatted database can be provided
with this argument. Any unique kmers that are in the plasmid database will be filtered out.
-l, --low_memory Activate this flag to cause plasmid filtering to use substantially less RAM (and go faster),
at the cost of some sensitivity.
-p3, --primer3 If enabled, will run primer3 on your potential amplicons and generate a list of primers and the
sizes of their products. This output will be found in a file called amplicons.csv in the output
directory specified.
-a AMPLICON_SIZE [AMPLICON_SIZE ...], --amplicon_size AMPLICON_SIZE [AMPLICON_SIZE ...]
Desired size for PCR amplicons. Default 200. If you want to find more than one amplicon size,
enter multiple, separated by spaces.
-m MAX_POTENTIAL_AMPLICONS, --max_potential_amplicons MAX_POTENTIAL_AMPLICONS
If inclusion sequences are very different from exclusion sequences, amplicon generation can take
forever. Set the number of potential amplicons with this option (default 200)