Quickstart

If you want to run SigSeekr right away upon installing it, you can do so with a toy dataset.

This dataset is hosted on figshare - to get it, run the following command:

You should now have a folder called example-data in your present working directory. To run SigSeekr, enter the following command:

The directory specified with the -o flag can be anything - it's the name of a directory where the output files will be created. Upon entering the command, you should see output that is something like this:

2019-11-14 10:59:09 Creating inclusion kmer set...
2019-11-14 10:59:25 Creating exclusion kmer set...
2019-11-14 10:59:52 Subtracting exclusion kmers from inclusion kmers with cutoff 1...
2019-11-14 10:59:54 Found kmers unique to inclusion...
2019-11-14 10:59:57 Generating contiguous sequences from inclusion kmers...
2019-11-14 11:00:41 Generating PCR info...
2019-11-14 11:00:43 Finding amplicons of size 200...
2019-11-14 11:01:26 Running Primer3 on potential amplicons...
2019-11-14 11:01:31 SigSeekr run complete!

The sigseekr_output folder should have five files in it:

The sigseekr_result.fasta created by running SigSeekr on this toy dataset will have one unique region.

>contig1_sequence1
AACAGGCGACAGGCAGCATCACTAGCTACTA

Detailed Usage

Detailed usage options can be found by typing sigseekr.py --help, which will give the following output. Further details on each option can be found below.

usage: sigseekr.py [-h] -i INCLUSION -e EXCLUSION -o OUTPUT_FOLDER
                   [-s KMER_SIZE] [-t THREADS] [-pcr] [-k]
                   [-p PLASMID_FILTERING] [-l] [-p3]
                   [-a AMPLICON_SIZE [AMPLICON_SIZE ...]]
                   [-m MAX_POTENTIAL_AMPLICONS]

optional arguments:
  -h, --help            show this help message and exit
  -i INCLUSION, --inclusion INCLUSION
                        Path to folder containing genome(s) you want signature sequences for. Genomes can be in FASTA 
                        or FASTQ format. FASTA-formatted files should be uncompressed, FASTQ-formatted files can be 
                        gzip-compressed or uncompressed.
  -e EXCLUSION, --exclusion EXCLUSION
                        Path to folder containing exclusion genome(s) - those you do not want signature sequences for. 
                        Genomes can be in FASTA or FASTQ format. FASTA-formatted files should be uncompressed, 
                        FASTQ-formatted files can be gzip-compressed or uncompressed.
  -o OUTPUT_FOLDER, --output_folder OUTPUT_FOLDER
                        Path to folder where you want to store output files. Folder will be created if it does not 
                        exist.
  -s KMER_SIZE, --kmer_size KMER_SIZE
                        Kmer size used to search for sequences unique to inclusion. Default 31. 
                        No idea how changing this affects results. TO BE INVESTIGATED.
  -t THREADS, --threads THREADS
                        Number of threads to run analysis on. Defaults to number of cores on your machine.
  -pcr, --pcr           Enable to filter out inclusion kmers that have close relatives in exclusion kmers.
  -k, --keep_tmpfiles   If enabled, will not clean up a bunch of (fairly) useless files at the end of a run.
  -p PLASMID_FILTERING, --plasmid_filtering PLASMID_FILTERING
                        To ensure unique sequences are not plasmid-borne, a FASTA-formatted database can be provided 
                        with this argument. Any unique kmers that are in the plasmid database will be filtered out.
  -l, --low_memory      Activate this flag to cause plasmid filtering to use substantially less RAM (and go faster), 
                        at the cost of some sensitivity.
  -p3, --primer3        If enabled, will run primer3 on your potential amplicons and generate a list of primers and the 
                        sizes of their products. This output will be found in a file called amplicons.csv in the output 
                        directory specified.
  -a AMPLICON_SIZE [AMPLICON_SIZE ...], --amplicon_size AMPLICON_SIZE [AMPLICON_SIZE ...]
                        Desired size for PCR amplicons. Default 200. If you want to find more than one amplicon size, 
                        enter multiple, separated by spaces.
  -m MAX_POTENTIAL_AMPLICONS, --max_potential_amplicons MAX_POTENTIAL_AMPLICONS
                        If inclusion sequences are very different from exclusion sequences, amplicon generation can take 
                        forever. Set the number of potential amplicons with this option (default 200)