SequenceExtractor
What does it do?
SequenceExtractor extracts the sequence data from a supplied contig at designated start and stop positions
Check out the source code: github
How do I use it?
Subject
In the Subject
field, put sequence_extractor
. Spelling counts, but case sensitivity does not.
Description
There are two ways to use SequenceExtractor
You must provide a list of (one per line):
SEQID;contig name;start coordinates; stop coordinates
e.g.
2019-SEQ-0848;Contig_1_149.079_Circ;1;50
2019-SEQ-1019;Contig_1_388.862_Circ;14;77
2019-SEQ-0848;Contig_2_392.879_Circ;2;39
2019-SEQ-1019;Contig_3_52.4575;5;22
Notes:
- If you want an entire contig, supply the start and stop positions as 0
- The stop base can be lower than the start base. The script will change them, so that the start base is always lower
The above example performs four extractions:
- bases
1-50
fromContig_1_149.079_Circ
in the assembly for SEQID2019-SEQ-0848
- bases
14-77
fromContig_1_388.862_Circ
in the assembly for SEQID2019-SEQ-1019
- bases
2-39
fromContig_2_392.879_Circ
in the assembly for SEQID2019-SEQ-0848
- bases
5-22
fromContig_3_52.4575
in the assembly for SEQID2019-SEQ-1019
This list can be supplied in two possible ways:
- In the Description
- In an attached file (the name of the file doesn't matter, but please try to avoid empty lines)
You cannot supply the list both ways; if you provide an attachment, any entries in the Description will be ignored.
Example
For an example SequenceExtractor analysis with the fasta details provided as an attachment, see issue 28104.
This example has SEQIDs in the Description, but it's not necessary.
For an example SequenceExtractor analysis with the fasta details provided in the description, see issue 28181.
Interpreting Results
SequenceExtractor will output a FASTA-formatted file, extracted_sequences.fasta
,that should look like the following if you used the example list above:
>2019-SEQ-0848_Contig_1_149.079_Circ_1_50
AAAAAAAACAAATATATACTTTGATGATAACTTTCTAAATATCTACAAAA
>2019-SEQ-0848_Contig_2_392.879_Circ_2_39
AAAAAAACAATAAAAAACACCGCAAAAATGGATTGTTA
>2019-SEQ-1019_Contig_1_388.862_Circ_14_77
CAATAGTCTTATTTCCATTCAGGTATTCAATATTAATATTCATACGTTAATCCGATTTAT
CCTT
>2019-SEQ-1019_Contig_3_52.4575_5_22
ATATCATCAGATGGCTGC
How long does it take?
SequenceExtractor is pretty fast - expect it to take a few seconds per entry you give it.
What can go wrong?
- Requested SEQIDs are not available. If we can't find some of the SEQIDs that you request, you will get a warning message informing you of it.
- The contig name must be provided exactly as it is written in the assembly.
- The start or stop position provided is invalid e.g. if you ask for positions 789-882 in a contig that is only 500bp long, nothing will be returned.