Table of Contents
Translate and Reduce Alleles
This script translates allele files from Enterobase in nucleotide format to amino acid, performs content and length checks, and removes duplicates.
In order for a translated allele to pass content and length checks, it must:
- Start with a
Methionine
residue - Pass a minimum length threshold after trimming:
- The length thresholds are:
- ECs2973 (stx2B): 90 amino acid residues
- ECs2974 (stx2A): 316 amino acid residues
- ECs1205 (stx1A): 320 amino acid residues
- ECs1206 (stx1B): 88 amino acid residues
- The length thresholds are:
- Not be a duplicate of an allele already in the reduced database
Inputs
- Nucleotide allele files from Enterobase in FASTA format. Download instructions.
- Reduced profile file prepared by
profile_reduce
. Note that the allele files must contain sequence for the same genes that were used for the reduction of the profile, e.g.:- ECs2973
- ECs2974
Running the Script
stec.py allele_translate_reduce -a /path/to/allele_folder -p /path/to/profile_file -r /path/to/output/folder/aa_profile -t /path/to/output/folder/aa_alleles
An example with the nucleotide allele folder nt_alleles
, the profile file profiles.txt
in the nt_profile
folder, the desired amino acid profile output folder aa_profile
, and the desired amino acid allele output folder aa_alleles
all in the current working directory:
stec.py allele_translate_reduce -a nt_alleles -p nt_profile/profile.txt -r aa_profile -t aa_alleles
Usage
usage: stec.py allele_translate_reduce [-h] [-version] [-v verbosity]
[-a allele_path] [-p profile_file]
[-r report_path] [-t translated_path]
Translate allele files in nucleotide format to amino acid. Remove duplicates. Keep notes
optional arguments:
-h, --help show this help message and exit
-version, --version show program's version number and exit
-v verbosity, --verbosity verbosity
Set the logging level. Options are debug, info, warning, error, and critical. Default is info.
-a allele_path, --allele_path allele_path
Specify name and path of folder containing allele files. If not provided, the nt_alleles folder in the current working directory will be used by default
-p profile_file, --profile_file profile_file
Optionally specify name and path of profile file. Parse the nucleic acid profile, and create the corresponding reduced amino acid profile
-r report_path, --report_path report_path
Specify the name and path of the folder into which outputs are to be placed. If not provided, the aa_profile folder in the current working directory will be used
-t translated_path, --translated_path translated_path
Specify the name and path of the folder into which alleles are to be placed. If not provided, the aa_alleles folder in the current working directory will be used
Outputs
These represent the folder structure of the directory containing the outputs
aa_alleles
notes
gene_name_notes.txt
: text file containing the nucleotide allele, the corresponding amino acid allele, and any notes (whether it is a duplicate, trimmed, or filtered)
gene_name.fasta
: FASTA-formatted text file containing sequences of all amino acid alleles that passed quality/content filtersgene_name_filtered.txt
: FASTA-formatted text file containing sequences of all amino acid alleles that failed quality/content filters
aa_profile
profile
profile.txt
: text file containing all unique amino acid profilesreducing_notes.txt
: text file containing all amino acid sequence types, the sequence type following duplicate reduction, and any notes
aa_full_profile.txt
: text file containing all amino acid profiles before duplicate removalaa_nt_profile_links.tsv
: TSV file containing every amino acid sequence type and the corresponding nucleotide sequence type(s)gene_names.txt
: text file containing the extracted gene names from the analysisprofile.txt
: copy ofprofile.txt
from theprofile
folder. Allows for allele discovery without manually moving the file
nt_alleles
gene_name.fasta
: FASTA-formatted text file containing sequences of all nucleotide alleles that passed quality/content filtersgene_name_filtered.txt
: FASTA-formatted text file containing sequences of all nucleotide alleles that failed quality/content filters
nt_profile
profile.txt
: text file containing all unique nucleotide profiles