SchemaAnnotation - Annotate schemas

Description

The SchemaAnnotation module annotates the loci in a schema. It allows to obtain relevant annotation data, providing functional context when evaluating a schema and reviewing the recommendations made by other modules.

Features

Annotate schemas.
Join different annotation files.
Configurable parameters for the annotation process.
Support for parallel processing using multiple CPUs.
Option to skip intermediate file cleanup after running the module.

Dependencies

BLAST (manual here)

Usage

The SchemaAnnotation module can be used as follows:

SR SchemaAnnotation -s /path/to/schema -o /path/to/output -ao uniprot-proteomes -pt path/to/proteome/table

Command-Line Arguments

-s, --schema-directory
    (Optional) Path to the schema's directory.  Needed for option 'uniprot-proteomes' and 'genbank'.

-o, --output-directory
    (Required) Path to the output directory where to save the files.

-ao, --annotation-options
    (Required) Annotation options to run.
    Choices: uniprot-proteomes, genbank, match-schemas, consolidate.

-pt, --proteome-table
    (Optional) TSV file downloaded from UniProt that contains the list of proteomes.
    Should be used with --annotation-options uniprot-proteomes.

-gf, --genbank-files
    (Optional) Path to the directory that contains Genbank files with annotations to extract.
    Each genbank file in this folder should be named after the ID it represents.
    Should be used with --annotation-options genbank.

-ca, --chewie-annotations
    (Optional) File with the results from chewBBACA UniprotFinder module.

-ms, --matched-schema
    (Optional) Path to the tsv output file from the MatchSchemas module (Match_Schemas_Results.tsv).

-ma, --match-annotations
    (Optional) Path to the annotations file of one of the schemas used in the match schema module. This argument is needed by the Match Schemas submodule.
            Should be used with --annotation-options match_schema and --matched-schema.

-cn, --consolidate-annotations
    (Optional) 2 or more paths to the files with the annotations that are to be consolidated.

-cc, --consolidate-cleanup
    (Optional) For option consolidate the final files will or not have duplicates. Advised for the use of match schemas annotations.

--bsr
    (Optional) Minimum BSR value to consider aligned alleles as alleles for the same locus. This argument is optional for the Match Schemas submodule.
    Default: 0.6

-t, --threads
    (Optional) Number of threads for concurrent download.
    Default: 1

-c, --cpu
    (Optional) Number of CPU cores for multiprocessing.
    Default: 1

-r, --retry
    (Optional) Maximum number of retries when a download fails.
    Default: 7

-tt, --translation-table
    (Optional) Translation table to use for the CDS translation.
    Default: 11

-rm, --run-mode
    (Optional) Mode to run the module.
    Choices: reps, alleles.
    Default: reps

-egtc, --extra-genbank-table-columns
    (Optional) List of columns to add to annotation file.
    Default: []

-gia, --genbank-ids-to-add
    (Optional) List of GenBank IDs to add to final results.
    Default: []

-pia, --proteome-ids-to-add
    (Optional) List of Proteome IDs to add to final results.
    Default: []

--nocleanup
    (Optional) Flag to indicate whether to skip cleanup after running the module.

--debug
    (Optional) Flag to indicate whether to run the module in debug mode.
    Default: False

--logger
    (Optional) Path to the logger file.
    Default: None

Note

Always verify it the translation table (argument -tt) being used is the correct one for the species.

The proteome-table argument should be a TSV file with IDs for UniProt proteomes. The proteomes can be downloaded directly from UniProt . The downloaded ZIP archive should be unzipped before passing it as input to the SchemaAnnotation module. The genbank-files argument should be a folder with gbff files.

Important

With the consolidate option it is important to make sure that the loci names in the different files match. Otherwise, the algorithm will not be able to link the annotations in the various files.

Algorithm Explanation

The SchemaAnnotation module has three different annotation options: GenBank files, UniProt proteomes, Match Schemas, and Consolidate. The following is the flowchart for the SchemaAnnotation module:

The SchemaAnnotation module can annotate by comparing the schema against UniProt proteomes:

SchemaAnnotation UniProt Proteomes Flowchart

For this process, the annotations are first separated into swiss-prot and TrEMBL records and then processed. From there, BLASTp is used to macth the protein sequences from the input schema and the proteomes from UniProt.

The format of the BLASTp output files is as follows:

qseqid sseqid qlen slen qstart qend sstart send length score gaps pident

The uniprot_annotations.tsv file includes the annotations determined based on the swiss-prot and TrEMBL records.

The SchemaAnnotation module can also annotate based on the annotations in GenBank files:

BLASTp is used to compare the schema loci against the records extracted from the GenBank files. The final output file includes the best match found for each locus.

For the options Match Schemas and Consolidate, the process merges the input files based on the IDs in the locus columns. For these modes it is not necessary to pass the path to a schema. The IDs of the loci should be in one of the first two columns of the input files.

Outputs

The directory structure of the output directory created by the SchemaAnnotation module is shown below.

OutputFolderName
├── # --nocleanup -ao genbank
├── genbank_annotations.tsv
├── genbank_annotations
|   ├── genbank_annotations.tsv
│   ├── best_annotations_all_genbank_files
│   │   └── best_genbank_annotations.tsv
│   ├── best_annotations_per_genbank_file
│   │   ├── genbank_file_x_annotations.tsv
│   │   ├── genbank_file_y_annotations.tsv
│   │   └── ...
│   ├── blast_processing
│   │   ├── selected_genbank_proteins.fasta
│   │   ├── blast_db
│   │   │   ├── blast_db_protein.pdb
│   │   │   ├── blast_db_protein.phr
│   │   │   ├── blast_db_protein.pin
│   │   │   ├── blast_db_protein.pog
│   │   │   ├── blast_db_protein.pos
│   │   │   ├── blast_db_protein.pot
│   │   │   ├── blast_db_protein.psq
│   │   │   ├── blast_db_protein.ptf
│   │   │   └── blast_db_protein.pto
│   │   ├── blastp_results
│   │   │   ├── blast_results_x.tsv
│   │   │   ├── blast_results_y.tsv
│   │   │   └── ...
│   │   └── self_score_folder
│   │       ├── blast_results_x.tsv
│   │       ├── blast_results_y.tsv
│   │       └── ...
│   └── reps_translations
│       ├── x_translation.fasta
│       ├── y_translation.fasta
│       └── ...
|
├── # --nocleanup -ao match-schemas
├── matched_annotations.tsv
|
├── # --nocleanup -ao uniprot-proteomes
├── uniprot_annotations.tsv
├── uniprot_annotations
|   ├── best_proteomes_annotations_swiss_prot.tsv
|   ├── best_proteomes_annotations_trEMBL.tsv
|   ├── proteome_matcher_output
|   │   ├── best_annotations_per_proteome_file
|   │   │   ├── Swiss-Prot
|   │   │   │   ├── proteome_file_x_Swiss-Prot_annotations.tsv
|   │   │   │   ├── proteome_file_y_Swiss-Prot_annotations.tsv
|   │   │   │   └── ...
|   │   │   └── TrEMBL
|   │   │       ├── proteome_file_x_TrEMBL_annotations.tsv
|   │   │       ├── proteome_file_y_TrEMBL_annotations.tsv
|   │   │       └── ...
|   │   ├── reps_translations
|   │   │   ├── x_translation.fasta
|   │   │   ├── y_translation.fasta
|   │   │   └── ...
|   │   ├── self_score_folder
|   │   │   ├── blast_results_x.tsv
|   │   │   ├── blast_results_y.tsv
|   │   │   └── ...
|   |   ├── swiss_prots_processing
|   |   │   ├── blast_processing
|   |   │   │   ├── blast_db
|   |   │   │   │   ├── blast_db_protein.pdb
|   |   │   │   │   ├── blast_db_protein.phr
|   |   │   │   │   ├── blast_db_protein.pin
|   |   │   │   │   ├── blast_db_protein.pog
|   |   │   │   │   ├── blast_db_protein.pos
|   |   │   │   │   ├── blast_db_protein.pot
|   |   │   │   │   ├── blast_db_protein.psq
|   |   │   │   │   ├── blast_db_protein.ptf
|   |   │   │   │   └── blast_db_protein.pto
|   |   │   │   ├── blastp_results
|   |   │   │   │   ├── blast_results_x.tsv
|   |   │   │   │   ├── blast_results_y.tsv
|   |   │   │   │   └── ...
|   |   │   │   └── swiss_prots.fasta
|   |   │   └── swiss_prots_annotations.tsv
|   |   └── trembl_prots_processing
|   |       ├── blast_processing
|   |       │   ├── blast_db
|   |       │   │   ├── blast_db_protein.pdb
|   |       │   │   ├── blast_db_protein.phr
|   |       │   │   ├── blast_db_protein.pin
|   |       │   │   ├── blast_db_protein.pog
|   |       │   │   ├── blast_db_protein.pos
|   |       │   │   ├── blast_db_protein.pot
|   |       │   │   ├── blast_db_protein.psq
|   |       │   │   ├── blast_db_protein.ptf
|   |       │   │   └── blast_db_protein.pto
|   |       │   ├── blastp_results
|   |       │   │   ├── blast_results_x.tsv
|   |       │   │   ├── blast_results_y.tsv
|   |       │   │   └── ...
|   |       │   └── trembl_prots.fasta
|   |       └── trembl_prots_annotations.tsv
|   ├── Proteomes
|   |   ├── Proteome_x.fasta.gz
|   |   ├── Proteome_x.fasta.gz
|   |   └── ...
|   └── split_proteomes
|       ├── prots_descriptions
|       ├── swiss_prots.fasta
|       └── trembl_prots.fasta
├── # --nocleanup -ao consolidate
└── consolidated_annotations.tsv

Output files and folders description for the SchemaAnnotation module

Report files description

genbank_annotations.tsv
Locus	Genbank_ID	Genbank_gene_name	Genbank_product	Genbank_BSR
x	AMD32818.1	NA	cysteine desulfurase	0.9932627526467758
y	AMD31754.1	NA	histidine triad protein	0.9891156462585035
z	AMD31913.1	rplS	50S ribosomal protein L19	1.0
…

matched_annotations.tsv if made with uniprot annotations
Query	Subject	BSR	Process	matched_Proteome_ID	matched_Proteome_product	matched_Proteome_gene_name	matched_Proteome_BSR	matched_Proteome_ID_best_proteomes_annotations_swiss_prot	matched_Proteome_product_best_proteomes_annotations_swiss_prot	matched_Proteome_gene_name_best_proteomes_annotations_swiss_prot	matched_Proteome_BSR_best_proteomes_annotations_swiss_prot
x	a	1.0	hashes_dna	tr\|X5K2G1\|X5K2G1_STRAG	dITP/XTP pyrophosphatase	rdgB	1.0	sp\|Q8DY93\|IXTPA_STRA5	dITP/XTP pyrophosphatase	SAG1599	1.0
y	b	1.0	hashes_dna	tr\|A0AAW3HT12\|A0AAW3HT12_STRAG	tRNA N6-adenosine threonylcarbamoyltransferase	tsaD	1.0	sp\|Q8DXT9\|TSAD_STRA5	tRNA N6-adenosine threonylcarbamoyltransferase	tsaD	1.0
z	c	1.0	hashes_dna	tr\|A0AAE9TM16\|A0AAE9TM16_STRAG	PTS fructose transporter subunit IIC	NCTC8184_00378	1.0
…

Columns description:

Locus: The locus from the query schema.
(matched_)Proteome_ID: The identifier for the trEMBL protein.
(matched_)Proteome_product: The product of the trEMBL protein.
(matched_)Proteome_gene_name: The gene name of the trEMBL protein.
(matched_)Proteome_BSR: The BLAST Score Ratio for the trEMBL protein.
(matched_)Proteome_ID_best_proteomes_annotations_swiss_prot: The identifier for the swiss prot protein.
(matched_)Proteome_product_best_proteomes_annotations_swiss_prot: The product of the swiss prot protein.
(matched_)Proteome_gene_name_best_proteomes_annotations_swiss_prot: The gene name of the swiss prot protein.
(matched_)Proteome_BSR_best_proteomes_annotations_swiss_prot: The BLAST Score Ratio for the swiss prot protein.
Genebank_ID: The GenBank origin ID.
Genebank_product: The product of the GenBank origin.
Genebank_gene_name: The name of the GenBank origin.
Genebank_BSR: The BSR value for the best GenBank annotations.
Query: The locus from the query schema.
Subject: The locus from the subject schema.
BSR: The BSR value for the best loci matches.
Process: Process where that match was found in MatchSchemas.

Note

The consolidated_annotations.tsv file contains all the annotations in the files provided by the user.

Consolidate column suffixes:

_file_x
x being the number of the file that column comes from. Which file corresponds to which number is specified in the output log file.
None
The columns that have a unique header or are the first instance of that header will not have any suffix.

Examples

Here are some example commands to use the SchemaAnnotation module:

# Annotate schema using default parameters
SR SchemaAnnotation -s /path/to/schema -o /path/to/output -ao uniprot-proteomes -pt path/to/proteome/table

# Annotate schema with custom parameters
SR SchemaAnnotation -o /path/to/output -ao consolidate -cn 'path/to/uniprot_annotations/output' 'path/to/genebank_annotations/output' -c 4 -t 4 -b 0.7 -tt 1 --nocleanup

Troubleshooting

If you encounter issues while using the SchemaAnnotation module, consider the following troubleshooting tips:

Verify that the paths to the schema and output directories are correct.
Check the output directory for any error logs or messages.
Increase the number of CPUs using the -c or –cpu option if the process is slow.
If it is a BLAST database related error, try deleting the BLAST folders in the output and run the command again and run the schema through the AdaptLoci as it checks for loci name conflicts.