Schema Refinery - Full Tutorial

Objective

This tutorial will guide you through a possible workflow of Schema Refinery, from schema creation to schema refinement.

Prerequisites

Schema Refinery installed (installation instructions here).
chewBBACA 3.3.10 or higher (chewBBACA’s installation instructions).
NCBI datasets command-line too installed (instructions here).

Procedure

Open a terminal window.
Download the assemblies from NCBI needed for creating the schema:

SR DownloadAssemblies -f path/to/input_tsv_file_with_taxon -db NCBI -e youremail@example.com -o path/to/DownloadAssemblies_NCBI_download -fm --download

Select the set of genome assemblies that you want to use to create a schema seed (e.g. the best quality, most complete, etc.).

Create a schema seed using the CreateSchema module from chewBBACA:

chewBBACA.py CreateSchema -i /path/to/DownloadAssemblies_NCBI_download/assemblies_ncbi_unziped -o /path/to/CreateSchema_chewbbaca_mySchema -t 4

Note

The schema seed contains one representative allele for each distinct loci identified in the genome assemblies.

Perform allele calling with the AlleleCall module from chewBBACA to populate the schema with the alleles identified in those genomes.

chewBBACA.py AlleleCall -i /path/to/CreateSchema_chewbbaca_mySchema/schema_seed -g /path/to/DownloadAssemblies_NCBI_download/assemblies_ncbi_unziped -o /path/to/AlleleCall_folder -t 4 --output-unclassified

Evaluate the CDSs that were not classified by chewBBACA using the IdentifySpuriousGenes:

SR IdentifySpuriousGenes -s path/to/CreateSchema_chewbbaca_mySchema/schema_seed -a path/to/AlleleCall_folder -m unclassified_cds -o path/to/IdentifySpuriousGenes_uCDS_mySchema -c 6

Note

In a normal workflow users would have to select the best loci to keep based on the recomendations of the IdentifySpuriousGenes module. Here we skip this step to show the full workflow.

Adapt the proto schema created from the unclassified CDSs:

SR AdaptLoci -i path/to/IdentifySpuriousGenes_uCDS_mySchema/temp_fastas -o path/to/AdaptLoci_unclassified

Important

Pass as input the temp_fastas` folder generated by the IdentifySpuriousGenes module. Repeat step 4 with this new schema to create a new AlleleCall folder.

Refine the schema created based on the unclassified CDSs:

SR IdentifySpuriousGenes -s path/to/AdaptLoci_unclassified/schema_seed -a path/to/AlleleCall_unclassified -m schema -o path/to/IdentifySpuriousGenes_unclassifiedSchema -c 6

Analyse the clusters and change the “Choice” actions into “Join”, “Add” or “Drop”.

Create a final schema using the altered recommendations_annotations.tsv file from the previous step:

SR CreateSchemaStructure -s path/to/AdaptLoci_unclassified/schema_seed -rf path/to/IdentifySpuriousGenes_unclassifiedSchema/recommendations_annotations.tsv -o path/to/CreateSchemaStructure_refined_schema -c 6

Use the altered recomendations file in the CreateSchemaStructure module folder, as that one has a selection of the loci to be dropped or added.

Optional modules to further refine or create a schema:

Follow the MatchSchemas tutorial to find loci matches between two schemas.
Follow the SchemaAnnotation tutorial to annotate schema loci based on information retrieved from various databases.
Follow the IdentifyParalogousLoci tutorial to identify paralogous loci in a schema.

Note

The assemblies available in the NCBI databases may change, so the results may vary.

Conclusion

You have successfully completed a possible workflow of Schema Refinery, from schema creation to schema refinement.