Schema Refinery - Full Tutorial

Objective

This tutorial will guide you through a possible workflow of Schema Refinery, from schema creation to schema refinement.

Prerequisites

Procedure

  1. Open a terminal window.

  2. Download the assemblies from NCBI needed for creating the schema:

SR DownloadAssemblies -f path/to/input_tsv_file_with_taxon -db NCBI -e youremail@example.com -o path/to/DownloadAssemblies_NCBI_download -fm --download
  1. Select the set of genome assemblies that you want to use to create a schema seed (e.g. the best quality, most complete, etc.).

  2. Create a schema seed using the CreateSchema module from chewBBACA:

    chewBBACA.py CreateSchema -i /path/to/DownloadAssemblies_NCBI_download/assemblies_ncbi_unziped -o /path/to/CreateSchema_chewbbaca_mySchema -t 4
    

Note

The schema seed contains one representative allele for each distinct loci identified in the genome assemblies.

  1. Perform allele calling with the AlleleCall module from chewBBACA to populate the schema with the alleles identified in those genomes.

    chewBBACA.py AlleleCall -i /path/to/CreateSchema_chewbbaca_mySchema/schema_seed -g /path/to/DownloadAssemblies_NCBI_download/assemblies_ncbi_unziped -o /path/to/AlleleCall_folder -t 4 --output-unclassified
    
  2. Evaluate the CDSs that were not classified by chewBBACA using the IdentifySpuriousGenes:

    SR IdentifySpuriousGenes -s path/to/CreateSchema_chewbbaca_mySchema/schema_seed -a path/to/AlleleCall_folder -m unclassified_cds -o path/to/IdentifySpuriousGenes_uCDS_mySchema -c 6
    

Note

In a normal workflow users would have to select the best loci to keep based on the recomendations of the IdentifySpuriousGenes module. Here we skip this step to show the full workflow.

  1. Adapt the proto schema created from the unclassified CDSs:

    SR AdaptLoci -i path/to/IdentifySpuriousGenes_uCDS_mySchema/temp_fastas -o path/to/AdaptLoci_unclassified
    

Important

Pass as input the temp_fastas` folder generated by the IdentifySpuriousGenes module. Repeat step 4 with this new schema to create a new AlleleCall folder.

  1. Refine the schema created based on the unclassified CDSs:

    SR IdentifySpuriousGenes -s path/to/AdaptLoci_unclassified/schema_seed -a path/to/AlleleCall_unclassified -m schema -o path/to/IdentifySpuriousGenes_unclassifiedSchema -c 6
    
  2. Analyse the clusters and change the “Choice” actions into “Join”, “Add” or “Drop”.

  3. Create a final schema using the altered recommendations_annotations.tsv file from the previous step:

    SR CreateSchemaStructure -s path/to/AdaptLoci_unclassified/schema_seed -rf path/to/IdentifySpuriousGenes_unclassifiedSchema/recommendations_annotations.tsv -o path/to/CreateSchemaStructure_refined_schema -c 6
    

Use the altered recomendations file in the CreateSchemaStructure module folder, as that one has a selection of the loci to be dropped or added.

Optional modules to further refine or create a schema:

Note

The assemblies available in the NCBI databases may change, so the results may vary.

Conclusion

You have successfully completed a possible workflow of Schema Refinery, from schema creation to schema refinement.