Schema Refinery - Full Tutorial ============================== Objective --------- This tutorial will guide you through a possible workflow of Schema Refinery, from schema creation to schema refinement. Prerequisites ------------- - Schema Refinery installed (installation instructions :doc:`here `). - chewBBACA 3.3.10 or higher (`chewBBACA's installation instructions `_). - NCBI datasets command-line too installed (instructions `here `_). Procedure --------- 1. Open a terminal window. 2. Download the assemblies from NCBI needed for creating the schema: .. code-block:: bash SR DownloadAssemblies -f path/to/input_tsv_file_with_taxon -db NCBI -e youremail@example.com -o path/to/DownloadAssemblies_NCBI_download -fm --download 3. Select the set of genome assemblies that you want to use to create a schema seed (e.g. the best quality, most complete, etc.). 4. Create a schema seed using the `CreateSchema `_ module from chewBBACA: .. code-block:: bash chewBBACA.py CreateSchema -i /path/to/DownloadAssemblies_NCBI_download/assemblies_ncbi_unziped -o /path/to/CreateSchema_chewbbaca_mySchema -t 4 .. Note :: The schema seed contains one representative allele for each distinct loci identified in the genome assemblies. 5. Perform allele calling with the `AlleleCall `_ module from chewBBACA to populate the schema with the alleles identified in those genomes. .. code-block:: bash chewBBACA.py AlleleCall -i /path/to/CreateSchema_chewbbaca_mySchema/schema_seed -g /path/to/DownloadAssemblies_NCBI_download/assemblies_ncbi_unziped -o /path/to/AlleleCall_folder -t 4 --output-unclassified 6. Evaluate the CDSs that were not classified by chewBBACA using the :doc:`IdentifySpuriousGenes `: .. code-block:: bash SR IdentifySpuriousGenes -s path/to/CreateSchema_chewbbaca_mySchema/schema_seed -a path/to/AlleleCall_folder -m unclassified_cds -o path/to/IdentifySpuriousGenes_uCDS_mySchema -c 6 .. Note:: In a normal workflow users would have to select the best loci to keep based on the recomendations of the :doc:`IdentifySpuriousGenes ` module. Here we skip this step to show the full workflow. 7. Adapt the proto schema created from the unclassified CDSs: .. code-block:: bash SR AdaptLoci -i path/to/IdentifySpuriousGenes_uCDS_mySchema/temp_fastas -o path/to/AdaptLoci_unclassified .. Important:: Pass as input the `temp_fastas`` folder generated by the :doc:`IdentifySpuriousGenes ` module. Repeat step 4 with this new schema to create a new AlleleCall folder. 7. Refine the schema created based on the unclassified CDSs: .. code-block:: bash SR IdentifySpuriousGenes -s path/to/AdaptLoci_unclassified/schema_seed -a path/to/AlleleCall_unclassified -m schema -o path/to/IdentifySpuriousGenes_unclassifiedSchema -c 6 8. Analyse the clusters and change the "Choice" actions into "Join", "Add" or "Drop". 9. Create a final schema using the altered **recommendations_annotations.tsv** file from the previous step: .. code-block:: bash SR CreateSchemaStructure -s path/to/AdaptLoci_unclassified/schema_seed -rf path/to/IdentifySpuriousGenes_unclassifiedSchema/recommendations_annotations.tsv -o path/to/CreateSchemaStructure_refined_schema -c 6 Use the altered recomendations file in the :doc:`CreateSchemaStructure ` module folder, as that one has a selection of the loci to be dropped or added. Optional modules to further refine or create a schema: ------------------------------------------------------ - Follow the :doc:`MatchSchemas tutorial ` to find loci matches between two schemas. - Follow the :doc:`SchemaAnnotation tutorial ` to annotate schema loci based on information retrieved from various databases. - Follow the :doc:`IdentifyParalogousLoci tutorial ` to identify paralogous loci in a schema. .. Note:: The assemblies available in the NCBI databases may change, so the results may vary. Conclusion ---------- You have successfully completed a possible workflow of Schema Refinery, from schema creation to schema refinement.