Schema Refinery - Full Tutorial
==============================

Objective
---------

This tutorial will guide you through a possible workflow of Schema Refinery, from schema creation to schema refinement.

Prerequisites
-------------
- Schema Refinery installed (installation instructions :doc:`here </SchemaRefinery/Overview/Installation>`).
- chewBBACA 3.3.10 or higher (`chewBBACA's installation instructions <https://chewbbaca.readthedocs.io/en/latest/user/getting_started/installation.html>`_).
- NCBI datasets command-line too installed (instructions `here <https://www.ncbi.nlm.nih.gov/datasets/docs/v2/command-line-tools/download-and-install/>`_).

Procedure
---------

1. Open a terminal window.

2. Download the assemblies from NCBI needed for creating the schema:

.. code-block:: bash

	SR DownloadAssemblies -f path/to/input_tsv_file_with_taxon -db NCBI -e youremail@example.com -o path/to/DownloadAssemblies_NCBI_download -fm --download

3. Select the set of genome assemblies that you want to use to create a schema seed (e.g. the best quality, most complete, etc.).

4. Create a schema seed using the `CreateSchema <https://chewbbaca.readthedocs.io/en/latest/user/modules/CreateSchema.html>`_ module from chewBBACA:

    .. code-block:: bash

        chewBBACA.py CreateSchema -i /path/to/DownloadAssemblies_NCBI_download/assemblies_ncbi_unziped -o /path/to/CreateSchema_chewbbaca_mySchema -t 4

.. Note ::
	The schema seed contains one representative allele for each distinct loci identified in the genome assemblies.

5. Perform allele calling with the `AlleleCall <https://chewbbaca.readthedocs.io/en/latest/user/modules/AlleleCall.html>`_ module from chewBBACA to populate the schema with the alleles identified in those genomes.

    .. code-block:: bash

        chewBBACA.py AlleleCall -i /path/to/CreateSchema_chewbbaca_mySchema/schema_seed -g /path/to/DownloadAssemblies_NCBI_download/assemblies_ncbi_unziped -o /path/to/AlleleCall_folder -t 4 --output-unclassified

6. Evaluate the CDSs that were not classified by chewBBACA using the :doc:`IdentifySpuriousGenes </SchemaRefinery/Modules/IdentifySpuriousGenes>`:

    .. code-block:: bash

        SR IdentifySpuriousGenes -s path/to/CreateSchema_chewbbaca_mySchema/schema_seed -a path/to/AlleleCall_folder -m unclassified_cds -o path/to/IdentifySpuriousGenes_uCDS_mySchema -c 6

.. Note::
	In a normal workflow users would have to select the best loci to keep based on the recomendations of the :doc:`IdentifySpuriousGenes </SchemaRefinery/Modules/IdentifySpuriousGenes>` module. Here we skip this step to show the full workflow.

7. Adapt the proto schema created from the unclassified CDSs:

    .. code-block:: bash

        SR AdaptLoci -i path/to/IdentifySpuriousGenes_uCDS_mySchema/temp_fastas -o path/to/AdaptLoci_unclassified

.. Important::
	Pass as input the `temp_fastas`` folder generated by the :doc:`IdentifySpuriousGenes </SchemaRefinery/Modules/IdentifySpuriousGenes>` module. Repeat step 4 with this new schema to create a new AlleleCall folder.

7. Refine the schema created based on the unclassified CDSs:
   
    .. code-block:: bash 

        SR IdentifySpuriousGenes -s path/to/AdaptLoci_unclassified/schema_seed -a path/to/AlleleCall_unclassified -m schema -o path/to/IdentifySpuriousGenes_unclassifiedSchema -c 6

8. Analyse the clusters and change the "Choice" actions into "Join", "Add" or "Drop".

9. Create a final schema using the altered **recommendations_annotations.tsv** file from the previous step:
    
    .. code-block:: bash

        SR CreateSchemaStructure -s path/to/AdaptLoci_unclassified/schema_seed -rf path/to/IdentifySpuriousGenes_unclassifiedSchema/recommendations_annotations.tsv -o path/to/CreateSchemaStructure_refined_schema -c 6

Use the altered recomendations file in the :doc:`CreateSchemaStructure </SchemaRefinery/Modules/CreateSchemaStructureOutputDescription>` module folder, as that one has a selection of the loci to be dropped or added.

Optional modules to further refine or create a schema:
------------------------------------------------------

- Follow the :doc:`MatchSchemas tutorial </SchemaRefinery/Tutorials/MatchSchemasTutorial>` to find loci matches between two schemas.

- Follow the :doc:`SchemaAnnotation tutorial </SchemaRefinery/Tutorials/SchemaAnnotationTutorial>` to annotate schema loci based on information retrieved from various databases.

- Follow the :doc:`IdentifyParalogousLoci tutorial </SchemaRefinery/Tutorials/IdentifyParalogousLociTutorial>` to identify paralogous loci in a schema.

.. Note::
	The assemblies available in the NCBI databases may change, so the results may vary.

Conclusion
----------

You have successfully completed a possible workflow of Schema Refinery, from schema creation to schema refinement.