Skip to content

Quick Usage Guide

David Gaylord edited this page Sep 28, 2021 · 19 revisions

1: Digest and ingest data

  1. Run the script bin/digest_and_ingest.sh with FASTA proteome files you wish to digest and ingest for types Genome. e.g.:
    • Add to genomes:
    bin/digest_and_ingest.sh file1.fasta file2.fasta ...
    
  2. Run the script bin/digest_and_ingest_specialized_assembly.sh with FASTA proteome files you wish to digest and ingest for types Specialized Assemblies. e.g.: Add to specialized assemblies:
    bin/digest_and_ingest.sh MAGfile1.fasta MAGfile2.fasta ...
    
  3. Run the script bin/digest_and_ingest_metagenome.sh with FASTA proteome files you wish to digest and ingest for types Meta-omic Assemblies along with annotation file for the assemblies. e.g.:
    bin/digest_and_ingest_metagenome.sh meta-omic-file.fasta
    

These scripts read the FASTA files, and runs digestions on their sequences. You should see a fair amount of output as these files are processed.

2: Query a data category for a peptide:

  • Query just one data type, here the specialized assembly type:

    bin/query_by_sequence.sh --sequence LSHQAIAEAIGSTR --type sa
    
  • Query all data categories:

    bin/query_by_sequence.sh --sequence MGFPCNR --type all
    
  • data type parameters (default is genomes): invoke with flag --type.

    • --type g - genomes
    • --type m - meta-omic assembly
    • --type sa - specialized assemblies
    • --type all - all types
  • perform optional LCA analysis (if lineages exist in the database): invoke with flag --lca.

    bin/query_by_sequence.sh --sequence MGFPCNR --type all --lca
    
    • For more information on adding taxonomic lineages to the database, see the Taxonomic Lineages Section below.

3: List the taxon ids:

  1. For the Genome data category, list all the taxa in that category:
bin/list_taxons.sh
  1. For the Specialized Assembly data category, list all the taxa:
bin/list_specialized_assemblies.sh
  1. For the Meta-omic Assemblies data category, list all the met-omic files loaded:
bin/list_metaomic_assemblies.sh
  1. For the Meta-omic Assemblies data category, list all the taxa loaded:
bin/list_metaomic_assembly_taxons.sh

4: Generate redundancy tables

  1. Generate redundancy tables for Genomes:

    • By entering taxa in command line:
    bin/generate_redundancy_tables.sh --taxon-ids syn8102 syn7502 syn7503 --output-dir exampleRedundancyTables
    
    • By inputting a file that contains a list of taxon IDs (one taxon ID per line):
    bin/generate_redundancy_tables.sh --taxon-id-file taxon_id_list.txt --output-dir exampleRedundancyTables
    
  2. Generate redundancy tables for Specialized Assemblies:

    • By entering taxa in command line:
    bin/generate_redundancy_tables_specialized_assembly.sh --sa-ids TARA_RED_MAG_00113 TARA_SOC_MAG_00005 --output-dir exampleRedundancyTables
    
    • By inputting a file that contains a list of taxon IDs (one taxon ID per line):
    bin/generate_redundancy_tables_specialized_assembly.sh --sa-id-file sa_id_list.txt --output-dir exampleRedundancyTables
    
  3. View resulting files in /exampleRedundancyTables

    • counts.csv contains counts of redundant peptides
    • union_percents.csv contains the values in counts.csv, divided by the number of unique peptides in the union of digestions of a taxa pair.
    • individual_percents.csv contains the value in counts.csv, divided by the count of unique peptides in taxon A.

5: Remove Taxa from the Database

If you wish to delete data for a given set of taxa in the database, run a command like this:

  1. For the Genome data category, remove taxa:
bin/clear_taxon_data.sh --taxon-ids taxa_name taxa2_name
  1. For the Specialized Assembly data category, remove taxa:
bin/clear_specialized_assembly_data.sh --taxon-ids taxa_name taxa2_name
  1. For the Meta-omic Assemblies data category, remove taxa:
bin/clear_metaomic_data.sh --taxon-ids taxa_name taxa2_name

6: Taxonomic Lineages: Least Common Ancestor Calculation

  1. First, need to pull the taxonomic lineage information with consistent lineages.
    • You can enter your own lineage information following the specified .csv file/header format found here for Genomes and Specialized Assemblies, or here for Meta-omic Assemblies.
    • Or, the python script bin/NCBI_lineage.py will pull that information for you using NCBI taxon ids. First, format your input data according to the format required for each data category. For Genomes and Specialized Assemblies, use the format found here, and for Meta-omic Assemblies use the format found here.
    • If you would like to simply rename your taxa with names other than those of the filenames uploaded, you can create a mapping file to do so with the format found here.
  2. To pull the NCBI Lineage information, run the python script bin/NCBI_lineage.py.
    • You can get help information by running bin/NCBI_lineage.py --help.
    • Upload the formatted taxa info file in order to pull the lineage info via the NCBI taxon id with flag f.
    • In addition to the formatted input file, an e-mail must be uploaded in order to access the NCBI database with flag -e.
    • You can optionally designate the output file name with flag -o.
    • For example:
      bin/NCBI_lineage.py -f genome_lookup_taxa.csv -e user@email.com -o genome_taxa_lineages.csv
      
  3. Add taxonomic lineage information to the database:
    • For Genomes, run bin/update_taxons.sh --filepath <lineages-file.csv>
      • You can either upload the lineage output file from NCBI_lineage.py or your own lineage information following the template (maintaining same header names as the template) found here.
    • For the Specialized Assembly data category, run bin/update_specialized_assembly_taxons.sh --filepath <lineages-file.csv>.
      • You can either upload the lineage output file from NCBI_lineage.py or your own lineage information following the template (maintaining same header names as the template) found here.
    • For the Meta-omic data category, you need to upload two different files.
      • An Annotation file that includes the list of the metagenome files, the ORFs within that metagenome, and the taxons assigned to those ORFs.
        • For meta-omic assemblies, there may be taxonomic assignments at two different levels. There may be an ORF level taxonomic assignment and a contig level taxonomic assignment. For any given ORF, METATRYP will preferentially pull the taxon info if there is a contig level assignment. If there is no contig level assignment, then the ORF level taxonomic assignment will be used.
        • Run bin/update_metaomic_annotations.sh --filepath <annotations-file.csv> to update the annotations. A very basic example of an annotations file can be found here. Additional annotations may require additional custom scripts.
      • A Lineage file will also be input which contains the taxonomic lineage information for the taxa included within the meta-omic file. You can either input the lineages generated from NCBI_lineage.py or upload your own lineages using the template (maintaining same header names as the template) found here.
        • Run bin/update_metaomic_taxons.sh --filepath <lineages-file.csv> to update the lineages.
  4. Call the Least Common Ancestor Calculation
    • Once the taxonomic lineages are added to the METATRYP database, you can call the LCA function on peptide queries using the flat --lca. See the peptide query section for more information on this function.