Skip to content

Collection of scripts to download NCBI taxonomy, process BOLD-downloaded sample metadata, and generate necessary files for downstream processes.

License

Notifications You must be signed in to change notification settings

bge-barcoding/sample-processing

 
 

Repository files navigation

Collection of scripts for processing and parsing necessary BOLD-downloaded sample metadata, and generating input files for downstream processes (e.g. (uploading trimmed reads to ENA, input into skim2mito and MGE, and requesting taxid creation from ENA)).

  • Written by Ben Price and Dan Parsons @ NHMUK.
  • For help, see usage information and docstrings within each script.

1_combine_tsv.py

  • Merges TSV files from multiple zipped folders downloaded from BOLD.
    • usage: python combine_tsv.py <output_dir> <zip_files_dir>
    • output_dir: Directory where combined .tsv files will be stored.
    • `zip_files_dir: Directory containing the zipped files to be merged.
      • Recommended directory structure:
parent_directory/
├── collection_data.tsv
├── lab.tsv
├── merged_custom_fields.tsv
├── specimen_details.tsv
├── tags.tsv
├── taxonomy.tsv
├── voucher.tsv
│
├── subdirectory.zip
│   ├── collection_data.tsv
│   ├── lab.tsv
│   ├── merged_custom_fields.tsv
│   ├── specimen_details.tsv
│   ├── tags.tsv
│   ├── taxonomy.tsv
│   └── voucher.tsv
│
└── subdirectory2.zip
    ├──  collection_data.tsv
    ├── lab.tsv
    ├── merged_custom_fields.tsv
    ├── specimen_details.tsv
    ├── tags.tsv
    ├── taxonomy.tsv
    └── voucher.tsv

2_download_taxonomy.sh

  • Downloads the newst NCBI taxonomy (i.e. new_taxdump).
  • If the taxdump was previously downloaded to the same directory, the script will replace that taxdump with the newst version. The tar.gz download will also be removed post-extraction.
    • usage: bash/sbatch/srun download_taxonomy.sh <output_file> <extract_dir>
    • output_file = Name of NCBI taxdump (e.g. <date>_taxdump.tar.gz)
    • extract_dir = Directory to output taxdump files to (e.g. <date>_taxdump). If given a relative or absolute path, new directories will be created.
      • Script will prompt user (Y/N) to confirm removal of directories, if applicable.

3_local_bold_processing.py

  • Merges relevant sample metadata from BOLD .tsv files, and resolves the taxonomic ID using the hierarchical structure and checks for mismatches between BOLD and NCBI taxonomy. Outputs sample_metadata.csv containing fields below:
    • Sample ID
    • Process ID
    • BOLD taxonomic ranks (phylum->species)
    • taxid
    • matched_rank (taxonimic rank the taxid corresponds to)
    • lineage (full NCBI lineage for taxid)
    • lineage_mismatch (did the major BOLD taxonomic ranks match the fetched NCBI lineage)
    • BOLD sample metadata (Identifier (identified_by), Collection Date (collection_date), Geographic Location (Country/Ocean) (geographic_location), Exact Site (geographic_location_locality), Latitude (latitude), Longitude (longitude), Collected By (collected_by), Habitat (habitat), Tissue Descriptor (organism_part), Sex (sex), Life stage (lifestage), Museum ID (specimen_voucher), Institution Storing (collecting_institution), Type Status (type_status).
  • usage: python local_bold_processing.py <input_dir> <rankedlineage_path> <output_dir>
    • input_dir = Directory containing BOLD-downloaded sample metadata (.tsv files).
    • rankedlineage_path = Path to NCBI taxonomic hierarchy/lineage (<date>_taxdump/rankedlineage.dmp file).
    • output_dir = Directory to output sample_metadata.tsv to.

If metadata was not collected for a particular sample, 'not collected' output to field as required by ToL ENA sample registration checklist.

Example samples_metadata.csv

Sample ID Process ID Phylum Class Order Family Subfamily Genus Species taxid matched_rank specimen_voucher lifestage collection_date geographic_location geographic_location_locality latitude longitude collected_by habitat identified_by collecting_institution organism_part sex type_status
BGE_0001_A01 BSNHM001-24 Arthropoda Insecta Trichoptera Apataniidae Apataniinae Apatania Apatania stylata 177658 genus 'Museum ID' adult YYYY-MM-DD France not collected lat (DD) lon (DD) 'Collectors' not collected not collected NHMMUK Whole M type
BGE_0001_A02 BSNHM002-24 Arthropoda Insecta Trichoptera Glossosomatidae Agapetinae Agapetus Agapetus iridipennis 177627 genus 'Museum ID' adult YYYY-MM-DD Switzerland not collected lat (DD) lon (DD) 'Collectors' not collected not collected Naturalis not collected F Paratype
BGE_0001_A03 BSNHM003-24 Arthropoda Insecta Trichoptera Glossosomatidae Hydropsychidae Diplectrona Diplectrona meridionalis 177860 genus 'Museum ID' adult YYYY-MM-DD France not collected lat (DD) lon (DD) 'Collectors' not collected not collected NHMMUK Leg M no

4_samples_spreadsheet.py

Script to generate samples.csv requried to run MGE and skim2mito piplines. Script also screen BGE project samples using the 27 BGE project codes.

  • usage: python 2_samples_spreadsheet.py [path/to/raw/read/dir] [path/to/samples_metadata.csv]
  • path/to/raw/read/dir: Path to parent directory with 'flat' or 'nested' structure (see below/docstring for more example) containing raw PE read files.
  • path/to/output/dir/samples_metadata.csv: Directory containing sample_metadata.csv file.`
  • outputs three CSV files containing ID (Process ID), forward (absolute path to R1 read (fastq.gz), reverse (absolute path to R2 read (fastq.gz) and taxid to current directory. samples_[parent_dir_name]_types.csv also contains type_status field, for reference.
    • samples_[parent_dir_name].csv = contains all samples containing BGE project codes in their Process ID's.
    • samples_[parent_dir_name]_nonproject.csv = contains all samples not containing BGE project codes.
    • samples_[parent_dir_name]_types.csv = contains all samples (project or non-project) that have 'type' in Type Status field.
Example Directory Structures Supported (where BSNHM593-24 is the process ID):
1. Flat structure:
   /parent_dir/
   ├── BSNHM593-24_R1.fq
   ├── BSNHM593-24_R2.fq
   └── ...

2. Nested structure:
   /parent_dir/XE-4013/
   └── 20240906_LH00179_0123_A22CKGHLT4/
       ├── Sample_XE-4013-BSNHM593-24/
       │   ├── BSNHM593-24_R1.fastq.gz
       │   └── BSNHM593-24_R2.fastq.gz
       └── ...

Miro workflow

![image](image

About

Collection of scripts to download NCBI taxonomy, process BOLD-downloaded sample metadata, and generate necessary files for downstream processes.

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 92.7%
  • Shell 7.3%