gene_fetch

A Python tool for retrieving protein and/or gene sequences from NCBI databases. The script can fetch both protein and nucleotide sequences for a given gene across multiple taxa, with support for traversing taxonomic hierarchies when sequences aren't available at the given taxonomic level (dictated by input taxid).

Fetch protein and/or nucleotide sequences from NCBI databases using taxonomic ID (taxid). Handles both direct nucleotide sequences and protein-linked nucleotide references.
Support for both protein-coding and rRNA genes.
Automatic taxonomy traversal using NCBI lineage for given taxid when sequences aren't found at input taxonomic level. BY traversing up the fetched NCBI lineage, potential homonyms are avoided.
Configurable sequence length thresholds, with minimum length requirements (can be altered).
Robust error handling, logging, and NCBI API rate limiting to comply with guidelines (10 requests/second. Requires valid NCBI API key and email for optimal performance).
Handles complex sequence features (e.g., complement strands, joined sequences) in addition to 'simple' cds extaction (if --type nucleotide/both).

Usage

python 1_gene_fetch.py <gene_name> <output_directory> <samples.csv> --type <sequence_type>
  <gene_name>: Name of gene to search for in NCBI RefSeq database (e.g., cox1/16s/rbcl).
  <output_directory>: Path to directory to save output files. The directory will be created if it does not exist.
  <samples.csv>: Path to input CSV file containing Process IDs (ID column) and TaxIDs (taxid column).
  --type: Sequence type to fetch ('protein', 'nucleotide', or 'both')
  --protein_size: Minimum protein sequence length (default: 500). Optional.
  --nucleotide_size: Minimum nucleotide sequence length (default: 1500). Optional.

'Checkpointing' available: If the script fails during a run, it can be rerun using the same commands/inputs and it will skip IDs with entries already in the sequence_references.csv and with .fasta files already present in the output directory.
./snakemakeSFU/workflow/envs/fetch.yaml contains all necessary dependencies to run the script. Can create conda env using commmand below (once conda is installed on your system).

conda env create -f fetch.yaml

samples.csv input file example

ID	forward	reverse	taxid
BSNHM002-24	abs/path/to/R1.fq.gz	abs/path/to/R2.fq.gz	177658
BSNHM038-24	abs/path/to/R1.fq.gz	abs/path/to/R2.fq.gz	177627
BSNHM046-24	abs/path/to/R1.fq.gz	abs/path/to/R2.fq.gz	3084599

Only 'ID' and 'taxid' columns are essential.

Running gene_fetch.py on a cluster

See 1_gene_fetch.sh (for running via Slurm).

Output Structure

output_dir/
├── nucleotide/             # Only populated if '--type nucleotide/both' utilised.
│   ├── SAMPLE1_dna.fasta   
│   ├── SAMPLE2_dna.fasta
│   └── ...
├── SAMPLE1.fasta           # Protein sequences
├── SAMPLE2.fasta
├── sequence_references.csv # Sequence metadata (for input into MGE snakemake pipeline)
├── failed_searches.csv     # Failed search attempts (ideally empty)
└── gene_fetch.log          # Operation log

sequence_references.csv output example

ID	taxid	protein_accession	protein_length	nucleotide_accession	nucleotide_length	matched_rank	ncbi_taxonomy	reference_name	protein_reference_path	nucleotide_reference_path
BSNHM002-24	177658	AHF21732.1	510	KF756944.1	1530	genus: Apatania	Eukaryota, ..., Apataniinae, Apatania	BSNHM002-24	abs/path/to/protein_references/BSNHM002-24.fasta	abs/path/to/protein_references/BSNHM002-24_dna.fasta

Supported targets

Script functions with other gene/protein targets than those listed below, but has hard-coded synonymns to catch name variations (of the below targets). More targets can be added into script (see 'class config').
cox1/COI
cox2/COII
cox3/COIII
cytb
nd1
rbcl
matk
16s
18s
28s
12s

Name		Name	Last commit message	Last commit date
Latest commit History 40 Commits
snakemakeSFU		snakemakeSFU
1_gene_fetch.sh		1_gene_fetch.sh
LICENSE		LICENSE
README.md		README.md
gene_fetch.py		gene_fetch.py
run_snakefile.sh		run_snakefile.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

gene_fetch

Usage

samples.csv input file example

Running gene_fetch.py on a cluster

Output Structure

sequence_references.csv output example

Supported targets

About

Releases

Packages

Languages

License

SchistoDan/gene_fetch

Folders and files

Latest commit

History

Repository files navigation

gene_fetch

Usage

samples.csv input file example

Running gene_fetch.py on a cluster

Output Structure

sequence_references.csv output example

Supported targets

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages