Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

--include none and --chromosomes all #298

Open
conchoecia opened this issue Dec 19, 2023 · 4 comments
Open

--include none and --chromosomes all #298

conchoecia opened this issue Dec 19, 2023 · 4 comments

Comments

@conchoecia
Copy link

conchoecia commented Dec 19, 2023

Hello,

I would like to use the --chromosomes all option when I download a genome to only get the chromosomes. I noticed that using this option also automatically downloads the complete genome fasta file (I think because --include genome appears to be the default. For example, when I run this command: datasets download genome accession GCA_940337035.1 --chromosomes all --filename TEST.zip, these are the resulting files:

Archive:  TEST.zip
  inflating: README.md
  inflating: ncbi_dataset/data/assembly_data_report.jsonl
  inflating: ncbi_dataset/data/GCA_940337035.1/GCA_940337035.1_PGI_AGRIOTES_LIN_V1_genomic.fna
  inflating: ncbi_dataset/data/GCA_940337035.1/chr1.fna
  inflating: ncbi_dataset/data/GCA_940337035.1/chr2.fna
  inflating: ncbi_dataset/data/GCA_940337035.1/chr3.fna
  inflating: ncbi_dataset/data/GCA_940337035.1/chr4.fna
  inflating: ncbi_dataset/data/GCA_940337035.1/chr5.fna
  inflating: ncbi_dataset/data/GCA_940337035.1/chr6.fna
  inflating: ncbi_dataset/data/GCA_940337035.1/chr7.fna
  inflating: ncbi_dataset/data/GCA_940337035.1/chr8.fna
  inflating: ncbi_dataset/data/GCA_940337035.1/chr9.fna
  inflating: ncbi_dataset/data/GCA_940337035.1/chr10.fna
  inflating: ncbi_dataset/data/GCA_940337035.1/unplaced.scaf.fna

I do not want to download GCA_940337035.1_PGI_AGRIOTES_LIN_V1_genomic.fna.

I thought that trying --chromosomes all --include none would allow me to download the fasta files of just the scaffolds designated as chromosomes, but it doesn't download any sequence.

Do you have any suggestions on how to download just the chromosome scaffolds without having to filter based on the info in the sequence report? I am using datasets v15.29.0

Thank you!
Darrin

@ericcox1
Copy link
Collaborator

Hi @conchoecia,

Thanks for opening this issue.

I noticed that using this option also automatically downloads the complete genome fasta file

This is a bug. We will try to fix this soon. In the meantime, I suggest that you try the following to only download the chromosome sequences:

  1. Download a dehydrated package
    datasets download genome accession GCA_940337035.1 --chromosomes all --filename TEST.zip --dehydrated
  2. Unzip the downloaded package
    unzip TEST.zip -d TEST
  3. Rehydrate the extracted package, using --match to selectively download filenames that include "chr"
    datasets rehydrate --directory TEST --match chr

Thanks again for opening this issue. I'll comment on this thread when we have a bug fix ready.

Best,
Eric

Eric Cox, PhD [Contractor] (he/him/his)
NCBI Datasets
Sequence Enhancements, Tools and Delivery (SeqPlus)
NIH/NLM/NCBI
eric.cox@nih.gov

@conchoecia
Copy link
Author

conchoecia commented Dec 20, 2023

Hi @ericcox1,

This solution works well - thanks! I will adjust my scripts to do this instead of parsing the sequence report .json file.

-Darrin


Update: I found that doing this process pulls scaffolds that are known to be localized to specific chromosomes, but are not actually placed.

For example, there is a bird genome, GCA_027574665.1, that has named chromosomes with the properties {"assignedMoleculeLocationType":"Chromosome", "role":"assembled-molecule"}. It also has unplaced pieces that are known to be on a specific chromosome, but are unplaced. These scaffolds are all less than 1Mbp, and have the properties {"assignedMoleculeLocationType":"Chromosome", "role":"unlocalized-scaffold"}. I'm not sure yet if I want to exclude the second type for my analysis, but this would be a good reason to parse the seq-report from datasets download genome accession GCA_027574665.1 --include seq-report

@conchoecia
Copy link
Author

conchoecia commented Dec 30, 2023

Hi @ericcox1,

I identified a place where this breaks - for some assemblies, rehydrating still downloads the entire genome assembly fasta file, in addition to the chromosome-scale scaffolds as individual files as I requested.

Here is a minimal example that uses the latest release of datasets:

#!/bin/bash

# For the genome assembly, GCA_933207985.1, it appears like downloading the chromosome-scale scaffolds resulted in two errors
#  - The first error is that all of the chromosome-scale scaffolds downloaded twice.
#  - The second error is that all of the non-chromosome-scale scaffolds downloaded more than once.

ASSEMBLY=GCA_933207985.1

# set up datasets
curl -o datasets 'ftp.ncbi.nlm.nih.gov/pub/datasets/command-line/LATEST/linux-amd64/datasets'
chmod u+x ./datasets

# Using the example from this github issue: https://github.com/ncbi/datasets/issues/298
# The way it works is by downloading a dehydrated dataset, then downloading to select only the chromosome-scale scaffolds
./datasets download genome accession ${ASSEMBLY} --chromosomes all --filename TEST.zip --dehydrated
unzip TEST.zip -d TEST
./datasets rehydrate --directory TEST --match chr

The resulting files are:

./TEST/ncbi_dataset/data/GCA_933207985.1/chr05.fna
./TEST/ncbi_dataset/data/GCA_933207985.1/chr02.fna
./TEST/ncbi_dataset/data/GCA_933207985.1/chr13.fna
./TEST/ncbi_dataset/data/GCA_933207985.1/chr14.fna
./TEST/ncbi_dataset/data/GCA_933207985.1/chr03.fna
./TEST/ncbi_dataset/data/GCA_933207985.1/chr04.fna
./TEST/ncbi_dataset/data/GCA_933207985.1/chr12.fna
./TEST/ncbi_dataset/data/GCA_933207985.1/chr11.fna
./TEST/ncbi_dataset/data/GCA_933207985.1/chr09.fna
./TEST/ncbi_dataset/data/GCA_933207985.1/chr07.fna
./TEST/ncbi_dataset/data/GCA_933207985.1/chr10.fna
./TEST/ncbi_dataset/data/GCA_933207985.1/chr01.fna
./TEST/ncbi_dataset/data/GCA_933207985.1/chr06.fna
./TEST/ncbi_dataset/data/GCA_933207985.1/GCA_933207985.1_aPelCul1.1_chrom_genomic.fna
./TEST/ncbi_dataset/data/GCA_933207985.1/chr08.fna

However, the file ./TEST/ncbi_dataset/data/GCA_933207985.1/GCA_933207985.1_aPelCul1.1_chrom_genomic.fna should not be present, based on how I've seen how the example works with other Assembly Accessions. I am not sure if this happened for more than one accession or not. Thanks!

@conchoecia
Copy link
Author

I found another place where this breaks. Some assemblies, despite having chromosome-scale scaffolds, have the error 'Found no files for rehydration' after running this. The assembly that I found that causes this error was GCF_905220415.1.

Here is the minimal example:

#!/bin/bash

# For the record GCF_905220415.1, there is some problem where the final fasta file is empty when using this method.
# Closer inspection reveals that the database correctly identifies certain scaffolds as being chromosome-scale, but
#  they are not downloaded correctly

ASSEMBLY=GCF_905220415.1

# set up datasets
curl -o datasets 'ftp.ncbi.nlm.nih.gov/pub/datasets/command-line/LATEST/linux-amd64/datasets'
chmod u+x ./datasets

# Check if there are chromosome-scale scaffolds
./datasets summary genome accession ${ASSEMBLY} --report sequence --as-json-lines | grep 'Chromosome' | head -5

# remove old files from previous runs
rm -rf TEST/ TEST.zip
# Using the example from this github issue: https://github.com/ncbi/datasets/issues/298
# The way it works is by downloading a dehydrated dataset, then downloading to select only the chromosome-scale scaffolds
./datasets download genome accession ${ASSEMBLY} --chromosomes all --filename TEST.zip --dehydrated
unzip TEST.zip -d TEST
./datasets rehydrate --directory TEST --match chr

Here are the results of running the above script, showing that there are chromosome-scale scaffolds, but the rehydration did not work.

 % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 17.5M  100 17.5M    0     0  5424k      0  0:00:03  0:00:03 --:--:-- 5424k
{"assembly_accession":"GCF_905220415.1","assembly_unit":"Primary Assembly","assigned_molecule_location_type":"Chromosome","chr_name":"1","gc_count":"5143305","gc_percent":34,"genbank_accession":"HG991959.1","length":15086434,"refseq_accession":"NC_059537.1","role":"assembled-molecule"}
{"assembly_accession":"GCF_905220415.1","assembly_unit":"Primary Assembly","assigned_molecule_location_type":"Chromosome","chr_name":"2","gc_count":"4472954","gc_percent":34,"genbank_accession":"HG991960.1","length":13248411,"refseq_accession":"NC_059538.1","role":"assembled-molecule"}
{"assembly_accession":"GCF_905220415.1","assembly_unit":"Primary Assembly","assigned_molecule_location_type":"Chromosome","chr_name":"3","gc_count":"4471753","gc_percent":34,"genbank_accession":"HG991961.1","length":13170806,"refseq_accession":"NC_059539.1","role":"assembled-molecule"}
{"assembly_accession":"GCF_905220415.1","assembly_unit":"Primary Assembly","assigned_molecule_location_type":"Chromosome","chr_name":"4","gc_count":"4360384","gc_percent":34,"genbank_accession":"HG991962.1","length":12846590,"refseq_accession":"NC_059540.1","role":"assembled-molecule"}
{"assembly_accession":"GCF_905220415.1","assembly_unit":"Primary Assembly","assigned_molecule_location_type":"Chromosome","chr_name":"5","gc_count":"4238058","gc_percent":33.5,"genbank_accession":"HG991963.1","length":12694599,"refseq_accession":"NC_059541.1","role":"assembled-molecule"}
Collecting 1 genome record [================================================] 100% 1/1
Downloading: TEST.zip    3.98kB valid zip structure -- files not checked
Validating package [================================================] 100% 4/4
Archive:  TEST.zip
  inflating: TEST/README.md
  inflating: TEST/ncbi_dataset/data/assembly_data_report.jsonl
  inflating: TEST/ncbi_dataset/fetch.txt
  inflating: TEST/ncbi_dataset/data/dataset_catalog.json
Found no files for rehydration

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants