Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

can't donwload viral library #221

Closed
ALFLAG opened this issue Oct 18, 2021 · 19 comments
Closed

can't donwload viral library #221

ALFLAG opened this issue Oct 18, 2021 · 19 comments

Comments

@ALFLAG
Copy link

ALFLAG commented Oct 18, 2021

Hi,
when using centrifuge-download, I couldn't download the viral library, while archea and bacteria were OK.
Any suggestion ?
I used the version 1.0.4, and I installed it using conda.

Thanks in advance.
Alex

@mperisin-lallemand
Copy link

This appears to be due to the centrifuge-download script line 368: "cut -f "$TAXID_FIELD,$FTP_PATH_FIELD,$FTP_PATH_FIELD2" "$ASSEMBLY_SUMMARY_FILE" | ". I manually changed it to: "cut -f "$TAXID_FIELD,$FTP_PATH_FIELD" "$ASSEMBLY_SUMMARY_FILE" | ", and that solved the issue for me. Further up in that script it on line 305 the FTP_PATH_FIELD2 variable is defined with the following comment, "## Needed for wrongly formatted virus files - hopefully just a temporary fix." So I guess the viral "assembly_summary.txt" has been reformatted so this previously needed fix is no longer necessary.

@mourisl
Copy link
Collaborator

mourisl commented Jan 14, 2022

I have updated a patch to handle this formatting issue. Thank you for identifying this @mperisin-lallemand !

@mourisl mourisl closed this as completed Jan 14, 2022
@mwylerCH
Copy link

Possible that there is again the same problem?

@ruysan
Copy link

ruysan commented Aug 5, 2024

I can't download viral sequences. Is it because my running version is outdated? The program ends at 100% progress but the library folder is empty.
My version:
/opt/nesi/CS400_centos7_bdw/Centrifuge/1.0.4-GCCcore-9.2.0/bin/centrifuge-class version 1.0.4
64-bit
Built on mahuika01
Mon Mar 28 09:16:18 NZDT 2022
Compiler: gcc version 9.2.0 (GCC)
Options: -O2 -ftree-vectorize -march=broadwell -fno-math-errno -std=c++11 -DPOPCNT_CAPABILITY

centrifuge-download -o library -d viral refseq > seqid2taxid.map
Downloading ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/viral/assembly_summary.txt ...
basename: extra operand ‘na_genomic.fna.gz’
Try 'basename --help' for more information.
cat: library/viral/: Is a directory
Progress : [----------------------------------------] 0% 1/14527basename: extra operand ‘na_genomic.fna.gz’
Try 'basename --help' for more information.
cat: library/viral/: Is a directory
Progress : [----------------------------------------] 0% 2/14527basename: extra operand ‘na_genomic.fna.gz’
...

@mourisl
Copy link
Collaborator

mourisl commented Aug 5, 2024

Could you please share the assembly_summary.txt file under library/viral in your download folder?

@ruysan
Copy link

ruysan commented Aug 5, 2024

The file is almost 7 Mb.
here are the first few lines:
head assembly_summary.txt

See ftp://ftp.ncbi.nlm.nih.gov/genomes/README_assembly_summary.txt for a description of the columns in this file.

#assembly_accession bioproject biosample wgs_master refseq_category taxid species_taxid organism_name infraspecific_name isolate version_status assembly_level release_type genome_rep seq_rel_date asm_name asm_submitter gbrs_paired_asm paired_asm_comp ftp_path excluded_from_refseq relation_to_type_material asm_not_live_date assembly_type group genome_size genome_size_ungapped gc_percent replicon_count scaffold_count contig_count annotation_provider annotation_name annotation_date total_gene_count protein_coding_gene_count non_coding_gene_count pubmed_id
GCF_000839185.1 PRJNA485481 na na na 10243 10243 Cowpox virus strain=Brighton Red na latest Complete Genome MajorFull 2003/05/19 ViralProj14174 Molecular Genetics and Microbiology, Duke University Medical Center GCA_000839185.1 identical https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/839/185/GCF_000839185.1_ViralProj14174 na ICTV species exemplar na haploid viral 224499224499 33.500000 1 1 1 NCBI RefSeq Annotation submitted by NCBI RefSeq 2018/08/13 233 233 0 6961398;8091665;2014645;2309453
GCF_014621545.1 PRJNA485481 na na na 10244 10244 Monkeypox virus na MPXV-M5312_HM12_Rivers latest Complete Genome MajorFull 2022/05/30 ASM1462154v1 NCEZID/DHCPP/PRB, CDC GCA_014621545.1 identical https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/014/621/545/GCF_014621545.1_ASM1462154v1 na na na haploid viral 197209 197209 33.000000 1 1 1 NCBI RefSeq Annotation submitted by NCBI RefSeq 2022/05/30 179 179 0 30660046;34253028;32880628
GCF_000857045.1 PRJNA485481 na na na 10244 10244 Monkeypox virus strain=Zaire-96-I-16 na latest Complete Genome MajorFull 2001/12/21 ViralProj15142 Department of Molecular Biology of Genomes, SRC VB Vector GCA_000857045.1 identical https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/857/045/GCF_000857045.1_ViralProj15142 na ICTV species exemplar na haploid viral 196858 19685833.000000 1 1 1 NCBI RefSeq Annotation submitted by NCBI RefSeq 2022/07/08 180 180 0 11734207;30660046;34253028

@mourisl
Copy link
Collaborator

mourisl commented Aug 5, 2024

How about the assembly_summary_filtered file? There are several issues in the downloaded file, like MajorFull should be "Major\tFull", but I'm not sure whether this is copy/paste error.

How about use the option "-g wget" to use another method to download the file?

@ruysan
Copy link

ruysan commented Aug 5, 2024

Assembly_summary_filtered looks like this:
GCF_000839185.1 PRJNA485481 na na na 10243 10243 Cowpox virus strain=Brighton Red na latest Complete Genome MajorFull 2003/05/19 ViralProj14174 Molecular Genetics and Microbiology, Duke University Medical Center GCA_000839185.1 identical https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/839/185/GCF_000839185.1_ViralProj14174 na ICTV species exemplar na haploid viral 224499224499 33.500000 1 1 1 NCBI RefSeq Annotation submitted by NCBI RefSeq 2018/08/13 233 233 0 6961398;8091665;2014645;2309453
GCF_014621545.1 PRJNA485481 na na na 10244 10244 Monkeypox virus na MPXV-M5312_HM12_Rivers latest Complete Genome MajorFull 2022/05/30 ASM1462154v1 NCEZID/DHCPP/PRB, CDC GCA_014621545.1 identical https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/014/621/545/GCF_014621545.1_ASM1462154v1 na na na haploid viral 197209 197209 33.000000 1 1 1 NCBI RefSeq Annotation submitted by NCBI RefSeq 2022/05/30 179 179 0 30660046;34253028;32880628
GCF_000857045.1 PRJNA485481 na na na 10244 10244 Monkeypox virus strain=Zaire-96-I-16 na latest Complete Genome MajorFull 2001/12/21 ViralProj15142 Department of Molecular Biology of Genomes, SRC VB Vector GCA_000857045.1 identical https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/857/045/GCF_000857045.1_ViralProj15142 na ICTV species exemplar na haploid viral 196858 19685833.000000 1 1 1 NCBI RefSeq Annotation submitted by NCBI RefSeq 2022/07/08 180 180 0 11734207;30660046;34253028
GCF_000860085.1 PRJNA485481 na na na 10245 10245 Vaccinia virus strain=WR (Western Reserve) na latest Complete Genome Major Full 2005/05/19 ViralProj15241 National Center for Infectious Diseases, Centers for Disease Control and Prevention GCA_000860085.1 identical https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/860/085/GCF_000860085.1_ViralProj15241 na ICTV species exemplarna haploid viral 194711 194711 33.500000 1 1 1 NCBI RefSeq Annotation submitted by NCBI RefSeq 2018/08/13 223 223 0 na

@ruysan
Copy link

ruysan commented Aug 5, 2024

Using wget produces the same result. Also, seqid2taxid.map is empty.
centrifuge-download -g wget -o library -d "viral" refseq > seqid2taxid.map
Downloading ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/viral/assembly_summary.txt ...
basename: extra operand ‘na_genomic.fna.gz’
Try 'basename --help' for more information.
cat: library/viral/: Is a directory
Progress : [----------------------------------------] 0% 1/14527basename: extra operand ‘na_genomic.fna.gz’
Try 'basename --help' for more information.
cat: library/viral/: Is a directory
Progress : [----------------------------------------] 0% 2/14527basename: extra operand ‘na_genomic.fna.gz’
Try 'basename --help' for more information.

@mourisl
Copy link
Collaborator

mourisl commented Aug 5, 2024

Can you share the whole assembly_summary_filtered.txt file? There might be some formatting issue when copy/paste here.

@ruysan
Copy link

ruysan commented Aug 5, 2024

here are both files.
assembly_summary.txt
assembly_summary_filtered.txt

@mourisl
Copy link
Collaborator

mourisl commented Aug 5, 2024

Your file looks correct to me. Could you please run command like "cut -f6,20,21 assembly_summary_filtered.txt | awk -F "\t" '{if ($2~/ftp/) print $1"\t"$2; if ($3~/ftp/) print $1"\t"$3}' | sed 's#([^/]*)$#\1/\1_genomic.fna.gz#' " to see whether there is strange output like na_genomic.fna.gz on your system?

@ruysan
Copy link

ruysan commented Aug 5, 2024

Sorry, I can't figure out the sed call. It turns an error.
sed: -e expression #1, char 32: invalid reference \1 on `s' command's RHS
The cut and awk calls work, to give:
...
3070917 https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/029/886/195/GCF_029886195.1_ASM2988619v1
3070918 https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/029/888/295/GCF_029888295.1_ASM2988829v1
3070923 https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/023/156/955/GCF_023156955.1_ASM2315695v1
...

@ruysan
Copy link

ruysan commented Aug 5, 2024

"genomic_fna" does not appear in the output of the cut | awk command

@mourisl
Copy link
Collaborator

mourisl commented Aug 5, 2024

it might be the bash version thing. Would command like "cut -f6,20,21 assembly_summary_filtered.txt | awk -F "\t" '{if ($2~/ftp/) print $1"\t"$2; if ($3~/ftp/) print $1"\t"$3}' | sed 's/([^/]*)$/\1/\1_genomic.fna.gz/' " work? You need the sed command to duplicate the folder name of the FTP site and append the _genomic.fna.gz extension.

@ruysan
Copy link

ruysan commented Aug 5, 2024

No luck.
cut -f6,20,21 assembly_summary_filtered.txt | awk -F "\t" '{if ($2~/ftp/) print $1"\t"$2; if ($3~/ftp/) print $1"\t"$3}' | sed 's/([^/]*)$/\1/\1_genomic.fna.gz/'
sed: -e expression #1, char 15: unknown option to `s'

@ruysan
Copy link

ruysan commented Aug 5, 2024

appending "_genomic.fna.gz" to the ftp address and looking for the file returns a 404 error:
wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/857/045/GCF_000857045.1_ViralProj15142_genomic.fna.gz
--2024-08-05 16:07:36-- https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/857/045/GCF_000857045.1_ViralProj15142_genomic.fna.gz
Resolving ftp.ncbi.nlm.nih.gov (ftp.ncbi.nlm.nih.gov)... 130.14.250.10, 130.14.250.12, 130.14.250.11, ...
Connecting to ftp.ncbi.nlm.nih.gov (ftp.ncbi.nlm.nih.gov)|130.14.250.10|:443... connected.
HTTP request sent, awaiting response... 404 Not Found
2024-08-05 16:07:38 ERROR 404: Not Found.

@mourisl
Copy link
Collaborator

mourisl commented Aug 5, 2024

No luck.

Could you please add the "-e" or "-E" option to the sed command.

The FTP linke to a file is: https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/857/045/GCF_000857045.1_ViralProj15142/GCF_000857045.1_ViralProj15142_genomic.fna.gz , there is a repetition of folder name.

@ruysan
Copy link

ruysan commented Aug 5, 2024

same result:
cut -f6,20,21 assembly_summary_filtered.txt | awk -F "\t" '{if ($2~/ftp/) print $1"\t"$2; if ($3~/ftp/) print $1"\t"$3}' | sed -e 's/([^/]*)$/\1/\1_genomic.fna.gz/'
sed: -e expression #1, char 15: unknown option to `s'

cut -f6,20,21 assembly_summary_filtered.txt | awk -F "\t" '{if ($2~/ftp/) print $1"\t"$2; if ($3~/ftp/) print $1"\t"$3}' | sed -E 's/([^/]*)$/\1/\1_genomic.fna.gz/'
sed: -e expression #1, char 15: unknown option to `s'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants