-
Notifications
You must be signed in to change notification settings - Fork 73
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
can't donwload viral library #221
Comments
This appears to be due to the centrifuge-download script line 368: "cut -f "$TAXID_FIELD,$FTP_PATH_FIELD,$FTP_PATH_FIELD2" "$ASSEMBLY_SUMMARY_FILE" | ". I manually changed it to: "cut -f "$TAXID_FIELD,$FTP_PATH_FIELD" "$ASSEMBLY_SUMMARY_FILE" | ", and that solved the issue for me. Further up in that script it on line 305 the FTP_PATH_FIELD2 variable is defined with the following comment, "## Needed for wrongly formatted virus files - hopefully just a temporary fix." So I guess the viral "assembly_summary.txt" has been reformatted so this previously needed fix is no longer necessary. |
I have updated a patch to handle this formatting issue. Thank you for identifying this @mperisin-lallemand ! |
Possible that there is again the same problem? |
I can't download viral sequences. Is it because my running version is outdated? The program ends at 100% progress but the library folder is empty. centrifuge-download -o library -d viral refseq > seqid2taxid.map |
Could you please share the assembly_summary.txt file under library/viral in your download folder? |
The file is almost 7 Mb. See ftp://ftp.ncbi.nlm.nih.gov/genomes/README_assembly_summary.txt for a description of the columns in this file.#assembly_accession bioproject biosample wgs_master refseq_category taxid species_taxid organism_name infraspecific_name isolate version_status assembly_level release_type genome_rep seq_rel_date asm_name asm_submitter gbrs_paired_asm paired_asm_comp ftp_path excluded_from_refseq relation_to_type_material asm_not_live_date assembly_type group genome_size genome_size_ungapped gc_percent replicon_count scaffold_count contig_count annotation_provider annotation_name annotation_date total_gene_count protein_coding_gene_count non_coding_gene_count pubmed_id |
How about the assembly_summary_filtered file? There are several issues in the downloaded file, like MajorFull should be "Major\tFull", but I'm not sure whether this is copy/paste error. How about use the option "-g wget" to use another method to download the file? |
Assembly_summary_filtered looks like this: |
Using wget produces the same result. Also, seqid2taxid.map is empty. |
Can you share the whole assembly_summary_filtered.txt file? There might be some formatting issue when copy/paste here. |
here are both files. |
Your file looks correct to me. Could you please run command like "cut -f6,20,21 assembly_summary_filtered.txt | awk -F "\t" '{if ($2~/ftp/) print $1"\t"$2; if ($3~/ftp/) print $1"\t"$3}' | sed 's#([^/]*)$#\1/\1_genomic.fna.gz#' " to see whether there is strange output like na_genomic.fna.gz on your system? |
Sorry, I can't figure out the sed call. It turns an error. |
"genomic_fna" does not appear in the output of the cut | awk command |
it might be the bash version thing. Would command like "cut -f6,20,21 assembly_summary_filtered.txt | awk -F "\t" '{if ($2~/ftp/) print $1"\t"$2; if ($3~/ftp/) print $1"\t"$3}' | sed 's/([^/]*)$/\1/\1_genomic.fna.gz/' " work? You need the sed command to duplicate the folder name of the FTP site and append the _genomic.fna.gz extension. |
No luck. |
appending "_genomic.fna.gz" to the ftp address and looking for the file returns a 404 error: |
Could you please add the "-e" or "-E" option to the sed command. The FTP linke to a file is: https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/857/045/GCF_000857045.1_ViralProj15142/GCF_000857045.1_ViralProj15142_genomic.fna.gz , there is a repetition of folder name. |
same result: cut -f6,20,21 assembly_summary_filtered.txt | awk -F "\t" '{if ($2~/ftp/) print $1"\t"$2; if ($3~/ftp/) print $1"\t"$3}' | sed -E 's/([^/]*)$/\1/\1_genomic.fna.gz/' |
Hi,
when using
centrifuge-download
, I couldn't download the viral library, while archea and bacteria were OK.Any suggestion ?
I used the version 1.0.4, and I installed it using conda.
Thanks in advance.
Alex
The text was updated successfully, but these errors were encountered: