Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

centrifuge-download error extra operand '.gz' #201

Open
oatesa opened this issue Oct 21, 2020 · 21 comments
Open

centrifuge-download error extra operand '.gz' #201

oatesa opened this issue Oct 21, 2020 · 21 comments

Comments

@oatesa
Copy link

oatesa commented Oct 21, 2020

We are recently decided to update our index so started from scratch (deleting old/dated index etc).

We ran centrifuge-download -o library -m -d "archaea,bacteria,viral,fungi" refseq >> seqid2taxid.map. Archaea was successful, but we received errors with bacteria

4247/19206basename: extra operand '.gz'
Try 'basename --help' for more information.

Error downloading na/562 na_genomic.fna.gz!
basename: extra operand '.gz'
Try 'basename --help' for more information.

overall this related to 5 genomes (stopped 5 short of the total) and did not progress to viral or fungi index download. I have ran these are a separate job (currently running) but wondered what this error could relate to and how to correct it.

Thanks in advance

@mourisl
Copy link
Collaborator

mourisl commented Oct 22, 2020

It feels like the file assembly_summary.txt or assembly_summary_filtered.txt is wrong(missing some columns, or some tabs become spaces). Does the same issue happen to your separate job?

@oatesa
Copy link
Author

oatesa commented Oct 26, 2020

same issue if run separately but its only occurring with the bacteria and only with 5 genomes, working fine with virus, fungi and archaea

@oatesa
Copy link
Author

oatesa commented Nov 12, 2020

any updates on this? colleagues are having tha same issue when trying to download bacterial genomes

@mourisl
Copy link
Collaborator

mourisl commented Nov 13, 2020

I could not reproduce this error on our server. What is the bash version on your system?

@stephaniepillay
Copy link

@mourisl Hi, i have the exact same issue. it works for archaea but not for bacteria. the bash version i am using is version 4.2.46. @oatesa did you manage to solve this issue?

@oatesa
Copy link
Author

oatesa commented Dec 3, 2020

@stephaniepillay @mourisl no we didn't solve the issue, the work around was to change the order of the download with bacteria being last on the list so the job would run but accept that those few sequences wouldn't not download. For me it was 5 sequences which didnt seem too much of an issue in the grander scheme of the bacterial sequences but others had around 50 that have failed. These individuals have repeated the download step for bacteria several times and this number reduced

@oatesa
Copy link
Author

oatesa commented Dec 3, 2020

@mourisl Hi, i have the exact same issue. it works for archaea but not for bacteria. the bash version i am using is version 4.2.46. @oatesa did you manage to solve this issue?

@mourisl bash, version 4.2.46

@oatesa oatesa closed this as completed Dec 3, 2020
@oatesa
Copy link
Author

oatesa commented Dec 3, 2020

I could not reproduce this error on our server. What is the bash version on your system?

@mourisl bash, version 4.2.46

@oatesa oatesa changed the title centrifuge-download error extra operand '.gz' centrifuge-download error extra operand '.gz' Dec 3, 2020
@oatesa oatesa reopened this Dec 3, 2020
@afkoeppeleri
Copy link

afkoeppeleri commented Oct 22, 2021

I'm getting this exact same issue with make p+h+v. A handful of the bacterial downloads fail with:

"Error downloading na/654 na_genomic.fna.gz!"
"extra operand ‘.gz’ Try 'basename --help' for more information."

This then crashes the rest of the build.

Bash version: 4.2.46(2)-release
Linux version: 4.14.248-189.473.amzn2.x86_64

Did anyone ever find a solution? If not, is there a recommended workaround?

@xiaoyunguo
Copy link

Have the same error looking for solution

@gbikpi
Copy link

gbikpi commented Jan 12, 2022

Hi everyone,

In case this is still an issue for some of you, the problem seems to be similar to #221 which has been solved by @mourisl in commit a5c09bb29a3a828d88be49c55353cd84b6b9bbad but only for the viral database. So I solved this issue by downloading the updated centrifuge-download and changing if [[ "$DOMAIN" == "viral" ]]; then into if [[ "$DOMAIN" == "viral" || "$DOMAIN" == "bacteria" ]]; then.

@mourisl It seems that the patch actually works for all domains since it handles both cases (field 20 or 21) so the "if" condition seems unnecessary to me. By the way, the line echo "Downloading $N_EXPECTED $DOMAIN genomes at assembly level $ASSEMBLY_LEVEL ... (will take a while)" >&2 should be placed outside (before) the "if" statement.

That's all, I hope this will be helpful.

@mourisl
Copy link
Collaborator

mourisl commented Jan 12, 2022

@gbikpi Thanks for testing! I will update the script and merge it to the master.

@mourisl
Copy link
Collaborator

mourisl commented Jan 14, 2022

The patch is merged to the master branch. Now all the domains will use the (maybe) more robust parsing strategy.

@poursalavati
Copy link

The patch is merged to the master branch. Now all the domains will use the (maybe) more robust parsing strategy.

Thanks for updating,
but unfortunately, still there is something wrong with centrifuge-download.
I tried make it from master again. but I got this for viral (bacteria works fine):

basename: extra operand ‘_genomic.fna.gz’
Try 'basename --help' for more information.
cat: ./viral/: Is a directory

@domenico-simone
Copy link

Hello,

I can confirm there's still the same error for viral genomes.

@oatesa
Copy link
Author

oatesa commented Feb 9, 2022

we recently went though downloading/building an index again for a new student a few of the bacterial genomes failed (20 didn't download). This time we had the issue everyone else was having with the viral genome with it completely failing

@CuypersBart
Copy link

I am having exactly the same issue as @oatesa describes. Is there a workaround possible?

@CuypersBart
Copy link

Note: no error message is displayed for not downloading the last 20 bacterial genomes

@omrctnr
Copy link

omrctnr commented Jul 21, 2022

Hi everyone,

In case this is still an issue for some of you, the problem seems to be similar to #221 which has been solved by @mourisl in commit a5c09bb29a3a828d88be49c55353cd84b6b9bbad but only for the viral database. So I solved this issue by downloading the updated centrifuge-download and changing if [[ "$DOMAIN" == "viral" ]]; then into if [[ "$DOMAIN" == "viral" || "$DOMAIN" == "bacteria" ]]; then.

@mourisl It seems that the patch actually works for all domains since it handles both cases (field 20 or 21) so the "if" condition seems unnecessary to me. By the way, the line echo "Downloading $N_EXPECTED $DOMAIN genomes at assembly level $ASSEMBLY_LEVEL ... (will take a while)" >&2 should be placed outside (before) the "if" statement.

That's all, I hope this will be helpful.

Hello,

I also encountered the same error while downloading the virus genome especially. As mentioned above, I replaced the centrifuge-download according to the https://raw.githubusercontent.com/DaehwanKimLab/centrifuge/viral_download/centrifuge-download. Now, It runs correctly.

@virocamp
Copy link

Hello,
I am also running into the same problem (Error downloading....basename: extra operand '_genomic.fna.gz') as others with the virus genomes with any command
make v
make p
make p_compressed+h+v
centrifuge-download -o library -m -d "archaea,bacteria,viral" refseq > seqid2taxid.map
centrifuge-download -o library -m -d "viral" refseq > seqid2taxid.map
etc

I tried the fix listed by others using the updated centrifuge-download linked above by @mourisl which apparently recently worked for @omrctnr, and also changing the line in question. In my summary files, the domain is in field 20.

curl v 7.82.0
bash v 4.4.19

Hope this is solveable

@josemunozc
Copy link

I'm having the same problem when running:

cd indices
make p_compressed+h+v
...
mkdir -p reference-sequences
[[ -d tmp_p_compressed+h+v ]] && rm -rf tmp_p_compressed+h+v; mkdir -p tmp_p_compressed+h+v
Downloading and dust-masking viral
centrifuge-download -o tmp_p_compressed+h+v  -m -a "Any" -d "viral" -P 1 refseq > \
	tmp_p_compressed+h+v/all-viral-any_level.map
Downloading ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/viral/assembly_summary.txt ...
basename: extra operand ‘_genomic.fna.gz’
Try 'basename --help' for more information.
gzip: tmp_p_compressed+h+v/viral/.gz: unknown suffix -- ignored
....

. I'm using centrifuge/1.0.4. Looking at the script centrifuge-download I can see this section:

    if [[ "$DOMAIN" == "viral" ]]; then
      ## Wrong columns in viral assembly summary files - the path is sometimes in field 20, sometimes 21
      cut -f "$TAXID_FIELD,$FTP_PATH_FIELD,$FTP_PATH_FIELD2" "$ASSEMBLY_SUMMARY_FILE" | \
       sed 's/^\(.*\)\t\(ftp:.*\)\t.*/\1\t\2/;s/^\(.*\)\t.*\t\(ftp:.*\)/\1\t\2/' | \
      sed 's#\([^/]*\)$#\1/\1_genomic.fna.gz#' |\
         tr '\n' '\0' | xargs -0 -n1 -P $N_PROC bash -c 'download_n_process_nofail "$@"' _ | count $N_EXPECTED
    else
      echo "Downloading $N_EXPECTED $DOMAIN genomes at assembly level $ASSEMBLY_LEVEL ... (will take a while)" >&2
      cut -f "$TAXID_FIELD,$FTP_PATH_FIELD" "$ASSEMBLY_SUMMARY_FILE" | sed 's#\([^/]*\)$#\1/\1_genomic.fna.gz#' |\
         tr '\n' '\0' | xargs -0 -n1 -P $N_PROC bash -c 'download_n_process_nofail "$@"' _ | count $N_EXPECTED
    fi
    echo >&2

I think the problem is in this sed command:

sed 's/^\(.*\)\t\(ftp:.*\)\t.*/\1\t\2/;s/^\(.*\)\t.*\t\(ftp:.*\)/\1\t\2/'

Is looking for a string with ftp: in the output of cut -f "$TAXID_FIELD,$FTP_PATH_FIELD,$FTP_PATH_FIELD2" "$ASSEMBLY_SUMMARY_FILE" which for me, it looks somthing like:

$ head -n 1 tmp_p_compressed+h+v/viral/assembly_summary_filtered.txt | cut -f 6,20,21
10243	https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/839/185/GCF_000839185.1_ViralProj14174

So I changed the command to search for https: instead, sed 's/^\(.*\)\t\(https:.*\)\t.*/\1\t\2/;s/^\(.*\)\t.*\t\(https:.*\)/\1\t\2/' and it seems to work. But I'm not sure if this would break anything else. Is there a way to sanity check the files were downloaded correctly?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests