centrifuge-download error extra operand '.gz' #201

oatesa · 2020-10-21T11:01:51Z

We are recently decided to update our index so started from scratch (deleting old/dated index etc).

We ran centrifuge-download -o library -m -d "archaea,bacteria,viral,fungi" refseq >> seqid2taxid.map. Archaea was successful, but we received errors with bacteria

4247/19206basename: extra operand '.gz'
Try 'basename --help' for more information.

Error downloading na/562 na_genomic.fna.gz!
basename: extra operand '.gz'
Try 'basename --help' for more information.

overall this related to 5 genomes (stopped 5 short of the total) and did not progress to viral or fungi index download. I have ran these are a separate job (currently running) but wondered what this error could relate to and how to correct it.

Thanks in advance

mourisl · 2020-10-22T18:36:26Z

It feels like the file assembly_summary.txt or assembly_summary_filtered.txt is wrong(missing some columns, or some tabs become spaces). Does the same issue happen to your separate job?

oatesa · 2020-10-26T08:36:59Z

same issue if run separately but its only occurring with the bacteria and only with 5 genomes, working fine with virus, fungi and archaea

oatesa · 2020-11-12T12:00:13Z

any updates on this? colleagues are having tha same issue when trying to download bacterial genomes

mourisl · 2020-11-13T18:25:47Z

I could not reproduce this error on our server. What is the bash version on your system?

stephaniepillay · 2020-12-03T14:31:47Z

@mourisl Hi, i have the exact same issue. it works for archaea but not for bacteria. the bash version i am using is version 4.2.46. @oatesa did you manage to solve this issue?

oatesa · 2020-12-03T15:29:19Z

@stephaniepillay @mourisl no we didn't solve the issue, the work around was to change the order of the download with bacteria being last on the list so the job would run but accept that those few sequences wouldn't not download. For me it was 5 sequences which didnt seem too much of an issue in the grander scheme of the bacterial sequences but others had around 50 that have failed. These individuals have repeated the download step for bacteria several times and this number reduced

oatesa · 2020-12-03T15:36:58Z

@mourisl Hi, i have the exact same issue. it works for archaea but not for bacteria. the bash version i am using is version 4.2.46. @oatesa did you manage to solve this issue?

@mourisl bash, version 4.2.46

oatesa · 2020-12-03T15:37:26Z

I could not reproduce this error on our server. What is the bash version on your system?

@mourisl bash, version 4.2.46

afkoeppeleri · 2021-10-22T18:46:23Z

I'm getting this exact same issue with make p+h+v. A handful of the bacterial downloads fail with:

"Error downloading na/654 na_genomic.fna.gz!"
"extra operand ‘.gz’ Try 'basename --help' for more information."

This then crashes the rest of the build.

Bash version: 4.2.46(2)-release
Linux version: 4.14.248-189.473.amzn2.x86_64

Did anyone ever find a solution? If not, is there a recommended workaround?

xiaoyunguo · 2021-12-02T23:20:59Z

Have the same error looking for solution

gbikpi · 2022-01-12T12:38:14Z

Hi everyone,

In case this is still an issue for some of you, the problem seems to be similar to #221 which has been solved by @mourisl in commit a5c09bb29a3a828d88be49c55353cd84b6b9bbad but only for the viral database. So I solved this issue by downloading the updated centrifuge-download and changing if [[ "$DOMAIN" == "viral" ]]; then into if [[ "$DOMAIN" == "viral" || "$DOMAIN" == "bacteria" ]]; then.

@mourisl It seems that the patch actually works for all domains since it handles both cases (field 20 or 21) so the "if" condition seems unnecessary to me. By the way, the line echo "Downloading $N_EXPECTED $DOMAIN genomes at assembly level $ASSEMBLY_LEVEL ... (will take a while)" >&2 should be placed outside (before) the "if" statement.

That's all, I hope this will be helpful.

mourisl · 2022-01-12T15:02:13Z

@gbikpi Thanks for testing! I will update the script and merge it to the master.

mourisl · 2022-01-14T19:05:53Z

The patch is merged to the master branch. Now all the domains will use the (maybe) more robust parsing strategy.

poursalavati · 2022-01-15T17:31:29Z

The patch is merged to the master branch. Now all the domains will use the (maybe) more robust parsing strategy.

Thanks for updating,
but unfortunately, still there is something wrong with centrifuge-download.
I tried make it from master again. but I got this for viral (bacteria works fine):

basename: extra operand ‘_genomic.fna.gz’
Try 'basename --help' for more information.
cat: ./viral/: Is a directory

domenico-simone · 2022-02-09T09:16:28Z

Hello,

I can confirm there's still the same error for viral genomes.

oatesa · 2022-02-09T09:22:47Z

we recently went though downloading/building an index again for a new student a few of the bacterial genomes failed (20 didn't download). This time we had the issue everyone else was having with the viral genome with it completely failing

CuypersBart · 2022-03-18T11:43:23Z

I am having exactly the same issue as @oatesa describes. Is there a workaround possible?

CuypersBart · 2022-03-18T13:16:42Z

Note: no error message is displayed for not downloading the last 20 bacterial genomes

omrctnr · 2022-07-21T18:16:09Z

Hi everyone,

In case this is still an issue for some of you, the problem seems to be similar to #221 which has been solved by @mourisl in commit a5c09bb29a3a828d88be49c55353cd84b6b9bbad but only for the viral database. So I solved this issue by downloading the updated centrifuge-download and changing if [[ "$DOMAIN" == "viral" ]]; then into if [[ "$DOMAIN" == "viral" || "$DOMAIN" == "bacteria" ]]; then.

@mourisl It seems that the patch actually works for all domains since it handles both cases (field 20 or 21) so the "if" condition seems unnecessary to me. By the way, the line echo "Downloading $N_EXPECTED $DOMAIN genomes at assembly level $ASSEMBLY_LEVEL ... (will take a while)" >&2 should be placed outside (before) the "if" statement.

That's all, I hope this will be helpful.

Hello,

I also encountered the same error while downloading the virus genome especially. As mentioned above, I replaced the centrifuge-download according to the https://raw.githubusercontent.com/DaehwanKimLab/centrifuge/viral_download/centrifuge-download. Now, It runs correctly.

virocamp · 2022-07-25T13:21:46Z

Hello,
I am also running into the same problem (Error downloading....basename: extra operand '_genomic.fna.gz') as others with the virus genomes with any command
make v
make p
make p_compressed+h+v
centrifuge-download -o library -m -d "archaea,bacteria,viral" refseq > seqid2taxid.map
centrifuge-download -o library -m -d "viral" refseq > seqid2taxid.map
etc

I tried the fix listed by others using the updated centrifuge-download linked above by @mourisl which apparently recently worked for @omrctnr, and also changing the line in question. In my summary files, the domain is in field 20.

curl v 7.82.0
bash v 4.4.19

Hope this is solveable

josemunozc · 2022-11-28T15:47:03Z

I'm having the same problem when running:

cd indices
make p_compressed+h+v
...
mkdir -p reference-sequences
[[ -d tmp_p_compressed+h+v ]] && rm -rf tmp_p_compressed+h+v; mkdir -p tmp_p_compressed+h+v
Downloading and dust-masking viral
centrifuge-download -o tmp_p_compressed+h+v  -m -a "Any" -d "viral" -P 1 refseq > \
	tmp_p_compressed+h+v/all-viral-any_level.map
Downloading ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/viral/assembly_summary.txt ...
basename: extra operand ‘_genomic.fna.gz’
Try 'basename --help' for more information.
gzip: tmp_p_compressed+h+v/viral/.gz: unknown suffix -- ignored
....

. I'm using centrifuge/1.0.4. Looking at the script centrifuge-download I can see this section:

    if [[ "$DOMAIN" == "viral" ]]; then
      ## Wrong columns in viral assembly summary files - the path is sometimes in field 20, sometimes 21
      cut -f "$TAXID_FIELD,$FTP_PATH_FIELD,$FTP_PATH_FIELD2" "$ASSEMBLY_SUMMARY_FILE" | \
       sed 's/^\(.*\)\t\(ftp:.*\)\t.*/\1\t\2/;s/^\(.*\)\t.*\t\(ftp:.*\)/\1\t\2/' | \
      sed 's#\([^/]*\)$#\1/\1_genomic.fna.gz#' |\
         tr '\n' '\0' | xargs -0 -n1 -P $N_PROC bash -c 'download_n_process_nofail "$@"' _ | count $N_EXPECTED
    else
      echo "Downloading $N_EXPECTED $DOMAIN genomes at assembly level $ASSEMBLY_LEVEL ... (will take a while)" >&2
      cut -f "$TAXID_FIELD,$FTP_PATH_FIELD" "$ASSEMBLY_SUMMARY_FILE" | sed 's#\([^/]*\)$#\1/\1_genomic.fna.gz#' |\
         tr '\n' '\0' | xargs -0 -n1 -P $N_PROC bash -c 'download_n_process_nofail "$@"' _ | count $N_EXPECTED
    fi
    echo >&2

I think the problem is in this sed command:

sed 's/^\(.*\)\t\(ftp:.*\)\t.*/\1\t\2/;s/^\(.*\)\t.*\t\(ftp:.*\)/\1\t\2/'

Is looking for a string with ftp: in the output of cut -f "$TAXID_FIELD,$FTP_PATH_FIELD,$FTP_PATH_FIELD2" "$ASSEMBLY_SUMMARY_FILE" which for me, it looks somthing like:

$ head -n 1 tmp_p_compressed+h+v/viral/assembly_summary_filtered.txt | cut -f 6,20,21
10243	https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/839/185/GCF_000839185.1_ViralProj14174

So I changed the command to search for https: instead, sed 's/^$.*$\t$https:.*$\t.*/\1\t\2/;s/^$.*$\t.*\t$https:.*$/\1\t\2/' and it seems to work. But I'm not sure if this would break anything else. Is there a way to sanity check the files were downloaded correctly?

oatesa closed this as completed Dec 3, 2020

oatesa changed the title ~~centrifuge-download error extra operand '.gz'~~ centrifuge-download error extra operand '.gz' Dec 3, 2020

oatesa reopened this Dec 3, 2020

fanninpm mentioned this issue Aug 3, 2022

Database download for Centrifuge #242

Open

DAWNkKim mentioned this issue Jun 27, 2023

how to make seqid2taxid.map #259

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

centrifuge-download error extra operand '.gz' #201

centrifuge-download error extra operand '.gz' #201

oatesa commented Oct 21, 2020

mourisl commented Oct 22, 2020

oatesa commented Oct 26, 2020

oatesa commented Nov 12, 2020

mourisl commented Nov 13, 2020

stephaniepillay commented Dec 3, 2020

oatesa commented Dec 3, 2020

oatesa commented Dec 3, 2020

oatesa commented Dec 3, 2020

afkoeppeleri commented Oct 22, 2021 •

edited

Loading

xiaoyunguo commented Dec 2, 2021

gbikpi commented Jan 12, 2022 •

edited

Loading

mourisl commented Jan 12, 2022

mourisl commented Jan 14, 2022

poursalavati commented Jan 15, 2022

domenico-simone commented Feb 9, 2022

oatesa commented Feb 9, 2022

CuypersBart commented Mar 18, 2022

CuypersBart commented Mar 18, 2022

omrctnr commented Jul 21, 2022

virocamp commented Jul 25, 2022

josemunozc commented Nov 28, 2022

centrifuge-download error extra operand '.gz' #201

centrifuge-download error extra operand '.gz' #201

Comments

oatesa commented Oct 21, 2020

mourisl commented Oct 22, 2020

oatesa commented Oct 26, 2020

oatesa commented Nov 12, 2020

mourisl commented Nov 13, 2020

stephaniepillay commented Dec 3, 2020

oatesa commented Dec 3, 2020

oatesa commented Dec 3, 2020

oatesa commented Dec 3, 2020

afkoeppeleri commented Oct 22, 2021 • edited Loading

xiaoyunguo commented Dec 2, 2021

gbikpi commented Jan 12, 2022 • edited Loading

mourisl commented Jan 12, 2022

mourisl commented Jan 14, 2022

poursalavati commented Jan 15, 2022

domenico-simone commented Feb 9, 2022

oatesa commented Feb 9, 2022

CuypersBart commented Mar 18, 2022

CuypersBart commented Mar 18, 2022

omrctnr commented Jul 21, 2022

virocamp commented Jul 25, 2022

josemunozc commented Nov 28, 2022

afkoeppeleri commented Oct 22, 2021 •

edited

Loading

gbikpi commented Jan 12, 2022 •

edited

Loading