Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failing to build custom database for HBV #9

Open
wskang1202 opened this issue May 19, 2023 · 18 comments
Open

Failing to build custom database for HBV #9

wskang1202 opened this issue May 19, 2023 · 18 comments

Comments

@wskang1202
Copy link

wskang1202 commented May 19, 2023

Hi Sara,

I've been trying to build custom databases by following FastViFi Readme. Building databases for HCV and EBV were successful, however, building the hbv databases for k=18, k-22 were unsuccessful. The following message was shown in the log file:

scan_fasta_file.pl: unable to determine taxonomy ID for sequence hbv_ref7
No preliminary seqid/taxid mapping files found, aborting.

Is there a way to solve this problem?

Best,
Wonseok

@sara-javadzadeh
Copy link
Owner

Hi Wonseok,

It looks like the file prelim_map.txt is missing. Does the file exist in the kraken2/<your HBV db name>/taxonomy directory? If not, one reason could be that downloading the library failed. Could you please run download_custom_kraken_library.sh for HBV again and check if the prelim_map.txt file is downloaded in your HBV database directory?

Please let me know if this didn't work for you.

Best,
Sara

@wskang1202
Copy link
Author

Hi, Sara,

I ran download_custom_kraken_library.sh for HBV again, and I can see that there is prelim_map.txt file in kraken2/Kraken2StandardDB_k_18_hbv/taxonomy but the file itself is empty.

Best,
Wonseok

@sara-javadzadeh
Copy link
Owner

Hi Wonseok,

Do you get an error when running download_custom_kraken_library.sh for the HBV dataset?
Could you please check if the prelim_map.txt is present and non-empty in the HCV and EBV databases that you created successfully before?

Best,
Sara

@wskang1202
Copy link
Author

Hi, Sara.

The prelim_map.txt is present and non-empty in the successfully-made databases (HCV and EBV as well as k_25_hbv_hg databases). However the file is empty for the unsuccessful k_18_hbv and k_22_hbv databases. I've attached the log.txt file in case you might want to check out.

Thank you,
Wonseok

@sara-javadzadeh
Copy link
Owner

Hi Wonseok,

Did you try running the build_custom_kraken_index.sh script on k_18_hbv database, after running download_custom_kraken_library.sh? If so, was there any error?

@mrzResearchArena
Copy link

mrzResearchArena commented Jun 5, 2023

Hi Javadzadeh,

I have downloaded your suggested dataset for sample-level FastFiVi for the HPV virus: https://drive.google.com/file/d/1QYn5lDWjvhtIWCrwmzDc_1fy8ANrXWz1/view?usp=sharing. However, when I was attempting to extract it, it showed errors (tar -xzvf kraken_datasets.tar.gz). Could you please suggest to me how I can figure it out?

@sara-javadzadeh
Copy link
Owner

Hi Muhammod,

Thanks for reaching out.
Could you please share the error messages when running tar -xzvf kraken_datasets.tar.gz?

@mrzResearchArena
Copy link

Hi Javadzadeh,

Thank you so much for your response. I was getting the below errors. The downloaded file size is "15796400321" bytes.

gzip: stdin: invalid compressed data--crc error
tar: Child returned status 1
tar: Error is not recoverable: exiting now
ls -l kraken_datasets.tar.gz 

@sara-javadzadeh
Copy link
Owner

Hi again,

Thanks! Although the output to ls -l command is truncated in your reply, I can see the file size in your text above. The file size seems correct.

Did you try running gunzip kraken_datasets.tar.gz and then running tar -xvf kraken_datasets.tar? If it's failing, could you please share the error?

By the way, the uncompressed should be about 60GB. Is that taken into consideration?

Thanks,
Sara

@mrzResearchArena
Copy link

mrzResearchArena commented Jun 12, 2023

Hi Javadzadeh,

Yes, I have tried. However, it doesn't work out.

gzip: kraken_datasets.tar.gz: invalid compressed data--crc error

@mrzResearchArena
Copy link

Hi Javazadesh, could you please provide a different download link?

@sara-javadzadeh
Copy link
Owner

I can provide another link, it'll take a couple of hours to upload the database.
In the meantime, could you please check the following?

  1. Could you please share the output of the following command? file kraken_datasets.tar.gz
  2. Check if tar -tf kraken_datasets.tar.gz can list the files without the error or not. If an error, could you please share it?

Sara

@mrzResearchArena
Copy link

Yes, it shows errors. You can view the error by clicking the link.

tar -tf kraken_datasets.tar.gz > errors-text.txt

Output:

kraken_datasets/
kraken_datasets/Kraken2StandardDB_k_22_hpv/
kraken_datasets/Kraken2StandardDB_k_22_hpv/taxonomy/
kraken_datasets/Kraken2StandardDB_k_22_hpv/taxonomy/readme.txt
kraken_datasets/Kraken2StandardDB_k_22_hpv/taxonomy/merged.dmp
kraken_datasets/Kraken2StandardDB_k_22_hpv/taxonomy/taxdump.tar.gz
kraken_datasets/Kraken2StandardDB_k_22_hpv/taxonomy/names.dmp
kraken_datasets/Kraken2StandardDB_k_22_hpv/taxonomy/taxdump.untarflag
kraken_datasets/Kraken2StandardDB_k_22_hpv/taxonomy/accmap.dlflag
kraken_datasets/Kraken2StandardDB_k_22_hpv/taxonomy/delnodes.dmp
kraken_datasets/Kraken2StandardDB_k_22_hpv/taxonomy/citations.dmp
kraken_datasets/Kraken2StandardDB_k_22_hpv/taxonomy/nodes.dmp
kraken_datasets/Kraken2StandardDB_k_22_hpv/taxonomy/nucl_gb.accession2taxid
kraken_datasets/Kraken2StandardDB_k_22_hpv/taxonomy/gc.prt
kraken_datasets/Kraken2StandardDB_k_22_hpv/taxonomy/nucl_wgs.accession2taxid
kraken_datasets/Kraken2StandardDB_k_22_hpv/taxonomy/division.dmp
kraken_datasets/Kraken2StandardDB_k_22_hpv/taxonomy/gencode.dmp
kraken_datasets/Kraken2StandardDB_k_22_hpv/taxonomy/taxdump.dlflag
kraken_datasets/Kraken2StandardDB_k_22_hpv/taxonomy/prelim_map.txt
kraken_datasets/Kraken2StandardDB_k_22_hpv/seqid2taxid.map
kraken_datasets/Kraken2StandardDB_k_22_hpv/hash.k2d
kraken_datasets/Kraken2StandardDB_k_22_hpv/taxo.k2d
kraken_datasets/Kraken2StandardDB_k_22_hpv/library/
kraken_datasets/Kraken2StandardDB_k_22_hpv/library/added/
kraken_datasets/Kraken2StandardDB_k_22_hpv/library/added/prelim_map.txt
kraken_datasets/Kraken2StandardDB_k_22_hpv/library/added/9TbkQmfdkG.fna.masked
kraken_datasets/Kraken2StandardDB_k_22_hpv/library/added/9TbkQmfdkG.fna
kraken_datasets/Kraken2StandardDB_k_22_hpv/library/added/prelim_map_3IwJCtpJpX.txt
kraken_datasets/Kraken2StandardDB_k_22_hpv/opts.k2d
kraken_datasets/Kraken2StandardDB_k_25_hpv_hg/
kraken_datasets/Kraken2StandardDB_k_25_hpv_hg/taxo.k2d
kraken_datasets/Kraken2StandardDB_k_25_hpv_hg/library/
kraken_datasets/Kraken2StandardDB_k_25_hpv_hg/library/human/
kraken_datasets/Kraken2StandardDB_k_25_hpv_hg/library/human/prelim_map.txt
kraken_datasets/Kraken2StandardDB_k_25_hpv_hg/library/human/assembly_summary.txt
kraken_datasets/Kraken2StandardDB_k_25_hpv_hg/library/human/library.fna.masked
kraken_datasets/Kraken2StandardDB_k_25_hpv_hg/library/human/library.fna
kraken_datasets/Kraken2StandardDB_k_25_hpv_hg/library/human/manifest.txt
kraken_datasets/Kraken2StandardDB_k_25_hpv_hg/library/added/
kraken_datasets/Kraken2StandardDB_k_25_hpv_hg/library/added/prelim_map_SeYmVYHiCd.txt
kraken_datasets/Kraken2StandardDB_k_25_hpv_hg/library/added/prelim_map.txt
kraken_datasets/Kraken2StandardDB_k_25_hpv_hg/library/added/rKtNPyn11J.fna
kraken_datasets/Kraken2StandardDB_k_25_hpv_hg/library/added/rKtNPyn11J.fna.masked
kraken_datasets/Kraken2StandardDB_k_25_hpv_hg/opts.k2d
kraken_datasets/Kraken2StandardDB_k_25_hpv_hg/taxonomy/
kraken_datasets/Kraken2StandardDB_k_25_hpv_hg/taxonomy/gencode.dmp
kraken_datasets/Kraken2StandardDB_k_25_hpv_hg/taxonomy/nucl_wgs.accession2taxid
7486\t47861343\nAG288467\tAG288467.1\t57486\t47861344\nAG288468\tAG288468.1\t57486\t47861345\nAG288469\tAG28846
639\t112961221\nDQ844259\tDQ844259.1\t1639\t112961224\nDQ844260\tDQ844260.1\t1639\t112961227\nDQ844261\tDQ84426
253\t113251528\nED394649\tED394649.1\t6253\t113251529\nED394650\tED394650.1\t6253\t113251530\nED394651\tED39465
322560303\nJG336704\tJG336704.1\t30301\t322560304\nJG336705\tJG336705.1\t30301\t322560305\nJG336706\tJG336706.
697\nKR112558\tKR112558.1\t1387109\t955261699\nKR112559\tKR112559.1\t1690892\t955261701\nKR112560\tKR112560.1\t
0\t1531460990\nMM160627\tMM160627.1\t0\t1531460991\nMM160628\tMM160628.1\t0\t1531460992\nMM160629\tMM160629.1\t0
61476\t1946114713\nOC673268\tOC673268.1\t61476\t1946114714\nOC673269\tOC673269.1\t61476\t1946114715\nOC673270\t
\t61472\t1948381426\nOD593408\tOD593408.1\t61472\t1948381428\nOD593409\tOD593409.1\t61472\t1948381430\nOD593410
61472\t1947471274\nOD855123\tOD855123.1\t61472\t1947471275\nOD855124\tOD855124.1\t61472\t1947471276\nOD855125\t
\t61474\t1962876452\nOE366104\tOE366104.1\t61474\t1962876453\nOE366105\tOE366105.1\t61474\t1962876454\nOE366106
61474\t1964446754\nOE507499\tOE507499.1\t61474\t1964446757\nOE507500\tOE507500.1\t61474\t1964446760\nOE507501\t
\t61474\t1965131656\nOE607256\tOE607256.1\t61474\t1965131659\nOE607257\tOE607257.1\t61474\t1965131662\nOE607258
003024004.1\t663202\t302664848\nXM_003024005\tXM_003024005.1\t663202\t302664850\nXM_003024006\tXM_003024006.
tar: Skipping to next header
tar: Archive contains ‘9.1\t5748’ where numeric mode_t value expected
tar: Archive contains ‘.1\t57486\t478’ where numeric time_t value expected
7486\t47861343\nAG288467\tAG288467.1\t57486\t47861344\nAG288468\tAG288468.1\t57486\t47861345\nAG288469\tAG28846
tar: Skipping to next header
tar: Archive contains ‘0672.1\t4113\t’ where numeric off_t value expected
tar: Archive contains ‘119.1\t262687’ where numeric off_t value expected
tar: Archive contains ‘1.1\t1639’ where numeric mode_t value expected
tar: Archive contains ‘.1\t1639\t1129’ where numeric time_t value expected
tar: Archive contains ‘\t1129612’ where numeric uid_t value expected
639\t112961221\nDQ844259\tDQ844259.1\t1639\t112961224\nDQ844260\tDQ844260.1\t1639\t112961227\nDQ844261\tDQ84426
tar: Skipping to next header
tar: Archive contains ‘1.1\t6253’ where numeric mode_t value expected
tar: Archive contains ‘.1\t6253\t1132’ where numeric time_t value expected
253\t113251528\nED394649\tED394649.1\t6253\t113251529\nED394650\tED394650.1\t6253\t113251530\nED394651\tED39465
tar: Skipping to next header
tar: Archive contains ‘1609\nEZ97768’ where numeric off_t value expected
tar: Archive contains ‘\tHE793950.1\t’ where numeric off_t value expected
322560303\nJG336704\tJG336704.1\t30301\t322560304\nJG336705\tJG336705.1\t30301\t322560305\nJG336706\tJG336706.
tar: Skipping to next header
tar: Archive contains ‘1759748\t’ where numeric mode_t value expected
tar: Archive contains ‘95526170’ where numeric uid_t value expected
697\nKR112558\tKR112558.1\t1387109\t955261699\nKR112559\tKR112559.1\t1690892\t955261701\nKR112560\tKR112560.1\t
tar: Skipping to next header
tar: Archive contains ‘\tLA487646.1\t’ where numeric off_t value expected
tar: Archive contains ‘29\tMC492929.’ where numeric off_t value expected
tar: Archive contains ‘31460994\nMM1’ where numeric time_t value expected
tar: Archive contains ‘993\nMM16’ where numeric uid_t value expected
0\t1531460990\nMM160627\tMM160627.1\t0\t1531460991\nMM160628\tMM160628.1\t0\t1531460992\nMM160629\tMM160629.1\t0
tar: Skipping to next header
tar: Archive contains ‘_019029293.1’ where numeric off_t value expected
tar: Archive contains ‘\t50390\t15815’ where numeric off_t value expected
tar: Archive contains ‘OC673270’ where numeric mode_t value expected
tar: Archive contains ‘\tOC673271.1\t’ where numeric time_t value expected
tar: Archive contains ‘.1\t61476’ where numeric uid_t value expected
tar: Archive contains ‘\t1946114’ where numeric gid_t value expected
61476\t1946114713\nOC673268\tOC673268.1\t61476\t1946114714\nOC673269\tOC673269.1\t61476\t1946114715\nOC673270\t
tar: Skipping to next header
tar: Archive contains ‘\tOD59341’ where numeric mode_t value expected
tar: Archive contains ‘0.1\t6147’ where numeric uid_t value expected
\t61472\t1948381426\nOD593408\tOD593408.1\t61472\t1948381428\nOD593409\tOD593409.1\t61472\t1948381430\nOD593410
tar: Skipping to next header
tar: Archive contains ‘OD855125’ where numeric mode_t value expected
tar: Archive contains ‘\tOD855126.1\t’ where numeric time_t value expected
tar: Archive contains ‘.1\t61472’ where numeric uid_t value expected
tar: Archive contains ‘\t1947471’ where numeric gid_t value expected
61472\t1947471274\nOD855123\tOD855123.1\t61472\t1947471275\nOD855124\tOD855124.1\t61472\t1947471276\nOD855125\t
tar: Skipping to next header
tar: Archive contains ‘\tOE36610’ where numeric mode_t value expected
tar: Archive contains ‘6.1\t6147’ where numeric uid_t value expected
\t61474\t1962876452\nOE366104\tOE366104.1\t61474\t1962876453\nOE366105\tOE366105.1\t61474\t1962876454\nOE366106
tar: Skipping to next header
tar: Archive contains ‘OE507501’ where numeric mode_t value expected
tar: Archive contains ‘\tOE507502.1\t’ where numeric time_t value expected
tar: Archive contains ‘.1\t61474’ where numeric uid_t value expected
tar: Archive contains ‘\t1964446’ where numeric gid_t value expected
61474\t1964446754\nOE507499\tOE507499.1\t61474\t1964446757\nOE507500\tOE507500.1\t61474\t1964446760\nOE507501\t
tar: Skipping to next header
tar: Archive contains ‘081\nOE597102’ where numeric off_t value expected
tar: Archive contains ‘\tOE60725’ where numeric mode_t value expected
tar: Archive contains ‘9\tOE607259.1’ where numeric time_t value expected
tar: Archive contains ‘8.1\t6147’ where numeric uid_t value expected
\t61474\t1965131656\nOE607256\tOE607256.1\t61474\t1965131659\nOE607257\tOE607257.1\t61474\t1965131662\nOE607258
tar: Skipping to next header
tar: Archive contains ‘03024007.1\t6’ where numeric time_t value expected
tar: Archive contains ‘\t3026648’ where numeric uid_t value expected
003024004.1\t663202\t302664848\nXM_003024005\tXM_003024005.1\t663202\t302664850\nXM_003024006\tXM_003024006.
tar: Skipping to next header
tar: Archive contains ‘008481066.2\t’ where numeric off_t value expected

gzip: stdin: invalid compressed data--crc error

gzip: stdin: invalid compressed data--length error
tar: Child returned status 1
tar: Error is not recoverable: exiting now

@sara-javadzadeh
Copy link
Owner

Thanks for checking.
I'm uploading the databases again, it'll take another couple of hours to fully upload. I'll share the link here as soon as it happens.
In the meantime, it might be worth setting up a new Conda environment, installing tar and trying to extract the database files in this new clean environment. Let me know if you still get the errors.

Sara

@sara-javadzadeh
Copy link
Owner

Hi again,

Here's a second link for the same Kraken databases: https://drive.google.com/file/d/1DrKgDE7fl5Tff2bV8K9XBxLYsbTeOcgh/view?usp=sharing

I suspect this might be a tar library incompatibility rather than file problem. I was able to list the contents of kraken_datasets.tar.gz using the first link (provided in the README file). Here's my tar version on macOS 12.1
tar --version bsdtar 3.5.1 - libarchive 3.5.1 zlib/1.2.11 liblzma/5.0.5 bz2lib/1.0.8
That's why I would recommend updating your tar package or create a new Conda environment and try it again as above. Let me know how it goes.

Sara

@mrzResearchArena
Copy link

Thank you, Ms. Javadzadeh. It helped me a lot.

I used a Python script instead of tar, and this time, it has not shown errors. After extracting, I got a 61.4 GB file size. Is it the correct file size?

import tarfile

sourcePATH = '/mnt/sdb1/kraken2/kraken_datasets.tar.gz'
destinationPATH = '/mnt/sdb1/kraken2/'

with tarfile.open(sourcePATH) as tar:
    tar.extractall(destinationPATH)
    tar.close()

@sara-javadzadeh
Copy link
Owner

sara-javadzadeh commented Jun 14, 2023 via email

@cubense
Copy link

cubense commented Dec 5, 2023

Hi Wonseok,

Did you try running the build_custom_kraken_index.sh script on k_18_hbv database, after running download_custom_kraken_library.sh? If so, was there any error?

hi sara
i meet the same error with Wonseok
no error running the build_custom_kraken_index.sh and download_custom_kraken_library.sh in k_18 and k_22
prelim_map.txt is empty in k_18 and k_22
but prelim_map.txt in k_25_hg is ok
i running the docker show error does not contain necessary file taxo.k2d

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants