-
Notifications
You must be signed in to change notification settings - Fork 2
Sync with RefSeq GenBank
NCBI is currently updating RefSeq/GenBank on odd numbered months. We are aiming to update the GTDB every 3 releases. Rsync is used to mirror the genome assemblies on NCBI's FTP site.
NCBI taxonomy information should be downloaded on the same day that we sync with NCBI.
- Create the new NCBI taxonomy metadata directory:
mkdir -p /srv/db/gtdb/metadata/<release#>/ncbi/taxonomy
- Download and extract the latest NCBI taxonomy database:
cd /srv/db/gtdb/metadata/<release#>/ncbi/taxonomy
rsync ftp.ncbi.nih.gov::pub/taxonomy/taxdump.tar.gz .
mv taxdump.tar.gz taxdump_<date>.tar.gz (i.e. taxdump_20220718.tar.gz)
mkdir taxdump_<date>
tar xvzf taxdump_<date>.tar.gz -C taxdump_<date>
- In /srv/db/gtdb/metadata/<release#>/ncbi/taxonomy):
rsync ftp.ncbi.nlm.nih.gov::genomes/refseq/archaea/assembly_summary.txt assembly_summary_archaea_refseq.txt
rsync ftp.ncbi.nlm.nih.gov::genomes/refseq/bacteria/assembly_summary.txt assembly_summary_bacteria_refseq.txt
rsync ftp.ncbi.nlm.nih.gov::genomes/genbank/archaea/assembly_summary.txt assembly_summary_archaea_genbank.txt
rsync ftp.ncbi.nlm.nih.gov::genomes/genbank/bacteria/assembly_summary.txt assembly_summary_bacteria_genbank.txt
cat assembly_summary_archaea_genbank.txt assembly_summary_bacteria_genbank.txt > assembly_summary_genbank.txt
cat assembly_summary_archaea_refseq.txt assembly_summary_bacteria_refseq.txt > assembly_summary_refseq.txt
- Remove all genomes associated with 'large multi-isolate project':
grep -v 'large multi-isolate project' assembly_summary_archaea_genbank.txt > assembly_summary_archaea_genbank_nolargeproject.txt
grep -v 'large multi-isolate project' assembly_summary_bacteria_genbank.txt > assembly_summary_bacteria_genbank_nolargeproject.txt
grep -v 'large multi-isolate project' assembly_summary_bacteria_refseq.txt > assembly_summary_bacteria_refseq_nolargeproject.txt
grep -v 'large multi-isolate project' assembly_summary_archaea_refseq.txt > assembly_summary_archaea_refseq_nolargeproject.txt
remove the 2 first line of each nolargeproject.txt.
sed -i '1,2d' *_nolargeproject.txt
- For EACH of the assembly summary file downloaded get the 20th column of the file ( this is the ftp URL for each assemblies):
export VERSION=95
cut -f20 assembly_summary_bacteria_refseq_nolargeproject.txt | grep 'ftp' > bac120_refseq_r$VERSION.lst
cut -f20 assembly_summary_archaea_refseq_nolargeproject.txt | grep 'ftp' > ar53_refseq_r$VERSION.lst
cut -f20 assembly_summary_bacteria_genbank_nolargeproject.txt | grep 'ftp' > bac120_genbank_r$VERSION.lst
cut -f20 assembly_summary_archaea_genbank_nolargeproject.txt | grep 'ftp' > ar53_genbank_r$VERSION.lst
NCBI taxonomy information should be placed in /srv/db/gtdb/metadata/<release#>/ncbi/taxonomy:
Using the taxdump.tar.gz, assembly_summary_refseq.txt and assembly_summary_genbank.txt downloaded previously:
- Run the ncbi_taxonomy.py script to create summary files of the NCBI taxonomy file:
mkdir standardised_taxonomy
cd standardised_taxonomy
gtdb_migration_tk parse_ncbi_taxonomy -t ../20190725/ --rb ../assembly_summary_bacteria_refseq.txt --ra ../assembly_summary_archaea_refseq.txt --gb ../assembly_summary_bacteria_genbank.txt --ga ../assembly_summary_archaea_genbank.txt -p ncbi_r202
- Remove deprecated genomes from FTP directory: **Because we run the rsync step individually for each genome present in the assembly summary file, we dont know which one have been remove. We need to write a script comparing the genome_dirs.tsv file form the previous release to the assembly summary file / *_r$VERSION.lst To know which one to remove before starting rsync. **
gtdb_migration_tk clean_ftp --new_list_genomes assembly_summary_archaea_genbank_nolargeproject.txt,assembly_summary_archaea_refseq_nolargeproject.txt,assembly_summary_bacteria_genbank_nolargeproject.txt,assembly_summary_bacteria_refseq_nolargeproject.txt --ftp_genome_dir_file /srv/db/gtdb/metadata/release207/ncbi/genome_dirs_ftp.tsv --report_dir report_clean_ftp/ --taxonomy_file standardised_taxonomy/ncbi_r213_standardized.tsv
- Run the rsync command for each of them
cd /srv/db/ncbi/new_ftp_structure/
mkdir r213_logs
cat /srv/db/gtdb/metadata/release<release_number>/ncbi/taxonomy/bac120_genbank_r<release_number>.lst |parallel --eta -j20 /srv/db/ncbi/new_ftp_structure/rsync_data.sh '<(' echo {} ')' '&&' echo finished {} '>>' bac120_gbk.log
Create the new <release#> folder in /srv/db/gtdb/genomes/ncbi/
:
mkdir -p /srv/db/gtdb/genomes/ncbi/<release#>/refseq/
mkdir -p /srv/db/gtdb/genomes/ncbi/<release#>/genbank/
Before copying, list all records in GenBank and RefSeq FTP folders:
gtdb_migration_tk list_genomes -g /srv/db/ncbi/new_ftp_structure/genomes/all/ -o /srv/db/gtdb/metadata/<release#>/ncbi/genome_dirs_ftp.tsv
Update RefSeq
gtdb_migration_tk update_refseq --cpus 20 --ftp_refseq_directory /srv/db/ncbi/new_ftp_structure/genomes/all/ --new_refseq_directory /srv/db/gtdb/genomes/ncbi/release213/refseq/ --ftp_genome_dirs_file /srv/db/gtdb/metadata/release213/ncbi/genome_dirs_ftp.tsv --old_genome_dirs_file /srv/db/gtdb/genomes/ncbi/release207/refseq/genome_dirs.tsv --arc_assembly_summary /srv/db/gtdb/metadata/release213/ncbi/taxonomy/assembly_summary_archaea_refseq.txt --bac_assembly_summary /srv/db/gtdb/metadata/release213/ncbi/taxonomy/assembly_summary_bacteria_refseq.txt > /srv/db/gtdb/genomes/ncbi/release<current_release#>/refseq/update_refseq_from_ftp_files.log
Optional Step: Manually curate conflicting genomes.
Sometimes (rarely), NCBI versioning of assembly is conflicting(GCF_000026325.1_ASM2632v1,GCF_000026325.1_ASM2632v2). The log has to be updated manually.
To track which assembly is conflicting:
grep 'to_curate' report_gcf.log
Go to the genome directory, clean it ( remove duplicate assembly report, copy the proper version..etc) and updated the status ( unmodified/modified...) in report_gcf.log
List all records in the new RefSeq folder
gtdb_migration_tk list_genomes -g /srv/db/gtdb/genomes/ncbi/release<current_release#>/refseq -o /srv/db/gtdb/genomes/ncbi/release<current_release#>/refseq/genome_dirs.tsv
Update GenBank folder
gtdb_migration_tk update_genbank
--ftp_genbank_directory /srv/db/ncbi/new_ftp_structure/genomes/all/
--new_genbank_directory ~/tmp_dir/test_update_refseq_multithreads/
--new_ftp_genbank_dirs_file /srv/db/gtdb/metadata/release<current_release#>/ncbi/ftp_genome_dirs.tsv
--old_genbank_genome_dirs_file /srv/db/gtdb/genomes/ncbi/release<old_release#>/genbank/genome_dirs.tsv
--arc_assembly_summary /srv/db/gtdb/metadata/release<current_release#>/ncbi/taxonomy/assembly_summary_archaea_genbank.txt
--bac_assembly_summary /srv/db/gtdb/metadata/release<current_release#>/ncbi/taxonomy/assembly_summary_bacteria_genbank.txt
--cpus 30
--new_refseq_genome_dirs_file /srv/db/gtdb/genomes/ncbi/release<current_release#>/refseq/genome_dirs.tsv
List all records in the new Genbank folder:
gtdb_migration_tk list_genomes -g /srv/db/gtdb/genomes/ncbi/release<current_release#>/genbank/ -o /srv/db/gtdb/genomes/ncbi/release<current_release#>/genbank/genome_dirs.tsv
For the next steps of the update, go to Update database metadata
====================================================================================================================
*How update_refseq_from_ftp.py works (TO REVIEW):
- For each domain:
- List the RefSeq records present in the FTP folder (they have to be qualified as latest)
- List the RefSeq records present in the previous GTDB folder
- If a genome is present in the FTP list but not in the old GTDB list:
- Add the genome folder to the new gtdb folder
- If the new genome is actually a new version of an existing genome:
- Replace the old one by this new one
- If a genome is not present in the FTP:
- Delete the genome from GTDB
- Modify the lists having the deleted genomes.
- If the genomes is present in both FTP and old GTDB:
- Compare the checksum file of the 2 folder
- If the checksum of the genomic.fna.gz and/or protein.faa.gz files are different:
- Copy the FTP folder to the new GTDB folder
- Unzip all gz file in the new GTDB folder
- Update the sha sizes in the GTDB
- If checksum of the genomic.fna.gz and/or protein.faa.gz files are the same:
- Copy the old GTDB folder to the new GTDB folder
- Compare the genbank files between the GTDB folder and FTP folder:
- If there is a change:
- Copy the genbank files from FTP that are different from the GTDB folder
- Copy the checksum files from FTP
- Compare the report files between the GTDB folder and FTP folder:
- If there is a change:
- Copy the report files from FTP that are different from the GTDB folder