Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

release new GTDB databases for R07-RS207 #1941

Closed
ctb opened this issue Apr 10, 2022 · 24 comments · Fixed by #2013
Closed

release new GTDB databases for R07-RS207 #1941

ctb opened this issue Apr 10, 2022 · 24 comments · Fixed by #2013

Comments

@ctb
Copy link
Contributor

ctb commented Apr 10, 2022

per https://twitter.com/ace_gtdb/status/1512789050452692996

@ctb
Copy link
Contributor Author

ctb commented Apr 10, 2022

the full genomic ones are available on farm at ~ctbrown/scratch/fromfile/gtdb, and they can be downloaded from IPFS here,

#1511 (comment)

along with updated genbank databases.

@luizirber, it is legit to make the https://dweb.link/ipfs/ URLs the standard for distributing databases?

@luizirber
Copy link
Member

@luizirber, it is legit to make the https://dweb.link/ipfs/ URLs the standard for distributing databases?

It is the preferable default (instead of https://ipfs.io) per ipfs/ipfs-companion#939

But we can also point in docs to check https://ipfs.github.io/public-gateway-checker/ if dweb.link is down?

@ctb
Copy link
Contributor Author

ctb commented Apr 10, 2022

note, also need to build/provide the taxonomy spreadsheets for both genbank and GTDB.

@taylorreiter
Copy link
Contributor

taylorreiter commented Apr 11, 2022

new taxonomy spread sheets built for GTDB!

Paths on farm:

/group/ctbrowngrp/gtdb/gtdb-rs207.taxonomy.csv
/group/ctbrowngrp/gtdb/gtdb-rs207.taxonomy.representives.csv
/group/ctbrowngrp/gtdb/gtdb-rs207.taxonomy.with-strain.csv
/group/ctbrowngrp/gtdb/gtdb-rs207.taxonomy.with-strain.representives.csv
library(dplyr)
library(readr)
library(tidyr)

# download and init reformatting ------------------------------------------

bac120 <- read_tsv("https://data.gtdb.ecogenomic.org/releases/release207/207.0/bac120_taxonomy_r207.tsv.gz", 
                   show_col_types = F, col_names = c("ident", "lineage")) 

ar53 <- read_tsv("https://data.gtdb.ecogenomic.org/releases/release207/207.0/ar53_taxonomy_r207.tsv.gz", 
                 show_col_types = F, col_names = c("ident", "lineage")) 

gtdb207 <- bind_rows(bac120, ar53) %>%
  separate(lineage, into = c("superkingdom", "phylum", "class", "order", 
                             "family", "genus", "species"), sep = ";") %>%
  mutate(ident = gsub("^RS_", "", ident),
         ident = gsub("^GB_", "", ident))

write_csv(gtdb207, "gtdb-rs207.taxonomy.csv")

gtdb207_strain <- gtdb207 %>%
  mutate(strain = ident)

write_csv(gtdb207_strain, "gtdb-rs207.taxonomy.with-strain.csv")


# make tax for reps -------------------------------------------------------
# download metadata to get reps info
destfile <- "inputs/gtdb-rs207/bac120_metadata_rs207.tar.gz"
url <- "https://data.gtdb.ecogenomic.org/releases/release207/207.0/bac120_metadata_r207.tar.gz"
if (!file.exists(destfile)) {
  download.file(url, destfile, method="auto") 
}
outfile <- "inputs/gtdb-rs207/bac120_metadata_r207.tsv"
if (!file.exists(outfile)){
  untar(destfile, exdir = "inputs/gtdb-rs207")
}

destfile <- "inputs/gtdb-rs207/ar53_metadata_rs207.tar.gz"
url <- "https://data.gtdb.ecogenomic.org/releases/release207/207.0/ar53_metadata_r207.tar.gz"
if (!file.exists(destfile)) {
  download.file(url, destfile, method="auto") 
}
outfile <- "inputs/gtdb-r207/ar53_metadata_r207.tsv"
if (!file.exists(outfile)){
  untar(destfile, exdir = "inputs/gtdb-rs207")
}

# combine metadata

gtdb_metadata_reps <- read_tsv("inputs/gtdb-rs207/bac120_metadata_r207.tsv", show_col_types = FALSE) %>%
  select(ident=accession, gtdb_representative) %>%
  filter(gtdb_representative == TRUE) %>%
  mutate(ident = gsub("^RS_", "", ident),
         ident = gsub("^GB_", "", ident))

gtdb_metadata_reps <- read_tsv("inputs/gtdb-rs207/ar53_metadata_r207.tsv", show_col_types = FALSE) %>%
  select(ident=accession, gtdb_representative) %>%
  filter(gtdb_representative == TRUE) %>%
  mutate(ident = gsub("^RS_", "", ident),
         ident = gsub("^GB_", "", ident)) %>%
  bind_rows(gtdb_metadata_reps)

gtdb207_reps <- gtdb207 %>%
  filter(ident %in% gtdb_metadata_reps$ident)

write_csv(gtdb207_reps, "gtdb-rs207.taxonomy.representives.csv")

gtdb207_reps_strain <- gtdb207_strain %>%
  filter(ident %in% gtdb_metadata_reps$ident)

write_csv(gtdb207_reps, "gtdb-rs207.taxonomy.with-strain.representives.csv")

@ctb
Copy link
Contributor Author

ctb commented Apr 12, 2022

I love me some picklists!

for k in 21 31 51; do
    sourmash sig cat --picklist /group/ctbrowngrp/gtdb/gtdb-rs207.taxonomy.representives.csv:ident:ident \
          gtdb-rs207.genomic.dna.k$k.zip -o gtdb-rs207.genomic-reps.dna.k$k.zip;
done

running now, results will be in ~ctbrown/scratch/fromfile/gtdb soon.

@ctb
Copy link
Contributor Author

ctb commented Apr 12, 2022

and I will confirm inclusion with:

sourmash sig check --picklist /group/ctbrowngrp/gtdb/gtdb-rs207.taxonomy.representives.csv:ident:ident \
           gtdb-rs207.genomic-reps.dna.k21.zip

@ctb
Copy link
Contributor Author

ctb commented Apr 12, 2022

databases built!

== This is sourmash version 4.3.1.dev46+g997741a8. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

** loading from 'gtdb-rs207.genomic-reps.dna.k51.zip'
path filetype: ZipFileLinearIndex
location: /group/ctbrowngrp2/ctbrown/fromfile/gtdb/gtdb-rs207.genomic-reps.dna.k51.zip
is database? yes
has manifest? yes
num signatures: 65703
** examining manifest...
total hashes: 212678557
summary of sketches:
   65703 sketches with DNA, k=51, scaled=1000, abund  212678557 total hashes

etc.

@luizirber
Copy link
Member

☝️ https://greyhound.sourmash.bio/ is now running with gtdb-rs207.genomic-reps.dna.k21.zip (and based on #1943, so I only had to run ./greyhound-server -k 21 --scaled 1000 ~/gtdb-rs207.genomic-reps.dna.k21.zip to start it =])

@luizirber
Copy link
Member

Oh, and IPFS hashes:
gtdb-rs207.genomic-reps.dna.k51.zip bafybeiaustaeoaeja5nksc5pn2lttsge6dscilphdclcbubuvixvp6cmxa
gtdb-rs207.genomic-reps.dna.k21.zip bafybeia54il6bduuriga7ysuxpzms4fpqfi2uekqjo6h4czmumgjhjunwq
gtdb-rs207.genomic-reps.dna.k31.zip bafybeihnkkgayqxlz5yta3mu75xvft4pqlmqcwklgdh2d3v2lddjljzodm

@ctb
Copy link
Contributor Author

ctb commented Apr 12, 2022

Here's how I'm contemplating building new database releases -

https://github.com/sourmash-bio/database-releases/

idea is we have a very small repo that contains the just the Snakefile and config stuff for each release version, and then every time we do a release of databases we cut a new release here => zenodo DOI, etc.

I'll flesh that out more clearly but would love any hot takes you might have :)

@ctb
Copy link
Contributor Author

ctb commented Apr 13, 2022

no past decision goes unpunished. the sourmash sig check stuff I'm putting into the database-releases workflows is now going to run afoul of LCA databases that are missing identifiers because duplicate signatures are removed per #1573.

SIGH.

@ctb
Copy link
Contributor Author

ctb commented Apr 13, 2022

Full databases (.zip, .sbt.zip, .lca.json.gz) now available for all GTDB:

/home/ctbrown/scratch/fromfile/database-releases/gtdb-rs207.genomic

and for just the genomic representatives:

/home/ctbrown/scratch/fromfile/database-releases/gtdb-rs207.genomic-reps

Genbank .zip databases from end of March 2022 are here:

/home/ctbrown/scratch/fromfile/genbank

I'll work on collating the tax spreadsheets etc and putting them in a single canonical place on farm.

(Still need to build tax spreadsheets for genbank.)

@ctb
Copy link
Contributor Author

ctb commented Apr 13, 2022

I made a release on database-releases here, https://github.com/sourmash-bio/database-examples/releases/tag/v0.1

@ctb
Copy link
Contributor Author

ctb commented Apr 14, 2022

all (?) GTDB databases linked under /group/ctbrowngrp/sourmash-db on farm, which should be the default path used from now on.

still have to update, copy, and/or link in taxonomy DBs, among other things...

@ctb
Copy link
Contributor Author

ctb commented Apr 20, 2022

random question: should I use the code in https://github.com/dib-lab/2018-ncbi-lineages to build new Genbank lineages, or is there a better procedure? No problem updating code etc etc if needed, was just wondering if somewhere in our collection of issues/PRs there is a new, improved genbank lineage construction script.

@taylorreiter
Copy link
Contributor

I don't know of any new and improved methods

@ctb
Copy link
Contributor Author

ctb commented Apr 20, 2022

🎶 hey, ho, away we go 🎶

(envision picture of dwarf heading off to code mines with a pickaxe)

@ctb
Copy link
Contributor Author

ctb commented Apr 20, 2022

actually I like how I set myself up for success in that github repo with a Snakefile and everything. yay past me!

@ctb
Copy link
Contributor Author

ctb commented Apr 20, 2022

...easy conversion over to assembly_summary files as inputs: https://github.com/ctb/2022-assembly-summary-to-lineages

@ctb
Copy link
Contributor Author

ctb commented Apr 20, 2022

New genbank lineages file on farm:

/group/ctbrowngrp/sourmash-db/genbank-2022.03/genbank-2022.03-lineages.csv.gz

@ctb
Copy link
Contributor Author

ctb commented Apr 20, 2022

and now in our google drive folder, https://drive.google.com/drive/folders/1Jk5z4fQtsyqyJWCcNmtn4WyE2jZsejrZ.

I think it's time to update the docs, yah?

@taylorreiter
Copy link
Contributor

I think rs207 needs to be added to the osf first?

also genbank was plugged into here, https://osf.io/wxf9z/, which is labelled as sourmash GTDB databases
and genbank was not plugged in here, https://osf.io/t3fqa/, which is labelled as just sourmash databases

@ctb
Copy link
Contributor Author

ctb commented Apr 20, 2022

they're all available via other links at this point - google drive and/or IPFS.

@ctb
Copy link
Contributor Author

ctb commented May 1, 2022

ref #2015. Will be closed by #2013 🎉 .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants