Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] improve identifier & taxonomy parsing for lca index #1542

Closed
wants to merge 4 commits into from

Conversation

ctb
Copy link
Contributor

@ctb ctb commented May 21, 2021

As our sourmash_databases fu continues to evolve, we are doing a better job of providing versioned accessions on signature names, but this is also changing the requirements for taxonomy spreadsheets.

This PR provides a bunch of options to lca index to improve the UX and slightly improve the overall situation, while also highlighting how silly the current code and UX is :).

@codecov
Copy link

codecov bot commented May 21, 2021

Codecov Report

Merging #1542 (7e822c6) into latest (516dc53) will increase coverage by 4.94%.
The diff coverage is 53.33%.

Impacted file tree graph

@@            Coverage Diff             @@
##           latest    #1542      +/-   ##
==========================================
+ Coverage   90.26%   95.21%   +4.94%     
==========================================
  Files         126       99      -27     
  Lines       21271    17584    -3687     
  Branches     1595     1600       +5     
==========================================
- Hits        19201    16743    -2458     
+ Misses       1843      611    -1232     
- Partials      227      230       +3     
Flag Coverage Δ
python 95.21% <53.33%> (-0.04%) ⬇️
rust ?

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
src/sourmash/lca/command_index.py 88.44% <46.15%> (-3.05%) ⬇️
src/sourmash/cli/lca/index.py 100.00% <100.00%> (ø)
src/core/src/sketch/hyperloglog/mod.rs
src/core/src/index/sbt/mhbt.rs
src/core/src/ffi/mod.rs
src/core/src/index/sbt/mod.rs
src/core/src/ffi/utils.rs
src/core/src/ffi/hyperloglog.rs
src/core/tests/test.rs
src/core/src/ffi/cmd/compute.rs
... and 19 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 516dc53...7e822c6. Read the comment docs.

@ctb
Copy link
Contributor Author

ctb commented May 21, 2021

Current command for building LCA indices 😆

sourmash lca index ../../gtdb-rs202.taxonomy.v2.csv gtdb-rs202.genomic.k31.lca.json.gz \
     gtdb-rs202.genomic-reps.k31.sbt.zip --scaled=10000 \
     --require-taxonomy --fail-on --split-identifier

@ctb
Copy link
Contributor Author

ctb commented May 22, 2021

This is working in practice, but still needs to be tested. @bluegenes if y'all end up taking this over for #1515 well that'd be just fine by me 😁

@bluegenes
Copy link
Contributor

This is working in practice, but still needs to be tested. @bluegenes if y'all end up taking this over for #1515 well that'd be just fine by me 😁

can I merge this into #1543 and add tests over there? Or would it be best to add tests here and then integrate?

@ctb
Copy link
Contributor Author

ctb commented May 25, 2021 via email

@bluegenes
Copy link
Contributor

Closing. Merged into #1543, with some tests added in b1a40a3

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants