Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Add "knn" and "umap" commands #710

Closed
wants to merge 138 commits into from
Closed

[WIP] Add "knn" and "umap" commands #710

wants to merge 138 commits into from

Conversation

olgabot
Copy link
Collaborator

@olgabot olgabot commented Aug 10, 2019

Depending on #925 working, then can create a k-nearest neighbor graph and umap visualization directly from a sourmash SBT index.

  • Is it mergeable?
  • make test Did it pass the tests?
  • make coverage Is the new code covered?
  • Did it change the command-line interface? Only additions are allowed
    without a major version increment. Changing file formats also requires a
    major version number increment.
  • Was a spellchecker run on the source code and documentation after
    changes were made?

olgabot and others added 30 commits April 8, 2019 12:58
Co-Authored-By: Luiz Irber <luizirber@users.noreply.github.com>
@luizirber
Copy link
Member

Continuing this conversation:

olgabot

Is this because SBTs are dependent on the order in which they were constructed?

titus

yes, currently SBTs are order dependent.

olgabot

So there’s no guarantee that leaves sharing parents of nodes further down are more similar than leaves sharing higher-level parents?

titus

that is correct.
I think Luiz has been working on ways to build the SBT differently, as have others, but if you want to enable online building and node additions, you end up with a lot of balance challenges.

luizirber

Yup. But clustering similar nodes would also make searching (especially gather) way faster...
There is a scaffold command on the Rust side which will take a list of datasets and cluster them, but it is not online
I was thinking to do something like:

  • index collects datasets up to a threshold (~100?), and just do a linear search til then
  • when there are enough datasets, run scaffold to build a skeleton for new updates (see the Allsome SBT paper for a discussion about clustering too early)
  • modify the add_node call to search for best position to put the new dataset. This can be the first thing to change, but without proper tree balancing it will mess the internal SBT representation...
    SBTs behave better if they are clustered by shared content, and if the tree is dense (don't increase depth if there are positions available)

@olgabot
Copy link
Collaborator Author

olgabot commented Oct 22, 2019

The current implementation of SBTs is not properly localized, so a nearest neighbor graph built on the SBTs and then fed into a UMAP implementation in this PR. Instead of getting the correct plot on the left that was built on an all-by-all matrix csv from sourmash compare, instead one gets the plot on the right which has mini-versions of the plot on the right.

Screen Shot 2019-10-22 at 9 20 58 AM

This is due to the SBTs not being properly localized, as the nearest neighbor algorithm depends on leaves sharing parents to be more similar in jaccard space than leaves that do not share parents.

@olgabot olgabot changed the title [WIP] Create adjacency list for K Nearest Neighbor graph from SBT [WIP] Add "knn" and "umap" commands Apr 7, 2020
@olgabot
Copy link
Collaborator Author

olgabot commented Apr 7, 2020

Resurrecting this PR...

# Conflicts:
#	requirements.txt
#	sourmash/_minhash.pxd
#	sourmash/_minhash.pyx
#	sourmash/commands.py
#	sourmash/compare.py
#	sourmash/index.py
#	sourmash/kmer_min_hash.hh
#	sourmash/sbt.py
#	sourmash/sig/__main__.py
#	sourmash/signature.py
#	sourmash/signature_json.py
#	sourmash/sourmash_args.py
#	tests/conftest.py
#	tests/test__minhash.py
#	tests/test_cmd_signature.py
#	tests/test_compare.py
#	tests/test_signature_json.py
#	tests/test_sourmash.py
@sonarcloud
Copy link

sonarcloud bot commented Apr 13, 2020

SonarCloud Quality Gate failed.

Bug E 1 Bug
Vulnerability A 0 Vulnerabilities (and Security Hotspot 0 Security Hotspots to review)
Code Smell A 13 Code Smells

No Coverage information No Coverage information
0.0% 0.0% Duplication

@czbiohub-sf czbiohub-sf closed this by deleting the head repository Mar 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants