Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Look into clustering and clustering summarization #225

Open
ctb opened this issue May 16, 2017 · 2 comments
Open

Look into clustering and clustering summarization #225

ctb opened this issue May 16, 2017 · 2 comments

Comments

@ctb
Copy link
Contributor

ctb commented May 16, 2017

https://stats.stackexchange.com/questions/3685/where-to-cut-a-dendrogram

http://www.sigmath.es.osaka-u.ac.jp/shimo-lab/prog/pvclust/

https://github.com/biocore/genome-subsampler/blob/master/genomesubsampler/prototypeSelection.py

things to provide code for --

  • notebooks to investigate & viz
  • include relabling (by editing labels.txt / lists in Python)
  • cut clusters at some distance (cophenetic?)
  • bootstrap/summarize/shrink clusters
  • histogram of distances
  • selection of samples by fishing with a query (or query set)
@ctb
Copy link
Contributor Author

ctb commented May 17, 2017

@ekg suggested looking into variational autoencoding:

Or, if you're interested in finding your way back into the big tree, you could use VAE or similar dim. > reduction and work from that repr

(we already have t-SNE working in a notebook somewhere)

@ctb
Copy link
Contributor Author

ctb commented Mar 4, 2021

see cluster and cocluster too #700

bluegenes added a commit to sourmash-bio/sourmash_plugin_branchwater that referenced this issue Feb 27, 2024
This PR adds a new command, `cluster`, that can be used to cluster the output from `pairwise` and `multisearch`.

`cluster`uses `rustworkx-core` (which internally uses `petgraph`) to build a graph, adding edges between nodes when the similarity exceeds the user-defined threshold. It can work on any of the similarity columns output by `pairwise` or `multisearch`, and will add all nodes to the graph to preserve singleton 'clusters' in the output.

`cluster` outputs two files: 
1. cluster identities file: `Component_X, name1;name2;name3...`
2. cluster size histogram `cluster_size, count`

context for some things I tried:
- try using petgraph directly and removing rustworkx dependency
> nope,`rustworkx-core` adds `connected_components` that returns the connected components, rather than just the number of connected components. Could reimplement if `rustworkx-core` brings in a lot of deps
- try using 'extend_with_edges' instead of add_edge logic.
> nope, only in `petgraph`

**Punted Issues:**
- develop clustering visualizations (ref @mr-eyes kSpider/dbretina work). Optionally output dot file of graph? (#248)
- enable updating clusters, rather than always regenerating from scratch (#249)
- benchmark `cluster` (#247)
>  `pairwise` files can be millions of lines long. Would it be faster to parallel read them, store them in an `edges` vector, and then add nodes/edges sequentially? Note that we would probably need to either 1. store all edges, including those that do not pass threshold) or 2. After building the graph from edges, add nodes from `names_to_node` that are not already in the graph to preserve singletons.


Related issues:

* #219
* sourmash-bio/sourmash#2271
* sourmash-bio/sourmash#700
* sourmash-bio/sourmash#225
* sourmash-bio/sourmash#274


---------

Co-authored-by: C. Titus Brown <titus@idyll.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant