Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Streamtrees #1902

Draft
wants to merge 22 commits into
base: master
Choose a base branch
from
Draft

Streamtrees #1902

wants to merge 22 commits into from

Conversation

jameshadfield
Copy link
Member

@jameshadfield jameshadfield commented Nov 14, 2024

This is the first prototype of a long-running idea of mine (and others) as a way of displaying big trees where our typical approach is either too slow and/or we run out of pixels. This sketch was from many years ago:
image

This prototype implements streams by allowing branch labels to cut the tree into monophylies / paraphylies and visualising each as streamgraphs. While the motivation was for this to display huge trees which Auspice can't currently display, it works and is useful (and fun) for smaller trees as well. Long-term I think this would be a good starting view into very large or diverse datasets.

How to use

nextstrain.org review app

Any dataset with a branch label (apart from amino acid) can have streamtrees. The dropdown in the sidebar can change the label used to partition the tree, and there's a toggle to go between streamtrees and normal tree view. My general view is that the appropriate partitioning of sample-sets will be best done in Augur, either algorithmically or manually. (There's also scope for dynamic partitioning in Auspice via genotype or color-by, but that's not implemented here.)

image 2

Suggested testing datasets

Known caveats / bugs

  • The tree must be temporal. You can change between temporal & divergence metrics, so long as temporal information exists.
  • Only rectangular trees.
  • Streams with only internal nodes cause a crash.
  • Performance hasn't been optimised at all. Specifically, toggling between streamtrees & normal trees is very slow.
  • Nested streams which aren't very ladder-like don't zoom well most of the time, and changing back to the normal view often gets the viewport completely wrong. This is fixable, but requires a rethinking & rewrite of how I calculate the vertical position of nodes.
  • Only categorical color-bys work. Genotype colorings don't work.
  • The vertical space within a single stream is consistent (i.e. half height = half the samples in that window) but it's not consistent between streams. The window size is also not consistent between streams.
  • The curves drawn around streams are an off-the-shelf curve generator and aren't quite right.
  • Dashed lines are no good.

Future directions

  • Release the current prototype after feedback, UI improvements & bug-fixes.
  • Auto-toggling between streamtrees & normal trees. Currently you have to manually do this and it's very slow, but the idea is that as you zoom the streams are replaced by either more fine-grained streams or the normal tree view.
  • Allow JSONs to encode streams in a more compressed format (to allow massive trees) and then the switching of streams to normal trees involves a fetch of the associated dataset-JSON for that particular stream.
  • Dynamic partitioning which isn't simple cuts in the tree. E.g. see certain mutations as individual streams with the lines between the streams representing the flows between states. (Maybe.)
  • On a recent call we discussed collapsing identical sequences (tips) in order to better view large trees. One option here is making the tip size scale with the number of samples represented by the tip (expressing multiple colours / sampling dates is harder). Another option is to use streams (but be aware that n streams will always be a lot slower than n tips).

Screenshots

Screen.Recording.2024-11-14.at.8.43.25.PM.mov
image 4
Screen.Recording.2024-11-14.at.8.40.30.PM.mov

@jameshadfield jameshadfield added experiment PRs which may never be merged preview on nextstrain.org labels Nov 14, 2024
@nextstrain-bot nextstrain-bot temporarily deployed to auspice-james-stream-tr-37i7jd November 14, 2024 07:46 Inactive
@trvrb
Copy link
Member

trvrb commented Nov 14, 2024

This is so cool @jameshadfield! I can give more detailed feedback soon, but I wanted to highlight just a few things on initial perusal:

@trvrb
Copy link
Member

trvrb commented Nov 15, 2024

(After conversation with James...)

We want to be sure this viz approach is useful for being able to read evolutionary / epidemiological stories from the genetic data. If we construct streams from the clade branch label then it's clear that we can describe evolutionary stories. For example from https://nextstrain-s-nextstrain-ginawj.herokuapp.com/ncov/gisaid/europe/all-time we can see the standard pandemic story of initial variants, then VOCs and the sweep of Delta and then the emergence of Omicron on a long branch, and so forth.

Screenshot 2024-11-14 at 4 00 11 PM

And we can do things like color by S1 mutations like we often want to understand what's driving clade success.

Screenshot 2024-11-14 at 4 18 58 PM

But in general I'd be trying to think about how a streamtree view would enable proper reading of stories like:

My guess is that streamtrees would work quite well for the epidemiological cases, but only really when "clades" are created that correspond well to geographic transitions (note that this tracking geography was where Pango lineages got their start). This could be literally creating branch labels to mark geographic transitions from augur traits and giving each transition a unique name. I think this will probably be better than trying to label branches as "USA", etc... and have convergence to the same streams. (This doesn't have to be a primary use case of the streamtrees, but I think worth considering while scoping out initial remit)

@trvrb
Copy link
Member

trvrb commented Nov 15, 2024

Related to the above, I think the biggest design decision here is to enforce creating streams from existing branch labels. This effectively pushes the problem of what streams to allow for into augur and machinery like augur clades. This is as opposed to dynamically sizing streams in Auspice based on tree size (perhaps with UI for how granular to make them). But that said, even if we at some point really wanted dynamic streams in Auspice, it's still going to be best to start with the simple branch label strategy as this is the necessary prerequisite anyway.

@trvrb
Copy link
Member

trvrb commented Nov 20, 2024

@jameshadfield --- A follow up thought here while on the topic of remit and structure for this feature. It would be amazing to be able to have https://nextstrain.org/nextclade/sars-cov-2/ but rather than each Pango lineage having a single circle, they would each have a stream.

Screenshot 2024-11-19 at 4 18 57 PM

I think this could effectively be hacked into your current setup by creating a branch label on the branch immediately leading to each tip with the Pango lineage label and then throwing in a set of 1 to 100 representative strains from this Pango lineage which would each possess necessary metadata of collection date, S1 mutations, region, etc... These 1-100 representative strains would not have any tree structure and would just be a comb / polytomy replacing the single existing tip. The number of strains per Pango lineage would be input so that more frequent lineages get more strains and consequently wider streams.

My main reason to bring this up: is a standard Auspice JSON with specific branch labels and polytomies of discrete strains the best way to encode this? I think so? There would be more efficient ways to encode this than a bag of strains if we had a single coloring to worry about, but if we want to allow different colorings, then treating this as a set of discrete strains with metadata is probably the way to go.

I would imagine this scenario of collapsed tip distributions / streams to be a common one. We have the analogous issue with unique seasonal influenza HA haplotypes.

I don't know if augur clades is the right place to stuff this sort of operation? Actually... one strategy would be to take a Nextclade reference tree (like the SARS-CoV-2 lineage tree above) and decorate it with additional input sequences also using Nextclade, and this way not perform full phylogenetic / TreeTime inference.

@trvrb
Copy link
Member

trvrb commented Nov 20, 2024

On a recent call we discussed collapsing identical sequences (tips) in order to better view large trees. One option here is making the tip size scale with the number of samples represented by the tip (expressing multiple colours / sampling dates is harder). Another option is to use streams (but be aware that n streams will always be a lot slower than n tips).

As you say, if you have the logic to collapse to streams you could explore the ability to collapse to circles whose area is proportional to n and whose color is the simple merged color logic we use for phylogeographic uncertainty. This would allow further scaling. But I think fair to start with just streams as getting this working is definitely the more complex avenue.

@jameshadfield
Copy link
Member Author

jameshadfield commented Dec 18, 2024

Thanks for the feedback @trvrb - much appreciated.

I'm currently reworking this PR to both improve the code and address the shortcomings identified by Trevor and myself. If anyone has further feedback please provide it over the coming fortnight.

@genehack
Copy link
Contributor

If anyone has further feedback please provide it over the coming fortnight.

That's the next two weeks for those of you who only speak Merrikun! 🤣

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
experiment PRs which may never be merged preview on nextstrain.org
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants