-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Validate path connections #6041
Comments
I was exploring if we can find smarter defaults, instead of trying to validate graphs. https://metabase.posthog.net/question/143 So, I analyzed data for 1 month, and over all paths for PostHog, we have ~28,000 edges. It's very much a power law distribution. Average edge weight is ~7, but 95%ile is 11, 99.5%ile is 160, 99.9%ile is 1000. So, we can't return all the data when there's no start or end point. Out of these, there are ~7,500 starting points. Edit: ~4000 unique starting points Next up, I'll look into the start points here^ to figure out the same data for them, and if it reduces the sample space enough, think we could have a high enough default for start and end points. |
Hmm, things aren't looking too great. Chose the largest ones, and then a few random ones:
This one fans out a LOT: link - there's supposed to be 374 edges starting at These are probably the worst of worst cases^^^. But assuming 10x bigger company, they start looking more reasonable. |
Another idea: We can leverage the shape of the distribution. Instead of having a limit on number of edges, how about we have a limit on [absolute|relative] edge weights? So something like, if the number of people who did edge A->B is less than 1% / 25 people, discard the edge from final result. It doesn't directly solve the above problem, but gives us better initial data to work with: If we have N edges with say, the same weight, and half of those fall in the limit, it would be a shame to discard the other half, as they're as useful (or useless) as the first half. ... And then we can consider completing this graph^^. |
Things turned a bit tricky. I'm exploring three different ways to solve this problem Complete Dangling EdgesEnsures that whatever edges make it to the cut-off, the remaining edges are added. Pro: Paths don't look wrong. Cons: Some very low weight edges show up, which can be hard to visualise / fill graph with useless information. Implementation notes: This makes things hard. Here's a failed attempt: https://metabase.posthog.net/question/150 that makes the wrong tradeoffs. We don't want to bound above the maximum edge weights. I suspect any solution that goes this way will significantly slow down our queries, since this requires some sort of graph traversal. Still trying to figure out if there's a better way around to solve this. Delete Dangling EdgesEnsures we validate edges before returning, so no dangling edges remain. Pro: Paths don't look wrong. Cons: Some high weight edges are removed, which might be carrying useful information. Implementation notes: relatively straightforward to do outside of SQL. And up to ~1000 edges, has negligible effect on performance. Defer control to usersThe crux between Solution (1) and (2) is the amount of information we show. Depending on the case, it can be useful to see the extra information, going further in depth, and in other cases, better to get rid of all low weight nodes. More importantly, it's hard to get the visualisation just right, a priori. This solution gives users these advanced manipulation options. There's two controls: (a) Control maximum number of edges. And Note: The edge weight represents the number of people on that path. Overall SolutionThe overall solution I'm leaning towards right now is: based on the above calculations, provide more meaningful defaults. The 95%ile has edge weights ~10 for the more popular cases, which translates to ~100 edges. Make these the default. (that's 5x more edges than right now, and increase these further if steps go above 5) This won't solve the graph looking weird in some cases, so delete dangling nodes (only when start or end point are defined), and tell the user that the graph is incomplete. And encourage using the advanced options I mentioned above for getting more indepth information. I think I need to myself play with these advanced options to figure out if there's heuristics we can find for even better default values. It does make things a bit more complicated for users, but hopefully most users are happy with the default. I do think it's important to allow this customization so users can drill down and up the graph. It gives a new dimension: allows not just number of step manipulation, but things like, "oh, I notice this specific segment of slightly unpopular paths (~200 edge weight) seem weird. Let me set edge weights between 100 and 300 to explore these more in depth, then find the specific people doing this, and see if I can figure out why they're doing things like this" etc. etc. Whether it's worth doing is an open question, I guess. cc: @paolodamico @clarkus @marcushyett-ph @EDsCODE @liyiy for more input :) |
I'll let @paolodamico and @clarkus chime in as they have the most context. But generally providing the best defaults we can sounds like a good approach to me. I have a question related specifically to the terminology used, edge weight etc. Do we have a more user friendly term in mind for how to describe this? As it feels pretty technical and might be hard for users to adopt. |
Definitely. It's the "count of users on a path". So, min edge weight is something like: "Minimum count of users on a Path" |
Hey @neilkakkar! In general, 100% agree with the overall approach of sensible defaults and advanced customization. Some questions,
|
Dangling edges: Link
Leaving them alone makes the graph look wrong. This is same as the issue in the original link in the issue: The start point is gone, these are intermediate edges, and thus dangling. Deleting dangling edges means getting rid of these in the final visualisation.
edge weight is indeed count of users on a single path item (I'm not yet sure of the right terminology to use, judging by the confusion exposed on the PR). It doesn't mean number of people on the entire path, but between any two consecutive Path items.
Good question. We could remove some controls, but removing any of these feels incomplete to me. Since: (a) No. of edges controls how dense the graph gets. (b) Min-Max controls what kind of edges show up.
Say you're interested in where people drop off, and say it's a very successful product: most people convert.. (Or vice versa, case is identical). Most path items on the happy path then have a high weight - and these are the ones you don't want to see, since they are noise. Setting a max weight effectively removes all of them, and helps you visualise where the dropoffs really go. Something similar can be achieved with excluding the popular events, but it's not the same, since you want to know if these "dropoffs" take some other route to the popular events. (Max weight would remove the popular paths, but not the small weight traversals to the popular items). It made sense to me, but it's 100% an advanced use case - and not very obvious. But since these are advanced features anyway....
As in, don't show any advanced options at all? Or just show them populated with defaults? The latter makes sense to me. The former not so much, because then the users wouldn't know how to control these advanced options at all? 100% agreed on getting real data points. |
The weight concept is new to me, so I'm catching up a bit on this. It seems like this might be the core reason a user would want to adjust weight for a paths insight:
Setting a maximum weight can optimize for analyzing dropoffs. Is the converse true for minimum weights? If so, that might be a good way to communicate the value of the feature to users. I think default make a ton of sense for this, but maybe there's some easy mode where the user just selects an "optimize for dropoffs" control or something similar? |
If you want to play around with edge weights, they're behind the |
Validation is done, so I'll close this. |
This issue has 1909 words. Issues this long are hard to read or contribute to, and tend to take very long to reach a conclusion. Instead, why not:
|
Bug description
Please describe.
When calculating paths, we aggregate many node and link data, often much more than would be reasonable to display. We perform a simple limit in our queries to cap the number of links we're accounting for. As a result, we sometimes might be cutting off a set of links that are dependent on one another. For example, there might be some links that go from $pageview -> insight viewed -> viewed dashboard however, our limit cuts off the data for $pageview -> insight viewed.
If this affects the front-end, screenshots would be of great help.
Expected behavior
The limited data we return should be complete so that paths aren't stranded.
How to reproduce
Internal graph link here
Notice how a start point is defined but there are start points on the visualization that are unrelated. The sankey is rendering stranded links that start at the 2nd or 3rd step but don't have a 1st step
Thank you for your bug report – we love squashing them!
The text was updated successfully, but these errors were encountered: