-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Quantitative analysis - Diagnosing Causes #5543
Comments
Happy to, but @EDsCODE might actually be better than I at this from a "how do we build the queries" standpoint. No strong preference from my side who owns this. :) Also @marcushyett-ph might have real good input here. |
Definitely! Though I think it's worth clarifying that ideally we want everyone in the team sharing thoughts and feedback, it's just that we've learned it's worth having an engineer(s) owning this to make sure we have constant active participation and reflecting that this is something that will require a time investment as any other engineering task would (more on product-internal#121). |
@EDsCODE I think it'd be awesome if you lead this from the eng POV - I think there's probably two components...
|
Trying to think this through from first principles: Let's start with our steps to diagnose causes:
Why does this make sense? Say you have a million users. It's unreasonable to go check what every user did to figure out if something is wrong. Thus, we aggregate all this data into a representation that makes sense: funnels. The funnel is an easy to understand metric of how those million users are behaving. However, our goal is to figure out what's going wrong with specific users. This implies de-aggregating. A crude way we manage this right now is via breakdowns: it slices the aggregated funnel down into smaller aggregates to figure out if one of these parts is going terribly wrong, so you can then reduce your search-for-specific-users down to that pool. There's this cycle of aggregation and de-aggregation to figure out a tractable subset, in which you can then, say, look at session recordings. (a.k.a step 3) Thus, the goal for quantitative analysis is to quickly find the smallest set of people that have a issue in common. (starting from an aggregated set) Now, the question becomes: How do we quickly find the smallest set of people? I see two parts to this: (1) Diagnoser driven analysis (2) Backend driven analysis (2.1) Event driven analysis (name TBD) (2.2) Smart analysis (name TBD) Solving this issue then involves (1) figuring out which of these make the most sense (or all). (2) Figuring out different means of implementing it. (3) Prioritising. (4) Winning And this doesn't even begin to answer the problem of quickly figuring this out. More thoughts appreciated! Will continue thinking over this :) Edit: changed definition from: the goal for quantitative analysis is to quickly find the biggest - smallest set of people that have a issue in common. Was more confusing. The idea basically means max-min from math: the smallest sets, but sorted by size such that biggest first. A hint for quickly comes from the biggest-smallest part of the definition. The bigger the slices, the more people the problem affects, the more important it is to solve, the quicker we find these, the better. |
Thanks for sharing this @neilkakkar makes a ton of sense to me. Although I'm not sure I fully understand the biggest - smallest point - could you possibly explain again? |
Yeah! (my definition here is a bit sloppy, hence removed it from consideration, it was a minor point anyway. Need to work towards improving this). Good place to start is minimax algorithm. Basically, what the algorithm does is minimise the possible values of the maximum loss. We want to do the same, but in reverse: maximise the set of people with the smallest set of common issues. The key to the analogy is that it's hard for any slicing/grouping to have all users that face the same problem. When you say, breakdown by browser, it may just be a particular version of the browser that's problematic. The smallest set just means that it's the best-effort minimisation of the set. Once you have a best-effort minimization, you can be sure that you don't have a big "cushion" - that unnecessarily increases the size of the slice (like, having two browser values in the breakdown instead of one, when you know it only happens in a specific version of one browser). So, now you can easily take the max slice size over these, without worrying about getting wrong results. And the reason you do this is to find the "biggest issues" first (i.e. issues that, in expectation, affect the most users, thus are the most important to find quickly / fix ) - or the low hanging fruit / the issues that will drive the most value. How exactly this translates into the features is still unclear to me, but given the above framework, makes sense (to me) to keep in mind. Edit: The reason why I felt it necessary to include this biggest-smallest point as atleast an end note is because without this, we're very prone to overfitting to the smallest common sets: to the extreme a set of 1 user who dropped off. Hard to say if it's a problem we can fix when it gets so small, without the context of seeing other users who also faced this. |
Thanks for your thoughts here @neilkakkar, really helpful to frame how we start thinking about this. Here's my perspective:
Thoughts? |
@neilkakkar makes a ton of sense. There will always be trivial examples where one or two users did something weird - and that's not a pattern, we want to filter out the noise and see meaningful patterns that can be acted on. @paolodamico really aligned with this... especially "...the game changer here will be us finding the relevant slices automatically". I think it makes most sense to give a very small (manageable) number of insights which we have confidence are meaningful, as opposed to sharing every possible correlation. And if we don't have any meaningful insights - I think that's also totally okay. |
The discussion here is great, yet rather high level. What are specific and actionable steps we can take towards nailing quantitative analysis? For me the term "quantitative analysis" can contain anything from a) let's implement a property breakdown like sentry does it (screenshot below) and call it a day... to b) "ML all the things!". For example, how we can determine what is "representative data" and what is not? Can we even determine this? |
@mariusandra agree here - super keen that we dive into specifics and come up with a starting point we can iterate further on... There are also some tangible and practical ideas in this issue here: |
Completely agree! Think it's a matter of having an owner for this that can drive this project forward. @EDsCODE mentioned in the standup today that he's already spread pretty thin with Paths. @neilkakkar looks like you've shown a ton of interest in this, perhaps you'd like to own this? Or @macobo? |
What do you mean by general and specific slices here? |
Since I haven't owned a project like this before, I'd love to! (and I'd probably need lots of help from every one :) ) We've (Core Analytics) been focusing on Paths, which I think is a subset of this. Is the goal here to figure out what to build for Diagnosing causes, and then build it in the next sprint? |
Perfect! So @neilkakkar you can be the owner here. @clarkus, @marcushyett-ph & can work tightly with you during this process to help you in whatever you need, and I'm sure we'll get input from other engineers too. Well the goal more specifically is to figure out just the quantitative analysis component of Diagnosing Causes (Step 2 from https://github.com/PostHog/product-internal/issues/127#issuecomment-896231144), by figure out I mean come up with a spec'd out solution that's ready to implement. Not sure if we'll necessarily work on this next sprint (paths might extend, we might need to prioritize some enterprise features), but I think ideally we would aim to be ready for next sprint as intuitively this would the next step in Diagnosing Causes. Re @neilkakkar,
A general slice I envision something like, "Chrome users", "Users from Canada", "Users coming from google.com" (one level cuts). Specific slices would then be cross cuts across multiple dimensions (e.g. "Chrome users from Canada who came from google.com"). |
@neilkakkar ready and waiting to support you in turning this from something a bit ambiguous into something really tractable, let me know if you'd like any time to chat and share context during London time. |
Heya! Spent some time thinking about this, and had a chat with Marcus which helped cut down scope. Looking for feedback & pushback on these ideas: I think two broad features fit well into Quantitative Analysis: TL;DR
Tool to aid user driven explorationHere's a specific example of this: I have a funnel I'm looking at, and it turns out on Android Mobile, people give up a lot quicker than the rest: This is buried in the table, and like Paolo rightly said, it's hard to find representative samples by exploring themselves. A part of the problem is us: We don't have tools to help them figure this out. We have all the data on this breakdown already, so we could have run, say, a std deviation calculation, and then highlighted all the rows with 2 sigma deviation. It's a small clean-cut feature we can build to help make it easier for users to find aberrations, both good and bad. Tools for automatic explorationOn the other side are tools we can build to surface the representative slices ourselves. An MVP I'm thinking of: Correlation Analysis. The assumption is that correlation analysis gives us decent representative slices. More concretely, how this looks like is: You're on a funnel, you reach a step and you want to know what makes these users successful vs those who were unsuccessful. You click on a funnel step, and it gives you the option to show correlations (good copy, and clearly explaining what the numbers mean would be imperative here. I'm thinking maybe we don't show the correlation co-efficient here, but how much more % the dropped-off users have this property, compared to the successful ones, and vice versa. Unsure though, maybe people are well versed enough in stats to just see the correl. coefficient?) There's only 2 pieces of data we have for these users: What events did they do before success/failure, and what properties do these users have. So, we run a correlation analysis on both these things: Given a pool of successful & unsuccessful users, find the properties that are more often found in the successful users than the unsuccessful ones (and vice versa). Same for events. We then surface the "best" of these. Choosing the "best" requires some experimentation from my side. An example: I think we should discard anything above a 0.9 correlation - if it's too correlated, it's usually not very useful (like pageleave event being highly inversely correlated with dropoffs. Yeah, duh, if you don't leave the page you can't dropoff) What the full-fledged version looks likeMore broadly, this second feature is a more specific version of: Exploring a Cohort of Users.
I think we should, at a later date, allow users to explore and analyse a single cohort as well - most common breakdown values on a cohort, find events most popular in the cohort, etc. etc. (probably right in the funnel step as well - instead of comparing success and failure, just exploring, say, success Cohort to corroborate what the correlation analysis gave us). This then feeds back into tool no.1 - where we highlight the aberrations. Of course, this means you can also compare any two cohorts and find correlation over events / properties for people in these cohorts. I'm opting though not to make this right now, because specific to diagnosing causes, comparing two pools of users to figure out what's going wrong, all baked into the funnel, is more useful. Next StepsFor me, before I myself am convinced this is something good to build, I want to
For you: thoughts on both these features? |
What you're describing here makes sense but I don't think it would be successful. While "breakdowns" make sense for analyzing from a conceptual standpoint, hiding this feature within another just makes it undiscoverable and inflexible Why inflexible? "User/event has prop" is not the only potential signal. It can also be "user did X". Is there a way to expose the same flow in a more clear way? Yes, don't present it as "breakdown". This is what fullstory does: https://help.fullstory.com/hc/en-us/articles/360034527793-Conversions-Introduction You have a separate flow after funnels, where you see potential signals (including some statistical analysis to rank them) |
Interesting, apologies I wasn't very clear about this: In my opinion the tool to aid user driven exploration can't stand on its own. It needs some underlying data on which to highlight statistical deviations. And I'm proposing highlighting for both, "user/event has prop" - via breakdowns; and for "user did X" via the Cohort Exploration (full fledged version of feature 2) Making it into a feature of its own is interesting - combining both breakdowns and user did X together, and separating from the funnel view - didn't think about that. |
This makes a lot of sense to me and agree with the next steps.
We should probably play around with a few rules here to make sure we find the optimum precision and recall for "actionable causes". I worry the best ever most obvious thing might get filtered out sometimes with a rule like this - but probably makes sense to take a bunch of examples and tweak the variables. To help with your next steps, a few common scenarios comes to mind if you'd like to set up some test funnels for your manual queries etc.
|
Next Steps
High-level feedback
Tactical feedback
|
@paolodamico sounds good. I am going to wrap up the work for paths and then I will move on to this next. |
Task list for next sprint: (ref: @EDsCODE @liyiy @hazzadous for next sprint planning)
Stretch goals (also depend on user interview results)
Other things to consider depending on user feedback
|
Catching up here - do we have any scenarios or content cases that are illustrative of this flow? I'm specifically thinking about the cases outlined above, but with real-ish values we can plug in for user testing. I am working through a few component ideas for how we can succinctly highlight significant items, as well as a component for a long-form listing of significant events and properties. |
For Tool 1: I think any funnel breakdown would be as real as it gets? (like this one ) For Correlation Analysis: Good point, I'll try to get some popular funnels other orgs use, run correlation on them and get you the values. For user testing, how hard is it to replace values? I was thinking replacing them with the actual values for each user testing we do. |
It's not difficult to replace values, just time consuming. If you can export as JSON i think I can use that as a value source in figma... I'll look into that. Either way, it's totally possible. Depending on scheduling, it might be hard to get everything updated before each interview. |
Thanks @neilkakkar!
Cue posthog-bot |
This issue has 7299 words. Issues this long are hard to read or contribute to, and tend to take very long to reach a conclusion. Instead, why not:
|
I've done that. Will continue to do so :). I think we can be a bit more aggressive with the timeline. I want to sort out all live user interviews by the end of this sprint, so we know what exactly we need to focus on for next sprint. We'll have the beta MVP behind a FF by day after, so internal testing can begin by Friday morning, and we can do live usability tests also behind the FF next week. Basically, once we have interviews scheduled, we can put them on the FF right around the interview, get them to share their screen and talk through what they see, like, use etc. Thoughts? It's a beta MVP, so will be rough around the edges, but I think how pretty the UI is isn't a defining feature for this product, given there are no fancy controls, and what's interesting is the data table itself. |
Excited about that! So would we be able to have live usability tests using user's real data? I can schedule some tests for next Mon - Wed, just need your confirmation. |
@clarkus some new design considerations based on recent experimentation:
|
We should facilitate input using the same taxonomic picker we use for filters, just scoped to person properties. Since we can specify multiple properties, we should use some form of multi-select. Is there an upper limit on the number of properties you can specify? Lower limit? Are cohorts applicable here since we're dealing with persons? The questions around scale will determine how we collect, summarize, and manage selected items. Here are some ideas for multi-selects that summarize selections across different scales.
I very much think we should be transparent here. We should aim to communicate when the funnel is in such a state that correlations aren't meaningful. Secondary to that, if there's any means for the user to recover from this state, let's make that a simple thing to do. In your example, 10k dropoffs to 100 conversions, we could indicate that the funnel isn't meaningful and prompt the user to adjust conversion windows or something else to adjust the results. It's very much a content design task. Another option would be to collapse or otherwise hide the correlation tables behind a prompt that says "hey these results might not be meaningful because the funnel is skewed... are you sure you want to see these?" and an action to show the tables. Not a fan of automagic™, but appreciate you bringing it up as an option. |
Preliminary tests suggest no. I'll try working this into the MVP to look at real-life perf considerations, but for now, we can assume that a person can select all person properties. So there's no lower, nor any upper limit. Cohorts aren't applicable here. What do you mean by different scales? Also, another point worth noting: You can only select the property name, not the property values. That is, the total options available to select are literally all you see in the taxonomic filter for person properties: The names of the properties. I agree with being transparent. Hmm, I was thinking about this a bit more over the weekend, and it's not like the correlations are irrelevant, but more like: "The quality of the input data determines the quality of the output". On the backend - The rule of thumb I've settled on right now is 1:10. If the success to failure ratio (or vice versa) is greater than 1:10, then the funnel is skewed. On the frontend, maybe we can say something like: The number of successes(or failures) is too few to give confident signals. We recommend the number of successes to failures not falling below 1/10. And optional extra advanced explanation: As otherwise, even a small success signal gets magnified 10 times, since the sample size is smaller. |
Scale means the volume of items we're managing. Knowing that there isn't an upper or lower bound answers my questions. 👍
This is very true. I think our goal should be guiding the user as much as we can to provide enough quality input in order to output something valuable. Saying why the funnel is skewed is good, but the very next thing the user will want is information on how to recover. Short of that, some simpler content that communicates why things aren't great... @paolodamico do you have any thoughts on this? |
Makes sense. As an example, @marcushyett-ph is notorious for choosing highly skewed funnels 😋 . Wayy too many pageviews, not enough of (second event). But, for either of those cases^, it's hard to judge how to recover. We need to either (1) be more specific with the first event, like choosing a specific page, or choosing another event that doesn't happen too often. At the very least, doing either of these should normalise the odds ratio, returning better signals. (Side note: The > 100x odds ratio are directly because of this. Too few successes, so each success carries a lot of weight ) |
This is great. I can work with this to try to come up with something more actionable for users encountering this. Thanks! |
Cool! :) Note that, it can be the reverse as well: say, 10,000 first events, and 9,990 successful conversions (so only 10 dropoffs) - that's equivalent to the above case, just the recommendations get reversed. |
While working on paths, I came up with a workflow for an inclusive picker. I took that concept and applied it to person properties.https://www.figma.com/file/gQBj9YnNgD8YW4nBwCVLZf/PostHog-App?node-id=4160%3A26978 By default, we'd select everything. From there a user can deselect what they don't want, or deselect everything and start from a blank slate. It's not shown here, but the control summary in the properties table would summarize the property count (4 / 16 person properties). |
Very interesting on the funnel skewing and how to fix! I do agree that if we can provide more actionable guidance, users will get more insights. I propose we move this conversation plus the one on how to surface confidence intervals (or other statsig metrics) to a separate issue and discuss for next sprint. |
|
We now fully support correlation analysis |
Opening this issue to start the discussion around quantitative analysis (Step 2) for the conversion optimization process (more context on https://github.com/PostHog/product-internal/issues/127#issuecomment-896231144).
@macobo I think you'd be interested in being actively involved in working out the solution for this? Would be great to also have an engineer from Core Experience Team actively involved too (@liyiy @alexkim205). @clarkus too.
How do you want to start work on this? I'm thinking we could start a discussion from first principles and some research/benchmarking. Probably easier to start this in a Google doc instead to have more fluidity?
The text was updated successfully, but these errors were encountered: