Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Quantitative analysis - Diagnosing Causes #5543

Closed
paolodamico opened this issue Aug 11, 2021 · 61 comments
Closed

Quantitative analysis - Diagnosing Causes #5543

paolodamico opened this issue Aug 11, 2021 · 61 comments
Assignees

Comments

@paolodamico
Copy link
Contributor

Opening this issue to start the discussion around quantitative analysis (Step 2) for the conversion optimization process (more context on https://github.com/PostHog/product-internal/issues/127#issuecomment-896231144).

@macobo I think you'd be interested in being actively involved in working out the solution for this? Would be great to also have an engineer from Core Experience Team actively involved too (@liyiy @alexkim205). @clarkus too.

How do you want to start work on this? I'm thinking we could start a discussion from first principles and some research/benchmarking. Probably easier to start this in a Google doc instead to have more fluidity?

@macobo
Copy link
Contributor

macobo commented Aug 11, 2021

@macobo I think you'd be interested in being actively involved in working out the solution for this?

Happy to, but @EDsCODE might actually be better than I at this from a "how do we build the queries" standpoint. No strong preference from my side who owns this. :) Also @marcushyett-ph might have real good input here.

@paolodamico
Copy link
Contributor Author

Definitely! Though I think it's worth clarifying that ideally we want everyone in the team sharing thoughts and feedback, it's just that we've learned it's worth having an engineer(s) owning this to make sure we have constant active participation and reflecting that this is something that will require a time investment as any other engineering task would (more on product-internal#121).

@marcushyett-ph
Copy link
Contributor

@EDsCODE I think it'd be awesome if you lead this from the eng POV - I think there's probably two components...

  1. How do we build an intuitive experience to interpret this kind of correlated data (@clarkus likely best to lead this)
  2. What are the technical limitations of implementing this type of feature and what is the best approach to give people insights about stat-sig correlations, filtering out the noise but building something that can return results in seconds (@EDsCODE would be awesome if you could lead this)

@neilkakkar
Copy link
Collaborator

neilkakkar commented Aug 12, 2021

Trying to think this through from first principles:

Let's start with our steps to diagnose causes:

  1. Constructing a funnel (which we have now nailed), seeing something seems off
  2. Diagnosing the cause quantitatively - focus of this sprint
  3. Diagnosing the cause qualitatively given something seems off in step 2

Why does this make sense?
Aggregations.

Say you have a million users. It's unreasonable to go check what every user did to figure out if something is wrong. Thus, we aggregate all this data into a representation that makes sense: funnels. The funnel is an easy to understand metric of how those million users are behaving.

However, our goal is to figure out what's going wrong with specific users. This implies de-aggregating. A crude way we manage this right now is via breakdowns: it slices the aggregated funnel down into smaller aggregates to figure out if one of these parts is going terribly wrong, so you can then reduce your search-for-specific-users down to that pool.

There's this cycle of aggregation and de-aggregation to figure out a tractable subset, in which you can then, say, look at session recordings. (a.k.a step 3)

Thus, the goal for quantitative analysis is to quickly find the smallest set of people that have a issue in common. (starting from an aggregated set)

Now, the question becomes: How do we quickly find the smallest set of people?

I see two parts to this:

(1) Diagnoser driven analysis
This involves allowing the diagnoser (the person trying to figure out issues) to slice and dice the data however they want - use breakdowns, use ???? to figure out if some section is facing issues. (showing users on path X,Y,Z in the middle of a funnel (where you apriori choose the path is another way to go about this!)

(2) Backend driven analysis
Instead of the diagnoser driving what to check, it's us showing them different slices we think would be useful

(2.1) Event driven analysis (name TBD)
This involves following the users to show what they're doing. Basically, instead of the diagnoser telling us what to slice on, we re-aggregate based on what happened to these users. Paths: #5545 - fits in nicely into this usecase. (There's probably a lot more similar things we can do, like sankey-diagrams between funnel steps #5515 (comment) - this is different from Diagnoser driven analysis in the sense it aggregates and shows all* paths, for you to choose the one that looks suspicious)

(2.2) Smart analysis (name TBD)
This involves figuring out correlations ourselves: statistical analysis, using existing tools to figure out if we can automatically find a segment with issues. Or things like: given a funnel, find common properties among users who dropped off, do a PCA, and give the correlation matrix for the top N properties (across both events and persons) from the PCA. (ref: https://github.com/PostHog/product-internal/issues/127#issuecomment-896385082 )


Solving this issue then involves (1) figuring out which of these make the most sense (or all). (2) Figuring out different means of implementing it. (3) Prioritising. (4) Winning

And this doesn't even begin to answer the problem of quickly figuring this out. More thoughts appreciated! Will continue thinking over this :)

Edit: changed definition from: the goal for quantitative analysis is to quickly find the biggest - smallest set of people that have a issue in common. Was more confusing. The idea basically means max-min from math: the smallest sets, but sorted by size such that biggest first.

A hint for quickly comes from the biggest-smallest part of the definition. The bigger the slices, the more people the problem affects, the more important it is to solve, the quicker we find these, the better.

@marcushyett-ph
Copy link
Contributor

Thanks for sharing this @neilkakkar makes a ton of sense to me.

Although I'm not sure I fully understand the biggest - smallest point - could you possibly explain again?

@neilkakkar
Copy link
Collaborator

neilkakkar commented Aug 12, 2021

Yeah! (my definition here is a bit sloppy, hence removed it from consideration, it was a minor point anyway. Need to work towards improving this).

Good place to start is minimax algorithm. Basically, what the algorithm does is minimise the possible values of the maximum loss. We want to do the same, but in reverse: maximise the set of people with the smallest set of common issues.

The key to the analogy is that it's hard for any slicing/grouping to have all users that face the same problem. When you say, breakdown by browser, it may just be a particular version of the browser that's problematic. The smallest set just means that it's the best-effort minimisation of the set.

Once you have a best-effort minimization, you can be sure that you don't have a big "cushion" - that unnecessarily increases the size of the slice (like, having two browser values in the breakdown instead of one, when you know it only happens in a specific version of one browser).

So, now you can easily take the max slice size over these, without worrying about getting wrong results. And the reason you do this is to find the "biggest issues" first (i.e. issues that, in expectation, affect the most users, thus are the most important to find quickly / fix ) - or the low hanging fruit / the issues that will drive the most value.

How exactly this translates into the features is still unclear to me, but given the above framework, makes sense (to me) to keep in mind.

Edit: The reason why I felt it necessary to include this biggest-smallest point as atleast an end note is because without this, we're very prone to overfitting to the smallest common sets: to the extreme a set of 1 user who dropped off. Hard to say if it's a problem we can fix when it gets so small, without the context of seeing other users who also faced this.

@paolodamico
Copy link
Contributor Author

Thanks for your thoughts here @neilkakkar, really helpful to frame how we start thinking about this. Here's my perspective:

  • Framing your approach another way: I think what we want is to find the most specific slices of representative data. The more specific, the better the signaling, but if they're not representative, then it can just be useless noisy data.
  • I think even general slices that have a large number of dropoffs can be quite useful. It can be an 80/20 problem, the more general problems will represent a larger number of users and can be a better starting point. Even if you don't get as clear signaling than you would get in specific slices, it can still help you reach good enough conclusions. After you exhaust this (or if you want to deep dive) you can use specific slices to fine tune.
  • To me, the game changer here will be us finding the relevant slices automatically. For users to find the relevant slices by themselves will require a ton of luck. Being able to slice the data can be helpful to validate hypotheses, but not great for exploration. You don't know what you don't know.
  • Just a clarification point from above, I think it's worth keeping in mind that the benefit of doing this analysis is not only that it makes analyzing millions of end-users tractable, but it is also paramount to avoid biases when doing analyses (even if you have just a couple of hundred of end-users).

Thoughts?

@marcushyett-ph
Copy link
Contributor

@neilkakkar makes a ton of sense. There will always be trivial examples where one or two users did something weird - and that's not a pattern, we want to filter out the noise and see meaningful patterns that can be acted on.

@paolodamico really aligned with this... especially "...the game changer here will be us finding the relevant slices automatically".

I think it makes most sense to give a very small (manageable) number of insights which we have confidence are meaningful, as opposed to sharing every possible correlation. And if we don't have any meaningful insights - I think that's also totally okay.

@mariusandra
Copy link
Collaborator

The discussion here is great, yet rather high level. What are specific and actionable steps we can take towards nailing quantitative analysis?

For me the term "quantitative analysis" can contain anything from a) let's implement a property breakdown like sentry does it (screenshot below) and call it a day... to b) "ML all the things!".

image

For example, how we can determine what is "representative data" and what is not? Can we even determine this?

@marcushyett-ph
Copy link
Contributor

@mariusandra agree here - super keen that we dive into specifics and come up with a starting point we can iterate further on...

There are also some tangible and practical ideas in this issue here:
https://github.com/PostHog/product-internal/issues/92

@paolodamico
Copy link
Contributor Author

Completely agree! Think it's a matter of having an owner for this that can drive this project forward. @EDsCODE mentioned in the standup today that he's already spread pretty thin with Paths. @neilkakkar looks like you've shown a ton of interest in this, perhaps you'd like to own this? Or @macobo?

@neilkakkar
Copy link
Collaborator

I think even general slices that have a large number of dropoffs can be quite useful. It can be an 80/20 problem, the more general problems will represent a larger number of users and can be a better starting point. Even if you don't get as clear signaling than you would get in specific slices, it can still help you reach good enough conclusions. After you exhaust this (or if you want to deep dive) you can use specific slices to fine tune.

What do you mean by general and specific slices here?

@neilkakkar
Copy link
Collaborator

Since I haven't owned a project like this before, I'd love to! (and I'd probably need lots of help from every one :) )

We've (Core Analytics) been focusing on Paths, which I think is a subset of this.

Is the goal here to figure out what to build for Diagnosing causes, and then build it in the next sprint?

@paolodamico
Copy link
Contributor Author

paolodamico commented Aug 17, 2021

Perfect! So @neilkakkar you can be the owner here. @clarkus, @marcushyett-ph & can work tightly with you during this process to help you in whatever you need, and I'm sure we'll get input from other engineers too.

Well the goal more specifically is to figure out just the quantitative analysis component of Diagnosing Causes (Step 2 from https://github.com/PostHog/product-internal/issues/127#issuecomment-896231144), by figure out I mean come up with a spec'd out solution that's ready to implement. Not sure if we'll necessarily work on this next sprint (paths might extend, we might need to prioritize some enterprise features), but I think ideally we would aim to be ready for next sprint as intuitively this would the next step in Diagnosing Causes.


Re @neilkakkar,

What do you mean by general and specific slices here?

A general slice I envision something like, "Chrome users", "Users from Canada", "Users coming from google.com" (one level cuts). Specific slices would then be cross cuts across multiple dimensions (e.g. "Chrome users from Canada who came from google.com").

@marcushyett-ph
Copy link
Contributor

@neilkakkar ready and waiting to support you in turning this from something a bit ambiguous into something really tractable, let me know if you'd like any time to chat and share context during London time.

@clarkus
Copy link
Contributor

clarkus commented Aug 19, 2021

Here to support as well. Just let me know what I can do to help.

and-my-axe

@neilkakkar
Copy link
Collaborator

neilkakkar commented Aug 20, 2021

Heya! Spent some time thinking about this, and had a chat with Marcus which helped cut down scope. Looking for feedback & pushback on these ideas:

I think two broad features fit well into Quantitative Analysis:

TL;DR

  1. Improve breakdowns by highlighting which breakdown values are at a 2-sigma deviation (the aberrations)
  2. Do correlation analysis on funnel steps to find signals like: Users who did X convert better, and Users who use Chrome convert better.

Tool to aid user driven exploration

Here's a specific example of this:

image

I have a funnel I'm looking at, and it turns out on Android Mobile, people give up a lot quicker than the rest:

image

The link

This is buried in the table, and like Paolo rightly said, it's hard to find representative samples by exploring themselves. A part of the problem is us: We don't have tools to help them figure this out.

We have all the data on this breakdown already, so we could have run, say, a std deviation calculation, and then highlighted all the rows with 2 sigma deviation.

It's a small clean-cut feature we can build to help make it easier for users to find aberrations, both good and bad.

Tools for automatic exploration

On the other side are tools we can build to surface the representative slices ourselves. An MVP I'm thinking of: Correlation Analysis.

The assumption is that correlation analysis gives us decent representative slices.

More concretely, how this looks like is: You're on a funnel, you reach a step and you want to know what makes these users successful vs those who were unsuccessful. You click on a funnel step, and it gives you the option to show correlations (good copy, and clearly explaining what the numbers mean would be imperative here. I'm thinking maybe we don't show the correlation co-efficient here, but how much more % the dropped-off users have this property, compared to the successful ones, and vice versa. Unsure though, maybe people are well versed enough in stats to just see the correl. coefficient?)

There's only 2 pieces of data we have for these users: What events did they do before success/failure, and what properties do these users have.

So, we run a correlation analysis on both these things: Given a pool of successful & unsuccessful users, find the properties that are more often found in the successful users than the unsuccessful ones (and vice versa). Same for events. We then surface the "best" of these.

Choosing the "best" requires some experimentation from my side. An example: I think we should discard anything above a 0.9 correlation - if it's too correlated, it's usually not very useful (like pageleave event being highly inversely correlated with dropoffs. Yeah, duh, if you don't leave the page you can't dropoff)

What the full-fledged version looks like

More broadly, this second feature is a more specific version of: Exploring a Cohort of Users.

As I mentioned earlier:

This implies that a necessary step to good "smart" analysis is a foundation that allows users to themselves explore and ask as many questions as they want. i.e. Diagnoser driven analysis opens up the path (pun intended) to better smart analysis.

I think we should, at a later date, allow users to explore and analyse a single cohort as well - most common breakdown values on a cohort, find events most popular in the cohort, etc. etc. (probably right in the funnel step as well - instead of comparing success and failure, just exploring, say, success Cohort to corroborate what the correlation analysis gave us). This then feeds back into tool no.1 - where we highlight the aberrations.

Of course, this means you can also compare any two cohorts and find correlation over events / properties for people in these cohorts.

I'm opting though not to make this right now, because specific to diagnosing causes, comparing two pools of users to figure out what's going wrong, all baked into the funnel, is more useful.


Next Steps

For me, before I myself am convinced this is something good to build, I want to

  1. Test the correlation analysis - write a query myself - and see if I get any good results
    1. Experiment for choosing the best cutoff values
  2. Test the performance impact
    1. We probably can't run correlation analysis on all the events / all the properties. So how do we choose the right subset?
      I think tool no.1 plays a good role here - it helps us figure out what properties lead to discovering aberrations, but we need to be collecting data there to figure out the properties.

For you: thoughts on both these features?

@macobo
Copy link
Contributor

macobo commented Aug 20, 2021

Tool to aid user driven exploration

What you're describing here makes sense but I don't think it would be successful. While "breakdowns" make sense for analyzing from a conceptual standpoint, hiding this feature within another just makes it undiscoverable and inflexible

Why inflexible? "User/event has prop" is not the only potential signal. It can also be "user did X".

Is there a way to expose the same flow in a more clear way? Yes, don't present it as "breakdown".

This is what fullstory does: https://help.fullstory.com/hc/en-us/articles/360034527793-Conversions-Introduction You have a separate flow after funnels, where you see potential signals (including some statistical analysis to rank them)

@neilkakkar
Copy link
Collaborator

Interesting, apologies I wasn't very clear about this:

In my opinion the tool to aid user driven exploration can't stand on its own. It needs some underlying data on which to highlight statistical deviations. And I'm proposing highlighting for both, "user/event has prop" - via breakdowns; and for "user did X" via the Cohort Exploration (full fledged version of feature 2)

Making it into a feature of its own is interesting - combining both breakdowns and user did X together, and separating from the funnel view - didn't think about that.

@marcushyett-ph
Copy link
Contributor

marcushyett-ph commented Aug 20, 2021

This makes a lot of sense to me and agree with the next steps.

I think we should discard anything above a 0.9 correlation

We should probably play around with a few rules here to make sure we find the optimum precision and recall for "actionable causes". I worry the best ever most obvious thing might get filtered out sometimes with a rule like this - but probably makes sense to take a bunch of examples and tweak the variables.

To help with your next steps, a few common scenarios comes to mind if you'd like to set up some test funnels for your manual queries etc.

  • Signup to insight viewed funnel (I would expect a high correlation with team first event ingested)
  • Mobile users are probably less likely to be successful in every flow
  • Would expect a high correlation with insight funnel viewed and user viewed tutorials

@paolodamico
Copy link
Contributor Author

paolodamico commented Aug 20, 2021

Next Steps

  • In terms of next steps, aside from testing these things out I would suggest we whip up some quick mockups and put them in front of users to get some early feedback to make sure we're going in the right direction. CC @clarkus

High-level feedback

  • Two sigma analysis seems quite powerful and straightforward, i.e. a great starting place. However, I do think that ideally we would also surface relevant breakdowns to look at. Problem is I don't know if Browser is going to be a significant slice to look at vs. OS vs. any other given property.

Tactical feedback

  • Great point on not showing the actual value of r, I think we could get some user input on this (perhaps it's an advanced option you can enable).
  • One important consideration on doing correlation on user properties is that we don't store user properties over time, so these analysis are bound to change to whatever the user property is set to at the time. Further, it can give misleading results, is there a way we can address this?

@clarkus
Copy link
Contributor

clarkus commented Sep 7, 2021

Want to do a quick follow-up here @clarkus, would be awesome if we could run some usability tests this week / early next week too so we're ready to start with this for the next sprint (pending prioritization).

@paolodamico sounds good. I am going to wrap up the work for paths and then I will move on to this next.

@neilkakkar
Copy link
Collaborator

neilkakkar commented Sep 30, 2021

Task list for next sprint: (ref: @EDsCODE @liyiy @hazzadous for next sprint planning)

  1. Implement backend for correlation analysis (tool 2) ( https://metabase.posthog.net/collection/17 )
    1. For Events List
    2. For single Property
    3. Consider doing correlation calculation (post aggregations) in Python to speed up queries
  2. Setup User interviews for people who're interested in this. Prepare by running correlation on some of their popular funnels to see if the results sound useful to them or not. (+ Do it with UI demo as well, if ready in time). @paolodamico would love your help in setting these up!.
    5. [ ] Actually, ask users about funnels that matter to them, and ask whether we can run analysis on these specific funnels before the chat
  3. Implement a basic frontend to display this information for correlation analysis (and accelerate feedback loops with @clarkus for design)
  4. Tool 1 is all frontend, basic MVP Implementation: highlight blocks in funnel breakdown with 2 sigma deviation.

Stretch goals (also depend on user interview results)

  1. Extend backend to work with multiple properties at the same time (optimising for query time by writing target people into a temp table perhaps)
  2. Pipeline to figure out which properties are important: Run correlation analysis on all saved insights in the background & shortlist 10 useful properties to run on-demand in new insights
  3. Experiment with heuristics to figure out how remove useless information from correlation results (like removing internal events, ????, user-deferred heuristics: like don't show me property X ever, etc., show warnings when signal information isn't rich enough )
  4. Iron out front-end once design is settled

Other things to consider depending on user feedback

  1. To show correlation coefficient / odds ratio / p-values or stick with "x times more users who dropped did this than users who converted)
  2. Alternative ways of calculating signals

@clarkus
Copy link
Contributor

clarkus commented Sep 30, 2021

Catching up here - do we have any scenarios or content cases that are illustrative of this flow? I'm specifically thinking about the cases outlined above, but with real-ish values we can plug in for user testing. I am working through a few component ideas for how we can succinctly highlight significant items, as well as a component for a long-form listing of significant events and properties.

@neilkakkar
Copy link
Collaborator

For Tool 1: I think any funnel breakdown would be as real as it gets? (like this one )

For Correlation Analysis: Good point, I'll try to get some popular funnels other orgs use, run correlation on them and get you the values. For user testing, how hard is it to replace values? I was thinking replacing them with the actual values for each user testing we do.

@clarkus
Copy link
Contributor

clarkus commented Sep 30, 2021

For Correlation Analysis: Good point, I'll try to get some popular funnels other orgs use, run correlation on them and get you the values. For user testing, how hard is it to replace values? I was thinking replacing them with the actual values for each user testing we do.

It's not difficult to replace values, just time consuming. If you can export as JSON i think I can use that as a value source in figma... I'll look into that. Either way, it's totally possible. Depending on scheduling, it might be hard to get everything updated before each interview.

@paolodamico
Copy link
Contributor Author

Thanks @neilkakkar!

  • Can you update the decision log (https://github.com/PostHog/product-internal/issues/178) with any decisions you make along the way (e.g. what type of correlation / what features we're building)?
  • I have quite a bit of users lined up for async written feedback based on a prototype. We can start with those and internal testing and after we have a beta MVP at the end of this sprint, we can do some live usability tests. Thoughts?
  • Based on the specific goal we set for this sprint on Quant Analysis, we may want to adjust the tasks a bit. Let's discuss in the meeting today.

Cue posthog-bot

@posthog-contributions-bot
Copy link
Contributor

This issue has 7299 words. Issues this long are hard to read or contribute to, and tend to take very long to reach a conclusion. Instead, why not:

  1. Write some code and submit a pull request! Code wins arguments
  2. Have a sync meeting to reach a conclusion
  3. Create a Request for Comments and submit a PR with it to the meta repo or product internal repo

@neilkakkar
Copy link
Collaborator

I've done that. Will continue to do so :).

I think we can be a bit more aggressive with the timeline. I want to sort out all live user interviews by the end of this sprint, so we know what exactly we need to focus on for next sprint.

We'll have the beta MVP behind a FF by day after, so internal testing can begin by Friday morning, and we can do live usability tests also behind the FF next week. Basically, once we have interviews scheduled, we can put them on the FF right around the interview, get them to share their screen and talk through what they see, like, use etc.

Thoughts? It's a beta MVP, so will be rough around the edges, but I think how pretty the UI is isn't a defining feature for this product, given there are no fancy controls, and what's interesting is the data table itself.

@paolodamico
Copy link
Contributor Author

Excited about that! So would we be able to have live usability tests using user's real data? I can schedule some tests for next Mon - Wed, just need your confirmation.

@neilkakkar
Copy link
Collaborator

@clarkus some new design considerations based on recent experimentation:

  1. Thanks to some technical breakthroughs, we can now consider allowing users to specify custom person properties on which to run correlation analysis. We have a default set to fall back on, but there's no way for us to intelligently choose properties that matter to users. What's a good way to show off custom property selection.

  2. There are some specific cases when the data will be very skewed: Say, 10,000 dropoffs to 100 successes. This skews results a lot, and we want to either (1) not show results here with an error. Or (2) Give a warning that the funnel is too skewed to produce meaningful correlation results. Thoughts on how to represent it?
    A third option here is to automagically fix this - but this enters slippery territory, as we end up doctoring the results.

@clarkus
Copy link
Contributor

clarkus commented Oct 8, 2021

Thanks to some technical breakthroughs, we can now consider allowing users to specify custom person properties on which to run correlation analysis. We have a default set to fall back on, but there's no way for us to intelligently choose properties that matter to users. What's a good way to show off custom property selection.

We should facilitate input using the same taxonomic picker we use for filters, just scoped to person properties. Since we can specify multiple properties, we should use some form of multi-select. Is there an upper limit on the number of properties you can specify? Lower limit? Are cohorts applicable here since we're dealing with persons?

Screen Shot 2021-10-08 at 11 41 06 AM

The questions around scale will determine how we collect, summarize, and manage selected items. Here are some ideas for multi-selects that summarize selections across different scales.

Screen Shot 2021-10-08 at 11 56 37 AM

There are some specific cases when the data will be very skewed: Say, 10,000 dropoffs to 100 successes. This skews results a lot, and we want to either (1) not show results here with an error. Or (2) Give a warning that the funnel is too skewed to produce meaningful correlation results. Thoughts on how to represent it?

I very much think we should be transparent here. We should aim to communicate when the funnel is in such a state that correlations aren't meaningful. Secondary to that, if there's any means for the user to recover from this state, let's make that a simple thing to do. In your example, 10k dropoffs to 100 conversions, we could indicate that the funnel isn't meaningful and prompt the user to adjust conversion windows or something else to adjust the results. It's very much a content design task. Another option would be to collapse or otherwise hide the correlation tables behind a prompt that says "hey these results might not be meaningful because the funnel is skewed... are you sure you want to see these?" and an action to show the tables. Not a fan of automagic™, but appreciate you bringing it up as an option.

@neilkakkar
Copy link
Collaborator

Is there an upper limit on the number of properties you can specify? Lower limit? Are cohorts applicable here since we're dealing with persons?

Preliminary tests suggest no. I'll try working this into the MVP to look at real-life perf considerations, but for now, we can assume that a person can select all person properties. So there's no lower, nor any upper limit.

Cohorts aren't applicable here.

What do you mean by different scales?

Also, another point worth noting: You can only select the property name, not the property values. That is, the total options available to select are literally all you see in the taxonomic filter for person properties: The names of the properties.


I agree with being transparent. Hmm, I was thinking about this a bit more over the weekend, and it's not like the correlations are irrelevant, but more like: "The quality of the input data determines the quality of the output".

On the backend - The rule of thumb I've settled on right now is 1:10. If the success to failure ratio (or vice versa) is greater than 1:10, then the funnel is skewed.

On the frontend, maybe we can say something like: The number of successes(or failures) is too few to give confident signals. We recommend the number of successes to failures not falling below 1/10.

And optional extra advanced explanation: As otherwise, even a small success signal gets magnified 10 times, since the sample size is smaller.
(maybe not say this, idk)

@clarkus
Copy link
Contributor

clarkus commented Oct 11, 2021

Scale means the volume of items we're managing. Knowing that there isn't an upper or lower bound answers my questions. 👍

I agree with being transparent. Hmm, I was thinking about this a bit more over the weekend, and it's not like the correlations are irrelevant, but more like: "The quality of the input data determines the quality of the output".

This is very true. I think our goal should be guiding the user as much as we can to provide enough quality input in order to output something valuable. Saying why the funnel is skewed is good, but the very next thing the user will want is information on how to recover. Short of that, some simpler content that communicates why things aren't great... @paolodamico do you have any thoughts on this?

@neilkakkar
Copy link
Collaborator

Makes sense. As an example, @marcushyett-ph is notorious for choosing highly skewed funnels 😋 . Wayy too many pageviews, not enough of (second event).

But, for either of those cases^, it's hard to judge how to recover. We need to either (1) be more specific with the first event, like choosing a specific page, or choosing another event that doesn't happen too often.
Or (2) Choose a more-frequently-happening second event.

At the very least, doing either of these should normalise the odds ratio, returning better signals.

(Side note: The > 100x odds ratio are directly because of this. Too few successes, so each success carries a lot of weight )

@clarkus
Copy link
Contributor

clarkus commented Oct 11, 2021

But, for either of those cases^, it's hard to judge how to recover. We need to either (1) be more specific with the first event, like choosing a specific page, or choosing another event that doesn't happen too often.
Or (2) Choose a more-frequently-happening second event.

This is great. I can work with this to try to come up with something more actionable for users encountering this. Thanks!

@neilkakkar
Copy link
Collaborator

neilkakkar commented Oct 11, 2021

Cool! :) Note that, it can be the reverse as well: say, 10,000 first events, and 9,990 successful conversions (so only 10 dropoffs) - that's equivalent to the above case, just the recommendations get reversed.

@clarkus
Copy link
Contributor

clarkus commented Oct 11, 2021

While working on paths, I came up with a workflow for an inclusive picker. I took that concept and applied it to person properties.https://www.figma.com/file/gQBj9YnNgD8YW4nBwCVLZf/PostHog-App?node-id=4160%3A26978

Screen Shot 2021-10-11 at 1 31 53 PM

Screen Shot 2021-10-11 at 1 46 06 PM

By default, we'd select everything. From there a user can deselect what they don't want, or deselect everything and start from a blank slate. It's not shown here, but the control summary in the properties table would summarize the property count (4 / 16 person properties).

@paolodamico
Copy link
Contributor Author

Very interesting on the funnel skewing and how to fix! I do agree that if we can provide more actionable guidance, users will get more insights. I propose we move this conversation plus the one on how to surface confidence intervals (or other statsig metrics) to a separate issue and discuss for next sprint.

@clarkus
Copy link
Contributor

clarkus commented Oct 14, 2021

@paolodamico
Copy link
Contributor Author

We now fully support correlation analysis

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants