-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Experimentation MVP Things to Do #7462
Comments
This is great @neilkakkar! Sounds like a plan. I'll work on some wireframes to discuss the overall flow. |
Are we limiting the user to what type of insight they're allowed to create? |
Experimentation is only possible on Funnels, so ideally yes (your call depending on how hard is it to implement. Disabling the rest / removing them works) |
Is there a list of known constraints somewhere?
|
@clarkus Potentially things like participant selection? 🤔 |
Hmm, the FF determines the participants, so this should be ok (thanks to the rollout %). One more constraint: Each experiment can have only one insight, and only one FF set. Don't think there's anything else |
Created a (very, very ugly) proposal for what the flow can look like (see below). Some notes based on the comments I see above:
|
Here are the things I think we'd need to track / collect for creating an experiment. I have some open questions below on areas I see as need elaboration. I think the way we determine the target metric needs more discussion or at least consensus for MVP. Experiment parameters
|
I hope you don't mind raising this issue I raised last year regarding time limited feature flags; I might be wrong. But it sounds like it might be related: #1903 |
@paolodamico : How does experimentation over a single event work? In this case, we don't have success & failures for the A or B side, so how do we calculate the rest of the things? imo we should restrict experimentation to funnels only. In that case, choosing the target metric = creating the funnel. The single number would be the conversion rate. I don't get how it can be, say, number of pageviews (the followup calculations get borked in this case?) |
Person Selection is interesting! Sounds like instead of choosing an existing FF, its creating a new FF. I was assuming that the FF would usually already exist, because there would be code changes based on this FF? Imagine you want to test a blue vs green logo of PostHog. I would make the change, release it only for PostHog users / myself, test it is ok, and then update some selection & rollout critera on the FF itself, and then move to the Experimentation part. This is very hard to do if I have to create FF inside experiment. My selection criteria is borking my testing flexibility. Thoughts? I feel your way is superior in terms of UX: I don't have to flip flop around to set things up. But hopefully(?) by the time you come to run an experiment, you have your FFs & code changes already finished, ready to go experiment. |
I am kind of confused on how feature flags work here, are we duplicating them as an "experiment feature flag" and editing that duplicate when we change its rollout percentages, etc? Also I think the FF selection should be a dropdown of existing FFs instead of entering one manually |
A reminder that we have a small data integrity issue with feature flags: posthog-js sends the first pageview before In practice: the first pageview of a user returning after 3 days may contain a very different set of flags than all the following events in that session. More context here: #6043 (comment) |
I think it's a matter of comparing the baseline (control group) vs. experiment group. So for instance you target an increase in the number of Discoveries (and we could measure it over the time frame of the experiment). Or we could potentially target an increase in weekly active users (and then average weekly active users over the time frame). Though I could see the point of the MVP only targeting funnel conversion.
I think we could support "converting" an existing flag (which would basically mean reusing the key), but I think doing a person selection differently than what we do FFs makes more sense. We can have a sync conversation tomorrow to iron out these details! |
Let's chat about this today! What's worth ironing out here is that in this case, both control & experiment need another baseline with which they're being compared. (But, I also found out that there's an alternative possible here, by setting exposure - i.e. varying the time between how long A & B are active for. ). Or, it can be something like: Baseline is number of pageviews over WAU, and then A/B test focuses on, say insights analyzed count. This is possible. (And yes, would like to scope down MVP. Getting the calculations for either of these right can get hard). |
Very short summary from our sync meeting today (feel free to edit if I missed something):
|
Alright based on our last conversation, here's a proposal (Figma link) for what the UI could look like to define the experiment parameters in the planning. Any feedback? @clarkus, @liyiy @neilkakkar Please think mostly big UX stuff, we can figure out all the details (particularly around UI) in the next sprint. |
Minor fly-by comments
|
|
Not yet, but backend has implicit support, so once we're relatively confident things are going okay, easy to switch this on.
If the FF exists already, we'll |
This doesn't feel like it's actually a requirement/helps towards an MVP, but adds complexity we'll need to build and maintain. Suggestion: Don't allow key collisions and always create a new flag to go with the experiment. |
I think most times the flow would be to create your own FF, test things work on a small set, and then create an experiment out of it. That's the flow we intend to support with reusing an existing FF, as it allows you to run an experiment without changes to existing code. The complexity is limited to the experiment creation endpoint: to the rest of the world, the FFs look like what they're supposed to. -> Would you mind explaining more about what you're thinking about ref |
Some new technical constraints came up while implementing that I didn't think of earlier. This makes some things more annoying. cc: @liyiy @paolodamico Fleshing this out with full context so people who aren't involved can contribute as well, if they want to (cc: @EDsCODE @macobo @hazzadous @mariusandra @marcushyett-ph ) The ProblemExperiments are very sensitive to measurement: You choose the wrong way of filtering people at the end of the experiment, and it borks all your results. This is a big no-no. Even in the MVP, we want to do precise measurements - we want to setup infra such that the numbers going into the calculation are accurate. So, just having global filters on the breakdown funnel (which is used for measurement) doesn't work: person properties can change over time, and we might be incorrectly selecting the control & test group. (Also what Marius said is a problem) The SolutionSet control and test variants explicitly on events. A.k.a use multivariate FFs to accurately select your control and test group. In this way, we count precisely those events which at the time were control & test variant. This get rids of both the above problems. (Users for whom FF haven't loaded wouldn't see the test/control variant and thus should be discarded from the result analysis) The annoyanceEarlier, we decided to create this multivariate FF implicitly behind the scenes. This leads to complications: if you want the event coming in from This used to fit in well with the flow where:
Another option here was to override Solving the annoyanceNow, the problem is: turning the simple FF to multivariate behind the scenes doesn't work. The user necessarily needs to do code changes to get things working. Once an experiment has started, this is hard to do, since we set the rollout to 50-50 to start collecting experiment data. And bugs here borks the experiment again. Put more succinctly, we want to allow users to run an experiment without making any new code changes after they've tested things work (by, say, rolling out to only their team). I think right now, the best way around this is to be more transparent: Tell users that running experiments means using multivariate FFs. Discard support for "simple" FFs. Support 2 variants by default: We change our UI flow a bit to allow selecting 2 variants, called And, if we're going that far, we should change the flow as well:
This is a bit more complicated in some ways than our existing flow - but our existing flow doesn't work, so I guess it's a non-comparison. Keen to hear if there are easier ways to sort this? And does this UX make sense? Are you very opposed to this? |
This issue has 2549 words at 23 comments. Issues this long are hard to read or contribute to, and tend to take very long to reach a conclusion. Instead, why not:
Is this issue intended to be sprawling? Consider adding label |
Thanks for sharing this @neilkakkar. My thoughts,
Finally, do let me know if you need any more help with mockups and/or wireframes. We should be doing ugly things for now, but still mockups can help. @liyiy |
Out of interest, may I ask what the use case is for wanting to convert (or use) an existing feature flag into an experiment? |
Absolutely @weyert! So the typical case is for instance you're launching your new landing page with an A/B test to make sure it converts better; but before launching the experiment you want to make sure the new landing page looks right in production, so maybe you want to release it to your internal team for testing purposes (as a FF). Then once, you've made sure the feature behind the FF works as expected, you convert it to an experiment and launch. Does that make sense? |
Given how it might be hard to gather feedback during this sprint, as people are away on holidays, changing things up a bit from the roadmap. cc: @paolodamico @liyiy The tactic I want to propose is doing things we're most confident this feature would need, and pushing the rest of the things out until we get some feedback. So, things to do for this sprint (in order of priority): Edit: Forgot about the cleanup on MVP
All of these tie into the goal of the sprint: (1) Users can run more kinds of experiments. (2) Users get richer results |
Excellent piece of feedback, thanks! I'll keep this in mind, and think of alternative representations :) |
This is great @neilkakkar! Can you update the roadmap PR to have that as our up-to-date source of truth? Re to Marcus's point, I would challenge that the histogram is something we should gather feedback for before building. Not convinced it will be useful for the majority of users. Also let's chat with @clarkus about how we want to handle the UX for "how long will experiments run" (planned) vs actual running time. |
Will do! Just to go a little deeper into what you're saying:
A few ideas from other platforms: (https://vwo.com/tools/ab-test-significance-calculator/) Show the beta distribution of conversion rates: Show the box plot: .. or the histogram .. or ???? |
These all look like good examples to use for customer feedback, +1 to @paolodamico's point - my hunch is that it'd be too complex but I could be totally wrong - so worth double checking The simplest option I can think of is something like this where we use color / shade to reflect confidence and a "meter" to represent magnitude (excuse my terrible UX skills): |
I have a consolidated experiment creation design at https://www.figma.com/file/gQBj9YnNgD8YW4nBwCVLZf?node-id=6012:30357#133670116. This reduces the flow into a single step for creation. I am moving on to summarizing an actively running experiment, a completed one, etc. Let me know if you have any ideas of how we might summarize progress. I am particularly thinking about the case where an experiment is running and a user needs to make the decision to let it run or end it earlier than planned. |
@neilkakkar aligned on the premise, let's just figure out the right UX here. My proposal would be to work with @clarkus on creating a few mockups for how we would display each of the 2-3 options and then show it to users. |
I have posted updated screens for the experiment summary and its various states at https://www.figma.com/file/gQBj9YnNgD8YW4nBwCVLZf/PostHog-App?node-id=6240%3A34236. Please take a look and leave any feedback on anything that looks off target. |
Adding comments on Figma |
A broad observation: It's turning out, from user interviews, that "Telling users how much better is variant than control" is perhaps not that big a deal. They can figure out a rough good-enough estimate using just the conversion rates/ absolute values they can see. We deprioritised histograms based on early feedback, and current feedback around "what do you think is missing from results?" doesn't seem to prompt the question "how much better is the variant?". Will keep the idea in background, but not building anything out yet. Since most things here have been achieved, closing this, will open a new issue for whatever new things come up / whatever else we decide to implement. |
Following up from #7418 , here's what we need to do to get the MVP out in these 2 weeks:
Main Tasks:
$active_feature_flags
&$feature/{key}
to every event (we only do it for web captures today) - @neilkakkar - deferred. Just focusing onposthog-js
for now.New constraints:
The text was updated successfully, but these errors were encountered: