-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ideas to tackle ~Universal Taxonomy~ Automated Insights #8261
Comments
Thanks for this @neilkakkar really comprehensive!
Off the wall idea |
Problem 1 + Business model segementationWe should stay open to cutting across the behavior on a different axis. Looking at business types seems to generalize what could be very different companies. A fintech for small business back office could be very different than a fintech for consumers. However, it seems a fintech for small business back office would have similarities to a healthtech EHR management software. We would then develop workflows for effectively tracking a category of feature patterns: onboarding flows or CRM-like interfaces. (This would also address the problem in "Text matching" where companies would bucket themselves into an industry but into feature patterns) Problem 3If we develop an internal taxonomy for events, collaboration (comments, names, descritptions) data could be good passive data to eventually train models off of that will produce useful insights for new users.
General thoughts
|
Not quite sure I follow - what would you like me to elaborate more on? |
Sorry, ignore that! I erased the rest of that thought |
Here's a more extensive description of point 4 from above. Problem layers
This got me thinking about how we use posthog right now and what would be an immediately useful insight to know. I found the problem space to have two layers.
Useful insights: Deep diveWhile suggesting novel insights would be a way to discover new, possibly useful insights, suggesting related insights to the one that's being viewed could be a way to suggest surefire, useful insights. For example, if I'm analyzing event X, it almost always would benefit me to know if the retention/churn of this event usage is above or below average from other events i'm tracking. Performing more correlation analysis would be helpful too. When looking at event X, I'd like to know if there's some weirdly higher rate of stickiness for people performing the event on Day A vs Day B. We could refine some of the flows to be more impactful so when I'm analyzing a three step funnel A->B->C, we could preemptively search for a different step B that's resulting in a really high conversion to C (or really low). We could still apply all the dimensions mentioned above by @marcushyett-ph to decide on what to surface Benefits
Drawbacks
|
Was expanding scope to cover whatever we can think of, but yeah, makes sense to rule things out now & prioritise.
Actually, the way I was thinking of this was (which gives it moderately-high probability): There's a fixed taxonomy (2.1), and we use word embeddings to solve (2.2), i.e. the text mapping bit. I think this will have higher precision than naive string matching because words with similar meanings in this space map to the same fixed taxonomy word. Example: User enrolled & User login can both map to our taxonomy: USER_LOGIN, while with text matching, we'd only get If we use these embeddings to do everything, then agreed, precision goes down.
I have no personal experience, but does sound hard to get right.
Agree, definitely better for understanding. I'd say the best way to solve problem 3, given you've solved problem 2 is manually generating the kind of insights based on the taxonomy, which then creates this set of possible insights. What I suggested + your ideas are definitely possible, but I'd say doing things manually first would give us a better grasp on what features are important. For MVP, I propose solving (3) manually.
Interesting, how do you imagine this will work? Or, what useful data do we get out of this? I imagine it being useful for us manually testing out ideas "oh, this team does this, which is cool, let's put it as a possible insight into our taxonomy", but don't see (yet) how this would work for training data?
Agreed! And also on the direction you're proposing. |
Also, it sounds like we have slightly different ideas in our heads about what 'taxonomy' is, so I suggest we taboo the word and replace it with "what we mean" when we chat next xD |
Can you share the new word / definition you come up with here please :)? |
A summary of options from above conversation:
My stance here is (2) is the lowest risk as it's an extension of some of the diagnosing causes work we've been doing. (3) is an add-on. (1) is what we should continue to validate right now: is taxonomizing a vertical/feature possible? If yes, then we build this direction and table (2). If not, we proceed with (2). |
What I mean by taxonomyThere's the alphabets, and the words. The alphabets = categories of events, like The words = insights we can possibly generate using these events. For example, Together, they make our taxonomy. The events, and the meaningful insights we can generate from these events. |
This issue has 2063 words at 12 comments. Issues this long are hard to read or contribute to, and tend to take very long to reach a conclusion. Instead, why not:
Is this issue intended to be sprawling? Consider adding label |
Problem AnalysisSo, I chose the quickest (to me) path through problems (1), (2), and (3), to see if we can validate them quickly, and here are my findings (feel free to challenge method/conclusion). Thank you, OpenAI. It made quickly using word embeddings and validating things very easy. I started with clustering on all events to come up with natural categories we could use. Half of them were useable, and I cherry picked those (the taxonomy alphabet examples above came out of this). As you'll note, these classifications are pretty generic. To categorise is to throw away information & make things legible. It's a trade-off. But, as far as taxonomies go, I think the above is reasonably okay. This answered problem (1). Yes, categorisation is possible, and for most things, you don't even need to go to a vertical, the above taxonomy is generic enough to work well with all businesses. Of course, a more specific one would be nicer for specific industries, but I wanted to quickly get something that works as a problem validation solution. I implicitly solved problem (2.1) by cherry picking and building "words" out of them, as mentioned in the taxonomy example above. For problem (2.2), we turn again to WordEmbeddings. Given my taxonomy definition, which events are closest to this definition (cosine similarity of word embeddings)? This worked extraordinarily well, gathering most useful events. So, a few more heuristics here to reduce the error rate, and we're golden. The Big ProblemThe big issue comes when going from Problem (1) and (2) to (3). There's no way I could figure out where the "words" in the taxonomy led to very meaningful insights that people hadn't thought of before. The issue here is that: a. The taxonomy is too generic. Disregarding all the other problems with word embeddings (how to make them work with self hosted / using an external API / aggregating data across PostHog instances) - some of which are solvable I think - creating a taxonomy to solve (3) doesn't seem like the right approach. Next up, I draw conclusions from this. Lmk if you disagree. @marcushyett-ph maybe there's a better selection of "words" you can come up with that helps solve this? Sounds impossible to me, given my hypothesis, but would love to be proven wrong :D The direction we should head inFleshing out this problem a bit more, what seems key is that most insights useful to a project will be generated from the data specific to the project. Or, any suggestions coming out of a taxonomy (no matter how specific to the industry) would necessarily be worse than analysing a team's data using specific algorithms(tbd), which tells them interesting things about their data. Coupled with the idea that we want to make this work well for self-hosted instances, we should limit our data universe to the project itself. This implies that we want to solve (3) directly. AND the ML approach based not on the data, i.e. insight results, but on generating filters / generating new insights to test would be terrible. And any ML approach that takes into consideration the results of insights seems almost impossible to compute: Basically implies you need a LOT of generated insights, which generate results, so you can then extract these results into features, and run the model on these features. Might be possible, but this sounds insane: Slowest feature generation I've ever heard of, where drift is very pronounced - metrics can easily change week over week. This constrains the problem space well, and discards most of our initial solutions. Something like what Eric mentioned, "Automated Diagnosing Causes and Interconnectedness" is a valid approach. It also gives us interesting new levers to pull, which also tie pretty well into collaboration (cc: @paolodamico). What if we define useful insights to be "insights other members in your team have found useful, but you haven't seen?" - we can recommend things powered off of analysis other users have been running. We can tell peeps that others have found this insight useful. Then, the more powerful ones are the automated diagnosing causes insights (for which Eric made that nice concept above^) I'm pretty sure we could also do more sigma-analysis like stuff for all insights: There's atleast some low-hanging fruit here, where users are looking at almost the right thing, but may gloss over it, and just nudging them in the right direction can make hell of a difference. |
@neilkakkar Can you share me the examples of the generic insights you're talking about (I couldn't really find them in the notebook)? So I have three opinions on generic insights.
The direction sounds reasonable to me (we should stay really close to the collaboration folks if we take it). One concern, does this approach preclude us from solving for the search problem (e.g. I can search for any insight in plain english)? Or do you think there's another way of approaching that given what you've learned so far from clustering etc? |
(Will respond to (1) and (2) separately, interesting tradeoffs here, after some quick tests on Monday . (1) is interesting, but sounds like a different problem which we can solve differently. Something like a bootstrap for success. ) About (3): Isn't this better solved by something like sigma-analysis / anomaly detection / correlation analysis? (a.k.a the direction proposed?). Would you rather have one of our generic suggested insights get lucky with showing you a remarkable change, vs, us pointedly searching for anomalies and surfacing those, like in Eric's concept above? About the search problem: Word embeddings in-of-themselves are a more state-of-the-art in the world for solving text search / document retrieval. (and take a lot less resources, given a huge-ass pretrained model). I'm pretty sure we could use these well to solve "searching for insights in plain english", and arguably better than having a taxonomy, since that brings back problem (2.2), which is mapping the natural language to the taxonomy, AND mapping existing insights to the taxonomy as well. Edit: We also haven't yet experimented with elasticsearch/Lucene/Solr/existing open source search solutions which can have lots of sophisticated ranking algos we can experiment with. Last time I worked on search (prev company), there were several plans of attack I had in mind & tested out to make search work nicely. |
Search: Sounds like a fair conclusion (but we can keep out of scope for now) would it be fair to say that we now have more confidence this is solvable given the work we've done so far? Btw, I'm trying to get our session with ex-CTO of company in this space booked in (they're not free until the week after next) we should be able to validate some of our conclusions with them, then hopefully. |
100% yes. For any of the above, I'm not saying we were wrong (at all) to consider all these approaches. I could only reach this step after experimenting a bit and solidifying my guesses via actual code. |
So, imagine we have such a taxonomy. Let's say, for sake of discussion, that every non-stale event in PostHog for PostHog is the taxonomy - it's a taxonomy for product analytics companies that also support FFs and session recordings and correlation. Now, this is a very specific taxonomy, and let's say there are 100 other companies which are in this space, have the same events, and use PostHog for their own product analytics because PostHog is obviously the best. How do we generate valuable insights from this taxonomy? This is an easier problem than above, since there's a perfect taxonomy mapping. |
I think the discussion and findings have been very useful Summary comment:
|
Great summary - whats our next step from here? |
Just jotting down some ideas as I come up with them / as I read things on the internet / clarify my own thinking around this. Feel free to add your own, we'd probably need to try & mix and match a few approaches to get to something usable!
Adapting from #8094 , there are 3 problems to solve:
Some things to consider:
We don't really care about (1) and (2). The goal is simply (3). It's possible to reach (3) without doing (1) and (2), by say, using a more fluid approach than a hard taxonomical classification. (Don't know how this would work yet, just something to keep in mind)
I think to effectively do (3), not only would we want to map events to a model, but also event properties. For example, a
subscribed
event would likely have aprice
oramount
property, and showing users they can "track daily revenue" vs just number of people who subscribed is where the magic happens. The latter is easy to figure out, the former, not so much!??
Problem 1 solutions
My gut feel here is yes: most companies with the same business model look the same, do the same things, and earn money the same way. Thus, the events they track should be similar.
What's interesting to me here is that these companies can be in different industries: You can have a health subscription service, or a SaaS, both of which would have very similar events:
subscription (started | cancelled)
andamount
props. By contrast, a health insurance company might have things like:bought product
withproduct type: A
as a property (spitballing here).So, I propose we divide verticals by business models instead of industries. (Before going this route, actually check our data if we can confirm this hypothesis or not)
I may be oversimplifying, and there may be other variables that are also important, but I feel figuring these out would make things a lot clearer.
Choosing the right division here is important, because it can make the next problem impossibly hard to easy.
Problem 2 solutions
There's two parts to this problem: (2.1): What does our internal model for this vertical look like? and (2.2): How do we map user events to this internal model.
We need both to be distinct, since we use (2.1) as a generator for solving (3).
Generic Word Embeddings
We can represent every word by a 200-300 dimension vector. Lots of generic trained models exist. Any two events whose some measure of distance, like Euclidean in this vector space is less than epsilon map to the same thing.
So, given a representation for (2.1) (perhaps manually choosing words), we should be able to solve (2.2) using these word embeddings.
We shouldn't train our own embeddings, as (I think) that's a losing battle, hard to get right, and not worth it for the MVP.
It's easier to find generic word embeddings vs embeddings specific to a field, but I expect results to be better when we use specific embeddings for a specific field: they map domain words better.
We should try testing both kinds, to see what works.
Probability I think this will work: Moderately high
Automatic taxonomy creation
There's lots of interesting methods to generate taxonomies. Why not use these to generate a model (2.1), and use it to predict which custom property goes where? (2.2).
This definitely scales better than manually doing (2.1), but runs into a new problem: How do we map this model to smart insights? For example, the taxonomy created might focus instead on different disease classification, vs. events coming into PostHog.
Similar arguments can be made for ontology creation.
However, I think we can take inspiration from these techniques, and figure out something that works for us.
Probability I think this will work: Low
Text matching
There's no reason we have to solve all the hard parts via code. We could manually build a taxonomy of what events should look like for a vertical (assuming we've solved problem (1) well). And encourage companies to adhere to these guidelines: call your events like we tell you to.
This makes (2.2) very easy: we know apriori what's coming in!
(2.1) is hard though. Do we know enough about industries to do this manually?
Further, How do we tell oura to not go with the health-industry taxonomy, but with the SaaS-taxonomy?
And, mucho friction, as industries change / businesses grow / their business models change, and this feature goes to trash. Maybe.
But anyway, I think we should definitely attempt this once, just to understand the edge cases better: When/why would businesses not want to track events like so, etc. etc.
Probability I think this will work: Moderate
Text matching without training
It's like the above, but what if we assume, given we select the verticals properly, most users will call their events similarly?
This removes all the icky bits from the above method, and just keeps the easy bits.
Probability I think this will work: Moderate, if (1) is solved well. Low otherwise.
Problem 3 solutions
Given we have a model (2.1), we should be able to create all important insights manually (and thanks to ideas from companies in the same model vertical).
Not sure about the effort this will take, and whether we'll surface interesting things. But, I suspect this will atleast level the playing field: Here's the basic things every company in this vertical looks at, which can be valuable enough.
Some really out there solutions:
Random Insights
What if, instead of doing the hard work of creating a taxonomy, we randomly suggest insights based on events & properties data coming in? Of course, there needs to be some structure, AND, we can do some pruning based on prelim results, like a chess engine / A* search algorithm (need to define the problem better for search, but you get the idea)
So, you generate random insights, and discard any for which the result is 0. Then we have heuristics to prune certain combinations, like, say, "if conversion rate below 1%, probs not useful". We'll need to play around a lot to figure these out, but idk, might do better.
(I mean, if this does better than solving (1) and (2), we know our models are pretty shitty, a.k.a the problem is very hard 😂 )
Probability I think this will work: Moderate-low
Neural Net all the things!
This is a surprisingly well defined problem to attack via machine learning: You have a set of event with properties, persons with properties, and the output is a list of tuples: the insight type, and the events/actions in the list.
Actually, we could possibly use GPT-3 here! If it can generate code, it can generate filter objects! We just need to prompt several good examples of meaningful filter objects, given events & properties. (every filter object uniquely maps to an insight)
I think GPT-3 would definitely work better than training our own neural nets. (because training is hardddd, needs lots of data, etc. etc.)
Hmm, now that I think about it, this might be the most promising approach, barring concerns with using an external API.
Probability I think this will work: High
cc: @marcushyett-ph @EDsCODE
The text was updated successfully, but these errors were encountered: