-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC: Inserter Service Requirements #68
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not much to say here - I think this makes sense. The one point from a practical perspective is that we've been thinking about this over at Team Pipeline as well. We currently have no bandwidth to support another service, but likely will in a couple of months with the two new joiners.
Happy for you to move fast here and just hook onto the topics and build a Cloud-specific service in a way that unblocks the Experimentation team, but this is something we'll probably take over in the medium-term I suspect? Since it pertains to ingestion / pipeline. We might also bring you over to the dark side instead ;)
- [ ] We will not materialize columns. We will use new `[JSON` datatype](https://clickhouse.com/docs/en/sql-reference/data-types/json/) | ||
- [ ] We will not use `Distributed` or `sharded` tables |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these two things are anyway not part of the ingestion pipeline. Also, ingestion into tables with a sharded setup is the same as ingestion in a non-sharded way. I suspect you're more laying out the requirements for the ClickHouse Cloud setup here rather than the inserter service?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The schemas will be different between hosted ClickHouse and ClickHouse.cloud. This is saying that we will not be building or inserting into Distributed
tables as we are currently. Think specifically about how the current ingestion pipeline is inserting into posthog.events_writable
. That pattern is not a goal here.
|
||
It’s time now for us to move on from our friend the `kafka` table engine and move this functionality outside of ClickHouse. | ||
|
||
## Why? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agree with every why here, thanks for the section
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
😎
I appreciate your ambition to expand the scope of your team by increasing the number of services that it is responsible for. I don't believe a team should simply annex a project because the name of the team warrants it. I would point to Conway's law here and suggest that having this not be owned by the Any organization that designs a system (defined broadly) will produce a design whose structure is a copy of the organization's communication structure. HBS and MIT have also found evidence to support this: So in the end I would suggest we should aim to reduce the scope of Ingestion / Pipelines into smaller pieces (events, plugins, schedule/async tasks, etc). This would help us develop much more clear interfaces between each in the inevitable event we want to untangle the bowl of noodles and replace components in place. |
Which other team would own this? |
Why does a team need to own this? This could be a new team of one (me) doing DB integrations?
I don't think this will work.
Let me flip the question back on you: Why should this be owned by team pipelines? Why not infrastructure? Where are the boundaries of what each team is responsible for? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there's a real cost to this project:
- makes self-hosting more costly
- for months we'll have more errors and work due to this service rather than less
- little immediate payoff for pipeline robustness - we could be handling typing errors better in plugin-server
@macobo I totally think your points are reasonable too. This entire thing adds complexity. I hate complexity.
This would not be shipped with self-host for now. This is entirely to support SQL interface on cloud for now. All normal stuff would go through the existing pipelines.
This would be 💯 true if we were cutting over existing pipeline to this, but this will only support data available in SQL interface
I think this is actually worth doing by itself and I consider it somewhat of a requirement for this to work. I really want the interface to be clean between plugin server and both this and clickhouse or whatever else is consuming from Kafka. Payloads on Kafka should be strongly typed and validated before producing to a topic (in most cases). This means that before the plugin service produces to kafka for the inserter service or ClickHouse it should enforce some sort of schema on it and validate that it complies with that schema. Whether that is JSON Schema or Proto or Avro or whatever. I really believe it will make our lives easier for both development and operating. I hope this makes you feel a bit better about this all. I don't intend on this replacing our current pipeline anytime soon, maybe in the future it will...but not yet. |
Draft RFC for requirements, reasoning, and phases of development for an Inserter Service that would be used to power, among other things, the new SQL Interface.