RFC: Inserter Service Requirements #68

fuziontech · 2022-11-08T01:42:56Z

Draft RFC for requirements, reasoning, and phases of development for an Inserter Service that would be used to power, among other things, the new SQL Interface.

yakkomajuri

Not much to say here - I think this makes sense. The one point from a practical perspective is that we've been thinking about this over at Team Pipeline as well. We currently have no bandwidth to support another service, but likely will in a couple of months with the two new joiners.

Happy for you to move fast here and just hook onto the topics and build a Cloud-specific service in a way that unblocks the Experimentation team, but this is something we'll probably take over in the medium-term I suspect? Since it pertains to ingestion / pipeline. We might also bring you over to the dark side instead ;)

yakkomajuri · 2022-11-08T12:12:24Z

requests-for-comments/2022-11-07-inserter-service.md

+- [ ] We will not materialize columns. We will use new `[JSON` datatype](https://clickhouse.com/docs/en/sql-reference/data-types/json/)
+- [ ] We will not use `Distributed` or `sharded` tables


these two things are anyway not part of the ingestion pipeline. Also, ingestion into tables with a sharded setup is the same as ingestion in a non-sharded way. I suspect you're more laying out the requirements for the ClickHouse Cloud setup here rather than the inserter service?

The schemas will be different between hosted ClickHouse and ClickHouse.cloud. This is saying that we will not be building or inserting into Distributed tables as we are currently. Think specifically about how the current ingestion pipeline is inserting into posthog.events_writable. That pattern is not a goal here.

yakkomajuri · 2022-11-08T12:14:23Z

requests-for-comments/2022-11-07-inserter-service.md

+
+It’s time now for us to move on from our friend the `kafka` table engine and move this functionality outside of ClickHouse.
+
+## Why?


agree with every why here, thanks for the section

fuziontech · 2022-11-08T15:27:59Z

Happy for you to move fast here and just hook onto the topics and build a Cloud-specific service in a way that unblocks the Experimentation team, but this is something we'll probably take over in the medium-term I suspect? Since it pertains to ingestion / pipeline. We might also bring you over to the dark side instead ;)

I appreciate your ambition to expand the scope of your team by increasing the number of services that it is responsible for. I don't believe a team should simply annex a project because the name of the team warrants it. I would point to Conway's law here and suggest that having this not be owned by the ingestion / pipeline team might be a good idea.

Any organization that designs a system (defined broadly) will produce a design whose structure is a copy of the organization's communication structure.

HBS and MIT have also found evidence to support this:
product developed by the loosely-coupled organization is significantly more modular than the product from the tightly-coupled organization

So in the end I would suggest we should aim to reduce the scope of Ingestion / Pipelines into smaller pieces (events, plugins, schedule/async tasks, etc). This would help us develop much more clear interfaces between each in the inevitable event we want to untangle the bowl of noodles and replace components in place.

timgl · 2022-11-08T15:44:12Z

I don't believe a team should simply annex a project because the name of the team warrants it. I would point to Conway's law here and suggest that having this not be owned by the ingestion / pipeline team might be a good idea.

Which other team would own this?

fuziontech · 2022-11-08T16:21:02Z

Which other team would own this?

Why does a team need to own this? This could be a new team of one (me) doing DB integrations?
I just want to make sure we don't end up back where we were with large teams unnecessarily.

Since it pertains to ingestion / pipeline. We might also bring you over to the dark side instead ;)

I don't think this will work.

I'm not in team ingestion's timezone
I don't think attending your sprint planning or stand-ups would be terribly relevant
This will just end up becoming tightly coupled

Let me flip the question back on you: Why should this be owned by team pipelines? Why not infrastructure? Where are the boundaries of what each team is responsible for?

macobo

I think there's a real cost to this project:

makes self-hosting more costly
for months we'll have more errors and work due to this service rather than less
little immediate payoff for pipeline robustness - we could be handling typing errors better in plugin-server

requests-for-comments/2022-11-07-inserter-service.md

fuziontech · 2022-11-10T00:45:15Z

@macobo I totally think your points are reasonable too. This entire thing adds complexity. I hate complexity.

makes self-hosting more costly

This would not be shipped with self-host for now. This is entirely to support SQL interface on cloud for now. All normal stuff would go through the existing pipelines.

for months we'll have more errors and work due to this service rather than less

This would be 💯 true if we were cutting over existing pipeline to this, but this will only support data available in SQL interface

little immediate payoff for pipeline robustness - we could be handling typing errors better in plugin-server

I think this is actually worth doing by itself and I consider it somewhat of a requirement for this to work. I really want the interface to be clean between plugin server and both this and clickhouse or whatever else is consuming from Kafka. Payloads on Kafka should be strongly typed and validated before producing to a topic (in most cases). This means that before the plugin service produces to kafka for the inserter service or ClickHouse it should enforce some sort of schema on it and validate that it complies with that schema. Whether that is JSON Schema or Proto or Avro or whatever. I really believe it will make our lives easier for both development and operating.

I hope this makes you feel a bit better about this all. I don't intend on this replacing our current pipeline anytime soon, maybe in the future it will...but not yet.

fuziontech added 2 commits November 7, 2022 17:41

RFC: Inserter Service Requirements

36249cb

hah https://events.py, oh notion

9747122

fuziontech requested review from timgl, yakkomajuri and ellie November 8, 2022 01:45

yakkomajuri reviewed Nov 8, 2022

View reviewed changes

macobo reviewed Nov 9, 2022

View reviewed changes

requests-for-comments/2022-11-07-inserter-service.md Show resolved Hide resolved

requests-for-comments/2022-11-07-inserter-service.md Show resolved Hide resolved

feedback and updates

bde7d75

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: Inserter Service Requirements #68

RFC: Inserter Service Requirements #68

fuziontech commented Nov 8, 2022

yakkomajuri left a comment

yakkomajuri Nov 8, 2022

fuziontech Nov 8, 2022

yakkomajuri Nov 8, 2022

fuziontech Nov 8, 2022

fuziontech commented Nov 8, 2022

timgl commented Nov 8, 2022

fuziontech commented Nov 8, 2022

macobo left a comment •

edited

Loading

fuziontech commented Nov 10, 2022

		- [ ] We will not materialize columns. We will use new `[JSON` datatype](https://clickhouse.com/docs/en/sql-reference/data-types/json/)
		- [ ] We will not use `Distributed` or `sharded` tables


		It’s time now for us to move on from our friend the `kafka` table engine and move this functionality outside of ClickHouse.

		## Why?

RFC: Inserter Service Requirements #68

Are you sure you want to change the base?

RFC: Inserter Service Requirements #68

Conversation

fuziontech commented Nov 8, 2022

yakkomajuri left a comment

Choose a reason for hiding this comment

yakkomajuri Nov 8, 2022

Choose a reason for hiding this comment

fuziontech Nov 8, 2022

Choose a reason for hiding this comment

yakkomajuri Nov 8, 2022

Choose a reason for hiding this comment

fuziontech Nov 8, 2022

Choose a reason for hiding this comment

fuziontech commented Nov 8, 2022

timgl commented Nov 8, 2022

fuziontech commented Nov 8, 2022

macobo left a comment • edited Loading

Choose a reason for hiding this comment

fuziontech commented Nov 10, 2022

macobo left a comment •

edited

Loading