Support multiple table sources in single Postgres connector #382

dylanlott · 2022-04-26T15:54:45Z

dylanlott
Apr 26, 2022

Postgres connector issue #23 brings up the point that a Postgres connector is bound to one table and one replication slot per instance.

Postgres replication slots support listening for changes to multiple tables, however this requires additional consideration for how destination connectors would handle data from multiple tables in a single source.

If we wanted to support multiple tables per Postgres connector, what would that look like? What would the consequences be for destination connectors? Is this a pattern we'd like to establish for connectors that can manage multiple sub-groups of data?

If not, do we want to actively discourage this pattern? Is this something we should encourage users to solve at a different layer?

lovromazgon · 2022-04-28T13:41:15Z

lovromazgon
Apr 28, 2022
Maintainer

A Conduit pipeline already supports multiple sources. Even if we constrain the Postgres connector to read only from one table at a time, we can have multiple Postgres sources in a pipeline, a destination needs to be able to handle this. For that reason, I think we will have to ensure destination connectors can somehow differentiate records coming from different sources (or different tables, in case the Postgres source listens to multiple tables). That's also why I think the source connector should support reading from multiple tables, it is an optimization that achieves the same as having multiple sources.

Now, how do we expect a destination connector to handle data coming from different sources? This entirely depends on the destination connector. First, I'd like to point out that many destinations might not even care about data coming from different sources (e.g. the s3 connector just dumps the record data into s3, regardless of where it comes from). Other destinations might care about it, in case they need to route records into different places (e.g. a Postgres destination might want to write record A in table A and record B in table B, or a Kafka destination might want to write records into different topics). That seems like the use case we are discussing here.

Given the format defined in the OpenCDC doc we don't have a specific field for differentiating the record source. Instead, the doc proposes to use metadata fields that source connectors/transforms can populate and destination connectors can read to tweak their behavior (I see them like HTTP headers - not quite mandatory, but can be used to tweak the behavior). That looks like the right place for routing data.

So at this point, the real question in my mind is - should we standardize a metadata field that destination connectors can use to route records? What I mean by this is, if a connector is reading from multiple sources (e.g. multiple tables, topics...) it stores the information where the record came from in this standard metadata field. A destination connector uses this field to route the record to the correct destination (e.g. target table, target topic...). I think there are pros and cons to introducing such a field.

Pros:

Having a standard field makes certain connectors compatible out of the box (e.g. a MySQL->PG pipeline could sync multiple tables without any transforms). Why certain? See first con.
A consequence of connector compatibility is a lower barrier for creating pipelines that "just work".
People don't have to check the documentation of a connector to know which metadata field contains routing data (they still need to check it to figure out the required format, see first con).
We don't reinvent the wheel for record routing in each connector, the SDK can provide utilities for fetching the routing field, Conduit can provide utilities for transforming it.

Cons:

Is one field going to be enough? What if a destination connector would need multiple pieces of information for routing (e.g. PG might need a schema and a table)? We can cram the information into one field but then the format of the value matters, which begs the question, can we come up with a single format for routing data that satisfies all cases? What happens if a connector receives routing data in an unsupported format? Here's an example to think about:
- Imagine a HTTP destination connector that sends a request each time it receives a record. What format would it expect in the routing field? Probably a URL, which means it won't be compatible with a PG source without a transform in between, even if we have a standard metadata field.
It's hard to enforce such a field (we can't automate this via acceptance tests), we would need to manually verify that connectors are doing the right thing. It might create false assurance about compatibility if connectors aren't following the pattern.

What other option do we have? The one I see is to give the destination connector the freedom to use any metadata field it wants for routing. This means that, as a user, I would have to check the documentation of the destination connector and create a transform that populates the necessary metadata field(s) that the connector needs for routing. Pros and cons for this option are essentially swapped.

I'm inclined towards introducing a standard metadata field for routing. Although, if we go with this option, we need to think more about what information it should contain (in what format) and how a destination connector should behave if it's not populated or contains an unexpected value.

0 replies

dylanlott · 2022-05-04T15:49:43Z

dylanlott
May 4, 2022
Author

A Conduit pipeline already supports multiple sources.

I agree that the larger problem of multiple possible sources per pipeline is still our root unsolved issue here. We have some established connector convention around how to handle sub-resources like tables in a source. Given our transform support, I think connectors handling n number of sub-resources is a natural evolution of Conduit.

What other option do we have?

Some more possible solutions that I see:

We could introduce a backwards-compatible (or not) router or source field to the OpenCDC specification. Giving routing information a first class field could be worth the complexity, but it has all of the same problems as a metadata field with some minor differences around testing behaviors. It could be a sdk.Data or just a string. Also, we have not created the process for updating the OpenCDC specification, and this would ultimately require a version update.
We could do nothing. I believe this is essentially the same as the alternative you mentioned, give the destination connector the freedom to use any metadata field it wants for routing, with the caveat that we wouldn't have any standard or convention at all, which is compelling in its own way. For the sake of argument, maybe it's a bad assumption about data to force organizational order on any given connector, source or destination, past the Record schema. Maybe our protocol is best kept intentionally thin at the middle and routing shouldn't be our concern at all.

Is one field going to be enough?

This is a hard question. Constraining it down to one field will likely require additional formatting or additional reliance on transformers. Enforcing or encouraging any standard like URL spec could create issues in connector compatibility, and would further require testing connector edge cases.

should we standardize a metadata field that destination connectors can use to route records?

I think I'm almost inclined towards no at this point. If we're not decreasing reliance on documentation, we're not able to broadly benefit connectors, if some connectors already ignore a record's source (like the S3 connector), and if enforcement is difficult at best and impossible at worst, it seems like this could be a ball of complexity that we could feasibly ignore.

0 replies

raulb · 2024-05-21T13:50:37Z

raulb
May 21, 2024
Maintainer

Closing the loop on this matter. As recently announced, we now support multiple collections: https://meroxa.com/blog/conduit-0.10-comes-with-multiple-collections-support/.

Our built-in PostgreSQL connector has included support for multiple tables since version 0.7.0. https://github.com/ConduitIO/conduit-connector-postgres/releases/tag/v0.7.0

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support multiple table sources in single Postgres connector #382

{{title}}

Replies: 3 comments

{{title}}

{{title}}

{{title}}

Select a reply

Support multiple table sources in single Postgres connector #382

dylanlott Apr 26, 2022

Replies: 3 comments

lovromazgon Apr 28, 2022 Maintainer

dylanlott May 4, 2022 Author

raulb May 21, 2024 Maintainer

dylanlott
Apr 26, 2022

lovromazgon
Apr 28, 2022
Maintainer

dylanlott
May 4, 2022
Author

raulb
May 21, 2024
Maintainer