Replies: 3 comments
-
A Conduit pipeline already supports multiple sources. Even if we constrain the Postgres connector to read only from one table at a time, we can have multiple Postgres sources in a pipeline, a destination needs to be able to handle this. For that reason, I think we will have to ensure destination connectors can somehow differentiate records coming from different sources (or different tables, in case the Postgres source listens to multiple tables). That's also why I think the source connector should support reading from multiple tables, it is an optimization that achieves the same as having multiple sources. Now, how do we expect a destination connector to handle data coming from different sources? This entirely depends on the destination connector. First, I'd like to point out that many destinations might not even care about data coming from different sources (e.g. the s3 connector just dumps the record data into s3, regardless of where it comes from). Other destinations might care about it, in case they need to route records into different places (e.g. a Postgres destination might want to write record A in table A and record B in table B, or a Kafka destination might want to write records into different topics). That seems like the use case we are discussing here. Given the format defined in the OpenCDC doc we don't have a specific field for differentiating the record source. Instead, the doc proposes to use metadata fields that source connectors/transforms can populate and destination connectors can read to tweak their behavior (I see them like HTTP headers - not quite mandatory, but can be used to tweak the behavior). That looks like the right place for routing data. So at this point, the real question in my mind is - should we standardize a metadata field that destination connectors can use to route records? What I mean by this is, if a connector is reading from multiple sources (e.g. multiple tables, topics...) it stores the information where the record came from in this standard metadata field. A destination connector uses this field to route the record to the correct destination (e.g. target table, target topic...). I think there are pros and cons to introducing such a field. Pros:
Cons:
What other option do we have? The one I see is to give the destination connector the freedom to use any metadata field it wants for routing. This means that, as a user, I would have to check the documentation of the destination connector and create a transform that populates the necessary metadata field(s) that the connector needs for routing. Pros and cons for this option are essentially swapped. I'm inclined towards introducing a standard metadata field for routing. Although, if we go with this option, we need to think more about what information it should contain (in what format) and how a destination connector should behave if it's not populated or contains an unexpected value. |
Beta Was this translation helpful? Give feedback.
-
I agree that the larger problem of multiple possible sources per pipeline is still our root unsolved issue here. We have some established connector convention around how to handle sub-resources like tables in a source. Given our transform support, I think connectors handling
Some more possible solutions that I see:
This is a hard question. Constraining it down to one field will likely require additional formatting or additional reliance on transformers. Enforcing or encouraging any standard like URL spec could create issues in connector compatibility, and would further require testing connector edge cases.
I think I'm almost inclined towards no at this point. If we're not decreasing reliance on documentation, we're not able to broadly benefit connectors, if some connectors already ignore a record's source (like the S3 connector), and if enforcement is difficult at best and impossible at worst, it seems like this could be a ball of complexity that we could feasibly ignore. |
Beta Was this translation helpful? Give feedback.
-
Closing the loop on this matter. As recently announced, we now support multiple collections: https://meroxa.com/blog/conduit-0.10-comes-with-multiple-collections-support/. Our built-in PostgreSQL connector has included support for multiple tables since version 0.7.0. https://github.com/ConduitIO/conduit-connector-postgres/releases/tag/v0.7.0 |
Beta Was this translation helpful? Give feedback.
-
Postgres connector issue #23 brings up the point that a Postgres connector is bound to one table and one replication slot per instance.
Postgres replication slots support listening for changes to multiple tables, however this requires additional consideration for how destination connectors would handle data from multiple tables in a single source.
If we wanted to support multiple tables per Postgres connector, what would that look like? What would the consequences be for destination connectors? Is this a pattern we'd like to establish for connectors that can manage multiple sub-groups of data?
If not, do we want to actively discourage this pattern? Is this something we should encourage users to solve at a different layer?
Beta Was this translation helpful? Give feedback.
All reactions