Namespace destination output #1921

ChristopheDuong · 2021-02-01T19:13:05Z

Tell us about the problem you're trying to solve

Problems:

1. Multiple sources can sync streams to the same destination, resulting in conflicts

(undesirable) Common stream names across sources (“events”, “users”, etc) from Salesforce & Github for example (different entities/schemas)
- Schema per source in destination Prefix table names in destinations with the source name #973
- Multiple sources for 1 Destination: Provide a way to specify a dataset for each source #1469
(potentially desirable) Same entities/schemas but produced by multiple sources with different settings
- Split Multi-tenants or by business units etc
- Exchange Rate USD + CAD into 1 destination allow mapping streams/fields at a connection level #448

2. A single source wants to sync to multiple destinations:

1 source sync to simultaneous multiple schemas/destinations: Syncing from stripe source to two different redshift schemas truncates data from the first schema #1790
Configure schema for each source Customize final destination schemas/datasets when configuring a source #1119
Replicating multiple schemas (large # of multi-tenants, hundreds?) in a DB Source: Add ability to select different schemas (rather than just public schema) in Postgres integration #1036

3. Advance configuration to make complex sync pipelines

Circular Sync from DB1.public.table_A -> DB2.public.table_B -> DB1.public.table_A see thread in slack
Salesforce -> Postgres -> Salesforce use case: see thread in slack

4. Constraints on naming:

Postgres destination schema/table name length check: Postgres hits the max table name length #1750

5. Cleaning, customizing things up:

Separate airbyte internal schema and final tables: Separate raw tables and normalized tables separately in different schemas #1607
Disable intermediate internal tables to save space: Allow disabling raw json tables to save space #1475

Current State:

Schema name is defined in the destination connector (should be normalized)
- Table name conflicts can be raised and difficult to handle (Problem 1.a)
- Writing to multiple schema implies creating multiple destinations with same config but different output schema (to solve Problem 2.b)
- One source can’t output to multiple schema at once, if you want to output to 100 schemas, then you need 100 destinations (Problem 2.c)
The source’s stream name is:
- Always normalized using “StandardName” in the destination
  - “ExtendedName” is never applied anymore so quoting/special characters are not allowed anymore, every special character is replaced by _ (to mimic the “clean_name” displayed in UI)
  - We can’t normalize the stream name in the catalog (in the source) because the source needs to remember what the original names were in order to query it.
- Database Source’s stream name is source_schema_name.table_name
- Non-database Source’s stream name is table_name
Final naming is therefore:
- destination_schema.stream_name
- => destination_schema(.source_schema).table_name
_airbyte_raw_ prefix to table names

Describe the solution you’d like

Database Source defines a schema Source_schema
- This is optional and up to the source if this information makes sense.
- It is required by the source to remember which schema a table is coming from
Source_schema.Table name is the source’s stream name
Source Connector is defined with a name that can be used as schema src_conn_name
- Source name should be unique (to avoid conflict Problem 1.a)
- Either restrict Source name with a regex
- Or name will be normalized into StandardNames (use ‘_’)
- (Optional) Or add a choice to normalize it depending on the destination with StandardName or ExtendedName
For each stream, UI can define:
- An override schema_override (namespace) and table_override
- UI proposes as override the following default values, but the user can edit at will:
  - src_conn_name + _ + Source_schema.table_name
- UI allows overrides that may result in conflicts with other connectors (warning message? For problem 3)
The final naming is:
- schema_override.table_override
- (then it goes through name normalization from destination)
_airbyte_raw_ prefix to schema instead? (see Problem 4 and 5)

The text was updated successfully, but these errors were encountered:

ChristopheDuong · 2021-02-08T20:07:24Z

Questions on how to handle this change to the protocol here:
#1993 (comment)

ChristopheDuong · 2021-02-15T15:26:39Z

Open-questions to discuss and act on this week:

Naming conventions
Protocol implementation/usage

See the Open-question section here:
https://docs.google.com/document/d/1EWBHuZ524K2Z9HJGlT-I9fqf0ba_pVh26NwMICHAqfs/edit?usp=sharing

ChristopheDuong · 2021-02-26T17:29:42Z

After discussions, we will be splitting the work on this topic as follow:

add new fields to the protocol (Step 2: Add Namespace to Airbyte Protocol #2228) (as a pre-requesite, there will be this PR to allow changes to protocol without breaking older versions of connectors ☝🏼 Upgrade all connectors (0.2.0) so protocol allows future / unknown properties #2238)
Introduce some transformation mapper worker tied to the "sync connection" page (Introduce some transformation mapper Actor tied to the "sync connection" page #2239)
Use the new fields from AirbyteStreamName within the destination connectors to write the final table at the desired location (Connectors should use the StreamName namespace & name fields #2240)

I will create issues or PR and assign them to the next milestone instead of carrying this issue (which is big and might go through multiple iterations/milestones)

cgardens · 2021-03-05T17:26:46Z

Just adding another user request for this issue. They want to sync postgres to postgres. They don't like that public_ is prepended at the front. They just want the destination db to have exactly the same table names as the source one. @ChristopheDuong I know this is a use case you've mentioned before too. Will this be something that is possible when this project is complete?

ChristopheDuong · 2021-03-05T19:31:38Z

Yes we can handle it at the end of the project as I am also referring to it in this comment:
#2298 (comment)

roshan · 2021-03-30T22:29:28Z

Just a quick +1 with some additional info on the use-case I have. The primary use of Airbyte for me is to take data from various places and put it into a DB intended for OLAP use-cases. When these sources were made, we did not consider this possibility so multiple DBs contain a notion of 'user' with the name 'users'. Without namespacing on a connection (or on the source) I'm stuck with overwriting public_users each time.

What I'm currently doing

Roughly, psql -c 'SELECT * FROM users' and psql -c 'INSERT INTO TABLE catalog_users VALUES'

ChristopheDuong · 2021-03-31T12:45:22Z

Just a quick +1 with some additional info on the use-case I have. The primary use of Airbyte for me is to take data from various places and put it into a DB intended for OLAP use-cases. When these sources were made, we did not consider this possibility so multiple DBs contain a notion of 'user' with the name 'users'. Without namespacing on a connection (or on the source) I'm stuck with overwriting public_users each time.

What I'm currently doing

Roughly, psql -c 'SELECT * FROM users' and psql -c 'INSERT INTO TABLE catalog_users VALUES'

In the latest Airbyte versions, there is now a prefix namespace on the connection page so you can solve this kind of conflict!

roshan · 2021-04-02T18:27:25Z

Thank you!

davinchia · 2021-04-22T14:12:11Z

@roshan with 0.21.0-alpha, supported connectors will automatically duplicated the source schema into the destination. See this documentation - should make it even easier for you!

andresbravog · 2021-05-14T08:08:23Z

Just adding another user request for this issue. We want to sync Postgres to BigQuery. We don't like the final dataset to be public (default namespace not overwritten by the BigQuery destination setting).

Is there a way to workaround this?

davinchia · 2021-05-19T07:54:09Z

I'm going to close this issue now as it's too big in scope to be useful and somewhat out-of-date. Our next step for this will be #3481.

I will go through the rest of the linked issues and deal with them at a later date.

ChristopheDuong added the type/enhancement New feature or request label Feb 1, 2021

ChristopheDuong added this to the 2021/02/05 milestone Feb 1, 2021

ChristopheDuong self-assigned this Feb 1, 2021

This was referenced Feb 1, 2021

Introduce namespacing to destinations #1918

Closed

Namespace destination tables #1992

Closed

Namespace fields in catalog #1993

Closed

Namespaces in normalization #1995

Closed

cgardens modified the milestones: 2021/02/05, 20201/02/12 Feb 8, 2021

cgardens modified the milestones: 20201/02/12, 2021/02/19 Feb 15, 2021

cgardens modified the milestones: 2021-02-19, 2021-02-26 Feb 23, 2021

ChristopheDuong mentioned this issue Feb 26, 2021

Step 2: Add Namespace to Airbyte Protocol #2228

Closed

3 tasks

ChristopheDuong mentioned this issue Mar 1, 2021

Connectors should use the StreamName namespace & name fields #2240

Closed

cgardens modified the milestones: 2021-02-26, 2021-03-05 Mar 1, 2021

ChristopheDuong mentioned this issue Mar 3, 2021

Add a field to input a sync/connection name #2285

Closed

cgardens mentioned this issue Mar 8, 2021

Source Postgres: Set up connection - add schema selection #1435

Closed

cgardens modified the milestones: Core - 2021-03-05, Core - 2021-03-12 Mar 8, 2021

davinchia mentioned this issue Mar 17, 2021

Destinations should be configurable with a namespace field. #2498

Closed

davinchia closed this as completed May 19, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Namespace destination output #1921

Namespace destination output #1921

ChristopheDuong commented Feb 1, 2021 •

edited

Loading

ChristopheDuong commented Feb 8, 2021

ChristopheDuong commented Feb 15, 2021

ChristopheDuong commented Feb 26, 2021 •

edited

Loading

cgardens commented Mar 5, 2021

ChristopheDuong commented Mar 5, 2021

roshan commented Mar 30, 2021 •

edited

Loading

ChristopheDuong commented Mar 31, 2021

What I'm currently doing

roshan commented Apr 2, 2021

davinchia commented Apr 22, 2021

andresbravog commented May 14, 2021

davinchia commented May 19, 2021 •

edited

Loading

Namespace destination output #1921

Namespace destination output #1921

Comments

ChristopheDuong commented Feb 1, 2021 • edited Loading

Tell us about the problem you're trying to solve

Problems:

1. Multiple sources can sync streams to the same destination, resulting in conflicts

2. A single source wants to sync to multiple destinations:

3. Advance configuration to make complex sync pipelines

4. Constraints on naming:

5. Cleaning, customizing things up:

Current State:

Describe the solution you’d like

ChristopheDuong commented Feb 8, 2021

ChristopheDuong commented Feb 15, 2021

ChristopheDuong commented Feb 26, 2021 • edited Loading

cgardens commented Mar 5, 2021

ChristopheDuong commented Mar 5, 2021

roshan commented Mar 30, 2021 • edited Loading

What I'm currently doing

ChristopheDuong commented Mar 31, 2021

What I'm currently doing

roshan commented Apr 2, 2021

davinchia commented Apr 22, 2021

andresbravog commented May 14, 2021

davinchia commented May 19, 2021 • edited Loading

ChristopheDuong commented Feb 1, 2021 •

edited

Loading

ChristopheDuong commented Feb 26, 2021 •

edited

Loading

roshan commented Mar 30, 2021 •

edited

Loading

davinchia commented May 19, 2021 •

edited

Loading