Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Namespace destination output #1921

Closed
ChristopheDuong opened this issue Feb 1, 2021 · 11 comments
Closed

Namespace destination output #1921

ChristopheDuong opened this issue Feb 1, 2021 · 11 comments
Assignees
Labels
type/enhancement New feature or request

Comments

@ChristopheDuong
Copy link
Contributor

ChristopheDuong commented Feb 1, 2021

Tell us about the problem you're trying to solve

Problems:

1. Multiple sources can sync streams to the same destination, resulting in conflicts

2. A single source wants to sync to multiple destinations:

3. Advance configuration to make complex sync pipelines

  • Circular Sync from DB1.public.table_A -> DB2.public.table_B -> DB1.public.table_A see thread in slack
  • Salesforce -> Postgres -> Salesforce use case: see thread in slack

4. Constraints on naming:

5. Cleaning, customizing things up:

Current State:

  • Schema name is defined in the destination connector (should be normalized)

    • Table name conflicts can be raised and difficult to handle (Problem 1.a)
    • Writing to multiple schema implies creating multiple destinations with same config but different output schema (to solve Problem 2.b)
    • One source can’t output to multiple schema at once, if you want to output to 100 schemas, then you need 100 destinations (Problem 2.c)
  • The source’s stream name is:

    • Always normalized using “StandardName” in the destination
      • “ExtendedName” is never applied anymore so quoting/special characters are not allowed anymore, every special character is replaced by _ (to mimic the “clean_name” displayed in UI)
      • We can’t normalize the stream name in the catalog (in the source) because the source needs to remember what the original names were in order to query it.
    • Database Source’s stream name is source_schema_name.table_name
    • Non-database Source’s stream name is table_name
  • Final naming is therefore:

    • destination_schema.stream_name
    • => destination_schema(.source_schema).table_name
  • _airbyte_raw_ prefix to table names

Describe the solution you’d like

  • Database Source defines a schema Source_schema
    • This is optional and up to the source if this information makes sense.
    • It is required by the source to remember which schema a table is coming from
  • Source_schema.Table name is the source’s stream name
  • Source Connector is defined with a name that can be used as schema src_conn_name
    • Source name should be unique (to avoid conflict Problem 1.a)
    • Either restrict Source name with a regex
    • Or name will be normalized into StandardNames (use ‘_’)
    • (Optional) Or add a choice to normalize it depending on the destination with StandardName or ExtendedName
  • For each stream, UI can define:
    • An override schema_override (namespace) and table_override
    • UI proposes as override the following default values, but the user can edit at will:
      • src_conn_name + _ + Source_schema.table_name
    • UI allows overrides that may result in conflicts with other connectors (warning message? For problem 3)
  • The final naming is:
    • schema_override.table_override
    • (then it goes through name normalization from destination)
  • _airbyte_raw_ prefix to schema instead? (see Problem 4 and 5)
@ChristopheDuong
Copy link
Contributor Author

Questions on how to handle this change to the protocol here:
#1993 (comment)

@ChristopheDuong
Copy link
Contributor Author

Open-questions to discuss and act on this week:

  • Naming conventions
  • Protocol implementation/usage

See the Open-question section here:
https://docs.google.com/document/d/1EWBHuZ524K2Z9HJGlT-I9fqf0ba_pVh26NwMICHAqfs/edit?usp=sharing

@ChristopheDuong
Copy link
Contributor Author

ChristopheDuong commented Feb 26, 2021

After discussions, we will be splitting the work on this topic as follow:

I will create issues or PR and assign them to the next milestone instead of carrying this issue (which is big and might go through multiple iterations/milestones)

@cgardens
Copy link
Contributor

cgardens commented Mar 5, 2021

Just adding another user request for this issue. They want to sync postgres to postgres. They don't like that public_ is prepended at the front. They just want the destination db to have exactly the same table names as the source one. @ChristopheDuong I know this is a use case you've mentioned before too. Will this be something that is possible when this project is complete?

@ChristopheDuong
Copy link
Contributor Author

Yes we can handle it at the end of the project as I am also referring to it in this comment:
#2298 (comment)

@roshan
Copy link
Contributor

roshan commented Mar 30, 2021

Just a quick +1 with some additional info on the use-case I have. The primary use of Airbyte for me is to take data from various places and put it into a DB intended for OLAP use-cases. When these sources were made, we did not consider this possibility so multiple DBs contain a notion of 'user' with the name 'users'. Without namespacing on a connection (or on the source) I'm stuck with overwriting public_users each time.

What I'm currently doing

Roughly, psql -c 'SELECT * FROM users' and psql -c 'INSERT INTO TABLE catalog_users VALUES'

@ChristopheDuong
Copy link
Contributor Author

Just a quick +1 with some additional info on the use-case I have. The primary use of Airbyte for me is to take data from various places and put it into a DB intended for OLAP use-cases. When these sources were made, we did not consider this possibility so multiple DBs contain a notion of 'user' with the name 'users'. Without namespacing on a connection (or on the source) I'm stuck with overwriting public_users each time.

What I'm currently doing

Roughly, psql -c 'SELECT * FROM users' and psql -c 'INSERT INTO TABLE catalog_users VALUES'

In the latest Airbyte versions, there is now a prefix namespace on the connection page so you can solve this kind of conflict!

@roshan
Copy link
Contributor

roshan commented Apr 2, 2021

Thank you!

@davinchia
Copy link
Contributor

@roshan with 0.21.0-alpha, supported connectors will automatically duplicated the source schema into the destination. See this documentation - should make it even easier for you!

@andresbravog
Copy link
Contributor

Just adding another user request for this issue. We want to sync Postgres to BigQuery. We don't like the final dataset to be public (default namespace not overwritten by the BigQuery destination setting).

Is there a way to workaround this?

@davinchia
Copy link
Contributor

davinchia commented May 19, 2021

I'm going to close this issue now as it's too big in scope to be useful and somewhat out-of-date. Our next step for this will be #3481.

I will go through the rest of the linked issues and deal with them at a later date.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants