Naming conventions managed in destinations #1060

ChristopheDuong · 2020-11-23T17:40:50Z

What

In #1048, we should go through each integration, determine what characters it can reasonably support.
Then reuse some of the name normalization code across destinations.

How

Replace the NamingHelper static methods by an interface and some implementation classes to handle specific naming rules per destination.

Here is my plan so far:

destination are allowed to choose the target schema where data will be written (Destination-postgres should write data in a table from the specified schema #1059 to make this equal on all destinations)
~~sources are allowed to output <schema_name>.<table_name> as stream names as described in Change jdbc sources to discover more than standard schemas #1038)~~
~~in destination's spec.json, a new option to allow sources the permission to override final target schema or not~~
- ~~if allowed, then when source output <schema_name>.<table_name>, it will be written in <schema_name>.<table_name> in the destination~~
- ~~if not allowed, then when source output <schema_name>.<table_name>, it will be written in <destination_target_schema>.<schema_name>_<table_name>~~

Some destinations allow extended SQL naming using delimited " characters in the names (which allows usage of extra special characters).

Destination that don't allow such extensions will replace those invalid characters by '_' instead

...bigquery/src/main/java/io/airbyte/integrations/destination/bigquery/BigQueryDestination.java

docs/integrations/destinations/README.md

… systematically

cgardens

I like your approach alot! The interface is a good idea. I think this is a nice balance of making it configurable for all destinations but also reusing the name cleaning code.

One thing I am not convinced of is this bit:

in destination's spec.json, a new option to allow sources the permission to override final target schema or not
* if allowed, then when source output <schema_name>.<table_name>, it will be written in <schema_name>.<table_name> in the destination
* if not allowed, then when source output <schema_name>.<table_name>, it will be written in <destination_target_schema>.<schema_name>_<table_name>

Is the idea that in most cases the destination just picks a default schema? But if the user overrides it then we use that schema instead. That part makes sense to me I think. That just means adding a "schema" option in all destinations? In the case of BQ we already have that with datasetId. With PG we have the option but don't respect it (which you fixed). etc? That seems fine / we seem to already do that. (Maybe we need to add it for snowflake?).

What I don't like is that the table name changes. That in some places it is <schema_name>_<table_name> and in others it is <table_name>. can we just always do <schema_name>_<table_name> even if someone is specifying what schema they are putting the table into? Is there a reason you feel that they have to be different?

We should not try to introspect the schema names from stream names. It is okay to let the user specify the target schema for the whole sync, but we should not try to extract it from stream names.

The different table names are confusing, we should always do <schema_name>_<table_name>
1. one it is not intuitive that the names change based on this configuration
2. The case that definite breaks if we don't keep the schema name is if someone is replicating from a multi-tenant database into a single schema in the destination database. If we don't namespace the table names with the schema names they will collide.
This introspection on the stream names is brittle in the case of non-database sources. e.g. facebook has a stream called destinations.list for example. I believe the implementation you're suggesting is going to accidentally put the list stream in its own schema which is not what we want.

I strongly thing we need to treat stream names as dumb strings. We clean out characters that are now allowed but we can't try to extract schema names.

Please add some unit tests to make sure we cover all the string we need to. You can steal some of Jared's test cases from here.

Just doing comment since this still a work in progress / draft.

...tegrations/bases/base-java/src/main/java/io/airbyte/integrations/base/ExtendedSQLNaming.java

...grations/bases/base-java/src/main/java/io/airbyte/integrations/base/SQLNamingResolvable.java

...tegrations/bases/base-java/src/main/java/io/airbyte/integrations/base/StandardSQLNaming.java

ChristopheDuong · 2020-11-24T10:10:59Z

I want to have the option of avoiding as much as possible renaming the table names...

The reason is that as a user of the data down the line, i probably have SQL queries that works on my source data system that i want to copy paste and use with the replicated data in the destination system. However, renaming table names will prevent me from doing that easily, whereas adapting the schema name should be an easier change.

Especially when i have data in a dev/production environment or multitenant (whether they are in two different databases or just in two different schemas) but the tables have the exact same names, I would like to keep it that way and have the freedom to change the destination schema/dataset name instead.

(here are more details in this issue's comments: #973 (comment))

ChristopheDuong · 2020-11-24T10:18:45Z

@michel-tricot also mentioned:

we could decide to change the struct behind the stream name and instead of a string make it an object that contains the name and the namespace

Maybe database sources could have this separate option to explicitly provide such "schema overrides" (outside of stream names?) to handle these namespaces?

Upgreydd · 2020-11-24T10:26:26Z

Hello.
In my opinion, destinations should be more chunked than now - example for bigquery:

configure big-query destination without providing a dataset
configure dataset for specific destination configuration
bind source with specific destination dataset
So:
dataset for communication will use bigquery connector
source will be able to replicate to desired destination/dataset

The same applies to redshift and other destinations - only namespaces are different - schemas/tables and so on.
Due to that, we should have 2 elements for each integration: source -> destination
But each destination will be related to a specific connector so in reality, it will be: source -> connector -> destination.
The connector should be selected by relation to the destination.

cgardens

nice! please make sure integration tests pass for all destinations. otherwise this looks great!

...tegrations/bases/base-java/src/main/java/io/airbyte/integrations/base/ExtendedSQLNaming.java

...tegrations/bases/base-java/src/main/java/io/airbyte/integrations/base/StandardSQLNaming.java

ChristopheDuong · 2020-11-25T15:07:54Z

...gres/src/test/java/io/airbyte/integrations/destination/postgres/PostgresDestinationTest.java

@@ -75,7 +74,7 @@
  private static final JSONFormat JSON_FORMAT = new JSONFormat().recordFormat(RecordFormat.OBJECT);
  private static final Instant NOW = Instant.now();
  private static final String USERS_STREAM_NAME = "users";
-  private static final String TASKS_STREAM_NAME = "tasks";
+  private static final String TASKS_STREAM_NAME = "tasks-list";


I am making the test slightly a little more challenging with a '-' character...

cgardens · 2020-11-25T16:04:03Z

airbyte-integrations/connectors/source-shopify-singer/unit_tests/unit_test.py

 def test_example_method():
-    assert True
+    assert os.path.commonpath(["/usr/lib", "/usr/local/lib"]) == "/usr"


what is this change?

There was no real tests and i just add something that use an import... not sure why i keep getting formatting back and forth with this empty file...

michel-tricot approved these changes Nov 23, 2020

View reviewed changes

...bigquery/src/main/java/io/airbyte/integrations/destination/bigquery/BigQueryDestination.java Outdated Show resolved Hide resolved

docs/integrations/destinations/README.md Outdated Show resolved Hide resolved

ChristopheDuong added 6 commits November 23, 2020 20:22

Use schema explicitely in snowflake and stop using quotes identifiers…

a3664f7

… systematically

Stop quoting systematically in normalization

289be83

Add docs on destination naming conventions

ece88ee

Stop quoting systematically in destination-postgres

727edd2

Use SQL Naming class in destinations to handle stream name conversions

2b2aa44

Compose SQLNamingResolvable instead of inherit it

3a45876

ChristopheDuong force-pushed the chris/destination-naming branch from cf741f7 to 3a45876 Compare November 23, 2020 20:21

cgardens reviewed Nov 23, 2020

View reviewed changes

...tegrations/bases/base-java/src/main/java/io/airbyte/integrations/base/StandardSQLNaming.java Outdated Show resolved Hide resolved

Update permission necessary for BigQuery

b5a49bb

ChristopheDuong added 2 commits November 24, 2020 14:53

Implement Naming Resolvers

b46c6b4

Destination Schema name should not be handled as part of stream names

8397634

cgardens approved these changes Nov 24, 2020

View reviewed changes

...tegrations/bases/base-java/src/main/java/io/airbyte/integrations/base/ExtendedSQLNaming.java Show resolved Hide resolved

...tegrations/bases/base-java/src/main/java/io/airbyte/integrations/base/StandardSQLNaming.java Outdated Show resolved Hide resolved

Add docs and unit tests

5f39725

ChristopheDuong marked this pull request as ready for review November 24, 2020 19:11

ChristopheDuong added 3 commits November 24, 2020 20:15

Merge remote-tracking branch 'origin/master' into destination-naming

6449c59

Extend dbt-utils macros instead of overriding them

0ef4b1d

Run & fix integration tests for all destinations

b8637a6

ChristopheDuong commented Nov 25, 2020

View reviewed changes

Remove extra imports

4e3227f

cgardens approved these changes Nov 25, 2020

View reviewed changes

ChristopheDuong merged commit 206d3cb into master Nov 25, 2020

ChristopheDuong deleted the chris/destination-naming branch November 25, 2020 17:53

This was referenced Nov 25, 2020

BumpVersions of destinations #1093

Merged

Change jdbc sources to discover more than standard schemas #1038

Merged

Customize final destination schemas/datasets when configuring a source #1119

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Naming conventions managed in destinations #1060

Naming conventions managed in destinations #1060

ChristopheDuong commented Nov 23, 2020 •

edited

Loading

cgardens left a comment •

edited

Loading

ChristopheDuong commented Nov 24, 2020 •

edited

Loading

ChristopheDuong commented Nov 24, 2020

Upgreydd commented Nov 24, 2020

cgardens left a comment

ChristopheDuong Nov 25, 2020

cgardens Nov 25, 2020

ChristopheDuong Nov 25, 2020

Naming conventions managed in destinations #1060

Naming conventions managed in destinations #1060

Conversation

ChristopheDuong commented Nov 23, 2020 • edited Loading

What

How

cgardens left a comment • edited Loading

Choose a reason for hiding this comment

ChristopheDuong commented Nov 24, 2020 • edited Loading

ChristopheDuong commented Nov 24, 2020

Upgreydd commented Nov 24, 2020

cgardens left a comment

Choose a reason for hiding this comment

ChristopheDuong Nov 25, 2020

Choose a reason for hiding this comment

cgardens Nov 25, 2020

Choose a reason for hiding this comment

ChristopheDuong Nov 25, 2020

Choose a reason for hiding this comment

ChristopheDuong commented Nov 23, 2020 •

edited

Loading

cgardens left a comment •

edited

Loading

ChristopheDuong commented Nov 24, 2020 •

edited

Loading