-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Naming conventions managed in destinations #1060
Conversation
...bigquery/src/main/java/io/airbyte/integrations/destination/bigquery/BigQueryDestination.java
Outdated
Show resolved
Hide resolved
cf741f7
to
3a45876
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like your approach alot! The interface is a good idea. I think this is a nice balance of making it configurable for all destinations but also reusing the name cleaning code.
One thing I am not convinced of is this bit:
in destination's spec.json, a new option to allow sources the permission to override final target schema or not
* if allowed, then when source output <schema_name>.<table_name>, it will be written in <schema_name>.<table_name> in the destination
* if not allowed, then when source output <schema_name>.<table_name>, it will be written in <destination_target_schema>.<schema_name>_<table_name>
Is the idea that in most cases the destination just picks a default schema? But if the user overrides it then we use that schema instead. That part makes sense to me I think. That just means adding a "schema" option in all destinations? In the case of BQ we already have that with datasetId. With PG we have the option but don't respect it (which you fixed). etc? That seems fine / we seem to already do that. (Maybe we need to add it for snowflake?).
What I don't like is that the table name changes. That in some places it is <schema_name>_<table_name>
and in others it is <table_name>
. can we just always do <schema_name>_<table_name>
even if someone is specifying what schema they are putting the table into? Is there a reason you feel that they have to be different?
We should not try to introspect the schema names from stream names. It is okay to let the user specify the target schema for the whole sync, but we should not try to extract it from stream names.
- The different table names are confusing, we should always do
<schema_name>_<table_name>
- one it is not intuitive that the names change based on this configuration
- The case that definite breaks if we don't keep the schema name is if someone is replicating from a multi-tenant database into a single schema in the destination database. If we don't namespace the table names with the schema names they will collide.
- This introspection on the stream names is brittle in the case of non-database sources. e.g. facebook has a stream called
destinations.list
for example. I believe the implementation you're suggesting is going to accidentally put the list stream in its own schema which is not what we want.
I strongly thing we need to treat stream names as dumb strings. We clean out characters that are now allowed but we can't try to extract schema names.
Please add some unit tests to make sure we cover all the string we need to. You can steal some of Jared's test cases from here.
Just doing comment since this still a work in progress / draft.
...tegrations/bases/base-java/src/main/java/io/airbyte/integrations/base/ExtendedSQLNaming.java
Show resolved
Hide resolved
...grations/bases/base-java/src/main/java/io/airbyte/integrations/base/SQLNamingResolvable.java
Outdated
Show resolved
Hide resolved
...tegrations/bases/base-java/src/main/java/io/airbyte/integrations/base/StandardSQLNaming.java
Outdated
Show resolved
Hide resolved
...tegrations/bases/base-java/src/main/java/io/airbyte/integrations/base/StandardSQLNaming.java
Outdated
Show resolved
Hide resolved
I want to have the option of avoiding as much as possible renaming the table names... The reason is that as a user of the data down the line, i probably have SQL queries that works on my source data system that i want to copy paste and use with the replicated data in the destination system. However, renaming table names will prevent me from doing that easily, whereas adapting the schema name should be an easier change. Especially when i have data in a dev/production environment or multitenant (whether they are in two different databases or just in two different schemas) but the tables have the exact same names, I would like to keep it that way and have the freedom to change the destination schema/dataset name instead. (here are more details in this issue's comments: #973 (comment)) |
@michel-tricot also mentioned:
Maybe database sources could have this separate option to explicitly provide such "schema overrides" (outside of stream names?) to handle these namespaces? |
Hello.
The same applies to redshift and other destinations - only namespaces are different - schemas/tables and so on. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice! please make sure integration tests pass for all destinations. otherwise this looks great!
...tegrations/bases/base-java/src/main/java/io/airbyte/integrations/base/ExtendedSQLNaming.java
Show resolved
Hide resolved
...tegrations/bases/base-java/src/main/java/io/airbyte/integrations/base/StandardSQLNaming.java
Outdated
Show resolved
Hide resolved
@@ -75,7 +74,7 @@ | |||
private static final JSONFormat JSON_FORMAT = new JSONFormat().recordFormat(RecordFormat.OBJECT); | |||
private static final Instant NOW = Instant.now(); | |||
private static final String USERS_STREAM_NAME = "users"; | |||
private static final String TASKS_STREAM_NAME = "tasks"; | |||
private static final String TASKS_STREAM_NAME = "tasks-list"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am making the test slightly a little more challenging with a '-' character...
def test_example_method(): | ||
assert True | ||
assert os.path.commonpath(["/usr/lib", "/usr/local/lib"]) == "/usr" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was no real tests and i just add something that use an import... not sure why i keep getting formatting back and forth with this empty file...
What
In #1048, we should go through each integration, determine what characters it can reasonably support.
Then reuse some of the name normalization code across destinations.
How
Replace the
NamingHelper
static methods by an interface and some implementation classes to handle specific naming rules per destination.Here is my plan so far:
sources are allowed to output<schema_name>.<table_name>
as stream names as described in Change jdbc sources to discover more than standard schemas #1038)in destination's spec.json, a new option to allow sources the permission to override final target schema or notif allowed, then when source output<schema_name>.<table_name>
, it will be written in<schema_name>.<table_name>
in the destinationif not allowed, then when source output<schema_name>.<table_name>
, it will be written in<destination_target_schema>.<schema_name>_<table_name>
Some destinations allow extended SQL naming using delimited
"
characters in the names (which allows usage of extra special characters).Destination that don't allow such extensions will replace those invalid characters by '_' instead