Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Naming conventions managed in destinations #1060

Merged
merged 14 commits into from
Nov 25, 2020

Conversation

ChristopheDuong
Copy link
Contributor

@ChristopheDuong ChristopheDuong commented Nov 23, 2020

What

In #1048, we should go through each integration, determine what characters it can reasonably support.
Then reuse some of the name normalization code across destinations.

How

Replace the NamingHelper static methods by an interface and some implementation classes to handle specific naming rules per destination.

Here is my plan so far:

  • destination are allowed to choose the target schema where data will be written (Destination-postgres should write data in a table from the specified schema #1059 to make this equal on all destinations)
  • sources are allowed to output <schema_name>.<table_name> as stream names as described in Change jdbc sources to discover more than standard schemas #1038)
  • in destination's spec.json, a new option to allow sources the permission to override final target schema or not
    • if allowed, then when source output <schema_name>.<table_name>, it will be written in <schema_name>.<table_name> in the destination
    • if not allowed, then when source output <schema_name>.<table_name>, it will be written in <destination_target_schema>.<schema_name>_<table_name>

Some destinations allow extended SQL naming using delimited " characters in the names (which allows usage of extra special characters).

Destination that don't allow such extensions will replace those invalid characters by '_' instead

@ChristopheDuong ChristopheDuong force-pushed the chris/destination-naming branch from cf741f7 to 3a45876 Compare November 23, 2020 20:21
Copy link
Contributor

@cgardens cgardens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like your approach alot! The interface is a good idea. I think this is a nice balance of making it configurable for all destinations but also reusing the name cleaning code.


One thing I am not convinced of is this bit:

in destination's spec.json, a new option to allow sources the permission to override final target schema or not
* if allowed, then when source output <schema_name>.<table_name>, it will be written in <schema_name>.<table_name> in the destination
* if not allowed, then when source output <schema_name>.<table_name>, it will be written in <destination_target_schema>.<schema_name>_<table_name>

Is the idea that in most cases the destination just picks a default schema? But if the user overrides it then we use that schema instead. That part makes sense to me I think. That just means adding a "schema" option in all destinations? In the case of BQ we already have that with datasetId. With PG we have the option but don't respect it (which you fixed). etc? That seems fine / we seem to already do that. (Maybe we need to add it for snowflake?).

What I don't like is that the table name changes. That in some places it is <schema_name>_<table_name> and in others it is <table_name>. can we just always do <schema_name>_<table_name> even if someone is specifying what schema they are putting the table into? Is there a reason you feel that they have to be different?

We should not try to introspect the schema names from stream names. It is okay to let the user specify the target schema for the whole sync, but we should not try to extract it from stream names.

  1. The different table names are confusing, we should always do <schema_name>_<table_name>
    1. one it is not intuitive that the names change based on this configuration
    2. The case that definite breaks if we don't keep the schema name is if someone is replicating from a multi-tenant database into a single schema in the destination database. If we don't namespace the table names with the schema names they will collide.
  2. This introspection on the stream names is brittle in the case of non-database sources. e.g. facebook has a stream called destinations.list for example. I believe the implementation you're suggesting is going to accidentally put the list stream in its own schema which is not what we want.

I strongly thing we need to treat stream names as dumb strings. We clean out characters that are now allowed but we can't try to extract schema names.


Please add some unit tests to make sure we cover all the string we need to. You can steal some of Jared's test cases from here.

Just doing comment since this still a work in progress / draft.

@ChristopheDuong
Copy link
Contributor Author

ChristopheDuong commented Nov 24, 2020

I want to have the option of avoiding as much as possible renaming the table names...

The reason is that as a user of the data down the line, i probably have SQL queries that works on my source data system that i want to copy paste and use with the replicated data in the destination system. However, renaming table names will prevent me from doing that easily, whereas adapting the schema name should be an easier change.

Especially when i have data in a dev/production environment or multitenant (whether they are in two different databases or just in two different schemas) but the tables have the exact same names, I would like to keep it that way and have the freedom to change the destination schema/dataset name instead.

(here are more details in this issue's comments: #973 (comment))

@ChristopheDuong
Copy link
Contributor Author

@michel-tricot also mentioned:

we could decide to change the struct behind the stream name and instead of a string make it an object that contains the name and the namespace

Maybe database sources could have this separate option to explicitly provide such "schema overrides" (outside of stream names?) to handle these namespaces?

@Upgreydd
Copy link
Contributor

Hello.
In my opinion, destinations should be more chunked than now - example for bigquery:

  • configure big-query destination without providing a dataset
  • configure dataset for specific destination configuration
  • bind source with specific destination dataset
    So:
  • dataset for communication will use bigquery connector
  • source will be able to replicate to desired destination/dataset

The same applies to redshift and other destinations - only namespaces are different - schemas/tables and so on.
Due to that, we should have 2 elements for each integration: source -> destination
But each destination will be related to a specific connector so in reality, it will be: source -> connector -> destination.
The connector should be selected by relation to the destination.

Copy link
Contributor

@cgardens cgardens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice! please make sure integration tests pass for all destinations. otherwise this looks great!

@ChristopheDuong ChristopheDuong marked this pull request as ready for review November 24, 2020 19:11
@@ -75,7 +74,7 @@
private static final JSONFormat JSON_FORMAT = new JSONFormat().recordFormat(RecordFormat.OBJECT);
private static final Instant NOW = Instant.now();
private static final String USERS_STREAM_NAME = "users";
private static final String TASKS_STREAM_NAME = "tasks";
private static final String TASKS_STREAM_NAME = "tasks-list";
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am making the test slightly a little more challenging with a '-' character...

def test_example_method():
assert True
assert os.path.commonpath(["/usr/lib", "/usr/local/lib"]) == "/usr"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is this change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There was no real tests and i just add something that use an import... not sure why i keep getting formatting back and forth with this empty file...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants