Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[bugfix] Fix for issue #109 #111

Merged
merged 1 commit into from
Apr 15, 2019

Conversation

CStejmar
Copy link

@CStejmar CStejmar commented Apr 5, 2019

Fix for issue #109

The table_schemas and the table_records did not match for more complex and nested schemas. This is now fixed and they match. All nested data that I had trouble with before now enters the database as expected. One question remains: Is this the way we want to organize/structure the tables and naming of them? I guess we could change the records denesting instead to match the schema we had before, that should also work and that will give us a slightly different structure/naming of tables and objects.

Below you find an example of how the tables for schema and record mismatched before and how they now match for a simple example using data from the CATS_SCHEMA in target-postgres. Look for the vaccination_type especially.

Before:

___table_schemas:  [{'type': 'TABLE_SCHEMA', 'path': (), 'level': None, 'key_properties': ['id'], 'mappings': [], 'schema': {'type': 'object', 'additionalProperties': False, 'properties': {('id',): {'type': ['integer']}, ('name',): {'type': ['string']}, ('paw_size',): {'type': ['integer'], 'default': 314159}, ('paw_colour',): {'type': ['string'], 'default': ''}, ('flea_check_complete',): {'type': ['boolean'], 'default': False}, ('pattern',): {'type': ['null', 'string']}, ('age',): {'type': ['null', 'integer']}, ('adoption', 'adopted_on'): {'type': ['null', 'string'], 'format': 'date-time'}, ('adoption', 'was_foster'): {'type': ['boolean', 'null']}, ('_sdc_received_at',): {'type': ['null', 'string'], 'format': 'date-time'}, ('_sdc_sequence',): {'type': ['null', 'integer']}, ('_sdc_table_version',): {'type': ['null', 'integer']}, ('_sdc_batched_at',): {'type': ['null', 'string'], 'format': 'date-time'}}}}, {'type': 'TABLE_SCHEMA', 'path': ('adoption', 'immunizations'), 'level': 0, 'key_properties': ['_sdc_source_key_id'], 'mappings': [], 'schema': {'type': 'object', 'additionalProperties': False, 'properties': {('type',): {'type': ['string']}, ('date_administered',): {'type': ['string'], 'format': 'date-time'}, ('adoption', 'immunizations', 'vaccination_type', 'shot'): {'type': ['string']}, ('_sdc_source_key_id',): {'type': ['integer']}, ('_sdc_sequence',): {'type': ['null', 'integer']}, ('_sdc_level_0_id',): {'type': ['integer']}}}}]
___table_records:  {('adoption', 'immunizations'): [{('type',): ('string', 'Rabies'), ('date_administered',): ('string', '2537-09-12T13:34:00'), ('vaccination_type', 'shot'): ('string', 'Yes'), ('_sdc_source_key_id',): ('integer', 1), ('_sdc_sequence',): ('integer', 1554384634), ('_sdc_level_0_id',): ('integer', 0)}, {('type',): ('string', 'Panleukopenia'), ('date_administered',): ('string', '2889-03-01T17:18:00'), ('vaccination_type', 'shot'): ('string', 'No'), ('_sdc_source_key_id',): ('integer', 1), ('_sdc_sequence',): ('integer', 1554384634), ('_sdc_level_0_id',): ('integer', 1)}, {('type',): ('string', 'Feline Leukemia'), ('date_administered',): ('string', '2599-08-08T07:47:00'), ('vaccination_type', 'shot'): ('string', 'No'), ('_sdc_source_key_id',): ('integer', 1), ('_sdc_sequence',): ('integer', 1554384634), ('_sdc_level_0_id',): ('integer', 2)}, {('type',): ('string', 'Feline Leukemia'), ('date_administered',): ('string', '2902-04-14T01:34:00'), ('vaccination_type', 'shot'): ('string', 'No'), ('_sdc_source_key_id',): ('integer', 1), ('_sdc_sequence',): ('integer', 1554384634), ('_sdc_level_0_id',): ('integer', 3)}], (): [{('id',): ('integer', 1), ('name',): ('string', 'Morgan'), ('pattern',): ('string', 'Tortoiseshell'), ('age',): ('integer', 14), ('adoption', 'adopted_on'): ('string', '2633-01-02T00:11:00'), ('adoption', 'was_foster'): ('boolean', False), ('_sdc_batched_at',): ('string', '2019-04-05 09:00:15.4599+00:00'), ('_sdc_sequence',): ('integer', 1554384634)}]}

After:

___table_schemas:  [{'type': 'TABLE_SCHEMA', 'path': (), 'level': None, 'key_properties': ['id'], 'mappings': [], 'schema': {'type': 'object', 'additionalProperties': False, 'properties': {('id',): {'type': ['integer']}, ('name',): {'type': ['string']}, ('paw_size',): {'type': ['integer'], 'default': 314159}, ('paw_colour',): {'type': ['string'], 'default': ''}, ('flea_check_complete',): {'type': ['boolean'], 'default': False}, ('pattern',): {'type': ['null', 'string']}, ('age',): {'type': ['null', 'integer']}, ('adoption', 'adopted_on'): {'type': ['null', 'string'], 'format': 'date-time'}, ('adoption', 'was_foster'): {'type': ['boolean', 'null']}, ('_sdc_received_at',): {'type': ['null', 'string'], 'format': 'date-time'}, ('_sdc_sequence',): {'type': ['null', 'integer']}, ('_sdc_table_version',): {'type': ['null', 'integer']}, ('_sdc_batched_at',): {'type': ['null', 'string'], 'format': 'date-time'}}}}, {'type': 'TABLE_SCHEMA', 'path': ('adoption', 'immunizations'), 'level': 0, 'key_properties': ['_sdc_source_key_id'], 'mappings': [], 'schema': {'type': 'object', 'additionalProperties': False, 'properties': {('type',): {'type': ['string']}, ('date_administered',): {'type': ['string'], 'format': 'date-time'}, ('vaccination_type', 'shot'): {'type': ['null', 'string']}, ('_sdc_source_key_id',): {'type': ['integer']}, ('_sdc_sequence',): {'type': ['null', 'integer']}, ('_sdc_level_0_id',): {'type': ['integer']}}}}]
___table_records:  {('adoption', 'immunizations'): [{('type',): ('string', 'Rabies'), ('date_administered',): ('string', '2537-09-12T13:34:00'), ('vaccination_type', 'shot'): ('string', 'Yes'), ('_sdc_source_key_id',): ('integer', 1), ('_sdc_sequence',): ('integer', 1554384634), ('_sdc_level_0_id',): ('integer', 0)}, {('type',): ('string', 'Panleukopenia'), ('date_administered',): ('string', '2889-03-01T17:18:00'), ('vaccination_type', 'shot'): ('string', 'No'), ('_sdc_source_key_id',): ('integer', 1), ('_sdc_sequence',): ('integer', 1554384634), ('_sdc_level_0_id',): ('integer', 1)}, {('type',): ('string', 'Feline Leukemia'), ('date_administered',): ('string', '2599-08-08T07:47:00'), ('vaccination_type', 'shot'): ('string', 'No'), ('_sdc_source_key_id',): ('integer', 1), ('_sdc_sequence',): ('integer', 1554384634), ('_sdc_level_0_id',): ('integer', 2)}, {('type',): ('string', 'Feline Leukemia'), ('date_administered',): ('string', '2902-04-14T01:34:00'), ('vaccination_type', 'shot'): ('string', 'No'), ('_sdc_source_key_id',): ('integer', 1), ('_sdc_sequence',): ('integer', 1554384634), ('_sdc_level_0_id',): ('integer', 3)}], (): [{('id',): ('integer', 1), ('name',): ('string', 'Morgan'), ('pattern',): ('string', 'Tortoiseshell'), ('age',): ('integer', 14), ('adoption', 'adopted_on'): ('string', '2633-01-02T00:11:00'), ('adoption', 'was_foster'): ('boolean', False), ('_sdc_batched_at',): ('string', '2019-04-05 12:05:11.2344+00:00'), ('_sdc_sequence',): ('integer', 1554384634)}]}

In the database for this example it now looks like this:

db=# \d
                     List of relations
 Schema |             Name              | Type  |  Owner   
--------+-------------------------------+-------+----------
 public | cats                          | table | postgres
 public | cats__adoption__immunizations | table | postgres
(2 rows)

db=# 
db=# 
db=# \d cats__adoption__immunizations 
                    Table "public.cats__adoption__immunizations"
         Column         |           Type           | Collation | Nullable | Default 
------------------------+--------------------------+-----------+----------+---------
 type                   | text                     |           | not null | 
 date_administered      | timestamp with time zone |           | not null | 
 vaccination_type__shot | text                     |           |          | 
 _sdc_source_key_id     | bigint                   |           | not null | 
 _sdc_sequence          | bigint                   |           |          | 
 _sdc_level_0_id        | bigint                   |           | not null | 

db=# select * from cats__adoption__immunizations
db-# ;
      type       |   date_administered    | vaccination_type__shot | _sdc_source_key_id | _sdc_sequence | _sdc_level_0_id 
-----------------+------------------------+------------------------+--------------------+---------------+-----------------
 Rabies          | 2537-09-12 15:34:00+02 | Yes                    |                  1 |    1554384634 |               0
 Panleukopenia   | 2889-03-01 18:18:00+01 | No                     |                  1 |    1554384634 |               1
 Feline Leukemia | 2599-08-08 09:47:00+02 | No                     |                  1 |    1554384634 |               2
 Feline Leukemia | 2902-04-14 03:34:00+02 | No                     |                  1 |    1554384634 |               3
(4 rows)

As you can see we now have data in the field vaccination_type__shot which didn't work before.

@CStejmar
Copy link
Author

CStejmar commented Apr 5, 2019

If we do it the other way around, changing the record denesting instead to match the schema we get:

gmp_db=# \d
                     List of relations
 Schema |             Name              | Type  |  Owner   
--------+-------------------------------+-------+----------
 public | cats                          | table | postgres
 public | cats__adoption__immunizations | table | postgres
(2 rows)

gmp_db=# \d cats
                               Table "public.cats"
        Column        |           Type           | Collation | Nullable | Default 
----------------------+--------------------------+-----------+----------+---------
 id                   | bigint                   |           | not null | 
 name                 | text                     |           | not null | 
 paw_size             | bigint                   |           | not null | 
 paw_colour           | text                     |           | not null | 
 flea_check_complete  | boolean                  |           | not null | 
 pattern              | text                     |           |          | 
 age                  | bigint                   |           |          | 
 adoption__adopted_on | timestamp with time zone |           |          | 
 adoption__was_foster | boolean                  |           |          | 
 _sdc_received_at     | timestamp with time zone |           |          | 
 _sdc_sequence        | bigint                   |           |          | 
 _sdc_table_version   | bigint                   |           |          | 
 _sdc_batched_at      | timestamp with time zone |           |          | 

gmp_db=# \d cats__adoption__immunizations 
                                Table "public.cats__adoption__immunizations"
                     Column                      |           Type           | Collation | Nullable | Default 
-------------------------------------------------+--------------------------+-----------+----------+---------
 type                                            | text                     |           | not null | 
 date_administered                               | timestamp with time zone |           | not null | 
 adoption__immunizations__vaccination_type__shot | text                     |           |          | 
 _sdc_source_key_id                              | bigint                   |           | not null | 
 _sdc_sequence                                   | bigint                   |           |          | 
 _sdc_level_0_id                                 | bigint                   |           | not null | 

gmp_db=# 
gmp_db=# select * from cats__adoption__immunizations;
      type       |   date_administered    | adoption__immunizations__vaccination_type__shot | _sdc_source_key_id | _sdc_sequence | _sdc_level_0_id 
-----------------+------------------------+-------------------------------------------------+--------------------+---------------+-----------------
 Rabies          | 2537-09-12 15:34:00+02 | Yes                                             |                  1 |    1554384634 |               0
 Panleukopenia   | 2889-03-01 18:18:00+01 | No                                              |                  1 |    1554384634 |               1
 Feline Leukemia | 2599-08-08 09:47:00+02 | No                                              |                  1 |    1554384634 |               2
 Feline Leukemia | 2902-04-14 03:34:00+02 | No                                              |                  1 |    1554384634 |               3
(4 rows)

To get this change, the _denest_record function in denest.py is slightly altered and looks like this:

def _denest_record(table_path, record, records_map, key_properties, pk_fks, level):
    """"""
    """
    {...}
    """
    denested_record = {}
    for prop, value in record.items():
        """
        str : {...} | [...] | None | <literal>
        """

        if isinstance(value, dict):
            """
            {...}
            """
            _denest_subrecord(table_path + (prop,),
                              table_path + (prop,),
                              denested_record,
                              value,
                              records_map,
                              key_properties,
                              pk_fks,
                              level)

        elif isinstance(value, list):
            """
            [...]
            """
            _denest_records(table_path + (prop,),
                            value,
                            records_map,
                            key_properties,
                            pk_fks=pk_fks,
                            level=level + 1)

        elif value is None:
            """
            None
            """
            continue

        else:
            """
            <literal>
            """
            denested_record[(prop,)] = (json_schema.python_type(value), value)

    if table_path not in records_map:
        records_map[table_path] = []
    records_map[table_path].append(denested_record)

@CStejmar
Copy link
Author

CStejmar commented Apr 5, 2019

I have prepared a branch for this other scenario and will push that as well and then you can decide what fix to use.

@CStejmar
Copy link
Author

CStejmar commented Apr 5, 2019

Merge either this PR or #112

Copy link
Collaborator

@AlexanderMann AlexanderMann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@CStejmar we definitely want this over changing the records. The way you have this implemented here is great. Nice job tracking down the problem! I think we should merge this, and then rebase my tests fork on master and merge is as well. The tests which are therein are beneficial since this change:

a) doesn't break any tests 😬
b) doesn't fix and tests

@awm33 I think this fixes current broken schemas etc. There will be a problem with things where tables which have this problem will still have the busted column remaining. I'm not sure whether trying to create a migration for this is worth it. Thoughts?

(@CStejmar I'd approve but I'd like to sus out the current implications of the bug with @awm33 here first so we understand what's going to change etc. and whether we need to "fix" this for current schemas)

@CStejmar
Copy link
Author

CStejmar commented Apr 8, 2019

@AlexanderMann Thank you! Glad I can help! Yes I think we should do as you suggest, merge this and then rebase your tests on top of that 👍 .

Regarding your list:
a) Great! 👍
b) Not sure I follow here, what exactly do you mean?

Yes I understand, discuss it with @awm33 and get back to me. Btw, what do you mean with "whether we need to "fix" this for current schemas". Because the fix shouldn't break any schemas or records, only match them. However, current databases and their tables could change if the schema used to create them are complex (nested) enough. Am I right?

@CStejmar
Copy link
Author

CStejmar commented Apr 8, 2019

I noticed one thing now when looking trough my tables in a test database. The objects looks fine and all data enters the database. However, naming of subtables (arrays in the schemas) miss their array name when we have arrays within arrays. The first array name is dropped in the table name.

With denest fix for schema:

db=# \d
                                 List of relations
 Schema |                         Name                          | Type  |  Owner   
--------+-------------------------------------------------------+-------+----------
 public | campaign                                              | table | postgres
 public | campaign__external_ids                                | table | postgres
 public | campaign__media_types                                 | table | postgres
 public | tv_plan                                               | table | postgres
 public | tv_plan__actual_summary__index_percent                | table | postgres
 public | tv_plan__actual_values__index_percent                 | table | postgres
 public | tv_plan__actual_values__periods                       | table | postgres
 public | tv_plan__actual_values__periods__film_code_breakdown  | table | postgres
 public | tv_plan__actual_values__periods__index_percent        | table | postgres
 public | tv_plan__channels                                     | table | postgres
 public | tv_plan__film_codes                                   | table | postgres
 public | tv_plan__planned_summary__index_percent               | table | postgres
 public | tv_plan__planned_values__index_percent                | table | postgres
 public | tv_plan__planned_values__periods                      | table | postgres
 public | tv_plan__planned_values__periods__film_code_breakdown | table | postgres
 public | tv_plan__planned_values__periods__index_percent       | table | postgres
 public | tv_plan__regions                                      | table | postgres
 public | tv_spot                                               | table | postgres
 public | tv_spot__target_audience_values                       | table | postgres
(19 rows)

db=# \d tv_plan__channels 
                                  Table "public.tv_plan__channels"
                     Column                     |       Type       | Collation | Nullable | Default 
------------------------------------------------+------------------+-----------+----------+---------
 channel_name                                   | text             |           |          | 
 planned_values__conversion_index_to_generic_ta | double precision |           |          | 
 planned_values__discount_percent               | double precision |           |          | 
 planned_values__net                            | double precision |           |          | 
 planned_values__net_net                        | double precision |           |          | 
 planned_values__grp                            | double precision |           |          | 
 planned_values__grp30                          | double precision |           |          | 
 planned_values__spots                          | double precision |           |          | 
 planned_values__spots30                        | double precision |           |          | 
 actual_values__conversion_index_to_generic_ta  | double precision |           |          | 
 actual_values__discount_percent                | double precision |           |          | 
 actual_values__net                             | double precision |           |          | 
 actual_values__net_net                         | double precision |           |          | 
 actual_values__grp                             | double precision |           |          | 
 actual_values__grp30                           | double precision |           |          | 
 actual_values__spots                           | double precision |           |          | 
 actual_values__spots30                         | double precision |           |          | 
 _sdc_source_key_id                             | text             |           | not null | 
 _sdc_source_key_campaign_id                    | text             |           | not null | 
 _sdc_sequence                                  | bigint           |           |          | 
 _sdc_level_0_id                                | bigint           |           | not null | 

Without denest fix/fixing record instead:

db=> \d
                                     List of relations
 Schema |                              Name                               | Type  |  Owner  
--------+-----------------------------------------------------------------+-------+---------
 public | campaign                                                        | table | adverai
 public | campaign__external_ids                                          | table | adverai
 public | campaign__media_types                                           | table | adverai
 public | tv_plan                                                         | table | adverai
 public | tv_plan__actual_summary__index_percent                          | table | adverai
 public | tv_plan__channels                                               | table | adverai
 public | tv_plan__channels__actual_values__index_percent                 | table | adverai
 public | tv_plan__channels__actual_values__periods                       | table | adverai
 public | tv_plan__channels__actual_values__periods__film_code_breakdown  | table | adverai
 public | tv_plan__channels__actual_values__periods__index_percent        | table | adverai
 public | tv_plan__channels__planned_values__index_percent                | table | adverai
 public | tv_plan__channels__planned_values__periods                      | table | adverai
 public | tv_plan__channels__planned_values__periods__film_code_breakdown | table | adverai
 public | tv_plan__channels__planned_values__periods__index_percent       | table | adverai
 public | tv_plan__film_codes                                             | table | adverai
 public | tv_plan__planned_summary__index_percent                         | table | adverai
 public | tv_plan__regions                                                | table | adverai
 public | tv_spot                                                         | table | adverai
 public | tv_spot__target_audience_values                                 | table | adverai
(19 rows)

db=> \d tv_plan__channels
                                       Table "public.tv_plan__channels"
                          Column                          |       Type       | Collation | Nullable | Default 
----------------------------------------------------------+------------------+-----------+----------+---------
 channels__actual_values__net_net                         | double precision |           |          | 
 channels__planned_values__conversion_index_to_generic_ta | double precision |           |          | 
 channel_name                                             | text             |           |          | 
 channels__actual_values__grp30                           | double precision |           |          | 
 channels__actual_values__net                             | double precision |           |          | 
 _sdc_level_0_id                                          | bigint           |           | not null | 
 channels__actual_values__discount_percent                | double precision |           |          | 
 channels__actual_values__conversion_index_to_generic_ta  | double precision |           |          | 
 channels__actual_values__spots                           | double precision |           |          | 
 channels__planned_values__net_net                        | double precision |           |          | 
 channels__planned_values__grp                            | double precision |           |          | 
 channels__planned_values__spots30                        | double precision |           |          | 
 _sdc_source_key_id                                       | text             |           | not null | 
 channels__planned_values__spots                          | double precision |           |          | 
 channels__planned_values__net                            | double precision |           |          | 
 channels__planned_values__discount_percent               | double precision |           |          | 
 channels__planned_values__grp30                          | double precision |           |          | 
 _sdc_sequence                                            | bigint           |           |          | 
 channels__actual_values__grp                             | double precision |           |          | 
 channels__actual_values__spots30                         | double precision |           |          | 
 _sdc_source_key_campaign_id                              | text             |           | not null | 

The above has errors in the naming of nested objects as you can see in tv_plan__channels. For example channels__actual_values*and channels__planned_values* should be named actual_values*and planned_values* respectively. Just as in the "fixed" example.

So my conclusion is that we need some more work with the fix regarding this denesting before merging. I will start looking into it now!

@AlexanderMann
Copy link
Collaborator

Yeah @CStejmar I think you're correct here.

In looking into the code, it looks like the records logic has the notion of table_path AND prop_path while schema's logic doesn't really.

And then on top of that, the denesting logic for records doesn't do the correct thing (I think?) with unpacking the child objects into the parent.

Expected Behaviour

@awm33 should confirm this, but my recollection of how this logic should be working is based on the StitchData docs

Specifically, a JSON object can only have three cases:

{
  scalar: 123
  object: {...}
  array: [...]
}

For each case, we should be doing the following (schema and records alike):

  • scalars do nothing unexpected
  • objects change the property path
  • arrays change the table path

Scalar

  • Table Path: table_path (no changes)
  • Property Path: (key,)
  • Action: scalar value is placed into parent at property path, ie, {(key,): value}

Object

  • Table Path: table_path (no changes)
  • Property Path: (key,) + object_keys
  • Action: recursively operate on each of the object_keys and their associated values

Array

  • Table Path: table_path + property_path
  • Property Path: () (resets for new values being denested)
  • Action: create a subtable with new Table Path, recursively operate on all items in the array

Changes which are probable necessary in the denesting logic...

@CStejmar I'm curious to see where you get with this, but I'm nervous that for the current schema denesting logic we need to introduce the concept of prop_path to be able to handle the new edge case you've highlighted.

Then, for records denesting we'll need to fix the issues around object denesting changing/not-changing the table path...

If I get the time I'm going to update my fork/branch with the broken tests to also include tests for the additional bug you've got up top here.

@CStejmar
Copy link
Author

CStejmar commented Apr 8, 2019

@AlexanderMann thanks for the breakdown of the problem! It is similar or exact to the thoughts I have had this morning :). I have been working with a fix for this today and just pushed it. It produces the correct and expected tables with all data in the database and also passes all tests present at the moment on the master branch. As you wrote, I needed to introduce the prop_path to the _denest_schema_helper function and then just input the paths correctly. I think I got it right, so please take a look at the changes and test it!

issue: datamill-co#109

The table_schemas and the table_records did not match for more complex
and nested schemas.
@CStejmar CStejmar force-pushed the fix/nested-schema-issue branch from 55ef65a to 3b865c0 Compare April 12, 2019 14:28
Copy link
Collaborator

@AlexanderMann AlexanderMann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@CStejmar this is 👍 from me. I also don't think that this will break/mess up any current usages of Target Postgres. Anyone who is using this now, who doesn't already have a problem, is someone who has nullable fields in their schema. Due to this, when we replicate, create new columns, and populate those, we'll have null values to populate the broken columns.

@awm33 if this gets a 👍 from you, I can merge and get Target Redshift etc. deployed.

@AlexanderMann AlexanderMann merged commit 4f7aaec into datamill-co:master Apr 15, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants