Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No data in database if nested objects within array #109

Closed
CStejmar opened this issue Apr 4, 2019 · 8 comments
Closed

No data in database if nested objects within array #109

CStejmar opened this issue Apr 4, 2019 · 8 comments

Comments

@CStejmar
Copy link

CStejmar commented Apr 4, 2019

Hi,
I think I have encountered a problem with nested objects in arrays. An object in an array is fine but when I have an object within another object in an array (array->object->object) problem arises. No data is seen in that last level object even if I can see in the record data that values exist. If I instead have an array at the last level (array->object->array) it behaves as expected.

I have done tests with this in the module test_sandbox using a modified CATS_SCHEMA and the FakeStream.generate_record_message(also modified to match my schema) to generate test data. If I in the test make the last object's properties optional by "type": ["null", "string"] for example, everything passes because the object's properties are all null. However, if I use "type": "string" for a property, the test fails with:

CRITICAL ('Exception writing records', IntegrityError('null value in column "adoption__immunizations__vaccination_type__shot" violates not-null constraint\nDETAIL: Failing row contains (Rabies, 2537-09-12 15:34:00+02, null, 1, 1554384634, 0).\nCONTEXT: COPY tmp_36a9e5fb_ac07_49f0_ac26_8fa9f1f69885, line 1: "Rabies,2537-09-12 13:34:00.0000+00:00,NULL,1,1554384634,0"\n'))

To me it seems like this case isn't handled and I can only find tests of array_of_array and object_of_object in the tests directory. Does anyone have a clue and can point me in the right direction?

I will attach a snippet of what I added to the test_sandbox.py file so that you can test my examples and easier understand my problem.

class CatStreamObject(SandboxStream):
    stream = [
        {"type": "SCHEMA",
         "stream": "cats",
         "schema": {
             "additionalProperties": False,
             "properties": {
                 "id": {"type": "integer"},
                 "name": {"type": ["string"]},
                 "paw_size": {"type": ["integer"],
                              "default": 314159},
                 "paw_colour": {"type": "string", "default": ""},
                 "flea_check_complete": {"type": ["boolean"],
                                         "default": False},
                 "pattern": {"type": ["null", "string"]},
                 "age": {"type": ["null", "integer"]},
                 "adoption": {"type": ["object", "null"],
                              "properties": {"adopted_on": {
                                  "type": ["null", "string"],
                                  "format": "date-time"},
                                  "was_foster": {
                                      "type": "boolean"},
                                  "immunizations": {
                                      "type": ["null",
                                               "array"],
                                      "items": {
                                          "type": [
                                              "object"],
                                          "properties": {
                                              "type": {
                                                  "type": [
                                                      "null",
                                                      "string"]},
                                              "date_administered": {
                                                  "type": [
                                                      "null",
                                                      "string"],
                                                  "format": "date-time"},
                                              "vaccination_type": {
                                                  "type": "object",
                                                  "properties": {
                                                      "shot": {
                                                          "type": [
                                                              "null",
                                                              "string"]}}}}}}}}}},
         "key_properties": ["id"]},
        {"type": "RECORD",
         "stream": "cats",
         "record": {
             "id": 1,
             "name": "Morgan",
             "pattern": "Tortoiseshell",
             "age": 14,
             "adoption": {
                 "adopted_on": "2633-01-02T00:11:00",
                 "was_foster": False,
                 "immunizations": [
                     {"type": "Rabies", "date_administered": "2537-09-12T13:34:00",
                      "vaccination_type": {"shot": "Yes"}},
                     {"type": "Panleukopenia", "date_administered": "2889-03-01T17:18:00",
                      "vaccination_type": {"shot": "No"}},
                     {"type": "Feline Leukemia", "date_administered": "2599-08-08T07:47:00",
                      "vaccination_type": {"shot": "No"}},
                     {"type": "Feline Leukemia", "date_administered": "2902-04-14T01:34:00",
                      "vaccination_type": {"shot": "No"}}]}
         },
         "sequence": 1554384634}
    ]


def test_cat_stream_object__sandbox(db_cleanup):
    config = CONFIG.copy()
    main(config, input_stream=CatStreamObject())

    with psycopg2.connect(**TEST_DB) as conn:
        with conn.cursor() as cur:
            assert_tables_equal(cur,
                                {'cats', 'cats__adoption__immunizations'})

            assert_columns_equal(cur,
                                 'cats__adoption__immunizations',
                                 {
                                     ('_sdc_level_0_id', 'bigint', 'NO'),
                                     ('_sdc_sequence', 'bigint', 'YES'),
                                     ('_sdc_source_key_id', 'bigint', 'NO'),
                                     ('date_administered', 'timestamp with time zone', 'YES'),
                                     ('type', 'text', 'YES'),
                                     ('adoption__immunizations__vaccination_type__shot', 'text', 'YES')
                                 })

class CatStreamArray(SandboxStream):
    stream = [
        {"type": "SCHEMA",
         "stream": "cats",
         "schema": {
             "additionalProperties": False,
             "properties": {
                 "id": {"type": "integer"},
                 "name": {"type": ["string"]},
                 "paw_size": {"type": ["integer"],
                              "default": 314159},
                 "paw_colour": {"type": "string", "default": ""},
                 "flea_check_complete": {"type": ["boolean"],
                                         "default": False},
                 "pattern": {"type": ["null", "string"]},
                 "age": {"type": ["null", "integer"]},
                 "adoption": {"type": ["object", "null"],
                              "properties": {"adopted_on": {
                                  "type": ["null", "string"],
                                  "format": "date-time"},
                                  "was_foster": {
                                      "type": "boolean"},
                                  "immunizations": {
                                      "type": ["null",
                                               "array"],
                                      "items": {
                                          "type": [
                                              "object"],
                                          "properties": {
                                              "type": {
                                                  "type": [
                                                      "null",
                                                      "string"]},
                                              "date_administered": {
                                                  "type": [
                                                      "null",
                                                      "string"],
                                                  "format": "date-time"},
                                              "vaccination_type": {
                                                  "type": "array",
                                                  "items": {"type": "string"}}}}}}}}},
         "key_properties": ["id"]},
        {"type": "RECORD",
         "stream": "cats",
         "record": {
             "id": 1,
             "name": "Morgan",
             "pattern": "Tortoiseshell",
             "age": 14,
             "adoption": {
                 "adopted_on": "2633-01-02T00:11:00",
                 "was_foster": False,
                 "immunizations": [
                     {"type": "Rabies", "date_administered": "2537-09-12T13:34:00",
                      "vaccination_type": ["shot", "Yes"]},
                     {"type": "Panleukopenia", "date_administered": "2889-03-01T17:18:00",
                      "vaccination_type": ["shot", "No"]}]}
         },
         "sequence": 1554384634}
    ]


def test_cat_stream_array__sandbox(db_cleanup):
    config = CONFIG.copy()
    main(config, input_stream=CatStreamArray())

    with psycopg2.connect(**TEST_DB) as conn:
        with conn.cursor() as cur:
            assert_tables_equal(cur,
                                {'cats', 'cats__adoption__immunizations', 'cats__adoption__immunizations__vaccination_type'})

            assert_columns_equal(cur,
                                 'cats__adoption__immunizations',
                                 {
                                     ('_sdc_level_0_id', 'bigint', 'NO'),
                                     ('_sdc_sequence', 'bigint', 'YES'),
                                     ('_sdc_source_key_id', 'bigint', 'NO'),
                                     ('date_administered', 'timestamp with time zone', 'YES'),
                                     ('type', 'text', 'YES')
                                 })
@AlexanderMann
Copy link
Collaborator

@CStejmar intriguing, not the behaviour I would have expected!

So it sounds like the denesting on the Schema is actually working as expected which is great, but that the denesting on the records is broken somehow.

I think the likely culprit is to double check the tests here. Those were thrown together somewhat hastily and could use love.

I'll poke around with those and see if I can't get more details for you, but that's where I'd start.

@AlexanderMann
Copy link
Collaborator

@CStejmar #110

It doesn't look like the problem is necessarily in the denesting logic, so it might be in the upserting logic? Very odd..added your tests as well to that branch to try and start looking into what in the heck is going on...

@AlexanderMann
Copy link
Collaborator

@CStejmar I think I found the problem. It looks like the denesting logic for the schema is in fact the problem. Your observation about the issue is exactly correct, and luckily your full blown Sandbox tests can be reduced down to just a single test on the denest.py module.

So what's happening is that we are denesting the records correctly and placing values under the column path inside of the table. BUT in the schema denesting, we are not removing the table path as a prefix.

Ex.

Denested schema:
  table: (a, b)
  column: (a, b, c, d)

Denested record:
  table: (a, b)
  column: (c, d)

So yeah, we just need to figure out why that's happening and fix it. I'm not sure how breaking that will be for users of this repo... Feel free to fork and start working on that branch. I've pushed up the simplified reproducing test.

@CStejmar
Copy link
Author

CStejmar commented Apr 5, 2019

@AlexanderMann Great! Thanks for your analysis! I will fork and branch from where you left of 👍

CStejmar pushed a commit to CStejmar/target-postgres that referenced this issue Apr 5, 2019
issue: datamill-co#109

The table_schemas and the table_records did not match for more complex
and nested schemas.
@CStejmar
Copy link
Author

CStejmar commented Apr 5, 2019

I think I solved the issue: #111

All tests (branched from master) pass and I get data in the database which I didn't get before for nested objects in arrays! However, we might want it the other way around where we instead change the record to match the schema we have now to keep the naming table structure we had before. What do you think @AlexanderMann?

CStejmar pushed a commit to CStejmar/target-postgres that referenced this issue Apr 5, 2019
issue: datamill-co#109

The table_schemas and the table_records did not match for more complex
and nested schemas.
@CStejmar
Copy link
Author

CStejmar commented Apr 8, 2019

I noticed one thing now when looking trough my tables in a test database. The objects looks fine and all data enters the database. However, naming of subtables (arrays in the schemas) miss their array name when we have arrays within arrays. The first array name is dropped in the table name.

With denest fix for schema:

db=# \d
                                 List of relations
 Schema |                         Name                          | Type  |  Owner   
--------+-------------------------------------------------------+-------+----------
 public | campaign                                              | table | postgres
 public | campaign__external_ids                                | table | postgres
 public | campaign__media_types                                 | table | postgres
 public | tv_plan                                               | table | postgres
 public | tv_plan__actual_summary__index_percent                | table | postgres
 public | tv_plan__actual_values__index_percent                 | table | postgres
 public | tv_plan__actual_values__periods                       | table | postgres
 public | tv_plan__actual_values__periods__film_code_breakdown  | table | postgres
 public | tv_plan__actual_values__periods__index_percent        | table | postgres
 public | tv_plan__channels                                     | table | postgres
 public | tv_plan__film_codes                                   | table | postgres
 public | tv_plan__planned_summary__index_percent               | table | postgres
 public | tv_plan__planned_values__index_percent                | table | postgres
 public | tv_plan__planned_values__periods                      | table | postgres
 public | tv_plan__planned_values__periods__film_code_breakdown | table | postgres
 public | tv_plan__planned_values__periods__index_percent       | table | postgres
 public | tv_plan__regions                                      | table | postgres
 public | tv_spot                                               | table | postgres
 public | tv_spot__target_audience_values                       | table | postgres
(19 rows)

db=# \d tv_plan__channels 
                                  Table "public.tv_plan__channels"
                     Column                     |       Type       | Collation | Nullable | Default 
------------------------------------------------+------------------+-----------+----------+---------
 channel_name                                   | text             |           |          | 
 planned_values__conversion_index_to_generic_ta | double precision |           |          | 
 planned_values__discount_percent               | double precision |           |          | 
 planned_values__net                            | double precision |           |          | 
 planned_values__net_net                        | double precision |           |          | 
 planned_values__grp                            | double precision |           |          | 
 planned_values__grp30                          | double precision |           |          | 
 planned_values__spots                          | double precision |           |          | 
 planned_values__spots30                        | double precision |           |          | 
 actual_values__conversion_index_to_generic_ta  | double precision |           |          | 
 actual_values__discount_percent                | double precision |           |          | 
 actual_values__net                             | double precision |           |          | 
 actual_values__net_net                         | double precision |           |          | 
 actual_values__grp                             | double precision |           |          | 
 actual_values__grp30                           | double precision |           |          | 
 actual_values__spots                           | double precision |           |          | 
 actual_values__spots30                         | double precision |           |          | 
 _sdc_source_key_id                             | text             |           | not null | 
 _sdc_source_key_campaign_id                    | text             |           | not null | 
 _sdc_sequence                                  | bigint           |           |          | 
 _sdc_level_0_id                                | bigint           |           | not null | 

Without denest fix/fixing record instead:

db=> \d
                                     List of relations
 Schema |                              Name                               | Type  |  Owner  
--------+-----------------------------------------------------------------+-------+---------
 public | campaign                                                        | table | adverai
 public | campaign__external_ids                                          | table | adverai
 public | campaign__media_types                                           | table | adverai
 public | tv_plan                                                         | table | adverai
 public | tv_plan__actual_summary__index_percent                          | table | adverai
 public | tv_plan__channels                                               | table | adverai
 public | tv_plan__channels__actual_values__index_percent                 | table | adverai
 public | tv_plan__channels__actual_values__periods                       | table | adverai
 public | tv_plan__channels__actual_values__periods__film_code_breakdown  | table | adverai
 public | tv_plan__channels__actual_values__periods__index_percent        | table | adverai
 public | tv_plan__channels__planned_values__index_percent                | table | adverai
 public | tv_plan__channels__planned_values__periods                      | table | adverai
 public | tv_plan__channels__planned_values__periods__film_code_breakdown | table | adverai
 public | tv_plan__channels__planned_values__periods__index_percent       | table | adverai
 public | tv_plan__film_codes                                             | table | adverai
 public | tv_plan__planned_summary__index_percent                         | table | adverai
 public | tv_plan__regions                                                | table | adverai
 public | tv_spot                                                         | table | adverai
 public | tv_spot__target_audience_values                                 | table | adverai
(19 rows)

db=> \d tv_plan__channels
                                       Table "public.tv_plan__channels"
                          Column                          |       Type       | Collation | Nullable | Default 
----------------------------------------------------------+------------------+-----------+----------+---------
 channels__actual_values__net_net                         | double precision |           |          | 
 channels__planned_values__conversion_index_to_generic_ta | double precision |           |          | 
 channel_name                                             | text             |           |          | 
 channels__actual_values__grp30                           | double precision |           |          | 
 channels__actual_values__net                             | double precision |           |          | 
 _sdc_level_0_id                                          | bigint           |           | not null | 
 channels__actual_values__discount_percent                | double precision |           |          | 
 channels__actual_values__conversion_index_to_generic_ta  | double precision |           |          | 
 channels__actual_values__spots                           | double precision |           |          | 
 channels__planned_values__net_net                        | double precision |           |          | 
 channels__planned_values__grp                            | double precision |           |          | 
 channels__planned_values__spots30                        | double precision |           |          | 
 _sdc_source_key_id                                       | text             |           | not null | 
 channels__planned_values__spots                          | double precision |           |          | 
 channels__planned_values__net                            | double precision |           |          | 
 channels__planned_values__discount_percent               | double precision |           |          | 
 channels__planned_values__grp30                          | double precision |           |          | 
 _sdc_sequence                                            | bigint           |           |          | 
 channels__actual_values__grp                             | double precision |           |          | 
 channels__actual_values__spots30                         | double precision |           |          | 
 _sdc_source_key_campaign_id                              | text             |           | not null | 

The above has errors in the naming of nested objects as you can see in tv_plan__channels. For example channels__actual_values*and channels__planned_values* should be named actual_values*and planned_values* respectively. Just as in the "fixed" example.

So my conclusion is that we need some more work with the fix regarding this denesting before merging. I will start looking into it now!

@CStejmar
Copy link
Author

CStejmar commented Apr 9, 2019

Update

Pull request is updated to fix the above problem with incorrect naming of subtables: #111

The plan ahead is now to first merge the tests @AlexanderMann set up in this PR: #110 and then rebase the changes from #111 on top of that to see if tests pass. If they do not pass, more work is needed.

CStejmar pushed a commit to CStejmar/target-postgres that referenced this issue Apr 12, 2019
issue: datamill-co#109

The table_schemas and the table_records did not match for more complex
and nested schemas.
CStejmar pushed a commit to CStejmar/target-postgres that referenced this issue Apr 12, 2019
issue: datamill-co#109

The table_schemas and the table_records did not match for more complex
and nested schemas.
AlexanderMann added a commit that referenced this issue Apr 15, 2019
@CStejmar
Copy link
Author

Fix and tests are merged to master.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants