No data in database if nested objects within array #109

CStejmar · 2019-04-04T16:24:38Z

Hi,
I think I have encountered a problem with nested objects in arrays. An object in an array is fine but when I have an object within another object in an array (array->object->object) problem arises. No data is seen in that last level object even if I can see in the record data that values exist. If I instead have an array at the last level (array->object->array) it behaves as expected.

I have done tests with this in the module test_sandbox using a modified CATS_SCHEMA and the FakeStream.generate_record_message(also modified to match my schema) to generate test data. If I in the test make the last object's properties optional by "type": ["null", "string"] for example, everything passes because the object's properties are all null. However, if I use "type": "string" for a property, the test fails with:

CRITICAL ('Exception writing records', IntegrityError('null value in column "adoption__immunizations__vaccination_type__shot" violates not-null constraint\nDETAIL: Failing row contains (Rabies, 2537-09-12 15:34:00+02, null, 1, 1554384634, 0).\nCONTEXT: COPY tmp_36a9e5fb_ac07_49f0_ac26_8fa9f1f69885, line 1: "Rabies,2537-09-12 13:34:00.0000+00:00,NULL,1,1554384634,0"\n'))

To me it seems like this case isn't handled and I can only find tests of array_of_array and object_of_object in the tests directory. Does anyone have a clue and can point me in the right direction?

I will attach a snippet of what I added to the test_sandbox.py file so that you can test my examples and easier understand my problem.

class CatStreamObject(SandboxStream):
    stream = [
        {"type": "SCHEMA",
         "stream": "cats",
         "schema": {
             "additionalProperties": False,
             "properties": {
                 "id": {"type": "integer"},
                 "name": {"type": ["string"]},
                 "paw_size": {"type": ["integer"],
                              "default": 314159},
                 "paw_colour": {"type": "string", "default": ""},
                 "flea_check_complete": {"type": ["boolean"],
                                         "default": False},
                 "pattern": {"type": ["null", "string"]},
                 "age": {"type": ["null", "integer"]},
                 "adoption": {"type": ["object", "null"],
                              "properties": {"adopted_on": {
                                  "type": ["null", "string"],
                                  "format": "date-time"},
                                  "was_foster": {
                                      "type": "boolean"},
                                  "immunizations": {
                                      "type": ["null",
                                               "array"],
                                      "items": {
                                          "type": [
                                              "object"],
                                          "properties": {
                                              "type": {
                                                  "type": [
                                                      "null",
                                                      "string"]},
                                              "date_administered": {
                                                  "type": [
                                                      "null",
                                                      "string"],
                                                  "format": "date-time"},
                                              "vaccination_type": {
                                                  "type": "object",
                                                  "properties": {
                                                      "shot": {
                                                          "type": [
                                                              "null",
                                                              "string"]}}}}}}}}}},
         "key_properties": ["id"]},
        {"type": "RECORD",
         "stream": "cats",
         "record": {
             "id": 1,
             "name": "Morgan",
             "pattern": "Tortoiseshell",
             "age": 14,
             "adoption": {
                 "adopted_on": "2633-01-02T00:11:00",
                 "was_foster": False,
                 "immunizations": [
                     {"type": "Rabies", "date_administered": "2537-09-12T13:34:00",
                      "vaccination_type": {"shot": "Yes"}},
                     {"type": "Panleukopenia", "date_administered": "2889-03-01T17:18:00",
                      "vaccination_type": {"shot": "No"}},
                     {"type": "Feline Leukemia", "date_administered": "2599-08-08T07:47:00",
                      "vaccination_type": {"shot": "No"}},
                     {"type": "Feline Leukemia", "date_administered": "2902-04-14T01:34:00",
                      "vaccination_type": {"shot": "No"}}]}
         },
         "sequence": 1554384634}
    ]


def test_cat_stream_object__sandbox(db_cleanup):
    config = CONFIG.copy()
    main(config, input_stream=CatStreamObject())

    with psycopg2.connect(**TEST_DB) as conn:
        with conn.cursor() as cur:
            assert_tables_equal(cur,
                                {'cats', 'cats__adoption__immunizations'})

            assert_columns_equal(cur,
                                 'cats__adoption__immunizations',
                                 {
                                     ('_sdc_level_0_id', 'bigint', 'NO'),
                                     ('_sdc_sequence', 'bigint', 'YES'),
                                     ('_sdc_source_key_id', 'bigint', 'NO'),
                                     ('date_administered', 'timestamp with time zone', 'YES'),
                                     ('type', 'text', 'YES'),
                                     ('adoption__immunizations__vaccination_type__shot', 'text', 'YES')
                                 })

class CatStreamArray(SandboxStream):
    stream = [
        {"type": "SCHEMA",
         "stream": "cats",
         "schema": {
             "additionalProperties": False,
             "properties": {
                 "id": {"type": "integer"},
                 "name": {"type": ["string"]},
                 "paw_size": {"type": ["integer"],
                              "default": 314159},
                 "paw_colour": {"type": "string", "default": ""},
                 "flea_check_complete": {"type": ["boolean"],
                                         "default": False},
                 "pattern": {"type": ["null", "string"]},
                 "age": {"type": ["null", "integer"]},
                 "adoption": {"type": ["object", "null"],
                              "properties": {"adopted_on": {
                                  "type": ["null", "string"],
                                  "format": "date-time"},
                                  "was_foster": {
                                      "type": "boolean"},
                                  "immunizations": {
                                      "type": ["null",
                                               "array"],
                                      "items": {
                                          "type": [
                                              "object"],
                                          "properties": {
                                              "type": {
                                                  "type": [
                                                      "null",
                                                      "string"]},
                                              "date_administered": {
                                                  "type": [
                                                      "null",
                                                      "string"],
                                                  "format": "date-time"},
                                              "vaccination_type": {
                                                  "type": "array",
                                                  "items": {"type": "string"}}}}}}}}},
         "key_properties": ["id"]},
        {"type": "RECORD",
         "stream": "cats",
         "record": {
             "id": 1,
             "name": "Morgan",
             "pattern": "Tortoiseshell",
             "age": 14,
             "adoption": {
                 "adopted_on": "2633-01-02T00:11:00",
                 "was_foster": False,
                 "immunizations": [
                     {"type": "Rabies", "date_administered": "2537-09-12T13:34:00",
                      "vaccination_type": ["shot", "Yes"]},
                     {"type": "Panleukopenia", "date_administered": "2889-03-01T17:18:00",
                      "vaccination_type": ["shot", "No"]}]}
         },
         "sequence": 1554384634}
    ]


def test_cat_stream_array__sandbox(db_cleanup):
    config = CONFIG.copy()
    main(config, input_stream=CatStreamArray())

    with psycopg2.connect(**TEST_DB) as conn:
        with conn.cursor() as cur:
            assert_tables_equal(cur,
                                {'cats', 'cats__adoption__immunizations', 'cats__adoption__immunizations__vaccination_type'})

            assert_columns_equal(cur,
                                 'cats__adoption__immunizations',
                                 {
                                     ('_sdc_level_0_id', 'bigint', 'NO'),
                                     ('_sdc_sequence', 'bigint', 'YES'),
                                     ('_sdc_source_key_id', 'bigint', 'NO'),
                                     ('date_administered', 'timestamp with time zone', 'YES'),
                                     ('type', 'text', 'YES')
                                 })

The text was updated successfully, but these errors were encountered:

AlexanderMann · 2019-04-04T17:54:18Z

@CStejmar intriguing, not the behaviour I would have expected!

So it sounds like the denesting on the Schema is actually working as expected which is great, but that the denesting on the records is broken somehow.

I think the likely culprit is to double check the tests here. Those were thrown together somewhat hastily and could use love.

I'll poke around with those and see if I can't get more details for you, but that's where I'd start.

AlexanderMann · 2019-04-04T18:33:39Z

@CStejmar #110

It doesn't look like the problem is necessarily in the denesting logic, so it might be in the upserting logic? Very odd..added your tests as well to that branch to try and start looking into what in the heck is going on...

AlexanderMann · 2019-04-04T18:53:50Z

@CStejmar I think I found the problem. It looks like the denesting logic for the schema is in fact the problem. Your observation about the issue is exactly correct, and luckily your full blown Sandbox tests can be reduced down to just a single test on the denest.py module.

So what's happening is that we are denesting the records correctly and placing values under the column path inside of the table. BUT in the schema denesting, we are not removing the table path as a prefix.

Ex.

Denested schema:
  table: (a, b)
  column: (a, b, c, d)

Denested record:
  table: (a, b)
  column: (c, d)

So yeah, we just need to figure out why that's happening and fix it. I'm not sure how breaking that will be for users of this repo... Feel free to fork and start working on that branch. I've pushed up the simplified reproducing test.

CStejmar · 2019-04-05T07:30:08Z

@AlexanderMann Great! Thanks for your analysis! I will fork and branch from where you left of 👍

issue: datamill-co#109 The table_schemas and the table_records did not match for more complex and nested schemas.

CStejmar · 2019-04-05T12:28:09Z

I think I solved the issue: #111

All tests (branched from master) pass and I get data in the database which I didn't get before for nested objects in arrays! However, we might want it the other way around where we instead change the record to match the schema we have now to keep the naming table structure we had before. What do you think @AlexanderMann?

issue: datamill-co#109 The table_schemas and the table_records did not match for more complex and nested schemas.

CStejmar · 2019-04-08T09:17:45Z

I noticed one thing now when looking trough my tables in a test database. The objects looks fine and all data enters the database. However, naming of subtables (arrays in the schemas) miss their array name when we have arrays within arrays. The first array name is dropped in the table name.

With denest fix for schema:

db=# \d
                                 List of relations
 Schema |                         Name                          | Type  |  Owner   
--------+-------------------------------------------------------+-------+----------
 public | campaign                                              | table | postgres
 public | campaign__external_ids                                | table | postgres
 public | campaign__media_types                                 | table | postgres
 public | tv_plan                                               | table | postgres
 public | tv_plan__actual_summary__index_percent                | table | postgres
 public | tv_plan__actual_values__index_percent                 | table | postgres
 public | tv_plan__actual_values__periods                       | table | postgres
 public | tv_plan__actual_values__periods__film_code_breakdown  | table | postgres
 public | tv_plan__actual_values__periods__index_percent        | table | postgres
 public | tv_plan__channels                                     | table | postgres
 public | tv_plan__film_codes                                   | table | postgres
 public | tv_plan__planned_summary__index_percent               | table | postgres
 public | tv_plan__planned_values__index_percent                | table | postgres
 public | tv_plan__planned_values__periods                      | table | postgres
 public | tv_plan__planned_values__periods__film_code_breakdown | table | postgres
 public | tv_plan__planned_values__periods__index_percent       | table | postgres
 public | tv_plan__regions                                      | table | postgres
 public | tv_spot                                               | table | postgres
 public | tv_spot__target_audience_values                       | table | postgres
(19 rows)

db=# \d tv_plan__channels 
                                  Table "public.tv_plan__channels"
                     Column                     |       Type       | Collation | Nullable | Default 
------------------------------------------------+------------------+-----------+----------+---------
 channel_name                                   | text             |           |          | 
 planned_values__conversion_index_to_generic_ta | double precision |           |          | 
 planned_values__discount_percent               | double precision |           |          | 
 planned_values__net                            | double precision |           |          | 
 planned_values__net_net                        | double precision |           |          | 
 planned_values__grp                            | double precision |           |          | 
 planned_values__grp30                          | double precision |           |          | 
 planned_values__spots                          | double precision |           |          | 
 planned_values__spots30                        | double precision |           |          | 
 actual_values__conversion_index_to_generic_ta  | double precision |           |          | 
 actual_values__discount_percent                | double precision |           |          | 
 actual_values__net                             | double precision |           |          | 
 actual_values__net_net                         | double precision |           |          | 
 actual_values__grp                             | double precision |           |          | 
 actual_values__grp30                           | double precision |           |          | 
 actual_values__spots                           | double precision |           |          | 
 actual_values__spots30                         | double precision |           |          | 
 _sdc_source_key_id                             | text             |           | not null | 
 _sdc_source_key_campaign_id                    | text             |           | not null | 
 _sdc_sequence                                  | bigint           |           |          | 
 _sdc_level_0_id                                | bigint           |           | not null |

Without denest fix/fixing record instead:

db=> \d
                                     List of relations
 Schema |                              Name                               | Type  |  Owner  
--------+-----------------------------------------------------------------+-------+---------
 public | campaign                                                        | table | adverai
 public | campaign__external_ids                                          | table | adverai
 public | campaign__media_types                                           | table | adverai
 public | tv_plan                                                         | table | adverai
 public | tv_plan__actual_summary__index_percent                          | table | adverai
 public | tv_plan__channels                                               | table | adverai
 public | tv_plan__channels__actual_values__index_percent                 | table | adverai
 public | tv_plan__channels__actual_values__periods                       | table | adverai
 public | tv_plan__channels__actual_values__periods__film_code_breakdown  | table | adverai
 public | tv_plan__channels__actual_values__periods__index_percent        | table | adverai
 public | tv_plan__channels__planned_values__index_percent                | table | adverai
 public | tv_plan__channels__planned_values__periods                      | table | adverai
 public | tv_plan__channels__planned_values__periods__film_code_breakdown | table | adverai
 public | tv_plan__channels__planned_values__periods__index_percent       | table | adverai
 public | tv_plan__film_codes                                             | table | adverai
 public | tv_plan__planned_summary__index_percent                         | table | adverai
 public | tv_plan__regions                                                | table | adverai
 public | tv_spot                                                         | table | adverai
 public | tv_spot__target_audience_values                                 | table | adverai
(19 rows)

db=> \d tv_plan__channels
                                       Table "public.tv_plan__channels"
                          Column                          |       Type       | Collation | Nullable | Default 
----------------------------------------------------------+------------------+-----------+----------+---------
 channels__actual_values__net_net                         | double precision |           |          | 
 channels__planned_values__conversion_index_to_generic_ta | double precision |           |          | 
 channel_name                                             | text             |           |          | 
 channels__actual_values__grp30                           | double precision |           |          | 
 channels__actual_values__net                             | double precision |           |          | 
 _sdc_level_0_id                                          | bigint           |           | not null | 
 channels__actual_values__discount_percent                | double precision |           |          | 
 channels__actual_values__conversion_index_to_generic_ta  | double precision |           |          | 
 channels__actual_values__spots                           | double precision |           |          | 
 channels__planned_values__net_net                        | double precision |           |          | 
 channels__planned_values__grp                            | double precision |           |          | 
 channels__planned_values__spots30                        | double precision |           |          | 
 _sdc_source_key_id                                       | text             |           | not null | 
 channels__planned_values__spots                          | double precision |           |          | 
 channels__planned_values__net                            | double precision |           |          | 
 channels__planned_values__discount_percent               | double precision |           |          | 
 channels__planned_values__grp30                          | double precision |           |          | 
 _sdc_sequence                                            | bigint           |           |          | 
 channels__actual_values__grp                             | double precision |           |          | 
 channels__actual_values__spots30                         | double precision |           |          | 
 _sdc_source_key_campaign_id                              | text             |           | not null |

The above has errors in the naming of nested objects as you can see in tv_plan__channels. For example channels__actual_values*and channels__planned_values* should be named actual_values*and planned_values* respectively. Just as in the "fixed" example.

So my conclusion is that we need some more work with the fix regarding this denesting before merging. I will start looking into it now!

CStejmar · 2019-04-09T07:25:41Z

Update

Pull request is updated to fix the above problem with incorrect naming of subtables: #111

The plan ahead is now to first merge the tests @AlexanderMann set up in this PR: #110 and then rebase the changes from #111 on top of that to see if tests pass. If they do not pass, more work is needed.

issue: datamill-co#109 The table_schemas and the table_records did not match for more complex and nested schemas.

[bugfix] Fix for issue #109

CStejmar · 2019-04-23T07:43:53Z

Fix and tests are merged to master.

AlexanderMann mentioned this issue Apr 4, 2019

FIX: Nested Records of objects of objects. #110

Merged

CStejmar pushed a commit to CStejmar/target-postgres that referenced this issue Apr 5, 2019

[bugfix] Fix for issue datamill-co#109

4a75ecb

issue: datamill-co#109 The table_schemas and the table_records did not match for more complex and nested schemas.

CStejmar mentioned this issue Apr 5, 2019

[bugfix] Fix for issue #109 #111

Merged

CStejmar pushed a commit to CStejmar/target-postgres that referenced this issue Apr 5, 2019

[bugfix] Fix for issue datamill-co#109 - Nested records fix

2293055

issue: datamill-co#109 The table_schemas and the table_records did not match for more complex and nested schemas.

CStejmar mentioned this issue Apr 5, 2019

[bugfix] Fix for issue #109 - Nested records fix #112

Closed

CStejmar pushed a commit to CStejmar/target-postgres that referenced this issue Apr 12, 2019

[bugfix] Fix for issue datamill-co#109

ccb7d85

issue: datamill-co#109 The table_schemas and the table_records did not match for more complex and nested schemas.

CStejmar pushed a commit to CStejmar/target-postgres that referenced this issue Apr 12, 2019

[bugfix] Fix for issue datamill-co#109

3b865c0

issue: datamill-co#109 The table_schemas and the table_records did not match for more complex and nested schemas.

AlexanderMann added a commit that referenced this issue Apr 15, 2019

Merge pull request #111 from CStejmar/fix/nested-schema-issue

4f7aaec

[bugfix] Fix for issue #109

CStejmar closed this as completed Apr 23, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

No data in database if nested objects within array #109

No data in database if nested objects within array #109

CStejmar commented Apr 4, 2019

AlexanderMann commented Apr 4, 2019

AlexanderMann commented Apr 4, 2019

AlexanderMann commented Apr 4, 2019

CStejmar commented Apr 5, 2019

CStejmar commented Apr 5, 2019

CStejmar commented Apr 8, 2019

CStejmar commented Apr 9, 2019

CStejmar commented Apr 23, 2019

No data in database if nested objects within array #109

No data in database if nested objects within array #109

Comments

CStejmar commented Apr 4, 2019

AlexanderMann commented Apr 4, 2019

AlexanderMann commented Apr 4, 2019

AlexanderMann commented Apr 4, 2019

CStejmar commented Apr 5, 2019

CStejmar commented Apr 5, 2019

CStejmar commented Apr 8, 2019

With denest fix for schema:

Without denest fix/fixing record instead:

CStejmar commented Apr 9, 2019

Update

CStejmar commented Apr 23, 2019