-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TDL-13692: Missing parent objects from projection expression causes key error and TDL-16140: Fix key error and index error while applying projection expression #35
Conversation
} | ||
] | ||
|
||
def generate_items(self, num_items, start_key=0): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add function comments
else: | ||
if output.get(breadcrumb[0]) is None: | ||
output[breadcrumb[0]] = {} | ||
self._apply_projection(record[breadcrumb[0]], breadcrumb[1:], output[breadcrumb[0]]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Explain what this code will input in the Talend-Stitch
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added comment.
output[breadcrumb_key] = [{}] | ||
self._apply_projection(record[breadcrumb_key][index], breadcrumb[1:], output[breadcrumb_key][0]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Explain what this code will input in the Talend-Stitch
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added comment.
else: | ||
output[breadcrumb_key] = [record[breadcrumb_key][index]] | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Explain what this code will input in the Talend-Stitch
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added comment
@@ -58,23 +58,31 @@ def _apply_projection(self, record, breadcrumb, output): | |||
breadcrumb_key = breadcrumb[0].split('[')[0] | |||
index = int(breadcrumb[0].split('[')[1].split(']')[0]) | |||
if output.get(breadcrumb_key): | |||
output[breadcrumb_key].append(record[breadcrumb_key][index]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Explain what this code will input in the Talend-Stitch
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Requesting test changes.
# get data | ||
messages_by_stream = runner.get_records_from_target_output() | ||
|
||
for stream in expected_streams: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there should be an assertion that proves there are upsert messages present. The way this is written it will pass even if messages = [] and then the test would never hit these assertions below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated the code to collect records from messages and loop over every record and assert.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Personally I think an explicit assertion that we replicated records would give more confidence. But I think this does functionally accomplish with the double get
in record.get('map_field').get('map_entry_1')
.
state = menagerie.get_state(conn_id) | ||
state_version = menagerie.get_state_version(conn_id) | ||
|
||
# delete 'finished_shards' for every streams from the state file as we want to run 2nd sync |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not familiar with dynamodb's bookmarking strategy, but I do not think it makes sense to inject a simulated state at this point in the test. Could you please explain why this bookmark key is being removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the "LOG_BASED" replication method, the Tap is forcing a "FULL_TABLE" sync for the 1st time, and from the next sync, the stream is synced in a LOG_BASED manner. Hence, in this test, when we ran the 1st sync, the tap ran in a FULL_TABLE manner, thus we have removed the finished shards from the state to re-sync the data and ran with state file to validate the projection change for the LOG_BASED replication method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. If you are confident that this mimics reality then I think this is a fine implementation. However, in the database taps we generally prefer to insert data between syncs rather than inject a simulated sync as this gives us a guaranteed copy of what an end-user would do. In the SaaS taps this is much more difficult since there is no local instance of the source to interact with. That is why you see these state injects to simulate behavior elsewhere.
.circleci/config.yml
Outdated
@@ -19,6 +19,17 @@ jobs: | |||
command: | | |||
source /usr/local/share/virtualenvs/tap-dynamodb/bin/activate | |||
make lint | |||
- run: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please update the config to the standards: tap-tester image, env vars and slack notification. I have recently documented the requirments for a circleci config file. See https://github.com/stitchdata/tap-tester/blob/master/reference/circleci_configs.md
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated the config.yml file.
- The tap does not break when the parent data is not found and the user is requesting for child data | ||
- The tap does not break when the data a specific position is not found in the record |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see that both cases are being tested. There should be a multiple logical syncs taking place if we are testing multiple cases like this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These both conditions are tested in the test as the is in the format:
{ 'int_id': i, 'string_field': "test string", 'test_list_2': ['list_2_data'] }
and in the projection at line no. 25, we are expecting map_field.map_entry_1
(The tap does not break when the parent data is not found and the user is requesting for child data) and test_list_2[1]
(The tap does not break when the data a specific position is not found in the record)
tap_dynamodb/deserialize.py
Outdated
# main breadcrumb = [['metadata[0]'], ['metadata[1]']] | ||
# current breadcrumb = ['metadata[1]'] | ||
# current output = {'metadata': ['test1']} | ||
# expected output = {'metadata': ['test1']} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we clean up these comments?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed the comments
else: | ||
output[breadcrumb[0]] = record.get(breadcrumb[0]) | ||
else: | ||
if '[' in breadcrumb[0]: | ||
breadcrumb_key = breadcrumb[0].split('[')[0] | ||
index = int(breadcrumb[0].split('[')[1].split(']')[0]) | ||
if output.get(breadcrumb_key) is None: | ||
if not output.get(breadcrumb_key): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should be
if not output.get(breadcrumb_key): | |
if breadcrumb_key not in output: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cosimon
As seen in the figure, the particular code was kept to handle such scenarios where the dictionary contains an empty list which would return false if we change it to if breadcrumb_key not in output:
Hence, not updated the code.
# as "metadata" is not present and the current breadcrumb is expecting it as a parent | ||
|
||
# keep empty dict if the data is not found in the record | ||
if record.get(breadcrumb[0]): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if record.get(breadcrumb[0]): | |
if breadcrumb[0] in record: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The same comment mentioned above applies to the change suggested here too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One more minor change to the config. Tests are good 👍
.circleci/config.yml
Outdated
tests | ||
run-test --tap=tap-dynamodb tests | ||
- slack/notify-on-failure: | ||
only_for_branches: master |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please also include the tap-tester-user
context after the circleci-user
context in both workflow definitions. That context contains the desired slack webhook. If this is unclear and the tap-tester docs do not clarify, please let me know so I can update them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added tap-tester-user
context.
state = menagerie.get_state(conn_id) | ||
state_version = menagerie.get_state_version(conn_id) | ||
|
||
# delete 'finished_shards' for every streams from the state file as we want to run 2nd sync |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. If you are confident that this mimics reality then I think this is a fine implementation. However, in the database taps we generally prefer to insert data between syncs rather than inject a simulated sync as this gives us a guaranteed copy of what an end-user would do. In the SaaS taps this is much more difficult since there is no local instance of the source to interact with. That is why you see these state injects to simulate behavior elsewhere.
# get data | ||
messages_by_stream = runner.get_records_from_target_output() | ||
|
||
for stream in expected_streams: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Personally I think an explicit assertion that we replicated records would give more confidence. But I think this does functionally accomplish with the double get
in record.get('map_field').get('map_entry_1')
.
…t we replicated records
…/tap-dynamodb into fix-key-error-and-index-error
…nger-io/tap-dynamodb into fix-key-error-and-index-error
…/tap-dynamodb into fix-key-error-and-index-error
* Tdl 16328 implement request timeout (#38) * added request timeout * added test cases for the backoff * resolved pylint errors * updated pylint * fixd pylint * updated config.yml * resolved comments * resolved comments * resolved pylint errors * resolved comments * resolved comments * resolved comments * resolved slack-on-notify failure * removed scenarios * added tap-tester-user in config * TDL-16291 added code comments (#36) * added code comments * fixed typo * TDL-13692: Missing parent objects from projection expression causes key error and TDL-16140: Fix key error and index error while applying projection expression (#35) * added code change for handling key error and index error * resolve pylint error * updated the unittests * updated config.yml to run unittest to run on CCi * added unittests * updated the code to handle child dict data * added comments with example for projection * updated config.yml file and resolved PR review comments * resolve slack failure error in config file * updated version 2 to 2.1 * remoed SCENARIO from the test cases * resolved comments * reverted changes * added tap-tester-venv, tap_tester_sandbox and check for verifying that we replicated records Co-authored-by: namrata270998 <namrata.brahmbhatt@crestdatasys.com> * Tdl 15171 implement expression attributes (#37) * Initial commit for expression attributes * Updated circleci and full_table.py * resolved pylint error * Updated circleci * Updated unittest case path in circleci * Updated expression attribute * resolved pylint error * Updated prepare exprssion logic * Handled empty projection * added logic for nested expressions * added in log_based test and comments * added code comments * Updated comments * Resolved review comments * added comment * resolved comments * resolved pylint errors * updated config.yml and a typo * resolved slack-on-notify failure * removed scenarios * Changed the expressionattribute logic as suggested * resolved pylint error * fixed typo * added test case for '.' in fields * added testcase for prepare_projection() * resolved comments * added check for empty expression * handled corner case * resolved comments and updated unittests * removed unused import * resolved comments * fixed pylint failure * updated comment * changed the name of catalog's expression to expression-attributes * added tap-tester-user to the config, updated to nosetests * updated to nosetests * Changed expression-attributes to expression-attribute * Changed expression-attributes to expression-attribute * added tests for primary and replication key reserved words * renamed the file * added funciton comments * fixed pylint * changed the parameter name to attribute * changed the attribute name in new files added * expression-attribute -> expression-attributes in full table * test fixes for pk and hash key as expression-attributes * expression-attribute -> expression-attributes in one more test * fix log sync ref and log projection test name * Changed from attribute to attributes Co-authored-by: namrata270998 <namrata.brahmbhatt@crestdatasys.com> Co-authored-by: KrishnanG <kgurusamy@talend.com> Co-authored-by: kspeer <kspeer@stitchdata.com> Co-authored-by: Harsh <80324346+harshpatel4crest@users.noreply.github.com> Co-authored-by: Prijen Khokhani <88327452+prijendev@users.noreply.github.com> Co-authored-by: KrishnanG <kgurusamy@talend.com> Co-authored-by: kspeer <kspeer@stitchdata.com>
Description of change
TDL-13692: Missing parent objects from projection expression causes key error
get
function when getting data from the dictionary and setting default value{}
if data is not found.TDL-16140: Fix key error and index error while applying projection expression
Manual QA steps
Risks
Rollback steps