Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🎉 CDK: Added support for efficient parent/child streams using cache #6057

Merged
merged 8 commits into from
Sep 22, 2021

Conversation

gaart
Copy link
Contributor

@gaart gaart commented Sep 14, 2021

What

Added the ability to use caching for efficient synchronization of nested streams.

How

The vcrpy library was used to add a caching mechanism.

HttpStream class:

  • new property use_cache, indicating whether to use caching or not
  • new property cache_filename defining the name of the cache file
  • a new method request_cache that creates and connects the cache file to the current stream
  • updated method read_records, which adds conditions for reading records from the cache file

A new class HttpSubStream.
This class should be used as the base class for "child" streams. There is a stream_slices method that gets the "parent" records from the cache files.

@github-actions github-actions bot added the CDK Connector Development Kit label Sep 14, 2021
@gaart gaart linked an issue Sep 15, 2021 that may be closed by this pull request
@gaart gaart requested a review from keu September 15, 2021 20:28
@gaart gaart changed the title Add caching CDK: parent/child streams Sep 16, 2021
@gaart gaart changed the title CDK: parent/child streams 🎉 CDK: parent/child streams Sep 16, 2021
self, sync_mode: SyncMode, cursor_field: List[str] = None, stream_state: Mapping[str, Any] = None
) -> Iterable[Optional[Mapping[str, Any]]]:
parent_stream_slices = self.parent.stream_slices(
sync_mode=sync_mode,
Copy link
Contributor

@keu keu Sep 16, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unfortunately, we can't use the same sync_mode, because in the case of incremental we might miss records from the child stream (if updating/creating child record doesn't update cursor_field on parent record)


# iterate over all parent records with current stream_slice
for record in parent_records:
yield record
Copy link
Contributor

@keu keu Sep 16, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
yield record
yield {'parent': record}

in the general case, the slice is a dict, because we might want to extend slices (slice of the slice, etc):
slice read by date and by parent record:

{
   "date": "2020-10-10 03:00:00",
   "parent": <parent_record>,
}

Copy link
Contributor

@keu keu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

few comments

…-parent-child-streams

# Conflicts:
#	airbyte-cdk/python/airbyte_cdk/sources/streams/http/http.py
#	airbyte-cdk/python/setup.py
#	airbyte-cdk/python/unit_tests/sources/streams/http/test_http.py
@gaart gaart requested a review from keu September 16, 2021 11:39
@gaart gaart marked this pull request as ready for review September 16, 2021 15:00
@sherifnada
Copy link
Contributor

waiting for @keu review. Also adding @Phlair since he's CDK-involved

Copy link
Contributor

@Phlair Phlair left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really good to see this getting added to CDK!

Couple of open questions around slicing.

Comment on lines 405 to 408
sync_mode=sync_mode,
cursor_field=cursor_field,
stream_slice=stream_slice,
stream_state=stream_state
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here we're making the read_records call on the parent stream using the substream's sync mode and state. Are there scenarios where this means we could miss relevant data if we're in incremental and have a recent state?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct, just fixed it


# iterate over all parent records with current stream_slice
for record in parent_records:
yield {"parent": record}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't guarantee time-order of slices right? We can assume that the iteration for stream_slice in parent_stream_slices: is ordered but the records within that slice aren't guaranteed to iterate in order I don't think.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's right

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok now that you've changed above to full_refresh this is safe because we're going to grab all parent records on every sync.

if self.use_cache:
self.cache_file = self.request_cache()
# we need this attr to get metadata about cassettes, such as record play count, all records played, etc.
self.cass = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WDYT?

Suggested change
self.cass = None
self.cassete = None

@gaart gaart temporarily deployed to more-secrets September 22, 2021 16:46 Inactive
@gaart
Copy link
Contributor Author

gaart commented Sep 22, 2021

/publish-cdk dry-run=true

🕑 https://github.com/airbytehq/airbyte/actions/runs/1262701535
https://github.com/airbytehq/airbyte/actions/runs/1262701535

@gaart
Copy link
Contributor Author

gaart commented Sep 22, 2021

/publish-cdk dry-run=false

🕑 https://github.com/airbytehq/airbyte/actions/runs/1262747568
https://github.com/airbytehq/airbyte/actions/runs/1262747568

@gaart gaart changed the title 🎉 CDK: parent/child streams 🎉 CDK: Added support for efficient parent/child streams using cache Sep 22, 2021
@gaart gaart merged commit 9aa5a5a into master Sep 22, 2021
@gaart gaart deleted the gaart/3380-parent-child-streams branch September 22, 2021 17:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CDK Connector Development Kit
Projects
None yet
Development

Successfully merging this pull request may close these issues.

CDK: Add support for efficient parent/child streams
5 participants