🎉 CDK: Added support for efficient parent/child streams using cache #6057

gaart · 2021-09-14T15:16:31Z

What

Added the ability to use caching for efficient synchronization of nested streams.

How

The vcrpy library was used to add a caching mechanism.

HttpStream class:

new property use_cache, indicating whether to use caching or not
new property cache_filename defining the name of the cache file
a new method request_cache that creates and connects the cache file to the current stream
updated method read_records, which adds conditions for reading records from the cache file

A new class HttpSubStream.
This class should be used as the base class for "child" streams. There is a stream_slices method that gets the "parent" records from the cache files.

keu · 2021-09-16T06:39:10Z

airbyte-cdk/python/airbyte_cdk/sources/streams/http/http.py

+        self, sync_mode: SyncMode, cursor_field: List[str] = None, stream_state: Mapping[str, Any] = None
+    ) -> Iterable[Optional[Mapping[str, Any]]]:
+        parent_stream_slices = self.parent.stream_slices(
+                sync_mode=sync_mode,


unfortunately, we can't use the same sync_mode, because in the case of incremental we might miss records from the child stream (if updating/creating child record doesn't update cursor_field on parent record)

keu · 2021-09-16T06:42:39Z

airbyte-cdk/python/airbyte_cdk/sources/streams/http/http.py

+
+            # iterate over all parent records with current stream_slice
+            for record in parent_records:
+                yield record


Suggested change

yield record

yield {'parent': record}

in the general case, the slice is a dict, because we might want to extend slices (slice of the slice, etc):
slice read by date and by parent record:

{ "date": "2020-10-10 03:00:00", "parent": <parent_record>, }

keu

few comments

…-parent-child-streams # Conflicts: # airbyte-cdk/python/airbyte_cdk/sources/streams/http/http.py # airbyte-cdk/python/setup.py # airbyte-cdk/python/unit_tests/sources/streams/http/test_http.py

sherifnada · 2021-09-20T03:15:28Z

waiting for @keu review. Also adding @Phlair since he's CDK-involved

Phlair

Really good to see this getting added to CDK!

Couple of open questions around slicing.

Phlair · 2021-09-21T09:17:49Z

airbyte-cdk/python/airbyte_cdk/sources/streams/http/http.py

+                sync_mode=sync_mode,
+                cursor_field=cursor_field,
+                stream_slice=stream_slice,
+                stream_state=stream_state


here we're making the read_records call on the parent stream using the substream's sync mode and state. Are there scenarios where this means we could miss relevant data if we're in incremental and have a recent state?

Correct, just fixed it

Phlair · 2021-09-21T09:21:32Z

airbyte-cdk/python/airbyte_cdk/sources/streams/http/http.py

+
+            # iterate over all parent records with current stream_slice
+            for record in parent_records:
+                yield {"parent": record}


This doesn't guarantee time-order of slices right? We can assume that the iteration for stream_slice in parent_stream_slices: is ordered but the records within that slice aren't guaranteed to iterate in order I don't think.

Yes, that's right

Ok now that you've changed above to full_refresh this is safe because we're going to grab all parent records on every sync.

keu · 2021-09-21T14:56:33Z

airbyte-cdk/python/airbyte_cdk/sources/streams/http/http.py

+        if self.use_cache:
+            self.cache_file = self.request_cache()
+            # we need this attr to get metadata about cassettes, such as record play count, all records played, etc.
+            self.cass = None


WDYT?

Suggested change

self.cass = None

self.cassete = None

gaart · 2021-09-22T16:50:34Z

/publish-cdk dry-run=true

🕑 https://github.com/airbytehq/airbyte/actions/runs/1262701535
✅ https://github.com/airbytehq/airbyte/actions/runs/1262701535

gaart · 2021-09-22T17:05:14Z

/publish-cdk dry-run=false

🕑 https://github.com/airbytehq/airbyte/actions/runs/1262747568
✅ https://github.com/airbytehq/airbyte/actions/runs/1262747568

Add caching

4036d9d

github-actions bot added the CDK Connector Development Kit label Sep 14, 2021

Upd cache file handling

550f757

gaart linked an issue Sep 15, 2021 that may be closed by this pull request

CDK: Add support for efficient parent/child streams #3380

Closed

gaart requested a review from keu September 15, 2021 20:28

gaart changed the title ~~Add caching~~ CDK: parent/child streams Sep 16, 2021

gaart changed the title ~~CDK: parent/child streams~~ 🎉 CDK: parent/child streams Sep 16, 2021

keu reviewed Sep 16, 2021

View reviewed changes

keu suggested changes Sep 16, 2021

View reviewed changes

gaart added 2 commits September 16, 2021 13:35

Merge branch 'master' of github.com:airbytehq/airbyte into gaart/3380…

0fa705f

…-parent-child-streams # Conflicts: # airbyte-cdk/python/airbyte_cdk/sources/streams/http/http.py # airbyte-cdk/python/setup.py # airbyte-cdk/python/unit_tests/sources/streams/http/test_http.py

Upd slices, sync mode, docs

8398985

gaart requested a review from keu September 16, 2021 11:39

gaart marked this pull request as ready for review September 16, 2021 15:00

gaart requested review from eliziario and vitaliizazmic September 16, 2021 15:00

vitaliizazmic approved these changes Sep 17, 2021

View reviewed changes

Bump version

66b5344

gaart requested a review from sherifnada September 17, 2021 14:33

sherifnada requested a review from Phlair September 20, 2021 03:15

Phlair reviewed Sep 21, 2021

View reviewed changes

Use SyncMode.full_refresh for parent stream_slices

40e9a09

keu reviewed Sep 21, 2021

View reviewed changes

keu approved these changes Sep 21, 2021

View reviewed changes

Phlair approved these changes Sep 21, 2021

View reviewed changes

gaart added 2 commits September 22, 2021 19:42

Refactor

4376194

Merge branch 'master' into gaart/3380-parent-child-streams

5708545

gaart temporarily deployed to more-secrets September 22, 2021 16:46 Inactive

gaart changed the title ~~🎉 CDK: parent/child streams~~ 🎉 CDK: Added support for efficient parent/child streams using cache Sep 22, 2021

gaart merged commit 9aa5a5a into master Sep 22, 2021

gaart deleted the gaart/3380-parent-child-streams branch September 22, 2021 17:23

jrhizor mentioned this pull request Sep 25, 2021

Bump Airbyte version from 0.29.21-alpha to 0.29.22-alpha #6450

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🎉 CDK: Added support for efficient parent/child streams using cache #6057

🎉 CDK: Added support for efficient parent/child streams using cache #6057

gaart commented Sep 14, 2021 •

edited

Loading

keu Sep 16, 2021 •

edited

Loading

keu Sep 16, 2021 •

edited

Loading

keu left a comment

sherifnada commented Sep 20, 2021

Phlair left a comment

Phlair Sep 21, 2021

gaart Sep 21, 2021

Phlair Sep 21, 2021

gaart Sep 21, 2021

Phlair Sep 21, 2021

keu Sep 21, 2021

gaart commented Sep 22, 2021 •

edited by github-actions bot

Loading

gaart commented Sep 22, 2021 •

edited by github-actions bot

Loading

🎉 CDK: Added support for efficient parent/child streams using cache #6057

🎉 CDK: Added support for efficient parent/child streams using cache #6057

Conversation

gaart commented Sep 14, 2021 • edited Loading

What

How

keu Sep 16, 2021 • edited Loading

Choose a reason for hiding this comment

keu Sep 16, 2021 • edited Loading

Choose a reason for hiding this comment

keu left a comment

Choose a reason for hiding this comment

sherifnada commented Sep 20, 2021

Phlair left a comment

Choose a reason for hiding this comment

Phlair Sep 21, 2021

Choose a reason for hiding this comment

gaart Sep 21, 2021

Choose a reason for hiding this comment

Phlair Sep 21, 2021

Choose a reason for hiding this comment

gaart Sep 21, 2021

Choose a reason for hiding this comment

Phlair Sep 21, 2021

Choose a reason for hiding this comment

keu Sep 21, 2021

Choose a reason for hiding this comment

gaart commented Sep 22, 2021 • edited by github-actions bot Loading

gaart commented Sep 22, 2021 • edited by github-actions bot Loading

gaart commented Sep 14, 2021 •

edited

Loading

keu Sep 16, 2021 •

edited

Loading

keu Sep 16, 2021 •

edited

Loading

gaart commented Sep 22, 2021 •

edited by github-actions bot

Loading

gaart commented Sep 22, 2021 •

edited by github-actions bot

Loading