feat(airbyte-cdk): Add Per Partition with Global fallback Cursor #45125

tolik0 · 2024-09-04T14:18:27Z

What

Added a new default cursor type for all substreams: Per Partition with Global Fallback. This cursor starts with per-partition management but switches to a global cursor when the number of records in the parent stream exceeds a defined threshold.

How

Implemented the Per Partition with Global Fallback cursor.
This cursor tracks the state per partition but falls back to a global cursor when the number of records in the parent stream exceeds two times the partition limit.
The fallback mechanism helps optimize performance for large datasets by reducing the state size and improving sync efficiency.
Updated substreams to use this cursor as the default, replacing the previous default preparation cursor.

Review guide

airbyte-cdk/python/airbyte_cdk/sources/declarative/incremental/per_partition_with_global.py
airbyte-cdk/python/airbyte_cdk/sources/declarative/incremental/per_partition_cursor.py
airbyte-cdk/python/airbyte_cdk/sources/declarative/incremental/global_substream_cursor.py
airbyte-cdk/python/airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py

User Impact

Substreams will now use the new Per Partition with Global Fallback cursor by default, improving performance and scalability for streams with large numbers of partitions.

Can this PR be safely reverted and rolled back?

YES 💚
NO ❌

vercel · 2024-09-04T14:18:31Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Skipped Deployment

Name	Status	Preview	Comments	Updated (UTC)
airbyte-docs	⬜️ Ignored (Inspect)	Visit Preview		Oct 25, 2024 5:48pm

girarda · 2024-09-04T16:05:36Z

airbyte-cdk/python/airbyte_cdk/sources/declarative/incremental/global_substream_cursor.py

+        self._stream_cursor = stream_cursor
+        self._partition_router = partition_router
+        self._timer = Timer()
+        self._lock = threading.Lock()


@brianjlai can you confirm what the plan is for the coalescing of the cursors? Will we be using the existing low-code classes with the concurrent cdk?

context: this is adding thread safety logic in case this does get use in a concurrent context

So our current plan is that the low-code processing (the entrypoint being DeclarativeStream.read_records() won't be managing state or cursors at all. And the concurrent processing framework will be responsible for instantiating and invoking cursor methods. So right now that means using the existing ConcurrentCursor.

As it pertains to this work, this seems like it should be fine. But I think whenever we get to substream state for concurrent, we would need to port the new low-code global cursor into concurrent because there is no existing substream concurrent cursor implementation

…e-cdk/add-per-partition-with-global-fallback`) Sure, here is the optimized version of your Python program.

codeflash-ai · 2024-09-12T15:25:48Z

⚡️ Codeflash found optimizations for this PR

📄 `Timer.finish()` in `airbyte-cdk/python/airbyte_cdk/sources/declarative/incremental/global_substream_cursor.py`

📈 Performance improved by 39% (0.39x faster)

⏱️ Runtime went down from 53.5 microseconds to 38.5 microseconds

I created a new dependent PR with the suggested changes. Please review:

⚡️ Speed up method Timer.finish by 39% in PR #45125 (tolik0/airbyte-cdk/add-per-partition-with-global-fallback) #45419

If you approve, it will be merged into this PR (branch tolik0/airbyte-cdk/add-per-partition-with-global-fallback).

…in PR #45125 (`tolik0/airbyte-cdk/add-per-partition-with-global-fallback`) Here are some optimizations for the provided code. 1. Avoid repeated attribute look-ups and repeated function calls. 2. Use local variables instead of instance attributes within methods where feasible to reduce attribute access overhead. 3. Refactor any repeated logic into a more efficient place. Here's the optimized code. ### Summary of changes. 1. Combined property lookups for `step`, `cursor_granularity`, `lookback_window`, and `datetime_format` to avoid repeated access. 2. Used local variables where possible. 3. Simplified redundant logic.

codeflash-ai · 2024-09-12T15:35:38Z

⚡️ Codeflash found optimizations for this PR

📄 `GlobalSubstreamCursor.set_initial_state()` in `airbyte-cdk/python/airbyte_cdk/sources/declarative/incremental/global_substream_cursor.py`

📈 Performance improved by 381% (3.81x faster)

⏱️ Runtime went down from 2.18 milliseconds to 454 microseconds

I created a new dependent PR with the suggested changes. Please review:

⚡️ Speed up method GlobalSubstreamCursor.set_initial_state by 381% in PR #45125 (tolik0/airbyte-cdk/add-per-partition-with-global-fallback) #45420

If you approve, it will be merged into this PR (branch tolik0/airbyte-cdk/add-per-partition-with-global-fallback).

tolik0 · 2024-09-17T09:26:26Z

/format-fix

Format-fix job started... Check job output.

✅ Changes applied successfully. (9879d3e)

codeflash-ai · 2024-09-19T14:59:14Z

⚡️ Codeflash found optimizations for this PR

📄 `Timer.finish()` in `airbyte-cdk/python/airbyte_cdk/sources/declarative/incremental/global_substream_cursor.py`

📈 Performance improved by 47% (0.47x faster)

⏱️ Runtime went down from 64.0 microseconds to 43.5 microseconds

I created a new dependent PR with the suggested changes. Please review:

⚡️ Speed up method Timer.finish by 47% in PR #45125 (tolik0/airbyte-cdk/add-per-partition-with-global-fallback) #45675

If you approve, it will be merged into this PR (branch tolik0/airbyte-cdk/add-per-partition-with-global-fallback).

brianjlai

nice work approving! does the issue you found in source-jira for the substream dependency on the parent record prevent us from merging this in, or was that just because you were trying to bump and test this against it?

tolik0 · 2024-09-20T09:56:44Z

@brianjlai We can merge this PR, but for Jira, we need to fix the partition bug first.

maxi297

I think this is a solid PR. I just have a couple question but I can approve right now

airbyte-cdk/python/airbyte_cdk/sources/declarative/async_job/job.py

maxi297 · 2024-09-20T14:21:28Z

airbyte-cdk/python/airbyte_cdk/sources/declarative/extractors/record_filter.py

@@ -50,14 +50,12 @@ class ClientSideIncrementalRecordFilterDecorator(RecordFilter):
    def __init__(
        self,
        date_time_based_cursor: DatetimeBasedCursor,
-        per_partition_cursor: Optional[PerPartitionCursor] = None,


Probably outside of the scope of this PR but could we eventually have just one cursor as a parameter here? I'm trying to understand why we need both cursor and it seems like we could just have one of the interfaice Cursor and the filtering code would look like:

def filter_records( self, records: Iterable[Mapping[str, Any]], stream_state: StreamState, stream_slice: Optional[StreamSlice] = None, next_page_token: Optional[Mapping[str, Any]] = None, ) -> Iterable[Mapping[str, Any]]: records = ( record for record in records if self._cursor.should_be_synced(record) ) if self.condition: records = super().filter_records( records=records, stream_state=stream_state, stream_slice=stream_slice, next_page_token=next_page_token ) yield from records

If we agree that this is a path forward, I'll create an issue for that

It is not possible now. The issue is that _substream_cursor doesn't have methods to work with the cursor, for example: select_best_end_datetime, parse_date.

I'm not sure I understand: if we use if self._cursor.should_be_synced(record), select_best_end_datetime and parse_date can be private, right?

maxi297 · 2024-09-20T14:22:58Z

airbyte-cdk/python/airbyte_cdk/sources/declarative/incremental/per_partition_with_global.py

+        # Iterate through partitions and process slices
+        for partition, is_last_partition in iterate_with_last_flag(self._partition_router.stream_slices()):
+            # Generate slices for the current cursor and handle the last slice using the flag
+            if partition is None:


Why do we need this if? In which case can this happen? Why do we simply continue? Should we maybe log?

The same questions apply for if slice is None: below. Depending on the answers, it feels like it might be interesting to have unit tests about those

I refactored the iterate_with_last_flag to avoid skipping None.

…global-fallback

brianjlai · 2024-10-24T02:33:51Z

Latest changes make sense. I do see a failing test test_session_token_auth, do you know if this is consistently failing with the latest changes or a transient test failure?

maxi297

test_check_is_valid_session_token_unauthorized seems to be failing but I'm not sure how this is related to your changes.

I have a concern about the PerPartition/GlobalSubstream cursors bleeding in other classes. I don't exactly get the reasons and would like to understand that and see if we can do it otherwise before approving.

maxi297 · 2024-10-24T05:55:28Z

airbyte-cdk/python/airbyte_cdk/sources/declarative/extractors/record_filter.py

@@ -50,14 +50,12 @@ class ClientSideIncrementalRecordFilterDecorator(RecordFilter):
    def __init__(
        self,
        date_time_based_cursor: DatetimeBasedCursor,
-        per_partition_cursor: Optional[PerPartitionCursor] = None,


I'm not sure I understand: if we use if self._cursor.should_be_synced(record), select_best_end_datetime and parse_date can be private, right?

maxi297 · 2024-10-24T06:15:35Z

airbyte-cdk/python/airbyte_cdk/sources/declarative/incremental/per_partition_with_global.py

+        self._use_global_cursor = stream_state.get("use_global_cursor", False)
+
+        self._global_cursor.set_initial_state(stream_state)
+        self._per_partition_cursor.set_initial_state(stream_state)


Should this be done only if not self._use_global_cursor?

Fixed. However, it doesn’t make much of a difference since we don’t save the per-partition cursor if the global cursor is being used.

maxi297 · 2024-10-24T06:32:30Z

...e-cdk/python/airbyte_cdk/sources/declarative/partition_routers/substream_partition_router.py

@@ -261,7 +284,14 @@ def get_stream_state(self) -> Optional[Mapping[str, StreamState]]:
            }
        }
        """
-        return copy.deepcopy(self._parent_state)


I'm not sure I get this change: why would we return only the state for one partition when setting the parent_state? Why does it matter if it is the last partition being processed or not?

I'm a bit afraid of that because this clearly indicates that the SubstreamPartitionRouter must know about PerPartition states which makes it a circular logical dependency. This is dangerous because now, every time someone modifies one, they will need to know that there is an impact in the other one even though this is not explicit.

I agree with you. I’ve refactored the code to move all state handling to the cursor classes rather than the PartitionRouter.

For the CartesianPartitionRouter, it still needs to know which slice is last. Since the partition is a product of all possible combinations from multiple partition routers, we can’t reliably retrieve the current state of underlying streams until all partitions are processed. At that point, we can safely update the parent state for all partition routers without risking data loss. To avoid exposing this logic to the cursor classes, it’s better to pass a flag parameter to the PartitionRouter.

I discussed a bit with @brianjlai and there are a couple of points which would make us want to remove the last from this interface:

This is only needed by CartesianPartitionRouter

CartesianPartitionRouter is basically not used (there are custom connectors that re-implements it but they could be replaced by something like PerPartition

Having parent states with CartesianPartitionRouter can gives undetermined state and I would prefer not to support that. For example, assuming state {cursor_field: 10} for parent 1 (whatever parent 1 is) and state {cursor_field: 20} for parent 2, those one of the two streams will overwrite the cursor value of the other and this will probably lead to data loss

Honestly, I think the concept of CartesianPartitionRouter should be deprecated and eventually removed. For those reasons, I prefer to keep the interface clean (i.e. without the last parameter) and ignore the CartesianPartitionRouter's problems for now. We should just flag this class as deprecated.

I disagree. The Cartesian product is a basic concept — the need to combine multiple lists into a product of elements. For example, when adding filters to Jira issues, projects and filters must be combined as a product.

After a sync with @tolik0, here is my understanding of the situation:

There are cases where this is used which I've missed (example). However, we think this case is scoffed and expect it to not work as the dev expected it to work

There are some hypothetical cases which @tolik0 identified with filters. For example, let's get all the comments with upvotes and downvotes (filters) for all the issues (parent stream). That being said, we don't have a prod use case in our pool of connectors for now.

The worst thing we've identified about changing the interface is that now, all the callers of get_stream_state needs to track which partition is last. This is annoying because we would need to transpose this logic in the concurrent framework as well.

So as a middle ground, we've decided that:

We won't change the get_stream_state interface because of the added complexity and low value

We will not delete CartesianPartitionRouter as we see potential cases where this could be useful in the future

Because we don't know how things will evolve and we want to avoid scope creep for that, we will simply add a log saying that the state management will not work when the CartesianProductSlicer is instantiated and don't do anything in the methods that relates to state. The reasoning is that is was scoffed anyway right now and we don't want to maintain it with the last boolean param

airbyte-cdk/python/airbyte_cdk/sources/declarative/incremental/per_partition_with_global.py

maxi297

After the new discussion, I'll (re)approve this PR. Thanks a lot for your hard work @tolik0

octavia-squidington-iii added area/documentation Improvements or additions to documentation CDK Connector Development Kit labels Sep 4, 2024

girarda reviewed Sep 4, 2024

View reviewed changes

vercel bot had a problem deploying to Preview September 6, 2024 11:56 Failure

tolik0 force-pushed the tolik0/airbyte-cdk/add-per-partition-with-global-fallback branch from 1ab0994 to 8daaacb Compare September 6, 2024 16:37

octavia-squidington-iii removed the area/documentation Improvements or additions to documentation label Sep 6, 2024

tolik0 added 6 commits September 10, 2024 18:34

Add PerPartitionWithGlobalCursor

49e75bf

Add switch between partitions

6874c76

Update tests

16847c1

Fix incremental tests

10c09a8

Add docs

01a55a7

Refactor per partition with global

5c90376

tolik0 force-pushed the tolik0/airbyte-cdk/add-per-partition-with-global-fallback branch from 8daaacb to 5c90376 Compare September 11, 2024 13:14

octavia-squidington-iii added the area/documentation Improvements or additions to documentation label Sep 11, 2024

tolik0 added 3 commits September 12, 2024 14:25

Refactor

baa8af9

Fix timer

766bd72

Add new test

e18d430

codeflash-ai bot added a commit that referenced this pull request Sep 12, 2024

⚡️ Speed up method Timer.finish by 39% in PR #45125 (`tolik0/airbyt…

d2d97ca

…e-cdk/add-per-partition-with-global-fallback`) Sure, here is the optimized version of your Python program.

codeflash-ai bot mentioned this pull request Sep 12, 2024

⚡️ Speed up method Timer.finish by 39% in PR #45125 (tolik0/airbyte-cdk/add-per-partition-with-global-fallback) #45419

Closed

codeflash-ai bot mentioned this pull request Sep 12, 2024

⚡️ Speed up method GlobalSubstreamCursor.set_initial_state by 381% in PR #45125 (tolik0/airbyte-cdk/add-per-partition-with-global-fallback) #45420

Closed

tolik0 added 4 commits September 13, 2024 21:56

Refactor and update tests

bfd4a45

Update ClientSideIncrementalRecordFilterDecorator

2c60ef0

Fix formatting

506d296

Fix mypy errors

d8e27d4

Fix typo in test

0dc55e1

codeflash-ai bot mentioned this pull request Sep 19, 2024

⚡️ Speed up method Timer.finish by 47% in PR #45125 (tolik0/airbyte-cdk/add-per-partition-with-global-fallback) #45675

Closed

brianjlai approved these changes Sep 20, 2024

View reviewed changes

tolik0 requested review from a team and maxi297 September 20, 2024 09:45

maxi297 approved these changes Sep 20, 2024

View reviewed changes

tolik0 added 2 commits October 4, 2024 18:25

Simplified iterate_with_last_flag

3042d28

Merge branch 'master' into tolik0/airbyte-cdk/add-per-partition-with-…

8cfc0ba

…global-fallback

vercel bot deployed to Preview October 7, 2024 11:24 View deployment

tolik0 added 2 commits October 7, 2024 14:31

Fix formatting

bd507e0

Fix mypy error

e6976fc

bazarnov approved these changes Oct 7, 2024

View reviewed changes

lazebnyi approved these changes Oct 8, 2024

View reviewed changes

Merge branch 'master' into tolik0/airbyte-cdk/add-per-partition-with-…

c998f5c

…global-fallback

vercel bot deployed to Preview October 18, 2024 13:31 View deployment

tolik0 added 2 commits October 22, 2024 18:01

Align with new SubstreamPartitionRouter

58c96f2

Update the state management

1fb01a9

maxi297 reviewed Oct 24, 2024

View reviewed changes

airbyte-cdk/python/airbyte_cdk/sources/declarative/incremental/per_partition_with_global.py Show resolved Hide resolved

tolik0 added 6 commits October 24, 2024 18:13

Move state management from partition router to cursor classes

2b38e0e

Fix mypy errors

e82eddc

Fix comment

0d177dc

Fix unit tests

cf06fa1

Delete old code

536f97c

Fix unit test helper function

cd81e19

maxi297 approved these changes Oct 25, 2024

View reviewed changes

tolik0 added 2 commits October 25, 2024 19:39

Delete last argument from partition router get_stream_state

e82ee3c

Add unit test for warning in CartesianPartitionRouter

365e556

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(airbyte-cdk): Add Per Partition with Global fallback Cursor #45125

feat(airbyte-cdk): Add Per Partition with Global fallback Cursor #45125

tolik0 commented Sep 4, 2024 •

edited

Loading

vercel bot commented Sep 4, 2024 •

edited

Loading

girarda Sep 4, 2024

brianjlai Sep 5, 2024

codeflash-ai bot commented Sep 12, 2024

⚡️ Speed up method `Timer.finish` by 39% in PR #45125 (`tolik0/airbyte-cdk/add-per-partition-with-global-fallback`) #45419

codeflash-ai bot commented Sep 12, 2024

⚡️ Speed up method `GlobalSubstreamCursor.set_initial_state` by 381% in PR #45125 (`tolik0/airbyte-cdk/add-per-partition-with-global-fallback`) #45420

tolik0 commented Sep 17, 2024 •

edited by github-actions bot

Loading

codeflash-ai bot commented Sep 19, 2024

⚡️ Speed up method `Timer.finish` by 47% in PR #45125 (`tolik0/airbyte-cdk/add-per-partition-with-global-fallback`) #45675

brianjlai left a comment

tolik0 commented Sep 20, 2024 •

edited

Loading

maxi297 left a comment

maxi297 Sep 20, 2024

tolik0 Oct 23, 2024

maxi297 Oct 24, 2024

maxi297 Sep 20, 2024

tolik0 Oct 9, 2024

brianjlai commented Oct 24, 2024

maxi297 left a comment

maxi297 Oct 24, 2024

maxi297 Oct 24, 2024

tolik0 Oct 24, 2024

maxi297 Oct 24, 2024

tolik0 Oct 24, 2024

maxi297 Oct 25, 2024

tolik0 Oct 25, 2024

maxi297 Oct 25, 2024

maxi297 left a comment

feat(airbyte-cdk): Add Per Partition with Global fallback Cursor #45125

Are you sure you want to change the base?

feat(airbyte-cdk): Add Per Partition with Global fallback Cursor #45125

Conversation

tolik0 commented Sep 4, 2024 • edited Loading

What

How

Review guide

User Impact

Can this PR be safely reverted and rolled back?

vercel bot commented Sep 4, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codeflash-ai bot commented Sep 12, 2024

⚡️ Codeflash found optimizations for this PR

📄 Timer.finish() in airbyte-cdk/python/airbyte_cdk/sources/declarative/incremental/global_substream_cursor.py

I created a new dependent PR with the suggested changes. Please review:

codeflash-ai bot commented Sep 12, 2024

⚡️ Codeflash found optimizations for this PR

📄 GlobalSubstreamCursor.set_initial_state() in airbyte-cdk/python/airbyte_cdk/sources/declarative/incremental/global_substream_cursor.py

I created a new dependent PR with the suggested changes. Please review:

tolik0 commented Sep 17, 2024 • edited by github-actions bot Loading

codeflash-ai bot commented Sep 19, 2024

⚡️ Codeflash found optimizations for this PR

📄 Timer.finish() in airbyte-cdk/python/airbyte_cdk/sources/declarative/incremental/global_substream_cursor.py

I created a new dependent PR with the suggested changes. Please review:

brianjlai left a comment

Choose a reason for hiding this comment

tolik0 commented Sep 20, 2024 • edited Loading

maxi297 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

brianjlai commented Oct 24, 2024

maxi297 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

maxi297 left a comment

Choose a reason for hiding this comment

tolik0 commented Sep 4, 2024 •

edited

Loading

vercel bot commented Sep 4, 2024 •

edited

Loading

📄 `Timer.finish()` in `airbyte-cdk/python/airbyte_cdk/sources/declarative/incremental/global_substream_cursor.py`

📄 `GlobalSubstreamCursor.set_initial_state()` in `airbyte-cdk/python/airbyte_cdk/sources/declarative/incremental/global_substream_cursor.py`

tolik0 commented Sep 17, 2024 •

edited by github-actions bot

Loading

📄 `Timer.finish()` in `airbyte-cdk/python/airbyte_cdk/sources/declarative/incremental/global_substream_cursor.py`

tolik0 commented Sep 20, 2024 •

edited

Loading