-
Notifications
You must be signed in to change notification settings - Fork 4.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(airbyte-cdk): Add Per Partition with Global fallback Cursor #45125
base: master
Are you sure you want to change the base?
feat(airbyte-cdk): Add Per Partition with Global fallback Cursor #45125
Conversation
The latest updates on your projects. Learn more about Vercel for Git ↗︎ 1 Skipped Deployment
|
self._stream_cursor = stream_cursor | ||
self._partition_router = partition_router | ||
self._timer = Timer() | ||
self._lock = threading.Lock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@brianjlai can you confirm what the plan is for the coalescing of the cursors? Will we be using the existing low-code classes with the concurrent cdk?
context: this is adding thread safety logic in case this does get use in a concurrent context
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So our current plan is that the low-code processing (the entrypoint being DeclarativeStream.read_records()
won't be managing state or cursors at all. And the concurrent processing framework will be responsible for instantiating and invoking cursor methods. So right now that means using the existing ConcurrentCursor
.
As it pertains to this work, this seems like it should be fine. But I think whenever we get to substream state for concurrent, we would need to port the new low-code global cursor into concurrent because there is no existing substream concurrent cursor implementation
1ab0994
to
8daaacb
Compare
8daaacb
to
5c90376
Compare
…e-cdk/add-per-partition-with-global-fallback`) Sure, here is the optimized version of your Python program.
⚡️ Codeflash found optimizations for this PR📄
|
…in PR #45125 (`tolik0/airbyte-cdk/add-per-partition-with-global-fallback`) Here are some optimizations for the provided code. 1. Avoid repeated attribute look-ups and repeated function calls. 2. Use local variables instead of instance attributes within methods where feasible to reduce attribute access overhead. 3. Refactor any repeated logic into a more efficient place. Here's the optimized code. ### Summary of changes. 1. Combined property lookups for `step`, `cursor_granularity`, `lookback_window`, and `datetime_format` to avoid repeated access. 2. Used local variables where possible. 3. Simplified redundant logic.
⚡️ Codeflash found optimizations for this PR📄
|
/format-fix
|
⚡️ Codeflash found optimizations for this PR📄
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice work approving! does the issue you found in source-jira
for the substream dependency on the parent record prevent us from merging this in, or was that just because you were trying to bump and test this against it?
@brianjlai We can merge this PR, but for Jira, we need to fix the partition bug first. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a solid PR. I just have a couple question but I can approve right now
@@ -50,14 +50,12 @@ class ClientSideIncrementalRecordFilterDecorator(RecordFilter): | |||
def __init__( | |||
self, | |||
date_time_based_cursor: DatetimeBasedCursor, | |||
per_partition_cursor: Optional[PerPartitionCursor] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably outside of the scope of this PR but could we eventually have just one cursor as a parameter here? I'm trying to understand why we need both cursor and it seems like we could just have one of the interfaice Cursor
and the filtering code would look like:
def filter_records(
self,
records: Iterable[Mapping[str, Any]],
stream_state: StreamState,
stream_slice: Optional[StreamSlice] = None,
next_page_token: Optional[Mapping[str, Any]] = None,
) -> Iterable[Mapping[str, Any]]:
records = (
record
for record in records
if self._cursor.should_be_synced(record)
)
if self.condition:
records = super().filter_records(
records=records, stream_state=stream_state, stream_slice=stream_slice, next_page_token=next_page_token
)
yield from records
If we agree that this is a path forward, I'll create an issue for that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not possible now. The issue is that _substream_cursor
doesn't have methods to work with the cursor, for example: select_best_end_datetime
, parse_date
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I understand: if we use if self._cursor.should_be_synced(record)
, select_best_end_datetime
and parse_date
can be private, right?
# Iterate through partitions and process slices | ||
for partition, is_last_partition in iterate_with_last_flag(self._partition_router.stream_slices()): | ||
# Generate slices for the current cursor and handle the last slice using the flag | ||
if partition is None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need this if? In which case can this happen? Why do we simply continue
? Should we maybe log?
The same questions apply for if slice is None:
below. Depending on the answers, it feels like it might be interesting to have unit tests about those
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I refactored the iterate_with_last_flag
to avoid skipping None.
Latest changes make sense. I do see a failing test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
test_check_is_valid_session_token_unauthorized
seems to be failing but I'm not sure how this is related to your changes.
I have a concern about the PerPartition/GlobalSubstream cursors bleeding in other classes. I don't exactly get the reasons and would like to understand that and see if we can do it otherwise before approving.
@@ -50,14 +50,12 @@ class ClientSideIncrementalRecordFilterDecorator(RecordFilter): | |||
def __init__( | |||
self, | |||
date_time_based_cursor: DatetimeBasedCursor, | |||
per_partition_cursor: Optional[PerPartitionCursor] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I understand: if we use if self._cursor.should_be_synced(record)
, select_best_end_datetime
and parse_date
can be private, right?
self._use_global_cursor = stream_state.get("use_global_cursor", False) | ||
|
||
self._global_cursor.set_initial_state(stream_state) | ||
self._per_partition_cursor.set_initial_state(stream_state) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be done only if not self._use_global_cursor
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed. However, it doesn’t make much of a difference since we don’t save the per-partition cursor if the global cursor is being used.
@@ -261,7 +284,14 @@ def get_stream_state(self) -> Optional[Mapping[str, StreamState]]: | |||
} | |||
} | |||
""" | |||
return copy.deepcopy(self._parent_state) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I get this change: why would we return only the state for one partition when setting the parent_state? Why does it matter if it is the last partition being processed or not?
I'm a bit afraid of that because this clearly indicates that the SubstreamPartitionRouter must know about PerPartition states which makes it a circular logical dependency. This is dangerous because now, every time someone modifies one, they will need to know that there is an impact in the other one even though this is not explicit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with you. I’ve refactored the code to move all state handling to the cursor classes rather than the PartitionRouter
.
For the CartesianPartitionRouter
, it still needs to know which slice is last. Since the partition is a product of all possible combinations from multiple partition routers, we can’t reliably retrieve the current state of underlying streams until all partitions are processed. At that point, we can safely update the parent state for all partition routers without risking data loss. To avoid exposing this logic to the cursor classes, it’s better to pass a flag parameter to the PartitionRouter
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I discussed a bit with @brianjlai and there are a couple of points which would make us want to remove the last
from this interface:
- This is only needed by
CartesianPartitionRouter
CartesianPartitionRouter
is basically not used (there are custom connectors that re-implements it but they could be replaced by something like PerPartition- Having parent states with
CartesianPartitionRouter
can gives undetermined state and I would prefer not to support that. For example, assuming state {cursor_field: 10} for parent 1 (whatever parent 1 is) and state {cursor_field: 20} for parent 2, those one of the two streams will overwrite the cursor value of the other and this will probably lead to data loss
Honestly, I think the concept of CartesianPartitionRouter
should be deprecated and eventually removed. For those reasons, I prefer to keep the interface clean (i.e. without the last
parameter) and ignore the CartesianPartitionRouter's problems for now. We should just flag this class as deprecated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I disagree. The Cartesian product is a basic concept — the need to combine multiple lists into a product of elements. For example, when adding filters to Jira issues, projects and filters must be combined as a product.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After a sync with @tolik0, here is my understanding of the situation:
- There are cases where this is used which I've missed (example). However, we think this case is scoffed and expect it to not work as the dev expected it to work
- There are some hypothetical cases which @tolik0 identified with filters. For example, let's get all the comments with upvotes and downvotes (filters) for all the issues (parent stream). That being said, we don't have a prod use case in our pool of connectors for now.
- The worst thing we've identified about changing the interface is that now, all the callers of get_stream_state needs to track which partition is last. This is annoying because we would need to transpose this logic in the concurrent framework as well.
So as a middle ground, we've decided that:
- We won't change the
get_stream_state
interface because of the added complexity and low value - We will not delete
CartesianPartitionRouter
as we see potential cases where this could be useful in the future - Because we don't know how things will evolve and we want to avoid scope creep for that, we will simply add a log saying that the state management will not work when the CartesianProductSlicer is instantiated and don't do anything in the methods that relates to state. The reasoning is that is was scoffed anyway right now and we don't want to maintain it with the
last
boolean param
airbyte-cdk/python/airbyte_cdk/sources/declarative/incremental/per_partition_with_global.py
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After the new discussion, I'll (re)approve this PR. Thanks a lot for your hard work @tolik0
What
Added a new default cursor type for all substreams: Per Partition with Global Fallback. This cursor starts with per-partition management but switches to a global cursor when the number of records in the parent stream exceeds a defined threshold.
How
Review guide
airbyte-cdk/python/airbyte_cdk/sources/declarative/incremental/per_partition_with_global.py
airbyte-cdk/python/airbyte_cdk/sources/declarative/incremental/per_partition_cursor.py
airbyte-cdk/python/airbyte_cdk/sources/declarative/incremental/global_substream_cursor.py
airbyte-cdk/python/airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py
User Impact
Substreams will now use the new Per Partition with Global Fallback cursor by default, improving performance and scalability for streams with large numbers of partitions.
Can this PR be safely reverted and rolled back?