-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
🎉 Source GitHub: Use CDK caching and convert PR-related streams to incremental #7250
Conversation
@marcosmarxm ping |
thanks @cjwooo we're going to review this tomorrow/monday! There is a high demand due hacktober contributions :D |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a few questions.
Also since you converted two streams to incremental please add them to the following files:
integration_tests/abnormal_state.json
;cursor_paths
section inacceptance-test-config.yml
.
Also, could you please run integration tests locally and send the result here?
airbyte-integrations/connectors/source-github/source_github/streams.py
Outdated
Show resolved
Hide resolved
airbyte-integrations/connectors/source-github/source_github/streams.py
Outdated
Show resolved
Hide resolved
airbyte-integrations/connectors/source-github/source_github/streams.py
Outdated
Show resolved
Hide resolved
airbyte-integrations/connectors/source-github/source_github/streams.py
Outdated
Show resolved
Hide resolved
@@ -503,7 +423,7 @@ def is_sorted_descending(self) -> bool: | |||
""" | |||
Depending if there any state we read stream in ascending or descending order. | |||
""" | |||
return self._first_read | |||
return not self._first_read |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand why? Did we have an error connected to this? Did we send the wrong direction
parameter or what?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes it was sending the wrong direction. https://github.com/airbytehq/airbyte/blob/master/airbyte-integrations/connectors/source-github/source_github/streams.py#L495-L496 states we want to sort in ascending order for the first run, then descending order for subsequent runs to allow the incremental behavior in https://github.com/airbytehq/airbyte/blob/master/airbyte-integrations/connectors/source-github/source_github/streams.py#L701-L702. However, the current stream version is setting is_sorted_descending
to true if self._first_read
is true, which is the opposite behavior.
@Zirochkaa I always get read errors when running acceptance tests locally, regardless of which connector version I'm testing. Not sure what I'm doing wrong. Here's the output of me running the acceptance test on the master branch, using connector version |
Hello @cjwooo it's throwing a timeout error. You can increase this variable to see if helps solving the problems with tests. |
Thanks @marcosmarxm , turns out I wasn't running the tests on the correct repo as well.
|
@Zirochkaa I've addressed your comments, can you please rereview? |
I need to update this PR to reflect latest changes https://github.com/airbytehq/airbyte/pull/8030/files |
@cjwooo let me know when you need a review for this contribution. |
@marcosmarxm This is now ready for review. Log of successful acceptance test (without collaborators, events, or teams streams since my github token does not have permission for those):
|
@cjwooo @Zirochkaa will review this week your contribution! |
airbyte-integrations/connectors/source-github/source_github/streams.py
Outdated
Show resolved
Hide resolved
parent_stream_slices = list(super().stream_slices(sync_mode=sync_mode, cursor_field=cursor_field, stream_state=stream_state)) | ||
if self.parent.is_sorted_descending: | ||
parent_stream_slices.reverse() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand this part of the code correctly, a list of all parent records is being created here. So if, for example, pull_requests stream has 5000 records then all of them will be placed in the list. We can't do that because it kills the idea of stream_slices being a generator function and the idea that we output one record at a time and not store all record for specific stream.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed to force the pull_requests stream to fetch records in ascending order via an override param in the stream_state.
|
Bump |
@cjwooo I requested to @Zirochkaa review your changes. |
Bump |
@cjwooo can you run |
@marcosmarxm Done |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks @cjwooo
What
Currently, GitHub source uses its own request caching logic instead of the CDK's, and the PullRequestStats and Reviews streams don't utilize their parent PullRequest stream's cache, and they run the parent stream without any start_date parameter, resulting in tons of extra Github API calls, whether in full refresh or incremental mode, that can even pull data before the configured start date. This PR fixes those issues.
How
Switch to use the CDK's request cache.
Fix PullRequests stream sorting order.
PullRequestStats and Reviews streams:
Recommended reading order
x.java
y.python
Pre-merge Checklist
airbyte_secret
./gradlew :airbyte-integrations:connectors:<name>:integrationTest
.README.md
bootstrap.md
. See description and examplesdocs/integrations/<source or destination>/<name>.md
including changelog. See changelog example