Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🎉 Source GitHub: Use CDK caching and convert PR-related streams to incremental #7250

Merged
merged 18 commits into from
Jan 6, 2022

Conversation

cjwooo
Copy link
Contributor

@cjwooo cjwooo commented Oct 21, 2021

What

Currently, GitHub source uses its own request caching logic instead of the CDK's, and the PullRequestStats and Reviews streams don't utilize their parent PullRequest stream's cache, and they run the parent stream without any start_date parameter, resulting in tons of extra Github API calls, whether in full refresh or incremental mode, that can even pull data before the configured start date. This PR fixes those issues.

How

Switch to use the CDK's request cache.
Fix PullRequests stream sorting order.
PullRequestStats and Reviews streams:

  • reuse a single parent PullRequests stream instance to utilize the request cache properly
  • support incremental mode
  • pass start_date and stream state to parent stream so that the child streams pull data for only the exact PRs that the parent PullRequests stream pulls

Recommended reading order

  1. x.java
  2. y.python

Pre-merge Checklist

  • Grant edit access to maintainers (instructions)
  • Secrets in the connector's spec are annotated with airbyte_secret
  • Unit & integration tests added and passing. Community members, please provide proof of success locally e.g: screenshot or copy-paste unit, integration, and acceptance test output. To run acceptance tests for a Python connector, follow instructions in the README. For java connectors run ./gradlew :airbyte-integrations:connectors:<name>:integrationTest.
  • Code reviews completed
  • Documentation updated
    • Connector's README.md
    • Connector's bootstrap.md. See description and examples
    • Changelog updated in docs/integrations/<source or destination>/<name>.md including changelog. See changelog example
  • PR name follows PR naming conventions
  • Connector version bumped like described here

@github-actions github-actions bot added the area/connectors Connector related issues label Oct 21, 2021
@cjwooo
Copy link
Contributor Author

cjwooo commented Oct 21, 2021

@marcosmarxm ping

@marcosmarxm
Copy link
Member

thanks @cjwooo we're going to review this tomorrow/monday! There is a high demand due hacktober contributions :D

@Zirochkaa Zirochkaa changed the title Source GitHub: Use CDK caching and convert PR-related streams to incremental 🎉 Source GitHub: Use CDK caching and convert PR-related streams to incremental Oct 31, 2021
Copy link
Contributor

@Zirochkaa Zirochkaa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a few questions.
Also since you converted two streams to incremental please add them to the following files:

  • integration_tests/abnormal_state.json;
  • cursor_paths section in acceptance-test-config.yml.

Also, could you please run integration tests locally and send the result here?

@@ -503,7 +423,7 @@ def is_sorted_descending(self) -> bool:
"""
Depending if there any state we read stream in ascending or descending order.
"""
return self._first_read
return not self._first_read
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why? Did we have an error connected to this? Did we send the wrong direction parameter or what?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it was sending the wrong direction. https://github.com/airbytehq/airbyte/blob/master/airbyte-integrations/connectors/source-github/source_github/streams.py#L495-L496 states we want to sort in ascending order for the first run, then descending order for subsequent runs to allow the incremental behavior in https://github.com/airbytehq/airbyte/blob/master/airbyte-integrations/connectors/source-github/source_github/streams.py#L701-L702. However, the current stream version is setting is_sorted_descending to true if self._first_read is true, which is the opposite behavior.

@cjwooo
Copy link
Contributor Author

cjwooo commented Nov 1, 2021

@Zirochkaa I always get read errors when running acceptance tests locally, regardless of which connector version I'm testing. Not sure what I'm doing wrong. Here's the output of me running the acceptance test on the master branch, using connector version airbyte/source-github:latest. https://gist.github.com/cjwooo/63d2f4d69472f1fe6bdffcb80f6dbbb6

@marcosmarxm
Copy link
Member

Hello @cjwooo it's throwing a timeout error. You can increase this variable to see if helps solving the problems with tests.

@cjwooo
Copy link
Contributor Author

cjwooo commented Nov 9, 2021

Thanks @marcosmarxm , turns out I wasn't running the tests on the correct repo as well.
I had to disable the collaborators and teams streams locally since my Github token doesn't have access to that API for the integration-tests repo. After that, I get a successful test result.

  ~/projects/airbyte/airbyte-integrations/connectors/source-github on   cwu/githubperf *3 !2 ?7                                                                                                                                                                                   took  8s  source-github at  11:30:13
❯ bash acceptance-test-docker.sh
[+] Building 0.6s (12/12) FINISHED
 => [internal] load build definition from Dockerfile                                                                                                                                                                                                                                                                     0.0s
 => => transferring dockerfile: 37B                                                                                                                                                                                                                                                                                      0.0s
 => [internal] load .dockerignore                                                                                                                                                                                                                                                                                        0.0s
 => => transferring context: 34B                                                                                                                                                                                                                                                                                         0.0s
 => [internal] load metadata for docker.io/library/python:3.7-slim                                                                                                                                                                                                                                                       0.5s
 => [1/7] FROM docker.io/library/python:3.7-slim@sha256:c2cc09c3de140f59b3065b9518fa7beb5fbedb4414762963bfe01079ce219f2e                                                                                                                                                                                                 0.0s
 => [internal] load build context                                                                                                                                                                                                                                                                                        0.0s
 => => transferring context: 2.31kB                                                                                                                                                                                                                                                                                      0.0s
 => CACHED [2/7] RUN apt-get update && apt-get install -y bash && rm -rf /var/lib/apt/lists/*                                                                                                                                                                                                                            0.0s
 => CACHED [3/7] WORKDIR /airbyte/integration_code                                                                                                                                                                                                                                                                       0.0s
 => CACHED [4/7] COPY source_github ./source_github                                                                                                                                                                                                                                                                      0.0s
 => CACHED [5/7] COPY main.py ./                                                                                                                                                                                                                                                                                         0.0s
 => CACHED [6/7] COPY setup.py ./                                                                                                                                                                                                                                                                                        0.0s
 => CACHED [7/7] RUN pip install .                                                                                                                                                                                                                                                                                       0.0s
 => exporting to image                                                                                                                                                                                                                                                                                                   0.0s
 => => exporting layers                                                                                                                                                                                                                                                                                                  0.0s
 => => writing image sha256:1707245c72b2c5e92cc6d4cfdce809744ee149c77eecc5045cac6560605f512a                                                                                                                                                                                                                             0.0s
 => => naming to docker.io/library/source-github                                                                                                                                                                                                                                                                         0.0s

Use 'docker scan' to run Snyk tests against images to find vulnerabilities and learn how to fix them
latest: Pulling from airbyte/source-acceptance-test
Digest: sha256:83fd0a0947efb193b59cc7bbee4a6cd35bc34a46c8a100d9cee0d87d6c15dc92
Status: Image is up to date for airbyte/source-acceptance-test:latest
docker.io/airbyte/source-acceptance-test:latest
Test session starts (platform: linux, Python 3.7.11, pytest 6.2.5, pytest-sugar 0.9.4)
rootdir: /test_input
plugins: sugar-0.9.4, timeout-1.4.2
collecting ...
 test_core.py ✓✓✓✓✓✓✓✓✓✓✓                                                                                                                                                                                                                                                                                       79% ███████▉
 test_full_refresh.py ✓                                                                                                                                                                                                                                                                                         86% ████████▋
 test_incremental.py ✓✓                                                                                                                                                                                                                                                                                        100% ██████████

Results (88.24s):
      14 passed

@cjwooo
Copy link
Contributor Author

cjwooo commented Nov 15, 2021

@Zirochkaa I've addressed your comments, can you please rereview?

@cjwooo
Copy link
Contributor Author

cjwooo commented Nov 17, 2021

I need to update this PR to reflect latest changes https://github.com/airbytehq/airbyte/pull/8030/files

@marcosmarxm
Copy link
Member

@cjwooo let me know when you need a review for this contribution.

@cjwooo
Copy link
Contributor Author

cjwooo commented Dec 3, 2021

@marcosmarxm This is now ready for review. Log of successful acceptance test (without collaborators, events, or teams streams since my github token does not have permission for those):

  ~/projects/airbyte/airbyte-integrations/connectors/source-github on   cwu/githubperf *3 !2 ?8                                                                                                                                                                             ✘ INT  source-github  14.16.0 at  18:31:24
❯ bash acceptance-test-docker.sh
[+] Building 0.7s (12/12) FINISHED
 => [internal] load build definition from Dockerfile                                                                                                                                                                                                                                                                     0.0s
 => => transferring dockerfile: 37B                                                                                                                                                                                                                                                                                      0.0s
 => [internal] load .dockerignore                                                                                                                                                                                                                                                                                        0.0s
 => => transferring context: 34B                                                                                                                                                                                                                                                                                         0.0s
 => [internal] load metadata for docker.io/library/python:3.7-slim                                                                                                                                                                                                                                                       0.5s
 => [1/7] FROM docker.io/library/python:3.7-slim@sha256:2556886c10b669a62c78726ea2f801a31e688064359bc576079c8ff443309fcb                                                                                                                                                                                                 0.0s
 => [internal] load build context                                                                                                                                                                                                                                                                                        0.0s
 => => transferring context: 2.31kB                                                                                                                                                                                                                                                                                      0.0s
 => CACHED [2/7] RUN apt-get update && apt-get install -y bash && rm -rf /var/lib/apt/lists/*                                                                                                                                                                                                                            0.0s
 => CACHED [3/7] WORKDIR /airbyte/integration_code                                                                                                                                                                                                                                                                       0.0s
 => CACHED [4/7] COPY source_github ./source_github                                                                                                                                                                                                                                                                      0.0s
 => CACHED [5/7] COPY main.py ./                                                                                                                                                                                                                                                                                         0.0s
 => CACHED [6/7] COPY setup.py ./                                                                                                                                                                                                                                                                                        0.0s
 => CACHED [7/7] RUN pip install .                                                                                                                                                                                                                                                                                       0.0s
 => exporting to image                                                                                                                                                                                                                                                                                                   0.0s
 => => exporting layers                                                                                                                                                                                                                                                                                                  0.0s
 => => writing image sha256:febbf9b32cf3b8f79b857d912237131a72f100ca75103ec5f888174bdcca0e6b                                                                                                                                                                                                                             0.0s
 => => naming to docker.io/airbyte/source-github                                                                                                                                                                                                                                                                         0.0s

Use 'docker scan' to run Snyk tests against images to find vulnerabilities and learn how to fix them
latest: Pulling from airbyte/source-acceptance-test
Digest: sha256:3559cb368543260761c44201dedcebce4a870447924b6e119a280830c342a67f
Status: Image is up to date for airbyte/source-acceptance-test:latest
docker.io/airbyte/source-acceptance-test:latest
Test session starts (platform: linux, Python 3.7.11, pytest 6.2.5, pytest-sugar 0.9.4)
rootdir: /test_input
plugins: sugar-0.9.4, timeout-1.4.2
collecting ...
 test_core.py ✓✓✓✓✓✓✓✓✓✓✓✓                                                                                                                                                                                                                                                                                      80% ████████
 test_full_refresh.py ✓                                                                                                                                                                                                                                                                                         87% ████████▋
 test_incremental.py ✓✓                                                                                                                                                                                                                                                                                        100% ██████████

Results (83.36s):
      15 passed

@github-actions github-actions bot added the area/documentation Improvements or additions to documentation label Dec 3, 2021
@marcosmarxm
Copy link
Member

@cjwooo @Zirochkaa will review this week your contribution!

@marcosmarxm marcosmarxm added the zzm label Dec 6, 2021
Comment on lines 667 to 669
parent_stream_slices = list(super().stream_slices(sync_mode=sync_mode, cursor_field=cursor_field, stream_state=stream_state))
if self.parent.is_sorted_descending:
parent_stream_slices.reverse()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand this part of the code correctly, a list of all parent records is being created here. So if, for example, pull_requests stream has 5000 records then all of them will be placed in the list. We can't do that because it kills the idea of stream_slices being a generator function and the idea that we output one record at a time and not store all record for specific stream.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to force the pull_requests stream to fetch records in ascending order via an override param in the stream_state.

@cjwooo
Copy link
Contributor Author

cjwooo commented Dec 10, 2021

  ~/projects/airbyte/airbyte-integrations/connectors/source-github on   cwu/githubperf *3 +1 !2 ?11                                                                    took  11s  source-github  14.16.0 at  14:56:40
❯ bash acceptance-test-docker.sh
[+] Building 0.6s (12/12) FINISHED
 => [internal] load build definition from Dockerfile                                                                                                                                                                     0.0s
 => => transferring dockerfile: 37B                                                                                                                                                                                      0.0s
 => [internal] load .dockerignore                                                                                                                                                                                        0.0s
 => => transferring context: 34B                                                                                                                                                                                         0.0s
 => [internal] load metadata for docker.io/library/python:3.7-slim                                                                                                                                                       0.4s
 => [1/7] FROM docker.io/library/python:3.7-slim@sha256:9e51c1a3fea7e0a2b93df2538c02f1afe31d2c69b10d6dcbd372c10c72b325aa                                                                                                 0.0s
 => [internal] load build context                                                                                                                                                                                        0.0s
 => => transferring context: 2.31kB                                                                                                                                                                                      0.0s
 => CACHED [2/7] RUN apt-get update && apt-get install -y bash && rm -rf /var/lib/apt/lists/*                                                                                                                            0.0s
 => CACHED [3/7] WORKDIR /airbyte/integration_code                                                                                                                                                                       0.0s
 => CACHED [4/7] COPY source_github ./source_github                                                                                                                                                                      0.0s
 => CACHED [5/7] COPY main.py ./                                                                                                                                                                                         0.0s
 => CACHED [6/7] COPY setup.py ./                                                                                                                                                                                        0.0s
 => CACHED [7/7] RUN pip install .                                                                                                                                                                                       0.0s
 => exporting to image                                                                                                                                                                                                   0.0s
 => => exporting layers                                                                                                                                                                                                  0.0s
 => => writing image sha256:99d30b6fdf20e15f8dab79437b125d37c56bd8865309d78a33958a0d5ea3ca9e                                                                                                                             0.0s
 => => naming to docker.io/airbyte/source-github                                                                                                                                                                         0.0s

Use 'docker scan' to run Snyk tests against images to find vulnerabilities and learn how to fix them
latest: Pulling from airbyte/source-acceptance-test
b4d181a07f80: Already exists
de8ecf497b75: Already exists
707b80804672: Already exists
283715715396: Already exists
8353afd48f6b: Already exists
faa659d489c1: Already exists
daa944c5cd5c: Pull complete
f4202e62f8ac: Pull complete
ff40e529434e: Pull complete
1b856eb3591e: Pull complete
5151553a3a9e: Pull complete
Digest: sha256:29744f6ad621eca4cd2658f25ebff984d0ab9b4ea51d147a5395dbe2902859cc
Status: Downloaded newer image for airbyte/source-acceptance-test:latest
docker.io/airbyte/source-acceptance-test:latest
Test session starts (platform: linux, Python 3.7.11, pytest 6.2.5, pytest-sugar 0.9.4)
rootdir: /test_input
plugins: sugar-0.9.4, timeout-1.4.2
collecting ...
 test_core.py ✓✓✓✓✓✓✓✓✓✓✓✓                                                                                                                                                                                      80% ████████
 test_full_refresh.py ✓                                                                                                                                                                                         87% ████████▋
 test_incremental.py ✓✓                                                                                                                                                                                        100% ██████████

Results (71.85s):
      15 passed

@cjwooo
Copy link
Contributor Author

cjwooo commented Dec 16, 2021

Bump

@marcosmarxm
Copy link
Member

@cjwooo I requested to @Zirochkaa review your changes.

@cjwooo
Copy link
Contributor Author

cjwooo commented Jan 3, 2022

Bump

@marcosmarxm
Copy link
Member

@cjwooo can you run ./gradlew format and build the container, there are some flake errors. After that its ready to merge!

@cjwooo
Copy link
Contributor Author

cjwooo commented Jan 5, 2022

@marcosmarxm Done

Copy link
Member

@marcosmarxm marcosmarxm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @cjwooo

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants