Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🐛 Source Hubspot: Handled 10K+ search-endpoint queries #10700

Merged

Conversation

lgomezm
Copy link
Contributor

@lgomezm lgomezm commented Feb 28, 2022

What

There's an issue when getting incremental updates for the streams that extend the CRMSearchStream in the source-hubspot connector. If your search query matches more than 10K records, it will respond with HTTP 400 if you try to get records after the 10K-th one. Hubspot search endpoint documentation for reference: https://developers.hubspot.com/docs/api/crm/search

This is the error from the logs:

2022-02-25 22:40:27 �[44msource�[0m > Giving up search(...) after 1 tries (requests.exceptions.HTTPError: 400 Client Error: Bad Request for url: https://api.hubapi.com/crm/v3/objects/company/search)
2022-02-25 22:40:27 �44msource�[0m > Encountered an exception while reading stream SourceHubspot
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/airbyte_cdk/sources/deprecated/base_source.py", line 67, in read
    yield from self._read_stream(logger=logger, client=client, configured_stream=configured_stream, state=total_state)
  File "/usr/local/lib/python3.7/site-packages/airbyte_cdk/sources/deprecated/base_source.py", line 86, in _read_stream
    for record in client.read_stream(configured_stream.stream):
  File "/usr/local/lib/python3.7/site-packages/airbyte_cdk/sources/deprecated/client.py", line 70, in read_stream
    for message in method(fields=fields):
  File "/airbyte/integration_code/source_hubspot/api.py", line 580, in list_records
    yield from self._flat_associations(self._filter_old_records(generator))
  File "/airbyte/integration_code/source_hubspot/api.py", line 454, in _flat_associations
    for record in records:
  File "/airbyte/integration_code/source_hubspot/api.py", line 304, in _filter_old_records
    for record in records:
  File "/airbyte/integration_code/source_hubspot/api.py", line 602, in read
    response = getter(data=payload)
  File "/usr/local/lib/python3.7/site-packages/backoff/_sync.py", line 94, in retry
    ret = target(*args, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/backoff/_sync.py", line 94, in retry
    ret = target(*args, **kwargs)
  File "/airbyte/integration_code/source_hubspot/api.py", line 569, in search
    return self._api.post(url=url, data=data, params=params)
  File "/airbyte/integration_code/source_hubspot/api.py", line 184, in post
    return self._parse_and_handle_errors(response)
  File "/airbyte/integration_code/source_hubspot/api.py", line 170, in _parse_and_handle_errors
    response.raise_for_status()
  File "/usr/local/lib/python3.7/site-packages/requests/models.py", line 953, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 400 Client Error: Bad Request for url: https://api.hubapi.com/crm/v3/objects/company/search

How

It will stop getting records when it gets to 10K and it will use the latest state collected so far to start a new search query on the fly.

Recommended reading order

  1. streams.py

🚨 User Impact 🚨

Are there any breaking changes? What is the end result perceived by the user? If yes, please merge this PR with the 🚨🚨 emoji so changelog authors can further highlight this if needed.

Pre-merge Checklist

Expand the relevant checklist and delete the others.

New Connector

Community member or Airbyter

  • Community member? Grant edit access to maintainers (instructions)
  • Secrets in the connector's spec are annotated with airbyte_secret
  • Unit & integration tests added and passing. Community members, please provide proof of success locally e.g: screenshot or copy-paste unit, integration, and acceptance test output. To run acceptance tests for a Python connector, follow instructions in the README. For java connectors run ./gradlew :airbyte-integrations:connectors:<name>:integrationTest.
  • Code reviews completed
  • Documentation updated
    • Connector's README.md
    • Connector's bootstrap.md. See description and examples
    • docs/SUMMARY.md
    • docs/integrations/<source or destination>/<name>.md including changelog. See changelog example
    • docs/integrations/README.md
    • airbyte-integrations/builds.md
  • PR name follows PR naming conventions

Airbyter

If this is a community PR, the Airbyte engineer reviewing this PR is responsible for the below items.

  • Create a non-forked branch based on this PR and test the below items on it
  • Build is successful
  • Credentials added to Github CI. Instructions.
  • /test connector=connectors/<name> command is passing
  • New Connector version released on Dockerhub by running the /publish command described here
  • After the connector is published, connector added to connector index as described here
  • Seed specs have been re-generated by building the platform and committing the changes to the seed spec files, as described here
Updating a connector

Community member or Airbyter

  • Grant edit access to maintainers (instructions)
  • Secrets in the connector's spec are annotated with airbyte_secret
  • Unit & integration tests added and passing. Community members, please provide proof of success locally e.g: screenshot or copy-paste unit, integration, and acceptance test output. To run acceptance tests for a Python connector, follow instructions in the README. For java connectors run ./gradlew :airbyte-integrations:connectors:<name>:integrationTest.
  • Code reviews completed
  • Documentation updated
    • Connector's README.md
    • Connector's bootstrap.md. See description and examples
    • Changelog updated in docs/integrations/<source or destination>/<name>.md including changelog. See changelog example
  • PR name follows PR naming conventions

Airbyter

If this is a community PR, the Airbyte engineer reviewing this PR is responsible for the below items.

  • Create a non-forked branch based on this PR and test the below items on it
  • Build is successful
  • Credentials added to Github CI. Instructions.
  • /test connector=connectors/<name> command is passing
  • New Connector version released on Dockerhub by running the /publish command described here
  • After the new connector version is published, connector version bumped in the seed directory as described here
  • Seed specs have been re-generated by building the platform and committing the changes to the seed spec files, as described here
Connector Generator
  • Issue acceptance criteria met
  • PR name follows PR naming conventions
  • If adding a new generator, add it to the list of scaffold modules being tested
  • The generator test modules (all connectors with -scaffold in their name) have been updated with the latest scaffold by running ./gradlew :airbyte-integrations:connector-templates:generator:testScaffoldTemplates then checking in your changes
  • Documentation which references the generator is updated as needed

Tests

Unit

Put your unit tests output here.

Integration

Put your integration tests output here.

Acceptance

Put your acceptance tests output here.

@github-actions github-actions bot added the area/connectors Connector related issues label Feb 28, 2022
@lgomezm lgomezm changed the title Source Hubspot: Handled 10K+ search-endpoint queries 🐛 Source Hubspot: Handled 10K+ search-endpoint queries Feb 28, 2022
@lgomezm lgomezm force-pushed the lgomez/handled_incremental_stream_top branch from 9542725 to 3e6258a Compare March 7, 2022 12:30
@lgomezm
Copy link
Contributor Author

lgomezm commented Mar 7, 2022

Here's the acceptance tests results:
Screen Shot 2022-03-07 at 12 49 38 PM

@marcosmarxm marcosmarxm self-assigned this Mar 8, 2022
@marcosmarxm marcosmarxm requested review from lazebnyi and removed request for lazebnyi March 8, 2022 01:10
@misteryeo misteryeo mentioned this pull request Mar 8, 2022
21 tasks
@lgomezm
Copy link
Contributor Author

lgomezm commented Mar 10, 2022

@marcosmarxm Please let me know if there's anything I should do in order to get this merged. We really need this change published 🙏

@marcosmarxm
Copy link
Member

@lgomezm could you run a sync with the dev connector version and shows this works with more than 10k for the specific stream? Our integration account don't have so many records.

@marcosmarxm
Copy link
Member

Also is it possible to add a unit test validating the logic when there is more than 10k records?

@lgomezm
Copy link
Contributor Author

lgomezm commented Mar 11, 2022

@lgomezm could you run a sync with the dev connector version and shows this works with more than 10k for the specific stream? Our integration account don't have so many records.

@marcosmarxm Sure. The following screenshots show a connection from source-hubspot to local-json. You can see the second sync stops successfully at 10,000 records instead of completing with an error.
Screen Shot 2022-03-11 at 12 51 31 PM
Screen Shot 2022-03-11 at 12 51 11 PM

@lgomezm lgomezm force-pushed the lgomez/handled_incremental_stream_top branch from c85a512 to 0ed6e56 Compare March 11, 2022 17:58
@lgomezm
Copy link
Contributor Author

lgomezm commented Mar 11, 2022

Also is it possible to add a unit test validating the logic when there is more than 10k records?

@marcosmarxm I've also added a unit test in 85aac62.

@marcosmarxm
Copy link
Member

@lgomezm is it not possible to update the query in the same sync? If someone have a connection to run every 24h this could lead to not run the correct data replication.

@lgomezm
Copy link
Contributor Author

lgomezm commented Mar 11, 2022

@lgomezm is it not possible to update the query in the same sync? If someone have a connection to run every 24h this could lead to not run the correct data replication.

@marcosmarxm I've updated it to use a new state when it reaches the 10Kth record. PTAL again when you get a chance.

@lgomezm
Copy link
Contributor Author

lgomezm commented Mar 14, 2022

@marcosmarxm These images show a connection that syncs Hubspot companies. In the second sync, you can see it now succeeds after reaching the 10,000th record:
Screen Shot 2022-03-14 at 8 35 22 AM
Screen Shot 2022-03-14 at 8 35 51 AM

@marcosmarxm
Copy link
Member

marcosmarxm commented Mar 14, 2022

/test connector=connectors/source-hubspot repo=calixa-io/airbyte

🕑 connectors/source-hubspot https://github.com/airbytehq/airbyte/actions/runs/1983725402
✅ connectors/source-hubspot https://github.com/airbytehq/airbyte/actions/runs/1983725402
Python tests coverage:

Name                                                 Stmts   Miss  Cover
------------------------------------------------------------------------
source_acceptance_test/utils/__init__.py                 6      0   100%
source_acceptance_test/tests/__init__.py                 4      0   100%
source_acceptance_test/__init__.py                       2      0   100%
source_acceptance_test/tests/test_full_refresh.py       52      2    96%
source_acceptance_test/utils/asserts.py                 37      2    95%
source_acceptance_test/config.py                        74      6    92%
source_acceptance_test/utils/json_schema_helper.py     105     13    88%
source_acceptance_test/utils/common.py                  70     17    76%
source_acceptance_test/utils/compare.py                 62     23    63%
source_acceptance_test/tests/test_core.py              275    106    61%
source_acceptance_test/base.py                          10      4    60%
source_acceptance_test/utils/connector_runner.py       110     48    56%
source_acceptance_test/tests/test_incremental.py        69     38    45%
------------------------------------------------------------------------
TOTAL                                                  876    259    70%
Name                         Stmts   Miss  Cover
------------------------------------------------
source_hubspot/errors.py         6      0   100%
source_hubspot/__init__.py       2      0   100%
source_hubspot/streams.py      658    109    83%
source_hubspot/source.py        69     16    77%
------------------------------------------------
TOTAL                          735    125    83%

Copy link
Member

@marcosmarxm marcosmarxm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @lgomezm

@marcosmarxm marcosmarxm merged commit 710543a into airbytehq:master Mar 15, 2022
@lgomezm lgomezm deleted the lgomez/handled_incremental_stream_top branch March 15, 2022 13:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/connectors Connector related issues community
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants