Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🎉 Destination Bigquery: added gcs upload option #5614

Merged

Conversation

etsybaev
Copy link
Contributor

@etsybaev etsybaev commented Aug 25, 2021

What

Currently, you may see some failures for big datasets and slow sources, i.e. if reading from source takes more than 10-12 hours.
This is caused by the Google BigQuery SDK client limitations. For more details please check #3549

How

There are 2 available options now to upload data to bigquery Standard and GCS Staging.

  • Standard is option to upload data directly from your source to BigQuery storage. This way is faster and requires less resources than GCS one.
  • GCS Uploading (CSV format). This is a newly introduced approach implemented in order to avoid the issue for big datasets mentioned above.
    At the first step all data is uploaded to GCS bucket and then all moved to BigQuery at one shot stream by stream.
    The destination-gcs connector is partially used under the hood here, so you may check its documentation for more details.

image
image

Pre-merge Checklist

Expand the relevant checklist and delete the others.

New Connector

Community member or Airbyter

  • Community member? Grant edit access to maintainers (instructions)
  • Secrets in the connector's spec are annotated with airbyte_secret
  • Unit & integration tests added and passing. Community members, please provide proof of success locally e.g: screenshot or copy-paste unit, integration, and acceptance test output. To run acceptance tests for a Python connector, follow instructions in the README. For java connectors run ./gradlew :airbyte-integrations:connectors:<name>:integrationTest.
  • Code reviews completed
  • Documentation updated
    • Connector's README.md
    • docs/SUMMARY.md
    • docs/integrations/<source or destination>/<name>.md including changelog. See changelog example
    • docs/integrations/README.md
    • airbyte-integrations/builds.md
  • PR name follows PR naming conventions
  • Connector added to connector index like described here

Airbyter

If this is a community PR, the Airbyte engineer reviewing this PR is responsible for the below items.

  • Create a non-forked branch based on this PR and test the below items on it
  • Build is successful
  • Credentials added to Github CI. Instructions.
  • /test connector=connectors/<name> command is passing.
  • New Connector version released on Dockerhub by running the /publish command described here

Updating a connector

Community member or Airbyter

  • Grant edit access to maintainers (instructions)
  • Secrets in the connector's spec are annotated with airbyte_secret
  • Unit & integration tests added and passing. Community members, please provide proof of success locally e.g: screenshot or copy-paste unit, integration, and acceptance test output. To run acceptance tests for a Python connector, follow instructions in the README. For java connectors run ./gradlew :airbyte-integrations:connectors:<name>:integrationTest.
  • Code reviews completed
  • Documentation updated
    • Connector's README.md
    • Changelog updated in docs/integrations/<source or destination>/<name>.md including changelog. See changelog example
  • PR name follows PR naming conventions
  • Connector version bumped like described here

Airbyter

If this is a community PR, the Airbyte engineer reviewing this PR is responsible for the below items.

  • Create a non-forked branch based on this PR and test the below items on it
  • Build is successful
  • Credentials added to Github CI. Instructions.
  • /test connector=connectors/<name> command is passing.
  • New Connector version released on Dockerhub by running the /publish command described here

Connector Generator

  • Issue acceptance criteria met
  • PR name follows PR naming conventions
  • If adding a new generator, add it to the list of scaffold modules being tested
  • The generator test modules (all connectors with -scaffold in their name) have been updated with the latest scaffold by running ./gradlew :airbyte-integrations:connector-templates:generator:testScaffoldTemplates then checking in your changes
  • Documentation which references the generator is updated as needed.

@github-actions github-actions bot added the area/connectors Connector related issues label Aug 25, 2021
@github-actions github-actions bot added the area/documentation Improvements or additions to documentation label Aug 26, 2021
@jrhizor jrhizor temporarily deployed to more-secrets August 26, 2021 13:44 Inactive
@etsybaev etsybaev changed the title [DARFT_PR] Destination Bigquery: added gcs upload option [DARFT_PR] 🎉Destination Bigquery: added gcs upload option Aug 26, 2021
@etsybaev
Copy link
Contributor Author

etsybaev commented Aug 26, 2021

/test connector=connectors/destination-bigquery

🕑 connectors/destination-bigquery https://github.com/airbytehq/airbyte/actions/runs/1171385872
✅ connectors/destination-bigquery https://github.com/airbytehq/airbyte/actions/runs/1171385872

@jrhizor jrhizor temporarily deployed to more-secrets August 26, 2021 17:26 Inactive
@etsybaev etsybaev changed the title [DARFT_PR] 🎉Destination Bigquery: added gcs upload option 🎉 Destination Bigquery: added gcs upload option Aug 27, 2021
@flagbug
Copy link

flagbug commented Aug 30, 2021

@etsybaev While looking for a workaround with the BigQuery connector failing with large datasets, I tried first manually uploading our tables to GCS but soon hit a limitation with the current GCS implementation as described here: #5720

Since I'm assuming that you're reusing the same logic for the GCS writer I just wanted to let you know that this might also impact this PR.

Copy link
Contributor

@irynakruk irynakruk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice pr!

@etsybaev etsybaev marked this pull request as ready for review August 31, 2021 13:15
Copy link
Contributor

@subodh1810 subodh1810 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a major problem that I dont understand. Why are we using Amazon S3 client everywhere if we are trying to load data via GCS?

@@ -1,3 +1,13 @@
## Uploading options
There are 2 available options to upload data to bigquery `Standard` and `GCS Staging`.
- `Standard` is option to upload data directly from your source to BigQuery storage. This way is faster and requires less resources than GCS one.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to explain further about when to choose which option. Its not clear to me when should I choose Standard vs GCS Uploading (CSV format)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated, thanks

throws IOException {
Timestamp uploadTimestamp = new Timestamp(System.currentTimeMillis());

AmazonS3 s3Client = GcsS3Helper.getGcsS3Client(gcsDestinationConfig);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we creating AmazonS3 client if we are using Google cloud storage for loading data

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GCS is compatible with the Amazon S3 client. By reusing the S3 client, we can reuse the related code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @subodh1810. I just partially re-used our already existing destination-gcs module that had been created before I started working on this ticket. I had some conversation with @tuliren and he confirmed that S3 client was also used for destination-gcs as it mostly works with both except for some minor cases.

GcsDestinationConfig gcsDestinationConfig = GcsDestinationConfig
.getGcsDestinationConfig(BigQueryUtils.getGcsJsonNodeConfig(config));
GcsCsvWriter gcsCsvWriter = initGcsWriter(gcsDestinationConfig, configStream);
gcsCsvWriter.initialize();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the initialize method we are using AmazonS3 client and I am not sure I follow why is that

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @subodh1810. The answer here is similar to the previous comment. I partially re-used the already existing destination-gcs module (GCS destination CSV writer in particular). But destination-gcs had been implemented re-using already existed modules from destination-S3 as far as I understood. The reason is that the Amazon S3 client in most of the cases works for both GCS and S3 storages. It's better to ask @tuliren for details.
So the idea is the next:

  1. Re-using the existing destination-gcs (CSV writer) we upload data to GCS bucket
  2. Using the native Bigquery writer we create a native job to migrate CSV formatted data from bucket to BigQuery

Copy link
Contributor

@subodh1810 subodh1810 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you just add a java doc explaining that Amazon s3 client works with GCS as well

@etsybaev
Copy link
Contributor Author

etsybaev commented Sep 8, 2021

/test connector=connectors/destination-gcs

🕑 connectors/destination-gcs https://github.com/airbytehq/airbyte/actions/runs/1212255266
✅ connectors/destination-gcs https://github.com/airbytehq/airbyte/actions/runs/1212255266

@jrhizor jrhizor temporarily deployed to more-secrets September 8, 2021 06:38 Inactive
@jrhizor jrhizor temporarily deployed to more-secrets September 8, 2021 08:55 Inactive
@etsybaev
Copy link
Contributor Author

etsybaev commented Sep 8, 2021

/test connector=connectors/destination-bigquery

🕑 connectors/destination-bigquery https://github.com/airbytehq/airbyte/actions/runs/1212705098
❌ connectors/destination-bigquery https://github.com/airbytehq/airbyte/actions/runs/1212705098
🕑 connectors/destination-bigquery https://github.com/airbytehq/airbyte/actions/runs/1212705098
✅ connectors/destination-bigquery https://github.com/airbytehq/airbyte/actions/runs/1212705098

@jrhizor jrhizor temporarily deployed to more-secrets September 8, 2021 09:13 Inactive
@etsybaev
Copy link
Contributor Author

etsybaev commented Sep 8, 2021

/test connector=connectors/destination-gcs

🕑 connectors/destination-gcs https://github.com/airbytehq/airbyte/actions/runs/1212830443
✅ connectors/destination-gcs https://github.com/airbytehq/airbyte/actions/runs/1212830443

@jrhizor jrhizor temporarily deployed to more-secrets September 8, 2021 09:52 Inactive
@jrhizor jrhizor temporarily deployed to more-secrets September 8, 2021 10:53 Inactive
@etsybaev
Copy link
Contributor Author

etsybaev commented Sep 8, 2021

/test connector=connectors/destination-bigquery

🕑 connectors/destination-bigquery https://github.com/airbytehq/airbyte/actions/runs/1213127186
✅ connectors/destination-bigquery https://github.com/airbytehq/airbyte/actions/runs/1213127186

@jrhizor jrhizor temporarily deployed to more-secrets September 8, 2021 11:30 Inactive
@etsybaev
Copy link
Contributor Author

etsybaev commented Sep 8, 2021

/publish connector=connectors/destination-bigquery

🕑 connectors/destination-bigquery https://github.com/airbytehq/airbyte/actions/runs/1213197602
✅ connectors/destination-bigquery https://github.com/airbytehq/airbyte/actions/runs/1213197602

@jrhizor jrhizor temporarily deployed to more-secrets September 8, 2021 11:56 Inactive
@etsybaev etsybaev merged commit f32b14e into master Sep 8, 2021
@etsybaev etsybaev deleted the etsybaev/5296-added-gcs-upload-to-bigquery-destination branch September 8, 2021 12:21
@etsybaev etsybaev linked an issue Sep 8, 2021 that may be closed by this pull request
etsybaev added a commit that referenced this pull request Sep 9, 2021
* Fixed destination bigquery denormalized compilation error (caused by #5614)
@sherifnada
Copy link
Contributor

@etsybaev see #5959 - why did we change the emitted_at column type? is that a change we can revert?

Field.of(JavaBaseConstants.COLUMN_NAME_EMITTED_AT, StandardSQLTypeName.TIMESTAMP));
// GCS works with only date\datetime formats, so need to have it a string for a while
// https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-csv#data_types
Field.of(JavaBaseConstants.COLUMN_NAME_EMITTED_AT, StandardSQLTypeName.STRING),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The type of the column when creating the schema of the table was changed from timestamp to STRING

But when formatting the records to insert, they are still typed as timestamp?

final String formattedEmittedAt = QueryParameterValue.timestamp(emittedAtMicroseconds).getValue();

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok so there was an issue for this here #5959

@@ -82,6 +82,10 @@ if(!System.getenv().containsKey("SUB_BUILD") || System.getenv().get("SUB_BUILD")
include ':airbyte-integrations:connectors:destination-redshift'
include ':airbyte-integrations:connectors:destination-snowflake'
include ':airbyte-integrations:connectors:destination-oracle'

//Needed by destination-bugquery
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wish we had a destination-bigquery instead though... 😜

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
9 participants