Make BQ import more reliable #366

oliverchang · 2021-04-27T04:30:38Z

BQ imports should load as WRITE_TRUNCATE into a date partition, to prevent duplicates in case a cron runs twice in the same day.

This cannot be easily done today with the single latest.json file, as entries can span multiple days (since the run can overlap between days).

For consistency of results within a partition, the "Date" should be consistent for all repo results for jobs started on the same day.

oliverchang · 2021-04-27T04:31:35Z

@azeemshaikh38 might something to consider in the design for the split workers / GCS output.

azeemshaikh38 · 2021-04-27T19:35:43Z

The date consistency problem should be easy to handle since the master (which publishes to PubSub topic) can provide the date to be used for all results.

However, the duplicate entries due to the BQ data transfer running twice might be tricky. BQ should delete source files after transfer, however it does not provide strong guarantees here. If the deletion does not happen and BQ runs again before the data is deleted, there is risk of duplication.

oliverchang · 2021-04-27T22:04:23Z

I think we can handle duplicate entries well if we load results by day with WRITE_TRUNCATE into a bigquery partition per https://cloud.google.com/bigquery/docs/creating-column-partitions.

If all input files have the same date, we can load and replace existing data into table$YYYYMMDD. The input files on GCS would just need to be structured in such a way to support this.

azeemshaikh38 · 2021-04-27T22:57:45Z

I'm skeptical of WRITE_TRUNCATE. (1) it is wasteful (2) only some GCS files maybe deleted and looks like WRITE_TRUNCATE overwrites the entire partition (3) if it only overwrites selected rows, that would be best-effort de-duping which only mitigates the problem but doesn't solve it.

One way would be - if BQ sends out notifications on delete failure, we can have a garbage collection job running which will process these notifications and retry deletes.

oliverchang · 2021-04-27T23:21:37Z

I think any additional cloud cost that incurs will be minimal, especially if it's avoiding maintenance costs. We can also just rely on GCS lifecycle policies (rather than a job we have to maintain).

One potential way to do this might look something like:

gs://bucket/YYYYMMDD/unique_run_id_0/*.json
gs://bucket/YYYYMMDD/unique_run_id_1/*.json
...

and the load job can load WRITE_TRUNCATE into YYYYMMDD partitions from a single run. This way every load is idempotent for the same unique_run_id, which further helps if a load job fails or needs to be retried, in addition to being resilient to handling multiple runs for the same day.

oliverchang · 2021-04-28T03:11:48Z

Discussed this with @azeemshaikh38. Another possible (and much simpler) approach here is to not care about historical date data at all, since that does not seem to be used anyway

I.e. each repo only ever has a single results.json file. The BQ transfer job will just WRITE_TRUNCATE every time using whatever is available at the time of writing. We would need to make sure that any races here (new/changed results.json while load job is happening) don't have any negative impacts though.

We can still keep regular backups of the main BQ table in case something breaks and we need to restore results.

azeemshaikh38 · 2021-04-28T06:17:36Z

So researched more, turns out BQ allows for runtime parameterized URIs - https://cloud.google.com/bigquery-transfer/docs/gcs-transfer-parameters. We can simply use an URI similar to {run_time-48h|"%Y-%m-%d"}/*.json :)

This should be enough to get started for now and this avoids any race conditions or duplicates. Only drawback is a few days old stale data in BQ, which I think is a far more acceptable cost to pay (for now). What do you think?

Note that, I'm assuming for now that there will either (a) be single run per day or (b) be versioning in place to make sure only latest run of a particular day is exposed.

azeemshaikh38 · 2021-04-28T07:24:04Z

Thinking about this more, if we want to handle multiple runs per day without race conditions, we could:

Make numShards deterministic. This will be set in the master.
Master writes to gs://bucket/YYYYMMDD/unique_run_id_0/.shard_num file which contains numShards value.
BQ runs on gs://latest_bucket/*.json with WRITE_TRUNCATE. Reports pass/fail result to a PubSub topic.
A separate worker gets notified from BQ PubSub and deletes all data in gs://latest_bucket/. The latest unique_run_id folder in which .shard_num value matches number of *.json files, will be copied to gs://latest_bucket/.

What do you think? I'd personally prefer starting off with #366 (comment) to get the ball rolling, and we can then get the infra in place for doing this.

oliverchang · 2021-04-28T07:34:55Z

Sure, starting simple sounds good to me!, We can always refine this later.

For fresher data, we can also just always load the data for the last X days into their respective partitions, and in our scorecards_latest BQ view return the latest data per repo rather than everything from a given date (as it does now).

oliverchang added the kind/enhancement New feature or request label Apr 27, 2021

azeemshaikh38 self-assigned this Apr 27, 2021

azeemshaikh38 mentioned this issue May 5, 2021

Scale scorecard from 2K to 100K to a million repos #318

Closed

azeemshaikh38 mentioned this issue May 20, 2021

🌱 Use TRUNCATE to load data into BigQuery #476

Merged

2 tasks

azeemshaikh38 linked a pull request Jun 11, 2021 that will close this issue

✨ Add a BQ data transfer cron job #570

Merged

2 tasks

azeemshaikh38 closed this as completed in #570 Jun 14, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make BQ import more reliable #366

Make BQ import more reliable #366

oliverchang commented Apr 27, 2021 •

edited

Loading

oliverchang commented Apr 27, 2021

azeemshaikh38 commented Apr 27, 2021

oliverchang commented Apr 27, 2021

azeemshaikh38 commented Apr 27, 2021

oliverchang commented Apr 27, 2021

oliverchang commented Apr 28, 2021 •

edited

Loading

azeemshaikh38 commented Apr 28, 2021 •

edited

Loading

azeemshaikh38 commented Apr 28, 2021

oliverchang commented Apr 28, 2021

Make BQ import more reliable #366

Make BQ import more reliable #366

Comments

oliverchang commented Apr 27, 2021 • edited Loading

oliverchang commented Apr 27, 2021

azeemshaikh38 commented Apr 27, 2021

oliverchang commented Apr 27, 2021

azeemshaikh38 commented Apr 27, 2021

oliverchang commented Apr 27, 2021

oliverchang commented Apr 28, 2021 • edited Loading

azeemshaikh38 commented Apr 28, 2021 • edited Loading

azeemshaikh38 commented Apr 28, 2021

oliverchang commented Apr 28, 2021

oliverchang commented Apr 27, 2021 •

edited

Loading

oliverchang commented Apr 28, 2021 •

edited

Loading

azeemshaikh38 commented Apr 28, 2021 •

edited

Loading