Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make BQ import more reliable #366

Closed
oliverchang opened this issue Apr 27, 2021 · 9 comments · Fixed by #570
Closed

Make BQ import more reliable #366

oliverchang opened this issue Apr 27, 2021 · 9 comments · Fixed by #570
Assignees
Labels
kind/enhancement New feature or request

Comments

@oliverchang
Copy link
Contributor

oliverchang commented Apr 27, 2021

BQ imports should load as WRITE_TRUNCATE into a date partition, to prevent duplicates in case a cron runs twice in the same day.

This cannot be easily done today with the single latest.json file, as entries can span multiple days (since the run can overlap between days).

For consistency of results within a partition, the "Date" should be consistent for all repo results for jobs started on the same day.

@oliverchang oliverchang added the kind/enhancement New feature or request label Apr 27, 2021
@oliverchang
Copy link
Contributor Author

@azeemshaikh38 might something to consider in the design for the split workers / GCS output.

@azeemshaikh38
Copy link
Contributor

The date consistency problem should be easy to handle since the master (which publishes to PubSub topic) can provide the date to be used for all results.

However, the duplicate entries due to the BQ data transfer running twice might be tricky. BQ should delete source files after transfer, however it does not provide strong guarantees here. If the deletion does not happen and BQ runs again before the data is deleted, there is risk of duplication.

@azeemshaikh38 azeemshaikh38 self-assigned this Apr 27, 2021
@oliverchang
Copy link
Contributor Author

I think we can handle duplicate entries well if we load results by day with WRITE_TRUNCATE into a bigquery partition per https://cloud.google.com/bigquery/docs/creating-column-partitions.

If all input files have the same date, we can load and replace existing data into table$YYYYMMDD. The input files on GCS would just need to be structured in such a way to support this.

@azeemshaikh38
Copy link
Contributor

I'm skeptical of WRITE_TRUNCATE. (1) it is wasteful (2) only some GCS files maybe deleted and looks like WRITE_TRUNCATE overwrites the entire partition (3) if it only overwrites selected rows, that would be best-effort de-duping which only mitigates the problem but doesn't solve it.

One way would be - if BQ sends out notifications on delete failure, we can have a garbage collection job running which will process these notifications and retry deletes.

@oliverchang
Copy link
Contributor Author

I think any additional cloud cost that incurs will be minimal, especially if it's avoiding maintenance costs. We can also just rely on GCS lifecycle policies (rather than a job we have to maintain).

One potential way to do this might look something like:

gs://bucket/YYYYMMDD/unique_run_id_0/*.json
gs://bucket/YYYYMMDD/unique_run_id_1/*.json
...

and the load job can load WRITE_TRUNCATE into YYYYMMDD partitions from a single run. This way every load is idempotent for the same unique_run_id, which further helps if a load job fails or needs to be retried, in addition to being resilient to handling multiple runs for the same day.

@oliverchang
Copy link
Contributor Author

oliverchang commented Apr 28, 2021

Discussed this with @azeemshaikh38. Another possible (and much simpler) approach here is to not care about historical date data at all, since that does not seem to be used anyway

I.e. each repo only ever has a single results.json file. The BQ transfer job will just WRITE_TRUNCATE every time using whatever is available at the time of writing. We would need to make sure that any races here (new/changed results.json while load job is happening) don't have any negative impacts though.

We can still keep regular backups of the main BQ table in case something breaks and we need to restore results.

@azeemshaikh38
Copy link
Contributor

azeemshaikh38 commented Apr 28, 2021

So researched more, turns out BQ allows for runtime parameterized URIs - https://cloud.google.com/bigquery-transfer/docs/gcs-transfer-parameters. We can simply use an URI similar to {run_time-48h|"%Y-%m-%d"}/*.json :)

This should be enough to get started for now and this avoids any race conditions or duplicates. Only drawback is a few days old stale data in BQ, which I think is a far more acceptable cost to pay (for now). What do you think?

Note that, I'm assuming for now that there will either (a) be single run per day or (b) be versioning in place to make sure only latest run of a particular day is exposed.

@azeemshaikh38
Copy link
Contributor

Thinking about this more, if we want to handle multiple runs per day without race conditions, we could:

  1. Make numShards deterministic. This will be set in the master.
  2. Master writes to gs://bucket/YYYYMMDD/unique_run_id_0/.shard_num file which contains numShards value.
  3. BQ runs on gs://latest_bucket/*.json with WRITE_TRUNCATE. Reports pass/fail result to a PubSub topic.
  4. A separate worker gets notified from BQ PubSub and deletes all data in gs://latest_bucket/. The latest unique_run_id folder in which .shard_num value matches number of *.json files, will be copied to gs://latest_bucket/.

What do you think? I'd personally prefer starting off with #366 (comment) to get the ball rolling, and we can then get the infra in place for doing this.

@oliverchang
Copy link
Contributor Author

Sure, starting simple sounds good to me!, We can always refine this later.

For fresher data, we can also just always load the data for the last X days into their respective partitions, and in our scorecards_latest BQ view return the latest data per repo rather than everything from a given date (as it does now).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants