Enhancements to update stream conversion script #204

szarnyasg · 2022-10-31T20:50:28Z

A followup of #203:

Old raw data sets have hasInterest instead of TagId for the Person_hasInterest_Tag table. This is the only difference between the data sets, so re-generating everything is hardly worth it... Rather, the script could/should auto-detect this.
The batching should be done on a daily basis (opposed to a weekly one), otherwise generating the updates for SF10,000 would require 1.5TiB memory.

The text was updated successfully, but these errors were encountered:

GLaDAP · 2022-10-31T21:10:01Z

Would it be feasible to change the column name of the old raw datasets possible using something like:

echo "COPY(SELECT creationDate, deletionDate, PersonId, hasInterest as TagId FROM '*.parquet') TO 'Person_hasInterest_Tag.parquet' (FORMAT 'PARQUET')" | duckdb

szarnyasg · 2022-10-31T21:11:45Z

They are compressed in tar.zst then split in .tar.zst.000, so, yes, but at that point I'd rather just regenerate the whole thing on EMR.

GLaDAP mentioned this issue Oct 31, 2022

Make batch size configurable and auto-detect Person_hasInterestTag.{TagId,interestId} attribute name #205

Merged

GLaDAP linked a pull request Oct 31, 2022 that will close this issue

Make batch size configurable and auto-detect Person_hasInterestTag.{TagId,interestId} attribute name #205

Merged

szarnyasg closed this as completed in #205 Nov 1, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enhancements to update stream conversion script #204

Enhancements to update stream conversion script #204

szarnyasg commented Oct 31, 2022

GLaDAP commented Oct 31, 2022

szarnyasg commented Oct 31, 2022

Enhancements to update stream conversion script #204

Enhancements to update stream conversion script #204

Comments

szarnyasg commented Oct 31, 2022

GLaDAP commented Oct 31, 2022

szarnyasg commented Oct 31, 2022