Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhancements to update stream conversion script #204

Closed
szarnyasg opened this issue Oct 31, 2022 · 2 comments · Fixed by #205
Closed

Enhancements to update stream conversion script #204

szarnyasg opened this issue Oct 31, 2022 · 2 comments · Fixed by #205

Comments

@szarnyasg
Copy link
Member

A followup of #203:

  • Old raw data sets have hasInterest instead of TagId for the Person_hasInterest_Tag table. This is the only difference between the data sets, so re-generating everything is hardly worth it... Rather, the script could/should auto-detect this.
  • The batching should be done on a daily basis (opposed to a weekly one), otherwise generating the updates for SF10,000 would require 1.5TiB memory.
@GLaDAP
Copy link
Member

GLaDAP commented Oct 31, 2022

Would it be feasible to change the column name of the old raw datasets possible using something like:

echo "COPY(SELECT creationDate, deletionDate, PersonId, hasInterest as TagId FROM '*.parquet') TO 'Person_hasInterest_Tag.parquet' (FORMAT 'PARQUET')" | duckdb

@szarnyasg
Copy link
Member Author

They are compressed in tar.zst then split in .tar.zst.000, so, yes, but at that point I'd rather just regenerate the whole thing on EMR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants