-
Notifications
You must be signed in to change notification settings - Fork 88
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
embulk-output-redshift copy should be single command #224
Comments
@mjalkio |
The |
embulk-output-redsfhit will copy files as soon as ready. |
Ah, okay. I guess I'm not familiar with other Embulk user workloads. It definitely isn't the case for us that our embulk-output-redshift threads finish at significantly different times. If that's the case for others an option would help us a lot. |
We are trying to utilize #207 (thank you for that feature!), but are hitting some issues with how Embulk copies S3 files into Redshift. Right now I'm using this feature on a table that uses
mode: replace
.The RedshiftCopyBatchInsert copies one file at a time. This causes Redshift to only sort the contents of the first file. All other files are left unsorted, and that means a
VACUUM
command needs to be run after the table is created.Instead, all the files in S3 should be copied as a single command. This will allow Redshift to keep the entire table sorted while using
mode: replace
, and will also allow the Redshift cluster to process the files in parallel. See: https://docs.aws.amazon.com/redshift/latest/dg/t_Loading-data-from-S3.htmlThe text was updated successfully, but these errors were encountered: