embulk-output-redshift copy should be single command #224

mjalkio · 2018-01-04T19:10:19Z

We are trying to utilize #207 (thank you for that feature!), but are hitting some issues with how Embulk copies S3 files into Redshift. Right now I'm using this feature on a table that uses mode: replace.

The RedshiftCopyBatchInsert copies one file at a time. This causes Redshift to only sort the contents of the first file. All other files are left unsorted, and that means a VACUUM command needs to be run after the table is created.

Instead, all the files in S3 should be copied as a single command. This will allow Redshift to keep the entire table sorted while using mode: replace, and will also allow the Redshift cluster to process the files in parallel. See: https://docs.aws.amazon.com/redshift/latest/dg/t_Loading-data-from-S3.html

The text was updated successfully, but these errors were encountered:

hito4t · 2018-01-10T03:13:13Z

@mjalkio
Thank you for the information!
Copying all files at once may be slower than copying files one after another.
So we should add the new property to choose.

mjalkio · 2018-01-10T03:21:00Z

The COPY command from within Redshift shouldn't be slower than copying one after another, it's what the docs recommend. We still want to create multiple files in S3, but they should be ingested through a single COPY command as described here.

hito4t · 2018-01-10T03:30:22Z

embulk-output-redsfhit will copy files as soon as ready.
Namely, embulk-output-redshift will start copying files before preparing all files.
embulk-output-redshift might have copied n-1 files before preparing the last file.
Copying only the last file will be faster than copying all files at once.

mjalkio · 2018-01-10T03:37:03Z

Ah, okay. I guess I'm not familiar with other Embulk user workloads. It definitely isn't the case for us that our embulk-output-redshift threads finish at significantly different times. If that's the case for others an option would help us a lot.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

embulk-output-redshift copy should be single command #224

embulk-output-redshift copy should be single command #224

mjalkio commented Jan 4, 2018

hito4t commented Jan 10, 2018

mjalkio commented Jan 10, 2018

hito4t commented Jan 10, 2018

mjalkio commented Jan 10, 2018

embulk-output-redshift copy should be single command #224

embulk-output-redshift copy should be single command #224

Comments

mjalkio commented Jan 4, 2018

hito4t commented Jan 10, 2018

mjalkio commented Jan 10, 2018

hito4t commented Jan 10, 2018

mjalkio commented Jan 10, 2018