Allow adding TRUNCATECOLUMNS option to Redshift COPY #43

stefankeidel · 2020-05-13T10:17:07Z

This adds a configuration parameter (defaulting to False) which triggers the TRUNCATECOLUMNS option in every Redshift COPY statement sent by the target.

The use case for us is the combination with tap-intercom where some of the content can exceed 64k, but the content for those few records/fields where that happens can be safely ignored. I couldn't find another way to truncate the content before sending to Redshift.

It might be useful to at some point add a more flexible way to include other options as well, but this should work for now.

AlexanderMann · 2020-05-14T12:50:00Z

@stefankeidel this seems really straightforward and practical. I think it'd make sense to have a test which creates an (insanely) large record, uses this option, and then makes sure we read a truncated version out of Redshift since unlike some of our other configuration options, this is pretty straightforward to actually test for.

Curious if @awm33 has any opinions/thoughts here?

This test adds 100 cats with a long description and asserts that they all insert correctly (Redshift bails if the content is too long if the TRUNCATECOLUMNS option is not set) and that the longest record for that column equals the max column length. Tested using the docker setup for this project: source /code/venv--target-redshift/bin/activate pytest tests/test_target_redshift.py -k 'test_truncate_columns'

stefankeidel · 2020-05-15T09:09:37Z

Good idea! Added a test that does roughly that. Lmk if that works

AlexanderMann · 2020-05-15T13:37:09Z

Test looks good. If we can get @awm33 to weigh in here, I think this is good to merge. Really nice work @stefankeidel!

awm33 · 2020-05-16T21:04:38Z

@AlexanderMann @stefankeidel I'm wondering if we should create a subobject for Redshift COPY options to group them?

@stefankeidel Did you try unselecting ("selected": false") the column/field in the tap catalog? Or do you actually need the column for analysis?

AlexanderMann · 2020-05-17T21:35:55Z

@awm33 that seems like a reasonable thing to do. I think a good enhancement for all of our config would be grouping all of the various things. Like, for psycopg2 we can make the connection object just a 1:1 mapping in a sub-object etc.

You're suggesting doing it here so folks don't have a bunch of work to do in the future?

stefankeidel · 2020-05-18T05:38:36Z

@stefankeidel Did you try unselecting ("selected": false") the column/field in the tap catalog? Or do you actually need the column for analysis?

Yeah, this is for tap-intercom. We don't care about every individual message being imported in full but would still like to have most of them :)

Regarding a subgrouping: Makes sense! Wdyt about something like this?

{
	"redshift_host": "...",
	"redshift_port": 5439,
	"redshift_database": "...",
	"redshift_username": "...",
	"redshift_password": "...",
	"redshift_schema": "test_schema",
	"default_column_length": 1000,
	"max_batch_rows": 16000000,
	"max_buffer_size": 30064771072,
	"target_s3": {
		"aws_access_key_id": "...",
		"aws_secret_access_key": "....",
		"bucket": "...",
		"key_prefix": "singer-io/"
	},
	"redshift_copy_options": {
		"truncate_columns": true
	}
}

AlexanderMann · 2020-05-18T15:23:58Z

If we're going this route, I'd prefer nested values ie: redshift: { copy: ...

Also, I'm wondering if we want to simply make this an array of strings which get passed right through to the COPY statement, that way you can override anything without needing another PR.

stefankeidel · 2020-05-19T05:11:09Z

If we're going this route, I'd prefer nested values ie: redshift: { copy: ...

Hmm, we already have prefixed redshift_ values for the connection params. Would you want to move those in there as well and break existing configs? Either way it should be consistent I think.

Also, I'm wondering if we want to simply make this an array of strings which get passed right through to the COPY statement, that way you can override anything without needing another PR.

I like this idea! Not sure if we should do some verification or if we can just assume people that are using such an option know what they're doing?

This allows to pass a list of options to redshift's copy command instead of just enabling to set a single option.

stefankeidel · 2020-05-19T11:00:07Z

I implemented it using the prefix redshift_* for now, but I'm happy to change the config format obviously. I just thought this is most consistent with the existing redshift_host etc.

awm33 · 2020-05-19T17:12:40Z

@stefankeidel @AlexanderMann I kind of regret us prefixing everything with redshift :). Other targets have COPY commands (postgres and snowflake) and snowflake has a TRUNCATECOLUMNS options too.

I propose something that we can use with other targets as well, since they offer something similar:

{
	"redshift_host": "...",
	"redshift_port": 5439,
	"redshift_database": "...",
	"redshift_username": "...",
	"redshift_password": "...",
	"redshift_schema": "test_schema",
	"default_column_length": 1000,
	"max_batch_rows": 16000000,
	"max_buffer_size": 30064771072,
	"target_s3": {
		"aws_access_key_id": "...",
		"aws_secret_access_key": "....",
		"bucket": "...",
		"key_prefix": "singer-io/"
	},
	"copy_options": {
		"truncate_columns": true
	}
}

stefankeidel · 2020-05-20T06:39:19Z

@awm33 I like that format, but what do you think about the copy-options-as-array-of-strings idea posted by @AlexanderMann above and implemented in this latest revision? Do we want keywords for every single option we want to support or just assume users know what they're doing?

AlexanderMann · 2020-05-20T12:36:38Z

@awm33 on that note...should this really be something we put into target-postgres then and have the other targets inherit? Is this something we can assume is a SQLBase type configuration?

Tom-E · 2021-01-05T20:06:17Z

Is this ready to be merged + released? Happen to be looking for exactly this config option =)

stefankeidel added 7 commits May 13, 2020 11:20

Hardcode TRUNCATECOLUMNS to see if that helps

b2f6e67

Try to make this configurable

55bf591

Add documentation

9ce1edf

Add links

4f7df3b

🤦

ed95e8a

Not literal

7709730

At least SQL?

20514f5

stefankeidel added 3 commits May 15, 2020 11:01

Include else branch in interpolation

30ba62f

Assert minimum length as well

1372bbc

AlexanderMann self-requested a review May 15, 2020 13:37

AlexanderMann approved these changes May 15, 2020

View reviewed changes

stefankeidel added 2 commits May 19, 2020 09:37

Refactor into more generic redshift_copy_options list option

1ba00cf

This allows to pass a list of options to redshift's copy command instead of just enabling to set a single option.

Update docs

f7b0b18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow adding TRUNCATECOLUMNS option to Redshift COPY #43

Allow adding TRUNCATECOLUMNS option to Redshift COPY #43

stefankeidel commented May 13, 2020

AlexanderMann commented May 14, 2020

stefankeidel commented May 15, 2020

AlexanderMann commented May 15, 2020

awm33 commented May 16, 2020

AlexanderMann commented May 17, 2020

stefankeidel commented May 18, 2020 •

edited

Loading

AlexanderMann commented May 18, 2020

stefankeidel commented May 19, 2020

stefankeidel commented May 19, 2020

awm33 commented May 19, 2020

stefankeidel commented May 20, 2020

AlexanderMann commented May 20, 2020

Tom-E commented Jan 5, 2021

Allow adding TRUNCATECOLUMNS option to Redshift COPY #43

Are you sure you want to change the base?

Allow adding TRUNCATECOLUMNS option to Redshift COPY #43

Conversation

stefankeidel commented May 13, 2020

AlexanderMann commented May 14, 2020

stefankeidel commented May 15, 2020

AlexanderMann commented May 15, 2020

awm33 commented May 16, 2020

AlexanderMann commented May 17, 2020

stefankeidel commented May 18, 2020 • edited Loading

AlexanderMann commented May 18, 2020

stefankeidel commented May 19, 2020

stefankeidel commented May 19, 2020

awm33 commented May 19, 2020

stefankeidel commented May 20, 2020

AlexanderMann commented May 20, 2020

Tom-E commented Jan 5, 2021

stefankeidel commented May 18, 2020 •

edited

Loading