Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow adding TRUNCATECOLUMNS option to Redshift COPY #43

Open
wants to merge 12 commits into
base: master
Choose a base branch
from

Conversation

stefankeidel
Copy link

This adds a configuration parameter (defaulting to False) which triggers the TRUNCATECOLUMNS option in every Redshift COPY statement sent by the target.

The use case for us is the combination with tap-intercom where some of the content can exceed 64k, but the content for those few records/fields where that happens can be safely ignored. I couldn't find another way to truncate the content before sending to Redshift.

It might be useful to at some point add a more flexible way to include other options as well, but this should work for now.

@AlexanderMann
Copy link
Collaborator

@stefankeidel this seems really straightforward and practical. I think it'd make sense to have a test which creates an (insanely) large record, uses this option, and then makes sure we read a truncated version out of Redshift since unlike some of our other configuration options, this is pretty straightforward to actually test for.

Curious if @awm33 has any opinions/thoughts here?

This test adds 100 cats with a long description and asserts that they
all insert correctly (Redshift bails if the content is too long if the
TRUNCATECOLUMNS option is not set) and that the longest record for
that column equals the max column length.

Tested using the docker setup for this project:

    source /code/venv--target-redshift/bin/activate
    pytest tests/test_target_redshift.py -k 'test_truncate_columns'
@stefankeidel
Copy link
Author

Good idea! Added a test that does roughly that. Lmk if that works

@AlexanderMann
Copy link
Collaborator

Test looks good. If we can get @awm33 to weigh in here, I think this is good to merge. Really nice work @stefankeidel!

@AlexanderMann AlexanderMann self-requested a review May 15, 2020 13:37
@awm33
Copy link
Member

awm33 commented May 16, 2020

@AlexanderMann @stefankeidel I'm wondering if we should create a subobject for Redshift COPY options to group them?

@stefankeidel Did you try unselecting ("selected": false") the column/field in the tap catalog? Or do you actually need the column for analysis?

@AlexanderMann
Copy link
Collaborator

@awm33 that seems like a reasonable thing to do. I think a good enhancement for all of our config would be grouping all of the various things. Like, for psycopg2 we can make the connection object just a 1:1 mapping in a sub-object etc.

You're suggesting doing it here so folks don't have a bunch of work to do in the future?

@stefankeidel
Copy link
Author

stefankeidel commented May 18, 2020

@stefankeidel Did you try unselecting ("selected": false") the column/field in the tap catalog? Or do you actually need the column for analysis?

Yeah, this is for tap-intercom. We don't care about every individual message being imported in full but would still like to have most of them :)

Regarding a subgrouping: Makes sense! Wdyt about something like this?

{
	"redshift_host": "...",
	"redshift_port": 5439,
	"redshift_database": "...",
	"redshift_username": "...",
	"redshift_password": "...",
	"redshift_schema": "test_schema",
	"default_column_length": 1000,
	"max_batch_rows": 16000000,
	"max_buffer_size": 30064771072,
	"target_s3": {
		"aws_access_key_id": "...",
		"aws_secret_access_key": "....",
		"bucket": "...",
		"key_prefix": "singer-io/"
	},
	"redshift_copy_options": {
		"truncate_columns": true
	}
}

@AlexanderMann
Copy link
Collaborator

If we're going this route, I'd prefer nested values ie: redshift: { copy: ...

Also, I'm wondering if we want to simply make this an array of strings which get passed right through to the COPY statement, that way you can override anything without needing another PR.

@stefankeidel
Copy link
Author

If we're going this route, I'd prefer nested values ie: redshift: { copy: ...

Hmm, we already have prefixed redshift_ values for the connection params. Would you want to move those in there as well and break existing configs? Either way it should be consistent I think.

Also, I'm wondering if we want to simply make this an array of strings which get passed right through to the COPY statement, that way you can override anything without needing another PR.

I like this idea! Not sure if we should do some verification or if we can just assume people that are using such an option know what they're doing?

This allows to pass a list of options to redshift's copy command
instead of just enabling to set a single option.
@stefankeidel
Copy link
Author

I implemented it using the prefix redshift_* for now, but I'm happy to change the config format obviously. I just thought this is most consistent with the existing redshift_host etc.

@awm33
Copy link
Member

awm33 commented May 19, 2020

@stefankeidel @AlexanderMann I kind of regret us prefixing everything with redshift :). Other targets have COPY commands (postgres and snowflake) and snowflake has a TRUNCATECOLUMNS options too.

I propose something that we can use with other targets as well, since they offer something similar:

{
	"redshift_host": "...",
	"redshift_port": 5439,
	"redshift_database": "...",
	"redshift_username": "...",
	"redshift_password": "...",
	"redshift_schema": "test_schema",
	"default_column_length": 1000,
	"max_batch_rows": 16000000,
	"max_buffer_size": 30064771072,
	"target_s3": {
		"aws_access_key_id": "...",
		"aws_secret_access_key": "....",
		"bucket": "...",
		"key_prefix": "singer-io/"
	},
	"copy_options": {
		"truncate_columns": true
	}
}

@stefankeidel
Copy link
Author

@awm33 I like that format, but what do you think about the copy-options-as-array-of-strings idea posted by @AlexanderMann above and implemented in this latest revision? Do we want keywords for every single option we want to support or just assume users know what they're doing?

@AlexanderMann
Copy link
Collaborator

@awm33 on that note...should this really be something we put into target-postgres then and have the other targets inherit? Is this something we can assume is a SQLBase type configuration?

@Tom-E
Copy link

Tom-E commented Jan 5, 2021

Is this ready to be merged + released? Happen to be looking for exactly this config option =)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants