Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Redshift IAM Credential Expiration #2826

Closed
livinlefevreloca opened this issue Oct 9, 2020 · 7 comments
Closed

Redshift IAM Credential Expiration #2826

livinlefevreloca opened this issue Oct 9, 2020 · 7 comments
Labels
enhancement New feature or request good_first_issue Straightforward + self-contained changes, good for new contributors! redshift

Comments

@livinlefevreloca
Copy link

Describe the feature

The option to "autorefresh" temporary Redshift credentials acquired through the Redshift GetClusterCredentials API call.

Describe alternatives you've considered

The only alternative to this I can think of is separating out models in to separate calls to dbt run. While this would
technically work it would be nice to not have to do this to run all our dbt models at once. If there is an existing work around for this I'm also open to that as well.

Additional context

This Feature is Redshift specific and I believe would only require changes to the Redshift Plugin. My team uses to DBT with Redshift and uses the IAM temporary credentials method for both convenience and and security purposes. We have several long running DBT process and have run into issues several times with the temporary credentials expiring in the middle of the process. This is fine for individual models as the connection to Redshift remains alive after the credentials expire however as soon as DBT attempts to make another connection to Redshift with the now expired credentials
all subsequent models in the run will fail.

There was some mention of this in original PR that introduced the IAM method as an option for authenticating to Redshift but it seems it was ultimately decided that the credentials expiring in the time the user a lots to them is fine. This makes sense but the one catch is Amazon doesn't allow an iam_duration_seconds value greater than 3600 seconds.

I would like to propose a Time to Live Cache be used to store the temporary username and password provided by Redshift. The cache can be set to expire slightly before the credentials would expire. Then, when the credentials are accessed by something in dbt the cache can be checked and if there is a miss the Redshift GetClusterCredentials API call can be made again to replace them. We have used this solution in our own code for long running processes that make multiple queries to Redshift and it has worked smoothly.

Who will this benefit?

I believe it will benefit anyone who uses DBT with Redshift that has multiple models that may run longer than an hour.

Are you interested in contributing this feature?

I have looked into the Redshift Plugin and would be willing to contribute this myself. I am however a first time to contributor to DBT so would most likely be looking for some guidance around whether the solution I have in mind would be acceptable.

@livinlefevreloca livinlefevreloca added enhancement New feature or request triage labels Oct 9, 2020
@jtcohen6
Copy link
Contributor

jtcohen6 commented Oct 9, 2020

Thanks for the detailed writeup @livinlefevreloca! The approach you've laid out here sounds reasonable to me.

cc @drewbanin: It's been a while since we touched this code. I don't foresee any security concerns around saving the temporary username/password in local cache, but want to confirm that with you.

@jtcohen6 jtcohen6 added good_first_issue Straightforward + self-contained changes, good for new contributors! redshift and removed triage labels Oct 9, 2020
@drewbanin
Copy link
Contributor

hey @livinlefevreloca - do you have individual models that take > 1 hour to build? If so, I don't think there's anything we can really do about that. I don't think there's going to be a way to refresh a connection with new credentials without terminating the existing transaction that's open for the running model! If you have thoughts about how to do this, I'd be happy to hear them for sure.

If the problem is more that a collection of models take > 1 hour to run collectively, then I think a good solution would be to make sure that dbt generates fresh IAM credentials with the 1 hour TTL for each connection that dbt opens when running any given model. I thought that this was already how things worked today, but maybe we can take a deeper look and see if our connection pool is causing problems here.

If you have an example of log output from a run that failed with expired creds, it would definitely help us understand how to best address the issue! Thanks for opening this one :)

@livinlefevreloca
Copy link
Author

@drewbanin @jtcohen6 thanks for the quick response guys.

It is groups of models failing. Im not 100% sure, but I don't believe that a query to Redshift thats last longer than 1 hour will fail once the credentials expire. I think once the connection is established the transaction will still be allowed to commit since the connection remains open. I don't have any documentation to back that claim, but we have many queries that run longer than an hour that are submitted by connections established through temporary credentials. It may vary depending on other factors though.

Unfortunately, I did not managed to get an example log today but will follow up with the team who are actually running these on Monday. Mean while I started make the changes to the Redshift Plugin that I think will cover it. Does it make sense for me to open a draft PR so theres a clearer views on where I think the changes need to occur and you guys can give feed back around whether it seems correct / if there are additional changes needed? Let me know

@drewbanin
Copy link
Contributor

It is groups of models failing. Im not 100% sure, but I don't believe that a query to Redshift thats last longer than 1 hour will fail once the credentials expire. I think once the connection is established the transaction will still be allowed to commit since the connection remains open.

This is 100% right, but dbt sometimes executes models as a series of queries. If you have a post-hook configured, for instance, the queries that dbt executes might look like:

-- start of model
begin;

-- main table materialization
create table abc.def as (

select ....

);

-- post-hooks run here
grant select on abc.def to user my_user;

-- end of model
commit;

In this case, the model is actually a couple of different queries, and they might not all be issued in the same statement. So, it's possible that if the create table statement took > 1 hour, the grant could fail! If you're able to scrape together some log output, that would definitely be helpful in diagnosing the specific failure mode you're running into.

I think that I would be more excited about a change that ensured that each model gets it own "fresh" connection with a 1 hour (or whatever is configured) TTL. I would be less excited about an approach which actively maintained a connection cache on the side. dbt already uses a connection pool, and I suspect the issue is that connection credentials are not refreshed when connections are drawn from this pool and re-used.

Happy to discuss the approach you have in mind further too - this is just my sense after doing a little bit of thinking here!

@livinlefevreloca
Copy link
Author

@drewbanin After diving pretty deep into this I'm actually at a loss for why the failures are occurring. From what I can tell dbt is actually only pooling the custom Connection object thats being used to wrap a connection handle. When a node is done executing, the release method closes the actual connection to the DB. This means that every time a node in the graph is executed the open method for the plugin is called. Postgres's, and therefore Redshift's open method calls get_credentials which will make the GetClusterCredentials call if the auth method is set to iam.

If everything I've got above is correct (please correct me if its not) I think there should be a fresh set of credentials for every model being executed as you said earlier. I will need to get a hold of the logs and maybe ask the team to run one they know will fail with the debug flag set. Hopefully that will make what actually going wrong a little clearer.

@livinlefevreloca
Copy link
Author

So, I'm here to eat my words. I was mistaken about how we were authenticating to Redshift. We were manually making the GetClusterCredentialsCall and passing the username and password returned to us into the profile. We're going to double check that using the iam method correctly fixes the issue but after that I'm gonna close this. Thanks for the quick responses!

@drewbanin
Copy link
Contributor

right on @livinlefevreloca! Everything you said matches my understanding as well - really appreciate you diving deep into this one!

Please do feel free to close this ticket out once you've confirmed that the IAM auth method is working as intended, and let us know if there's anything else we can help out with :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request good_first_issue Straightforward + self-contained changes, good for new contributors! redshift
Projects
None yet
Development

No branches or pull requests

3 participants