Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make credentials a resolver #3811

Closed
Galileo-Galilei opened this issue Apr 14, 2024 · 14 comments
Closed

Make credentials a resolver #3811

Galileo-Galilei opened this issue Apr 14, 2024 · 14 comments
Labels
Issue: Feature Request New feature or improvement to existing feature Stage: Technical Design 🎨 Ticket needs to undergo technical design before implementation TD: should we? Tech Design topic to discuss whether we should implement/solve a raised issue

Comments

@Galileo-Galilei
Copy link
Member

Galileo-Galilei commented Apr 14, 2024

Description

Make "credentials" a OmegaConf resolver, so injecting credentials in the catalog would look like :

# current injection
my_data:
    type: pandas.CSVDataset
    filepath: /path/to/data.csv
    credentials: my_credentials # a string which refer to a a section in credentials.yaml
# new proposal
my_data:
    type: pandas.CSVDataset
    filepath: /path/to/data.csv
    credentials: ${credentials: my_credentials}

Context

This may look like a pure technical refactoring, but actually I am expecting this to be an actual improvement for users, e.g. :

Overall, this would likely be a general improvement with some overlaps with what is described here #1646 and has never been properly adressed, despite a bunch of discussions around this topic.

Possible Implementation

Move the logic from the catalog and the _get_credentials and _resolve_credentials:

def _get_credentials(credentials_name: str, credentials: dict[str, Any]) -> Any:
"""Return a set of credentials from the provided credentials dict.
Args:
credentials_name: Credentials name.
credentials: A dictionary with all credentials.
Returns:
The set of requested credentials.
Raises:
KeyError: When a data set with the given name has not yet been
registered.
"""
try:
return credentials[credentials_name]
except KeyError as exc:
raise KeyError(
f"Unable to find credentials '{credentials_name}': check your data "
"catalog and credentials configuration. See "
"https://kedro.readthedocs.io/en/stable/kedro.io.DataCatalog.html "
"for an example."
) from exc

def _resolve_credentials(
config: dict[str, Any], credentials: dict[str, Any]
) -> dict[str, Any]:
"""Return the dataset configuration where credentials are resolved using
credentials dictionary provided.
Args:
config: Original dataset config, which may contain unresolved credentials.
credentials: A dictionary with all credentials.
Returns:
The dataset config, where all the credentials are successfully resolved.
"""
config = copy.deepcopy(config)
def _map_value(key: str, value: Any) -> Any:
if key == CREDENTIALS_KEY and isinstance(value, str):
return _get_credentials(value, credentials)
if isinstance(value, dict):
return {k: _map_value(k, v) for k, v in value.items()}
return value
return {k: _map_value(k, v) for k, v in config.items()}

to a resolver.

Possible Alternatives

Do nothing and keep the situation as is.

@Galileo-Galilei Galileo-Galilei added the Issue: Feature Request New feature or improvement to existing feature label Apr 14, 2024
@idanov
Copy link
Member

idanov commented Apr 14, 2024

I like the idea in principle, but what would be a way for us to prevent hardcoding of credentials into the catalog directly?

@Galileo-Galilei
Copy link
Member Author

Not sure of what you mean : for me the default credentials provided by kedro would do the exact same things as it does now (map a string in the catalog to an entry in credentials.yml) so I don't think there is a problem here.

A user who likes playing with kedro's limits could eventually write his own resolver and eventually takes raw inputs directly in the catalog, but I don't think it's kedro's responsibility to try to circumvent all weird things users can do (and funnily, they already can write a custom resolver and use it in the catalog - my proposal would simply enable to make their dangerous logic the default, but they are clearly doing it on purpose). Do you have an example of a situation this proposal would enable which is not already possible and that we want to avoid by all means?

@astrojuanlu
Copy link
Member

actually I am expecting this to be an actual improvement for users

Hard agree. Our YAML-first approach to credentials is weird, given that there are many other methods that are arguably more prevalent, like environment variables, secrets vaults, JSON token files...

Plus we have evidence that users struggle to understand the current approach #1646 (comment)

@datajoely
Copy link
Contributor

I think this is really elegant, I think there is probably a way where this is non-breaking and just an extension too

@idanov
Copy link
Member

idanov commented Apr 17, 2024

@Galileo-Galilei The way OmegaConf resolvers work is that they eventually render the value to the dictionary you have provided. That means that in the end result, you get the credentials as part of the key there. The consumer (in this case Kedro) would see no difference between this:

# new proposal
my_data:
    type: pandas.CSVDataset
    filepath: /path/to/data.csv
    credentials: ${credentials: my_credentials}

and this:

# new proposal
my_data:
    type: pandas.CSVDataset
    filepath: /path/to/data.csv
    credentials: s3cre7!

Currently, the latter will not be taken by Kedro and it won't work, because we do the resolving after loading the config. This way we are preventing users from accidentally checking in secrets. If we go the resolver path (which I like personally), we need to make sure that my second example will error out and your example will only work.

@Galileo-Galilei
Copy link
Member Author

Galileo-Galilei commented Apr 23, 2024

This is a totally valid concern, I did not think about this edge case.

It seems it should be possible to check if the credentials subkey is interpolated with [OmegaConf.is_interpolation method] (https://omegaconf.readthedocs.io/en/2.3_branch/usage.html#omegaconf-is-interpolation) and raise an error if it isn't, but it will likely require #2973 to be completed first (and given the number of references to this issue, it feels like it will become a blocker at one point)

@astrojuanlu
Copy link
Member

That’s exactly what I always do: I use credentials as env vars. however, it would still be nice to be able to inject them in a dataset (think db credentials for sql datasets) in some way.

Originally posted by @MatthiasRoels in #1936 (comment)

@datajoely
Copy link
Contributor

O also think a resolver based way of doing creds would be amazing, I had to hack something together to make things work for my SQLFrame experiement.

@m-gris
Copy link

m-gris commented Sep 9, 2024

Hi Everyone,

As always, a lot of great thoughts & comments. 👍

If I may quickly had some poorly articulated concerns / suggestions that connects this issue to #2829:

AWS Secrets bills on READ ⚠️ 💵 , making it crucial that only credentials that are actually needed are being loaded.

Our problem is compounded by the fact that, in the @after_context_created hook, there is unfortunately no access to tags, pipeline name, namespace ..., i.e data that would help in filtering things out and minimise AWS billing.

Looking very forward to reading your thoughts on this.

@noklam
Copy link
Contributor

noklam commented Sep 9, 2024

The suggestion make sense. We agree this is something that is useful, at the moment we are doing some re-design for DataCatalog, we will start looking at this once we finish the refactoring.

@astrojuanlu
Copy link
Member

More evidence that our users are just using environment variables for credentials https://kedro.hall.community/support-lY6wDVhxGXNY/passing-credentials-to-a-node-in-kedro-DrfL3JQdGpPb

(Related to #3979 cc @lordsoffallen)

@merelcht merelcht added Stage: Technical Design 🎨 Ticket needs to undergo technical design before implementation TD: should we? Tech Design topic to discuss whether we should implement/solve a raised issue labels Oct 7, 2024
@astrojuanlu
Copy link
Member

Credential management in Polars is changing 👀 see

lf = pl.scan_parquet(
    "s3://...",
    credential_provider=pl.CredentialProviderAWS(profile_name="..."),
)

@astrojuanlu
Copy link
Member

Moving this to Discussions to continue the conversation there.

@kedro-org kedro-org locked and limited conversation to collaborators Nov 11, 2024
@astrojuanlu astrojuanlu converted this issue into discussion #4320 Nov 11, 2024

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
Issue: Feature Request New feature or improvement to existing feature Stage: Technical Design 🎨 Ticket needs to undergo technical design before implementation TD: should we? Tech Design topic to discuss whether we should implement/solve a raised issue
Projects
Status: Done
Development

No branches or pull requests

7 participants