-
Notifications
You must be signed in to change notification settings - Fork 915
/
Copy pathadvanced_configuration.md
316 lines (240 loc) · 13.6 KB
/
advanced_configuration.md
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
# Advanced configuration
The documentation on [configuration](./configuration_basics.md) describes how to satisfy most common requirements of standard Kedro project configuration:
By default, Kedro is set up to use the [ConfigLoader](/kedro.config.ConfigLoader) class. Kedro also provides two additional configuration loaders with more advanced functionality: the [TemplatedConfigLoader](/kedro.config.TemplatedConfigLoader) and the [OmegaConfigLoader](/kedro.config.OmegaConfigLoader).
Each of these classes are alternatives for the default `ConfigLoader` and have different features. The following sections describe each of these classes and their specific functionality in more detail.
## TemplatedConfigLoader
Kedro provides an extension [TemplatedConfigLoader](/kedro.config.TemplatedConfigLoader) class that allows you to template values in configuration files. To apply templating in your project, set the `CONFIG_LOADER_CLASS` constant in your [`src/<package_name>/settings.py`](../kedro_project_setup/settings.md):
```python
from kedro.config import TemplatedConfigLoader # new import
CONFIG_LOADER_CLASS = TemplatedConfigLoader
```
### Provide template values through globals
When using the `TemplatedConfigLoader` you can provide values in the configuration template through a `globals` file or dictionary.
Let's assume the project contains a `conf/base/globals.yml` file with the following contents:
```yaml
bucket_name: "my_s3_bucket"
key_prefix: "my/key/prefix/"
datasets:
csv: "pandas.CSVDataSet"
spark: "spark.SparkDataSet"
folders:
raw: "01_raw"
int: "02_intermediate"
pri: "03_primary"
fea: "04_feature"
```
To point your `TemplatedConfigLoader` to the globals file, add it to the the `CONFIG_LOADER_ARGS` variable in [`src/<package_name>/settings.py`](../kedro_project_setup/settings.md):
```python
CONFIG_LOADER_ARGS = {"globals_pattern": "*globals.yml"}
```
Now the templating can be applied to the configuration. Here is an example of a templated `conf/base/catalog.yml` file:
```yaml
raw_boat_data:
type: "${datasets.spark}" # nested paths into global dict are allowed
filepath: "s3a://${bucket_name}/${key_prefix}/${folders.raw}/boats.csv"
file_format: parquet
raw_car_data:
type: "${datasets.csv}"
filepath: "s3://${bucket_name}/data/${key_prefix}/${folders.raw}/${filename|cars.csv}" # default to 'cars.csv' if the 'filename' key is not found in the global dict
```
Under the hood, `TemplatedConfigLoader` uses [`JMESPath` syntax](https://github.com/jmespath/jmespath.py) to extract elements from the globals dictionary.
Alternatively, you can declare which values to fill in the template through a dictionary. This dictionary could look like the following:
```python
{
"bucket_name": "another_bucket_name",
"non_string_key": 10,
"key_prefix": "my/key/prefix",
"datasets": {"csv": "pandas.CSVDataSet", "spark": "spark.SparkDataSet"},
"folders": {
"raw": "01_raw",
"int": "02_intermediate",
"pri": "03_primary",
"fea": "04_feature",
},
}
```
To point your `TemplatedConfigLoader` to the globals dictionary, add it to the `CONFIG_LOADER_ARGS` variable in [`src/<package_name>/settings.py`](../kedro_project_setup/settings.md):
```python
CONFIG_LOADER_ARGS = {
"globals_dict": {
"bucket_name": "another_bucket_name",
"non_string_key": 10,
"key_prefix": "my/key/prefix",
"datasets": {"csv": "pandas.CSVDataSet", "spark": "spark.SparkDataSet"},
"folders": {
"raw": "01_raw",
"int": "02_intermediate",
"pri": "03_primary",
"fea": "04_feature",
},
}
}
```
If you specify both `globals_pattern` and `globals_dict` in `CONFIG_LOADER_ARGS`, the contents of the dictionary resulting from `globals_pattern` are merged with the `globals_dict` dictionary. In case of conflicts, the keys from the `globals_dict` dictionary take precedence.
## OmegaConfigLoader
[OmegaConf](https://omegaconf.readthedocs.io/) is a Python library designed for configuration. It is a YAML-based hierarchical configuration system with support for merging configurations from multiple sources.
From Kedro 0.18.5 you can use the [`OmegaConfigLoader`](/kedro.config.OmegaConfigLoader) which uses `OmegaConf` under the hood to load data.
```{note}
`OmegaConfigLoader` is under active development. It was first available from Kedro 0.18.5 with additional features due in later releases. Let us know if you have any feedback about the `OmegaConfigLoader`.
```
`OmegaConfigLoader` can load `YAML` and `JSON` files. Acceptable file extensions are `.yml`, `.yaml`, and `.json`. By default, any configuration files used by the config loaders in Kedro are `.yml` files.
To use `OmegaConfigLoader` in your project, set the `CONFIG_LOADER_CLASS` constant in your [`src/<package_name>/settings.py`](../kedro_project_setup/settings.md):
```python
from kedro.config import OmegaConfigLoader # new import
CONFIG_LOADER_CLASS = OmegaConfigLoader
```
## Advanced Kedro configuration
This section contains a set of guidance for advanced configuration requirements of standard Kedro projects:
* [How to change which configuration files are loaded](#how-to-change-which-configuration-files-are-loaded)
* [How to ensure non default configuration files get loaded](#how-to-ensure-non-default-configuration-files-get-loaded)
* [How to bypass the configuration loading rules](#how-to-bypass-the-configuration-loading-rules)
* [How to use Jinja2 syntax in configuration](#how-to-use-jinja2-syntax-in-configuration)
* [How to do templating with the `OmegaConfigLoader`](#how-to-do-templating-with-the-omegaconfigloader)
* [How to use custom resolvers in the `OmegaConfigLoader`](#how-to-use-custom-resolvers-in-the-omegaconfigloader)
* [How to load credentials through environment variables](#how-to-load-credentials-through-environment-variables)
### How to change which configuration files are loaded
If you want to change the patterns that the configuration loader uses to find the files to load you need to set the `CONFIG_LOADER_ARGS` variable in [`src/<package_name>/settings.py`](../kedro_project_setup/settings.md).
For example, if your `parameters` files are using a `params` naming convention instead of `parameters` (e.g. `params.yml`) you need to update `CONFIG_LOADER_ARGS` as follows:
```python
CONFIG_LOADER_ARGS = {
"config_patterns": {
"parameters": ["params*", "params*/**", "**/params*"],
}
}
```
By changing this setting, the default behaviour for loading parameters will be replaced, while the other configuration patterns will remain in their default state.
### How to ensure non default configuration files get loaded
You can add configuration patterns to match files other than `parameters`, `credentials`, `logging`, and `catalog` by setting the `CONFIG_LOADER_ARGS` variable in [`src/<package_name>/settings.py`](../kedro_project_setup/settings.md).
For example, if you want to load Spark configuration files you need to update `CONFIG_LOADER_ARGS` as follows:
```python
CONFIG_LOADER_ARGS = {
"config_patterns": {
"spark": ["spark*/"],
}
}
```
### How to bypass the configuration loading rules
You can bypass the configuration patterns and set configuration directly on the instance of a config loader class. You can bypass the default configuration (catalog, parameters, credentials, and logging) as well as additional configuration.
```{code-block} python
:lineno-start: 10
:emphasize-lines: 8
from kedro.config import ConfigLoader
from kedro.framework.project import settings
conf_path = str(project_path / settings.CONF_SOURCE)
conf_loader = ConfigLoader(conf_source=conf_path)
# Bypass configuration patterns by setting the key and values directly on the config loader instance.
conf_loader["catalog"] = {"catalog_config": "something_new"}
```
### How to use Jinja2 syntax in configuration
From version 0.17.0, `TemplatedConfigLoader` also supports the [Jinja2](https://palletsprojects.com/p/jinja/) template engine alongside the original template syntax. Below is an example of a `catalog.yml` file that uses both features:
```
{% for speed in ['fast', 'slow'] %}
{{ speed }}-trains:
type: MemoryDataSet
{{ speed }}-cars:
type: pandas.CSVDataSet
filepath: s3://${bucket_name}/{{ speed }}-cars.csv
save_args:
index: true
{% endfor %}
```
When parsing this configuration file, `TemplatedConfigLoader` will:
1. Read the `catalog.yml` and compile it using Jinja2
2. Use a YAML parser to parse the compiled config into a Python dictionary
3. Expand `${bucket_name}` in `filepath` using the `globals_pattern` and `globals_dict` arguments for the `TemplatedConfigLoader` instance, as in the previous examples
The output Python dictionary will look as follows:
```python
{
"fast-trains": {"type": "MemoryDataSet"},
"fast-cars": {
"type": "pandas.CSVDataSet",
"filepath": "s3://my_s3_bucket/fast-cars.csv",
"save_args": {"index": True},
},
"slow-trains": {"type": "MemoryDataSet"},
"slow-cars": {
"type": "pandas.CSVDataSet",
"filepath": "s3://my_s3_bucket/slow-cars.csv",
"save_args": {"index": True},
},
}
```
```{warning}
Although Jinja2 is a very powerful and extremely flexible template engine, which comes with a wide range of features, we do not recommend using it to template your configuration unless absolutely necessary. The flexibility of dynamic configuration comes at a cost of significantly reduced readability and much higher maintenance overhead. We believe that, for the majority of analytics projects, dynamically compiled configuration does more harm than good.
```
### How to do templating with the `OmegaConfigLoader`
Templating or [variable interpolation](https://omegaconf.readthedocs.io/en/2.3_branch/usage.html#variable-interpolation), as it's called in `OmegaConf`, for parameters works out of the box if the template values are within the parameter files or the name of the file that contains the template values follows the same config pattern specified for parameters.
By default, the config pattern for parameters is: `["parameters*", "parameters*/**", "**/parameters*"]`.
Suppose you have one parameters file called `parameters.yml` containing parameters with `omegaconf` placeholders like this:
```yaml
model_options:
test_size: ${data.size}
random_state: 3
```
and a file containing the template values called `parameters_globals.yml`:
```yaml
data:
size: 0.2
```
Since both of the file names (`parameters.yml` and `parameters_globals.yml`) match the config pattern for parameters, the `OmegaConfigLoader` will load the files and resolve the placeholders correctly.
```{note}
Templating currently only works for parameter files, but not for catalog files.
```
### How to use custom resolvers in the `OmegaConfigLoader`
`Omegaconf` provides functionality to [register custom resolvers](https://omegaconf.readthedocs.io/en/2.3_branch/usage.html#resolvers) for templated values. You can use these custom resolves within Kedro by extending the [`OmegaConfigLoader`](/kedro.config.OmegaConfigLoader) class.
The example below illustrates this:
```python
from kedro.config import OmegaConfigLoader
from omegaconf import OmegaConf
from typing import Any, Dict
class CustomOmegaConfigLoader(OmegaConfigLoader):
def __init__(
self,
conf_source: str,
env: str = None,
runtime_params: Dict[str, Any] = None,
):
super().__init__(
conf_source=conf_source, env=env, runtime_params=runtime_params
)
# Register a customer resolver that adds up numbers.
self.register_custom_resolver("add", lambda *numbers: sum(numbers))
@staticmethod
def register_custom_resolver(name, function):
"""
Helper method that checks if the resolver has already been registered and registers the
resolver if it's new. The check is needed, because omegaconf will throw an error
if a resolver with the same name is registered twice.
Alternatively, you can call `register_new_resolver()` with `replace=True`.
"""
if not OmegaConf.has_resolver(name):
OmegaConf.register_new_resolver(name, function)
```
In order to use this custom configuration loader, you will need to set it as the project configuration loader in `src/<package_name>/settings.py`:
```python
from package_name.custom_configloader import CustomOmegaConfigLoader
CONFIG_LOADER_CLASS = CustomOmegaConfigLoader
```
You can then use the custom "add" resolver in your `parameters.yml` as follows:
```yaml
model_options:
test_size: ${add:1,2,3}
random_state: 3
```
### How to load credentials through environment variables
The [`OmegaConfigLoader`](/kedro.config.OmegaConfigLoader) enables you to load credentials from environment variables. To achieve this you have to use the `OmegaConfigLoader` and the `omegaconf` [`oc.env` resolver](https://omegaconf.readthedocs.io/en/2.3_branch/custom_resolvers.html#oc-env).
To use the `OmegaConfigLoader` in your project, set the `CONFIG_LOADER_CLASS` constant in your [`src/<package_name>/settings.py`](../kedro_project_setup/settings.md):
```python
from kedro.config import OmegaConfigLoader # new import
CONFIG_LOADER_CLASS = OmegaConfigLoader
```
Now you can use the `oc.env` resolver to access credentials from environment variables in your `credentials.yml`, as demonstrated in the following example:
```yaml
dev_s3:
client_kwargs:
aws_access_key_id: ${oc.env:AWS_ACCESS_KEY_ID}
aws_secret_access_key: ${oc.env:AWS_SECRET_ACCESS_KEY}
```
```{note}
Note that you can only use the resolver in `credentials.yml` and not in catalog or parameter files. This is because we do not encourage the usage of environment variables for anything other than credentials.
```