Investigate performance of config loading for big projects #3893

astrojuanlu · 2024-05-25T08:36:12Z

Description

Earlier this week a user reached out to me in private saying that it was taking 3 minutes for Kedro to load their configuration (KedroContext._get_catalog).

Today another user mentioned that "Looking at the logs, it gets stuck at the kedro.config.module for more than 50% of the pipeline run duration, but we do have a lot of inputs and outputs"

I still don't have specific reproducers, but I'm noticing enough qualitative evidence to open an issue about it.

The text was updated successfully, but these errors were encountered:

datajoely · 2024-05-28T08:47:14Z

I'd like to see us add a CLI command which users can run to produce a flamegraph. It would massively reduce the guesswork here.

kedro profile {kedro command} -> .html / .bin

yury-fedotov · 2024-06-01T21:55:55Z

I'd like to see us add a CLI command which users can run to produce a flamegraph. It would massively reduce the guesswork here.

kedro profile {kedro command} -> .html / .bin

@datajoely flamegraph for the entire pipeline run (how much time each node takes) or just the config resolution / pipeline initialization?

datajoely · 2024-06-03T09:22:03Z

In my mind, it would run the whole command as normal, but also generate the profiling data.

Perhaps if we were to take this seriously, a full on memray integration would incredible.

astrojuanlu · 2024-06-03T09:55:58Z

Continuing the discussion on creating custom commands here #3908

astrojuanlu · 2024-07-02T14:43:22Z

Many users have been complaining about the slowness of Kedro with big projects and that can be attributed to many different causes. However one of the most prevailing cause is big parameter files that get expanded into hundreds of datasets on their own. That process takes a lot of time and if the files become too big (a couple of MB), it presents as significant slowdown.

Originally posted by @idanov in #3732 (comment)

The solution works, but ~~couples the DataCatalog with OmegaConf~~ is still under review.

From the discussion in the PR:

Shouldn't we redesign the DataCatalog API instead so that parameters are first class citizens, and not fake datasets?

There were a few thumbs up to the idea, and it was brought up again in #3973 (@datajoely please do confirm that this is what you had in mind 😄)

@merelcht pointed out that there's a pending research item on how users use parameters and for what #2240

@ElenaKhaustova agreed that this is relevant in the context of the ongoing DataCatalog API redesign #3934.

Ideally, if there's a way we can tackle this issue without blocking it on #2240, the time to look at it would be now. But I have very little visibility on what are the implications, or whether we would actually solve the performance problem at all. So, leaving the decision to the team.

merelcht · 2024-07-02T14:53:38Z

The solution works, but couples the DataCatalog with OmegaConf

Would you really call this coupling? The way I read it is that is uses omegaconf to parse the parameters config. We already have a dependency on omegaconf anyway, and I actually quite like that we can leverage it in more places than just the OmegaConfigLoader itself. I would have called it coupling if it uses the actual OmegaConfigLoader class, but this just imports the library.

astrojuanlu · 2024-07-02T15:13:47Z

Sorry to keep moving the conversation but I'd rather not discuss the specifics of a particular solution outside the corresponding PR, addressed your question in context at #3732 (comment)

astrojuanlu · 2024-10-30T16:01:49Z

Now that we're working on this, some goals for this ticket:

understand under what circumstances omegaconf becomes the dominant bottleneck of loading configuration
find where the hotspots are, in terms of functions called (function profiling) but also where in the Kedro code are they being called (line profiling)
understand the scaling properties, for example
- what happens with 1 config file with 10, 100, 1 000 variables?
- what happens with 1 config file with 10, 100, 1 000 dataset factories?
- what happens with 10 config files with a given pattern, 100, 1 000?
  - in other words: catalog1.yml, catalog2.yml, ..., catalog1000.yml when the pattern is catalog*.yml
- are all the scaling laws above linear? are some of them superlinear?
in the absence of additional context from the original reporter (reached out to them by email), how many files, variables etc are needed to reach 3 minutes of config loading on modern hardware?

The outcomes should be

an analysis of what functions dominate with "big" configuration
an analysis of what lines of code in the Kedro codebase dominate with "big" configuration
scatter plots that showcase the scaling properties of configuration loading with respect to different properties as outlined above, see Low performance of pipeline sums #3167 (comment) for an example

Hopefully by the end of the analysis we should either have

a clear recommendation of what part of the code we can optimise, or
a mandate to look for something faster than omegaconf 😬

noklam · 2024-10-30T16:17:24Z

@astrojuanlu My worry is that we will likely find nothing actionable unless there is a project that is actually slow with OmegaConfigLoader. With the benchmark result, it seems that it is reasonable fast.

We can still do the exponential scaling (not necessary a combination of all) to better understand the performance of configloader (this should probably move into the benchmark once done):

Scaling with number of files, i.e. file1, file2, file - maybe issues with globbing or related to small files
Scaling number of entry (benchmark covers already), we can increase the number of entries see how well it scales.
#number of factories

The result of this should be a table (one axis being the thing to be test, the other axis is the number of entry) + profiling

deepyaman · 2024-10-31T13:12:23Z

a mandate to look for something faster than omegaconf 😬

I had started working on something for fun a few months ago to solve this (potential) problem. So if you can find cases where omegaconf is slow, I'd be very interested. 😉

datajoely · 2024-10-31T15:21:32Z

Did someone hear a 🦀 walking?

ravi-kumar-pilla · 2024-11-06T19:42:40Z

Hi Team,

As suggested by @astrojuanlu and @noklam , I tried creating stress test scripts and analyze how OmegaConfigLoader scales. You can find the test scripts here under kedro/kedro_benchmarks/temp_investigate_ocl. I used line_profiler and kernprof for the analysis. Used matplotlib.pyplot for plotting.

Machine used:

1. Single Catalog file with increasing variable interpolations -

Line Profiler - kernprof -lvr --unit 1 ocl_plot_variables.py
Total time: 265.218 s
File: /KedroOrg/kedro/kedro/config/omegaconf_config.py
Function: load_and_merge_dir_config at line 272

Line #     Hits    Time    Per Hit    % Time                   Line Contents
326        16        103.0      6.4       38.8                      config = OmegaConf.load(tmp_fo) 
353        32        134.1      4.2        50.6                 for k, v in OmegaConf.to_container(                                                   
354        16         28.0      1.8         10.6                  OmegaConf.merge(*aggregate_config), resolve=True

2. Single Catalog file without variable interpolations -

Line profiler - kernprof -lvr --unit 1 ocl_plot_datasets.py

Total time: 50.2196 s
File: /KedroOrg/kedro/kedro/config/omegaconf_config.py
Function: load_and_merge_dir_config at line 272

Line #      Hits     Time    Per Hit   % Time                  Line Contents
 326        16         37.6      2.3        74.9                   config = OmegaConf.load(tmp_fo)  
 353        32          2.4      0.1         4.8                    for k, v in OmegaConf.to_container(                                                   
 354        16         10.1      0.6         20.1                  OmegaConf.merge(*aggregate_config), resolve=True

3. Multiple catalog files following catalog* pattern -

Line Profiler - kernprof -lvr --unit 1 ocl_plot_multifile.py

Total time: 106.144 s
File: /KedroOrg/kedro/kedro/config/omegaconf_config.py
Function: load_and_merge_dir_config at line 272

Line #      Hits         Time   Per Hit    % Time                Line Contents
322      3615         16.8       0.0          15.8                  with self._fs.open(str(config_filepath.as_posix())) as open_config: 
326      3615         58.3      0.0           54.9                 config = OmegaConf.load(tmp_fo)
354        10         19.7         2.0           18.5                  OmegaConf.merge(*aggregate_config), resolve=True

Summary: Below are the methods which take most of the time when resolving catalog. All of them are part of load_and_merge_dir_config function.

All of these are from OmegaConf module which we use under the hood. So based on the above analysis, we could try alternatives to OmegaConf to have better performance (I am not sure if there are any better alternatives. I found hydra which again uses OmegaConf under the hood.)

Thank you !

datajoely · 2024-11-07T10:03:26Z

I'm not sure if it's worth the engineering overhead, but its cool

https://github.com/SergioBenitez/Figment

datajoely · 2024-11-07T10:05:29Z

Some more on the same theme:

https://github.com/leptonyu/cfg-rs
https://github.com/LukasKalbertodt/confique

astrojuanlu · 2024-11-07T10:10:11Z

Thanks a lot for the analysis @ravi-kumar-pilla , the analysis looks good. Looks like everything scales linearly.

We're still waiting for feedback from a user that was struggling with high latency.

noklam · 2024-11-07T10:29:21Z

Thanks @ravi-kumar-pilla , this aligns with the result here https://kedro-org.github.io/kedro-benchmark-results/#benchmark_ocl.TimeOmegaConfigLoader.time_loading_catalog

This looks reasonably fast enough (1000 datasets in 1 second). Let's wait for the feedback from the user.

merelcht · 2024-11-07T10:57:03Z

Great analysis, thanks @ravi-kumar-pilla ! Is there by any chance any parts of the Kedro code that are also introducing latency or is it only the omegaconf stuff?

And for number 3 Multiple catalog files following catalog* pattern, did these catalogs have variable interpolation or not? This one is clearly the slowest, but any project with close to 100 catalogs sounds a bit crazy 😅 I think it's reasonable to assume that's not a very realistic setup and the timings for around 10 catalogs are still reasonable IMO.

merelcht · 2024-11-07T10:58:15Z

I'm also curious to hear how you think we can add this kind of profiling to the QA/benchmarking tests @ravi-kumar-pilla ? And did you find the newly added benchmarking setup useful for this kind of testing at all?

ravi-kumar-pilla · 2024-11-07T16:56:45Z

Is there by any chance any parts of the Kedro code that are also introducing latency or is it only the omegaconf stuff?

I did use pyinstrument and speedscope to analyze the overall behavior. Most of the time was spent on load_and_merge_dir_config function. I will attach those results in some time.

And for number 3 Multiple catalog files following catalog* pattern, did these catalogs have variable interpolation or not?

Number 3 was without variable interpolations (I think it would take more time with variable interpolations based on the individual test).

This one is clearly the slowest, but any project with close to 100 catalogs sounds a bit crazy 😅 I think it's reasonable to assume that's not a very realistic setup and the timings for around 10 catalogs are still reasonable IMO.

Yes 100 catalogs is crazy simulation which might not happen in real. I think the overall behavior of OmegaConf was reasonable.

Thank you

ravi-kumar-pilla · 2024-11-07T17:01:59Z

I'm also curious to hear how you think we can add this kind of profiling to the QA/benchmarking tests @ravi-kumar-pilla ? And did you find the newly added benchmarking setup useful for this kind of testing at all?

The benchmark setup does show similar performance trajectory and is useful for testing these cases. We can definitely iterate and add more use cases of generating the catalog.

ravi-kumar-pilla · 2024-11-08T18:36:39Z

Hi Team,

Based on the materials we received, it is evident that the bottleneck is OmegaConf usage (to_container method). Testing this locally, we observed that the method takes time in resolving variable interpolations i.e., if the catalog contains any global or variable references and if the referenced file has complex hierarchical structure then the method get_node_value in omegaconf/basecontainer.py takes considerably long time which is called inside to_container. Please find below observations -

These observations are in-line with the above benchmark test plots. OmegaConf seems to be the bottleneck in resolving complex variable references. Happy to hear any suggestions. Thank you

ravi-kumar-pilla · 2024-11-12T02:49:36Z

Closing this as the investigation is completed. Opened a follow up issue to improve time taken by OmegaConf in resolving global interpolations here

astrojuanlu mentioned this issue May 29, 2024

[spike] Improve Kedro CLI startup time #1476

Closed

github-actions bot mentioned this issue Jun 1, 2024

Monthly issue metrics report #3906

Open

This was referenced Jul 2, 2024

Remove the copying hack and add proper params querying capabilities in the DataCatalog #3732

Closed

Key completion for dataset access #3973

Merged

merelcht added this to the Improve Developer Experience milestone Jul 9, 2024

noklam mentioned this issue Jul 23, 2024

Spike: design example kedro projects that can be used to assess performance issues #3957

Closed

noklam mentioned this issue Aug 20, 2024

[Stress Testing] - Create example projects to assess Kedro performance for complex pipelines #3866

Closed

noklam mentioned this issue Aug 29, 2024

[Stress Testing] - Data Catalog and Config Loader #4125

Closed

merelcht added this to Kedro Framework Sep 12, 2024

merelcht moved this to To Do in Kedro Framework Sep 16, 2024

merelcht modified the milestones: Improve Developer Experience, Improve performance of Kedro Oct 18, 2024

merelcht assigned ravi-kumar-pilla and noklam Oct 28, 2024

ravi-kumar-pilla mentioned this issue Nov 12, 2024

Improve OmegaConfigLoader performance when global/variable interpolations are involved #4322

Open

ravi-kumar-pilla closed this as completed Nov 12, 2024

github-project-automation bot moved this from In Progress to Done in Kedro Framework Nov 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate performance of config loading for big projects #3893

Investigate performance of config loading for big projects #3893

astrojuanlu commented May 25, 2024

datajoely commented May 28, 2024

yury-fedotov commented Jun 1, 2024

datajoely commented Jun 3, 2024

astrojuanlu commented Jun 3, 2024

astrojuanlu commented Jul 2, 2024 •

edited

Loading

merelcht commented Jul 2, 2024

astrojuanlu commented Jul 2, 2024

astrojuanlu commented Oct 30, 2024

noklam commented Oct 30, 2024

deepyaman commented Oct 31, 2024

datajoely commented Oct 31, 2024

ravi-kumar-pilla commented Nov 6, 2024 •

edited

Loading

datajoely commented Nov 7, 2024

datajoely commented Nov 7, 2024

astrojuanlu commented Nov 7, 2024

noklam commented Nov 7, 2024

merelcht commented Nov 7, 2024

merelcht commented Nov 7, 2024

ravi-kumar-pilla commented Nov 7, 2024

ravi-kumar-pilla commented Nov 7, 2024

ravi-kumar-pilla commented Nov 8, 2024

ravi-kumar-pilla commented Nov 12, 2024

Investigate performance of config loading for big projects #3893

Investigate performance of config loading for big projects #3893

Comments

astrojuanlu commented May 25, 2024

Description

datajoely commented May 28, 2024

yury-fedotov commented Jun 1, 2024

datajoely commented Jun 3, 2024

astrojuanlu commented Jun 3, 2024

astrojuanlu commented Jul 2, 2024 • edited Loading

merelcht commented Jul 2, 2024

astrojuanlu commented Jul 2, 2024

astrojuanlu commented Oct 30, 2024

noklam commented Oct 30, 2024

deepyaman commented Oct 31, 2024

datajoely commented Oct 31, 2024

ravi-kumar-pilla commented Nov 6, 2024 • edited Loading

datajoely commented Nov 7, 2024

datajoely commented Nov 7, 2024

astrojuanlu commented Nov 7, 2024

noklam commented Nov 7, 2024

merelcht commented Nov 7, 2024

merelcht commented Nov 7, 2024

ravi-kumar-pilla commented Nov 7, 2024

ravi-kumar-pilla commented Nov 7, 2024

ravi-kumar-pilla commented Nov 8, 2024

ravi-kumar-pilla commented Nov 12, 2024

astrojuanlu commented Jul 2, 2024 •

edited

Loading

ravi-kumar-pilla commented Nov 6, 2024 •

edited

Loading