-
Notifications
You must be signed in to change notification settings - Fork 910
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Lazy Loading of Catalog Items #2829
Comments
I ask the user to comment out any SQLDataSet to see how it impacts performance. It shows that it contribute to a significant amount of time.
|
To scope the discussion better.
|
Hey @m-gris! Which dataset were you facing an issue with on #2829? I think I've "fixed" it in kedro-org/kedro-plugins#281 (just for To test, I modified Spaceflights to make # conf/base/catalog.yml
companies:
type: pandas.SQLTableDataSet
credentials: db_credentials
table_name: shuttles
load_args:
schema: dwschema
save_args:
schema: dwschema
if_exists: replace
# conf/local/credentials.yml
db_credentials:
con: postgresql://scott:tiger@localhost/test This dataset doesn't exist, so |
Confirmed offline with @m-gris that kedro-org/kedro-plugins#281 works for him, so I'll get it ready for review when I get a chance! :) |
Hi everyone, Apologies for my silence / lack of reactivity & Thanks to Deepyaman for reaching out and coming up with a fix for this issue. I'm happy to confirm that it does work, in the sense that I can run pipeline A, even if some sql datasets needed for pipeline B can’t be loaded because I’m offline / without access to the DB. However... With further tests I noticed something that, to me at least, seems not ideal: I know… I’m being a bit picky. Instead of putting garbage in the Thanks in advance, |
Hi again @m-gris! Based on just my evaluation, I think what you're requesting is feasible, and looks reasonably straightforward. When constructing However... I do think the change could have a significant effect, potentially breaking things (e.g. plugins) that parse the data catalog. I'm almost certain this would be a breaking change that would have to go in 0.19 or a future minor release. Additionally, users may benefit from the eager error reporting, but perhaps the lazy loading could be enabled as an option (reminiscent of the discussion about Let me request some others view on this, especially given a lot of people have been looking at the data catalog lately. P.S. Re "I could hack my way around by passing a dummy |
Thx for your answer. Good point regarding regarding the potential for breaking changes. |
As far as I understand, the whole problem boils down to connection is constructed(some SQL or db related dataset) when datacalog is materialised. This look like a dataset issue to me and we should fix dataset instead of making Would this be enough? |
This would be addressed by my open PR. I think there are definite benefits to doing this (e.g. not unnecessarily creating connections well before they need to be used, if at all.
Nit: There's also a block further up, to validate the But, more broadly, I don't know how I feel about this. I feel like providing |
We are trying to isolate catalogs within Kedro based on pipelines. What this means is: pipeline A should use catalog A, pipeline B should use catalog B. I have used Kedro and used multiple catalogs but it seems like all catalog files (with the prefix catalog ) get read in regardless of which pipeline is actually run. More context/background: The DE pipeline is set up to read from an Oracle database and create data extracts. The only way to go ahead is with data extracts as we have READ-ONLY access on the database and the DE pipeline needs to do some transformations before the data is ready for DS. Posting at the request of @astrojuanlu for feature prioritization (if actually determined to be a needed feature) |
Thanks a lot @m-gris for opening the issue and @rishabhmulani-mck for sharing your use case. One quick comment: is using environments a decent workaround? https://docs.kedro.org/en/stable/configuration/configuration_basics.html#how-to-specify-additional-configuration-environments One that contains the "problematic"/"heavy" datasets, and another one that is isolated from it. Then one could do |
@rishabhmulani-mck I think we all agree that it's a problem to connect to all of the databases upon parsing the catalog, which would be resolved by something like kedro-org/kedro-plugins#281. I can finish this PR. There is a second question of @astrojuanlu @noklam my opinion is to split this up, and finish fleshing out kedro-org/kedro-plugins#281 (I believe it's an improvement regardless, can be added for all the datasets with a DB connection), and decide in the meantime whether to support anything further. Unless it's just me, I think making |
That's fine with me, and it is an improvement regardless. We can keep this ticket open but merge the PR you made. |
I don't think we have spent enough time thinking about how to solve this in a generic way for all datasets beyond database connections. The reason I'm saying this is because if we come up with a generic solution (like "let's not instantiate the dataset classes until they're needed"?) then maybe kedro-org/kedro-plugins#281 is not even needed. I'd like to hear @merelcht perspective when she's back in a few days. |
Since my slack message was quoted above, I'll add my 2 cents for what it's worth. I have in the end solved it also using a templated uri string, but was left slightly unsatisfied, as there is just "no need" to construct everything (or there doesnt seem to be) and it both impacts performance and ease of use. Having the whole configuration valid while perhaps desirable, it is kind of annoying to not be able to run an unrelated pipeline perhaps during development. "Why would X break Y???" reaction is what I had personally. So in this case I would love to also see lazy loading of only the necessary config components, but I am not that knowledgeable in Kedro and what sort of things it'd break. |
Great! I will unassigned myself from this issue (and it's larger scope), but finish the other PR when I get a chance. :) |
Link kedro-org/kedro-plugins#281 which is a partial solution to this issue. |
Currently the catalog is lazily instantiated on a session run: kedro/kedro/framework/session/session.py Lines 417 to 420 in 0fd1cac
(not sure why this is not using context.catalog and reaching for an internal method, but this is a side question)
It should be possible to add an argument to kedro/kedro/framework/context/context.py Line 269 in 0fd1cac
Something like this:
Parameter names and points of interjection should be more carefully examined though, since this is just a sketch solution. The good news is that it's a non-breaking one, if we implement it. |
@idanov I think that's because Noted @ankatiyar added the |
This is really something needed when the codebase grow or when we do Monorepo with many pipelines that need/could be orchestrated independently with different scheduling contexts (for example : training, evaluation and inference pipelines). Currenly when our codebase grow, we create multiple packages inside the same kedro project (src/package1, src/package2, ...), which means multiple kedro contexts, one context per "deployabe" pipelines. So we can have a "space" to manage configs that belong to the same execution context. However this increase our cognitive context switching and duplication of boilerplate code. I wonder if the proposed solution solve entirely the problem, as it's not just about initializing datasets lazily, but somehow about materializing the dataset lazily. The catalog will still be entirely materialized from yaml to python object by the config_loader. kedro will still ask for some globals that are not filled but not really needed by the selected pipeline (in user perspectif). Maybe pushing a little further the namespace concept could solve this. But for now, the poposed solution below (lazily instantiate datasets) is a big step toward pipeline modularity and monorepos, and have the merit of being no breaking. I'm looking forward to it.
|
kedro-org/kedro-plugins#281 addressed lazy loading of database connections. Should we keep this issue open for future work on lazy loading of catalog items in general? @merelcht |
Yeah, we should; I'm not really sure why this got closed in #281, as that was linked to #366 and not this... |
We're hitting this issue again, as we tried to deploy two pipelines that have slightly differents configs. When we run pipeline A, Kedro keep asking for a credential that is only used in pipeline B. Currently we're giving fake credentials to the target orchestration Patform, so the pipeline could run. But this introduce some debt in our credentials management in the orchestration platform. |
Adding this to the "Redesign the API for IO (catalog)" for visibility. |
Noting that this was also reported as #3804. |
I believed the original issue was resolved by fixing the lazy connection in SQL-related dataset already. There seems to be some other issue raised by @takikadiri |
It wasn't, generic datasets are still eagerly loaded |
@takikadiri We also run a monorepo, with several overlapping projects/pipelines. We segment the projects by subfolder both in src code and parameters, pipelines and datacatalog. We control this via env vars that are read in However, we have the opposite problem when it comes to testing, we cannot iterate through different sessions to test the makeup of the catalogs because subsequent sessions created in code (confirmed different session ids, etc.) for some reason still contains the first context. see: #4087 |
|
Closing this in favour of #3935 , which will address the issues raised here. |
Hi everyone,
Currently, running modular pipelines requires loading all datasets before pipeline executions, which, to me, seems not really ideal...
Let's assume a simple project, with 2 modular pipelines, A and B.
If B requires datasets created via api calls or interactions with remote databases , then, when working offline, one cannot do a simple
kedro run --pipeline A
, since Kedro will fail at loading B's dataset... which are not needed to run A !A bit of a pity...
A first, impulsive reaction might lead to commenting out those un-needed datasets in the local conf, but since commentaries are not evaluated, this would simply 'do nothing'...
Granted, one could simply comment out those datasets in the base config...
But "messing around" with base to handle a temporary / local need seems, to me at least, to deviate from the "Software engineering best practices" that kedro aims at fostering.
To avoid this, one would therefore have to create some sort of dummy dataset in the local conf... Totally do-able, but not super convenient and definitely not frictionless...
I guess that such "troubles" could be avoided if datasets were loaded lazily, only when needed, or by building the DAG first and then filtering datasets that are not actually needed...
I hope that this suggestion will both sound relevant and won't represent much of a technical challenge.
Best Regards
Marc
The text was updated successfully, but these errors were encountered: