-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Stored in memory and saved data are not the same #3236
Comments
Hi @philmar1 , thanks a lot for reporting and sorry you're experiencing trouble. The full traceback will help us understand the issue, but is there a way you can narrow the problem down a bit and share a toy input dataset we can use? |
I'm noting this by the way:
Can you try addressing those warnings and run the pipeline again? |
it would need to be adata.obs.loc[adata.obs['infected'] == False, "perturbation"] = "not infected" rather than |
Hi @philmar1 are you still facing issues after the suggestions @astrojuanlu made? |
I'm closing this for now. If you @philmar1 or anyone else needs help to solve the above issue, feel free to re-open it again. |
Hi Merelcht, I have not working on this for a while and somehow managed to deal with that issue. I still have a general question though. I'm running several successive nodes and sometimes the run is just killed. I know that it comes from RAM management because it runs great with a subsample of data. Thanks a lot for your answer |
Dataset that doesn't required by downstream are released from memory as
soon as possible.
…On Wed, 20 Mar 2024, 10:58 Philippe Martin, ***@***.***> wrote:
Hi Merelcht,
I have not working on this for a while and somehow managed to deal with
that issue.
I still have a general question though. I'm running several successive
nodes and sometimes the run is just killed. I know that it comes from RAM
management because it runs great with a subsample of data.
I believe the intermediate outputs are stored in MemoryDataset until the
end of the kedro run. Do you agree with that assumption or can you confirm
that intermediate outputs stored in MemoryDatasets are dynamically removed
when they are not useful anymore ? For instance, when "outputN" is required
only for node N+1, will it be removed once nodeN+1 is finished?
Thanks a lot for your answer
—
Reply to this email directly, view it on GitHub
<#3236 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AELAWL3G5ZFE6M4INW5VGCTYZFTWTAVCNFSM6AAAAAA6SZY43WVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMBZGI4TINJZGU>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
Hi @philmar1! Glad to hear your issue was solved 🙂 What @noklam says is correct. In our runners we have logic to release a dataset as soon as it's not needed anymore in the rest of the pipeline. See e.g.: https://github.com/kedro-org/kedro/blob/main/kedro/runner/sequential_runner.py#L81-L88 |
To clarify, the runner code above only affect datasets that implemented the Related: |
Hello,
I am working with an AnnData (https://anndata.readthedocs.io/en/latest/). I realized that when I execute nodes without saving intermediate data, I get a mismatch between shapes of output data of node i and shape of input data of node (i+1), while the output data is the input data of the following node.
It generates an error while processing data in node i+1. However, the issue disappears when I register output(i+1) (sc_filtered) in the catalog
I attached the kedro pipeline. The error appears at node "add_phase", where data "sc_labeled" should have a len of 23819, just like "sc_filtered", but it in fact has the shape 25060, while function between these two "add_cell_perturbation_type" doesn't change the data shape (see the function below)
Here is the complete error:
ValueError: Observations annot.
obs
must have number of rows ofX
(23819), but has 25060 rows.The text was updated successfully, but these errors were encountered: