Stored in memory and saved data are not the same #3236

philmar1 · 2023-10-27T13:10:37Z

Hello,

I am working with an AnnData (https://anndata.readthedocs.io/en/latest/). I realized that when I execute nodes without saving intermediate data, I get a mismatch between shapes of output data of node i and shape of input data of node (i+1), while the output data is the input data of the following node.

It generates an error while processing data in node i+1. However, the issue disappears when I register output(i+1) (sc_filtered) in the catalog

I attached the kedro pipeline. The error appears at node "add_phase", where data "sc_labeled" should have a len of 23819, just like "sc_filtered", but it in fact has the shape 25060, while function between these two "add_cell_perturbation_type" doesn't change the data shape (see the function below)

def add_cell_perturbation_type(adata, control_cells):
    """Add label for each cell. Possible values are: 
        - control (infected by NO-TARGET sgRNA)
        - infected (by at least one other guidethan NO-TARGET sgRNA)
        - not infected
    """
    adata.obs["perturbation"] = np.nan
    adata.obs[adata.obs.index.isin(control_cells)]["perturbation"] = "control"
    adata.obs[adata.obs['infected'] == False]["perturbation"] = "not infected"
    adata.obs["perturbation"].fillna("infected", inplace=True)
    return adata

Here is the complete error:

Running node: filter_cells: filter_cells([sc_w_targeted_genes,efficient_perturbed_cells,control_cells,not_infected_cells]) -> [sc_filtered]                                                                                                     node.py:331
                    INFO     Keeping only cells in being efficient (includes control), control or not infected in adata AnnData object with n_obs × n_vars = 25060 × 32630                                                                                                  nodes.py:186
                                 obs: 'Dataset', 'n_genes', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'guides', 'num_guides', 'num_guides_clip', 'infected', 'targeted_genes', 'num_targeted_genes'                                                      
                                 var: 'mt', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts', 'n_cells'                                                                                                                                                      
                                 uns: 'log1p'                                                                                                                                                                                                                                           
                                 layers: 'CPM', 'logCPM' ...                                                                                                                                                                                                                            
                    INFO     Found 10967 efficient cells, 260 control cells and 12852 not infected cells                                                                                                                                                                    nodes.py:187
                    INFO     Keeping 23819/25060 cells                                                                                                                                                                                                                      nodes.py:188
[10/27/23 14:42:12] INFO     Output adata.obs shape: (23819, 12)                                                                                                                                                                                                                         nodes.py:191
                    INFO     Saving data to 'sc_filtered' (MemoryDataset)...                                                                                                                                                                                         data_catalog.py:384
                    INFO     Completed 5 out of 9 tasks                                                                                                                                                                                                          sequential_runner.py:85
                    INFO     Loading data from 'sc_filtered' (MemoryDataset)...                                                                                                                                                                                      data_catalog.py:345
[10/27/23 14:42:13] INFO     Loading data from 'control_cells' (MemoryDataset)...                                                                                                                                                                                    data_catalog.py:345
                    INFO     Running node: add_cell_perturbation_type: add_cell_perturbation_type([sc_filtered,control_cells]) -> [sc_labeled]                                                                                                                               node.py:331
                    INFO     Input adata.obs shape: (25060, 12)                                                                                                                                                                                                                                    nodes.py:200
                    WARNING  /Users/philippemartin/Documents/Curie/Projets/PerturbSeq/kedro_style/src/perturbseq_kedro_project/pipelines/preprocessing/nodes.py:202: SettingWithCopyWarning:                                                                             warnings.py:109
                             A value is trying to be set on a copy of a slice from a DataFrame.                                                                                                                                                                                         
                             Try using .loc[row_indexer,col_indexer] = value instead                                                                                                                                                                                                    
                                                                                                                                                                                                                                                                                        
                             See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy                                                                                                                 
                               adata.obs[adata.obs.index.isin(control_cells)]["perturbation"] = "control"                                                                                                                                                                               
                                                                                                                                                                                                                                                                                        
                    WARNING  /Users/philippemartin/Documents/Curie/Projets/PerturbSeq/kedro_style/src/perturbseq_kedro_project/pipelines/preprocessing/nodes.py:203: SettingWithCopyWarning:                                                                             warnings.py:109
                             A value is trying to be set on a copy of a slice from a DataFrame.                                                                                                                                                                                         
                             Try using .loc[row_indexer,col_indexer] = value instead                                                                                                                                                                                                    
                                                                                                                                                                                                                                                                                        
                             See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy                                                                                                                 
                               adata.obs[adata.obs['infected'] == False]["perturbation"] = "not infected"                                                                                                                                                                               
                                                                                                                                                                                                                                                                                        
                    INFO     Output adata.obs.shape: (25060, 13)                                                                                                                                                                                                                                    nodes.py:205
                    INFO     Saving data to 'sc_labeled' (MemoryDataset)...                                                                                                                                                                                          data_catalog.py:384
[10/27/23 14:42:14] INFO     Completed 6 out of 9 tasks                                                                                                                                                                                                          sequential_runner.py:85
                    INFO     Loading data from 'sc_labeled' (MemoryDataset)...                                                                                                                                                                                       data_catalog.py:345
[10/27/23 14:42:18] INFO     Loading data from 's_genes' (MemoryDataset)...                                                                                                                                                                                          data_catalog.py:345
                    INFO     Loading data from 'g2m_genes' (MemoryDataset)...                                                                                                                                                                                        data_catalog.py:345
                    INFO     Running node: add_phase: add_phase([sc_labeled,s_genes,g2m_genes]) -> [sc_w_phase]                                                                                                                                                              node.py:331
                    INFO     View of AnnData object with n_obs × n_vars = 23819 × 32630                                                                                                                                                                                      nodes.py:96
                                 obs: 'Dataset', 'n_genes', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'guides', 'num_guides', 'num_guides_clip', 'infected', 'targeted_genes', 'num_targeted_genes', 'perturbation'                                      
                                 var: 'mt', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts', 'n_cells'                                                                                                                                                      
                                 uns: 'log1p'                                                                                                                                                                                                                                           
                                 layers: 'CPM', 'logCPM'                                                                                                                                                                                                                                
[10/27/23 14:42:19] INFO     adata.X.shape: (23819, 32630)                                                                                                                                                                                                                                  nodes.py:97
                    INFO     adata.obs.shape: (25060, 13)                                                                                                                                                                                                                                     nodes.py:98
[10/27/23 14:42:22] ERROR    Node 'add_phase: add_phase([sc_labeled,s_genes,g2m_genes]) -> [sc_w_phase]' failed with error:                                                                                                                                                  node.py:356
                             Observations annot. `obs` must have number of rows of `X` (23819), but has 25060 rows.                                                                                                                                                                     
                    WARNING  There are 3 nodes that have not run.                                                                                                                                                                                                          runner.py:206
                             You can resume the pipeline run from the nearest nodes with persisted inputs by adding the following argument to your previous command:                                                                                                                    
                               --from-nodes "get_efficient_targeted_genes_and_cells,get_control_cells,get_not_infected_cells"                                                                                                                                                           
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /Users/philippemartin/miniconda3/envs/perturb-seq/bin/kedro:8 in <module>                        │
│                                                                                                  │
│ /Users/philippemartin/miniconda3/envs/perturb-seq/lib/python3.10/site-packages/kedro/framework/c │
│ li/cli.py:211 in main                                                                            │
│                                                                                                  │
│   208 │   """                                                                                    │
│   209 │   _init_plugins()                                                                        │
│   210 │   cli_collection = KedroCLI(project_path=Path.cwd())                                     │
│ ❱ 211 │   cli_collection()                                                                       │
│   212                                                                                            │
│                                                                                                  │
│ /Users/philippemartin/miniconda3/envs/perturb-seq/lib/python3.10/site-packages/click/core.py:115 │
│ 7 in __call__                                                                                    │
│                                                                                                  │
│ /Users/philippemartin/miniconda3/envs/perturb-seq/lib/python3.10/site-packages/kedro/framework/c │
│ li/cli.py:139 in main                                                                            │
│                                                                                                  │
│   136 │   │   )                                                                                  │
│   137 │   │                                                                                      │
│   138 │   │   try:                                                                               │
│ ❱ 139 │   │   │   super().main(                                                                  │
│   140 │   │   │   │   args=args,                                                                 │
│   141 │   │   │   │   prog_name=prog_name,                                                       │
│   142 │   │   │   │   complete_var=complete_var,                                                 │
│                                                                                                  │
│ /Users/philippemartin/miniconda3/envs/perturb-seq/lib/python3.10/site-packages/click/core.py:107 │
│ 8 in main                                                                                        │
│                                                                                                  │
│ /Users/philippemartin/miniconda3/envs/perturb-seq/lib/python3.10/site-packages/click/core.py:168 │
│ 8 in invoke                                                                                      │
│                                                                                                  │
│ /Users/philippemartin/miniconda3/envs/perturb-seq/lib/python3.10/site-packages/click/core.py:143 │
│ 4 in invoke                                                                                      │
│                                                                                                  │
│ /Users/philippemartin/miniconda3/envs/perturb-seq/lib/python3.10/site-packages/click/core.py:783 │
│ in invoke                                                                                        │
│                                                                                                  │
│ /Users/philippemartin/miniconda3/envs/perturb-seq/lib/python3.10/site-packages/kedro/framework/c │
│ li/project.py:459 in run                                                                         │
│                                                                                                  │
│   456 │   with KedroSession.create(                                                              │
│   457 │   │   env=env, conf_source=conf_source, extra_params=params                              │
│   458 │   ) as session:                                                                          │
│ ❱ 459 │   │   session.run(                                                                       │
│   460 │   │   │   tags=tag,                                                                      │
│   461 │   │   │   runner=runner(is_async=is_async),                                              │
│   462 │   │   │   node_names=node_names,                                                         │
│                                                                                                  │
│ /Users/philippemartin/miniconda3/envs/perturb-seq/lib/python3.10/site-packages/kedro/framework/s │
│ ession/session.py:425 in run                                                                     │
│                                                                                                  │
│   422 │   │   )                                                                                  │
│   423 │   │                                                                                      │
│   424 │   │   try:                                                                               │
│ ❱ 425 │   │   │   run_result = runner.run(                                                       │
│   426 │   │   │   │   filtered_pipeline, catalog, hook_manager, session_id                       │
│   427 │   │   │   )                                                                              │
│   428 │   │   │   self._run_called = True                                                        │
│                                                                                                  │
│ /Users/philippemartin/miniconda3/envs/perturb-seq/lib/python3.10/site-packages/kedro/runner/runn │
│ er.py:92 in run                                                                                  │
│                                                                                                  │
│    89 │   │   │   self._logger.info(                                                             │
│    90 │   │   │   │   "Asynchronous mode is enabled for loading and saving data"                 │
│    91 │   │   │   )                                                                              │
│ ❱  92 │   │   self._run(pipeline, catalog, hook_manager, session_id)                             │
│    93 │   │                                                                                      │
│    94 │   │   self._logger.info("Pipeline execution completed successfully.")                    │
│    95                                                                                            │
│                                                                                                  │
│ /Users/philippemartin/miniconda3/envs/perturb-seq/lib/python3.10/site-packages/kedro/runner/sequ │
│ ential_runner.py:70 in _run                                                                      │
│                                                                                                  │
│   67 │   │                                                                                       │
│   68 │   │   for exec_index, node in enumerate(nodes):                                           │
│   69 │   │   │   try:                                                                            │
│ ❱ 70 │   │   │   │   run_node(node, catalog, hook_manager, self._is_async, session_id)           │
│   71 │   │   │   │   done_nodes.add(node)                                                        │
│   72 │   │   │   except Exception:                                                               │
│   73 │   │   │   │   self._suggest_resume_scenario(pipeline, done_nodes, catalog)                │
│                                                                                                  │
│ /Users/philippemartin/miniconda3/envs/perturb-seq/lib/python3.10/site-packages/kedro/runner/runn │
│ er.py:320 in run_node                                                                            │
│                                                                                                  │
│   317 │   if is_async:                                                                           │
│   318 │   │   node = _run_node_async(node, catalog, hook_manager, session_id)                    │
│   319 │   else:                                                                                  │
│ ❱ 320 │   │   node = _run_node_sequential(node, catalog, hook_manager, session_id)               │
│   321 │                                                                                          │
│   322 │   for name in node.confirms:                                                             │
│   323 │   │   catalog.confirm(name)                                                              │
│                                                                                                  │
│ /Users/philippemartin/miniconda3/envs/perturb-seq/lib/python3.10/site-packages/kedro/runner/runn │
│ er.py:416 in _run_node_sequential                                                                │
│                                                                                                  │
│   413 │   )                                                                                      │
│   414 │   inputs.update(additional_inputs)                                                       │
│   415 │                                                                                          │
│ ❱ 416 │   outputs = _call_node_run(                                                              │
│   417 │   │   node, catalog, inputs, is_async, hook_manager, session_id=session_id               │
│   418 │   )                                                                                      │
│   419                                                                                            │
│                                                                                                  │
│ /Users/philippemartin/miniconda3/envs/perturb-seq/lib/python3.10/site-packages/kedro/runner/runn │
│ er.py:382 in _call_node_run                                                                      │
│                                                                                                  │
│   379 │   │   │   is_async=is_async,                                                             │
│   380 │   │   │   session_id=session_id,                                                         │
│   381 │   │   )                                                                                  │
│ ❱ 382 │   │   raise exc                                                                          │
│   383 │   hook_manager.hook.after_node_run(                                                      │
│   384 │   │   node=node,                                                                         │
│   385 │   │   catalog=catalog,                                                                   │
│                                                                                                  │
│ /Users/philippemartin/miniconda3/envs/perturb-seq/lib/python3.10/site-packages/kedro/runner/runn │
│ er.py:372 in _call_node_run                                                                      │
│                                                                                                  │
│   369 ) -> dict[str, Any]:                                                                       │
│   370 │   # pylint: disable=too-many-arguments                                                   │
│   371 │   try:                                                                                   │
│ ❱ 372 │   │   outputs = node.run(inputs)                                                         │
│   373 │   except Exception as exc:                                                               │
│   374 │   │   hook_manager.hook.on_node_error(                                                   │
│   375 │   │   │   error=exc,                                                                     │
│                                                                                                  │
│ /Users/philippemartin/miniconda3/envs/perturb-seq/lib/python3.10/site-packages/kedro/pipeline/no │
│ de.py:357 in run                                                                                 │
│                                                                                                  │
│   354 │   │   # purposely catch all exceptions                                                   │
│   355 │   │   except Exception as exc:                                                           │
│   356 │   │   │   self._logger.error("Node '%s' failed with error: \n%s", str(self), str(exc))   │
│ ❱ 357 │   │   │   raise exc                                                                      │
│   358 │                                                                                          │
│   359 │   def _run_with_no_inputs(self, inputs: dict[str, Any]):                                 │
│   360 │   │   if inputs:                                                                         │
│                                                                                                  │
│ /Users/philippemartin/miniconda3/envs/perturb-seq/lib/python3.10/site-packages/kedro/pipeline/no │
│ de.py:348 in run                                                                                 │
│                                                                                                  │
│   345 │   │   │   elif isinstance(self._inputs, str):                                            │
│   346 │   │   │   │   outputs = self._run_with_one_input(inputs, self._inputs)                   │
│   347 │   │   │   elif isinstance(self._inputs, list):                                           │
│ ❱ 348 │   │   │   │   outputs = self._run_with_list(inputs, self._inputs)                        │
│   349 │   │   │   elif isinstance(self._inputs, dict):                                           │
│   350 │   │   │   │   outputs = self._run_with_dict(inputs, self._inputs)                        │
│   351                                                                                            │
│                                                                                                  │
│ /Users/philippemartin/miniconda3/envs/perturb-seq/lib/python3.10/site-packages/kedro/pipeline/no │
│ de.py:388 in _run_with_list                                                                      │
│                                                                                                  │
│   385 │   │   │   │   f"{sorted(inputs.keys())}."                                                │
│   386 │   │   │   )                                                                              │
│   387 │   │   # Ensure the function gets the inputs in the correct order                         │
│ ❱ 388 │   │   return self._func(*(inputs[item] for item in node_inputs))                         │
│   389 │                                                                                          │
│   390 │   def _run_with_dict(self, inputs: dict[str, Any], node_inputs: dict[str, str]):         │
│   391 │   │   # Node inputs and provided run inputs should completely overlap                    │
│                                                                                                  │
│ /Users/philippemartin/Documents/Curie/Projets/PerturbSeq/kedro_style/src/perturbseq_kedro_projec │
│ t/pipelines/preprocessing/nodes.py:99 in add_phase                                               │
│                                                                                                  │
│    96 │   logger.info(adata)                                                                     │
│    97 │   logger.info(adata.X.shape)                                                             │
│    98 │   logger.info(adata.obs.shape)                                                           │
│ ❱  99 │   temp_adata = adata.copy()                                                              │
│   100 │   temp_adata.X = adata.layers["logCPM"]                                                  │
│   101 │                                                                                          │
│   102 │   sc.tl.score_genes_cell_cycle(temp_adata, s_genes=s_genes, g2m_genes=g2m_genes)         │
│                                                                                                  │
│ /Users/philippemartin/miniconda3/envs/perturb-seq/lib/python3.10/site-packages/anndata/_core/ann │
│ data.py:1513 in copy                                                                             │
│                                                                                                  │
│   1510 │   │   │   │   # Subsetting this way means we don’t have to have a view type             │
│   1511 │   │   │   │   # defined for the matrix, which is needed for some of the                 │
│   1512 │   │   │   │   # current distributed backend. Specifically Dask.                         │
│ ❱ 1513 │   │   │   │   return self._mutated_copy(                                                │
│   1514 │   │   │   │   │   X=_subset(self._adata_ref.X, (self._oidx, self._vidx)).copy()         │
│   1515 │   │   │   │   )                                                                         │
│   1516 │   │   │   else:                                                                         │
│                                                                                                  │
│ /Users/philippemartin/miniconda3/envs/perturb-seq/lib/python3.10/site-packages/anndata/_core/ann │
│ data.py:1458 in _mutated_copy                                                                    │
│                                                                                                  │
│   1455 │   │   │   new["raw"] = kwargs["raw"]                                                    │
│   1456 │   │   elif self.raw is not None:                                                        │
│   1457 │   │   │   new["raw"] = self.raw.copy()                                                  │
│ ❱ 1458 │   │   return AnnData(**new)                                                             │
│   1459 │                                                                                         │
│   1460 │   def to_memory(self, copy=False) -> "AnnData":                                         │
│   1461 │   │   """Return a new AnnData object with all backed arrays loaded into memory.         │
│                                                                                                  │
│ /Users/philippemartin/miniconda3/envs/perturb-seq/lib/python3.10/site-packages/anndata/_core/ann │
│ data.py:285 in __init__                                                                          │
│                                                                                                  │
│    282 │   │   │   │   raise ValueError("`X` has to be an AnnData object.")                      │
│    283 │   │   │   self._init_as_view(X, oidx, vidx)                                             │
│    284 │   │   else:                                                                             │
│ ❱  285 │   │   │   self._init_as_actual(                                                         │
│    286 │   │   │   │   X=X,                                                                      │
│    287 │   │   │   │   obs=obs,                                                                  │
│    288 │   │   │   │   var=var,                                                                  │
│                                                                                                  │
│ /Users/philippemartin/miniconda3/envs/perturb-seq/lib/python3.10/site-packages/anndata/_core/ann │
│ data.py:505 in _init_as_actual                                                                   │
│                                                                                                  │
│    502 │   │   # Backwards compat for connectivities matrices in uns["neighbors"]                │
│    503 │   │   _move_adj_mtx({"uns": self._uns, "obsp": self._obsp})                             │
│    504 │   │                                                                                     │
│ ❱  505 │   │   self._check_dimensions()                                                          │
│    506 │   │   self._check_uniqueness()                                                          │
│    507 │   │                                                                                     │
│    508 │   │   if self.filename:                                                                 │
│                                                                                                  │
│ /Users/philippemartin/miniconda3/envs/perturb-seq/lib/python3.10/site-packages/anndata/_core/ann │
│ data.py:1845 in _check_dimensions                                                                │
│                                                                                                  │
│   1842 │   │   else:                                                                             │
│   1843 │   │   │   key = {key}                                                                   │
│   1844 │   │   if "obs" in key and len(self._obs) != self._n_obs:                                │
│ ❱ 1845 │   │   │   raise ValueError(                                                             │
│   1846 │   │   │   │   "Observations annot. `obs` must have number of rows of `X`"               │
│   1847 │   │   │   │   f" ({self._n_obs}), but has {self._obs.shape[0]} rows."                   │
│   1848 │   │   │   )                                                                             │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯

ValueError: Observations annot. obs must have number of rows of X (23819), but has 25060 rows.

The text was updated successfully, but these errors were encountered:

astrojuanlu · 2023-10-27T13:26:26Z

Hi @philmar1 , thanks a lot for reporting and sorry you're experiencing trouble. The full traceback will help us understand the issue, but is there a way you can narrow the problem down a bit and share a toy input dataset we can use?

astrojuanlu · 2023-10-27T13:31:20Z

I'm noting this by the way:

                    WARNING  /Users/philippemartin/Documents/Curie/Projets/PerturbSeq/kedro_style/src/perturbseq_kedro_project/pipelines/preprocessing/nodes.py:203: SettingWithCopyWarning:                                                                             warnings.py:109
                             A value is trying to be set on a copy of a slice from a DataFrame.                                                                                                                                                                                         
                             Try using .loc[row_indexer,col_indexer] = value instead                                                                                                                                                                                                    
                                                                                                                                                                                                                                                                                        
                             See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy                                                                                                                 
                               adata.obs[adata.obs['infected'] == False]["perturbation"] = "not infected"

Can you try addressing those warnings and run the pipeline again?

astrojuanlu · 2023-10-27T13:32:16Z

it would need to be

adata.obs.loc[adata.obs['infected'] == False, "perturbation"] = "not infected"

rather than adata.obs[adata.obs['infected'] == False]["perturbation"] = "not infected".

merelcht · 2024-03-11T16:33:46Z

Hi @philmar1 are you still facing issues after the suggestions @astrojuanlu made?

merelcht · 2024-03-19T11:45:10Z

I'm closing this for now. If you @philmar1 or anyone else needs help to solve the above issue, feel free to re-open it again.

philmar1 · 2024-03-20T10:58:27Z

Hi Merelcht,

I have not working on this for a while and somehow managed to deal with that issue.

I still have a general question though. I'm running several successive nodes and sometimes the run is just killed. I know that it comes from RAM management because it runs great with a subsample of data.
I believe the intermediate outputs are stored in MemoryDataset until the end of the kedro run. Do you agree with that assumption or can you confirm that intermediate outputs stored in MemoryDatasets are dynamically removed when they are not useful anymore ? For instance, when "outputN" is required only for node N+1, will it be removed once nodeN+1 is finished?

Thanks a lot for your answer

noklam · 2024-03-20T19:05:45Z

Dataset that doesn't required by downstream are released from memory as soon as possible.

…

On Wed, 20 Mar 2024, 10:58 Philippe Martin, ***@***.***> wrote: Hi Merelcht, I have not working on this for a while and somehow managed to deal with that issue. I still have a general question though. I'm running several successive nodes and sometimes the run is just killed. I know that it comes from RAM management because it runs great with a subsample of data. I believe the intermediate outputs are stored in MemoryDataset until the end of the kedro run. Do you agree with that assumption or can you confirm that intermediate outputs stored in MemoryDatasets are dynamically removed when they are not useful anymore ? For instance, when "outputN" is required only for node N+1, will it be removed once nodeN+1 is finished? Thanks a lot for your answer — Reply to this email directly, view it on GitHub <#3236 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AELAWL3G5ZFE6M4INW5VGCTYZFTWTAVCNFSM6AAAAAA6SZY43WVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMBZGI4TINJZGU> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

merelcht · 2024-03-21T10:51:38Z

Hi @philmar1! Glad to hear your issue was solved 🙂 What @noklam says is correct. In our runners we have logic to release a dataset as soon as it's not needed anymore in the rest of the pipeline. See e.g.:

https://github.com/kedro-org/kedro/blob/main/kedro/runner/sequential_runner.py#L81-L88

noklam · 2024-03-21T12:41:21Z

To clarify, the runner code above only affect datasets that implemented the _release method. For most case, the Dataset object itself doesn't save the data (with CacheDataset as exception), so it is release as soon as it go outside of the function, which is purely handled by Python GC.

[KED-2744] Memory Leakage - Unexpected Caching behaviour with CacheDataset #819

astrojuanlu added the Community Issue/PR opened by the open-source community label Oct 27, 2023

github-project-automation bot added this to Kedro Framework Oct 27, 2023

github-actions bot mentioned this issue Nov 1, 2023

Monthly issue metrics report #3256

Closed

merelcht closed this as not planned Won't fix, can't repro, duplicate, stale Mar 19, 2024

github-project-automation bot moved this to Done in Kedro Framework Mar 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stored in memory and saved data are not the same #3236

Stored in memory and saved data are not the same #3236

philmar1 commented Oct 27, 2023

astrojuanlu commented Oct 27, 2023

astrojuanlu commented Oct 27, 2023

astrojuanlu commented Oct 27, 2023

merelcht commented Mar 11, 2024

merelcht commented Mar 19, 2024

philmar1 commented Mar 20, 2024

noklam commented Mar 20, 2024 via email

merelcht commented Mar 21, 2024

noklam commented Mar 21, 2024

Stored in memory and saved data are not the same #3236

Stored in memory and saved data are not the same #3236

Comments

philmar1 commented Oct 27, 2023

astrojuanlu commented Oct 27, 2023

astrojuanlu commented Oct 27, 2023

astrojuanlu commented Oct 27, 2023

merelcht commented Mar 11, 2024

merelcht commented Mar 19, 2024

philmar1 commented Mar 20, 2024

noklam commented Mar 20, 2024 via email

merelcht commented Mar 21, 2024

noklam commented Mar 21, 2024