Easier CustomDataset Creation #1936

lancechua · 2022-10-16T08:10:12Z

Description

IO read write functions typically follow this signature:

def load(path: Union[Path, str], **kwars) -> obj:
    ...

def save(obj, path: Union[Path, str], **kwars) -> None:
    ...

Creating custom datasets should ideally be as easy as supplying load / save function(s) that follow this function signature.

Context

kedro supports an extensive range of datasets, but it is not exhaustive. Popular libraries used by relatively niche communities like xarray and arviz aren't currently supported.

Beyond these examples, unofficially adding support for more obscure datasets would be easier.

Initially, I was looking to implement something like this and asked in the Discord chat if this pattern made sense.
Then, @datajoely suggested I open a feature request.

Possible Implementation

We can consider a Dataset class factory. Maybe GenericDataset with a class method .create_custom_dataset(name: str, load: callable, save: callable).

Usage would look something like xarrayNetCDF = GenericDataset("xarrayNetCDF", xarray.open_dataset, lambda x, path, **kwargs: x.to_netcdf(path, **kwargs)).
Entries can be added to the data catalog yaml just as with any other custom dataset implementation.

Possible Alternatives

LambdaDataset is very similar but the load, and save are hard coded in the implementation, and cannot be parameterized in the data catalog, as far as I'm aware
Subclassing AbstractDataset is an option, but this feature request seeks to reduce boilerplate when defining new datasets
Adding xarray support #1346 officially requires implementing nuances like cloud file storage, partitioned datasets, lazy loading, etc.

The text was updated successfully, but these errors were encountered:

datajoely · 2022-10-17T17:27:03Z

Hi @lancechua thank you for raising this! We have discussed this as a team and have had a good conversation regarding ways we can proceed.

You will see that this issue has now been referenced in the #1778 super issue which is tracking all of the ways we plan to improve the DataCatalog at a low level.

Your last point:

Adding xarray support kedro-org/kedro-plugins#165 officially requires implementing nuances like cloud file storage, partitioned datasets, lazy loading, etc.

Is the most interesting, a lot of the paint-points to mention have emerged organically. Versioning, fsspec etc. have all arrived in Kedro way later than the original AbstractDataSet class definition, implementing a new DataSet today requires (probably too much) copy and pasting.

The team are going to have a go at prototyping what this could look like, so you may be interested in subscribing to parent issue to contribute to these design decisions. The initial thinking reflected on the rather flat structure we have today where everything inherits from AbstractDataSet (or AbstractVersionedDataSet) and we could utilise OO concepts much better and make implementation/testing/overriding much simpler.

deepyaman · 2022-10-31T14:56:08Z

Is the most interesting, a lot of the paint-points to mention have emerged organically. Versioning, fsspec etc. have all arrived in Kedro way later than the original AbstractDataSet class definition, implementing a new DataSet today requires (probably too much) copy and pasting.

I feel like too much copy-pasting has been an issue for a really long time. 😅

I would be keen to break down the points of duplication (e.g. default argument handling, fsspec-related boilerplate, etc.) and add mixins for each (as in #15), because that way we can make progress on resolving these pain points without having to limit ourselves to a universal solution that works for every possible dataset.

astrojuanlu · 2023-06-19T16:40:39Z

Collecting some thoughts from our latest Tech Design session about this topic:

Recent developments

Linked issues: Re-design io.core and io.data_catalog #1778, Move AbstractDataSet to Kedro-Plugins #2409
Some recent experiences trying to add "weird" datasets
- My own Kaggle dataset https://github.com/astrojuanlu/kedro-kaggle-dataset
- FITS dataset using Astropy, because SunPy and Astropy https://github.com/b4rlw/sunspot-segmentation/blob/master/src/sunspots/extras/datasets/sunpy_datasets.py
  - "It is also worth noting that a HDU cannot be selected within the node logic, because even though the FITS file metadata might be loaded, the data inside a particular HDU cannot be loaded unless the corresponding file is still open, and by the time a dataset arrives at a node, it is not."
- HDF dataset, recently reopened Add catalog dataset for HDF formats not supported by pandas kedro-plugins#240

General thoughts

There are 3 dimensions to take into account
- Where data is
- What format is
- What memory representation
Ideally a design would decouple the three, but this is not practical, and since Kedro offloads data loading to other libraries, the dataset is subject to the abstractions by those other libraries and how they solve the decoupling
The old design was very painful and using fsspec greatly improved it https://docs.kedro.org/en/0.14.3/kedro.io.html#data-sets
There is some sort of decoupling with pandas.GenericDataSet https://docs.kedro.org/en/latest/kedro_datasets.pandas.GenericDataSet.html but it cannot be easily achieved with everything

Current issues

Lots of copy-pasting
- @deepyaman's comment right above Easier CustomDataset Creation #1936 (comment)
- "Relatedly, why is _describe an abstract method at all? People just have to write way too much code when writing a custom dataset"
- There's a suggestion to add a fsspec mixin [kedro-datasets] fsspec mixin kedro-plugins#200
  - But multiple inheritance brings some problems
  - AbstractDataSet base class contains lots of logic, can't be easily turned into a protocol Move AbstractDataSet to Kedro-Plugins #2409 (comment)
Unclear how to support "lazy" datasets
- Case in point: HDF, FITS, others
- Datasets are eager, so if the data has to be lazily loaded, users have to jump through some hoops
Somewhat unclear separation of "format" and "transport"
- Not really an issue of the current design, but we could do a better job at communicating it, or just assume that people will do whatever works for them
Unclear if full reproducibility is encouraged
- Case in point: Kaggle
- Another: "My experience loading COVID data https://github.com/deepyaman/inauditus/blob/develop/refresh-data I just write a script to download the data from an external source, and then treat it as a pandas.CSVDataSet. There are other problems with doing the “expensive” data download in a load function, e.g. that we’re loading from a remote source each time when we really should just have it in a downloaded source."
Unclear interaction with versioning
- "And it is more subtle about versioning - it actually works only if you use a specific function. The newly ManagedTableDataset which use the built-in versioning actually doesn’t work like this."
Creating a custom dataset requires a full Kedro installation, which some users consider too heavy
- The topic of Move AbstractDataSet to Kedro-Plugins #2409

Other links and ideas

Any redesign should be grounded on user research and user needs
@antonymilne suggested making dataset parameters keyword-only, which enables non-breaking refactoring post-0.19.0
- If we think this is worth doing, we should open an issue and prioritize it sooner rather than later
GetInData "pipeline dataset" https://github.com/getindata/kedro-azureml/blob/develop/kedro_azureml/datasets/pipeline_dataset.py

astrojuanlu · 2023-06-20T21:33:38Z

A user that quite likely got confused when defining their own dataset, most likely because they aren't copy-pasting enough https://www.linen.dev/s/kedro/t/12646523/hi-folks-i-m-trying-to-create-a-custom-kedro-dataset-inherit#ac22d4cb-dd72-4072-905a-f9c5cb352205

yetudada · 2023-08-07T08:45:53Z

I'm also just dropping data from some of our surveys here, the question is "What is the hardest part of working with the Kedro framework?" (Kedro - User Research Study)

IO abstraction, I found its bit tricky if I want to implement any new IO module
when I have to work on non-popular data sets such as Excel
Writing custom IO datasets. Specially Incremental Data Ingestion. Finding rationale behind different kedro layer (raw, stage, int etc) specifically when to use which and when to skip.
Custom datasets perhaps

astrojuanlu · 2023-08-18T14:22:09Z

Interesting use case: a user trying to perform data access with a highly customized wget command https://linen-slack.kedro.org/t/14156294/i-m-migrating-to-kedro-for-a-bunch-of-my-projects-for-one-of#a3d857e1-7e31-48f9-98de-439a936786c8

This is not too different from my KaggleDataset attempt. Maybe there should be a "generic dataset" that allows people to just type a subprocess that will take care of leaving some files in a directory.

datajoely · 2023-08-18T14:48:24Z

Channelling my inner @idanov this does harm reproducibility - personally I think it's a good thing, same for conditional nodes, another things we've held off doing for the same reason.

astrojuanlu · 2023-08-18T15:09:29Z

This is a bit of a tangent, but I think it's worth addressing.

The reality is that people who need "weird" things will do them, whether are supported by Kedro or not. This not only applies to datasets: I've seen people do very weird things with Kedro.

Speaking of datasets, the options our users have are:

Put the "non reproducible" part to a separate step, and build the rest as a Kedro project. (What @deepyaman did in his project, and what the user I referred to in my previous comment declared they would do). I contend this is bad DX.
Hack around their own brittle dataset that does this. Which is a difficult thing to do in any case (see this issue). So not only this is bad DX, but also they're probably shooting themselves in the foot and we're not helping.
Find an alternative way, assuming it's possible. This is good for our users enlightenment, but still screams "you're doing it wrong", which is something that should be handled with care from the Kedro side (with, at the very least, good docs that are easily discoverable). I'm on the fence on whether this is bad DX or not - frameworks definitely impose some constraints. But should they have escape hatches?
Say "Kedro is not for me" and try something else. This is bad for Kedro.

There are 2 stances we can take:

Consider the bad DX a feature. Friction means the users will try to do something else. Or, put another way, what they're trying to do is out of scope for Kedro so we refuse to implement it ourselves.
Consider the bad DX a bug. Friction means we should improve the situation for these users, while helping them do the right thing as much as we can, but offering options to do all their data processing within Kedro.

I'm firmly in camp (2).

datajoely · 2023-08-18T15:33:41Z

Completely aligned on #2 I think it's related to the wider rigidity questions we have in the framework vs lib debate.

FYI - how prefect do conditionals here looks neat
PrefectHQ/prefect#3545 (comment)

astrojuanlu · 2023-08-28T18:07:14Z

Another user who wants to read a TSV file contained in a .tar.gz https://stackoverflow.com/q/76994906/554319

pandas does not support compressed archives with more than one file, so this requires an intermediate step.

astrojuanlu · 2023-09-13T08:51:25Z

After implementing some new datasets that will hopefully be open sourced soon, I'm adding some more reflections here.

I think that the difficult part lies in three specific points:

Having to overload private methods of an obscure class (that then don't get documented, see any dataset in https://docs.kedro.org/en/0.18.13/kedro_datasets.html),
The fsspec boilerplate, and
The versioning stuff (How can we improve dataset versioning? #1979, Configurable versioning #2355)

(1) was hinted by @lancechua's original comment at the top of this issue, and I agree that we should move towards an architecture that makes load and save public methods. Moreover, I'm advocating to favor composition over inheritance.

(2) and (3) are trickier though. What's interesting is that one can envision datasets that don't refer to specific files on disk and that are not versioned - and these ones are actually quite easy to implement. As such, adding the path to load as @lancechua suggested might be a too strong assumption.

Another reason why this is hard is the naming itself: some things we name "datasets" in Kedro are not really a /ˈdā-tə-ˌset/ in the strict sense of the word, but more like "artifacts" or "connectors". I think there is an opportunity here to collect some of the "weird" use cases we already have and come up with better terminology.

lancechua · 2023-09-16T15:12:02Z

(1) was hinted by @lancechua's original comment at the top of this issue, and I agree that we should move towards an architecture that makes load and save public methods. Moreover, I'm advocating to favor composition over inheritance.

Can't remember who linked it in the issues here somewhere, but after reading the Python subclassing redux article, I also think composition is the better way to go by injecting load and save callables as dependencies.

(2) and (3) are trickier though. What's interesting is that one can envision datasets that don't refer to specific files on disk and that are not versioned - and these ones are actually quite easy to implement. As such, adding the path to load as @lancechua suggested might be a too strong assumption.

Also agree. The callables can be fully specified by load_kwargs (and potentially load_args) so there isn't a hard requirement for the function signature.

From what I remember about the implementation of versioning, the filename becomes a directory, with timestamp sub-directories. A separate versioner dependency can be injected to redirect the load and save callables to the path of the correct version. This would make the path parameter a hard requirement, but we can always just pass None for "pathless" datasets.
The no versioning case would just return the path as is.

A use case that would be good, but potentially annoying to implement, is a cache for expensive "load" operations like web pulls or API requests. Instead of repeating the expensive request, the flow with subsequent reads would be:

expensive first request -> in memory -> save to disk -> read cache from disk -> in memory -> ...

The cache can be a dispatcher that takes both the original load callable, and a load cache callable.
cache itself must still be a valid load callable.
Perhaps we namespace original and cache load params or create a separate load_cache_kwargs parameter?

The load callable dispatched would depend on:

dataset parameters (e.g. use_cache=True)
cache validity (e.g. based on hash of params).

The no caching case would just always dispatch to the original load callable.

P.S. Haven't worked on kedro datasets for a while now, but hopefully the above still makes sense.

astrojuanlu · 2023-09-16T17:00:45Z

Thanks a lot for getting back @lancechua! Yes, Hynek has influenced me a lot on this as well. Hope we can prioritize this in 2024.

astrojuanlu · 2023-09-18T08:40:36Z

More evidence of different artifact types: Kubeflow https://www.kubeflow.org/docs/components/pipelines/v2/data-types/artifacts/#artifact-types (via @deepyaman)

astrojuanlu · 2023-09-20T07:56:21Z

Another problem with the current design: when doing post-mortem debugging on datasets, one gets thrown to AbstractDataset.load:

kedro/kedro/io/core.py

Lines 191 to 201 in 4d186ef

    
           try: 
        
               return self._load() 
        
           except DatasetError: 
        
               raise 
        
           except Exception as exc: 
        
               # This exception handling is by design as the composed data sets 
        
               # can throw any type of exception. 
        
               message = ( 
        
                   f"Failed while loading data from data set {str(self)}.\n{str(exc)}" 
        
               ) 
        
               raise DatasetError(message) from exc

datajoely · 2023-09-20T08:29:18Z

Now Python 3.7 is EOL is there any advantage to using Protocol types as an alternative/addition to the abstract classes?

astrojuanlu · 2023-09-20T09:19:23Z

I proposed it in another issue and @antonymilne made some good remarks #2409 (comment) I could be wrong here and in any case I defer the implementation details to the engineering team, but I don't think this can be done without "breaking" the current catalog + abstractdataset model. This would essentially be a "catalog 2.0", which is exciting but also kind of scary.

datajoely · 2023-10-23T11:29:53Z

On that last part - we're well overdue a native PyTorch dataset

astrojuanlu · 2023-12-17T20:16:32Z

Another thing to note: some of our underlying datasets have their own mechanisms of reading remote data that don't rely on fsspec pola-rs/polars#13044 plus sometimes our "copy paste" fsspec logic is really obscure and easy to get wrong kedro-org/kedro-plugins#360 (comment)

Making our "fsspec adapter" (not a mixin please) more explicit will make life easier for custom dataset authors and probably for us.

astrojuanlu · 2024-06-27T20:00:20Z

Progress towards making load and save public:

Edit: Actually, thinking about it more, should just make load and save abstract instead of _load and _save, to make the new interface clear. Even if just _load and _save is being implemented, you mentioned it's going into __init_subclass__, and load and save will get created there.

Originally posted by @deepyaman in #3920 (comment)

astrojuanlu · 2024-06-27T20:01:28Z

Also, interesting perspective on object stores vs filesystems: https://docs.rs/object_store/latest/object_store/#why-not-a-filesystem-interface

This design provides the following advantages:

All operations are atomic, and readers cannot observe partial and/or failed writes

Methods map directly to object store APIs, providing both efficiency and predictability

Abstracts away filesystem and operating system specific quirks, ensuring portability

Allows for functionality not native to filesystems, such as operation preconditions and atomic multipart uploads

Originally shared by @MatthiasRoels in kedro-org/kedro-plugins#625 (comment)

deepyaman · 2024-08-01T19:30:05Z

Now Python 3.7 is EOL is there any advantage to using Protocol types as an alternative/addition to the abstract classes?

I proposed it in another issue and @antonymilne made some good remarks #2409 (comment) I could be wrong here and in any case I defer the implementation details to the engineering team, but I don't think this can be done without "breaking" the current catalog + abstractdataset model. This would essentially be a "catalog 2.0", which is exciting but also kind of scary.

@datajoely @astrojuanlu @antonymilne I feel a lot less like trailblazer after coming across these comments again, but I've been doing a lot more work/thinking about datasets and the data catalog, and many of my thoughts are along these lines.

I agree protocols make a lot of sense for datasets. This would get us 90% of the way there to enabling use of the data catalog without depending on Kedro, and for datasets to be implemented without necessarily needing to subclass anything. The biggest barrier is the wrapping of load and save functionality; I haven't found any way to wrap a method with stuff like error handling and path modification code like we do.

However, through my recent work, I question more whether wrapping is even a good design (as I think some of you may have, too). For one, there are a lot of instances in the tests and dataset code where you access the _load or _save attribute directly, because you want that "unwrapped" functionality (especially common for meta-datasets). I think @Galileo-Galilei has done something similar in Kedro-MLflow. It may be a bad design if we're having to frequently unwrap wrapped methods.

How can we avoid this? Rather than modifying the load and save methods, why not just use them from another method? For example, load_dataset(dataset, versioned=True). Without having to define and call an abstract method, you can call load_dataset or save_dataset on anything that properly implements the protocol/interface (similar to how scikit-learn estimators work; also doesn't rely on a protocol currently).

This approach has numerous other benefits:

It's much more natural for somebody to call the "unwrapped" dataset.load() if they need it. Maybe you have your own error handling and don't need/want Kedro's opinionated wrapper. It's not like the load and save functions contain code that doesn't fundamentally work without the wrapper.
Definition of the load_dataset and save_dataset methods can be customized (e.g. potentially by different catalog implementations). This is especially helpful if you want to do something like use a different versioning strategy; you're not tied to an implementation upstream in AbstractVersionedDataset dictated by Kedro.

Of course, catalogs, too, will need to follow their own protocol, but that's a separate (perhaps simpler?) story. :)

In any case, think all of this is definitely worth revisiting, given that Data Catalog 2.0 is actually a thing people are (doing more than) talking about now!

astrojuanlu · 2024-08-02T07:44:31Z

Almost 1 year after I wrote #1936 (comment), and in line with what @deepyaman is saying, I think I can articulate better what the problem is: there's a double wrapping.

DataCatalog.load wraps AbstractDataset.load:

kedro/kedro/io/data_catalog.py

Line 546 in 94ae223

result = dataset.load()

and AbstractDataset.load wraps AbstractDataset._load:

kedro/kedro/io/core.py

Lines 192 to 193 in 011e3a8

    
           try: 
        
               return self._load()

(pasting the pre-0.19.7 logic because the current one is not very easy to follow)

I agree with @deepyaman that this is bad design. To me, it's clear that one of those wrappings is unnecessary. Datasets should implement a PEP 544 DatasetProtocol, that protocol should mandate public load and save, it should live in kedro.io, and DataCatalog "2.0" should just call dataset.load/save. Hence I'm advocating for AbstractDataset to disappear.

How to do this transition in an orderly fashion is another story.

astrojuanlu · 2024-08-02T07:47:33Z

Also I came here to say two more things:

it's not at all clear what describe should do and why it's mandatory
it's not clear how to store credentials in a way that they don't get accidentally leaked in some log line, see kedro-datasets: Connection credentials are exposed when _describe() foribis.TableDataset kedro-plugins#784 (to the point that nowadays I think it's just safer to rely on environment variables and completely sidestep Kedro credentials management, see also Make credentials a resolver #3811)

MatthiasRoels · 2024-08-02T09:08:08Z

That’s exactly what I always do: I use credentials as env vars. however, it would still be nice to be able to inject them in a dataset (think db credentials for sql datasets) in some way.

As for the other comments, I really like the ideas and I think it would dramatically simplify the setup. I would even go as far as stripping down catalog/datasets to the bare minimum:

versioning is solved by tools such delta/iceberg or even just hive partitioning (you can select a partition using globals/env vars)
Credential management is decoupled from kedro (perhaps a few small kedro-plugins can provide patterns?)
AbstractDataset is just an abstract class to provide the basic structure
perhaps the catalog load/save can contain the necessary try-except with specific error messages so that the custom classes don’t need to implement it?

This will certainly be a series of breaking changes but should hopefully simplify catalog setup and perhaps decouple catalog from kedro altogether. Perhaps a separate package with no dependency on kedro, but core kedro depends on the catalog package?

Galileo-Galilei · 2024-08-18T13:19:33Z

I've read the (very interesting) article provided by @antonymilne above, and I am not absolutely sure of this sentence :

I agree with @deepyaman that this is bad design. To me, it's clear that one of those wrappings is unnecessary. Datasets should implement a PEP 544 DatasetProtocol, that protocol should mandate public load and save, it should live in kedro.io, and DataCatalog "2.0" should just call dataset.load/save.

I guess that on the contrary of the bold sentence (Definitely agree with the rest of the sentence though), the article suggests to create an extra layer of abstraction with a class to split the AbstractDataset between what is "public" vs what is encapsulated logic. If I understand well, using a protocol would roughly go like this:

Disclaimer: please ignore the terrible naming.

We declare a public class, called ConnectorProtocol hereafter which shows the schema that should be used to create a custom Connector. People must declare load and save

# public_dataset_protocol.py
# https://hynek.me/articles/python-subclassing-redux/#case-study
from typing import Any, Protocol, runtime_checkable

# belong to kedro, it is the core interface
# but we want people to be able to declare it without importing kedro
# The public interface is no longer AbstractDataset or Dataset
# runtime_checkable decorator is mandatory to be able to check with "isinstance"


@runtime_checkable
class ConnectorProtocol(Protocol):
    def load(self) -> None: ...
    def save(self, data: Any) -> None: ...

We have 2 internal classes, Dataset and DataCatalog which behave like currently, except that Dataset is instantiated with a Connector

# internal_dataset_class.py
from logging import getLogger
from typing import Any, Dict

from dataset_protocol.public_dataset_protocol import ConnectorProtocol


# belong to kedro, it is the core interface
# but this is intended to be used internally
# all our internal wrapping with logging and other stuff should be invisible to the end user
class Dataset:
    def __init__(self, connector: ConnectorProtocol) -> None:
        # we must check on framework side that the argument is indeed a connector
        # that's why we needed to declare the protocol "runtime_checkable"
        if not isinstance(connector, ConnectorProtocol):
            raise ValueError("connector must be a ConnectorProtocol")
        self._connector = connector
        self._logger = getLogger(__name__)

    def load(self) -> None:
        # this class wraps the public "load" and add all custom logic
        self._logger.warning("Loading data...")
        return self._connector.load()

    def save(self, data: Any) -> None:
        # this class wraps the public "load" and add all custom logic
        self._logger.warning("Saving data...")
        return self._connector.save(data)


# belong to kedro, it is the core interface
# but this is intended to be used internally
class DataCatalog:
    def __init__(self, datasets: Dict[str, Dataset]) -> None:
        self.datasets = datasets

    def load(self, name: str) -> None:
        # like @astrojuanlu I am not convinced that Datacalog needs to redefine load and save
        # if one can call catalog.datasets[name].load() directly, but this is another topic
        return self.datasets[name].load()

    def save(self, name: str, data: Any) -> None:
        return self.datasets[name].save(data)

One can create a Connector without inheritating from the ConnectorProtocol, i.e without importing kedro.

# external_custom_connector.py
import pandas as pd

class GoodCustomConnector:
    def __init__(self, filepath: str) -> None:
        self._filepath = filepath

    def load(self) -> pd.DataFrame:
        return pd.read_csv(filepath_or_buffer=self._filepath)

    def save(self, data: pd.DataFrame) -> None:
        return data.to_csv(path_or_buf=self._filepath)

class BadCustomConnector:
    # no save method declared!
    def __init__(self, filepath: str) -> None:
        self._filepath = filepath

    def load(self) -> pd.DataFrame:
        return pd.read_csv(filepath_or_buffer=self._filepath)

Finally, instantiating the datasets woukd look like this:

from dataset_protocol.external_custom_connector import (
    BadCustomConnector,
    GoodCustomConnector,
)
from dataset_protocol.internal_dataset_class import Dataset

good_ds = Dataset(
    GoodCustomConnector(
        filepath=r"data\01_raw\companies.csv"
    )
)

good_ds.load() # Loading data...

bad_ds = Dataset(
    BadCustomConnector(
        filepath=r"data\01_raw\companies.csv"
    )
) # ValueError: connector must be a ConnectorProtocol  -> The error message is quite bad, we likely want something more insightful

The experience from the catalog.yml would be unchanged.

motorbikes:
  type: pandas.CSVDataset # except that it is a ConnectorProtocol, not a Dataset. The associated dataset will be created internally at parsing time
  filepath: s3://your_bucket/data/02_intermediate/company/motorbikes.csv
  credentials: dev_s3

This would enable much simpler composability , because I could create a connector that wraps connector (like MlflowAbstractDataset, or like @deepyaman load and save wrapping), and still have the associated Dataset properly working.

astrojuanlu · 2024-08-18T19:26:58Z

I guess my immediate question is what does Dataset add over the ConnectorProtocol implementor. But as long as this is an internal detail of how Kedro works and users don't have to interact with it, I'm happy to move forward.

Galileo-Galilei · 2024-08-21T09:05:30Z

Dataset has a lot of private methods that we don't necessarily want to expose to the end user (because it's confusing to understand that they should supercharge _load and _save but use load and save, or because they could override load by mistake before @deepyaman recent change). End users should only know about the public methods, not our internal cooking. The other goal is to enable a third party plugin to act "between" the Dataset and the ConnectorProtocol to compose logic, e.g. (untested pseudo code below):

class MlflowArtifactConnector:
    def __init__(self, connector: Connector) -> None:
        self.connector = connector

    def save(self, data: pd.DataFrame) -> None:
        self.connector.save(data)
        mlflow.log_artifact(self.connector._filepath)

Which is very hard to do currently because we need to fake inheritance.

MatthiasRoels · 2024-08-21T09:23:59Z

@Galileo-Galilei: this stil doesn’t answer @astrojuanlu’s question. With a complete redesign of datasets and catalog, why would not need a third series of classes? In your examples, it looks like it is sufficient to have the ConnectorProtocol classes and the Catalog class. What particular use-case requires the intermediate DataSet class?

astrojuanlu · 2024-08-21T16:38:21Z

To better describe my proposal:

DataCatalog.load does all the magic that's currently in AbstractDataset.load (relevant for Design DataCatalog2.0 #3995 too cc @ElenaKhaustova):

class DataCatalog2:
    def load(self, name: str, version: str | None = None) -> Any:
        """Loads a registered data set."""
        ...
        try:
            result = dataset.load()
        except Exception as exc:
            message = (
                f"Failed while loading data from data set {str(self)}.\n{str(exc)}"
            )
            raise DatasetError(message) from exc

Datasets are simple, and are just expected to adhere to a DatasetProtocol like this:

class DatasetProtocol(Protocol):
    # metadata: dict[str, t.Any]        # No prescribed properties, maybe only metadata?

    def load(self) -> t.Any: ...
    def save(self, data: t.Any) -> None: ...

...

@dataclass
class MyPolarsDataset:  # Structural subtyping means no need for inheritance
    filepath: str
    metadata: dict[str, t.Any]

    def load(self) -> pl.DataFrame:
        ...
        return df

Am I missing something?

astrojuanlu · 2024-09-03T17:11:07Z

xref #3995 (comment)

lancechua added the Issue: Feature Request New feature or improvement to existing feature label Oct 16, 2022

merelcht added the Community Issue/PR opened by the open-source community label Oct 17, 2022

merelcht mentioned this issue Oct 17, 2022

Re-design io.core and io.data_catalog #1778

Open

merelcht added this to the Redesign Catalog and Datasets milestone Apr 12, 2023

merelcht added Stage: Technical Design 🎨 Ticket needs to undergo technical design before implementation and removed Community Issue/PR opened by the open-source community labels Apr 12, 2023

merelcht assigned astrojuanlu May 30, 2023

astrojuanlu removed their assignment Jun 19, 2023

astrojuanlu mentioned this issue Jun 19, 2023

Move AbstractDataSet to Kedro-Plugins #2409

Closed

noklam mentioned this issue Jun 20, 2023

Make datasets argument keywords only kedro-org/kedro-plugins#247

Closed

astrojuanlu mentioned this issue Jul 19, 2023

[DRAFT] Separate file format from processing engine in datasets kedro-org/kedro-plugins#273

Open

astrojuanlu mentioned this issue Aug 22, 2023

Render concrete parameter/return types for dataset save/load #2199

Closed

astrojuanlu mentioned this issue Sep 29, 2023

kedro-datasets: Rename MatplotlibWriter to MatplotlibDataset kedro-org/kedro-plugins#353

Open

astrojuanlu mentioned this issue Oct 23, 2023

Clarify documentation for matplotlib datasets #2536

Open

This was referenced Jan 30, 2024

How to maintain external datasets contributions kedro-org/kedro-plugins#535

Open

[spike] Investigate suitability of Kedro for EL pipelines and incremental data loading #3578

Open

merelcht modified the milestones: Redesign the API for io.datacatalog and io.core, Custom dataset creation Feb 2, 2024

astrojuanlu mentioned this issue Mar 12, 2024

Decide on definitions of regular and experimental contributions kedro-org/kedro-plugins#583

Closed

merelcht added Type: Parent Issue and removed Issue: Feature Request New feature or improvement to existing feature Stage: Technical Design 🎨 Ticket needs to undergo technical design before implementation labels Mar 28, 2024

ElenaKhaustova mentioned this issue Jun 5, 2024

Access dataset filepath via public API for file-backed datasets #3929

Open

This was referenced Jun 18, 2024

Expose load and save publicly for each dataset #3920

Merged

Conduct market research on c̶a̶t̶a̶l̶o̶g̶s̶ metastores kedro-org/kedro-devrel#141

Open

astrojuanlu mentioned this issue Jun 27, 2024

Rewrite the Polars datasets to not rely on fsspec unnecessarily kedro-org/kedro-plugins#625

Open

astrojuanlu mentioned this issue Aug 2, 2024

Make credentials a resolver #3811

Open

astrojuanlu mentioned this issue Sep 3, 2024

Design DataCatalog2.0 #3995

Open

3 tasks

astrojuanlu mentioned this issue Sep 4, 2024

Research summary of insights for redesigning Kedro's versioning #4129

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Easier CustomDataset Creation #1936

Easier CustomDataset Creation #1936

lancechua commented Oct 16, 2022

datajoely commented Oct 17, 2022

deepyaman commented Oct 31, 2022 •

edited

Loading

astrojuanlu commented Jun 19, 2023

astrojuanlu commented Jun 20, 2023

yetudada commented Aug 7, 2023

astrojuanlu commented Aug 18, 2023

datajoely commented Aug 18, 2023

astrojuanlu commented Aug 18, 2023

datajoely commented Aug 18, 2023

astrojuanlu commented Aug 28, 2023

astrojuanlu commented Sep 13, 2023 •

edited

Loading

lancechua commented Sep 16, 2023 •

edited

Loading

astrojuanlu commented Sep 16, 2023

astrojuanlu commented Sep 18, 2023

astrojuanlu commented Sep 20, 2023

datajoely commented Sep 20, 2023

astrojuanlu commented Sep 20, 2023

datajoely commented Oct 23, 2023

astrojuanlu commented Dec 17, 2023

astrojuanlu commented Jun 27, 2024

astrojuanlu commented Jun 27, 2024

deepyaman commented Aug 1, 2024

astrojuanlu commented Aug 2, 2024

astrojuanlu commented Aug 2, 2024

MatthiasRoels commented Aug 2, 2024

Galileo-Galilei commented Aug 18, 2024 •

edited

Loading

astrojuanlu commented Aug 18, 2024

Galileo-Galilei commented Aug 21, 2024 •

edited

Loading

MatthiasRoels commented Aug 21, 2024

astrojuanlu commented Aug 21, 2024

astrojuanlu commented Sep 3, 2024

Easier CustomDataset Creation #1936

Easier CustomDataset Creation #1936

Comments

lancechua commented Oct 16, 2022

Description

Context

Possible Implementation

Possible Alternatives

datajoely commented Oct 17, 2022

deepyaman commented Oct 31, 2022 • edited Loading

astrojuanlu commented Jun 19, 2023

Recent developments

General thoughts

Current issues

Other links and ideas

astrojuanlu commented Jun 20, 2023

yetudada commented Aug 7, 2023

astrojuanlu commented Aug 18, 2023

datajoely commented Aug 18, 2023

astrojuanlu commented Aug 18, 2023

datajoely commented Aug 18, 2023

astrojuanlu commented Aug 28, 2023

astrojuanlu commented Sep 13, 2023 • edited Loading

lancechua commented Sep 16, 2023 • edited Loading

astrojuanlu commented Sep 16, 2023

astrojuanlu commented Sep 18, 2023

astrojuanlu commented Sep 20, 2023

datajoely commented Sep 20, 2023

astrojuanlu commented Sep 20, 2023

datajoely commented Oct 23, 2023

astrojuanlu commented Dec 17, 2023

astrojuanlu commented Jun 27, 2024

astrojuanlu commented Jun 27, 2024

deepyaman commented Aug 1, 2024

astrojuanlu commented Aug 2, 2024

astrojuanlu commented Aug 2, 2024

MatthiasRoels commented Aug 2, 2024

Galileo-Galilei commented Aug 18, 2024 • edited Loading

astrojuanlu commented Aug 18, 2024

Galileo-Galilei commented Aug 21, 2024 • edited Loading

MatthiasRoels commented Aug 21, 2024

astrojuanlu commented Aug 21, 2024

astrojuanlu commented Sep 3, 2024

deepyaman commented Oct 31, 2022 •

edited

Loading

astrojuanlu commented Sep 13, 2023 •

edited

Loading

lancechua commented Sep 16, 2023 •

edited

Loading

Galileo-Galilei commented Aug 18, 2024 •

edited

Loading

Galileo-Galilei commented Aug 21, 2024 •

edited

Loading