Stage 1 (optional): Partitioned datasets #29

shaunc · 2022-02-24T14:43:03Z

shaunc
Feb 24, 2022
Maintainer

Representation

Kedro supports partitioned datasets (api). These allow large and/or growing datasets to be stored in chunks.

On the DVC side, we can represent the data as a remote dependency such as is created by dvc import-url. We can either create a single represented directory object (see .dvc dependency doc, or create individual .dvc files for each partition.

Implementation

Simplest is to use a directory to represent the partitioned dataset. The downsides are that, if something changes in the directory, dvc will see the whole file as changing. Also, the directory needn't necessarily contain only the partitioned data (although this is typically the case).

Nevertheless, we will start out this way. The kedro hooks, which have more information, can chose to skip processing even if dvc notes a change, if they can determine the part of the partition needed for a node operation. This may require an additional kedro-dvc hook.

Note as well: dvc has added support for "append only" directories, which is an important case for partitioned datasets. Using these facilities may be important to avoid getting snarled up in git merge conflicts.

back to overall data discussion

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stage 1 (optional): Partitioned datasets #29

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

Stage 1 (optional): Partitioned datasets #29

shaunc Feb 24, 2022 Maintainer

Representation

Implementation

Replies: 0 comments

shaunc
Feb 24, 2022
Maintainer