You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Kedro supports partitioned datasets (api). These allow large and/or growing datasets to be stored in chunks.
On the DVC side, we can represent the data as a remote dependency such as is created by dvc import-url. We can either create a single represented directory object (see .dvc dependency doc, or create individual .dvc files for each partition.
Implementation
Simplest is to use a directory to represent the partitioned dataset. The downsides are that, if something changes in the directory, dvc will see the whole file as changing. Also, the directory needn't necessarily contain only the partitioned data (although this is typically the case).
Nevertheless, we will start out this way. The kedro hooks, which have more information, can chose to skip processing even if dvc notes a change, if they can determine the part of the partition needed for a node operation. This may require an additional kedro-dvc hook.
Note as well: dvc has added support for "append only" directories, which is an important case for partitioned datasets. Using these facilities may be important to avoid getting snarled up in git merge conflicts.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
Representation
Kedro supports partitioned datasets (api). These allow large and/or growing datasets to be stored in chunks.
On the DVC side, we can represent the data as a remote dependency such as is created by
dvc import-url
. We can either create a single represented directory object (see.dvc
dependency doc, or create individual.dvc
files for each partition.Implementation
Simplest is to use a directory to represent the partitioned dataset. The downsides are that, if something changes in the directory, dvc will see the whole file as changing. Also, the directory needn't necessarily contain only the partitioned data (although this is typically the case).
Nevertheless, we will start out this way. The kedro hooks, which have more information, can chose to skip processing even if dvc notes a change, if they can determine the part of the partition needed for a node operation. This may require an additional
kedro-dvc
hook.Note as well: dvc has added support for "append only" directories, which is an important case for partitioned datasets. Using these facilities may be important to avoid getting snarled up in git merge conflicts.
Beta Was this translation helpful? Give feedback.
All reactions