How can we improve dataset versioning? #1979

merelcht · 2022-10-26T13:32:08Z

Introduction

The catalog and dataset APIs and code needs to be improved. One of the most complex things related to datasets and the catalog is versioning. As a team we need to get a better understanding of how versioning works, so we can refactor it.

Background

#1778

What's in scope

Understand how versioning works: can be a group exercise or one person can to a deep dive and teach the team
A proposal of how to improve versioning.

noklam · 2022-10-28T10:26:06Z

Linked one interesting PR opened by our community earlier Configurable Versioning. I think this is interesting because we don't have much thought about sub-classing Version, at the moment this class is basically a nametuple.

astrojuanlu · 2023-08-22T13:40:02Z

xref #2355 (the evolution of the PR above)

astrojuanlu · 2023-10-30T08:54:31Z

xref also #2703 #2691 ~~# 1731~~ these are all related, one way or another, to versioning.

astrojuanlu · 2024-06-06T05:49:17Z

Also partially related #1654

astrojuanlu · 2024-08-02T10:11:34Z

See what I found: an issue from 2019 where @Galileo-Galilei said

A huge thanks for the framework, which is really useful. My team decided to use it for most of its projects, especially to ensure collaboration. Data abstraction is really an important feature. However, we have a major disagreement of how data versioning is implemented in kedro. We decided to move on and to develop our own layer of versioning above your framework.

#113

astrojuanlu · 2024-08-02T10:11:55Z

And a bit of interesting historical insight from @/lorenabalan:

The current versioning behaviour actually follows the Spark notation - it's modelling exactly what Spark does under the hood when writing a file to multiple partitions, which is why we prefer the current implementation.

#371

merelcht added this to the Redesign Catalog and Datasets milestone Oct 26, 2022

merelcht added the Component: IO Issue/PR addresses data loading/saving/versioning and validation, the DataCatalog and DataSets label Oct 26, 2022

merelcht mentioned this issue Oct 26, 2022

Re-design io.core and io.data_catalog #1778

Open

astrojuanlu mentioned this issue Sep 13, 2023

Easier CustomDataset Creation #1936

Open

merelcht modified the milestones: Redesign the API for io.datacatalog and io.core, Dataset Versioning Feb 2, 2024

merelcht mentioned this issue Mar 14, 2024

Timing issue with executing the same pipeline with versionned dataset #2694

Closed

merelcht added Type: Parent Issue roadmap and removed Type: Parent Issue labels Mar 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How can we improve dataset versioning? #1979

How can we improve dataset versioning? #1979

merelcht commented Oct 26, 2022

noklam commented Oct 28, 2022 •

edited

Loading

astrojuanlu commented Aug 22, 2023

astrojuanlu commented Oct 30, 2023 •

edited

Loading

astrojuanlu commented Jun 6, 2024

astrojuanlu commented Aug 2, 2024

astrojuanlu commented Aug 2, 2024

How can we improve dataset versioning? #1979

How can we improve dataset versioning? #1979

Comments

merelcht commented Oct 26, 2022

Introduction

Background

What's in scope

noklam commented Oct 28, 2022 • edited Loading

astrojuanlu commented Aug 22, 2023

astrojuanlu commented Oct 30, 2023 • edited Loading

astrojuanlu commented Jun 6, 2024

astrojuanlu commented Aug 2, 2024

astrojuanlu commented Aug 2, 2024

noklam commented Oct 28, 2022 •

edited

Loading

astrojuanlu commented Oct 30, 2023 •

edited

Loading