Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How can we improve dataset versioning? #1979

Open
merelcht opened this issue Oct 26, 2022 · 6 comments
Open

How can we improve dataset versioning? #1979

merelcht opened this issue Oct 26, 2022 · 6 comments
Labels
Component: IO Issue/PR addresses data loading/saving/versioning and validation, the DataCatalog and DataSets roadmap

Comments

@merelcht
Copy link
Member

Introduction

The catalog and dataset APIs and code needs to be improved. One of the most complex things related to datasets and the catalog is versioning. As a team we need to get a better understanding of how versioning works, so we can refactor it.

Background

#1778

What's in scope

  • Understand how versioning works: can be a group exercise or one person can to a deep dive and teach the team
  • A proposal of how to improve versioning.
@merelcht merelcht added the Component: IO Issue/PR addresses data loading/saving/versioning and validation, the DataCatalog and DataSets label Oct 26, 2022
@noklam
Copy link
Contributor

noklam commented Oct 28, 2022

Linked one interesting PR opened by our community earlier Configurable Versioning. I think this is interesting because we don't have much thought about sub-classing Version, at the moment this class is basically a nametuple.

@astrojuanlu
Copy link
Member

xref #2355 (the evolution of the PR above)

@astrojuanlu
Copy link
Member

astrojuanlu commented Oct 30, 2023

xref also #2703 #2691 # 1731 these are all related, one way or another, to versioning.

@astrojuanlu
Copy link
Member

Also partially related #1654

@astrojuanlu
Copy link
Member

See what I found: an issue from 2019 where @Galileo-Galilei said

A huge thanks for the framework, which is really useful. My team decided to use it for most of its projects, especially to ensure collaboration. Data abstraction is really an important feature. However, we have a major disagreement of how data versioning is implemented in kedro. We decided to move on and to develop our own layer of versioning above your framework.

#113

@astrojuanlu
Copy link
Member

And a bit of interesting historical insight from @/lorenabalan:

The current versioning behaviour actually follows the Spark notation - it's modelling exactly what Spark does under the hood when writing a file to multiple partitions, which is why we prefer the current implementation.

#371

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component: IO Issue/PR addresses data loading/saving/versioning and validation, the DataCatalog and DataSets roadmap
Projects
Status: Current
Development

No branches or pull requests

3 participants