Research summary of insights for redesigning Kedro's versioning #4129

iamelijahko · 2024-08-29T12:38:58Z

What Should We Do Next?

Areas of Focus	Suggested Actions
Integration with other leading toolings	• Integration with Established Tools and Interoperability: This research aims to explore how Kedro can integrate with existing tools to manage complexity, avoiding the need to reinvent the wheel. Kedro should prioritize interoperability with other solutions, leveraging industry-standard tools to enhance its capabilities. • Integration with Leading Tools: Consider integrating with leading tools like Delta Lake, Apache Hudi, Apache Iceberg for data management, Git for code versioning, and MLflow and DVC for model versioning. Users have reported that Kedro's dataset versioning faces compatibility issues on platforms such as Databricks and Palantir Foundry, reducing its versatility and leading to redundancy with more mature platforms. Refer to market research for insights on how other tools support versioning in data, code, and models. • Alignment with Data Lakehouse Concepts: The industry's enthusiasm for the data lakehouse concept, which includes features like versioning and time travel, doesn't fully align with Kedro's current design, creating challenges for integration and complementarity.
Versioning Method 1: Explore Individual Artefact Versioning	• Granular Versioning Solutions for Different Data Types: To improve versioning, consider implementing granular solutions tailored to different data types—such as code, models, table data, semi-structured, and unstructured data. This approach could offer advanced features but contrasts with Kedro's current method, which indiscriminately creates new copies without fully understanding the data being versioned. Full reproducibility requires capturing the exact code, data, and parameters used in a run. Kedro's current solution is imperfect as it fails to capture all parameters, the state of the code, or ensure consistent upstream data, making full reproducibility and experiment tracking challenging. • Interaction Between Code and Data Versions: Consider how code and data versions interact, potentially creating non-linear branches. This would enable better tracing and auditing by identifying which code version produced which data version, and allow branching from specific points in time, thus addressing the multidimensional aspects of versioning. • Refer to this Miro board for various artefacts versioning.
Versioning Method 2: Explore versioning entire pipeline to avoid duplicating massive files	• The goal was to integrate comprehensive versioning, potentially tying it into experiment tracking. The idea was to version everything within the Kedro pipeline that might change, rather than focusing solely on individual elements like parameters or catalog settings. • E.g. PMPx team implemented a GitHub-based versioning in Kedro to track entire pipelines, leveraging GitHub branches for comprehensive version control. GitHub improves versioning efficiency by tracking and storing only the changes made, rather than creating complete copies of files, thus saving significant storage space.
Identification	• Use of Unique ID for Versioning: Consider using a unique identifier instead of a DateTime format for versioning. Although timestamps have advantages, particularly in file systems, they can be problematic. Moreover, displaying numerous parameters in a table is impractical. Users mention the DateTime format currently used for versioning in Kedro is challenging to manage, making it difficult to reference versions across different software and programming languages. • Single Number Version Tracking: Users need a single version number that maps to the corresponding versions of the model, data, and code. This approach simplifies tracking and ensures compatibility, eliminating the complexity of managing multiple version numbers. • Customized Version Names: Consider allowing users to set up customized version names, such as incorporating specific parameters.
Storage, Logging, & Retrieval	• Centralized Session Store: Consider storing logs and versioning information in a centralized session store to ensure easy access and reference. • Automatic Logging: Implement automatic logging of key parameters and metrics with each version to maintain a complete historical context. • Detailed Metadata Logging: Include detailed metadata with each version, such as data size and key parameters, to provide a comprehensive record. • Maintain Historical Files: Consider keeping all historical versions of files with attributes for easy lookup without needing additional functions. • Refer to this Miro board for user journey on Kedro versioning.
Documentation	Enhanced Documentation and Use Cases: A documentation with example use cases and usage patterns, providing users with detailed guidance and options to maximize the tool's value, giving users more control. • Clear Documentation: Document what changes were made in each version, including any parameter adjustments or data modifications.
Accessibility	API Access for Versioning in Kedro: Making versioning easily searchable and accessible via an API would allow other applications to build on Kedro and leverage this versioning information. • Collaborative (Sharing) Versioning in Managed Analytics: Ensure multiple users can access versioned outcomes easily, avoid local machine conflicts, and utilize platforms like GitHub for effective collaborative versioning.

Priority matrix (Miro board)

Artefacts: What to Track?

Reproducing runs in Kedro is challenging due to incomplete capture of code, parameters, and data, hindering full reproducibility. Granular versioning across data types could improve this, despite Kedro's limitations.
Miro link: https://miro.com/app/board/uXjVK9U8mVo=/?moveToWidget=3458764597910279898&cot=14

User journey

Miro link: https://miro.com/app/board/uXjVK9U8mVo=/?moveToWidget=3458764596155374065&cot=14

Data

From the user interviews, data versioning involves tracking and managing different versions of datasets over time, allowing for consistent results even when code remains unchanged. It typically includes handling large tables and unstructured data by storing snapshots or slices at specific points, enabling historical analysis. While unstructured data is often versioned by copying versions, semi-structured data may require specialized algorithms, and large datasets demand careful management due to their complexity.

Painpoints: Data

Painpoints	Opportunities
1. Inconsistency of Upstream Data in Pipeline Runs: A significant issue is the lack of guarantee that pipeline input data remains consistent across runs due to changes in upstream systems, making it nearly impossible to snapshot all states in a lightweight, deployable framework.	---
2. Excessive and Redundant Dataset Versioning: Frequent pipeline runs generate numerous, often unnecessary, dataset versions, leading to excessive storage use and a cluttered version history that's difficult to manage and navigate.	Introduce for e.g., a `--disable-versioning` flag in Kedro's CLI to prevent unnecessary version creation, tag important outputs, and simplify storage engine selection with compatible options like Apache Hudi or Delta.
3. Challenges in Retrieving and Managing Data Versions in Kedro and Jupyter Notebooks: Engineers face difficulties retrieving specific dataset versions in Jupyter and Kedro, often requiring manual inspection and custom logic to locate and load the desired timestamps, making the process cumbersome and time-consuming.	Implement a feature to load dataset versions by order or automatically load the most recent version (e.g., `catalog.load("df", version="last")`, enhancing workflow efficiency and intuitive version management.
4. Difficulty with Transcoding: Users face challenges with Kedro's versioning during transcoding, leading to failures. They struggle to retrieve recent dataset timestamps easily, requiring manual AWS checks and adding unnecessary pipeline steps.	---

The text was updated successfully, but these errors were encountered:

noklam · 2024-08-29T13:53:03Z

leveraging GitHub branches for comprehensive version control. GitHub improves versioning efficiency by tracking and storing only the changes made, rather than creating complete copies of files, thus saving significant storage space.

Can you clarify this? How is it possible to use Github to version data?

idanov · 2024-08-30T09:07:59Z

Just to keep this as a potential solution for the challenges of using timestamps, maybe we can use ULID as version number format?

astrojuanlu · 2024-08-30T10:36:11Z

Thanks a lot for the summary @iamelijahko ! Could you update it with

A more clear description of how the next actions were prioritised (how "frequency" or "user impact" was measured)
A more clear separation between actions and insights pertaining current Kedro versioning vs things that can be done outside of how current versioning works?

astrojuanlu · 2024-09-04T22:50:15Z

I would like to add a bit more color to the synthesis @iamelijahko has already provided.

Usage of versioned datasets is low

There seems to be very low prevalence of versioned: true datasets on open source projects.

https://github.com/kedro-org/kedro/network/dependents shows 2 439 repositories, and this query shows 154 files. That's an upper bound of approximately ~6 % of open repositories using versioned datasets, and this is without discarding those that are mostly a copy-paste of the spaceflights tutorial.

There seems to be very low prevalence of --load-versions in our telemetry.

Out of 3 537 184 total kedro run commands, only 1 644 included --load-versions, ~0.05 %.

SQL query

SELECT
  COUNT(*)
FROM HEAP_FRAMEWORK_VIZ_PRODUCTION.HEAP.ANY_COMMAND_RUN
WHERE
  COMMAND LIKE 'kedro run %'
  AND COMMAND LIKE '%--load-version%'

Interest in versioned datasets in our support channels is low

https://linen-slack.kedro.org/?threads%5Bquery%5D=%22versioned%3A%20true%22 shows 35 results. It's difficult to assess how many "questions" (~threads) are, but for reference, "dataset" yields 877 results, "plugin" yields 270 results, "node" gives 731 results. Searching for "*" gives 3 773 results. This is roughly ~1 % of the messages.

Users are finding workarounds to their pain points within versioning

For example, #4028 (comment) states

just for your UX research transparency, we now completely moved away from versioning and instead have a RUN_ID env variable that we pick up in globals.yaml and prefix all pipeline paths with that. we found this approach (all data of a version bundled under one path) to be preferable.

Feature-rich alternatives exist

Read @iamelijahko's thorough market analysis in https://github.com/kedro-org/kedro/wiki/Market-research-on-versioning-tools

Maintenance of versioned datasets delays resolution of unrelated user pain points

For example, these 5 issues with the Polars datasets:

kedro-org/kedro-plugins#789
kedro-org/kedro-plugins#702
kedro-org/kedro-plugins#625
kedro-org/kedro-plugins#590
kedro-org/kedro-plugins#444

are all blocked because of how we use fsspec for our versioning.

(Disclaimer: I opened 3 of them, but 1 comes directly from a user question and the other one has supporting evidence that other users are affected)

astrojuanlu · 2024-09-04T23:24:10Z

Given the above, the comparatively large number of recommendations coming out of this research, and the need to allocate our limited resources efficiently, it becomes crucial not only to prioritise which ones to tackle, but more importantly to give a coherent vision of how do we want the versioning workflow in Kedro to be.

As such, I would like us to pick between these two strategies:

De-emphasize the current versioning functionality, and explore alternatives and integrations, or
Bet on improving and extending our current versioning functionality for existing and new users

Under one of these two optics, I believe it will be easier to interpret the recommendations at the top of this thread. And it might also inform how do we approach the last big part of the "Kedro I/O redesign", custom dataset creation #1936

astrojuanlu · 2024-09-18T11:30:10Z

We've had extensive discussions about this in the past weeks. Here is a summary of where we're at and what are the proposed next steps.

Proposal

Draft a path towards the deprecation of AbstractVersionedDataset and its replacement by something leaner and better.

Motivation

See summary at #4129 (comment)

Some extra points on top of what I already wrote:

Our approach to versioning already received criticism right after Kedro was first open sourced, with teams (that continue to use Kedro 5 years later) fully opting for MLflow integration for data and run versioning Use mlflow for better versioning and collaboration #113
- This functionality was supposed to be complemented by the Journal, which was deprecated and removed in Kedro 0.18.0 No Journal if pipeline is run through the session #757 (comment)
- The successor of the Journal was Experiment Tracking in Kedro-Viz (more on that below)
Some users are content with simple data versioning using OmegaConf custom resolvers, which is a less powerful approach, but also way more flexible https://kedro.hall.community/support-lY6wDVhxGXNY/pushing-files-to-s3-with-dynamic-names-FfCYxXyxTZF4

Implications

Several concerns were raised:

Risk of not finding a better alternative. @idanov's main point was that we should sit and re-think versioning from scratch, without closing the door to the possibility that maybe our current AbstractVersionedDataset is the best we can come up with.
Relationship with Experiment Tracking. Our Experiment Tracking functionality relies on AbstractVersionedDataset.
Cost of the deprecation itself. AbstractVersionedDataset is ingrained in almost every layer of Kedro, and as such if we ever decide to actually get rid of it, it will be a considerable amount of work.

Next steps

Ahead of start working on a new design, and given that @iamelijahko has already ranked the pain points in terms of user value, we should estimate the engineering effort of addressing such pain points, to come up with a "effort-return" matrix that help us prioritise.
In parallel, we need to assess what's next for Experiment Tracking [Proposal] Remove experiment tracking from Kedro-Viz kedro-viz#1831 For that, @lid-rs and myself will
- Come up with a definition of "Kedro-Viz active user" that will act as a baseline
- Come up with a definition of "Experiment Tracking active user" that will act as a target
- Decide, on the basis of absolute and relative user counts, what should be the way forward

yetudada · 2024-09-18T14:48:08Z

I love the way you’ve written out what we should do next! I completely agree with the idea of proposing a workstream to track and better understand who exactly is using Kedro-Viz. This could offer valuable insights into how adoption plays out across different user segments, particularly when narrowing down experiment tracking.

That being said, I do have some concerns about how we're identifying a Kedro-Viz user, especially on Heap. I was reviewing some download data to get a better sense of usage, and there seems to be a significant discrepancy that might be worth investigating further. For context, Kedro-Viz has around 4 million downloads, whereas the Kedro framework itself is sitting at roughly 17 million downloads. That puts Kedro-Viz usage at about 23-24% relative to the core framework, which feels more aligned with what we’d expect.

However, with the changes we've made to telemetry, Kedro-Viz is now being reported as used by only 0.7% of total Kedro users—this just doesn’t add up. I know @rashidakanchwala has looked into this previously, but there’s definitely something odd going on here that still needs to be clarified. It could be a propagation issue affecting data from Heap all the way to our Snowflake instance.

Also, adding to the complexity, I find it less likely that Kedro-Viz is being heavily used in production environments, while Kedro’s usage numbers might be inflated by CI/CD pipelines. So, while the download figures likely reflect more accurate user numbers, the current telemetry data seems to be painting an unclear picture.

iamelijahko added this to the Dataset Versioning milestone Aug 29, 2024

github-actions bot mentioned this issue Sep 1, 2024

Monthly issue metrics report #4135

Open

merelcht assigned iamelijahko Sep 2, 2024

astrojuanlu mentioned this issue Sep 3, 2024

Design DataCatalog2.0 #3995

Open

3 tasks

merelcht assigned astrojuanlu and unassigned iamelijahko Sep 16, 2024

astrojuanlu removed their assignment Sep 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Research summary of insights for redesigning Kedro's versioning #4129

Research summary of insights for redesigning Kedro's versioning #4129

iamelijahko commented Aug 29, 2024 •

edited

Loading

noklam commented Aug 29, 2024

idanov commented Aug 30, 2024 •

edited

Loading

astrojuanlu commented Aug 30, 2024

astrojuanlu commented Sep 4, 2024

astrojuanlu commented Sep 4, 2024

astrojuanlu commented Sep 18, 2024

yetudada commented Sep 18, 2024

Research summary of insights for redesigning Kedro's versioning #4129

Research summary of insights for redesigning Kedro's versioning #4129

Comments

iamelijahko commented Aug 29, 2024 • edited Loading

What Should We Do Next?

Artefacts: What to Track?

User journey

Data

Painpoints: Data

noklam commented Aug 29, 2024

idanov commented Aug 30, 2024 • edited Loading

astrojuanlu commented Aug 30, 2024

astrojuanlu commented Sep 4, 2024

Usage of versioned datasets is low

Interest in versioned datasets in our support channels is low

Users are finding workarounds to their pain points within versioning

Feature-rich alternatives exist

Maintenance of versioned datasets delays resolution of unrelated user pain points

astrojuanlu commented Sep 4, 2024

astrojuanlu commented Sep 18, 2024

Proposal

Motivation

Implications

Next steps

yetudada commented Sep 18, 2024

iamelijahko commented Aug 29, 2024 •

edited

Loading

idanov commented Aug 30, 2024 •

edited

Loading