Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(datasets): Added the ExternalTableDataset for Databricks #827

Open
wants to merge 40 commits into
base: main
Choose a base branch
from

Conversation

MinuraPunchihewa
Copy link

@MinuraPunchihewa MinuraPunchihewa commented Sep 8, 2024

Description

This PR adds the ExternalTableDataset to support interactions with external tables in Databricks (Unity Catalog).

Fixes #817

Development notes

In the process of working on this PR, it has been identified that certain operations that are performed on external and managed tables are somewhat similar, for instance, the loading and saving of data. As a result, a base class that contains the common code for interacting with both types of tables called BaseTableDataset has been created. This is not meant to be used directly, but has been used as the parent class for both ExternalTableDataset and the already existing ManagedTableDataset.

The BaseTableDataset was created by making slight changes to the code from ManagedTableDataset. A few small improvements were also made in the process such as supporting partitioning when the data is saved.

These changes have been tested,

  1. Manually, by running the code against a Unity Catalog enabled workspace.
    2. Via the existing and newly added unit tests.

Checklist

  • Opened this PR as a 'Draft Pull Request' if it is work-in-progress
  • Updated the documentation to reflect the code changes
  • Added a description of this change in the relevant RELEASE.md file
  • Added tests to cover my changes

Signed-off-by: Minura Punchihewa <minurapunchihewa17@gmail.com>
Signed-off-by: Minura Punchihewa <minurapunchihewa17@gmail.com>
Signed-off-by: Minura Punchihewa <minurapunchihewa17@gmail.com>
Signed-off-by: Minura Punchihewa <minurapunchihewa17@gmail.com>
Signed-off-by: Minura Punchihewa <minurapunchihewa17@gmail.com>
Signed-off-by: Minura Punchihewa <minurapunchihewa17@gmail.com>
Signed-off-by: Minura Punchihewa <minurapunchihewa17@gmail.com>
Signed-off-by: Minura Punchihewa <minurapunchihewa17@gmail.com>
Signed-off-by: Minura Punchihewa <minurapunchihewa17@gmail.com>
Signed-off-by: Minura Punchihewa <minurapunchihewa17@gmail.com>
Signed-off-by: Minura Punchihewa <minurapunchihewa17@gmail.com>
Signed-off-by: Minura Punchihewa <minurapunchihewa17@gmail.com>
Signed-off-by: Minura Punchihewa <minurapunchihewa17@gmail.com>
Signed-off-by: Minura Punchihewa <minurapunchihewa17@gmail.com>
Signed-off-by: Minura Punchihewa <minurapunchihewa17@gmail.com>
Signed-off-by: Minura Punchihewa <minurapunchihewa17@gmail.com>
Signed-off-by: Minura Punchihewa <minurapunchihewa17@gmail.com>
Signed-off-by: Minura Punchihewa <minurapunchihewa17@gmail.com>
Signed-off-by: Minura Punchihewa <minurapunchihewa17@gmail.com>
Signed-off-by: Minura Punchihewa <minurapunchihewa17@gmail.com>
Signed-off-by: Minura Punchihewa <minurapunchihewa17@gmail.com>
Signed-off-by: Minura Punchihewa <minurapunchihewa17@gmail.com>
Signed-off-by: Minura Punchihewa <minurapunchihewa17@gmail.com>
Signed-off-by: Minura Punchihewa <minurapunchihewa17@gmail.com>
Signed-off-by: Minura Punchihewa <minurapunchihewa17@gmail.com>
Signed-off-by: Minura Punchihewa <minurapunchihewa17@gmail.com>
Signed-off-by: Minura Punchihewa <minurapunchihewa17@gmail.com>
…xist

Signed-off-by: Minura Punchihewa <minurapunchihewa17@gmail.com>
Signed-off-by: Minura Punchihewa <minurapunchihewa17@gmail.com>
@MinuraPunchihewa MinuraPunchihewa force-pushed the feature/databricks_external_dataset branch from 880f599 to dc3550e Compare September 8, 2024 09:16
@MinuraPunchihewa MinuraPunchihewa changed the title Added the ExternalTableDataset for Databricks feat(datasets): Added the ExternalTableDataset for Databricks Sep 8, 2024
Signed-off-by: Minura Punchihewa <minurapunchihewa17@gmail.com>
Signed-off-by: Minura Punchihewa <minurapunchihewa17@gmail.com>
Signed-off-by: Minura Punchihewa <minurapunchihewa17@gmail.com>
Signed-off-by: Minura Punchihewa <minurapunchihewa17@gmail.com>
Signed-off-by: Minura Punchihewa <minurapunchihewa17@gmail.com>
Signed-off-by: Minura Punchihewa <minurapunchihewa17@gmail.com>
Signed-off-by: Minura Punchihewa <minurapunchihewa17@gmail.com>
Signed-off-by: Minura Punchihewa <minurapunchihewa17@gmail.com>
@MinuraPunchihewa MinuraPunchihewa marked this pull request as ready for review September 24, 2024 05:59
Signed-off-by: Minura Punchihewa <minurapunchihewa17@gmail.com>
Signed-off-by: Minura Punchihewa <minurapunchihewa17@gmail.com>
@MinuraPunchihewa MinuraPunchihewa force-pushed the feature/databricks_external_dataset branch from 9924351 to e7ba0e3 Compare September 24, 2024 06:05
@MinuraPunchihewa
Copy link
Author

Hey @astrojuanlu,
As I have mentioned in the description while working on adding the ExternalTableDataset, due to some of the many similarities of how the operations for both managed and external tables work, I created a BaseTableDataset that contains this common logic and then extended both the ExternalTableDataset and the existing ManagedTableDataset from it. I made a few slight improvements to the existing implementation of the ManagedTableDataset as well.

I have tested everything manually it seems to work fine, but I was late in reading the contributing guidelines and I realize that I should have added all of this under kedro_datasets_experimental. If you want me to update this PR by placing my code there, please let me know or if it can be reviewed like this, I will work on adding the unit tests and solving the CI issues to complete the PR.

I apologize for not reading the contribution guidelines beforehand. That was my mistake.

P.S I am more than happy to maintain this dataset moving forward.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

kedro-datasets: ExternalTableDataset for Databricks
1 participant