Skip to content

Commit

Permalink
feat(datasets): move default mode of ManagedTableDataSet to read-only (
Browse files Browse the repository at this point in the history
…kedro-org#303)

* feat: move default mode of ManagedTableDataSet to read-only

default of `write_mode` is None preventing `save` by default

Signed-off-by: Flavien Lambert <PetitLepton@users.noreply.github.com>

* fix linting

Signed-off-by: Flavien Lambert <PetitLepton@users.noreply.github.com>

* fix linting

Signed-off-by: Flavien Lambert <PetitLepton@users.noreply.github.com>

* fix(datasets): Correct pyproject.toml syntax for optional dependencies (kedro-org#302)

* Fix pyproject.toml syntax for optional dependencies

Signed-off-by: Dmitry Sorokin <dmd40in@gmail.com>

* refacor out the base dependencies

Signed-off-by: Nok <nok.lam.chan@quantumblack.com>

* add comments

Signed-off-by: Nok <nok.lam.chan@quantumblack.com>

* format pyproject.toml

Signed-off-by: Nok <nok.lam.chan@quantumblack.com>

* Reorder pandas dependencies

Signed-off-by: Nok <nok.lam.chan@quantumblack.com>

* reorder spark dependencies

Signed-off-by: Nok <nok.lam.chan@quantumblack.com>

* remove polars-base and delta-base

Signed-off-by: Nok <nok.lam.chan@quantumblack.com>

---------

Signed-off-by: Dmitry Sorokin <dmd40in@gmail.com>
Signed-off-by: Nok <nok.lam.chan@quantumblack.com>
Co-authored-by: Nok <nok.lam.chan@quantumblack.com>
Signed-off-by: Flavien Lambert <PetitLepton@users.noreply.github.com>

* Update kedro-datasets/kedro_datasets/databricks/managed_table_dataset.py

Co-authored-by: Joel <35801847+datajoely@users.noreply.github.com>
Signed-off-by: Flavien Lambert <PetitLepton@users.noreply.github.com>

* added entry to RELEASE.md

Signed-off-by: Flavien Lambert <PetitLepton@users.noreply.github.com>

* docs: Fix broken link to datasets docs in README.md (kedro-org#304)

fix broken link to datasets docs

Signed-off-by: Jo Stichbury <jo_stichbury@mckinsey.com>
Signed-off-by: Flavien Lambert <PetitLepton@users.noreply.github.com>

* ci: Add docs rtd check on `kedro-datasets` (kedro-org#299)

* Try adding docs rtd check on kedro datasets

Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com>

* Add Read the Docs configuration for kedro-datasets

Signed-off-by: Juan Luis Cano Rodríguez <juan_luis_cano@mckinsey.com>
Signed-off-by: Flavien Lambert <PetitLepton@users.noreply.github.com>

* update docstring

Signed-off-by: Flavien Lambert <PetitLepton@users.noreply.github.com>

* Merge branch 'main' into managed-table-dataset-read-only-by-default

Signed-off-by: Flavien Lambert <PetitLepton@users.noreply.github.com>

* fix linting

Signed-off-by: Flavien Lambert <PetitLepton@users.noreply.github.com>

* fix linting

Signed-off-by: Flavien Lambert <PetitLepton@users.noreply.github.com>

---------

Signed-off-by: Flavien Lambert <PetitLepton@users.noreply.github.com>
Signed-off-by: Dmitry Sorokin <dmd40in@gmail.com>
Signed-off-by: Nok <nok.lam.chan@quantumblack.com>
Signed-off-by: Jo Stichbury <jo_stichbury@mckinsey.com>
Signed-off-by: Juan Luis Cano Rodríguez <juan_luis_cano@mckinsey.com>
Co-authored-by: Dmitry Sorokin <40151847+DimedS@users.noreply.github.com>
Co-authored-by: Nok <nok.lam.chan@quantumblack.com>
Co-authored-by: Joel <35801847+datajoely@users.noreply.github.com>
Co-authored-by: Jo Stichbury <jo_stichbury@mckinsey.com>
Co-authored-by: Merel Theisen <49397448+merelcht@users.noreply.github.com>
  • Loading branch information
6 people authored and Peter Bludau committed Aug 27, 2023
1 parent da7dcc5 commit f408623
Show file tree
Hide file tree
Showing 3 changed files with 34 additions and 15 deletions.
2 changes: 2 additions & 0 deletions kedro-datasets/RELEASE.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,8 @@
## Major features and improvements

## Bug fixes and other changes
* Made `databricks.ManagedTableDataSet` read-only by default.
* The user needs to specify `write_mode` to allow `save` on the data set.
* Fixed an issue on `api.APIDataSet` where the sent data was doubly converted to json
string (once by us and once by the `requests` library).

Expand Down
19 changes: 14 additions & 5 deletions kedro-datasets/kedro_datasets/databricks/managed_table_dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@ class ManagedTable:
database: str
catalog: Optional[str]
table: str
write_mode: str
write_mode: Union[str, None]
dataframe_type: str
primary_key: Optional[str]
owner_group: str
Expand Down Expand Up @@ -82,7 +82,10 @@ def _validate_write_mode(self):
Raises:
DataSetError: If an invalid `write_mode` is passed.
"""
if self.write_mode not in self._VALID_WRITE_MODES:
if (
self.write_mode is not None
and self.write_mode not in self._VALID_WRITE_MODES
):
valid_modes = ", ".join(self._VALID_WRITE_MODES)
raise DataSetError(
f"Invalid `write_mode` provided: {self.write_mode}. "
Expand Down Expand Up @@ -196,7 +199,7 @@ def __init__( # pylint: disable=R0913
table: str,
catalog: str = None,
database: str = "default",
write_mode: str = "overwrite",
write_mode: Union[str, None] = None,
dataframe_type: str = "spark",
primary_key: Optional[Union[str, List[str]]] = None,
version: Version = None,
Expand All @@ -215,10 +218,11 @@ def __init__( # pylint: disable=R0913
Defaults to None.
database: the name of the database.
(also referred to as schema). Defaults to "default".
write_mode: the mode to write the data into the table.
write_mode: the mode to write the data into the table. If not
present, the data set is read-only.
Options are:["overwrite", "append", "upsert"].
"upsert" mode requires primary_key field to be populated.
Defaults to "overwrite".
Defaults to None.
dataframe_type: "pandas" or "spark" dataframe.
Defaults to "spark".
primary_key: the primary key of the table.
Expand Down Expand Up @@ -365,6 +369,11 @@ def _save(self, data: Union[DataFrame, pd.DataFrame]) -> None:
Args:
data (Any): Spark or pandas dataframe to save to the table location
"""
if self._table.write_mode is None:
raise DataSetError(
"'save' can not be used in read-only mode. "
"Change 'write_mode' value to `overwrite`, `upsert` or `append`."
)
# filter columns specified in schema and match their ordering
if self._table.schema():
cols = self._table.schema().fieldNames()
Expand Down
28 changes: 18 additions & 10 deletions kedro-datasets/tests/databricks/test_managed_table_dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -195,7 +195,7 @@ def test_describe(self):
"catalog": None,
"database": "default",
"table": "test",
"write_mode": "overwrite",
"write_mode": None,
"dataframe_type": "spark",
"primary_key": None,
"version": "None",
Expand Down Expand Up @@ -282,11 +282,8 @@ def test_table_does_not_exist(self):

def test_save_default(self, sample_spark_df: DataFrame):
unity_ds = ManagedTableDataSet(database="test", table="test_save")
unity_ds.save(sample_spark_df)
saved_table = unity_ds.load()
assert (
unity_ds._exists() and sample_spark_df.exceptAll(saved_table).count() == 0
)
with pytest.raises(DataSetError):
unity_ds.save(sample_spark_df)

def test_save_schema_spark(
self, subset_spark_df: DataFrame, subset_expected_df: DataFrame
Expand All @@ -311,6 +308,7 @@ def test_save_schema_spark(
],
"type": "struct",
},
write_mode="overwrite",
)
unity_ds.save(subset_spark_df)
saved_table = unity_ds.load()
Expand Down Expand Up @@ -339,6 +337,7 @@ def test_save_schema_pandas(
],
"type": "struct",
},
write_mode="overwrite",
dataframe_type="pandas",
)
unity_ds.save(subset_pandas_df)
Expand All @@ -352,7 +351,9 @@ def test_save_schema_pandas(
def test_save_overwrite(
self, sample_spark_df: DataFrame, append_spark_df: DataFrame
):
unity_ds = ManagedTableDataSet(database="test", table="test_save")
unity_ds = ManagedTableDataSet(
database="test", table="test_save", write_mode="overwrite"
)
unity_ds.save(sample_spark_df)
unity_ds.save(append_spark_df)

Expand Down Expand Up @@ -433,7 +434,9 @@ def test_save_upsert_mismatched_columns(
unity_ds.save(mismatched_upsert_spark_df)

def test_load_spark(self, sample_spark_df: DataFrame):
unity_ds = ManagedTableDataSet(database="test", table="test_load_spark")
unity_ds = ManagedTableDataSet(
database="test", table="test_load_spark", write_mode="overwrite"
)
unity_ds.save(sample_spark_df)

delta_ds = ManagedTableDataSet(database="test", table="test_load_spark")
Expand All @@ -445,7 +448,9 @@ def test_load_spark(self, sample_spark_df: DataFrame):
)

def test_load_spark_no_version(self, sample_spark_df: DataFrame):
unity_ds = ManagedTableDataSet(database="test", table="test_load_spark")
unity_ds = ManagedTableDataSet(
database="test", table="test_load_spark", write_mode="overwrite"
)
unity_ds.save(sample_spark_df)

delta_ds = ManagedTableDataSet(
Expand All @@ -470,7 +475,10 @@ def test_load_version(self, sample_spark_df: DataFrame, append_spark_df: DataFra

def test_load_pandas(self, sample_pandas_df: pd.DataFrame):
unity_ds = ManagedTableDataSet(
database="test", table="test_load_pandas", dataframe_type="pandas"
database="test",
table="test_load_pandas",
dataframe_type="pandas",
write_mode="overwrite",
)
unity_ds.save(sample_pandas_df)

Expand Down

0 comments on commit f408623

Please sign in to comment.