feat(datasets): move default mode of ManagedTableDataSet to read-only (…

…kedro-org#303) * feat: move default mode of ManagedTableDataSet to read-only default of `write_mode` is None preventing `save` by default Signed-off-by: Flavien Lambert <PetitLepton@users.noreply.github.com> * fix linting Signed-off-by: Flavien Lambert <PetitLepton@users.noreply.github.com> * fix linting Signed-off-by: Flavien Lambert <PetitLepton@users.noreply.github.com> * fix(datasets): Correct pyproject.toml syntax for optional dependencies (kedro-org#302) * Fix pyproject.toml syntax for optional dependencies Signed-off-by: Dmitry Sorokin <dmd40in@gmail.com> * refacor out the base dependencies Signed-off-by: Nok <nok.lam.chan@quantumblack.com> * add comments Signed-off-by: Nok <nok.lam.chan@quantumblack.com> * format pyproject.toml Signed-off-by: Nok <nok.lam.chan@quantumblack.com> * Reorder pandas dependencies Signed-off-by: Nok <nok.lam.chan@quantumblack.com> * reorder spark dependencies Signed-off-by: Nok <nok.lam.chan@quantumblack.com> * remove polars-base and delta-base Signed-off-by: Nok <nok.lam.chan@quantumblack.com> --------- Signed-off-by: Dmitry Sorokin <dmd40in@gmail.com> Signed-off-by: Nok <nok.lam.chan@quantumblack.com> Co-authored-by: Nok <nok.lam.chan@quantumblack.com> Signed-off-by: Flavien Lambert <PetitLepton@users.noreply.github.com> * Update kedro-datasets/kedro_datasets/databricks/managed_table_dataset.py Co-authored-by: Joel <35801847+datajoely@users.noreply.github.com> Signed-off-by: Flavien Lambert <PetitLepton@users.noreply.github.com> * added entry to RELEASE.md Signed-off-by: Flavien Lambert <PetitLepton@users.noreply.github.com> * docs: Fix broken link to datasets docs in README.md (kedro-org#304) fix broken link to datasets docs Signed-off-by: Jo Stichbury <jo_stichbury@mckinsey.com> Signed-off-by: Flavien Lambert <PetitLepton@users.noreply.github.com> * ci: Add docs rtd check on `kedro-datasets` (kedro-org#299) * Try adding docs rtd check on kedro datasets Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com> * Add Read the Docs configuration for kedro-datasets Signed-off-by: Juan Luis Cano Rodríguez <juan_luis_cano@mckinsey.com> Signed-off-by: Flavien Lambert <PetitLepton@users.noreply.github.com> * update docstring Signed-off-by: Flavien Lambert <PetitLepton@users.noreply.github.com> * Merge branch 'main' into managed-table-dataset-read-only-by-default Signed-off-by: Flavien Lambert <PetitLepton@users.noreply.github.com> * fix linting Signed-off-by: Flavien Lambert <PetitLepton@users.noreply.github.com> * fix linting Signed-off-by: Flavien Lambert <PetitLepton@users.noreply.github.com> --------- Signed-off-by: Flavien Lambert <PetitLepton@users.noreply.github.com> Signed-off-by: Dmitry Sorokin <dmd40in@gmail.com> Signed-off-by: Nok <nok.lam.chan@quantumblack.com> Signed-off-by: Jo Stichbury <jo_stichbury@mckinsey.com> Signed-off-by: Juan Luis Cano Rodríguez <juan_luis_cano@mckinsey.com> Co-authored-by: Dmitry Sorokin <40151847+DimedS@users.noreply.github.com> Co-authored-by: Nok <nok.lam.chan@quantumblack.com> Co-authored-by: Joel <35801847+datajoely@users.noreply.github.com> Co-authored-by: Jo Stichbury <jo_stichbury@mckinsey.com> Co-authored-by: Merel Theisen <49397448+merelcht@users.noreply.github.com>
PtrBld · Aug 27, 2023 · f408623 · f408623
1 parent da7dcc5
commit f408623
Show file tree

Hide file tree

Showing 3 changed files with 34 additions and 15 deletions.
diff --git a/kedro-datasets/RELEASE.md b/kedro-datasets/RELEASE.md
@@ -2,6 +2,8 @@
 ## Major features and improvements
 
 ## Bug fixes and other changes
+* Made `databricks.ManagedTableDataSet` read-only by default.
+    * The user needs to specify `write_mode` to allow `save` on the data set.
 * Fixed an issue on `api.APIDataSet` where the sent data was doubly converted to json
   string (once by us and once by the `requests` library).
 

diff --git a/kedro-datasets/kedro_datasets/databricks/managed_table_dataset.py b/kedro-datasets/kedro_datasets/databricks/managed_table_dataset.py
@@ -29,7 +29,7 @@ class ManagedTable:
     database: str
     catalog: Optional[str]
     table: str
-    write_mode: str
+    write_mode: Union[str, None]
     dataframe_type: str
     primary_key: Optional[str]
     owner_group: str
@@ -82,7 +82,10 @@ def _validate_write_mode(self):
         Raises:
             DataSetError: If an invalid `write_mode` is passed.
         """
-        if self.write_mode not in self._VALID_WRITE_MODES:
+        if (
+            self.write_mode is not None
+            and self.write_mode not in self._VALID_WRITE_MODES
+        ):
             valid_modes = ", ".join(self._VALID_WRITE_MODES)
             raise DataSetError(
                 f"Invalid `write_mode` provided: {self.write_mode}. "
@@ -196,7 +199,7 @@ def __init__(  # pylint: disable=R0913
         table: str,
         catalog: str = None,
         database: str = "default",
-        write_mode: str = "overwrite",
+        write_mode: Union[str, None] = None,
         dataframe_type: str = "spark",
         primary_key: Optional[Union[str, List[str]]] = None,
         version: Version = None,
@@ -215,10 +218,11 @@ def __init__(  # pylint: disable=R0913
              Defaults to None.
             database: the name of the database.
              (also referred to as schema). Defaults to "default".
-            write_mode: the mode to write the data into the table.
+            write_mode: the mode to write the data into the table. If not
+             present, the data set is read-only.
              Options are:["overwrite", "append", "upsert"].
              "upsert" mode requires primary_key field to be populated.
-             Defaults to "overwrite".
+             Defaults to None.
             dataframe_type: "pandas" or "spark" dataframe.
              Defaults to "spark".
             primary_key: the primary key of the table.
@@ -365,6 +369,11 @@ def _save(self, data: Union[DataFrame, pd.DataFrame]) -> None:
         Args:
             data (Any): Spark or pandas dataframe to save to the table location
         """
+        if self._table.write_mode is None:
+            raise DataSetError(
+                "'save' can not be used in read-only mode. "
+                "Change 'write_mode' value to `overwrite`, `upsert` or `append`."
+            )
         # filter columns specified in schema and match their ordering
         if self._table.schema():
             cols = self._table.schema().fieldNames()

diff --git a/kedro-datasets/tests/databricks/test_managed_table_dataset.py b/kedro-datasets/tests/databricks/test_managed_table_dataset.py
@@ -195,7 +195,7 @@ def test_describe(self):
             "catalog": None,
             "database": "default",
             "table": "test",
-            "write_mode": "overwrite",
+            "write_mode": None,
             "dataframe_type": "spark",
             "primary_key": None,
             "version": "None",
@@ -282,11 +282,8 @@ def test_table_does_not_exist(self):
 
     def test_save_default(self, sample_spark_df: DataFrame):
         unity_ds = ManagedTableDataSet(database="test", table="test_save")
-        unity_ds.save(sample_spark_df)
-        saved_table = unity_ds.load()
-        assert (
-            unity_ds._exists() and sample_spark_df.exceptAll(saved_table).count() == 0
-        )
+        with pytest.raises(DataSetError):
+            unity_ds.save(sample_spark_df)
 
     def test_save_schema_spark(
         self, subset_spark_df: DataFrame, subset_expected_df: DataFrame
@@ -311,6 +308,7 @@ def test_save_schema_spark(
                 ],
                 "type": "struct",
             },
+            write_mode="overwrite",
         )
         unity_ds.save(subset_spark_df)
         saved_table = unity_ds.load()
@@ -339,6 +337,7 @@ def test_save_schema_pandas(
                 ],
                 "type": "struct",
             },
+            write_mode="overwrite",
             dataframe_type="pandas",
         )
         unity_ds.save(subset_pandas_df)
@@ -352,7 +351,9 @@ def test_save_schema_pandas(
     def test_save_overwrite(
         self, sample_spark_df: DataFrame, append_spark_df: DataFrame
     ):
-        unity_ds = ManagedTableDataSet(database="test", table="test_save")
+        unity_ds = ManagedTableDataSet(
+            database="test", table="test_save", write_mode="overwrite"
+        )
         unity_ds.save(sample_spark_df)
         unity_ds.save(append_spark_df)
 
@@ -433,7 +434,9 @@ def test_save_upsert_mismatched_columns(
             unity_ds.save(mismatched_upsert_spark_df)
 
     def test_load_spark(self, sample_spark_df: DataFrame):
-        unity_ds = ManagedTableDataSet(database="test", table="test_load_spark")
+        unity_ds = ManagedTableDataSet(
+            database="test", table="test_load_spark", write_mode="overwrite"
+        )
         unity_ds.save(sample_spark_df)
 
         delta_ds = ManagedTableDataSet(database="test", table="test_load_spark")
@@ -445,7 +448,9 @@ def test_load_spark(self, sample_spark_df: DataFrame):
         )
 
     def test_load_spark_no_version(self, sample_spark_df: DataFrame):
-        unity_ds = ManagedTableDataSet(database="test", table="test_load_spark")
+        unity_ds = ManagedTableDataSet(
+            database="test", table="test_load_spark", write_mode="overwrite"
+        )
         unity_ds.save(sample_spark_df)
 
         delta_ds = ManagedTableDataSet(
@@ -470,7 +475,10 @@ def test_load_version(self, sample_spark_df: DataFrame, append_spark_df: DataFra
 
     def test_load_pandas(self, sample_pandas_df: pd.DataFrame):
         unity_ds = ManagedTableDataSet(
-            database="test", table="test_load_pandas", dataframe_type="pandas"
+            database="test",
+            table="test_load_pandas",
+            dataframe_type="pandas",
+            write_mode="overwrite",
         )
         unity_ds.save(sample_pandas_df)