Feature upgrade to Spark 3.2.1 (#111)

* work in progress * work in progress * work in progress * changed spaces for PEP compatibility - minor no functionality change * changed library dependencies for build * additional code coverage for version checks * additional code coverage for version checks * Removed lines of code badge as lines of code server is down * additional test coverage for text generators * spark version check * reverted unchanged_files * updates in response to reviews * converted version tests to pytest * updated version due to change in baseline Python requirements * updated version due to change in baseline Python requirements
databrickslabs · Nov 28, 2022 · 981a5a4 · 981a5a4
1 parent 109707e
commit 981a5a4
Show file tree

Hide file tree

Showing 17 changed files with 307 additions and 48 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -23,6 +23,30 @@ See the contents of the file `python/require.txt` to see the Python package depe
 * renamed packaging to `dbldatagen`
 * Releases now available at https://github.com/databrickslabs/dbldatagen/releases
 * code tidy up and rename of options
-* added text generation plugin support for python functions and 3rd party libraries such as Faker
+* added text generation plugin support for python functions and 3rd party libraries
 * Use of data generator to generate static and streaming data sources in Databricks Delta Live Tables
 * added support for install from PyPi
+
+### version 0.3.0
+
+The code for the Databricks Data Generator has the following dependencies
+
+* Requires Databricks runtime 9.1 LTS or later
+* Requires Spark 3.1.2 or later 
+* Requires Python 3.8.10 or later
+
+While the data generator framework does not require all libraries used by the runtimes, where a library from 
+the Databricks runtime is used, it will use the version found in the Databricks runtime for 9.1 LTS or later.
+You can use older versions of the Databricks Labs Data Generator by referring to that explicit version.
+
+To use an older DB runtime version in your notebook, you can use the following code in your notebook:
+
+```commandline
+%pip install git+https://github.com/databrickslabs/dbldatagen@dbr_7_3_LTS_compat
+```
+
+See the [Databricks runtime release notes](https://docs.databricks.com/release-notes/runtime/releases.html) 
+ for the full list of dependencies.
+
+This can be found at : https://docs.databricks.com/release-notes/runtime/releases.html
+
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -15,7 +15,10 @@ warrant that you have the legal authority to do so.
 
 ## Python compatibility
 
-The code has been tested with Python 3.7.5 and 3.8
+The code has been tested with Python 3.8.10 and later.
+
+Older releases were tested with Python 3.7.5 but as of this release, it requires the Databricks runtime 9.1 LTS or later
+which relies on Python 3.8.10
 
 ## Checking your code for common issues
 
@@ -77,10 +80,21 @@ Run  `make clean dist` from the main project directory.
 
 # Testing 
 
-## Creating tests
-Preferred style is to use pytest rather than unittest but some unittest based code is used in compatibility mode.
+## Developing new tests
+New tests should be created using PyTest with classes combining multiple `Pytest` tests.
+
+Existing test code contains tests based on Python's `unittest` framework but these are 
+run on `pytest` rather than `unitest`. 
+
+To get a  `spark` instance for test purposes, use the following code:
+
+```python
+import dbldatagen as dg
 
-Any new tests should be written as pytest compatible test classes.
+spark = dg.SparkSingleton.getLocalInstance("<name to flag spark instance>")
+```
+
+The name used to flag the spark instance should be the test module or test class name. 
 
 ## Running unit / integration tests
 
@@ -100,9 +114,32 @@ To run the tests using a `pipenv` environment:
   - Run `make test-with-html-report` to generate test coverage report in `htmlcov/inxdex.html`
 
 # Using the Databricks Labs data generator
-To use the project, the generated wheel should be installed in your Python notebook as a wheel based library
+The recommended method for installation is to install from the PyPi package
+
+You can install the library as a notebook scoped library when working within the Databricks 
+notebook environment through the use of a `%pip` cell in your notebook.
+
+To install as a notebook-scoped library, create and execute a notebook cell with the following text:
+
+> `%pip install dbldatagen`
+
+This installs from the PyPi package
+
+You can also install from release binaries or directly from the Github sources.
+
+The release binaries can be accessed at:
+- Databricks Labs Github Data Generator releases - https://github.com/databrickslabs/dbldatagen/releases
+
+
+The `%pip install` method also works on the Databricks Community Edition.
+
+Alternatively, you use download a wheel file and install using the Databricks install mechanism to install a wheel based
+library into your workspace.
+
+The `%pip install` method can also down load a specific binary release.
+For example, the following code downloads the release V0.2.1
 
-Once the library has been installed, you can use it to generate a test data frame.
+> '%pip install https://github.com/databrickslabs/dbldatagen/releases/download/v021/dbldatagen-0.2.1-py3-none-any.whl'
 
 # Coding Style 
 

diff --git a/README.md b/README.md
@@ -45,7 +45,7 @@ used in other computations
 * Generating values to conform to a schema or independent of an existing schema
 * use of SQL expressions in test data generation
 * plugin mechanism to allow use of 3rd party libraries such as Faker
-* Use of data generator to generate data sources in Databricks Delta Live Tables
+* Use within a Databricks Delta Live Tables pipeline as a synthetic data generation source
 
 Details of these features can be found in the online documentation  -
  [online documentation](https://databrickslabs.github.io/dbldatagen/public_docs/index.html). 
@@ -57,7 +57,7 @@ details of use and many examples.
 
 Release notes and details of the latest changes for this specific release
 can be found in the Github repository
-[here](https://github.com/databrickslabs/dbldatagen/blob/release/v0.2.1/CHANGELOG.md)
+[here](https://github.com/databrickslabs/dbldatagen/blob/release/v0.3.0/CHANGELOG.md)
 
 # Installation
 
@@ -75,9 +75,10 @@ The documentation [installation notes](https://databrickslabs.github.io/dbldatag
 contains details of installation using alternative mechanisms.
 
 ## Compatibility 
-The Databricks Labs data generator framework can be used with Pyspark 3.x and Python 3.6 or later
+The Databricks Labs data generator framework can be used with Pyspark 3.1.2 and Python 3.8 or later. These are 
+compatible with the Databricks runtime 9.1 LTS and later releases.
 
-However prebuilt releases are tested against Pyspark 3.0.1 (compatible with the Databricks runtime 7.3 LTS 
+Older prebuilt releases are tested against Pyspark 3.0.1 (compatible with the Databricks runtime 7.3 LTS 
 or later) and built with Python 3.7.5
 
 For full library compatibility for a specific Databricks Spark release, see the Databricks 

diff --git a/dbldatagen/__init__.py b/dbldatagen/__init__.py
@@ -24,7 +24,8 @@
 """
 
 from .data_generator import DataGenerator
-from .datagen_constants import DEFAULT_RANDOM_SEED, RANDOM_SEED_RANDOM, RANDOM_SEED_FIXED, RANDOM_SEED_HASH_FIELD_NAME
+from .datagen_constants import DEFAULT_RANDOM_SEED, RANDOM_SEED_RANDOM, RANDOM_SEED_FIXED, \
+                               RANDOM_SEED_HASH_FIELD_NAME, MIN_PYTHON_VERSION, MIN_SPARK_VERSION
 from .utils import ensure, topologicalSort, mkBoundsList, coalesce_values, \
     deprecated, parse_time_interval, DataGenError
 from ._version import __version__
@@ -46,12 +47,19 @@
            "text_generator_plugins"
            ]
 
+def python_version_check(python_version_expected):
+    """Check against Python version
 
-def python_version_check():
+       Allows minimum version to be passed in to facilitate unit testing
+
+       :param python_version_expected: = minimum version of python to support as tuple e.g (3,6)
+       :return: True if passed
+
+        """
     import sys
-    if not sys.version_info >= (3, 6):
-        raise RuntimeError("Minimum version of Python supported is 3.6")
+    return sys.version_info >= python_version_expected
 
 
 # lets check for a correct python version or raise an exception
-python_version_check()
+if not python_version_check(MIN_PYTHON_VERSION):
+    raise RuntimeError(f"Minimum version of Python supported is {MIN_PYTHON_VERSION[0]}.{MIN_PYTHON_VERSION[1]}")
diff --git a/dbldatagen/_version.py b/dbldatagen/_version.py
@@ -33,5 +33,18 @@ def get_version(version):
     return version_info
 
 
-__version__ = "0.2.1"  # DO NOT EDIT THIS DIRECTLY!  It is managed by bumpversion
+__version__ = "0.3.0"  # DO NOT EDIT THIS DIRECTLY!  It is managed by bumpversion
 __version_info__ = get_version(__version__)
+
+
+def _get_spark_version(sparkVersion):
+    try:
+        r = re.compile(r'(?P<major>\d+)\.(?P<minor>\d+)\.(?P<patch>\d+)(?P<release>.*)')
+        major, minor, patch, release = r.match(sparkVersion).groups()
+        spark_version_info = VersionInfo(int(major), int(minor), int(patch), release, build="0")
+    except (RuntimeError, AttributeError):
+        spark_version_info = VersionInfo(major=3, minor=0, patch=1, release="unknown", build="0")
+        logging.warning("Could not parse spark version - using assumed Spark Version : %s", spark_version_info)
+
+    return spark_version_info
+
diff --git a/dbldatagen/data_generator.py b/dbldatagen/data_generator.py
@@ -12,8 +12,9 @@
 from pyspark.sql.types import LongType, IntegerType, StringType, StructType, StructField, DataType
 from .spark_singleton import SparkSingleton
 from .column_generation_spec import ColumnGenerationSpec
-from .datagen_constants import DEFAULT_RANDOM_SEED, RANDOM_SEED_FIXED, RANDOM_SEED_HASH_FIELD_NAME
+from .datagen_constants import DEFAULT_RANDOM_SEED, RANDOM_SEED_FIXED, RANDOM_SEED_HASH_FIELD_NAME, MIN_SPARK_VERSION
 from .utils import ensure, topologicalSort, DataGenError, deprecated
+from . _version import _get_spark_version
 
 _OLD_MIN_OPTION = 'min'
 _OLD_MAX_OPTION = 'max'
@@ -131,9 +132,48 @@ def __init__(self, sparkSession=None, name=None, randomSeedMethod=None,
         self.withColumn(ColumnGenerationSpec.SEED_COLUMN, LongType(), nullable=False, implicit=True, omit=True)
         self._batchSize = batchSize
 
+        # set up spark session
+        self._setupSparkSession(sparkSession)
+
         # set up use of pandas udfs
         self._setupPandas(batchSize)
 
+    @classmethod
+    def _checkSparkVersion(cls, sparkVersion, minSparkVersion):
+        """
+        check spark version
+        :param sparkVersion: spark version string
+        :param minSparkVersion: min spark version as tuple
+        :return: True if version passes minVersion
+
+        Layout of version string must be compatible "xx.xx.xx.patch"
+        """
+        sparkVersionInfo = _get_spark_version(sparkVersion)
+
+        if sparkVersionInfo < minSparkVersion:
+            logging.warn(f"*** Minimum version of Python supported is {minSparkVersion} - found version %s ",
+                         sparkVersionInfo )
+            return False
+
+        return True
+
+    def _setupSparkSession(self, sparkSession):
+        """
+        Set up spark session
+        :param sparkSession: spark session to use
+        :return: nothing
+        """
+        if sparkSession is None:
+            sparkSession = SparkSingleton.getInstance()
+
+        assert sparkSession is not None, "Spark session not initialized"
+
+        self.sparkSession = sparkSession
+
+        # check if the spark version meets the minimum requirements and warn if not
+        sparkVersion = sparkSession.version
+        self._checkSparkVersion(sparkVersion, MIN_SPARK_VERSION)
+
     def _setupPandas(self, pandasBatchSize):
         """
         Set up pandas

diff --git a/dbldatagen/datagen_constants.py b/dbldatagen/datagen_constants.py
@@ -25,3 +25,7 @@
 RANDOM_SEED_RANDOM_FLOAT = -1.0
 RANDOM_SEED_FIXED = "fixed"
 RANDOM_SEED_HASH_FIELD_NAME = "hash_fieldname"
+
+# minimum versions for version checks
+MIN_PYTHON_VERSION = (3, 8)
+MIN_SPARK_VERSION = (3, 1, 2)
diff --git a/docs/source/conf.py b/docs/source/conf.py
@@ -28,7 +28,7 @@
 author = 'Databricks Inc'
 
 # The full version, including alpha/beta/rc tags
-release = "0.2.1"  # DO NOT EDIT THIS DIRECTLY!  It is managed by bumpversion
+release = "0.3.0"  # DO NOT EDIT THIS DIRECTLY!  It is managed by bumpversion
 
 
 # -- General configuration ---------------------------------------------------

diff --git a/makefile b/makefile
@@ -27,7 +27,7 @@ prepare: clean
 
 create-dev-env:
 	@echo "$(OK_COLOR)=> making conda dev environment$(NO_COLOR)"
-	conda create -n $(ENV_NAME) python=3.8
+	conda create -n $(ENV_NAME) python=3.8.10
 
 create-github-build-env:
 	@echo "$(OK_COLOR)=> making conda dev environment$(NO_COLOR)"

diff --git a/python/.bumpversion.cfg b/python/.bumpversion.cfg
@@ -1,5 +1,5 @@
 [bumpversion]
-current_version = 0.2.1
+current_version = 0.3.0
 commit = False
 tag = False
 parse = (?P<major>\d+)\.(?P<minor>\d+)\.(?P<patch>\d+)\-{0,1}(?P<release>\D*)(?P<build>\d*)

diff --git a/python/dev_require.txt b/python/dev_require.txt
@@ -1,17 +1,17 @@
 # The following packages are used in building the test data generator framework.
 # All packages used are already installed in the Databricks runtime environment for version 6.5 or later
-numpy==1.22.0
-pandas==1.0.1
+numpy==1.19.2
+pandas==1.2.4
 pickleshare==0.7.5
 py4j==0.10.9
-pyarrow==1.0.1
-pyspark>=3.0.1
+pyarrow==4.0.0
+pyspark>=3.1.2
 python-dateutil==2.8.1
-six==1.14.0
+six==1.15.0
 
 # The following packages are required for development only
-wheel==0.34.2
-setuptools==45.2.0
+wheel==0.36.2
+setuptools==52.0.0
 bumpversion
 pytest
 pytest-cov
@@ -25,7 +25,7 @@ sphinx_rtd_theme
 nbsphinx
 numpydoc==0.8
 pypandoc
-ipython==7.16.3
+ipython==7.22.0
 recommonmark
 sphinx-markdown-builder
 rst2pdf==0.98

diff --git a/python/require.txt b/python/require.txt
@@ -1,17 +1,17 @@
 # The following packages are used in building the test data generator framework.
 # All packages used are already installed in the Databricks runtime environment for version 6.5 or later
 numpy==1.22.0
-pandas==1.0.1
+pandas==1.2.5
 pickleshare==0.7.5
 py4j==0.10.9
-pyarrow==1.0.1
-pyspark>=3.0.1
+pyarrow==4.0.0
+pyspark>=3.1.2
 python-dateutil==2.8.1
-six==1.14.0
+six==1.15.0
 
 # The following packages are required for development only
-wheel==0.34.2
-setuptools==45.2.0
+wheel==0.36.2
+setuptools==52.0.0
 bumpversion
 pytest
 pytest-cov
@@ -25,7 +25,7 @@ sphinx_rtd_theme
 nbsphinx
 numpydoc==0.8
 pypandoc
-ipython==7.16.3
+ipython==7.22.0
 recommonmark
 sphinx-markdown-builder
 rst2pdf==0.98

diff --git a/setup.py b/setup.py
@@ -31,13 +31,13 @@
 
 setuptools.setup(
     name="dbldatagen",
-    version="0.2.1",
+    version="0.3.0",
     author="Ronan Stokes, Databricks",
     description="Databricks Labs -  PySpark Synthetic Data Generator",
     long_description=long_description,
     long_description_content_type="text/markdown",
     url="https://github.com/databrickslabs/data-generator",
-    project_urls = {
+    project_urls={
         "Databricks Labs": "https://www.databricks.com/learn/labs",
         "Documentation": "https://databrickslabs.github.io/dbldatagen/public_docs/index.html"
 },
@@ -52,5 +52,5 @@
         "Intended Audience :: Developers",
         "Intended Audience :: System Administrators"
     ],
-    python_requires='>=3.7.5',
+    python_requires='>=3.8.10',
 )