Skip to content

Commit

Permalink
Feature upgrade to Spark 3.2.1 (#111)
Browse files Browse the repository at this point in the history
* work in progress

* work in progress

* work in progress

* changed spaces for PEP compatibility - minor no functionality change

* changed library dependencies for build

* additional code coverage for version checks

* additional code coverage for version checks

* Removed lines of code badge as lines of code server is down

* additional test coverage for text generators

* spark version check

* reverted unchanged_files

* updates in response to reviews

* converted version tests to pytest

* updated version due to change in baseline Python requirements

* updated version due to change in baseline Python requirements
  • Loading branch information
ronanstokes-db authored Nov 28, 2022
1 parent 109707e commit 981a5a4
Show file tree
Hide file tree
Showing 17 changed files with 307 additions and 48 deletions.
26 changes: 25 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,30 @@ See the contents of the file `python/require.txt` to see the Python package depe
* renamed packaging to `dbldatagen`
* Releases now available at https://github.com/databrickslabs/dbldatagen/releases
* code tidy up and rename of options
* added text generation plugin support for python functions and 3rd party libraries such as Faker
* added text generation plugin support for python functions and 3rd party libraries
* Use of data generator to generate static and streaming data sources in Databricks Delta Live Tables
* added support for install from PyPi

### version 0.3.0

The code for the Databricks Data Generator has the following dependencies

* Requires Databricks runtime 9.1 LTS or later
* Requires Spark 3.1.2 or later
* Requires Python 3.8.10 or later

While the data generator framework does not require all libraries used by the runtimes, where a library from
the Databricks runtime is used, it will use the version found in the Databricks runtime for 9.1 LTS or later.
You can use older versions of the Databricks Labs Data Generator by referring to that explicit version.

To use an older DB runtime version in your notebook, you can use the following code in your notebook:

```commandline
%pip install git+https://github.com/databrickslabs/dbldatagen@dbr_7_3_LTS_compat
```

See the [Databricks runtime release notes](https://docs.databricks.com/release-notes/runtime/releases.html)
for the full list of dependencies.

This can be found at : https://docs.databricks.com/release-notes/runtime/releases.html

49 changes: 43 additions & 6 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,10 @@ warrant that you have the legal authority to do so.

## Python compatibility

The code has been tested with Python 3.7.5 and 3.8
The code has been tested with Python 3.8.10 and later.

Older releases were tested with Python 3.7.5 but as of this release, it requires the Databricks runtime 9.1 LTS or later
which relies on Python 3.8.10

## Checking your code for common issues

Expand Down Expand Up @@ -77,10 +80,21 @@ Run `make clean dist` from the main project directory.

# Testing

## Creating tests
Preferred style is to use pytest rather than unittest but some unittest based code is used in compatibility mode.
## Developing new tests
New tests should be created using PyTest with classes combining multiple `Pytest` tests.

Existing test code contains tests based on Python's `unittest` framework but these are
run on `pytest` rather than `unitest`.

To get a `spark` instance for test purposes, use the following code:

```python
import dbldatagen as dg

Any new tests should be written as pytest compatible test classes.
spark = dg.SparkSingleton.getLocalInstance("<name to flag spark instance>")
```

The name used to flag the spark instance should be the test module or test class name.

## Running unit / integration tests

Expand All @@ -100,9 +114,32 @@ To run the tests using a `pipenv` environment:
- Run `make test-with-html-report` to generate test coverage report in `htmlcov/inxdex.html`

# Using the Databricks Labs data generator
To use the project, the generated wheel should be installed in your Python notebook as a wheel based library
The recommended method for installation is to install from the PyPi package

You can install the library as a notebook scoped library when working within the Databricks
notebook environment through the use of a `%pip` cell in your notebook.

To install as a notebook-scoped library, create and execute a notebook cell with the following text:

> `%pip install dbldatagen`
This installs from the PyPi package

You can also install from release binaries or directly from the Github sources.

The release binaries can be accessed at:
- Databricks Labs Github Data Generator releases - https://github.com/databrickslabs/dbldatagen/releases


The `%pip install` method also works on the Databricks Community Edition.

Alternatively, you use download a wheel file and install using the Databricks install mechanism to install a wheel based
library into your workspace.

The `%pip install` method can also down load a specific binary release.
For example, the following code downloads the release V0.2.1

Once the library has been installed, you can use it to generate a test data frame.
> '%pip install https://github.com/databrickslabs/dbldatagen/releases/download/v021/dbldatagen-0.2.1-py3-none-any.whl'
# Coding Style

Expand Down
9 changes: 5 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -45,7 +45,7 @@ used in other computations
* Generating values to conform to a schema or independent of an existing schema
* use of SQL expressions in test data generation
* plugin mechanism to allow use of 3rd party libraries such as Faker
* Use of data generator to generate data sources in Databricks Delta Live Tables
* Use within a Databricks Delta Live Tables pipeline as a synthetic data generation source

Details of these features can be found in the online documentation -
[online documentation](https://databrickslabs.github.io/dbldatagen/public_docs/index.html).
Expand All @@ -57,7 +57,7 @@ details of use and many examples.

Release notes and details of the latest changes for this specific release
can be found in the Github repository
[here](https://github.com/databrickslabs/dbldatagen/blob/release/v0.2.1/CHANGELOG.md)
[here](https://github.com/databrickslabs/dbldatagen/blob/release/v0.3.0/CHANGELOG.md)

# Installation

Expand All @@ -75,9 +75,10 @@ The documentation [installation notes](https://databrickslabs.github.io/dbldatag
contains details of installation using alternative mechanisms.

## Compatibility
The Databricks Labs data generator framework can be used with Pyspark 3.x and Python 3.6 or later
The Databricks Labs data generator framework can be used with Pyspark 3.1.2 and Python 3.8 or later. These are
compatible with the Databricks runtime 9.1 LTS and later releases.

However prebuilt releases are tested against Pyspark 3.0.1 (compatible with the Databricks runtime 7.3 LTS
Older prebuilt releases are tested against Pyspark 3.0.1 (compatible with the Databricks runtime 7.3 LTS
or later) and built with Python 3.7.5

For full library compatibility for a specific Databricks Spark release, see the Databricks
Expand Down
18 changes: 13 additions & 5 deletions dbldatagen/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,8 @@
"""

from .data_generator import DataGenerator
from .datagen_constants import DEFAULT_RANDOM_SEED, RANDOM_SEED_RANDOM, RANDOM_SEED_FIXED, RANDOM_SEED_HASH_FIELD_NAME
from .datagen_constants import DEFAULT_RANDOM_SEED, RANDOM_SEED_RANDOM, RANDOM_SEED_FIXED, \
RANDOM_SEED_HASH_FIELD_NAME, MIN_PYTHON_VERSION, MIN_SPARK_VERSION
from .utils import ensure, topologicalSort, mkBoundsList, coalesce_values, \
deprecated, parse_time_interval, DataGenError
from ._version import __version__
Expand All @@ -46,12 +47,19 @@
"text_generator_plugins"
]

def python_version_check(python_version_expected):
"""Check against Python version
def python_version_check():
Allows minimum version to be passed in to facilitate unit testing
:param python_version_expected: = minimum version of python to support as tuple e.g (3,6)
:return: True if passed
"""
import sys
if not sys.version_info >= (3, 6):
raise RuntimeError("Minimum version of Python supported is 3.6")
return sys.version_info >= python_version_expected


# lets check for a correct python version or raise an exception
python_version_check()
if not python_version_check(MIN_PYTHON_VERSION):
raise RuntimeError(f"Minimum version of Python supported is {MIN_PYTHON_VERSION[0]}.{MIN_PYTHON_VERSION[1]}")
15 changes: 14 additions & 1 deletion dbldatagen/_version.py
Original file line number Diff line number Diff line change
Expand Up @@ -33,5 +33,18 @@ def get_version(version):
return version_info


__version__ = "0.2.1" # DO NOT EDIT THIS DIRECTLY! It is managed by bumpversion
__version__ = "0.3.0" # DO NOT EDIT THIS DIRECTLY! It is managed by bumpversion
__version_info__ = get_version(__version__)


def _get_spark_version(sparkVersion):
try:
r = re.compile(r'(?P<major>\d+)\.(?P<minor>\d+)\.(?P<patch>\d+)(?P<release>.*)')
major, minor, patch, release = r.match(sparkVersion).groups()
spark_version_info = VersionInfo(int(major), int(minor), int(patch), release, build="0")
except (RuntimeError, AttributeError):
spark_version_info = VersionInfo(major=3, minor=0, patch=1, release="unknown", build="0")
logging.warning("Could not parse spark version - using assumed Spark Version : %s", spark_version_info)

return spark_version_info

42 changes: 41 additions & 1 deletion dbldatagen/data_generator.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,8 +12,9 @@
from pyspark.sql.types import LongType, IntegerType, StringType, StructType, StructField, DataType
from .spark_singleton import SparkSingleton
from .column_generation_spec import ColumnGenerationSpec
from .datagen_constants import DEFAULT_RANDOM_SEED, RANDOM_SEED_FIXED, RANDOM_SEED_HASH_FIELD_NAME
from .datagen_constants import DEFAULT_RANDOM_SEED, RANDOM_SEED_FIXED, RANDOM_SEED_HASH_FIELD_NAME, MIN_SPARK_VERSION
from .utils import ensure, topologicalSort, DataGenError, deprecated
from . _version import _get_spark_version

_OLD_MIN_OPTION = 'min'
_OLD_MAX_OPTION = 'max'
Expand Down Expand Up @@ -131,9 +132,48 @@ def __init__(self, sparkSession=None, name=None, randomSeedMethod=None,
self.withColumn(ColumnGenerationSpec.SEED_COLUMN, LongType(), nullable=False, implicit=True, omit=True)
self._batchSize = batchSize

# set up spark session
self._setupSparkSession(sparkSession)

# set up use of pandas udfs
self._setupPandas(batchSize)

@classmethod
def _checkSparkVersion(cls, sparkVersion, minSparkVersion):
"""
check spark version
:param sparkVersion: spark version string
:param minSparkVersion: min spark version as tuple
:return: True if version passes minVersion
Layout of version string must be compatible "xx.xx.xx.patch"
"""
sparkVersionInfo = _get_spark_version(sparkVersion)

if sparkVersionInfo < minSparkVersion:
logging.warn(f"*** Minimum version of Python supported is {minSparkVersion} - found version %s ",
sparkVersionInfo )
return False

return True

def _setupSparkSession(self, sparkSession):
"""
Set up spark session
:param sparkSession: spark session to use
:return: nothing
"""
if sparkSession is None:
sparkSession = SparkSingleton.getInstance()

assert sparkSession is not None, "Spark session not initialized"

self.sparkSession = sparkSession

# check if the spark version meets the minimum requirements and warn if not
sparkVersion = sparkSession.version
self._checkSparkVersion(sparkVersion, MIN_SPARK_VERSION)

def _setupPandas(self, pandasBatchSize):
"""
Set up pandas
Expand Down
4 changes: 4 additions & 0 deletions dbldatagen/datagen_constants.py
Original file line number Diff line number Diff line change
Expand Up @@ -25,3 +25,7 @@
RANDOM_SEED_RANDOM_FLOAT = -1.0
RANDOM_SEED_FIXED = "fixed"
RANDOM_SEED_HASH_FIELD_NAME = "hash_fieldname"

# minimum versions for version checks
MIN_PYTHON_VERSION = (3, 8)
MIN_SPARK_VERSION = (3, 1, 2)
2 changes: 1 addition & 1 deletion docs/source/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@
author = 'Databricks Inc'

# The full version, including alpha/beta/rc tags
release = "0.2.1" # DO NOT EDIT THIS DIRECTLY! It is managed by bumpversion
release = "0.3.0" # DO NOT EDIT THIS DIRECTLY! It is managed by bumpversion


# -- General configuration ---------------------------------------------------
Expand Down
2 changes: 1 addition & 1 deletion makefile
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@ prepare: clean

create-dev-env:
@echo "$(OK_COLOR)=> making conda dev environment$(NO_COLOR)"
conda create -n $(ENV_NAME) python=3.8
conda create -n $(ENV_NAME) python=3.8.10

create-github-build-env:
@echo "$(OK_COLOR)=> making conda dev environment$(NO_COLOR)"
Expand Down
2 changes: 1 addition & 1 deletion python/.bumpversion.cfg
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
[bumpversion]
current_version = 0.2.1
current_version = 0.3.0
commit = False
tag = False
parse = (?P<major>\d+)\.(?P<minor>\d+)\.(?P<patch>\d+)\-{0,1}(?P<release>\D*)(?P<build>\d*)
Expand Down
16 changes: 8 additions & 8 deletions python/dev_require.txt
Original file line number Diff line number Diff line change
@@ -1,17 +1,17 @@
# The following packages are used in building the test data generator framework.
# All packages used are already installed in the Databricks runtime environment for version 6.5 or later
numpy==1.22.0
pandas==1.0.1
numpy==1.19.2
pandas==1.2.4
pickleshare==0.7.5
py4j==0.10.9
pyarrow==1.0.1
pyspark>=3.0.1
pyarrow==4.0.0
pyspark>=3.1.2
python-dateutil==2.8.1
six==1.14.0
six==1.15.0

# The following packages are required for development only
wheel==0.34.2
setuptools==45.2.0
wheel==0.36.2
setuptools==52.0.0
bumpversion
pytest
pytest-cov
Expand All @@ -25,7 +25,7 @@ sphinx_rtd_theme
nbsphinx
numpydoc==0.8
pypandoc
ipython==7.16.3
ipython==7.22.0
recommonmark
sphinx-markdown-builder
rst2pdf==0.98
Expand Down
14 changes: 7 additions & 7 deletions python/require.txt
Original file line number Diff line number Diff line change
@@ -1,17 +1,17 @@
# The following packages are used in building the test data generator framework.
# All packages used are already installed in the Databricks runtime environment for version 6.5 or later
numpy==1.22.0
pandas==1.0.1
pandas==1.2.5
pickleshare==0.7.5
py4j==0.10.9
pyarrow==1.0.1
pyspark>=3.0.1
pyarrow==4.0.0
pyspark>=3.1.2
python-dateutil==2.8.1
six==1.14.0
six==1.15.0

# The following packages are required for development only
wheel==0.34.2
setuptools==45.2.0
wheel==0.36.2
setuptools==52.0.0
bumpversion
pytest
pytest-cov
Expand All @@ -25,7 +25,7 @@ sphinx_rtd_theme
nbsphinx
numpydoc==0.8
pypandoc
ipython==7.16.3
ipython==7.22.0
recommonmark
sphinx-markdown-builder
rst2pdf==0.98
Expand Down
6 changes: 3 additions & 3 deletions setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -31,13 +31,13 @@

setuptools.setup(
name="dbldatagen",
version="0.2.1",
version="0.3.0",
author="Ronan Stokes, Databricks",
description="Databricks Labs - PySpark Synthetic Data Generator",
long_description=long_description,
long_description_content_type="text/markdown",
url="https://github.com/databrickslabs/data-generator",
project_urls = {
project_urls={
"Databricks Labs": "https://www.databricks.com/learn/labs",
"Documentation": "https://databrickslabs.github.io/dbldatagen/public_docs/index.html"
},
Expand All @@ -52,5 +52,5 @@
"Intended Audience :: Developers",
"Intended Audience :: System Administrators"
],
python_requires='>=3.7.5',
python_requires='>=3.8.10',
)
Loading

0 comments on commit 981a5a4

Please sign in to comment.