Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature upgrade to Spark 3.2.1 #111

Merged
merged 22 commits into from
Nov 28, 2022
Merged
Show file tree
Hide file tree
Changes from 19 commits
Commits
Show all changes
22 commits
Select commit Hold shift + click to select a range
dbc2ad5
work in progress
ronanstokes-db Oct 4, 2022
163ed73
work in progress
ronanstokes-db Oct 4, 2022
70af189
work in progress
ronanstokes-db Oct 4, 2022
7a3f604
changed spaces for PEP compatibility - minor no functionality change
ronanstokes-db Oct 4, 2022
a23017d
changed library dependencies for build
ronanstokes-db Oct 4, 2022
03e751d
additional code coverage for version checks
ronanstokes-db Oct 4, 2022
1499900
additional code coverage for version checks
ronanstokes-db Oct 4, 2022
3f192e8
Removed lines of code badge as lines of code server is down
ronanstokes-db Oct 4, 2022
b8e53b8
additional test coverage for text generators
ronanstokes-db Oct 4, 2022
18b05be
spark version check
ronanstokes-db Oct 4, 2022
53148f2
reverted unchanged_files
ronanstokes-db Oct 4, 2022
f40f26c
Merge branch 'master' into feature-upgrade-to-spark-3_1_2
ronanstokes-db Oct 5, 2022
31fb56f
updates in response to reviews
ronanstokes-db Oct 7, 2022
ab61f2b
Merge branch 'feature-upgrade-to-spark-3_1_2' of https://github.com/d…
ronanstokes-db Oct 7, 2022
3447200
converted version tests to pytest
ronanstokes-db Oct 7, 2022
47b903d
Merge branch 'master' into feature-upgrade-to-spark-3_1_2
ronanstokes-db Oct 7, 2022
d3ebcf2
merged changes from master
ronanstokes-db Nov 23, 2022
85478f8
updated version due to change in baseline Python requirements
ronanstokes-db Nov 23, 2022
3b04db4
updated version due to change in baseline Python requirements
ronanstokes-db Nov 23, 2022
44939aa
updated version due to change in baseline Python requirements
ronanstokes-db Nov 23, 2022
d7cdfb8
updated version due to change in baseline Python requirements
ronanstokes-db Nov 23, 2022
e8eb833
updated version due to change in baseline Python requirements
ronanstokes-db Nov 24, 2022
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
26 changes: 25 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,30 @@ See the contents of the file `python/require.txt` to see the Python package depe
* renamed packaging to `dbldatagen`
* Releases now available at https://github.com/databrickslabs/dbldatagen/releases
* code tidy up and rename of options
* added text generation plugin support for python functions and 3rd party libraries such as Faker
* added text generation plugin support for python functions and 3rd party libraries
* Use of data generator to generate static and streaming data sources in Databricks Delta Live Tables
* added support for install from PyPi

### version 0.3.0

The code for the Databricks Data Generator has the following dependencies

* Requires Databricks runtime 9.1 LTS or later
* Requires Spark 3.1.2 or later
ronanstokes-db marked this conversation as resolved.
Show resolved Hide resolved
* Requires Python 3.8.10 or later

While the data generator framework does not require all libraries used by the runtimes, where a library from
the Databricks runtime is used, it will use the version found in the Databricks runtime for 9.1 LTS or later.
You can use older versions of the Databricks Labs Data Generator by referring to that explicit version.

To use an older DB runtime version in your notebook, you can use the following code in your notebook:

```commandline
%pip install git+https://github.com/databrickslabs/dbldatagen@dbr_7_3_LTS_compat
ronanstokes-db marked this conversation as resolved.
Show resolved Hide resolved
```

See the [Databricks runtime release notes](https://docs.databricks.com/release-notes/runtime/releases.html)
for the full list of dependencies.

This can be found at : https://docs.databricks.com/release-notes/runtime/releases.html

49 changes: 43 additions & 6 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,10 @@ warrant that you have the legal authority to do so.

## Python compatibility

The code has been tested with Python 3.7.5 and 3.8
The code has been tested with Python 3.8.10 and later.

Older releases were tested with Python 3.7.5 but as of this release, it requires the Databricks runtime 9.1 LTS or later
which relies on Python 3.8.10

## Checking your code for common issues

Expand Down Expand Up @@ -77,10 +80,21 @@ Run `make clean dist` from the main project directory.

# Testing

## Creating tests
Preferred style is to use pytest rather than unittest but some unittest based code is used in compatibility mode.
## Developing new tests
New tests should be created using PyTest with classes combining multiple `Pytest` tests.

Existing test code contains tests based on Python's `unittest` framework but these are
run on `pytest` rather than `unitest`.

To get a `spark` instance for test purposes, use the following code:

```python
import dbldatagen as dg

Any new tests should be written as pytest compatible test classes.
spark = dg.SparkSingleton.getLocalInstance("<name to flag spark instance>")
```

The name used to flag the spark instance should be the test module or test class name.

## Running unit / integration tests

Expand All @@ -100,9 +114,32 @@ To run the tests using a `pipenv` environment:
- Run `make test-with-html-report` to generate test coverage report in `htmlcov/inxdex.html`

# Using the Databricks Labs data generator
To use the project, the generated wheel should be installed in your Python notebook as a wheel based library
The recommended method for installation is to install from the PyPi package

You can install the library as a notebook scoped library when working within the Databricks
notebook environment through the use of a `%pip` cell in your notebook.

To install as a notebook-scoped library, create and execute a notebook cell with the following text:

> `%pip install dbldatagen`

This installs from the PyPi package

You can also install from release binaries or directly from the Github sources.

The release binaries can be accessed at:
- Databricks Labs Github Data Generator releases - https://github.com/databrickslabs/dbldatagen/releases


The `%pip install` method also works on the Databricks Community Edition.

Alternatively, you use download a wheel file and install using the Databricks install mechanism to install a wheel based
library into your workspace.

The `%pip install` method can also down load a specific binary release.
For example, the following code downloads the release V0.2.1

Once the library has been installed, you can use it to generate a test data frame.
> '%pip install https://github.com/databrickslabs/dbldatagen/releases/download/v021/dbldatagen-0.2.1-py3-none-any.whl'

# Coding Style

Expand Down
9 changes: 5 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -45,7 +45,7 @@ used in other computations
* Generating values to conform to a schema or independent of an existing schema
* use of SQL expressions in test data generation
* plugin mechanism to allow use of 3rd party libraries such as Faker
* Use of data generator to generate data sources in Databricks Delta Live Tables
* Use within a Databricks Delta Live Tables pipeline as a synthetic data generation source

Details of these features can be found in the online documentation -
[online documentation](https://databrickslabs.github.io/dbldatagen/public_docs/index.html).
Expand All @@ -57,7 +57,7 @@ details of use and many examples.

Release notes and details of the latest changes for this specific release
can be found in the Github repository
[here](https://github.com/databrickslabs/dbldatagen/blob/release/v0.2.1/CHANGELOG.md)
[here](https://github.com/databrickslabs/dbldatagen/blob/release/v0.3.0/CHANGELOG.md)

# Installation

Expand All @@ -75,9 +75,10 @@ The documentation [installation notes](https://databrickslabs.github.io/dbldatag
contains details of installation using alternative mechanisms.

## Compatibility
The Databricks Labs data generator framework can be used with Pyspark 3.x and Python 3.6 or later
The Databricks Labs data generator framework can be used with Pyspark 3.1.2 and Python 3.8 or later. These are
compatible with the Databricks runtime 9.1 LTS and later releases.

However prebuilt releases are tested against Pyspark 3.0.1 (compatible with the Databricks runtime 7.3 LTS
Older prebuilt releases are tested against Pyspark 3.0.1 (compatible with the Databricks runtime 7.3 LTS
or later) and built with Python 3.7.5

For full library compatibility for a specific Databricks Spark release, see the Databricks
Expand Down
18 changes: 13 additions & 5 deletions dbldatagen/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,8 @@
"""

from .data_generator import DataGenerator
ronanstokes-db marked this conversation as resolved.
Show resolved Hide resolved
from .datagen_constants import DEFAULT_RANDOM_SEED, RANDOM_SEED_RANDOM, RANDOM_SEED_FIXED, RANDOM_SEED_HASH_FIELD_NAME
from .datagen_constants import DEFAULT_RANDOM_SEED, RANDOM_SEED_RANDOM, RANDOM_SEED_FIXED, \
RANDOM_SEED_HASH_FIELD_NAME, MIN_PYTHON_VERSION, MIN_SPARK_VERSION
from .utils import ensure, topologicalSort, mkBoundsList, coalesce_values, \
deprecated, parse_time_interval, DataGenError
from ._version import __version__
Expand All @@ -46,12 +47,19 @@
"text_generator_plugins"
]

def python_version_check(python_version_expected):
"""Check against Python version

def python_version_check():
Allows minimum version to be passed in to facilitate unit testing

:param python_version_expected: = minimum version of python to support as tuple e.g (3,6)
:return: True if passed

"""
import sys
if not sys.version_info >= (3, 6):
raise RuntimeError("Minimum version of Python supported is 3.6")
return sys.version_info >= python_version_expected


# lets check for a correct python version or raise an exception
python_version_check()
if not python_version_check(MIN_PYTHON_VERSION):
raise RuntimeError(f"Minimum version of Python supported is {MIN_PYTHON_VERSION[0]}.{MIN_PYTHON_VERSION[1]}")
15 changes: 14 additions & 1 deletion dbldatagen/_version.py
Original file line number Diff line number Diff line change
Expand Up @@ -33,5 +33,18 @@ def get_version(version):
return version_info


__version__ = "0.2.1" # DO NOT EDIT THIS DIRECTLY! It is managed by bumpversion
__version__ = "0.3.0" # DO NOT EDIT THIS DIRECTLY! It is managed by bumpversion
__version_info__ = get_version(__version__)


def _get_spark_version(sparkVersion):
try:
r = re.compile(r'(?P<major>\d+)\.(?P<minor>\d+)\.(?P<patch>\d+)(?P<release>.*)')
major, minor, patch, release = r.match(sparkVersion).groups()
spark_version_info = VersionInfo(int(major), int(minor), int(patch), release, build="0")
except (RuntimeError, AttributeError):
spark_version_info = VersionInfo(major=3, minor=0, patch=1, release="unknown", build="0")
logging.warning("Could not parse spark version - using assumed Spark Version : %s", spark_version_info)

return spark_version_info

49 changes: 48 additions & 1 deletion dbldatagen/data_generator.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,8 +12,9 @@
from pyspark.sql.types import LongType, IntegerType, StringType, StructType, StructField, DataType
from .spark_singleton import SparkSingleton
from .column_generation_spec import ColumnGenerationSpec
from .datagen_constants import DEFAULT_RANDOM_SEED, RANDOM_SEED_FIXED, RANDOM_SEED_HASH_FIELD_NAME
from .datagen_constants import DEFAULT_RANDOM_SEED, RANDOM_SEED_FIXED, RANDOM_SEED_HASH_FIELD_NAME, MIN_SPARK_VERSION
from .utils import ensure, topologicalSort, DataGenError, deprecated
from . _version import _get_spark_version

_OLD_MIN_OPTION = 'min'
_OLD_MAX_OPTION = 'max'
Expand Down Expand Up @@ -131,9 +132,55 @@ def __init__(self, sparkSession=None, name=None, randomSeedMethod=None,
self.withColumn(ColumnGenerationSpec.SEED_COLUMN, LongType(), nullable=False, implicit=True, omit=True)
self._batchSize = batchSize

# set up spark session
self._setupSparkSession(sparkSession)

# set up use of pandas udfs
self._setupPandas(batchSize)

@classmethod
def _checkSparkVersion(cls, sparkVersion, minSparkVersion):
"""
check spark version
:param sparkVersion: spark version string
:param minSparkVersion: min spark version as tuple
:return: True if version passes minVersion

Layout of version string must be compatible "xx.xx.xx.patch"
"""
sparkVersionInfo = _get_spark_version(sparkVersion)

if sparkVersionInfo < minSparkVersion:
logging.warn(f"*** Minimum version of Python supported is {minSparkVersion} - found version %s ",
sparkVersionInfo )
return False

return True

def _setupSparkSession(self, sparkSession):
"""
Set up spark session
:param sparkSession: spark session to use
:return: nothing
"""
if sparkSession is None:
sparkSession = SparkSingleton.getInstance()

assert sparkSession is not None, "The spark session attribute must be initialized"

self.sparkSession = sparkSession
if sparkSession is None:
raise DataGenError("""Spark session not initialized

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added line #L171 was not covered by tests

and this statement is unreachable :) lgtm.com is soo good at this.
image

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and this statement is unreachable :) lgtm.com is soo good at this.

except the fix for this was checked in 20 hours earlier

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for some reason lgtm is taking 1 day to update results - perhaps due to transition to new code scanning ?

The spark session attribute must be initialized in the DataGenerator initialization

i.e DataGenerator(sparkSession=spark, name="test", ...)
""")

# check if the spark version meets the minimum requirements and warn if not
sparkVersion = sparkSession.version
self._checkSparkVersion(sparkVersion, MIN_SPARK_VERSION)

def _setupPandas(self, pandasBatchSize):
"""
Set up pandas
Expand Down
4 changes: 4 additions & 0 deletions dbldatagen/datagen_constants.py
Original file line number Diff line number Diff line change
Expand Up @@ -25,3 +25,7 @@
RANDOM_SEED_RANDOM_FLOAT = -1.0
RANDOM_SEED_FIXED = "fixed"
RANDOM_SEED_HASH_FIELD_NAME = "hash_fieldname"

# minimum versions for version checks
MIN_PYTHON_VERSION = (3, 8)
MIN_SPARK_VERSION = (3, 1, 2)
2 changes: 1 addition & 1 deletion docs/source/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@
author = 'Databricks Inc'

# The full version, including alpha/beta/rc tags
release = "0.2.1" # DO NOT EDIT THIS DIRECTLY! It is managed by bumpversion
release = "0.3.0" # DO NOT EDIT THIS DIRECTLY! It is managed by bumpversion


# -- General configuration ---------------------------------------------------
Expand Down
2 changes: 1 addition & 1 deletion makefile
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@ prepare: clean

create-dev-env:
@echo "$(OK_COLOR)=> making conda dev environment$(NO_COLOR)"
conda create -n $(ENV_NAME) python=3.8
conda create -n $(ENV_NAME) python=3.8.10

create-github-build-env:
@echo "$(OK_COLOR)=> making conda dev environment$(NO_COLOR)"
Expand Down
2 changes: 1 addition & 1 deletion python/.bumpversion.cfg
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
[bumpversion]
current_version = 0.2.1
current_version = 0.3.0
commit = False
tag = False
parse = (?P<major>\d+)\.(?P<minor>\d+)\.(?P<patch>\d+)\-{0,1}(?P<release>\D*)(?P<build>\d*)
Expand Down
16 changes: 8 additions & 8 deletions python/dev_require.txt
Original file line number Diff line number Diff line change
@@ -1,17 +1,17 @@
# The following packages are used in building the test data generator framework.
# All packages used are already installed in the Databricks runtime environment for version 6.5 or later
numpy==1.22.0
pandas==1.0.1
numpy==1.19.2
pandas==1.2.4
pickleshare==0.7.5
py4j==0.10.9
pyarrow==1.0.1
pyspark>=3.0.1
pyarrow==4.0.0
pyspark>=3.1.2
python-dateutil==2.8.1
six==1.14.0
six==1.15.0

# The following packages are required for development only
wheel==0.34.2
setuptools==45.2.0
wheel==0.36.2
setuptools==52.0.0
bumpversion
pytest
pytest-cov
Expand All @@ -25,7 +25,7 @@ sphinx_rtd_theme
nbsphinx
numpydoc==0.8
pypandoc
ipython==7.16.3
ipython==7.22.0
recommonmark
sphinx-markdown-builder
rst2pdf==0.98
Expand Down
14 changes: 7 additions & 7 deletions python/require.txt
Original file line number Diff line number Diff line change
@@ -1,17 +1,17 @@
# The following packages are used in building the test data generator framework.
# All packages used are already installed in the Databricks runtime environment for version 6.5 or later
numpy==1.22.0
pandas==1.0.1
pandas==1.2.5
pickleshare==0.7.5
py4j==0.10.9
pyarrow==1.0.1
pyspark>=3.0.1
pyarrow==4.0.0
pyspark>=3.1.2
python-dateutil==2.8.1
six==1.14.0
six==1.15.0

# The following packages are required for development only
wheel==0.34.2
setuptools==45.2.0
wheel==0.36.2
setuptools==52.0.0
bumpversion
pytest
pytest-cov
Expand All @@ -25,7 +25,7 @@ sphinx_rtd_theme
nbsphinx
numpydoc==0.8
pypandoc
ipython==7.16.3
ipython==7.22.0
recommonmark
sphinx-markdown-builder
rst2pdf==0.98
Expand Down
6 changes: 3 additions & 3 deletions setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -31,13 +31,13 @@

setuptools.setup(
name="dbldatagen",
version="0.2.1",
version="0.3.0",
author="Ronan Stokes, Databricks",
description="Databricks Labs - PySpark Synthetic Data Generator",
long_description=long_description,
long_description_content_type="text/markdown",
url="https://github.com/databrickslabs/data-generator",
project_urls = {
project_urls={
"Databricks Labs": "https://www.databricks.com/learn/labs",
"Documentation": "https://databrickslabs.github.io/dbldatagen/public_docs/index.html"
},
Expand All @@ -52,5 +52,5 @@
"Intended Audience :: Developers",
"Intended Audience :: System Administrators"
],
python_requires='>=3.7.5',
python_requires='>=3.8.10',
)
4 changes: 4 additions & 0 deletions tests/test_quick_tests.py
Original file line number Diff line number Diff line change
Expand Up @@ -698,6 +698,10 @@ def test_strings_from_numeric_string_field4(self):
rowCount = nullRowsDF.count()
self.assertEqual(rowCount, 0)

def test_version_info(self):
# test access to version info without explicit import
print("Data generator version", dg.__version__)


# run the tests
# if __name__ == '__main__':
Expand Down
Loading