Skip to content

Latest commit

 

History

History
421 lines (245 loc) · 13.4 KB

README.md

File metadata and controls

421 lines (245 loc) · 13.4 KB

DataOps TestGen

apache 2.0 license Badge PRs Badge Latest Version Docker Pulls Documentation Static Badge

DataOps TestGen delivers simple, fast data quality test generation and execution by data profiling,  new dataset screening and hygiene review, algorithmic generation of data quality validation tests, ongoing production testing of new data refreshes, and continuous anomaly monitoring of datasets. DataOps TestGen is part of DataKitchen's Open Source Data Observability.

DatKitchen Open Source Data Observability

Installation

Using dk-installer (recommended)

Install with a single command using dk-installer.

python3 dk-installer.py tg install

Using docker compose

You can also install using the provided docker-compose.yml.

Make a local copy of the compose file.

curl -o docker-compose.yml 'https://raw.githubusercontent.com/DataKitchen/dataops-testgen/main/deploy/docker-compose.yml'

If you are interested in integrating TestGen with DataKitchen Observability platform, edit the compose file and set values for the environment variables OBSERVABILITY_API_URL and OBSERVABILITY_API_KEY.

Before running docker compose, create a .env to hold the secrets needed to run Testgen.

touch testgen.env

The following variables are required:

TESTGEN_USERNAME=
TESTGEN_PASSWORD=
TG_DECRYPT_SALT=
TG_DECRYPT_PASSWORD=

You can learn about how each variable is used in Configuration

Then, run docker compose to start the services:

docker compose --env-file testgen.env up --detach

This will spin up a postgres service, a startup service which runs once to setup the database and, make the Testgen UI available at http://localhost:8501.

After verifying that Testgen is running, follow the steps for the quick start to start getting familiar with the tool.

Quick start

Testgen includes a basic data set for you to play around.

Using dk-installer (recommended)

Once Testgen is running, you can use dk-installer to generate the demo data:

python3 dk-installer.py tg run-demo

And, if you are integrating Testgen with the DataKitchen Observability platform, you will need to pass the --export flag:

python3 dk-installer.py tg run-demo --export

Using docker compose

You can also generate the demo data if you installed using docker compose. Set it up by using the Testgen CLI to run the quick start command:

docker compose --env-file testgen.env exec engine testgen quick-start

It also supports setting up the integration with DataKitchen Observability:

docker compose --env-file testgen.env exec engine testgen quick-start --observability-api-url <url> --observability-api-key <key>

NOTE: You don't need to pass the Observability URL and key as arguments if you set them up as environment variables in your compose file.

After you have the demo data from the quick-start command, follow the following steps to complete the quick start:

  1. Run profiling against the target demo database
docker compose --env-file testgen.env exec engine testgen run-profile --table-group-id 0ea85e17-acbe-47fe-8394-9970725ad37d
  1. Generate tests cases for all columns in the target demo database
docker compose --env-file testgen.env exec engine testgen run-test-generation --table-group-id 0ea85e17-acbe-47fe-8394-9970725ad37d
  1. Run the generated tests
docker compose --env-file testgen.env exec engine testgen run-tests --project-key DEFAULT --test-suite-key default-suite-1
  1. Export the test results to Observability
docker compose --env-file testgen.env exec engine testgen export-observability --project-key DEFAULT --test-suite-key default-suite-1
  1. Simulate changes to the demo data
docker compose --env-file testgen.env exec engine testgen quick-start --simulate-fast-forward
  1. And, export the test results over the simulated changes to Observability
docker compose --env-file testgen.env exec engine testgen export-observability --project-key DEFAULT --test-suite-key default-suite-1

Configuration

TESTGEN_DEBUG

Invalidates the cache with the bootstrapped application causing the changes to the routing and plugins to take effect on every render.

Also, changes the logging level for the testgen.ui logger from INFO to DEBUG.

default: no

TESTGEN_LOG_TO_FILE

Set it to yes to enable rotating file logs to be written under /var/log/testgen/.

default: no

TG_DECRYPT_SALT

Salt used to encrypt and decrypt user secrets. Only allows ascii characters.

A minimun length of 16 characters is recommended.

TG_DECRYPT_PASSWORD

Secret passcode used in combination with TG_DECRYPT_SALT to encrypt and decrypt user secrets. Only allows ascii characters.

TESTGEN_USERNAME

Username to log into the web application.

TESTGEN_PASSWORD

Password to log into the web application.

TG_METADATA_DB_USER

User to connect to the testgen application postgres database.

default: os.environ["TESTGEN_USERNAME"]

TG_METADATA_DB_PASSWORD

Password to connect to the testgen application postgres database.

default: os.environ["TESTGEN_PASSWORD"]

DATABASE_ADMIN_USER

User with admin privileges in the testgen application postgres database used to create roles, users, database and schema. Required if the user in TG_METADATA_DB_USER does not have the required privileges.

default: os.environ["TG_METADATA_DB_USER"] |

DATABASE_ADMIN_PASSWORD

Password for the admin user to connect to the testgen application postgres database.

default: os.environ["TG_METADATA_DB_PASSWORD"]

DATABASE_EXECUTE_USER

User to be created into the testgen application postgres database.

Will be granted:

  • read/write to tables test_results, test_suites and test_definitions
  • read only to all other tables.

default: testgen_execute

DATABASE_REPORT_USER

User to be created into the testgen application postgres database. Will be granted read_only access to all tables.

default: testgen_report

TG_METADATA_DB_HOST

Hostname where the testgen application postgres database is running in.

default: localhost

TG_METADATA_DB_PORT

Port at which the testgen application postgres database is exposed by the host.

default: 5432

TG_METADATA_DB_NAME

Name of the database in postgres on which to store testgen metadata.

default: datakitchen

TG_METADATA_DB_SCHEMA

Name of the schema inside the postgres database on which to store testgen metadata.

default: testgen

PROJECT_KEY

Code used to uniquely identify the auto generated project.

default: DEFAULT

DEFAULT_PROJECT_NAME

Name to assign to the auto generated project.

default: Demo

PROJECT_SQL_FLAVOR

SQL flavor of the database the auto generated project will run tests against.

Supported flavors:

  • redshift
  • snowflake
  • mssql
  • postgresql

default: postgresql

PROJECT_CONNECTION_NAME

Name assigned to identify the connection to the project database.

default: default

PROJECT_CONNECTION_MAX_THREADS

Maximum number of concurrent queries executed when fetching data from the project database.

default: 4

PROJECT_CONNECTION_MAX_QUERY_CHAR

Determine how many tests are grouped together in a single query. Increase for better performance or decrease to better isolate test failures. Accepted values are 500 to 14 000.

default: 5000

PROJECT_QC_SCHEMA

Name of the schema to be created in the project database.

default: qc

PROJECT_DATABASE_NAME

Name of the database the auto generated project will run test against.

default: demo_db

PROJECT_DATABASE_SCHEMA

Name of the schema inside the project database the tests will be run against.

default: demo

PROJECT_DATABASE_USER

User to be used by the auto generated project to connect to the database under testing.

default: os.environ["TG_METADATA_DB_USER"]

PROJECT_DATABASE_USER

Password to be used by the auto generated project to connect to the database under testing.

default: os.environ["TG_METADATA_DB_PASSWORD"]

PROJECT_DATABASE_HOST

Hostname where the database under testing is running in.

default: os.environ["TG_METADATA_DB_HOST"]

PROJECT_DATABASE_PORT

Port at which the database under testing is exposed by the host. default: os.environ["TG_METADATA_DB_PORT"]

TG_TARGET_DB_TRUST_SERVER_CERTIFICATE

For supported SQL flavors, set up the SQLAlchemy connection to trust the database server certificate.

default: no

DEFAULT_TABLE_GROUPS_NAME

Name assigned to the auto generated table group.

default: default

DEFAULT_TEST_SUITE_NAME

Key to be assgined to the auto generated test suite.

default: default-suite-1

DEFAULT_TEST_SUITE_DESCRIPTION

Description for the auto generated test suite.

default: default_suite_desc

DEFAULT_PROFILING_TABLE_SET

Comma separated list of specific table names to include when running profiling for the project database.

DEFAULT_PROFILING_INCLUDE_MASK

A SQL filter supported by the project database's LIKE operator for table names to include.

default: %%

DEFAULT_PROFILING_EXCLUDE_MASK

A SQL filter supported by the project database's LIKE operator for table names to exclude.

default: tmp%%

DEFAULT_PROFILING_ID_COLUMN_MASK

A SQL filter supported by the project database's LIKE operator representing ID columns.

default: %%id

DEFAULT_PROFILING_SK_COLUMN_MASK

A SQL filter supported by the project database's LIKE operator representing surrogate key columns.

default: %%sk

DEFAULT_PROFILING_USE_SAMPLING

Toggle on to base profiling on a sample of records instead of the full table. Accepts Y or N.

default: N

OBSERVABILITY_API_URL

API URL of your instance of Observability where to send events to for the project.

OBSERVABILITY_API_KEY

Authentication key with permissions to send events created in your instance of Observability.

TG_EXPORT_TO_OBSERVABILITY_VERIFY_SSL

Exporting events to your instance of Observabilty verifies SSL certificate.

default: yes

TG_OBSERVABILITY_EXPORT_MAX_QTY

When exporting to your instance of Observabilty, the maximum number of events that will be sent to the events API on a single export.

default: 5000

OBSERVABILITY_DEFAULT_COMPONENT_TYPE

When exporting to your instance of Observabilty, the type of event that will be sent to the events API.

default: dataset

OBSERVABILITY_DEFAULT_COMPONENT_KEY

When exporting to your instance of Observabilty, the key sent to the events API to identify the components. default: default

TG_DOCKER_RELEASE_CHECK_ENABLED

Enables calling Docker Hub API to fetch the latest released image tag. The fetched tag is displayed in the UI menu.

default: yes

Community

Getting Started Guide

We recommend you start by going through the Data Observability Overview Demo.

Support

For support requests, join the Data Observability Slack and ask post on #support channel.

Connect

Talk and Learn with other data practitioners who are building with DataKitchen. Share knowledge, get help, and contribute to our open-source project.

Join our community here:

Contributing

For details on contributing or running the project for development, check out our contributing guide.

License

DataKitchen DataOps TestGen is Apache 2.0 licensed.