Skip to content

Commit

Permalink
Merge branch 'develop'
Browse files Browse the repository at this point in the history
  • Loading branch information
kupferk committed Apr 28, 2022
2 parents a8f4a76 + b742d49 commit ff137b7
Show file tree
Hide file tree
Showing 67 changed files with 863 additions and 312 deletions.
2 changes: 1 addition & 1 deletion .editorconfig
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ trim_trailing_whitespace = true
insert_final_newline = true

[*.scala]
indent_size = 2
indent_size = 4

[*.yml]
indent_size = 2
Expand Down
8 changes: 8 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,11 @@
# Version 0.24.1 - 2022-04-28

* github-175: '--jobs' parameter starts way to many parallel jobs
* github-176: start-/end-date in report should not be the same
* github-177: Implement generic SQL schema check
* github-179: Update DeltaLake dependency to 1.2.0


# Version 0.24.0 - 2022-04-05

* github-168: Support optional filters in data quality checks
Expand Down
33 changes: 23 additions & 10 deletions QUICKSTART.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,8 +2,7 @@

This quickstart guide will walk you to a installation of Apache Spark and Flowman on your local Linux box. If you
are using Windows, you will find some hints for setting up the required "Hadoop WinUtils", but we generally recommend
to use Linux. You can also run a [Flowman Docker image](docs/setup/docker.md), which is the simplest way to get up to
speed.
to use Linux. You can also run a [Flowman Docker image](setup/docker.md), which is the simplest way to get up to speed.


## 1. Install Spark
Expand All @@ -25,7 +24,9 @@ homepage. So we download the appropriate Spark distribution from the Apache arch
mkdir playground
cd playground# Download and unpack Spark & Hadoop

curl -L https://archive.apache.org/dist/spark/spark-3.2.1/spark-3.2.1-bin-hadoop3.2.tgz | tar xvzf -# Create a nice link
curl -L https://archive.apache.org/dist/spark/spark-3.2.1/spark-3.2.1-bin-hadoop3.2.tgz | tar xvzf -

# Create a nice link
ln -snf spark-3.2.1-bin-hadoop3.2 spark
```

Expand Down Expand Up @@ -104,11 +105,11 @@ Now we can inspect some of the relations defined in the project. First we list a
flowman:weather> relation list
```

Now we can peek inside the relations `stations-raw` and `measurements-raw`. Since the second relation is partitioned
Now we can peek inside the relations `stations_raw` and `measurements_raw`. Since the second relation is partitioned
by years, we explicitly specify the year via the option `-p year=2011`
```
flowman:weather> relation show stations-raw
flowman:weather> relation show measurements-raw -p year=2011
flowman:weather> relation show stations_raw
flowman:weather> relation show measurements_raw -p year=2011
```

### Running a Job
Expand All @@ -130,9 +131,9 @@ flowman:weather> job enter main year=2011
Note how the prompt has changed and will now include the job name. Now we can inspect some mappings:
```
flowman:weather/main> mapping list
flowman:weather/main> mapping show measurements-raw
flowman:weather/main> mapping show measurements_raw
flowman:weather/main> mapping show measurements-extracted
flowman:weather/main> mapping show stations-raw
flowman:weather/main> mapping show stations_raw
```
Finally we'd like to leave the job context again.
```
Expand All @@ -158,9 +159,21 @@ flowman:weather> history job search
flowman:weather> history target search -J 1
```


### Generating Documentation

Flowman cannot only execute all the data transformations specified in the example project, it can also generate
a documentation, which will be stored as an html file
```
flowman:weather> documentation generate
```
This will create a file in the directory `examples/weather/generated-documentation/project.html` which can be viewed
by any web browser of your choice.


### Quitting

Finally we quit the Flowman shell via the `quit` command.
Finally, we quit the Flowman shell via the `quit` command.
```
flowman:weather> quit
```
Expand All @@ -169,7 +182,7 @@ flowman:weather> quit
## 4. Flowman Batch Execution

So far we have only used the Flowman shell for interactive work with projects. Actually, the shell was developed as a
second step to help analyzing problems and debugging data flows. The primary command for working with Flowman projects
second step to help to analyze problems and debugging data flows. The primary command for working with Flowman projects
is `flowexec` which is used for non-interactive batch execution, for example within cron-jobs.

It shares a lot of code with the Flowman shell, so the commands are often exactly the same. The main difference is
Expand Down
2 changes: 1 addition & 1 deletion docker/pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@
<parent>
<groupId>com.dimajix.flowman</groupId>
<artifactId>flowman-root</artifactId>
<version>0.24.0</version>
<version>0.24.1</version>
<relativePath>../pom.xml</relativePath>
</parent>

Expand Down
8 changes: 6 additions & 2 deletions docs/cli/index.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,10 @@
# Flowman Executables
# Flowman CLI Tools

Flowman provides a small set of executables for working with projects.
![Flowman Shell in Action](../images/console-01.png)

Flowman provides a small set of command line interface (CLI) executables for working with projects. These are used
to execute projects as batch jobs, to investigate intermediate results of mappings etc. Moreover the Flowman History
Server will provide you a powerful Web UI for keeping track of past runs.

```eval_rst
.. toctree::
Expand Down
2 changes: 1 addition & 1 deletion docs/concepts/concepts.md → docs/concepts/entities.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Core Concepts
# Core Entities

Flowman is a *data build tool* which uses a declarative syntax to specify, what needs to be built. The main difference
to classical build tools like `make`, `maven` is that Flowman builds *data* instead of *applications* or *libraries*.
Expand Down
13 changes: 12 additions & 1 deletion docs/concepts/index.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,17 @@
# Core Concepts

Flowman provides a small set of executables for working with projects.
Flowman reduces the development efforts for creating robust and scalable data processing applications. At the heart
of Flowman are some basic concepts which provide the simple building blocks which can be used to build even complex
data transformations.

In order to appreciate the elegance and power of Flowman, it is important to understand the [core entities](entities.md),
which are used to model all the aspects of a data flow, like relations (which describe
data sources and sinks), mappings (which describe data transformations) and targets (which describe the actual work
to be performed).

In addition to understand the core entities, it is also important to understand Flowman's execution model which is
described in the [lifecycle documentation](lifecycle.md).


```eval_rst
.. toctree::
Expand Down
17 changes: 15 additions & 2 deletions docs/concepts/lifecycle.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
Flowman sees data as artifacts with a common lifecycle, from creation until deletion. The lifecycle itself consists of
multiple different phases, each of them representing one stage of the whole lifecycle.

## Lifecycle Phases
## Execution Phases

The full lifecycle consists out of specific execution phases, as follows:

Expand Down Expand Up @@ -35,7 +35,7 @@ definitions, views and directories. It performs the opposite operation than the

## Built In Lifecycles

Some of the execution phases can be performed in a meaningful way one after the other. Such a sequence of phases is
Some execution phases can be performed in a meaningful way one after the other. Such a sequence of phases is
called *lifecycle*. Flowman has the following lifecycles built in:

### Build
Expand All @@ -49,3 +49,16 @@ The second lifecycle contains only the single phase *TRUNCATE*
### Destroy

The last lifecycle contains only the single phase *DESTROY*


## Targets & Lifecycles

Each [target](../spec/target/index.md) supports a certain subset of execution phases. Not all targets support all
phases. For example the widely used [`relation` target](../spec/target/relation.md) which is used for creating data
sinks and for writing new data into them supports the phases `CREATE`, `BUILD`, `VERIFY`, `TRUNCATE` and `DESTROY`. On
the other hand the [`measure` target](../spec/target/measure.md) which collects some data dependent metrics is only
executed during the `VERIFY` phase.

Of course when a specific target participates in multiple execution phases, it will perform different actions in each
of the phases. The documentation of each target will contain the details of the supported phases and what action is
performed in each of them.
123 changes: 122 additions & 1 deletion docs/documenting/checks.md
Original file line number Diff line number Diff line change
Expand Up @@ -76,7 +76,8 @@ mappings:
- ["87600"]
```
## Available Column Checks
## Column Checks
Flowman implements a couple of different check types on a per column basis.
Expand All @@ -101,6 +102,23 @@ so in many cases you might want to specify both `notNUll` and `unique`.
to exclude records with known quality issues.


### Foreign Key

A `foreignKey` column check is used to ensure that all not-`NULL` values refer to existing entries in a different
mapping or relation

* `kind` **(mandatory)** *(string)*: `foreignKey`
* `filter` **(optional)** *(string)*:
Optional SQL expression applied as a filter to select only a subset of all records for quality check. This is useful
to exclude records with known quality issues.
* `mapping` **(optional)** *(string)*: Name of mapping the foreign key refers to. You need to specify either the
`mapping` or the `relation` property.
* `relation` **(optional)** *(string)*: Name of relation the foreign key refers to. You need to specify either the
`mapping` or the `relation` property.
* `column` **(optional)** *(string)*: Name of the column in the referenced entity (either mapping or relation). If
this property is not set, then the same column name will be assumed


### Specific Values

In order to test if a column only contains specific values, you can use the `values` test. Note that this test will
Expand Down Expand Up @@ -152,3 +170,106 @@ A very flexible test is provided with the SQL expression test. This test allows
* `filter` **(optional)** *(string)*:
Optional SQL expression applied as a filter to select only a subset of all records for quality check. This is useful
to exclude records with known quality issues.


## Schema Checks

In addition to checks for individual columns, Flowman also supports schema checks which may refer to multiple columns

### Primary Key
A `primaryKey` column check is used to ensure that all not-`NULL` values refer to existing entries in a different
mapping or relation

* `kind` **(mandatory)** *(string)*: `primaryKey`
* `filter` **(optional)** *(string)*:
Optional SQL expression applied as a filter to select only a subset of all records for quality check. This is useful
to exclude records with known quality issues.
* `columns` **(optional)** *(list:string)*: Name of assumed primary key columns in the model


### Foreign Key
A `foreignKey` column check is used to ensure that all not-`NULL` values refer to existing entries in a different
mapping or relation

* `kind` **(mandatory)** *(string)*: `foreignKey`
* `filter` **(optional)** *(string)*:
Optional SQL expression applied as a filter to select only a subset of all records for quality check. This is useful
to exclude records with known quality issues.
* `mapping` **(optional)** *(string)*: Name of mapping the foreign key refers to. You need to specify either the
`mapping` or the `relation` property.
* `relation` **(optional)** *(string)*: Name of relation the foreign key refers to. You need to specify either the
`mapping` or the `relation` property.
* `columns` **(optional)** *(list:string)*: Name of columns in the model
* `references` **(optional)** *(list:string)*: Name of columns in the referenced entity


### SQL Expression
A very flexible test is provided with the SQL expression test. This test allows you to specify any simple SQL expression
(which may also use different columns), which should evaluate to `TRUE` for all records passing the test.

* `kind` **(mandatory)** *(string)*: `expression`
* `expression` **(mandatory)** *(string)*: Boolean SQL Expression
* `filter` **(optional)** *(string)*:
Optional SQL expression applied as a filter to select only a subset of all records for quality check. This is useful
to exclude records with known quality issues.


### SQL Query
A very flexible test is provided with the SQL query test. This test allows you to specify an arbitrary SQL `SELECT`
statement (which may also refer different mappings). The current entity is provided as `__THIS__`. The check actually
supports two different variants of queries, which differ in the interpretation of the result

#### Grouped Query
The first type of supported SQL queries returns multiple records, each having two columns (with arbitrary name). The
first column should be a boolean indicating if the test succeeded, while the second column should be an integer
containing the number of records. The names of the columns are irrelevant.

| Column | Data Type | Remark |
|--------|-----------|-------------------------------------------------------|
| 1. | `BOOL` | Either`TRUE` or `FALSE` indicating success or failure |
| 2. | `LONG` | Number of records with `TRUE` or `FALSE` test result |

Typically, a result set would contain two records, one with the first column `TRUE` and the second column holding the
number of records which passed the test and the second record having `FALSE` in the first column and the number of
failed records in the second column.

The following example will check for the number of duplicates of the column `transaction_id`
```yaml
kind: sql
query: |
WITH dups AS (
SELECT
tx.transaction_id,
COUNT(*) AS cnt
FROM __this__ tx
GROUP BY transaction_id
)
SELECT
cnt = 1,
COUNT(*)
FROM dups
GROUP BY 1
```

#### One-Record Query
The second type of supported SQL queries is required to return a single row that has to include one boolean column
called `success`. The other columns are not interpreted by Flowman and only serve as informational columns.

The following query will compare the number of records in two mappings `raw_transactions` and `processed_transactions`.
The check succeeds if the numbers match, otherwise it fails. The number of records of each mapping is provided as
additional values which will be shown in the documentation.
```yaml
kind: sql
query: |
SELECT
(SELECT COUNT(*) FROM raw_transactions) AS original_tx_count,
(SELECT COUNT(*) FROM processed_transactions) AS final_tx_count,
(SELECT COUNT(*) FROM raw_transactions) = (SELECT COUNT(*) FROM processed_transactions) AS success
```


* `kind` **(mandatory)** *(string)*: `sql`
* `query` **(mandatory)** *(string)*: Boolean SQL Expression
* `filter` **(optional)** *(string)*:
Optional SQL expression applied as a filter to select only a subset of all records for quality check. This is useful
to exclude records with known quality issues.
Binary file added docs/images/console-01.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
2 changes: 1 addition & 1 deletion docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -97,10 +97,10 @@ Flowman also provides optional plugins which extend functionality. You can find
quickstart
concepts/index
tutorial/index
cli/index
spec/index
testing/index
documenting/index
cli/index
setup/index
connectors/index
plugins/index
Expand Down
Loading

0 comments on commit ff137b7

Please sign in to comment.