Merge branch 'develop'

dimajix · Apr 28, 2022 · ff137b7 · ff137b7
2 parents a8f4a76 + b742d49
commit ff137b7
Show file tree

Hide file tree

Showing 67 changed files with 863 additions and 312 deletions.
diff --git a/.editorconfig b/.editorconfig
@@ -17,7 +17,7 @@ trim_trailing_whitespace = true
 insert_final_newline = true
 
 [*.scala]
-indent_size = 2
+indent_size = 4
 
 [*.yml]
 indent_size = 2

diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,3 +1,11 @@
+# Version 0.24.1 - 2022-04-28
+
+* github-175: '--jobs' parameter starts way to many parallel jobs
+* github-176: start-/end-date in report should not be the same
+* github-177: Implement generic SQL schema check
+* github-179: Update DeltaLake dependency to 1.2.0
+
+
 # Version 0.24.0 - 2022-04-05
 
 * github-168: Support optional filters in data quality checks

diff --git a/QUICKSTART.md b/QUICKSTART.md
@@ -2,8 +2,7 @@
 
 This quickstart guide will walk you to a installation of Apache Spark and Flowman on your local Linux box. If you
 are using Windows, you will find some hints for setting up the required "Hadoop WinUtils", but we generally recommend
-to use Linux. You can also run a [Flowman Docker image](docs/setup/docker.md), which is the simplest way to get up to 
-speed.
+to use Linux. You can also run a [Flowman Docker image](setup/docker.md), which is the simplest way to get up to speed.
 
 
 ## 1. Install Spark
@@ -25,7 +24,9 @@ homepage. So we download the appropriate Spark distribution from the Apache arch
 mkdir playground
 cd playground# Download and unpack Spark & Hadoop
 
-curl -L https://archive.apache.org/dist/spark/spark-3.2.1/spark-3.2.1-bin-hadoop3.2.tgz | tar xvzf -# Create a nice link
+curl -L https://archive.apache.org/dist/spark/spark-3.2.1/spark-3.2.1-bin-hadoop3.2.tgz | tar xvzf -
+
+# Create a nice link
 ln -snf spark-3.2.1-bin-hadoop3.2 spark
 ```
 
@@ -104,11 +105,11 @@ Now we can inspect some of the relations defined in the project. First we list a
 flowman:weather> relation list
 ```
 
-Now we can peek inside the relations `stations-raw` and `measurements-raw`. Since the second relation is partitioned
+Now we can peek inside the relations `stations_raw` and `measurements_raw`. Since the second relation is partitioned
 by years, we explicitly specify the year via the option `-p year=2011`
 ```
-flowman:weather> relation show stations-raw
-flowman:weather> relation show measurements-raw -p year=2011
+flowman:weather> relation show stations_raw
+flowman:weather> relation show measurements_raw -p year=2011
 ```
 
 ### Running a Job
@@ -130,9 +131,9 @@ flowman:weather> job enter main year=2011
 Note how the prompt has changed and will now include the job name. Now we can inspect some mappings:
 ```
 flowman:weather/main> mapping list
-flowman:weather/main> mapping show measurements-raw
+flowman:weather/main> mapping show measurements_raw
 flowman:weather/main> mapping show measurements-extracted
-flowman:weather/main> mapping show stations-raw
+flowman:weather/main> mapping show stations_raw
 ```
 Finally we'd like to leave the job context again.
 ```
@@ -158,9 +159,21 @@ flowman:weather> history job search
 flowman:weather> history target search -J 1
 ```
 
+
+### Generating Documentation
+
+Flowman cannot only execute all the data transformations specified in the example project, it can also generate
+a documentation, which will be stored as an html file
+```
+flowman:weather> documentation generate
+```
+This will create a file in the directory `examples/weather/generated-documentation/project.html` which can be viewed
+by any web browser of your choice.
+
+
 ### Quitting
 
-Finally we quit the Flowman shell via the `quit` command.
+Finally, we quit the Flowman shell via the `quit` command.
 ```
 flowman:weather> quit
 ```
@@ -169,7 +182,7 @@ flowman:weather> quit
 ## 4. Flowman Batch Execution
 
 So far we have only used the Flowman shell for interactive work with projects. Actually, the shell was developed as a
-second step to help analyzing problems and debugging data flows. The primary command for working with Flowman projects
+second step to help to analyze problems and debugging data flows. The primary command for working with Flowman projects
 is `flowexec` which is used for non-interactive batch execution, for example within cron-jobs.
 
 It shares a lot of code with the Flowman shell, so the commands are often exactly the same. The main difference is

diff --git a/docker/pom.xml b/docker/pom.xml
@@ -10,7 +10,7 @@
     <parent>
         <groupId>com.dimajix.flowman</groupId>
         <artifactId>flowman-root</artifactId>
-        <version>0.24.0</version>
+        <version>0.24.1</version>
         <relativePath>../pom.xml</relativePath>
     </parent>
 

diff --git a/docs/cli/index.md b/docs/cli/index.md
@@ -1,6 +1,10 @@
-# Flowman Executables
+# Flowman CLI Tools
 
-Flowman provides a small set of executables for working with projects.
+![Flowman Shell in Action](../images/console-01.png)
+
+Flowman provides a small set of command line interface (CLI) executables for working with projects. These are used
+to execute projects as batch jobs, to investigate intermediate results of mappings etc. Moreover the Flowman History
+Server will provide you a powerful Web UI for keeping track of past runs.
 
 ```eval_rst
 .. toctree::

diff --git a/docs/concepts/concepts.md → docs/concepts/entities.md b/docs/concepts/concepts.md → docs/concepts/entities.md
@@ -1,4 +1,4 @@
-# Core Concepts
+# Core Entities
 
 Flowman is a *data build tool* which uses a declarative syntax to specify, what needs to be built. The main difference
 to classical build tools like `make`, `maven` is that Flowman builds *data* instead of *applications* or *libraries*.

diff --git a/docs/concepts/index.md b/docs/concepts/index.md
@@ -1,6 +1,17 @@
 # Core Concepts
 
-Flowman provides a small set of executables for working with projects.
+Flowman reduces the development efforts for creating robust and scalable data processing applications. At the heart
+of Flowman are some basic concepts which provide the simple building blocks which can be used to build even complex
+data transformations.
+
+In order to appreciate the elegance and power of Flowman, it is important to understand the [core entities](entities.md),
+which are used to model all the aspects of a data flow, like relations (which describe
+data sources and sinks), mappings (which describe data transformations) and targets (which describe the actual work
+to be performed).
+
+In addition to understand the core entities, it is also important to understand Flowman's execution model which is
+described in the [lifecycle documentation](lifecycle.md).
+
 
 ```eval_rst
 .. toctree::

diff --git a/docs/concepts/lifecycle.md b/docs/concepts/lifecycle.md
@@ -3,7 +3,7 @@
 Flowman sees data as artifacts with a common lifecycle, from creation until deletion. The lifecycle itself consists of 
 multiple different phases, each of them representing one stage of the whole lifecycle. 
 
-## Lifecycle Phases
+## Execution Phases
 
 The full lifecycle consists out of specific execution phases, as follows:
 
@@ -35,7 +35,7 @@ definitions, views and directories. It performs the opposite operation than the
 
 ## Built In Lifecycles
 
-Some of the execution phases can be performed in a meaningful way one after the other. Such a sequence of phases is
+Some execution phases can be performed in a meaningful way one after the other. Such a sequence of phases is
 called *lifecycle*. Flowman has the following lifecycles built in:
 
 ### Build
@@ -49,3 +49,16 @@ The second lifecycle contains only the single phase *TRUNCATE*
 ### Destroy
 
 The last lifecycle contains only the single phase *DESTROY*
+
+
+## Targets & Lifecycles
+
+Each [target](../spec/target/index.md) supports a certain subset of execution phases. Not all targets support all 
+phases. For example the widely used [`relation` target](../spec/target/relation.md) which is used for creating data 
+sinks and for writing new data into them supports the phases `CREATE`, `BUILD`, `VERIFY`, `TRUNCATE` and `DESTROY`. On 
+the other hand the [`measure` target](../spec/target/measure.md) which collects some data dependent metrics is only
+executed during the `VERIFY` phase.
+
+Of course when a specific target participates in multiple execution phases, it will perform different actions in each
+of the phases. The documentation of each target will contain the details of the supported phases and what action is
+performed in each of them.
diff --git a/docs/documenting/checks.md b/docs/documenting/checks.md
@@ -76,7 +76,8 @@ mappings:
         - ["87600"]
 ```
 
-## Available Column Checks
+
+## Column Checks
 
 Flowman implements a couple of different check types on a per column basis. 
 
@@ -101,6 +102,23 @@ so in many cases you might want to specify both `notNUll` and `unique`.
   to exclude records with known quality issues.
 
 
+### Foreign Key
+
+A `foreignKey` column check is used to ensure that all not-`NULL` values refer to existing entries in a different
+mapping or relation
+
+* `kind` **(mandatory)** *(string)*: `foreignKey`
+* `filter` **(optional)** *(string)*:
+  Optional SQL expression applied as a filter to select only a subset of all records for quality check. This is useful
+  to exclude records with known quality issues.
+* `mapping` **(optional)** *(string)*: Name of mapping the foreign key refers to. You need to specify either the
+  `mapping` or the `relation` property.
+* `relation` **(optional)** *(string)*: Name of relation the foreign key refers to. You need to specify either the
+  `mapping` or the `relation` property.
+* `column`  **(optional)** *(string)*: Name of the column in the referenced entity (either mapping or relation). If
+  this property is not set, then the same column name will be assumed
+
+
 ### Specific Values
 
 In order to test if a column only contains specific values, you can use the `values` test.  Note that this test will 
@@ -152,3 +170,106 @@ A very flexible test is provided with the SQL expression test. This test allows
 * `filter` **(optional)** *(string)*:
   Optional SQL expression applied as a filter to select only a subset of all records for quality check. This is useful
   to exclude records with known quality issues.
+
+
+## Schema Checks
+
+In addition to checks for individual columns, Flowman also supports schema checks which may refer to multiple columns
+
+### Primary Key
+A `primaryKey` column check is used to ensure that all not-`NULL` values refer to existing entries in a different
+mapping or relation
+
+* `kind` **(mandatory)** *(string)*: `primaryKey`
+* `filter` **(optional)** *(string)*:
+  Optional SQL expression applied as a filter to select only a subset of all records for quality check. This is useful
+  to exclude records with known quality issues.
+* `columns`  **(optional)** *(list:string)*: Name of assumed primary key columns in the model
+
+
+### Foreign Key
+A `foreignKey` column check is used to ensure that all not-`NULL` values refer to existing entries in a different
+mapping or relation
+
+* `kind` **(mandatory)** *(string)*: `foreignKey`
+* `filter` **(optional)** *(string)*:
+  Optional SQL expression applied as a filter to select only a subset of all records for quality check. This is useful
+  to exclude records with known quality issues.
+* `mapping` **(optional)** *(string)*: Name of mapping the foreign key refers to. You need to specify either the
+  `mapping` or the `relation` property.
+* `relation` **(optional)** *(string)*: Name of relation the foreign key refers to. You need to specify either the
+  `mapping` or the `relation` property.
+* `columns`  **(optional)** *(list:string)*: Name of columns in the model
+* `references`  **(optional)** *(list:string)*: Name of columns in the referenced entity
+
+
+### SQL Expression
+A very flexible test is provided with the SQL expression test. This test allows you to specify any simple SQL expression
+(which may also use different columns), which should evaluate to `TRUE` for all records passing the test.
+
+* `kind` **(mandatory)** *(string)*: `expression`
+* `expression` **(mandatory)** *(string)*: Boolean SQL Expression
+* `filter` **(optional)** *(string)*:
+  Optional SQL expression applied as a filter to select only a subset of all records for quality check. This is useful
+  to exclude records with known quality issues.
+
+
+### SQL Query
+A very flexible test is provided with the SQL query test. This test allows you to specify an arbitrary SQL `SELECT`
+statement (which may also refer different mappings). The current entity is provided as `__THIS__`. The check actually
+supports two different variants of queries, which differ in the interpretation of the result
+
+#### Grouped Query
+The first type of supported SQL queries returns multiple records, each having two columns (with arbitrary name). The 
+first column should be a boolean indicating if the test succeeded, while the second column should be an integer 
+containing the number of records. The names of the columns are irrelevant.
+
+| Column | Data Type | Remark                                                |
+|--------|-----------|-------------------------------------------------------|
+| 1.     | `BOOL`    | Either`TRUE` or `FALSE` indicating success or failure |
+| 2.     | `LONG`    | Number of records with `TRUE` or `FALSE` test result  |
+
+Typically, a result set would contain two records, one with the first column `TRUE` and the second column holding the
+number of records which passed the test and the second record having `FALSE` in the first column and the number of 
+failed records in the second column.
+
+The following example will check for the number of duplicates of the column `transaction_id`  
+```yaml
+kind: sql
+query: |
+  WITH dups AS (
+    SELECT
+      tx.transaction_id,
+      COUNT(*) AS cnt
+    FROM __this__ tx
+    GROUP BY transaction_id
+  )
+  SELECT
+    cnt = 1,
+    COUNT(*)
+  FROM dups
+  GROUP BY 1
+```
+
+#### One-Record Query
+The second type of supported SQL queries is required to return a single row that has to include one boolean column 
+called `success`. The other columns are not interpreted by Flowman and only serve as informational columns.
+
+The following query will compare the number of records in two mappings `raw_transactions` and `processed_transactions`.
+The check succeeds if the numbers match, otherwise it fails. The number of records of each mapping is provided as 
+additional values which will be shown in the documentation.
+```yaml
+kind: sql
+query: |
+  SELECT
+    (SELECT COUNT(*) FROM raw_transactions) AS original_tx_count,
+    (SELECT COUNT(*) FROM processed_transactions) AS final_tx_count,
+    (SELECT COUNT(*) FROM raw_transactions) = (SELECT COUNT(*) FROM processed_transactions) AS success
+```
+
+
+* `kind` **(mandatory)** *(string)*: `sql`
+* `query` **(mandatory)** *(string)*: Boolean SQL Expression
+* `filter` **(optional)** *(string)*:
+  Optional SQL expression applied as a filter to select only a subset of all records for quality check. This is useful
+  to exclude records with known quality issues.
diff --git a/docs/images/console-01.png b/docs/images/console-01.png
diff --git a/docs/index.md b/docs/index.md
@@ -97,10 +97,10 @@ Flowman also provides optional plugins which extend functionality. You can find
    quickstart
    concepts/index
    tutorial/index
+   cli/index
    spec/index
    testing/index
    documenting/index
-   cli/index
    setup/index
    connectors/index
    plugins/index