Skip to content

Commit

Permalink
Merge branch 'develop'
Browse files Browse the repository at this point in the history
  • Loading branch information
kupferk committed Mar 18, 2022
2 parents 80a9ec4 + eaed485 commit 55199f2
Show file tree
Hide file tree
Showing 395 changed files with 6,646 additions and 1,764 deletions.
10 changes: 10 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,13 @@
# Version 0.23.0 - 2022-03-18

* github-148: Support staging table for all JDBC relations
* github-120: Use staging tables for UPSERT and MERGE operations in JDBC relations
* github-147: Add support for PostgreSQL
* github-151: Implement column level lineage in documentation
* github-121: Correctly apply documentation, before/after and other common attributes to templates
* github-152: Implement new 'cast' mapping


# Version 0.22.0 - 2022-03-01

* Add new `sqlserver` relation
Expand Down
1 change: 1 addition & 0 deletions docker/conf/default-namespace.yml
Original file line number Diff line number Diff line change
Expand Up @@ -53,6 +53,7 @@ plugins:
- flowman-mariadb
- flowman-mysql
- flowman-mssqlserver
- flowman-postgresql
- flowman-swagger
- flowman-openapi
- flowman-json
2 changes: 1 addition & 1 deletion docker/pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@
<parent>
<groupId>com.dimajix.flowman</groupId>
<artifactId>flowman-root</artifactId>
<version>0.22.0</version>
<version>0.23.0</version>
<relativePath>../pom.xml</relativePath>
</parent>

Expand Down
12 changes: 6 additions & 6 deletions docs/cli/flowexec.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@ or for inspecting individual entities.

## Project Commands
The most important command group is for executing a specific lifecycle or an individual phase for the whole project.
```shell script
```shell
flowexec project <create|build|verify|truncate|destroy> <args>
```
This will execute the whole job by executing the desired lifecycle for the `main` job. Additional parameters are
Expand Down Expand Up @@ -70,12 +70,12 @@ Similar to the project commands, individual jobs with different names than `main

### List Jobs
The following command will list all jobs defined in a project
```shell script
```shell
flowexec job list
```

### Execute Job phase
```shell script
```shell
flowexec job <create|build|verify|truncate|destroy> <job_name> <args>
```
This will execute the whole job by executing the desired lifecycle for the `main` job. Additional parameters are
Expand Down Expand Up @@ -125,12 +125,12 @@ inferior to using the `job` interface above, since typical jobs will also define
which might be required by targets.

### List Targets
```shell script
```shell
flowexec target list
```

### Execute Target phase
```shell script
```shell
flowexec target <create|build|verify|truncate|destroy> <target_name>
```
This will execute an individual target by executing the desired lifecycle for the `main` job. Additional parameters are
Expand All @@ -144,6 +144,6 @@ the whole lifecycle for `verify` includes the phases `create` and `build` and th
## Info Command
As a small debugging utility, Flowman also provides an `info` command, which simply shows all environment variables
and configuration settings.
```shell script
```shell
flowexec info
```
4 changes: 2 additions & 2 deletions docs/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -51,7 +51,7 @@

# General information about the project.
project = 'Flowman'
copyright = '2021, Kaya Kupferschmidt'
copyright = '2022, Kaya Kupferschmidt'
author = 'Kaya Kupferschmidt'

github_doc_root = 'https://github.com/dimajix/flowman/tree/master/docs/'
Expand All @@ -63,7 +63,7 @@
# The short X.Y version.
version = '0.22'
# The full version, including alpha/beta/rc tags.
release = '0.22.0'
release = '0.22.1'

# The language for content autogenerated by Sphinx. Refer to documentation
# for a list of supported languages.
Expand Down
31 changes: 31 additions & 0 deletions docs/connectors/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
# Connectors

Flowman supports a broad range of data sources and sinks. Some are available directly in Flowman while others are
contained in [plugins](../plugins/index.md) to decrease code bloat when not required.

## Overview

The following table gives an overview of all currently supported data sources and sinks:

| Data Source | Supports Read | Supports Write | Plugin | Relation |
|----------------|----------------|----------------|----------------------------------------------|--------------------------------------------------------------------------------------------------|
| AWS S3 | yes | yes | [AWS](../plugins/aws.html) | [`file`](../spec/relation/file.html) |
| Avro files | yes | yes | N/A | [`file`](../spec/relation/file.html) |
| Azure ABS | yes | yes | [Azure](../plugins/azure.html) | [`file`](../spec/relation/file.html) |
| Azure SQL | yes | yes | [MS SQL Server](../plugins/mssqlserver.html) | [`sqlserver`](../spec/relation/sqlserver.html) |
| CSV files | yes | yes | N/A | [`file`](../spec/relation/file.html) |
| Delta Lake | yes | yes | [Delta](../plugins/delta.html) | [`deltaFile`](../spec/relation/deltaFile.html), [`deltaTable`](../spec/relation/deltaTable.html) |
| HDFS | yes | yes | N/A | [`file`](../spec/relation/file.html) |
| Hive | yes | yes | N/A | [`hiveTable`](../spec/relation/hiveTable.html), [`hiveView`](../spec/relation/hiveView.html) |
| Impala | yes (via Hive) | yes (via Hive) | [Impala](../plugins/impala.html) | [`hiveTable`](../spec/relation/hiveTable.html), [`hiveView`](../spec/relation/hiveView.html) |
| JSON files | yes | yes | N/A | [`file`](../spec/relation/file.html) |
| Kafka | yes | yes | [Kafka](../plugins/kafka.html) | [`kafka`](../spec/relation/kafka.html) |
| Local files | yes | yes | N/A | [`local`](../spec/relation/local.html) |
| MariaDB | yes | yes | [MariaDB](../plugins/mariadb.html) | [`jdbc`](../spec/relation/jdbcTable.html) |
| MySQL | yes | yes | [MySQL](../plugins/mysql.html) | [`jdbc`](../spec/relation/jdbcTable.html) |
| ORC files | yes | yes | N/A | [`file`](../spec/relation/file.html) |
| Parquet files | yes | yes | N/A | [`file`](../spec/relation/file.html) |
| PostgreSQL | yes | yes | N/A | [`jdbc`](../spec/relation/jdbcTable.html) |
| SQL Server | yes | yes | [MS SQL Server](../plugins/mssqlserver.html) | [`sqlserver`](../spec/relation/sqlserver.html) |
| Sequence files | yes | yes | N/A | [`file`](../spec/relation/file.html) |
| Text files | yes | yes | N/A | [`file`](../spec/relation/file.html) |
2 changes: 1 addition & 1 deletion docs/cookbook/data-quality.md
Original file line number Diff line number Diff line change
Expand Up @@ -66,7 +66,7 @@ targets:
```

This example will publish two metrics, `record_count` and `column_sum`, which then can be sent to a
[metric sink](../spec/metric) configured in the [namespace](../spec/namespace.md).
[metric sink](../spec/metric/index.md) configured in the [namespace](../spec/namespace.md).


## When to use what
Expand Down
2 changes: 1 addition & 1 deletion docs/cookbook/validation.md
Original file line number Diff line number Diff line change
Expand Up @@ -60,4 +60,4 @@ difference that it is executed after the `BUILD` phase.

Note that when you are concerned about the quality of the data produced by your Flowman job, the `verify` target
is only one of multiple possibilities to implement meaningful checks. Read more in the
[data quality cookbook](data-qualioty.md) about available options.
[data quality cookbook](data-quality.md) about available options.
20 changes: 12 additions & 8 deletions docs/documenting/checks.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,7 @@ relations:
- name: year
description: "The year of the measurement, used for partitioning the data"
checks:
# Check that the column does not contain NULL values
- kind: notNull
- name: usaf
checks:
Expand All @@ -38,24 +39,27 @@ relations:
- name: air_temperature_qual
checks:
- kind: notNull
# Check that the column only contains the specified values
- kind: values
values: [0,1,2,3,4,5,6,7,8,9]
- name: air_temperature
checks:
# Perform an arbitrary check on the column, you can also access other columns
- kind: expression
expression: "air_temperature >= -100 OR air_temperature_qual <> 1"
- kind: expression
expression: "air_temperature <= 100 OR air_temperature_qual <> 1"
# Schema tests, which might involve multiple columns
checks:
kind: foreignKey
relation: stations
columns:
- usaf
- wban
references:
- usaf
- wban
# Check that each usaf/wban combination is a foreign key refering to the "stations" relation
kind: foreignKey
relation: stations
columns:
- usaf
- wban
references:
- usaf
- wban
```
## Available Column Checks
Expand Down
2 changes: 2 additions & 0 deletions docs/documenting/config.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,8 @@ collectors:
- kind: mappings
# Collect documentation of build targets
- kind: targets
# Collect column level lineage
- kind: lineage
# Execute all checks
- kind: checks

Expand Down
2 changes: 2 additions & 0 deletions docs/documenting/mappings.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,8 +24,10 @@ mappings:
air_temperature: "CAST(SUBSTR(raw_data,88,5) AS FLOAT)/10"
air_temperature_qual: "SUBSTR(raw_data,93,1)"

# Explicit documentation section for annotating columns from above
documentation:
columns:
# You can document any column you like, you don't have to provide a description for all of them
- name: usaf
description: "The USAF (US Air Force) id of the weather station"
- name: wban
Expand Down
2 changes: 2 additions & 0 deletions docs/documenting/relations.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,9 +20,11 @@ relations:
type: integer
granularity: 1

# Explicit documentation section for annotating columns of the relation
documentation:
description: "The table contains all aggregated measurements"
columns:
# You can document any column you like, you don't have to provide a description for all of them
- name: country
description: "Country of the weather station"
- name: min_temperature
Expand Down
4 changes: 3 additions & 1 deletion docs/documenting/targets.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,8 @@
# Documenting Targets

Flowman also supports documenting build targets.
Flowman also supports documenting build targets. There aren't many options or properties, since targets do not represent
any data or transformations themselves. Documenting them mainly serves to complete a technical reference for
developers.

## Example

Expand Down
9 changes: 5 additions & 4 deletions docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,14 +25,14 @@ application.

### Notable Features

* Declarative syntax in [YAML files](spec)
* Declarative syntax in [YAML files](spec/index.md)
* Full lifecycle management of data models (create, migrate and destroy Hive tables, JDBC tables or file based storage)
* Flexible expression language
* Jobs for managing build targets (like copying files or uploading data via sftp)
* Automatic dependency analysis to build targets in the correct order
* Powerful yet simple [command line tool for batch execution](cli/flowexec.md)
* Powerful [Command line tool for interactive data flow analysis](cli/flowshell.md)
* [History server](cli/flowman-server.md) that provides an overview of past jobs and targets including lineage
* [History server](history-server/index.md) that provides an overview of past jobs and targets including lineage
* [Metric system](cookbook/metrics.md) with the ability to publish these to servers like Prometheus
* Extendable via Plugins

Expand All @@ -55,7 +55,7 @@ following sections:

* [Flowman Executor](cli/flowexec.md): Documentation of the Flowman Executor CLI
* [Flowman Shell](cli/flowshell.md): Documentation of the Flowman Shell CLI
* [Flowman Server](cli/flowserver.md): Documentation of the Flowman Server CLI
* [Flowman Server](history-server/index.md): Documentation of the Flowman Server CLI


### Specification Documentation
Expand All @@ -74,7 +74,7 @@ More detail on all these items is described in the following sections:

### Cookbooks

* [Testing](cookbook/testing.md) How to implement tests in Flowman
* [Testing](testing/index.md) How to implement tests in Flowman
* [Kerberos](cookbook/kerberos.md) How to use Flowman in Kerberized environments


Expand All @@ -95,6 +95,7 @@ Flowman also provides optional plugins which extend functionality. You can find
concepts
installation
lifecycle
connectors/index
spec/index
testing/index
documenting/index
Expand Down
8 changes: 4 additions & 4 deletions docs/installation.md
Original file line number Diff line number Diff line change
Expand Up @@ -80,7 +80,7 @@ As an alternative to downloading a pre-built distribution of Flowman, you might

Flowman is distributed as a `tar.gz` file, which simply needs to be extracted at some location on your computer or
server. This can be done via
```shell script
```shell
tar xvzf flowman-dist-X.Y.Z-bin.tar.gz
```

Expand Down Expand Up @@ -134,7 +134,7 @@ and Hadoop properties can be configured, like for example
* Generic Java options like http proxy and more

#### Example
```shell script
```shell
#!/usr/bin/env bash

# Specify Java home (just in case)
Expand Down Expand Up @@ -281,6 +281,6 @@ Please have a look at [Running Flowman on Windows](cookbook/windows.md) for deta
Please have a look at [Kerberos](cookbook/kerberos.md) for detailed information.


### Running in Docker
It is also possible to run Flowman inside Docker. We now also provide some images at
## 7. Running in Docker
It is also possible to [run Flowman inside Docker](cookbook/docker.md). We now also provide some images at
[Docker Hub](https://hub.docker.com/repository/docker/dimajix/flowman)
2 changes: 1 addition & 1 deletion docs/plugins/delta.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ move to Spark 3.0+.
## Provided Entities
* [`deltaTable` relation](../spec/relation/deltaTable.md)
* [`deltaFile` relation](../spec/relation/deltaFile.md)
* ['deltaVacuum' target](../spec/target/deltaVacuum.md)
* ['deltaVacuum' target](../spec/target/delta-vacuum.md)


## Activation
Expand Down
29 changes: 28 additions & 1 deletion docs/plugins/mariadb.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# MariaDB Plugin

The MariaDB plugin mainly provides a JDBC driver to access MariaDB databases via the [JDBC relation](../spec/relation/jdbc.md)
The MariaDB plugin mainly provides a JDBC driver to access MariaDB databases via the [JDBC relation](../spec/relation/jdbcTable.md)


## Activation
Expand All @@ -10,3 +10,30 @@ The plugin can be easily activated by adding the following section to the [defau
plugins:
- flowman-mariadb
```
## Usage
In order to connect to a MariaDB database, you need to specify a [JDBC connection](../spec/connection/jdbc.md)
and use that one in a [JDBC relation](../spec/relation/jdbcTable.md) as follows:
```yaml
# First specify a connection. This can be used by multiple JDBC relations
connections:
frontend:
kind: jdbc
driver: "org.mariadb.jdbc.Driver"
url: "jdbc:mariadb://my-mariadb-database.domain.com"
username: "my_username"
password: "secret!password"

relations:
frontend_users:
kind: jdbcTable
# Specify the name of the connection to use
connection: frontend
# Specify database
database: "frontend"
# Specify the table
table: "users"
```
Loading

0 comments on commit 55199f2

Please sign in to comment.