Flowman

Flowman is a Spark based data build tool that simplifies the act of writing data transformations as part of ETL processes. The main idea is that users write purely declarative specifications which describe all details of the data sources, sinks and data transformations instead of writing Spark jobs in Scala or Python. The main advantage of this approach is that many technical details of a correct and robust implementation are encapsulated and the user can concentrate on the data transformations themselves.

In addition to writing and executing data transformations, Flowman can also be used for managing physical data models, i.e. Hive or JDBC tables. Flowman can create such tables from a specification with the correct schema. This helps to keep all aspects (like transformations and schema information) in a single place managed by a single program.

Noteable Features

Semantics of a build tool like Maven - just for data instead for applications
Declarative syntax in YAML files
Data model management (Create, Migrate and Destroy Hive tables, JDBC tables or file based storage)
Generation of meaningful documentation
Flexible expression language
Jobs for managing build targets (like copying files or uploading data via sftp)
Automatic data dependency management within the execution of individual jobs
Meaningful logging output & rich set of execution metrics
Powerful yet simple command line tools
Extendable via Plugins

Documentation

You can find the official homepage at Flowman.io and a comprehensive documentation at Read the Docs.

Installation

You can either grab an appropriate pre-build package at GitHub

Installing the Packed Distribution

The packed distribution file is called flowman-{version}-bin.tar.gz and can be extracted at any location using

tar xvzf flowman-{version}-bin.tar.gz

Apache Spark

Flowman does not bring its own Spark libraries, but relies on a correctly installed Spark distribution. You can download appropriate packages directly from [https://spark.apache.org](the Spark Homepage).

Hadoop Utils for Windows

If you are trying to run the application on Windows, you also need the Hadoop Winutils, which is a set of DLLs required for the Hadoop libraries to be working. You can get a copy at https://github.com/kontext-tech/winutils . Once you downloaded the appropriate version, you need to place the DLLs into a directory $HADOOP_HOME/bin, where HADOOP_HOME refers to some location on your Windows PC. You also need to set the following environment variables:

HADOOP_HOME should point to the parent directory of the bin directory
PATH should also contain $HADOOP_HOME/bin

Command Line Utils

The primary tool provided by Flowman is called flowexec and is located in the bin folder of the installation directory.

General Usage

The flowexec tool has several subcommands for working with objects and projects. The general pattern looks as follows

flowexec [generic options] <cmd> <subcommand> [specific options and arguments]

For working with flowexec, either your current working directory needs to contain a Flowman project with a file project.yml or you need to specify the path to a valid project via

flowexec -f /path/to/project/folder <cmd>

Interactive Shell

With version 0.14.0, Flowman also introduced a new interactive shell for executing data flows. The shell can be started via

flowshell -f <project>

Within the shell, you can interactively build targets and inspect intermediate mappings.

Building

You can build your own version via Maven with

mvn clean install

Please also read BUILDING.md for detailed instructions, specifically on build profiles.

Contributing

You want to contribute to Flowman? Welcome! Please read CONTRIBUTING.md to understand what you can do.

Name		Name	Last commit message	Last commit date
Latest commit History 1,779 Commits
docker		docker
docs		docs
examples		examples
flowman-client		flowman-client
flowman-common		flowman-common
flowman-core		flowman-core
flowman-dist		flowman-dist
flowman-dsl		flowman-dsl
flowman-hub		flowman-hub
flowman-parent		flowman-parent
flowman-plugins		flowman-plugins
flowman-scalatest-compat		flowman-scalatest-compat
flowman-server-ui		flowman-server-ui
flowman-server		flowman-server
flowman-spark-extensions		flowman-spark-extensions
flowman-spark-testing		flowman-spark-testing
flowman-spec		flowman-spec
flowman-studio-ui		flowman-studio-ui
flowman-studio		flowman-studio
flowman-testing		flowman-testing
flowman-tools		flowman-tools
licenses		licenses
tests		tests
.editorconfig		.editorconfig
.gitattributes		.gitattributes
.gitignore		.gitignore
.gitlab-ci.yml		.gitlab-ci.yml
.readthedocs.yaml		.readthedocs.yaml
.travis.yml		.travis.yml
BUILDING.md		BUILDING.md
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
INSTALLING.md		INSTALLING.md
LICENSE		LICENSE
NOTICE		NOTICE
QUICKSTART.md		QUICKSTART.md
README.md		README.md
RELEASING.md		RELEASING.md
build-release.sh		build-release.sh
catalog-info.yaml		catalog-info.yaml
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Flowman

Noteable Features

Documentation

Installation

Installing the Packed Distribution

Apache Spark

Hadoop Utils for Windows

Command Line Utils

General Usage

Interactive Shell

Building

Contributing

About

Releases

Packages

Languages

License

scMarkus/flowman

Folders and files

Latest commit

History

Repository files navigation

Flowman

Noteable Features

Documentation

Installation

Installing the Packed Distribution

Apache Spark

Hadoop Utils for Windows

Command Line Utils

General Usage

Interactive Shell

Building

Contributing

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages