Skip to content

Commit

Permalink
Merge branch 'main' into feature/web/inputs-outputs
Browse files Browse the repository at this point in the history
  • Loading branch information
wslulciuc authored Sep 25, 2023
2 parents ab2addf + 6003af6 commit 170e158
Show file tree
Hide file tree
Showing 22 changed files with 87 additions and 148 deletions.
2 changes: 1 addition & 1 deletion .circleci/api-load-test.sh
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@
set -e

# Build version of Marquez
readonly MARQUEZ_VERSION=0.41.0-SNAPSHOT
readonly MARQUEZ_VERSION=0.42.0-SNAPSHOT
# Fully qualified path to marquez.jar
readonly MARQUEZ_JAR="api/build/libs/marquez-api-${MARQUEZ_VERSION}.jar"

Expand Down
2 changes: 1 addition & 1 deletion .circleci/db-migration.sh
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@
# Version of PostgreSQL
readonly POSTGRES_VERSION="14"
# Version of Marquez
readonly MARQUEZ_VERSION=0.40.0
readonly MARQUEZ_VERSION=0.41.0
# Build version of Marquez
readonly MARQUEZ_BUILD_VERSION="$(git log --pretty=format:'%h' -n 1)" # SHA1

Expand Down
2 changes: 1 addition & 1 deletion .env.example
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
API_PORT=5000
API_ADMIN_PORT=5001
WEB_PORT=3000
TAG=0.40.0
TAG=0.41.0
33 changes: 32 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,37 @@
# Changelog

## [Unreleased](https://github.com/MarquezProject/marquez/compare/0.40.0...HEAD)
## [Unreleased](https://github.com/MarquezProject/marquez/compare/0.41.0...HEAD)

## [0.41.0](https://github.com/MarquezProject/marquez/compare/0.40.0...0.41.0) - 2023-09-20
### Added
* API: add support for the following parameters in the `SearchDao` [`#2556`](https://github.com/MarquezProject/marquez/pull/2556) [@tati](https://github.com/tati) [@wslulciuc](https://github.com/wslulciuc)
*This PR updates the search endpoint to enforce `YYYY-MM-DD` for query params, use `YYYY-MM-DD` as `LocalDate`, and support the following query params:*
- *`namespace` - matches jobs or datasets within the given namespace.*
- *`before` - matches jobs or datasets before `YYYY-MM-DD`.*
- *`after` - matches jobs or datasets after `YYYY-MM-DD`.*
* Web: add paging on jobs and datasets [`#2614`](https://github.com/MarquezProject/marquez/pull/2614) [@phixme](https://github.com/phixMe)
*Adds paging to jobs and datasets just like we already have on the lineage events page.*
* Web: add tag descriptions to tooltips [`#2612`](https://github.com/MarquezProject/marquez/pull/2612) [@davidsharp7](https://github.com/davidsharp7)
*Get the tag descriptions from the tags endpoint and when a column has a tag display the corresponding description on hover over. Context can be found [here](https://docs.github.com/en/issues/tracking-your-work-with-issues/linking-a-pull-request-to-an-issue).*
* Web: add available column-level tags [`#2606`](https://github.com/MarquezProject/marquez/pull/2606) [@davidsharp7](https://github.com/davidsharp7)
*Adds a new column called "tags" to the dataset column view along with the tags associated with the dataset column.*
* Web: add HTML Tool Tip [`#2601`](https://github.com/MarquezProject/marquez/pull/2601) [@davidsharp7](https://github.com/davidsharp7)
*Adds a Tool Tip to display basic node details.*

### Fixed
* Web: fix dataset saga for paging [`#2615`](https://github.com/MarquezProject/marquez/pull/2615) [@phixme](https://github.com/phixMe)
*Updates the saga, changes the default page size.*
* API: perf/improve `jobdao` query [`#2609`](https://github.com/MarquezProject/marquez/pull/2609) [@algorithmy1](https://github.com/algorithmy1)
*Optimizes the query to make use of Common Table Expressions to fetch the required data more efficiently and before the join, fixing a significant bottleneck.*

### Changed
* Docker: Postgres `14` [`#2607`](https://github.com/MarquezProject/marquez/pull/2607) [@wslulciuc](https://github.com/wslulciuc)
*Bumps the recommended version of Postgres to 14.*
*When deploying locally, you might need to run `./docker/down.sh` to clean existing volumes.*

### Removed
* Client: tolerate null transformation attrs in field model [`#2600`](https://github.com/MarquezProject/marquez/pull/2600) [@davidjgoss](https://github.com/davidjgoss)
*Removes the @NonNull annotation from the client class and the @NotNull from the model class.*

## [0.40.0](https://github.com/MarquezProject/marquez/compare/0.39.0...0.40.0) - 2023-08-15
### Added
Expand Down
2 changes: 1 addition & 1 deletion CODE_QUALITY_AND_SECURITY.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@ The specific security and analysis methodologies that we employ include but are

For more information about our approach to quality and security, feel free to reach out to the Marquez development team:

- Slack: [Marquezproject.slack.com](http://bit.ly/MarquezSlack)
- Slack: [Marquezproject.slack.com](http://bit.ly/Marquez_invite)
- Twitter: [@MarquezProject](https://twitter.com/MarquezProject)

----
Expand Down
6 changes: 3 additions & 3 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,14 +3,14 @@
We're excited you're interested in contributing to Marquez! We'd love your help, and there are plenty of ways to contribute:

* Give the repo a star
* Join our [slack](http://bit.ly/MqzSlack) channel and leave us feedback or help with answering questions from the community
* Join our [slack](http://bit.ly/Marquez_invite) channel and leave us feedback or help with answering questions from the community
* Fix or [report](https://github.com/MarquezProject/marquez/issues/new) a bug
* Fix or improve documentation
* For newcomers, pick up a ["good first issue"](https://github.com/MarquezProject/marquez/labels/good%20first%20issue), then send a pull request our way (see the [resources](#resources) section below for helpful links to get started)

We feel that a welcoming community is important and we ask that you follow the [Contributor Covenant Code of Conduct](https://github.com/MarquezProject/marquez/blob/main/CODE_OF_CONDUCT.md) in all interactions with the community.
We feel that a welcoming community is important and we ask that you follow the [Contributor Covenant Code of Conduct](https://github.com/MarquezProject/marquez/blob/main/CODE_OF_CONDUCT.md) in all interactions with the community.

If you’re interested in using or learning more about Marquez, reach out to us on our [slack](http://bit.ly/MqzSlack) channel and follow [@MarquezProject](https://twitter.com/MarquezProject) for updates. We also encourage new comers to [join](https://lists.lfaidata.foundation/g/marquez-technical-discuss/ics/invite.ics?repeatid=32038) our monthly community meeting!
If you’re interested in using or learning more about Marquez, reach out to us on our [slack](http://bit.ly/Marquez_invite) channel and follow [@MarquezProject](https://twitter.com/MarquezProject) for updates. We also encourage new comers to [join](https://lists.lfaidata.foundation/g/marquez-technical-discuss/ics/invite.ics?repeatid=32038) our monthly community meeting!

# Getting Your Changes Approved

Expand Down
2 changes: 1 addition & 1 deletion GOVERNANCE.md
Original file line number Diff line number Diff line change
Expand Up @@ -100,7 +100,7 @@ Or a meeting may be at an organization's offices that are required to maintain a

## Marquez on Slack

Marquez uses [a Slack community](http://bit.ly/MarquezSlack) to provide an ongoing dialogue between members.
Marquez uses [a Slack community](https://bit.ly/Marquez_invite) to provide an ongoing dialogue between members.
This creates a recorded discussion of design decisions and discussions that complement the project meetings.

Follow the link above and register with the Slack service using your email address.
Expand Down
4 changes: 2 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ Marquez is an open source **metadata service** for the **collection**, **aggrega
[![CircleCI](https://circleci.com/gh/MarquezProject/marquez/tree/main.svg?style=shield)](https://circleci.com/gh/MarquezProject/marquez/tree/main)
[![codecov](https://codecov.io/gh/MarquezProject/marquez/branch/main/graph/badge.svg)](https://codecov.io/gh/MarquezProject/marquez/branch/main)
[![status](https://img.shields.io/badge/status-active-brightgreen.svg)](#status)
[![Slack](https://img.shields.io/badge/slack-chat-blue.svg)](http://bit.ly/MqzSlack)
[![Slack](https://img.shields.io/badge/slack-chat-blue.svg)](https://bit.ly/Marquez_invite)
[![license](https://img.shields.io/badge/license-Apache_2.0-blue.svg)](https://raw.githubusercontent.com/MarquezProject/marquez/main/LICENSE)
[![Contributor Covenant](https://img.shields.io/badge/Contributor%20Covenant-v2.0%20adopted-ff69b4.svg)](CODE_OF_CONDUCT.md)
[![maven](https://img.shields.io/maven-central/v/io.github.marquezproject/marquez-api.svg)](https://search.maven.org/search?q=g:io.github.marquezproject)
Expand Down Expand Up @@ -160,7 +160,7 @@ Marquez listens on port `8080` for all API calls and port `8081` for the admin i

* Website: https://marquezproject.ai
* Source: https://github.com/MarquezProject/marquez
* Chat: [MarquezProject Slack](https://bit.ly/MqzSlackInvite)
* Chat: [MarquezProject Slack](https://bit.ly/Marquez_invite)
* Twitter: [@MarquezProject](https://twitter.com/MarquezProject)

## Contributing
Expand Down
2 changes: 1 addition & 1 deletion chart/Chart.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -29,4 +29,4 @@ name: marquez
sources:
- https://github.com/MarquezProject/marquez
- https://marquezproject.github.io/marquez/
version: 0.40.0
version: 0.41.0
4 changes: 2 additions & 2 deletions chart/values.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ marquez:
image:
registry: docker.io
repository: marquezproject/marquez
tag: 0.40.0
tag: 0.41.0
pullPolicy: IfNotPresent
## Name of the existing secret containing credentials for the Marquez installation.
## When this is specified, it will take precedence over the values configured in the 'db' section.
Expand Down Expand Up @@ -75,7 +75,7 @@ web:
image:
registry: docker.io
repository: marquezproject/marquez-web
tag: 0.40.0
tag: 0.41.0
pullPolicy: IfNotPresent
## Marquez website will run on this port
##
Expand Down
4 changes: 2 additions & 2 deletions clients/java/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,14 +10,14 @@ Maven:
<dependency>
<groupId>io.github.marquezproject</groupId>
<artifactId>marquez-java</artifactId>
<version>0.40.0</version>
<version>0.41.0</version>
</dependency>
```

or Gradle:

```groovy
implementation 'io.github.marquezproject:marquez-java:0.40.0
implementation 'io.github.marquezproject:marquez-java:0.41.0
```

## Usage
Expand Down
2 changes: 1 addition & 1 deletion clients/python/marquez_client/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
# -*- coding: utf-8 -*-

__author__ = """Marquez Project"""
__version__ = "0.41.0"
__version__ = "0.42.0"

from marquez_client.client import MarquezClient # noqa: F401
from marquez_client.clients import Clients # noqa: F401
2 changes: 1 addition & 1 deletion clients/python/setup.cfg
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
[bumpversion]
current_version = 0.41.0
current_version = 0.42.0
commit = False
tag = False
parse = (?P<major>\d+)\.(?P<minor>\d+)\.(?P<patch>\d+)(?P<rc>.*)
Expand Down
2 changes: 1 addition & 1 deletion clients/python/setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@

setup(
name="marquez-python",
version="0.41.0",
version="0.42.0",
description="Marquez Python Client",
long_description=readme,
long_description_content_type="text/markdown",
Expand Down
4 changes: 2 additions & 2 deletions docker/up.sh
Original file line number Diff line number Diff line change
Expand Up @@ -8,9 +8,9 @@
set -e

# Version of Marquez
readonly VERSION=0.40.0
readonly VERSION=0.41.0
# Build version of Marquez
readonly BUILD_VERSION=0.40.0
readonly BUILD_VERSION=0.41.0

title() {
echo -e "\033[1m${1}\033[0m"
Expand Down
108 changes: 5 additions & 103 deletions docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,109 +2,11 @@
layout: index
---

## Overview

Marquez is an open source **metadata service** for the **collection**, **aggregation**, and **visualization** of a data ecosystem's metadata. It maintains the [provenance](https://en.wikipedia.org/wiki/Provenance#Data_provenance) of how datasets are consumed and produced, provides global visibility into job runtime and frequency of dataset access, centralization of dataset lifecycle management, and much more. Marquez was released and open sourced by [WeWork](https://www.wework.com).

#### FEATURES

* A reference implementation of the [OpenLineage](https://openlineage.io) standard
* Centralized [metadata management](https://en.wikipedia.org/wiki/Metadata_management) powering:
* Data lineage
* [Data governance](https://en.wikipedia.org/wiki/Data_governance)
* Data health
* Data discovery **+** exploration
* Precise and highly dimensional [data model](#data-model)
* Datasets
* Jobs
* Runs
* Easily collect metadata as [OpenLineage](https://openlineage.io) events via the [LineageAPI](https://marquezproject.github.io/marquez/openapi.html#tag/Lineage/paths/~1lineage/post)
* **Datasets** as first-class values
* **Enforcement** of _job_ and _dataset_ ownership
* Simple operation and design with minimal dependencies
* [RESTful API](./openapi.html) enabling sophisticated integrations with other systems:
* [Airflow](https://airflow.apache.org)
* [Amundsen](https://www.amundsen.io)
* [dbt](https://www.getdbt.com)
* [Spark](https://spark.apache.org/docs/latest/index.html)
* Designed to promote a **healthy** data ecosystem where teams within an organization can seamlessly _share_ and _safely_ depend on one another's datasets with confidence

## Why Marquez?

Marquez enables highly flexible [data lineage](https://en.wikipedia.org/wiki/Data_lineage) queries across _all datasets_, while reliably and efficiently associating (_upstream_, _downstream_) dependencies between jobs and the datasets they produce and consume.

<figure align="center">
<img src="./assets/images/lineage.png">
</figure>

## Why manage and utilize metadata?

<figure align="center">
<img src="./assets/images/ecosystem.png">
</figure>

## Design

Marquez is a modular system and has been designed as a highly scalable, highly extensible platform-agnostic solution for metadata management. It consists of the following system components:

* **Metadata Repository**: Stores all job and dataset metadata, including a complete history of job runs and job-level statistics (i.e. total runs, average runtimes, success/failures, etc).
* **Metadata API**: RESTful API enabling a diverse set of clients to begin interacting with metadata around dataset production and consumption.
* **Metadata UI**: Used for dataset discovery, connecting multiple datasets and exploring their dependency graph.

<br/>

<figure align="center">
<img src="./assets/images/ol-stack.svg">
</figure>

To ease adoption and enable a diverse set of data processing applications to build metadata collection as a core requirement into their design, Marquez implements the OpenLineage [specification](https://github.com/OpenLineage/OpenLineage/blob/main/spec/OpenLineage.yml). OpenLineage provides support for [Java](https://github.com/OpenLineage/OpenLineage/tree/main/client/java) and [Python](https://github.com/OpenLineage/OpenLineage/tree/main/client/python) as well as many [integrations](https://openlineage.io/integration).

The Metadata API is an abstraction for recording information around the production and consumption of datasets. It's a low-latency, highly-available stateless layer responsible for encapsulating both metadata persistence and aggregation of lineage information. The API allows clients to collect and/or obtain dataset information to/from the [Metadata Repository](https://www.lucidchart.com/documents/view/f918ce01-9eb4-4900-b266-49935da271b8/0).

Metadata needs to be collected, organized, and stored in a way to allow for rich exploratory queries via the [Metadata UI](https://github.com/MarquezProject/marquez/tree/main/web). The Metadata Repository serves as a catalog of dataset information encapsulated and cleanly abstracted away by the Metadata API.

## Data Model

Marquez's data model emphasizes immutability and timely processing of datasets. Datasets are first-class values produced by job runs. A job run is linked to _versioned_ code, and produces one or more immutable _versioned_ outputs. Dataset changes are recorded at different points in job execution via lightweight API calls, including the success or failure of the run itself.

The diagram below shows the metadata collected and cataloged for a given job over multiple runs, and the time-ordered sequence of changes applied to its input dataset.

<figure align="center">
<img src="./assets/images/versioning.png">
</figure>

**Job**: A job has an `owner`, unique `name`, `version`, and optional `description`. A job will define one or more _versioned_ inputs as dependencies, and one or more _versioned_ outputs as artifacts. Note that it's possible for a job to have only input, or only output datasets defined.

**Job Version:** A read-only _immutable_ `version` of a job, with a unique referenceable `link` to code preserving the reproducibility of builds from source. A job version associates one or more input and output datasets to a job definition (important for lineage information as data moves through various jobs over time). Such associations catalog provenance links and provide powerful visualizations of the flow of data.

**Dataset:** A dataset has an `owner`, unique `name`, `schema`, `version`, and optional `description`. A dataset is contained within a datasource. A `datasource` enables the grouping of physical datasets to their physical source. A version `pointer` into the historical set of changes is present for each dataset and maintained by Marquez. When a dataset change is committed back to Marquez, a distinct version ID is generated, stored, then set to `current` with the pointer updated internally.

**Dataset Version:** A read-only _immutable_ `version` of a dataset. Each version can be read independently and has a unique ID mapped to a dataset change preserving its state at some given point in time. The _latest_ version ID is updated only when a change to the dataset has been recorded. To compute a distinct version ID, Marquez applies a versioning function to a set of properties corresponding to the datasets underlying datasource.

## Deployment

To deploy and manage Marquez in a cloud environment, please follow our [deployment](deployment-overview.html) guide.

## Contributing

We're excited you're interested in contributing to Marquez! We'd love your help, and there are plenty of ways to contribute:

* Fix or [report](https://github.com/MarquezProject/marquez/issues/new) a bug
* Fix or improve documentation
* Pick up a ["good first issue"](https://github.com/MarquezProject/marquez/labels/good%20first%20issue), then send a pull request our way

We feel that a welcoming community is important and we ask that you follow the [Contributor Covenant Code of Conduct](https://github.com/MarquezProject/marquez/blob/main/CODE_OF_CONDUCT.md) in all interactions with the community.

If you’re interested in using or learning more about Marquez, reach out to us on our [slack](http://bit.ly/MarquezSlack) channel and follow [@MarquezProject](https://twitter.com/MarquezProject) for updates. We also encourage new comers to [join](https://lists.lfaidata.foundation/g/marquez-technical-discuss/ics/invite.ics?repeatid=32038) our monthly community meeting!

## Marquez Talks

* [Data Lineage with Apache Airflow using OpenLineage](https://www.youtube.com/watch?v=qQAdpbNhxl8) by Julien Le Dem, Willy Lulciuc at Airflow Summit '21
* [Data Lineage with Apache Airflow](https://www.datacouncil.ai/talks/data-lineage-with-apache-airflow) by Willy Lulciuc at Data Council SF '20
* [Solving Data Lineage Tracking And Data Discovery At WeWork](https://www.dataengineeringpodcast.com/marquez-data-lineage-episode-111) on [The Data Engineering Podcast](https://www.dataengineeringpodcast.com/)
* [Data Lineage with Apache Airflow using Marquez](https://www.youtube.com/watch?v=BIVUXruv5io) by Willy Lulciuc at CRUNCH '19
* [Marquez: An Open Source Metadata Service for ML Platforms](https://www.slideshare.net/WillyLulciuc/marquez-an-open-source-metadata-service-for-ml-platforms) by Willy Lulciuc, Shawn Shah at AI NEXTCon SF '19
* [Marquez: A Metadata Service for Data Abstraction, Data Lineage, and Event-based Triggers](https://www.datacouncil.ai/speaker/marquez-a-metadata-service-for-data-abstraction-data-lineage-and-event-based-triggers) by Willy Lulciuc at DataEngConf NYC '18
<!DOCTYPE html>
<meta charset="utf-8">
<title>Redirecting you to our new website!</title>
<meta http-equiv="refresh" content="0; URL=https://marquezproject.ai/">
<link rel="canonical" href="https://marquezproject.ai/">

----
SPDX-License-Identifier: Apache-2.0
Expand Down
Loading

0 comments on commit 170e158

Please sign in to comment.