Skip to content

Latest commit

 

History

History
1026 lines (865 loc) · 104 KB

CHANGELOG.md

File metadata and controls

1026 lines (865 loc) · 104 KB

Changelog

Added

  • Spark: integration now emits intermediate, application level events wrapping entire job execution #1672 @mobuchowski
    *Previously, Spark event model described only single actions, potentially linked only to some parent run.
  • Flink: support multi topic Kafka Sink. #2372 @pawel-big-lebowski
    Support multi topic kafka sinks. Limitations: recordSerializer need to implement KafkaTopicsDescriptor. Please refer to limitations sections in documentation.
  • Spark: support built-in lineage within DataSourceV2Relation #2394 @pawel-big-lebowski
    Enable built-in lineage extraction within from DataSourceV2Relation lineage nodes.
  • Spark: Add support for JobTypeJobFacet properties. #2410 @mattiabertorello Support job type properties within the Spark Job facet.
  • DBT: Add support for JobTypeJobFacet properties. #2411 @mattiabertorello Support job type properties within the DBT Job facet.

1.8.0 - 2024-01-19

  • Flink: support Flink 1.18 #2366 @HuangZhenQiu
    Adds support for the latest Flink version with 1.17 used for Iceberg Flink runtime and Cassandra Connector as these do not yet support 1.18.
  • Spark: add Gradle plugins to simplify the build process to support Scala 2.13 #2376 @d-m-h
    *Defines a set of Gradle plugins to configure the modules and reduce duplication.
  • Spark: support multiple Scala versions LogicalPlan implementation #2361 @mattiabertorello
    In the LogicalPlanSerializerTest class, the implementation of the LogicalPlan interface is different between Scala 2.12 and Scala 2.13. In detail, the IndexedSeq changes package from the scala.collection to scala.collection.immutable. This implements both of the methods necessary in the two versions.
  • Spark: Use ScalaConversionUtils to convert Scala and Java collections #2357 @mattiabertorello
    This initial step is to start supporting compilation for Scala 2.13 in the 3.2+ Spark versions. Scala 2.13 changed the default collection to immutable, the methods to create an empty collection, and the conversion between Java and Scala. This causes the code to not compile between 2.12 and 2.13. This replaces the usage of direct Scala collection methods (like creating an empty object) and conversions utils with ScalaConversionUtils methods that will support cross-compilation.
  • Spark: support MERGE INTO queries on Databricks #2348 @pawel-big-lebowski
    Supports custom plan nodes used when running MERGE INTO queries on Databricks runtime.
  • Spark: Support Glue catalog in iceberg #2283 @nataliezeller1
    Adds support for the Glue catalog based on the 'catalog-impl' property (in this case we will not have a 'type' property).

Changed

  • Spark: Move Spark 3.1 code from the spark3 project #2365 @mattiabertorello
    Moves the Spark 3.1-related code to a specific project, spark31, so the spark3 project can be compiled with any Spark 3.x version.

Fixed

  • Airflow: add database information to SnowflakeExtractor #2364 @kacpermuda
    Fixes missing database information in SnowflakeExtractor.
  • Airflow: add dag_id to task_run_id to avoid duplicates #2358 @kacpermuda
    The lack of dag_id in task_run_id can cause duplicates in run_id across different dags.
  • Airflow: Add tests for column lineage facet and sql parser #2373 @kacpermuda
    Improves naming (database.schema.table) in SQLExtractor's column lineage facet and adds some unit tests.
  • Spark: fix removePathPattern behaviour #2350 @pawel-big-lebowski
    The removepath pattern feature is not applied all the time. The method is called when constructing DatasetIdentifier through PathUtils which is not the case all the time. This moves removePattern to another place in the codebase that is always run.
  • Spark: fix a type incompatibility in RddExecutionContext between Scala 2.12 and 2.13 #2360 @mattiabertorello
    The function from the ResultStage.func() object change type in Spark between Scala 2.12 and 2.13 makes the compilation fail. This avoids getting the function with an explicit type; instead, it gets it every time it is needed from the ResultStage object. This PR is part of the effort to support Scala 2.13 in the Spark integration.
  • Spark: Fix removePathPattern feature #2350 @pawel-big-lebowski
    Refactors code to make sure that all datasets sent are processed through removePathPattern if configured to do so.
  • Spark: Clean up the individual build.gradle files in preparation for Scala 2.13 support #2377 @d-m-h
    Cleans up the build.gradle files, consolidating the custom plugin and removing unused and unnecessary configuration.
  • Spark: refactor the Gradle plugins to make it easier to define Scala variants per module #2383 @d-m-h
    The third of several PRs to support producing Scala 2.12 and Scala 2.13 variants of the OpenLineage Spark integration. This PR refactors the custom Gradle plugins in order to make supporting multiple variants per module easier. This is necessary because the shared module fails its tests when consuming the Scala 2.13 variants of Apache Spark.

1.7.0 - 2023-12-21

Added

  • Airflow: add parent run facet to COMPLETE and FAIL events in Airflow integration #2320 @kacpermuda
    Adds a parent run facet to all events in the Airflow integration.

Fixed

  • Airflow: repair up.sh for MacOS #2316 #2318 @kacpermuda
    Some scripts were not working well on MacOS. This adjusts them.
  • Airflow: repair run_id for FAIL event in Airflow 2.6+ #2305 @kacpermuda
    The Run_id in a FAIL event was different than in the START event for Airflow 2.6+.
  • Flink: open Iceberg TableLoader before loading a table #2314 @pawel-big-lebowski
    Fixes a potential NullPointerException in 1.17 when dealing with Iceberg sinks.
  • Flink: name Kafka datasets according to the naming convention #2321 @pawel-big-lebowski
    Adds a kafka:// prefix to Kafka topic datasets' namespaces.
  • Flink: fix properties within JobTypeJobFacet #2325 @pawel-big-lebowski
    Fixes properties assignment in the Flink visitor.
  • Spark: fix commons-logging relocate in target jar #2319 @pawel-big-lebowski
    Avoids relocating a dependency that was getting excluded from the jar.
  • Spec: fix inconsistency with Redshift authority format #2315 @davidjgoss
    Amends the Authority format for consistency with other references in the same section.

Removed

  • Airflow: remove Airflow 2.8+ support #2330 @kacpermuda
    To encourage use of the Provider, this removes the listener from the plugin if the Airflow version is <2.3.0 or >=2.8.0.

1.6.2 - 2023-12-07

Added

  • Dagster: support Dagster 1.5.x #2220 @tsungchih
    Gets event records for each target Dagster event type to support Dagster version 0.15.0+.
  • Dbt: add a new command dbt-ol send-events to send metadata of the last run without running the job #2285 @sophiely
    Adds a new command to send events to OpenLineage according to the latest metadata generated without running any dbt command.
  • Flink: add option for Flink job listener to read from Flink conf #2229 @ensctom
    Adds option for the Flink job listener to read jobnames and namespaces from Flink conf.
  • Spark: get column-level lineage from JDBC dbtable option #2284 @mobuchowski
    Adds support for dbtable, enables lineage in the case of single input columns, and improves dataset naming.
  • Spec: introduce JobTypeJobFacet to contain additional job related information#2241 @pawel-big-lebowski
    New JobTypeJobFacet contains the processing type such as BATCH|STREAMING, integration via SPARK|FLINK|... and job type in QUERY|COMMAND|DAG|....
  • SQL: add quote information from sqlparser-rs #2259 @JDarDagran
    Adds quote information from sqlparser-rs.

Fixed

  • Spark: update Jackson dependency to resolve CVE-2022-1471 #2185 @pawel-big-lebowski
    Updates Gradle for Spark and Flink to 8.1.1. Upgrade Jackson 2.15.3.
  • Flink: avoid relying on Guava which can be missing during production runtime #2296 @pawel-big-lebowski
    Removes usage of Guava ImmutableList.
  • Spark: exclude commons-logging transitive dependency from published jar #2297 @pawel-big-lebowski
    Ensures commons-logging is not shipped as this can lead to a version mismatch on the user's side.

1.5.0 - 2023-11-01

Added

  • Flink: add Flink lineage for Cassandra Connectors #2175 @HuangZhenQiu
    Adds Flink Cassandra source and sink visitors and Flink Cassandra Integration test.
  • Spark: support rdd and toDF operations available in Spark Scala API #2188 @pawel-big-lebowski
    Includes the first Scala integration test, fixes ExternalRddVisitor and adds support for extracting inputs from MapPartitionsRDD and ParallelCollectionRDD plan nodes.
  • Spark: support Databricks Runtime 13.3 #2185 @pawel-big-lebowski
    Modifies the Spark integration to support the latest Databricks Runtime version.

Changed

  • Airflow: loosen attrs and requests versions #2107 @JDarDagran
    Lowers the version requirements for attrs and requests and removes an unnecessary dependency.
  • dbt: render yaml configs lazily #2221 @JDarDagran
    Don't render each entry in yaml files at start.

Fixed

  • Airflow/Athena: change dataset name to its location #2167 @sophiely
    Replaces the dataset and namespace with the data's physical location for more complete lineage across integrations.
  • Python client: skip redaction in column lineage facet #2177 @JDarDagran
    Redacted fields in ColumnLineageDatasetFacetFieldsAdditionalInputFields are now skipped.
  • Spark: unify dataset naming for RDD jobs and Spark SQL #2181 @pawel-big-lebowski
    Use the same mechanism for RDD jobs to extract dataset identifier as used for Spark SQL.
  • Spark: ensure a single START and a single COMPLETE event are sent #2103 @pawel-big-lebowski
    For Spark SQL at least four events are sent triggered by different SparkListener methods. Each of them is required and used to collect facets unavailable elsewhere. However, there should be only one START and COMPLETE events emitted. Other events should be sent as RUNNING. Please keep in mind that Spark integration remains stateless to limit the memory footprint, and it is the backend responsibility to merge several Openlineage events into a meaningful snapshot of metadata changes.

1.4.1 - 2023-10-09

Added

  • Client: allow setting client's endpoint via environment variable #2151 @mars-lan
    Enables setting this endpoint via environment variable because creating the client manually in Airflow is not possible.
  • Flink: expand Iceberg source types #2149 @HuangZhenQiu
    Adds support for FlinkIcebergSource and FlinkIcebergTableSource for Flink Iceberg lineage.
  • Spark: add debug facet #2147 @pawel-big-lebowski
    An extra run facet containing some system details (e.g., OS, Java, Scala version), classpath (e.g., package versions, jars included in the Spark job), SparkConf (like openlineage entries except auth, specified extensions, etc.) and LogicalPlan details (execution tree nodes' names) are added to events emitted. SparkConf setting spark.openlineage.debugFacet=enabled needs to be set to include the facet. By default, the debug facet is disabled.
  • Spark: enable Nessie REST catalog #2165 @julwin
    Adds support for Nessie catalog in Spark.

1.3.1 - 2023-10-03

Added

  • Airflow: add some basic stats to the Airflow integration #1845 @harels
    Uses the statsd component that already exists in the Airflow codebase and wraps the section that emits to event with a timer, as well as emitting a counter for exceptions in sending the event.
  • Airflow: add columns as schema facet for airflow.lineage.Table (if defined) #2138 @erikalfthan
    Adds columns (if set) from airflow.lineage.Table inlets/outlets to the OpenLineage Dataset.
  • DBT: add SQLSERVER to supported dbt profile types #2136 @erikalfthan
    Adds support for dbt-sqlserver, solving #2129.
  • Spark: support for latest 3.5 2118 @pawel-big-lebowski
    Integration tests are now run on Spark 3.5. Also upgrades 3.3 branch to 3.3.3. Please note that delta and iceberg are not supported for Spark 3.5 at this time.
  • Flink: expand iceberge source types #2149 @HuangZhenQiu Add Iceberg Source and Iceberg Table Source for Flink Lineage.

Fixed

  • Airflow: fix find-links path in tox #2139 @JDarDagran
    Fixes a broken link.
  • Airflow: add more graceful logging when no OpenLineage provider installed #2141 @JDarDagran
    Recognizes a failed import of airflow.providers.openlineage and adds more graceful logging to fix a corner case.
  • Spark: fix bug in PathUtils' prepareDatasetIdentifierFromDefaultTablePath(CatalogTable) to correctly preserve scheme from CatalogTable's location #2142 @d-m-h
    Previously, the prepareDatasetIdentifierFromDefaultTablePath method would override the scheme with the value of "file" when constructing a dataset identifier. It now uses the scheme of the CatalogTable's URI for this. Thank you @pawel-big-lebowski for the quick triage and suggested fix.

1.2.2 - 2023-09-19

Added

  • Spark: publish the ProcessingEngineRunFacet as part of the normal operation of the OpenLineageSparkEventListener #2089 @d-m-h
    Publishes the spec-defined ProcessEngineRunFacet alongside the custom SparkVersionFacet (for now). The SparkVersionFacet is deprecated and will be removed in a future release.
  • Spark: capture and emit spark.databricks.clusterUsageTags.clusterAllTags variable from databricks environment #2099 @Anirudh181001
    Adds spark.databricks.clusterUsageTags.clusterAllTags to the list of environment variables captured from databricks.

Fixed

  • Common: support parsing dbt_project.yml without target-path #2106 @tatiana
    As of dbt v1.5, usage of target-path in the dbt_project.yml file has been deprecated, now preferring a CLI flag or env var. It will be removed in a future version. This allows users to run DbtLocalArtifactProcessor in dbt projects that do not declare target-path.
  • Proxy: fix Proxy chart #2091 @harels
    Includes the proper image to deploy in the helm chart.
  • Python: fix serde filtering #2044 @xli-1026
    Fixes the bug causing values in list objects to be filtered accidentally.
  • Python: use non-deprecated apiKey if loading it from env variables #2029 @mobuchowski
    Changes api_key to apiKey in create_token_provider.
  • Spark: Improve RDDs on S3 integration. #2039 @pawel-big-lebowski
    Prepares integration test to access S3, fixes input dataset duplicates and includes other minor fixes.
  • Flink: prevent sending running events after job completes #2075 @pawel-big-lebowski
    Flink checkpoint tracking thread was not getting stopped properly on job complete.
  • Spark & Flink: Unify dataset naming from URI objects #2083 @pawel-big-lebowski
    Makes sure Spark and Flink generate same dataset identifiers for the same datasets by having a single implementation to generate dataset namespace and name.
  • Spark: Databricks improvements #2076 @pawel-big-lebowski
    Filters unwanted events on databricks and adds an integration test to verify this. Adds integration tests to verify dataset naming on databricks runtime is correct when table location is specified. Adds integration test for wide transformation on delta tables.

Removed

  • SQL: remove sqlparser dependency from iface-java and iface-py #2090 @JDarDagran
    Removes the dependency due to a breaking change in the latest release of the parser.

1.1.0 - 2023-08-23

Added

  • Flink: create Openlineage configuration based on Flink configuration #2033 @pawel-big-lebowski
    Flink configuration entries starting with openlineage.* are passed to the Openlineage client.
  • Java: add Javadocs to the Java client #2004 @julienledem
    The client was missing some Javadocs.
  • Spark: append output dataset name to a job name #2036 @pawel-big-lebowski
    Solves problem of multiple jobs, writing to different datasets while having the same job name. The feature is enabled by default and results in different job names and can be disabled by setting spark.openlineage.jobName.appendDatasetName to false. Unifies job names generated on the Databricks platform (using a dot job part separator instead of an underscore). The default behaviour can be altered with spark.openlineage.jobName.replaceDotWithUnderscore.
  • Spark: support Spark 3.4.1 #2057 @pawel-big-lebowski
    Bumps the latest Spark version to be covered in integration tests.

Fixed

  • Airflow: do not use database as fallback when no schema parsed #2023 @mobuchowski
    Sets the schema to None in TablesHierarchy to skip filtering on the schema level in the information schema query.
  • Flink: fix a bug when getting schema for KafkaSink #2042 @pentium3
    Fixes the incomplete schema from KafkaSinkVisitor by changing the KafkaSinkWrapper to catch schemas of type AvroSerializationSchema.
  • Spark: filter CreateView events #1968#1987 @pawel-big-lebowski
    Clears events generated by logical plans having CreateView nodes as root.
  • Spark: fix MERGE INTO for delta tables identified by physical locations #2026 Delta tables identified by physical locations were not properly recognized.
  • Spark: fix incorrect naming of JDBC datasets #2035 @mobuchowski
    Makes the namespace generated by the JDBC/Spark connector conform to the naming schema in the spec.
  • Spark: fix ignored event adaptive_spark_plan in Databricks #2061 @algorithmy1
    Removes adaptive_spark_plan from the excludedNodes in DatabricksEventFilter.

1.0.0 - 2023-08-01

Added

  • Airflow: convert lineage from legacy File definition #2006 @mobuchowski
    Adds coverage for File entity definition to enhance backwards compatibility.

Removed

  • Spec: remove facet ref from core #1997 @JDarDagran
    Removes references to facets from the core spec that broke compatibility with JSON schema specification.

Changed

  • Airflow: change log level to DEBUG when extractor isn't found #2012 @kaxil
    Changes log level from WARNING to DEBUG when an extractor is not available.
  • Airflow: make sure we cannot fail in thread despite direct execution #2010 @mobuchowski
    Ensures the listener is not failing tasks, even in unlikely scenarios.

Fixed

  • Airflow: stop using reusable session by default, do not send full event on Snowflake complete #2025 @mobuchowski
    Fixes the issue of the Snowflake connector clashing with HttpTransport by disabling automatic requests session reuse and not running SnowflakeExtractor again on job completion.
  • Client: fix error message to avoid confusion #2001 @mars-lan
    Fixes the error message in HttpTransport in the case of a null URL.

0.30.1 - 2023-07-25

Added

  • Flink: support Iceberg sinks #1960 @pawel-big-lebowski
    Detects output datasets when using an Iceberg table as a sink.
  • Spark: column-level lineage for merge into on delta tables #1958 @pawel-big-lebowski
    Makes column-level lineage support merge into on Delta tables. Also refactors column-level lineage to deal with multiple Spark versions.
  • Spark: column-level lineage for merge into on Iceberg tables #1971 @pawel-big-lebowski
    Makes column-level lineage support merge into on Iceberg tables.
  • Spark: add support for Iceberg REST catalog #1963 @juancappi
    Adds rest to the existing options of hive and hadoop in IcebergHandler.getDatasetIdentifier() to add support for Iceberg's RestCatalog.
  • Airflow: add possibility to force direct-execution based on environment variable #1934 @mobuchowski
    Adds the option to use the direct-execution method on the Airflow listener when the existence of a non-SQLAlchemy-based Airflow event mechanism is confirmed. This happens when using Airflow 2.6 or when the OPENLINEAGE_AIRFLOW_ENABLE_DIRECT_EXECUTION environment variable exists.
  • SQL: add support for Apple Silicon to openlineage-sql-java #1981 @davidjgoss
    Expands the OS/architecture checks when compiling to produce a specific file for Apple Silicon. Also expands the corresponding OS/architecture checks when loading the binary at runtime from Java code.
  • Spec: add facet deletion #1975 @julienledem
    In order to add a mechanism for deleting job and dataset facets, adds a { _deleted: true } object that can take the place of any job or dataset facet (but not run or input/output facets, which are valid only for a specific run).
  • Client: add a file transport #1891 @Alexkuva
    Creates a FileTransport and its configuration classes supporting append mode or write-new-file mode, which is especially useful when an object store does not support append mode, e.g. in the case of Databricks DBFS FUSE.

Changed

  • Airflow: do not run plugin if OpenLineage provider is installed #1999 @JDarDagran
    Sets OPENLINEAGE_DISABLED to true if the provider is installed.
  • Python: rename config to config_class #1998 @mobuchowski
    Renames the config class variable to config_class to avoid potential conflict with the config instance.

Fixed

  • Airflow: add workaround for airflow-sqlalchemy event mechanism bug #1959 @mobuchowski
    Due to known issues with the fork and thread model in the Airflow-SQLAlchemy-based event-delivery mechanism, a Kafka producer left alone does not emit a `COMPLETE`` event. This creates a producer for each event when we detect that we're under Airflow 2.3 - 2.5.
  • Spark: fix custom environment variables facet #1973 @pawel-big-lebowski
    Enables sending the Spark environment variables facet in a non-deterministic way.
  • Spark: filter unwanted Delta events #1968 @pawel-big-lebowski
    Clears events generated by logical plans having Project node as root.
  • Python: allow modification of openlineage.* logging levels via environment variables #1974 @JDarDagran
    Adds OPENLINEAGE_{CLIENT/AIRFLOW/DBT}_LOGGING environment variables that can be set according to module logging levels and cleans up some logging calls in openlineage-airflow.

0.29.2 - 2023-06-30

Added

  • Flink: support Flink version 1.17.1 #1947 @pawel-big-lebowski
    Adds support for Flink versions 1.15.4, 1.16.2 and 1.17.1.
  • Spark: support Spark 3.4 #1790 @pawel-big-lebowski
    Introduces support for latest Spark version 3.4.0, along with 3.2.4 and 3.3.2.
  • Spark: add Databricks platform integration test #1928 @pawel-big-lebowski
    Adds a Spark integration test to verify behavior on Databricks to be run manually in CircleCI when needed.
  • Spec: add static lineage event types #1880 @pawel-big-lebowski
    As a first step in implementing static lineage, this adds new DatasetEvent and JobEvent types to the spec, along with support for the new types in the Python client.

Removed

  • Proxy: remove unused Golang client approach #1926 @mobuchowski
    Removes the unused Golang proxy, rendered redundant by the fluentd proxy.
  • Req: bump minimum supported Python version to 3.8 #1950 @mobuchowski
    Python 3.7 is at EOL. This bumps the minimum supported version to 3.8 to keep the project aligned with the Python EOL schedule.

Fixed

  • Flink: fix KafkaSource with GenericRecord #1944 @pawel-big-lebowski
    Extract dataset schema from KafkaSource when GenericRecord deserialized is used.
  • dbt: fix security vulnerabilities #1945 @JDarDagran
    Fixes vulnerabilities in the dbt integration and integration tests.

0.28.0 - 2023-06-12

Added

  • dbt: add Databricks compatibility #1829 Ines70
    Enables launching OpenLineage with a Databricks profile.

Fixed

  • Fix type-checked marker and packaging #1913 gaborbernat
    The client was not marking itself as type-annotated.
  • Python client: add schemaURL to run event #1917 gaborbernat Adds the missing schemaURL to the client's RunState class.

0.27.2 - 2023-06-06

Fixed

  • Python client: deprecate client.from_environment, do not skip loading config #1908 @mobuchowski
    Deprecates the OpenLineage.from_environment method and recommends using the constructor instead.

0.27.1 - 2023-06-05

Added

  • Python client: add emission filtering mechanism and exact, regex filters #1878 @mobuchowski
    Adds configurable job-name filtering to the Python client. Filters can be exact-match- or regex-based. Events will not be sent in the case of matches.

Fixed

  • Spark: fix column lineage for aggregate queries on databricks #1867 @pawel-big-lebowski
    Aggregate queries on databricks did not return column lineage.
  • Airflow: fix unquoted [ and ] in Snowflake URIs #1883 @JDarDagran
    Snowflake connections containing one of [ or ] were causing urllib.parse.urlparse to fail.

0.26.0 - 2023-05-18

Added

  • Proxy: Fluentd proxy support (experimental) #1757 @pawel-big-lebowski
    Adds a Fluentd data collector as a proxy to buffer Openlineage events and send them to multiple backends (among many other purposes). Also implements a Fluentd Openlineage parser to validate incoming HTTP events at the beginning of the pipeline. See the readme file for more details.

Changed

  • Python client: use Hatchling over setuptools to orchestrate Python env setup #1856 @gaborbernat
    Replaces setuptools with Hatchling for building the backend. Also includes a number of fixes, including to type definitions in transport and elsewhere.

Fixed

  • Spark: support single file datasets #1855 @pawel-big-lebowski
    Fixes the naming of single file datasets so they are no longer named using the parent directory's path: spark.read.csv('file.csv').
  • Spark: fix logicalPlan serialization issue on Databricks #1858 @pawel-big-lebowski
    Disables the spark_unknown facet by default to turn off serialization of logicalPlan.

0.25.0 - 2023-05-15

Added

Fixed

  • Spark: fix JDBC query handling #1808 @nataliezeller1
    Makes query handling more tolerant of variations in syntax and formatting.
  • Spark: filter Delta adaptive plan events #1830 @pawel-big-lebowski
    Extends the DeltaEventFilter class to filter events in cases where rewritten queries in adaptive Spark plans generate extra events.
  • Spark: fix Java class cast exception #1844 @Anirudh181001
    Fixes the error caused by the OpenLineageRunEventBuilder when it cast the Spark scheduler's ShuffleMapStage to boolean.
  • Flink: include missing fields of Openlineage events #1840 @pawel-big-lebowski Enriches Flink events so that missing eventTime, runId and job elements no longer produce errors.

0.24.0 - 2023-05-02

Added

  • Support custom transport types #1795 @nataliezeller1
    Adds a new interface, TransportBuilder, for creating custom transport types without having to modify core components of OpenLineage.
  • Airflow: dbt Cloud integration #1418 @howardyoo
    Adds a new OpenLineage extractor for dbt Cloud that uses the dbt Cloud hook provided by Airflow to communicate with dbt Cloud via its API.
  • Spark: support dataset name modification using regex #1796 @pawel-big-lebowski It is a common scenario to write Spark output datasets with a location path ending with /year=2023/month=04. The Spark parameter spark.openlineage.dataset.removePath.pattern introduced here allows for removing certain elements from a path with a regex pattern.
  • Spark: filter adaptive plan events #1830 @pawel-big-lebowski When spark plan is optimized, it is rewritten into adaptive plan which lead to duplicate Openlineage events: per normal and per adaptive plan. This changes filters the latter one.

Fixed

  • Spark: catch exception when trying to obtain details of non-existing table. #1798 @pawel-big-lebowski This mostly happens when getting table details on START event while the table is still not created.
  • Spark: LogicalPlanSerializer #1792 @pawel-big-lebowski
    Changes LogicalPlanSerializer to make use of non-shaded Jackson classes in order to serialize LogicalPlans. Note: class names are no longer serialized.
  • Flink: fix Flink CI #1801 @pawel-big-lebowski
    Specifies an older image version that succeeds on CI in order to fix the Flink integration.

0.23.0 - 2023-04-20

Added

  • SQL: parser improvements to support: copy into, create stage, pivot #1742 @pawel-big-lebowski
    Adds support for additional syntax available in sqlparser-rs.
  • dbt: add support for snapshots #1787 @JDarDagran
    Adds support for this special kind of table representing type-2 Slowly Changing Dimensions.

Changed

  • Spark: change custom column lineage visitors #1788 @pawel-big-lebowski
    Makes the CustomColumnLineageVisitor interface public to support custom column lineage.

Fixed

  • Spark: fix null pointer in JobMetricsHolder #1786 @pawel-big-lebowski
    Adds a null check before running put to fix a NPE occurring in JobMetricsHolder
  • SQL: fix query with table generator #1783 @pawel-big-lebowski
    Allows TableFactor::TableFunction to support queries containing table functions.
  • SQL: fix rust code style bug #1785 @pawel-big-lebowski
    Fixes a minor style issue in visitor.rs.

Removed

  • Airflow: Remove explicit pass from several extract_on_complete methods #1771 JDarDagran
    Removes the code from three extractors.

0.22.0 - 2023-04-03

Added

  • Spark: properties facet #1717 @tnazarew
    Adds a new facet to capture specified Spark properties.
  • SQL: SQLParser supports alter, truncate and drop statements #1695 @pawel-big-lebowski
    Adds support for the statements to the parser.
  • Common/SQL: provide public interface for openlineage_sql package #1727 @JDarDagran
    Provides a .pyi public interface file for providing typing hints.
  • Java client: add configurable headers to HTTP transport #1718 @tnazarew
    Adds custom header handling to HttpTransport and the Spark integration.
  • Python client: create client from dictionary #1745 @JDarDagran
    Adds a new from_dict method to the Python client to support creating it from a dictionary.

Changed

  • Spark: remove URL parameters for JDBC namespaces #1708 @tnazarew
    Makes the namespace value from an event conform to the naming convention specified in Naming.md.
  • Airflow: make OPENLINEAGE_DISABLED case-insensitive #1705 @jedcunningham
    Makes the environment variable for disabling OpenLineage in the Python client and Airflow integration case-insensitive.

Fixed

  • Spark: fix missing BigQuery class in column lineage #1698 @pawel-big-lebowski
    The Spark integration now checks if the BigQuery classes are available on the classpath before attempting to use them.
  • DBT: throw UnsupportedDbtCommand when finding unsupported entry in args.which #1724 @JDarDagran
    Adjusts the dbt-ol script to detect DBT commands in run_results.json only.

Removed

  • Spark: remove unnecessary warnings for column lineage #1700 @pawel-big-lebowski
    Removes the warnings about OneRowRelation and LocalRelation nodes.
  • Spark: remove deprecated configs #1711 @tnazarew
    Removes support for deprecated configs.

0.21.1 - 2023-03-02

Added

  • Clients: add DEBUG logging of events to transports #1633 @mobuchowski
    Ensures that the DEBUG loglevel on properly configured loggers will always log events, regardless of the chosen transport.
  • Spark: add CustomEnvironmentFacetBuilder class #1545 New contributor @Anirudh181001
    Enables the capture of custom environment variables from Spark.
  • Spark: introduce the new output visitors AlterTableAddPartitionCommandVisitor and AlterTableSetLocationCommandVisitor #1629 New contributor @nataliezeller1
    Adds visitors for extracting table names from the Spark commands AlterTableAddPartitionCommand and AlterTableSetLocationCommand. The intended use case is a custom transport for the OpenMetadata lineage API.
  • Spark: add column lineage for JDBC relations #1636 @tnazarew
    Adds column lineage information to JDBC events with data extracted from query by the SQL parser.
  • SQL: add linux-aarch64 native library to Java SQL parser #1664 @mobuchowski
    Adds a Linux-ARM version of the native library. The Java SQL parser interface had only Linux-x64 and MacOS universal binary variants previously.

Changed

  • Airflow: get table database in Athena extractor #1631 New contributor @rinzool
    Changes the extractor to get a table's database from the table.schema field or the operator default if the field is None.

Fixed

  • dbt: add dbt seed to the list of dbt-ol events #1649 New contributor @pohek321
    Ensures that dbt-ol test no longer fails when run against an event seed.
  • Spark: make column lineage extraction in Spark support caching #1634 @pawel-big-lebowski
    Collect column lineage from Spark logical plans that contain cached datasets.
  • Spark: add support for a deprecated config #1586 @tnazarew
    Maps the deprecated spark.openlineage.url to spark.openlineage.transport.url.
  • Spark: add error message in case of null in url #1590 @tnazarew
    Improves error logging in the case of undefined URLs.
  • Spark: collect complete event for really quick Spark jobs #1650 @pawel-big-lebowski
    Improves the collecting of OpenLineage events on SQL complete in the case of quick operations.
  • Spark: fix input/outputs for one node LogicalRelation plans #1668 @pawel-big-lebowski
    For simple queries like select col1, col2 from my_db.my_table that do not write output, the Spark plan contained just a single node, which was wrongly treated as both an input and output dataset.
  • SQL: fix file existence check in build script for openlineage-sql-java #1613 @sekikn
    Ensures that the build script works if the library is compiled solely for Linux.

Removed

  • Airflow: remove JobIdMapping and update macros to better support Airflow version 2+ #1645 @JDarDagran
    Updates macros to use OpenLineageAdapter's method to generate deterministic run UUIDs because using the JobIdMapping utility is incompatible with Airflow 2+.

Added

  • Spark: column lineage for JDBC relations #1636 @tnazarew
    • Adds column lineage info to JDBC events with data extracted form query by OL SQL parser

0.20.6 - 2023-02-10

Added

  • Airflow: add new extractor for FTPFileTransmitOperator #1603 @sekikn
    Adds a new extractor for this Airflow operator serving legacy systems.

Changed

  • Airflow: make extractors for async operators work #1601 @JDarDagran
    Sends a deterministic Run UUID for Airflow runs.

Fixed

  • dbt: render actual profile only in profiles.yml #1599 @mobuchowski
    Adds an include_section argument for the Jinja render method to include only one profile if needed.
  • dbt: make compiled_code optional #1595 @JDarDagran
    Makes compiled_code optional for manifest > v7.

0.20.4 - 2023-02-07

Added

  • Airflow: add new extractor for GCSToGCSOperator #1495 @sekikn
    Adds a new extractor for this operator.
  • Flink: resolve topic names from regex, support 1.16.0 #1522 @pawel-big-lebowski
    Adds support for Flink 1.16.0 and makes the integration resolve topic names from Kafka topic patterns.
  • Proxy: implement lineage event validator for client proxy #1469 @fm100
    Implements logic in the proxy (which is still in development) for validating and handling lineage events.

Changed

  • CI: use ruff instead of flake8, isort, etc., for linting and formatting #1526 @mobuchowski
    Adopts the ruff package, which combines several linters and formatters into one fast binary.

Fixed

  • Airflow: make the Trino catalog non-mandatory #1572 @JDarDagran
    Makes the Trino catalog optional in the Trino extractor.
  • Common: add explicit SQL dependency #1532 @mobuchowski
    Addresses a 0.19.2 breaking change to the GE integration by including the SQL dependency explicitly.
  • DBT: adjust tqdm logging in dbt-ol #1549 @JdarDagran
    Adjusts tqdm to show the correct number of iterations and adds START events for parent runs.
  • DBT: fix typo in log output #1493 @denimalpaca
    Fixes 'emittled' typo in log output.
  • Great Expectations/Airflow: follow Snowflake dataset naming rules #1527 @mobuchowski
    Normalizes Snowflake dataset and datasource naming rules among DBT/Airflow/GE; canonizes old Snowflake account paths around making them all full-size with account, region and cloud names.
  • Java and Python Clients: Kafka does not initialize properties if they are empty; check and notify about Confluent-Kafka requirement #1556 @mobuchowski
    Fixes the failure to initialize KafkaTransport in the Java client and adds an exception if the required confluent-kafka module is missing from the Python client.
  • Spark: add square brackets for list-based Spark configs #1507 @Varunvaruns9
    Adds a condition to treat configs with [] as lists. Note: [] will be required for list-based configs starting with 0.21.0.
  • Spark: fix several Spark/BigQuery-related issues #1557 @mobuchowski
    Fixes the assumption that a version is always a number; adds support for HadoopMapReduceWriteConfigUtil; makes the integration access BigQueryUtil and getTableId using reflection, which supports all BigQuery versions; makes logs provide the full serialized LogicalPlan on debug.
  • SQL: only report partial failures `#1479 @mobuchowski
    Changes the parser so it reports partial failures instead of failing the whole extraction.

0.19.2 - 2023-01-04

Added

  • Airflow: add Trino extractor #1288 @sekikn
    Adds a Trino extractor to the Airflow integration.
  • Airflow: add S3FileTransformOperator extractor #1450 @sekikn
    Adds an S3FileTransformOperator extractor to the Airflow integration.
  • Airflow: add standardized run facet #1413 @JDarDagran
    Creates one standardized run facet for the Airflow integration.
  • Airflow: add NominalTimeRunFacet and OwnershipJobFacet #1410 @JDarDagran
    Adds nominalEndTime and OwnershipJobFacet fields to the Airflow integration.
  • dbt: add support for postgres datasources #1417 @julienledem
    Adds the previously unsupported postgres datasource type.
  • Proxy: add client-side proxy (skeletal version) #1439 #1420 @fm100
    Implements a skeletal version of a client-side proxy.
  • Proxy: add CI job to publish Docker image #1086 @wslulciuc
    Includes a script to build and tag the image plus jobs to verify the build on every CI run and publish to Docker Hub.
  • SQL: add ExtractionErrorRunFacet #1442 @mobuchowski
    Adds a facet to the spec to reflect internal processing errors, especially failed or incomplete parsing of SQL jobs.
  • SQL: add column-level lineage to SQL parser #1432 #1461 @mobuchowski @StarostaGit
    Adds support for extracting column-level lineage from SQL statements in the parser, including adjustments to Rust-Python and Rust-Java interfaces and the Airflow integration's SQL extractor to make use of the feature. Also includes more tests, removal of the old parser, and removal of the common-build cache in CI (which was breaking the parser).
  • Spark: pass config parameters to the OL client #1383 @tnazarew
    Adds a mechanism for making new lineage consumers transparent to the integration, easing the process of setting up new types of consumers.

Fixed

  • Airflow: fix collect_ignore, add flags to Pytest for cleaner output #1437 @JDarDagran
    Removes the extractors directory from the ignored list, improving unit testing.
  • Spark & Java client: fix README typos @versaurabh
    Fixes typos in the SPDX license headers.

0.18.0 - 2022-12-08

Added

  • Airflow: support SQLExecuteQueryOperator #1379 @JDarDagran
    Changes the SQLExtractor and adds support for the dynamic assignment of extractors based on conn_type.
  • Airflow: introduce a new extractor for SFTPOperator #1263 @sekikn
    Adds an extractor for tracing file transfers between local file systems.
  • Airflow: add Sagemaker extractors #1136 @fhoda
    Creates extractors for SagemakerProcessingOperator and SagemakerTransformOperator.
  • Airflow: add S3 extractor for Airflow operators #1166 @fhoda
    Creates an extractor for the S3CopyObject in the Airflow integration.
  • Airflow: implement DagRun listener #1286 @mobuchowski
    OpenLineage integration will now explicitly emit DagRun start and DagRun complete or DagRun failed events, which allows precise tracking of single dags.
  • Spec: add spec file for ExternalQueryRunFacet #1262 @howardyoo
    Adds a spec file to make this facet available for the Java client. Includes a README
  • Docs: add a TSC doc #1303 @merobi-hub
    Adds a document listing the members of the Technical Steering Committee.

Changed

  • Spark: enable usage of other Transports via Spark configuration #1383 @tnazarew
    • OL client argument parsing moved from Spark Integration to java client

Fixed

  • Spark: improve Databricks to send better events #1330 @pawel-big-lebowski
    Filters unwanted events and provides a meaningful job name.
  • Spark-Bigquery: fix a few of the common errors #1377 @mobuchowski
    Fixes a few of the common issues with the Spark-Bigquery integration and adds an integration test and configures CI.
  • Python: validate eventTime field in Python client #1355 @pawel-big-lebowski
    Validates the eventTime of a RunEvent within the client library.
  • Databricks: Handle Databricks Runtime 11.3 changes to DbFsUtils constructor #1351 @wjohnson
    Recaptures lost mount point information from the DatabricksEnvironmentFacetBuilder and environment-properties facet by looking at the number of parameters in the DbFsUtils constructor to determine the runtime version.

0.17.0 - 2022-11-16

Added

  • Spark: support latest Spark 3.3.1 #1183 @pawel-big-lebowski
    Adds support for the latest Spark 3.3.1 version.
  • Spark: add Kinesis Transport and support config Kinesis in Spark integration #1200 @yogayang
    Adds support for sending to Kinesis from the Spark integration.
  • Spark: Disable specified facets #1271 @pawel-big-lebowski
    Adds the ability to disable specified facets from generated OpenLineage events.
  • Python: add facets implementation to Python client #1233 @pawel-big-lebowski
    Adds missing facets to the Python client.
  • SQL: add Rust parser interface #1172 @StarostaGit @mobuchowski
    Implements a Java interface in the Rust SQL parser, including a build script, native library loading mechanism, CI support and build fixes.
  • Proxy: add helm chart for the proxy backed #1068 @wslulciuc
    Adds a helm chart for deploying the proxy backend on Kubernetes.
  • Spec: include possible facets usage in spec #1249 @pawel-big-lebowski
    Extends the facets definition with a list of available facets.
  • Website: publish YML version of spec to website #1300 @rossturk
    Adds configuration necessary to make the OpenLineage website auto-generate openAPI docs when the spec is published there.
  • Docs: update language on nominating new committers #1270 @rossturk
    Updates the governance language to reflect the new policy on nominating committers.

Changed

  • Website: publish spec into new website repo location #1295 @rossturk
    Creates a new deploy key, adds it to CircleCI & GitHub, and makes the necessary changes to the release.sh script.
  • Airflow: change how pip installs packages in tox environments #1302 @JDarDagran
    Use deprecated resolver and constraints files provided by Airflow to avoid potential issues caused by pip's new resolver.

Fixed

  • Airflow: fix README for running integration test #1238 @sekikn
    Updates the README for consistency with supported Airflow versions.
  • Airflow: add task_instance argument to get_openlineage_facets_on_complete #1269 @JDarDagran
    Adds the task_instance argument to DefaultExtractor.
  • Java client: fix up all artifactory paths #1290 @harels
    Not all artifactory paths were changed in the build CI script in a previous PR.
  • Python client: fix Mypy errors and adjust to PEP 484 #1264 @JDarDagran
    Adds a --no-namespace-packages argument to the Mypy command and adjusts code to PEP 484.
  • Website: release all specs since last_spec_commit_id, not just HEAD~1 #1298 @rossturk
    The script now ships all specs that have changed since .last_spec_commit_id.

Removed

  • Deprecate HttpTransport.Builder in favor of HttpConfig #1287 @collado-mike
    Deprecates the Builder in favor of HttpConfig only and replaces the existing Builder implementation by delegating to the HttpConfig.

0.16.1 - 2022-11-03

Added

  • Airflow: add dag_run information to Airflow version run facet #1133 @fm100
    Adds the Airflow DAG run ID to the taskInfo facet, making this additional information available to the integration.
  • Airflow: add LoggingMixin to extractors #1149 @JDarDagran
    Adds a LoggingMixin class to the custom extractor to make the output consistent with general Airflow and OpenLineage logging settings.
  • Airflow: add default extractor #1162 @mobuchowski
    Adds a DefaultExtractor to support the default implementation of OpenLineage for external operators without the need for custom extractors.
  • Airflow: add on_complete argument in DefaultExtractor #1188 @JDarDagran
    Adds support for running another method on extract_on_complete.
  • SQL: reorganize the library into multiple packages #1167 @StarostaGit @mobuchowski
    Splits the SQL library into a Rust implementation and foreign language bindings, easing the process of adding language interfaces. Also contains CI fix.

Changed

  • Airflow: move get_connection_uri as extractor's classmethod #1169 @JDarDagran
    The get_connection_uri method allowed for too many params, resulting in unnecessarily long URIs. This changes the logic to whitelisting per extractor.
  • Airflow: change get_openlineage_facets_on_start/complete behavior #1201 @JDarDagran
    Splits up the method for greater legibility and easier maintenance.

Fixed

  • Airflow: always send SQL in SqlJobFacet as a string #1143 @mobuchowski
    Changes the data type of query from array to string to an fix error in the RedshiftSQLOperator.
  • Airflow: include __extra__ case when filtering URI query params #1144 @JDarDagran
    Includes the conn.EXTRA_KEY in the get_connection_uri method to avoid exposing secrets in URIs via the __extra__ key.
  • Airflow: enforce column casing in SQLCheckExtractors #1159 @denimalpaca
    Uses the parent extractor's _is_uppercase_names property to determine if the column should be upper cased in the SQLColumnCheckExtractor's _get_input_facets() method.
  • Spark: prevent exception when no schema provided #1180 @pawel-big-lebowski
    Prevents evaluation of column lineage when the schemFacet is null.
  • Great Expectations: add V3 API compatibility #1194 @denimalpaca
    Fixes the Pandas datasource to make it V3 API-compatible.

Removed

  • Airflow: remove support for Airflow 1.10 #1128 @mobuchowski
    Removes the code structures and tests enabling support for Airflow 1.10.

0.15.1 - 2022-10-05

Added

  • Airflow: improve development experience #1101 @JDarDagran
    Adds an interactive development environment to the Airflow integration and improves integration testing.
  • Spark: add description for URL parameters in readme, change overwriteName to appName #1130 @tnazarew
    Adds more information about passing arguments with spark.openlineage.url and changes overwriteName to appName for clarity.
  • Documentation: update issue templates for proposal & add new integration template #1116 @rossturk
    Adds a YAML issue template for new integrations and fixes a bug in the proposal template.

Changed

  • Airflow: lazy load BigQuery client #1119 @mobuchowski
    Moves import of the BigQuery client from top level to local level to decrease DAG import time.

Fixed

  • Airflow: fix UUID generation conflict for Airflow DAGs with same name #1056 @collado-mike
    Adds a namespace to the UUID calculation to avoid conflicts caused by DAGs having the same name in different namespaces in Airflow deployments.
  • Spark/BigQuery: fix issue with spark-bigquery-connector >=0.25.0 #1111 @pawel-big-lebowski
    Makes the Spark integration compatible with the latest connector.
  • Spark: fix column lineage #1069 @pawel-big-lebowski
    Fixes a null pointer exception error and an error when openlineage.timeout is not provided.
  • Spark: set log level of Init OpenLineageContext to DEBUG #1064 @varuntestaz
    Prevents sensitive information from being logged unless debug mode is used.
  • Java client: update version of SnakeYAML #1090 @TheSpeedding
    Bumps the SnakeYAML library version to include a key bug fix.
  • dbt: remove requirement for OPENLINEAGE_URL to be set #1107 @mobuchowski
    Removes erroneous check for OPENLINEAGE_URL in the dbt integration.
  • Python client: remove potentially cyclic import #1126 @mobuchowski
    Hides imports to remove potentially cyclic import.
  • CI: build macos release package on medium resource class #1131 @mobuchowski
    Fixes failing build due to resource class being too large.

0.14.1 - 2022-09-07

Fixed

  • Fix Spark integration issues including error when no openlineage.timeout #1069 @pawel-big-lebowski
    OpenlineageSparkListener was failing when no openlineage.timeout was provided.

0.14.0 - 2022-09-06

Added

  • Support ABFSS and Hadoop Logical Relation in Column-level lineage #1008 @wjohnson
    Introduces an extractDatasetIdentifier that uses similar logic to InsertIntoHadoopFsRelationVisitor to pull out the path on the HDFS compliant file system; tested on ABFSS and DBFS (Databricks FileSystem) to prove that lineage could be extracted using non-SQL commands.
  • Add Kusto relation visitor #939 @hmoazam
    Implements a KustoRelationVisitor to support lineage for Azure Kusto's Spark connector.
  • Add ColumnLevelLineage facet doc #1020 @julienledem
    Adds documentation for the Column-level lineage facet.
  • Include symlinks dataset facet #935 @pawel-big-lebowski
    Includes the recently introduced SymlinkDatasetFacet in generated OpenLineage events.
  • Add support for dbt 1.3 beta's metadata changes #1051 @mobuchowski
    Makes projects that are composed of only SQL models work on 1.3 beta (dbt 1.3 renamed the compiled_sql field to compiled_code to support Python models). Does not provide support for dbt's Python models.
  • Support Flink 1.15 #1009 @mzareba382
    Adds support for Flink 1.15.
  • Add Redshift dialect to the SQL integration #1066 @mobuchowski
    Adds support for Redshift's SQL dialect in OpenLineage's SQL parser, including quirks such as the use of square brackets in JSON paths. (Note, this does not add support for all of Redshift's custom syntax.)

Changed

  • Make the timeout configurable in the Spark integration #1050 @tnazarew
    Makes timeout configurable by the user. (In some cases, the time needed to send events was longer than 5 seconds, which exceeded the timeout value.)

Fixed

  • Add a dialect parameter to Great Expectations SQL parser calls #1049 @collado-mike
    Specifies the dialect name from the SQL engine.
  • Fix Delta 2.1.0 with Spark 3.3.0 #1065 @pawel-big-lebowski
    Allows delta support for Spark 3.3 and fixes potential issues. (The Openlineage integration for Spark 3.3 was turned on without delta support, as delta did not support Spark 3.3 at that time.)

0.13.1 - 2022-08-25

Fixed

  • Rename all parentRun occurrences to parent in Airflow integration 1037 @fm100
    Changes the parentRun property name to parent in the Airflow integration to match the spec.
  • Do not change task instance during on_running event 1028 @JDarDagran
    Fixes an issue in the Airflow integration with the on_running hook, which was changing the TaskInstance object along with the task attribute.

0.13.0 - 2022-08-22

Added

  • Add BigQuery check support #960 @denimalpaca
    Adds logic and support for proper dynamic class inheritance for BigQuery-style operators. (BigQuery's extractor needed additional logic to support the forthcoming BigQueryColumnCheckOperator and BigQueryTableCheckOperator.)
  • Add RUNNING EventType in spec and Python client #972 @mzareba382
    Introduces a RUNNING event state in the OpenLineage spec to indicate a running task and adds a RUNNING event type in the Python API.
  • Use databases & schemas in SQL Extractors #974 @JDarDagran
    Allows the Airflow integration to differentiate between databases and schemas. (There was no notion of databases and schemas when querying and parsing results from information_schema tables.)
  • Implement Event forwarding feature via HTTP protocol #995 @howardyoo
    Adds HttpLineageStream to forward a given OpenLineage event to any HTTP endpoint.
  • Introduce SymlinksDatasetFacet to spec #936 @pawel-big-lebowski
    Creates a new facet, the SymlinksDatasetFacet, to support the storing of alternative dataset names.
  • Add Azure Cosmos Handler to Spark integration #983 @hmoazam
    Defines a new interface, the RelationHandler, to support Spark data sources that do not have TableCatalog, Identifier, or TableProperties set, as is the case with the Azure Cosmos DB Spark connector.
  • Support OL Datasets in manual lineage inputs/outputs #1015 @conorbev
    Allows Airflow users to create OpenLineage Dataset classes directly in DAGs with no conversion necessary. (Manual lineage definition required users to create an airflow.lineage.entities.Table, which was then converted to an OpenLineage Dataset.)
  • Create ownership facets #996 @julienledem
    Adds an ownership facet to both Dataset and Job in the OpenLineage spec to capture ownership of jobs and datasets.

Changed

  • Use RUNNING EventType in Flink integration for currently running jobs #985 @mzareba382
    Makes use of the new RUNNING event type in the Flink integration, changing events sent by Flink jobs from OTHER to this new type.
  • Convert task objects to JSON-encodable objects when creating custom Airflow version facets #1018 @fm100
    Implements a to_json_encodable function in the Airflow integration to make task objects JSON-encodable.

Fixed

  • Add support for custom SQL queries in v3 Great Expectations API #1025 @collado-mike
    Fixes support for custom SQL statements in the Great Expectations provider. (The Great Expectations custom SQL datasource was not applied to the support for the V3 checkpoints API.)

0.12.0 - 2022-08-01

Added

Changed

Fixed

0.11.0 - 2022-07-07

Added

Changed

  • When testing extractors in the Airflow integration, set the extractor length assertion dynamic #882 @denimalpaca
  • Render templates as start of integration tests for TaskListener in the Airflow integration #870 @mobuchowski

Fixed

0.10.0 - 2022-06-24

Added

Changed

Fixed

0.9.0 - 2022-06-03

Added

Fixed

0.8.2 - 2022-05-19

Added

Fixed

  • PostgresOperator fails to retrieve host and conn during extraction (#705) @sekikn
  • SQL parser accepts lists of sql statements (#734) @mobuchowski
  • Missing schema when writing to Delta tables in Databricks (#748) @collado-mike

0.8.1 - 2022-04-29

Added

Fixed

  • GreatExpectations: Fixed bug when invoking GreatExpectations using v3 API (#683) @collado-mike

0.7.1 - 2022-04-19

Added

Fixed

0.6.2 - 2022-03-16

Added

Fixed

0.6.1 - 2022-03-07

Fixed

  • Catch possible failures when emitting events and log them @mobuchowski

Fixed

  • dbt: jinja2 code using do extensions does not crash @mobuchowski

0.6.0 - 2022-03-04

Added

  • Extract source code of PythonOperator code similar to SQL facet @mobuchowski
  • Add DatasetLifecycleStateDatasetFacet to spec @pawel-big-lebowski
  • Airflow: extract source code from BashOperator @mobuchowski
  • Add generic facet to collect environmental properties (EnvironmentFacet) @harishsune
  • OpenLineage sensor for OpenLineage-Dagster integration @dalinkim
  • Java-client: make generator generate enums as well @pawel-big-lebowski
  • Added UnknownOperatorAttributeRunFacet to Airflow integration to record operators that don't produce lineage @collado-mike

Fixed

  • Airflow: increase import timeout in tests, fix exit from integration @mobuchowski
  • Reduce logging level for import errors to info @rossturk
  • Remove AWS secret keys and extraneous Snowflake parameters from connection uri @collado-mike
  • Convert to LifecycleStateChangeDatasetFacet @pawel-big-lebowski

0.5.2 - 2022-02-10

Added

  • Proxy backend example using Kafka @wslulciuc
  • Support Databricks Delta Catalog naming convention with DatabricksDeltaHandler @wjohnson
  • Add javadoc as part of build task @mobuchowski
  • Include TableStateChangeFacet in non V2 commands for Spark @mr-yusupov
  • Support for SqlDWRelation on Databricks' Azure Synapse/SQL DW Connector @wjohnson
  • Implement input visitors for v2 commands @pawel-big-lebowski
  • Enabled SparkListenerJobStart events to trigger open lineage events @collado-mike

Fixed

  • dbt: job namespaces for given dbt run match each other @mobuchowski
  • Fix Breaking SnowflakeOperator Changes from OSS Airflow @denimalpaca
  • Made corrections to account for DeltaDataSource handling @collado-mike

0.5.1 - 2022-01-18

Added

Fixed

  • airflow: fix import failures when dependencies for bigquery, dbt, great_expectations extractors are missing @lukaszlaszko
  • Fixed openlineage-spark jar to correctly rename bundled dependencies @collado-mike

0.4.0 - 2021-12-13

Added

Fixed

  • dbt: column descriptions are properly filled from metadata.json @mobuchowski
  • dbt: allow parsing artifacts with version higher than officially supported @mobuchowski
  • dbt: dbt build command is supported @mobuchowski
  • dbt: fix crash when build command is used with seeds in dbt 1.0.0rc3 @mobuchowski
  • spark: increase logical plan visitor coverage @mobuchowski
  • spark: fix logical serialization recursion issue @OleksandrDvornik
  • Use URL#getFile to fix build on Windows @mobuchowski

0.3.1 - 2021-10-21

Fixed

0.3.0 - 2021-10-21

Added

Fixed

0.2.3 - 2021-10-07

Fixed

0.2.2 - 2021-09-08

Added

  • Implement OpenLineageValidationAction for Great Expectations @collado-mike
  • facet: add expectations assertions facet @mobuchowski

Fixed

  • airflow: pendulum formatting fix, add tests @mobuchowski
  • dbt: do not emit events if run_result file was not updated @mobuchowski

0.2.1 - 2021-08-27

Fixed

  • Default --project-dir argument to current directory in dbt-ol script @mobuchowski

0.2.0 - 2021-08-23

Added

  • Parse dbt command line arguments when invoking dbt-ol @mobuchowski. For example:

    $ dbt-ol run --project-dir path/to/dir
    
  • Set UnknownFacet for spark (captures metadata about unvisited nodes from spark plan not yet supported) @OleksandrDvornik

Changed

Fixed

  • Remove instance references to extractors from DAG and avoid copying log property for serializability @collado-mike

0.1.0 - 2021-08-12

OpenLineage is an Open Standard for lineage metadata collection designed to record metadata for a job in execution. The initial public release includes:

  • An initial specification. The the initial version 1-0-0 of the OpenLineage specification defines the core model and facets.
  • Integrations that collect lineage metadata as OpenLineage events:
  • Clients that send OpenLineage events to an HTTP backend. Both java and python are initially supported.