Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Openlineage is not able to ingest Biqguery table for a column type error #19567

Open
dariotr opened this issue Jan 28, 2025 · 0 comments
Open

Comments

@dariotr
Copy link

dariotr commented Jan 28, 2025

Affected module
Ingestion Framework

Describe the bug
The Openlineage Pipeline ingestion failed when a openlineage event describes a Bigquery table. The error raised is:

Encountered exception running step [<metadata.ingestion.source.pipeline.openlineage.metadata.OpenlineageSource object at 0x7f0a83da44c0>]: [1 validation error for Column
dataType
Input should be 'NUMBER', 'TINYINT', 'SMALLINT', 'INT', 'BIGINT', 'BYTEINT', 'BYTES', 'FLOAT', 'DOUBLE', 'DECIMAL', 'NUMERIC', 'TIMESTAMP', 'TIMESTAMPZ', 'TIME', 'DATE', 'DATETIME', 'INTERVAL', 'STRING', 'MEDIUMTEXT', 'TEXT', 'CHAR', 'LONG', 'VARCHAR', 'BOOLEAN', 'BINARY', 'VARBINARY', 'ARRAY', 'BLOB', 'LONGBLOB', 'MEDIUMBLOB', 'MAP', 'STRUCT', 'UNION', 'SET', 'GEOGRAPHY', 'ENUM', 'JSON', 'UUID', 'VARIANT', 'GEOMETRY', 'BYTEA', 'AGGREGATEFUNCTION', 'ERROR', 'FIXED', 'RECORD', 'NULL', 'SUPER', 'HLLSKETCH', 'PG_LSN', 'PG_SNAPSHOT', 'TSQUERY', 'TXID_SNAPSHOT', 'XML', 'MACADDR', 'TSVECTOR', 'UNKNOWN', 'CIDR', 'INET', 'CLOB', 'ROWID', 'LOWCARDINALITY', 'YEAR', 'POINT', 'POLYGON', 'TUPLE', 'SPATIAL', 'TABLE', 'NTEXT', 'IMAGE', 'IPV4', 'IPV6', 'DATETIMERANGE', 'HLL', 'LARGEINT', 'QUANTILE_STATE', 'AGG_STATE', 'BITMAP', 'UINT', 'BIT' or 'MONEY' [type=enum, input_value='INT64', input_type=str]
For further information visit https://errors.pydantic.dev/2.7/v/enum]

The issue is that in the method _get_om_table_columns() in OpenMetadata/ingestion/src/metadata/ingestion/source/pipeline/openlineage/metadata.py the dataType for the columns is using directly what is written in the Openlineage message but it accepts only the dataType defined for OpenMetadata and not for the source.
Openlineage is sending INT64 which is not a proper value for a OM dataType defined in metadata.generated.schema.entity.data.table

To Reproduce
It is possible to reproduce the same error using an Openlineage message which has defined the facets.schema and columns with type like INT64 or FLOAT64 ect.

Screenshots or steps to reproduce
Send to a Kafka topic used for the Openlineage Integration a message like this one:
For example:

{ "eventTime": "2025-01-28T10:05:45.991109Z", "eventType": "COMPLETE", "inputs": [ { "facets": { "dataSource": { "_deleted": null, "_producer": "https://github.com/OpenLineage/OpenLineage/tree/1.25.0/client/python", "_schemaURL": "https://openlineage.io/spec/facets/1-0-1/DatasourceDatasetFacet.json#/$defs/DatasourceDatasetFacet", "name": "bigquery", "uri": "bigquery" }, "documentation": { "_deleted": null, "_producer": "https://github.com/OpenLineage/OpenLineage/tree/1.25.0/client/python", "_schemaURL": "https://openlineage.io/spec/facets/1-0-1/DocumentationDatasetFacet.json#/$defs/DocumentationDatasetFacet", "description": "" }, "schema": { "_deleted": null, "_producer": "https://github.com/OpenLineage/OpenLineage/tree/1.25.0/client/python", "_schemaURL": "https://openlineage.io/spec/facets/1-1-1/SchemaDatasetFacet.json#/$defs/SchemaDatasetFacet", "fields": [ { "fields": [], "name": "id", "type": "INT64" }, { "fields": [], "name": "column1", "type": "STRING" }, { "fields": [], "name": "column2", "type": "INT64" }, { "fields": [], "name": "column3", "type": "INT64" }, { "fields": [], "name": "column4", "type": "FLOAT64" }, { "fields": [], "name": "column5", "type": "FLOAT64" } ] } }, "inputFacets": null, "name": "project-private.dataset-private.mytable", "namespace": "bigquery" } ], "job": { "facets": { "jobType": { "_deleted": null, "_producer": "https://github.com/OpenLineage/OpenLineage/tree/1.25.0/integration/dbt", "_schemaURL": "https://openlineage.io/spec/facets/2-0-3/JobTypeJobFacet.json#/$defs/JobTypeJobFacet", "integration": "DBT", "jobType": "MODEL", "processingType": "BATCH" }, "sql": { "_deleted": null, "_producer": "https://github.com/OpenLineage/OpenLineage/tree/1.25.0/client/python", "_schemaURL": "https://openlineage.io/spec/facets/1-0-1/SQLJobFacet.json#/$defs/SQLJobFacet", "query": "\nSELECT\n id,\n column1,\n column2,\n column3,\n column4,\n column5\nFROM project-private.dataset-private.metrics" } }, "name": "project-private.outputdataset.mytable", "namespace": "dbt" }, "outputs": [ { "facets": { "columnLineage": { "_deleted": null, "_producer": "https://github.com/OpenLineage/OpenLineage/tree/1.25.0/client/python", "_schemaURL": "https://openlineage.io/spec/facets/1-2-0/ColumnLineageDatasetFacet.json#/$defs/ColumnLineageDatasetFacet", "dataset": [], "fields": { "column1": { "inputFields": [ { "field": "column1", "name": "project-private.dataset-private.metrics", "namespace": "bigquery", "transformations": [] } ], "transformationDescription": "", "transformationType": "" }, "column4": { "inputFields": [ { "field": "column4", "name": "project-private.dataset-private.metrics", "namespace": "bigquery", "transformations": [] } ], "transformationDescription": "", "transformationType": "" }, "id": { "inputFields": [ { "field": "id", "name": "project-private.dataset-private.metrics", "namespace": "bigquery", "transformations": [] } ], "transformationDescription": "", "transformationType": "" }, "column3": { "inputFields": [ { "field": "column3", "name": "project-private.dataset-private.metrics", "namespace": "bigquery", "transformations": [] } ], "transformationDescription": "", "transformationType": "" }, "column5": { "inputFields": [ { "field": "column5", "name": "project-private.dataset-private.metrics", "namespace": "bigquery", "transformations": [] } ], "transformationDescription": "", "transformationType": "" }, "column2": { "inputFields": [ { "field": "column2", "name": "project-private.dataset-private.metrics", "namespace": "bigquery", "transformations": [] } ], "transformationDescription": "", "transformationType": "" } } }, "dataSource": { "_deleted": null, "_producer": "https://github.com/OpenLineage/OpenLineage/tree/1.25.0/client/python", "_schemaURL": "https://openlineage.io/spec/facets/1-0-1/DatasourceDatasetFacet.json#/$defs/DatasourceDatasetFacet", "name": "bigquery", "uri": "bigquery" }, "documentation": { "_deleted": null, "_producer": "https://github.com/OpenLineage/OpenLineage/tree/1.25.0/client/python", "_schemaURL": "https://openlineage.io/spec/facets/1-0-1/DocumentationDatasetFacet.json#/$defs/DocumentationDatasetFacet", "description": "" }, "schema": { "_deleted": null, "_producer": "https://github.com/OpenLineage/OpenLineage/tree/1.25.0/client/python", "_schemaURL": "https://openlineage.io/spec/facets/1-1-1/SchemaDatasetFacet.json#/$defs/SchemaDatasetFacet", "fields": [ { "description": "", "fields": [], "name": "id", "type": "INT64" }, { "fields": [], "name": "column1", "type": "STRING" }, { "fields": [], "name": "column2", "type": "INT64" }, { "fields": [], "name": "column3", "type": "INT64" }, { "fields": [], "name": "column4", "type": "FLOAT64" }, { "fields": [], "name": "column5", "type": "FLOAT64" } ] } }, "name": "project-private.outputdataset.dmns_mytable", "namespace": "bigquery", "outputFacets": { "outputStatistics": { "_producer": "https://github.com/OpenLineage/OpenLineage/tree/1.25.0/client/python", "_schemaURL": "https://openlineage.io/spec/facets/1-0-2/OutputStatisticsOutputDatasetFacet.json#/$defs/OutputStatisticsOutputDatasetFacet", "rowCount": 151884140, "size": 8386690822 } } } ], "producer": "https://github.com/OpenLineage/OpenLineage/tree/1.25.0/integration/dbt", "run": { "facets": { "dbt_version": { "_producer": "https://github.com/OpenLineage/OpenLineage/tree/1.25.0/client/python", "_schemaURL": "https://github.com/OpenLineage/OpenLineage/tree/main/integration/common/openlineage/schema/dbt-version-run-facet.json", "version": "1.8.9" }, "parent": { "_producer": "https://github.com/OpenLineage/OpenLineage/tree/1.25.0/client/python", "_schemaURL": "https://openlineage.io/spec/facets/1-0-1/ParentRunFacet.json#/$defs/ParentRunFacet", "job": { "name": "dbt-run-outputdataset", "namespace": "dbt" }, "run": { "runId": "0194ac5f-9bc5-73d8-818f-7b2143745544" } } }, "runId": "0194ac68-7069-79c4-b30b-60664a1c7142" }, "schemaURL": "https://openlineage.io/spec/2-0-2/OpenLineage.json#/$defs/RunEvent" }

Expected behavior
A clear and concise description of what you expected to happen.
I expected tha tables with specific column types are properly ingested from Openlineage Integration.

In the method _get_om_table_columns() in OpenMetadata/ingestion/src/metadata/ingestion/source/pipeline/openlineage/metadata.py, a static method like ColumnTypeParser.get_column_type() should be used to retrieve the correct value

Version:

  • Python version: 3.10
  • OpenMetadata version: 1.6.2
  • OpenMetadata Ingestion package version: [e.g. openmetadata-ingestion[docker]==XYZ]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant