Replies: 3 comments 2 replies
-
By concurrency, did you mean concurrency within sqllineage, or with your code in concurrently calling sqllineage. sqllineage itself does parsing and lineage analysis sequentially. If it's your code where concurrency happens, can you try multi-process instead of multi-thread to see if the same issue still happens? Some class in sqllineage might not be well-designed with thread-safety in mind. |
Beta Was this translation helpful? Give feedback.
-
Here is the smallest example of how to reproduce the problem: metadata.json:
test.sql: CREATE OR REPLACE TEMP TABLE `tmp_table` AS
SELECT
Column_1
FROM `dataset.table_A`
;
CREATE OR REPLACE TABLE `dataset.table_C` AS
SELECT *
FROM `tmp_table`
WHERE
Column_1 IN (
SELECT Column_W
FROM `dataset.table_B`
)
; test.py: import json
from sqllineage.core.metadata.dummy import DummyMetaDataProvider
from sqllineage.runner import LineageRunner
with open('metadata_test.json', 'r') as file:
metadata = json.load(file)
with open('test.sql', 'r') as f:
body = f.read()
provider = DummyMetaDataProvider(metadata)
lineage = LineageRunner(body, dialect="bigquery", metadata_provider=provider, verbose=True)
lineage.print_column_lineage() Now executing the following command multiple will output something different often: Out of 12 tries I got 9 times this:
and 3 times this: From my testings, the problem seems heavily relying on the subquery + temp table. Delete one or the other and the result is consistent. Edit: Interestingly enough, if I put all the Python code in a loop instead of executing the code multiple time from the console then the output is always the same at each iteration of the loop. |
Beta Was this translation helpful? Give feedback.
-
I noticed a similar thing and traced it back to the wildcard resolution logic. I didn't fix it properly, but made it consistent at least: #661 |
Beta Was this translation helpful? Give feedback.
-
Hi!
I can't give a small example of how to reproduce the bug so I won't create an issue yet.
I have this kind of BigQuery code where I created temporary tables, each using the previous one:
I wrote a program in Python to get the column lineage and provide metadata with the DummyMetadataProvider.
PROBLEM: About half the time, when I run the code, the lineage outputs this:
Meanwhile the other half is working correctly (provides the full column lineage).
My theory based on nothing:
Maybe it is a concurrency problem where the parsing is parallelized or something and the last statement is processed before the previous one could finish so it does not have the temporary tables metadata yet ?
Beta Was this translation helpful? Give feedback.
All reactions