Expose a complete SQL queries to be utilised by Apache Beam #2977

jaehyeon-kim · 2024-09-30T22:35:46Z

Is your feature request related to a problem? Please describe.

Apache Beam is an open-source, unified programming model for batch and streaming data processing pipelines that simplifies large-scale data processing dynamics. I'm developing Apache Beam Python I/O connectors that utilises your integration features of SQL and open table format sources (and possibly sinks). Following the I/O connector development guide, I'll be building Splittable DoFn objects, and they require a complete set of SQL queries as per partition conditions (eg partition_col, num_partitons...). Then, those queries are executed in multiple instances of DoFn objects followed by combining the results before ingesting into subsequent tasks.

For SQL sources as an example, the physical plan scheduler prints SQL queries but they don't seem to be available in Python. I might be able to obtain the queries using the sql scan operator but it looks incomplete because the pushdowns object is not directly obtainable in python. In this regards, it'll be good if a complete SQL queries can be obtained in python.

Describe the solution you'd like

I guess the physical plan schedule is available for both SQL and open table format sources. Although it includes SQL queries but they cannot be obtained in python. It'll be good if there is a to_dict() method or similar that support to obtain a complete SQL queries.

USER_STMT = "SELECT id, first_name, last_name, email FROM staging.users"
## not fully evaluated
df = daft.read_sql(
    sql=USER_STMT, conn=create_connection, partition_col="id", num_partitions=9
)
## physical plan schedule should exist for sql and open table formats
physical_plan_scheduler = df._builder.to_physical_plan_scheduler(
    daft.context.get_context().daft_execution_config
)
## can we have a to_dict method or another that can obtain a complete sql queries?
physical_plan_scheduler
# * TabularScan:
# |   Num Scan Tasks = 9
# |   Estimated Scan Bytes = 685008
# |   Clustering spec = { Num partitions = 9 }
# |   SQL Queries = [SELECT * FROM (SELECT id, first_name, last_name, email FROM staging.users) AS subquery WHERE id >= 1 AND id < 1112,..]
# |   Schema: {id#Int64, first_name#Utf8, last_name#Utf8, email#Utf8}
# |   Scan Tasks: [
# |   {Database {postgresql+psycopg2://devuser:***@localhost/devdb}}
# |   {Database {postgresql+psycopg2://devuser:***@localhost/devdb}}
# |   {Database {postgresql+psycopg2://devuser:***@localhost/devdb}}
# |   ...
# |   {Database {postgresql+psycopg2://devuser:***@localhost/devdb}}
# |   {Database {postgresql+psycopg2://devuser:***@localhost/devdb}}
# |   {Database {postgresql+psycopg2://devuser:***@localhost/devdb}}
# |   ]

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

The text was updated successfully, but these errors were encountered:

jaehyeon-kim · 2024-10-08T21:22:26Z

Hi @colin-ho

How are you? Is there an update?

Cheers,
Jaehyeon

colin-ho · 2024-10-08T23:15:35Z

Hey @jaehyeon-kim ! Sorry for the late update, I think the simplest way for us to do this is to serialize the physical plan into JSON, then you would be able to extract the sql queries from the JSON. PR is here: #3023

jaehyeon-kim · 2024-10-08T23:53:27Z

Hi @colin-ho

Thank you so much for the update. Look forward to having the feature available soon!

Cheers,
Jaehyeon

#2977 Allows physical plan to be serialized to be json strings. The script below shows how you can extract the sql queries from read sql. ``` import daft import json def create_conn(): return create_engine('sqlite:///test_database.db').connect() df = daft.read_sql("SELECT * FROM test_data", create_conn, partition_col="id", num_partitions=3) physical_plan_scheduler = df._builder.to_physical_plan_scheduler( daft.context.get_context().daft_execution_config ) physical_plan_dict = json.loads(physical_plan_scheduler.to_json_string()) # collect all sql queries sql_queries = [] for task in physical_plan_dict['TabularScan']['scan_tasks']: sql_queries.append(task['file_format_config']['Database']['sql']) print(sql_queries) ['SELECT * FROM (SELECT * FROM test_data) AS subquery WHERE id >= 1 AND id < 34', 'SELECT * FROM (SELECT * FROM test_data) AS subquery WHERE id >= 34 AND id < 67', 'SELECT * FROM (SELECT * FROM test_data) AS subquery WHERE id >= 67 AND id <= 100'] ``` --------- Co-authored-by: Colin Ho <colinho@Colins-MBP.localdomain>

colin-ho · 2024-10-15T21:13:52Z

The feature has been merged in, and should be available in the next release!

jaychia assigned colin-ho Sep 30, 2024

colin-ho mentioned this issue Oct 8, 2024

[FEAT] Enable to_json_string for physical plan #3023

Merged

colin-ho closed this as completed Oct 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Expose a complete SQL queries to be utilised by Apache Beam #2977

Expose a complete SQL queries to be utilised by Apache Beam #2977

jaehyeon-kim commented Sep 30, 2024

jaehyeon-kim commented Oct 8, 2024

colin-ho commented Oct 8, 2024

jaehyeon-kim commented Oct 8, 2024

colin-ho commented Oct 15, 2024

Expose a complete SQL queries to be utilised by Apache Beam #2977

Expose a complete SQL queries to be utilised by Apache Beam #2977

Comments

jaehyeon-kim commented Sep 30, 2024

jaehyeon-kim commented Oct 8, 2024

colin-ho commented Oct 8, 2024

jaehyeon-kim commented Oct 8, 2024

colin-ho commented Oct 15, 2024