-
Notifications
You must be signed in to change notification settings - Fork 181
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Expose a complete SQL queries to be utilised by Apache Beam #2977
Comments
Hi @colin-ho How are you? Is there an update? Cheers, |
Hey @jaehyeon-kim ! Sorry for the late update, I think the simplest way for us to do this is to serialize the physical plan into JSON, then you would be able to extract the sql queries from the JSON. PR is here: #3023 |
Hi @colin-ho Thank you so much for the update. Look forward to having the feature available soon! Cheers, |
#2977 Allows physical plan to be serialized to be json strings. The script below shows how you can extract the sql queries from read sql. ``` import daft import json def create_conn(): return create_engine('sqlite:///test_database.db').connect() df = daft.read_sql("SELECT * FROM test_data", create_conn, partition_col="id", num_partitions=3) physical_plan_scheduler = df._builder.to_physical_plan_scheduler( daft.context.get_context().daft_execution_config ) physical_plan_dict = json.loads(physical_plan_scheduler.to_json_string()) # collect all sql queries sql_queries = [] for task in physical_plan_dict['TabularScan']['scan_tasks']: sql_queries.append(task['file_format_config']['Database']['sql']) print(sql_queries) ['SELECT * FROM (SELECT * FROM test_data) AS subquery WHERE id >= 1 AND id < 34', 'SELECT * FROM (SELECT * FROM test_data) AS subquery WHERE id >= 34 AND id < 67', 'SELECT * FROM (SELECT * FROM test_data) AS subquery WHERE id >= 67 AND id <= 100'] ``` --------- Co-authored-by: Colin Ho <colinho@Colins-MBP.localdomain>
The feature has been merged in, and should be available in the next release! |
Is your feature request related to a problem? Please describe.
Apache Beam is an open-source, unified programming model for batch and streaming data processing pipelines that simplifies large-scale data processing dynamics. I'm developing Apache Beam Python I/O connectors that utilises your integration features of SQL and open table format sources (and possibly sinks). Following the I/O connector development guide, I'll be building
Splittable DoFn
objects, and they require a complete set of SQL queries as per partition conditions (eg partition_col, num_partitons...). Then, those queries are executed in multiple instances ofDoFn
objects followed by combining the results before ingesting into subsequent tasks.For SQL sources as an example, the physical plan scheduler prints SQL queries but they don't seem to be available in Python. I might be able to obtain the queries using the sql scan operator but it looks incomplete because the pushdowns object is not directly obtainable in python. In this regards, it'll be good if a complete SQL queries can be obtained in python.
Describe the solution you'd like
I guess the physical plan schedule is available for both SQL and open table format sources. Although it includes SQL queries but they cannot be obtained in python. It'll be good if there is a
to_dict()
method or similar that support to obtain a complete SQL queries.Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.
Additional context
Add any other context or screenshots about the feature request here.
The text was updated successfully, but these errors were encountered: