Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Two level query plan execution #10

Open
alxmrs opened this issue Feb 13, 2024 · 4 comments
Open

Two level query plan execution #10

alxmrs opened this issue Feb 13, 2024 · 4 comments

Comments

@alxmrs
Copy link
Owner

alxmrs commented Feb 13, 2024

One level, the fallback, would be the prototype in #8. This should always work, but is expensive since it requires compact Xarray datasets to be unraveled.

The other level would be more like xql today. It does as much pre processing on the Dataset with xr operations as possible, then trivially unravels at the end. This implies that the SQL-on-Xarray layer should have clean interface boundaries.

@alxmrs
Copy link
Owner Author

alxmrs commented Feb 13, 2024

Some notes on how we could do this:

  • Control the sql parsing step
  • https://dask-sql.readthedocs.io/en/latest/how_does_it_work.html
  • from the sql plan, produce a refined xr.ds. The process of producing this should be a good enough effort while maintaining correctness. It might have a notion of bailing out due to ambiguity.
  • at last step, apply sql-dask engine on the converted, refined xr.ds.
  • leave open the possibility of using other sql engines on dfs via fuegue.

@alxmrs
Copy link
Owner Author

alxmrs commented Feb 17, 2024

#2 would become one level of execution.

@alxmrs
Copy link
Owner Author

alxmrs commented Feb 17, 2024

How will we integrate the distributed execution between the two levels? For example, the Xarray executor level would use xbeam on Dataflow, whereas the Dataframe executor would use Dask on Dataproc. Is there some way we can get both sides execution on the same context? Or, in the distributed case, would we hand off the tasks via IO, like how Cubed breaks up each step by writing to Zarr?

@alxmrs
Copy link
Owner Author

alxmrs commented Feb 17, 2024

Hmmm... it looks like Beam supports Pandas-like Dataframes.

https://beam.apache.org/documentation/dsls/dataframes/overview/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant