Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Calculate partition boundaries for read_sql using sampling #3245

Open
colin-ho opened this issue Nov 7, 2024 · 0 comments
Open

Calculate partition boundaries for read_sql using sampling #3245

colin-ho opened this issue Nov 7, 2024 · 0 comments
Assignees
Labels
enhancement New feature or request

Comments

@colin-ho
Copy link
Contributor

colin-ho commented Nov 7, 2024

Is your feature request related to a problem?

Currently, read_sql calculates partition boundaries using PERCENTILE_DISC, falling back to using min - max if needed. PERCENTILE_DISC is not scalable, as it involves expensive operations such as windowing and sorting.

Describe the solution you'd like

We should instead calculate percentiles by taking samples from the input table. This will allows trade off btw accuracy and computational complexity (both time and space complexities). The parameter to control this trade off is the sampling size. The large the sampling size, the more accurate, but it will be more computational complex. If smaller sampling size will be more likely to have uneven sized partition (skew).

Describe alternatives you've considered

No response

Additional Context

No response

Would you like to implement a fix?

No

@colin-ho colin-ho added enhancement New feature or request needs triage labels Nov 7, 2024
colin-ho added a commit that referenced this issue Nov 13, 2024
…ax (#3246)

Currently, read_sql calculates partition bounds using the
`PERCENTILE_DISC` function. However, this function does not scale well
to large tables, as it is an expensive window + sort function. A better
alternative is to take samples, then estimate partition bounds, as
described in this issue:
#3245.

In the meantime, we should default to using the min-max calculations
instead, which was previously the fallback option.

---------

Co-authored-by: Colin Ho <colinho@Colins-MacBook-Pro.local>
Co-authored-by: Colin Ho <colinho@Colins-MBP.localdomain>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants