Calculate partition boundaries for `read_sql` using sampling #3245

colin-ho · 2024-11-07T19:07:07Z

Is your feature request related to a problem?

Currently, read_sql calculates partition boundaries using PERCENTILE_DISC, falling back to using min - max if needed. PERCENTILE_DISC is not scalable, as it involves expensive operations such as windowing and sorting.

Describe the solution you'd like

We should instead calculate percentiles by taking samples from the input table. This will allows trade off btw accuracy and computational complexity (both time and space complexities). The parameter to control this trade off is the sampling size. The large the sampling size, the more accurate, but it will be more computational complex. If smaller sampling size will be more likely to have uneven sized partition (skew).

Describe alternatives you've considered

No response

Additional Context

No response

Would you like to implement a fix?

No

The text was updated successfully, but these errors were encountered:

…ax (#3246) Currently, read_sql calculates partition bounds using the `PERCENTILE_DISC` function. However, this function does not scale well to large tables, as it is an expensive window + sort function. A better alternative is to take samples, then estimate partition bounds, as described in this issue: #3245. In the meantime, we should default to using the min-max calculations instead, which was previously the fallback option. --------- Co-authored-by: Colin Ho <colinho@Colins-MacBook-Pro.local> Co-authored-by: Colin Ho <colinho@Colins-MBP.localdomain>

colin-ho added enhancement New feature or request needs triage labels Nov 7, 2024

colin-ho mentioned this issue Nov 11, 2024

[CHORE] Expose read_sql partition bound strategy and default to min-max #3246

Merged

desmondcheongzx removed the needs triage label Nov 26, 2024

desmondcheongzx assigned colin-ho Nov 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Calculate partition boundaries for `read_sql` using sampling #3245

Calculate partition boundaries for `read_sql` using sampling #3245

colin-ho commented Nov 7, 2024

Calculate partition boundaries for read_sql using sampling #3245

Calculate partition boundaries for read_sql using sampling #3245

Comments

colin-ho commented Nov 7, 2024

Is your feature request related to a problem?

Describe the solution you'd like

Describe alternatives you've considered

Additional Context

Would you like to implement a fix?

Calculate partition boundaries for `read_sql` using sampling #3245

Calculate partition boundaries for `read_sql` using sampling #3245