Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CHORE] Expose read_sql partition bound strategy and default to min-max #3246

Merged
merged 3 commits into from
Nov 13, 2024

Conversation

colin-ho
Copy link
Contributor

@colin-ho colin-ho commented Nov 7, 2024

Currently, read_sql calculates partition bounds using the PERCENTILE_DISC function. However, this function does not scale well to large tables, as it is an expensive window + sort function. A better alternative is to take samples, then estimate partition bounds, as described in this issue: #3245.

In the meantime, we should default to using the min-max calculations instead, which was previously the fallback option.

@github-actions github-actions bot added the chore label Nov 7, 2024
Copy link

codspeed-hq bot commented Nov 7, 2024

CodSpeed Performance Report

Merging #3246 will improve performances by 50.05%

Comparing colin/fix-readsql-partition-bounds (4fea9d7) with main (6e28b3f)

Summary

⚡ 1 improvements
✅ 16 untouched benchmarks

Benchmarks breakdown

Benchmark main colin/fix-readsql-partition-bounds Change
test_show[100 Small Files] 50.1 ms 33.4 ms +50.05%

Copy link

codecov bot commented Nov 7, 2024

Codecov Report

Attention: Patch coverage is 16.66667% with 35 lines in your changes missing coverage. Please review.

Project coverage is 78.52%. Comparing base (2b71ffb) to head (4fea9d7).
Report is 19 commits behind head on main.

Files with missing lines Patch % Lines
daft/sql/sql_scan.py 14.63% 35 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #3246      +/-   ##
==========================================
- Coverage   79.13%   78.52%   -0.62%     
==========================================
  Files         640      641       +1     
  Lines       77983    79095    +1112     
==========================================
+ Hits        61715    62107     +392     
- Misses      16268    16988     +720     
Files with missing lines Coverage Δ
daft/io/_sql.py 51.85% <100.00%> (ø)
daft/sql/sql_scan.py 25.54% <14.63%> (-0.02%) ⬇️

... and 34 files with indirect coverage changes

@colin-ho colin-ho marked this pull request as ready for review November 12, 2024 23:06
@colin-ho
Copy link
Contributor Author

Oops i forgot to mark this as ready @desmondcheongzx :P

Copy link
Contributor

@desmondcheongzx desmondcheongzx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

@colin-ho colin-ho merged commit bd4e944 into main Nov 13, 2024
45 of 46 checks passed
@colin-ho colin-ho deleted the colin/fix-readsql-partition-bounds branch November 13, 2024 03:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants