Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

date_trunc range optimization should apply also for BETWEEN predicate #14293

Open
findinpath opened this issue Sep 26, 2022 · 0 comments · May be fixed by #14451
Open

date_trunc range optimization should apply also for BETWEEN predicate #14293

findinpath opened this issue Sep 26, 2022 · 0 comments · May be fixed by #14451
Assignees
Labels
enhancement New feature or request performance

Comments

@findinpath
Copy link
Contributor

Related commit:

80c079f9

When doing range searches with date_trunc it seems to me that we could improve a bit the filter

filterPredicate = (date_trunc('day', "tpep_dropoff_datetime") BETWEEN TIMESTAMP '2021-10-01 00:00:00.000000' AND TIMESTAMP '2021-10-15 00:00:00.000000')

to look like

filterPredicate = ("tpep_dropoff_datetime" BETWEEN TIMESTAMP '2021-10-01 00:00:00.000000' AND TIMESTAMP '2021-10-15 23:59:59.999999')

See the detailed EXPLAIN ANALYZE output for the query:

explain analyze select count(*) from iceberg.default.nyc_yellow_taxi_trip_data where date_trunc('day', tpep_dropoff_datetime) BETWEEN DATE '2021-10-01' AND DATE '2021-10-15';
                                                                                                                           Query Plan                                                                                                                         >
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------->
 Fragment 1 [SINGLE]                                                                                                                                                                                                                                          >
     CPU: 14.32ms, Scheduled: 17.63ms, Blocked 8.66s (Input: 7.19s, Output: 0.00ns), Input: 109 rows (981B); per task: avg.: 109.00 std.dev.: 0.00, Output: 1 row (9B)                                                                                        >
     Output layout: [count]                                                                                                                                                                                                                                   >
     Output partitioning: SINGLE []                                                                                                                                                                                                                           >
     Aggregate[type = FINAL]                                                                                                                                                                                                                                  >
     │   Layout: [count:bigint]                                                                                                                                                                                                                               >
     │   Estimates: {rows: ? (?), cpu: ?, memory: ?, network: ?}                                                                                                                                                                                              >
     │   CPU: 2.00ms (0.05%), Scheduled: 2.00ms (0.01%), Blocked: 0.00ns (0.00%), Output: 1 row (9B)                                                                                                                                                          >
     │   Input avg.: 109.00 rows, Input std.dev.: 0.00%                                                                                                                                                                                                       >
     │   count := count("count_0")                                                                                                                                                                                                                            >
     └─ LocalExchange[partitioning = SINGLE]                                                                                                                                                                                                                  >
        │   Layout: [count_0:bigint]                                                                                                                                                                                                                          >
        │   Estimates: {rows: ? (?), cpu: ?, memory: ?, network: ?}                                                                                                                                                                                           >
        │   CPU: 3.00ms (0.08%), Scheduled: 3.00ms (0.01%), Blocked: 1.47s (16.97%), Output: 109 rows (981B)                                                                                                                                                  >
        │   Input avg.: 13.63 rows, Input std.dev.: 174.30%                                                                                                                                                                                                   >
        └─ RemoteSource[sourceFragmentIds = [2]]                                                                                                                                                                                                              >
               Layout: [count_0:bigint]                                                                                                                                                                                                                       >
               Estimates:                                                                                                                                                                                                                                     >
               CPU: 2.00ms (0.05%), Scheduled: 3.00ms (0.01%), Blocked: 7.19s (83.03%), Output: 109 rows (981B)                                                                                                                                               >
               Input avg.: 13.63 rows, Input std.dev.: 174.30%                                                                                                                                                                                                >
                                                                                                                                                                                                                                                              >
 Fragment 2 [SOURCE]                                                                                                                                                                                                                                          >
     CPU: 3.65s, Scheduled: 39.35s, Blocked 0.00ns (Input: 0.00ns, Output: 0.00ns), Input: 30903532 rows (265.25MB); per task: avg.: 30903532.00 std.dev.: 0.00, Output: 109 rows (981B)                                                                      >
     Output layout: [count_0]                                                                                                                                                                                                                                 >
     Output partitioning: SINGLE []                                                                                                                                                                                                                           >
     Aggregate[type = PARTIAL]                                                                                                                                                                                                                                >
     │   Layout: [count_0:bigint]                                                                                                                                                                                                                             >
     │   Estimates:                                                                                                                                                                                                                                           >
     │   CPU: 16.00ms (0.44%), Scheduled: 15.00ms (0.04%), Blocked: 0.00ns (0.00%), Output: 109 rows (981B)                                                                                                                                                   >
     │   Input avg.: 15069.17 rows, Input std.dev.: 251.85%                                                                                                                                                                                                   >
     │   count_0 := count(*)                                                                                                                                                                                                                                  >
     └─ ScanFilterProject[table = iceberg:default.nyc_yellow_taxi_trip_data$data@7744941593534498146, filterPredicate = (date_trunc('day', "tpep_dropoff_datetime") BETWEEN TIMESTAMP '2021-10-01 00:00:00.000000' AND TIMESTAMP '2021-10-15 00:00:00.000000')>
            Layout: []                                                                                                                                                                                                                                        >
            Estimates: {rows: 30903532 (0B), cpu: 265.25M, memory: 0B, network: 0B}/{rows: ? (0B), cpu: 530.49M, memory: 0B, network: 0B}/{rows: ? (0B), cpu: 530.49M, memory: 0B, network: 0B}                                                               >
            CPU: 3.63s (99.37%), Scheduled: 39.34s (99.94%), Blocked: 0.00ns (0.00%), Output: 1642540 rows (0B)                                                                                                                                               >
            Input avg.: 283518.64 rows, Input std.dev.: 222.57%                                                                                                                                                                                               >
            tpep_dropoff_datetime := 3:tpep_dropoff_datetime:timestamp(6)                                                                                                                                                                                     >
            Input: 30903532 rows (265.25MB), Filtered: 94.68%       
@findepi findepi changed the title date_trunc range optimization should apply also for date/timestamp ranges date_trunc range optimization should apply also for BETWEEN predicate Sep 26, 2022
@findepi findepi added enhancement New feature or request performance labels Sep 26, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request performance
2 participants