Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Perform range optimization for BETWEEN predicate on date_trunc and temporal casts #14390

Closed

Conversation

findinpath
Copy link
Contributor

@findinpath findinpath commented Sep 30, 2022

Description

This change allows the engine to infer that, for instance,
given t::timestamp(6)

    date_trunc('day', t) BETWEEN TIMESTAMP '2022-01-01 00:00:00' AND TIMESTAMP '2022-01-02 00:00:00'

or

   cast(t as date) BETWEEN TIMESTAMP '2022-01-01 00:00:00' AND TIMESTAMP '2022-01-02 00:00:00'

can be rewritten as

    t BETWEEN TIMESTAMP '2022-01-01 00:00:00' AND TIMESTAMP '2022-01-02 23:59:59.999999'

The change applies for the temporal types:

  • date
  • timestamp
  • timestamp with time zone

Range predicate BetweenPredicate can be transformed into a TupleDomain
and thus help with predicate pushdown.
Range-based TupleDomain representation is critical for connectors
which have min/max-based metadata (like Iceberg manifests lists which
play a key role in partition pruning or Iceberg data files), as ranges allow
for intersection tests, something that is hard
to do in a generic manner for ConnectorExpression.

Fixes #14293

Non-technical explanation

Release notes

( ) This is not user-visible or docs only and no release notes are required.
( ) Release notes are required, please propose a release note for me.
(x) Release notes are required, with the following suggested text:

# Main
* Improve partition and data pruning when comparing temporal casts  with ranges

@cla-bot cla-bot bot added the cla-signed label Sep 30, 2022
@findinpath findinpath changed the title Rewrite temporal casts comparation on ranges Perform range optimization for BETWEEN predicate on date_trunc and temporal casts Sep 30, 2022
@findinpath findinpath force-pushed the rewrite-between-for-temporal-casts branch 2 times, most recently from c01d17e to 4226391 Compare September 30, 2022 14:58
@findinpath findinpath marked this pull request as ready for review September 30, 2022 15:00
@findinpath findinpath force-pushed the rewrite-between-for-temporal-casts branch 3 times, most recently from a487c99 to aa7abc0 Compare October 2, 2022 07:19
@findinpath
Copy link
Contributor Author

CI hit #11140

@findinpath findinpath requested review from martint and findepi October 2, 2022 07:21
@findinpath findinpath force-pushed the rewrite-between-for-temporal-casts branch from aa7abc0 to 5317143 Compare October 2, 2022 07:23
…expression

This change allows the engine to infer that, for instance,
given t::timestamp(6)

    date_trunc('day', t) BETWEEN TIMESTAMP '2022-01-01 00:00:00' AND TIMESTAMP '2022-01-02 00:00:00'

can be rewritten as

    t BETWEEN TIMESTAMP '2022-01-01 00:00:00' AND TIMESTAMP '2022-01-02 23:59:59.999999'

The change applies for the temporal types:
- date
- timestamp
- timestamp with time zone

Range predicate BetweenPredicate can be transformed into a `TupleDomain`
and thus help with predicate pushdown.
Range-based `TupleDomain` representation is critical for connectors
which have min/max-based metadata (like Iceberg manifests lists which
play a key role in partition pruning or Iceberg data files), as ranges allow
for intersection tests, something that is hard
to do in a generic manner for `ConnectorExpression`.
This change allows the engine to infer that, for instance,
given t::timestamp(6)

    cast(t as date) BETWEEN DATE '2022-01-01' AND DATE '2022-01-02'

can be rewritten as

    t BETWEEEN TIMESTAMP '2022-01-01 00:00:00' AND TIMESTAMP '2022-01-02 23:59:59.999999'

The change applies for the temporal types:
- date
- timestamp
- timestamp with time zone

Range predicate BetweenPredicate can be transformed into a `TupleDomain`
and thus help with predicate pushdown.
Range-based `TupleDomain` representation is critical for connectors
which have min/max-based metadata (like Iceberg manifests lists which
play a key role in partition pruning or Iceberg data files), as ranges allow
for intersection tests, something that is hard
to do in a generic manner for `ConnectorExpression`.
@findinpath findinpath force-pushed the rewrite-between-for-temporal-casts branch from 5317143 to a7ab471 Compare October 3, 2022 08:14
}
LongTimestamp longTimestamp = (LongTimestamp) rangeStart;
verify(longTimestamp.getPicosOfMicro() == 0, "Unexpected picos in %s, value not rounded to %s", rangeStart, rangeUnit);
long endInclusiveMicros = (long) calculateRangeEndInclusive(longTimestamp.getEpochMicros(), createTimestampType(6), rangeUnit);
return new LongTimestamp(endInclusiveMicros, toIntExact(PICOSECONDS_PER_MICROSECOND - scaleFactor(timestampType.getPrecision(), 12)));
long endInclusiveMicros = (long) calculateRangeEndInclusive(longTimestamp.getEpochMicros(), createTimestampType(TimestampType.MAX_SHORT_PRECISION), rangeUnit);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the variable name is "endInclusiveMicros"
the code used 6 and it's know that 10^(-6)s is a microsecond.

after the change the code uses TimestampType.MAX_SHORT_PRECISION. it's not obvious that it's correct (is short precision actually microseconds?). Thus, actually this change decreases readability

long endInclusiveMicros = (long) calculateRangeEndInclusive(longTimestamp.getEpochMicros(), createTimestampType(6), rangeUnit);
return new LongTimestamp(endInclusiveMicros, toIntExact(PICOSECONDS_PER_MICROSECOND - scaleFactor(timestampType.getPrecision(), 12)));
long endInclusiveMicros = (long) calculateRangeEndInclusive(longTimestamp.getEpochMicros(), createTimestampType(TimestampType.MAX_SHORT_PRECISION), rangeUnit);
return new LongTimestamp(endInclusiveMicros, toIntExact(PICOSECONDS_PER_MICROSECOND - scaleFactor(timestampType.getPrecision(), TimestampType.MAX_PRECISION)));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

similar here. the use PICOSECONDS_PER_MICROSECOND mandates that we know we're dealing with picoseconds, i.e. 10^(-12)s, so it matched the corresponding 12 on this line

after the change, we invoke "max precision" constant, but we still rely on it having an actual value of 12

@findepi
Copy link
Member

findepi commented Oct 3, 2022

@findinpath let's have unwrapping of CASTs and date_trunc as separate PRs.
I'd like to focus on casts first.

@findinpath
Copy link
Contributor Author

Continuing the work on #14451 and #14452

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

date_trunc range optimization should apply also for BETWEEN predicate
2 participants