Derive predicate over base column for `CAST(a_varchar AS date) > a_date_constant` condition #12925

ubyyj · 2022-06-22T06:50:12Z

For iceberg table, if partition key in an expression (cast) in the where clause, then it's not pushed down to table scan. while hive table does.

below is the explain output of a query, which the partition key dt is casted to date (this is required by Tableau for incremental refresh), the predicate is not pushed down to table scan.

CREATE EXTERNAL TABLE t1(
  name string COMMENT 'from deserializer',
  age int COMMENT 'from deserializer',
  dt string COMMENT 'from deserializer')
ROW FORMAT SERDE
  'org.apache.iceberg.mr.hive.HiveIcebergSerDe'
STORED BY
  'org.apache.iceberg.mr.hive.HiveIcebergStorageHandler'
LOCATION
  'hdfs://namenode/user/hive/warehouse/t1'
TBLPROPERTIES (
  'bucketing_version'='1',
  'engine.hive.enabled'='true',
  'external.table.purge'='TRUE',
  'format-version'='2',
  'last_modified_by'='root',
  'last_modified_time'='1655557984',
  'metadata_location'='...',
  'previous_metadata_location'='..',
  'table_type'='ICEBERG',
  'transient_lastDdlTime'='1655557984',
  'uuid'='81f55231-f6ed-4a22-97ea-3fe792cb2927',
  'write.delete.mode'='copy-on-write')

explain select cast(dt as date) as dt, count(name) as cnt from iceberg.default.t1 where cast(dt as date) > date_add('day', -3, current_date)  and cast(dt as date)  <= date_add('day', 3, current_date)   group by dt;
                                                                                               
 Fragment 0 [SINGLE]
     Output layout: [expr, count]
     Output partitioning: SINGLE []
     Stage Execution Strategy: UNGROUPED_EXECUTION
     Output[dt, cnt]
     │   Layout: [expr:date, count:bigint]
     │   Estimates: {rows: ? (?), cpu: ?, memory: ?, network: ?}
     │   dt := expr
     │   cnt := count
     └─ RemoteSource[1]
            Layout: [count:bigint, expr:date]

 Fragment 1 [HASH]
     Output layout: [count, expr]
     Output partitioning: SINGLE []
     Stage Execution Strategy: UNGROUPED_EXECUTION
     Project[]
     │   Layout: [count:bigint, expr:date]
     │   Estimates: {rows: ? (?), cpu: ?, memory: ?, network: ?}
     │   expr := CAST("dt" AS date)
     └─ Aggregate(FINAL)[dt][$hashvalue]
        │   Layout: [dt:varchar, $hashvalue:bigint, count:bigint]
        │   Estimates: {rows: ? (?), cpu: ?, memory: ?, network: ?}
        │   count := count("count_0")
        └─ LocalExchange[HASH][$hashvalue] ("dt")
           │   Layout: [dt:varchar, count_0:bigint, $hashvalue:bigint]
           │   Estimates: {rows: ? (?), cpu: ?, memory: ?, network: ?}
           └─ RemoteSource[2]
                  Layout: [dt:varchar, count_0:bigint, $hashvalue_1:bigint]

 Fragment 2 [SOURCE]
     Output layout: [dt, count_0, $hashvalue_2]
     Output partitioning: HASH [dt][$hashvalue_2]
     Stage Execution Strategy: UNGROUPED_EXECUTION
     Aggregate(PARTIAL)[dt][$hashvalue_2]
     │   Layout: [dt:varchar, $hashvalue_2:bigint, count_0:bigint]
     │   count_0 := count("name")
     └─ ScanFilterProject[table = iceberg:default.t1$data@3690406149830550716, grouped = false, filterPredicate = ((CAST("dt" AS date) > DATE '2022-06-19') AND (CAST("dt" AS date) <= DATE '2022-06-25'))]
            Layout: [name:varchar, dt:varchar, $hashvalue_2:bigint]
            Estimates: {rows: 2 (264B), cpu: 246, memory: 0B, network: 0B}/{rows: 1 (66B), cpu: 492, memory: 0B, network: 0B}/{rows: 1 (66B), cpu: 558, memory: 0B, network: 0B}
            $hashvalue_2 := combine_hash(bigint '0', COALESCE("$operator$hash_code"("dt"), 0))
            dt := 3:dt:varchar
            name := 1:name:varchar`

for a similar HIVE table , the condition is pushed down.

CREATE EXTERNAL TABLE actions(
  user_id bigint)
PARTITIONED BY (
  dt string,
  hh string,
  action string)
ROW FORMAT SERDE
  'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
  'hdfs://namenode/user/hive/warehouse/actions'
TBLPROPERTIES (
  'transient_lastDdlTime'='1655858884')

explain select cast(dt as date) as dt, count(user_id) as cnt from hive.default.actions where cast(dt as date) > date_add('day', -3, current_date)  and cast(dt as date)  <= date_add('day', 3, current_date)  group by dt;

 Fragment 0 [SINGLE]
     Output layout: [expr, count]
     Output partitioning: SINGLE []
     Stage Execution Strategy: UNGROUPED_EXECUTION
     Output[dt, cnt]
     │   Layout: [expr:date, count:bigint]
     │   Estimates: {rows: ? (?), cpu: ?, memory: ?, network: ?}
     │   dt := expr
     │   cnt := count
     └─ RemoteSource[1]
            Layout: [count:bigint, expr:date]

 Fragment 1 [HASH]
     Output layout: [count, expr]
     Output partitioning: SINGLE []
     Stage Execution Strategy: UNGROUPED_EXECUTION
     Project[]
     │   Layout: [count:bigint, expr:date]
     │   Estimates: {rows: ? (?), cpu: ?, memory: ?, network: ?}
     │   expr := CAST("dt" AS date)
     └─ Aggregate(FINAL)[dt][$hashvalue]
        │   Layout: [dt:varchar, $hashvalue:bigint, count:bigint]
        │   Estimates: {rows: ? (?), cpu: ?, memory: ?, network: ?}
        │   count := count("count_0")
        └─ LocalExchange[HASH][$hashvalue] ("dt")
           │   Layout: [dt:varchar, count_0:bigint, $hashvalue:bigint]
           │   Estimates: {rows: ? (?), cpu: ?, memory: ?, network: ?}
           └─ RemoteSource[2]
                  Layout: [dt:varchar, count_0:bigint, $hashvalue_1:bigint]

 Fragment 2 [SOURCE]
     Output layout: [dt, count_0, $hashvalue_2]
     Output partitioning: HASH [dt][$hashvalue_2]
     Stage Execution Strategy: UNGROUPED_EXECUTION
     Aggregate(PARTIAL)[dt][$hashvalue_2]
     │   Layout: [dt:varchar, $hashvalue_2:bigint, count_0:bigint]
     │   count_0 := count("user_id")
     └─ ScanFilterProject[table = hive:default:actions, grouped = false, filterPredicate = ((CAST("dt" AS date) > DATE '2022-06-19') AND (CAST("dt" AS date) <= DATE '2022-06-25'))]
            Layout: [user_id:bigint, dt:varchar, $hashvalue_2:bigint]
            Estimates: {rows: 1 (33B), cpu: 24, memory: 0B, network: 0B}/{rows: 0 (8B), cpu: 48, memory: 0B, network: 0B}/{rows: 0 (8B), cpu: 56, memory: 0B, network: 0B}
            $hashvalue_2 := combine_hash(bigint '0', COALESCE("$operator$hash_code"("dt"), 0))
            dt := dt:string:PARTITION_KEY
                :: [[2022-06-20]]
            user_id := user_id:bigint:REGULAR
            hh:string:PARTITION_KEY
                :: [[00]]
            action:string:PARTITION_KEY
                :: [[post]]

The text was updated successfully, but these errors were encountered:

findepi · 2022-06-22T11:25:10Z

For the record, #12795 #12918 improve Iceberg pushdown capabilities, but only when the partition key is of a temporal type.

In this issue, your Iceberg table doesn't seem partitioned and Hive table is partitioned.
Also, can you elaborate on the actual and expected behavior?
Thanks for sharing the EXPLAIN plans, but I don't know what detail of the plans you're looking at.

ubyyj · 2022-06-23T00:50:49Z

The iceberg table t1 is partitioned, just the show create table result can't tell it. Below is what I got from the metadata file:
"partition-specs" : [ { "spec-id" : 0, "fields" : [ { "name" : "dt", "transform" : "identity", "source-id" : 3, "field-id" : 1000 } ] } ],
I want (I guess) the ScanFilterProject should tell which dates will be scanned, like what hive plan tells, the [[2022-06-20]] part:
dt := dt:string:PARTITION_KEY :: [[2022-06-20]]

ubyyj · 2022-06-23T10:02:43Z

as far as I can understand, the push down does not succeed because the dt is a varchar type, while timestamp(6) is expected, at here

findepi · 2022-06-23T21:12:45Z

I want (I guess) the ScanFilterProject should tell which dates will be scanned, like what hive plan tells, the [[2022-06-20]] part:
dt := dt:string:PARTITION_KEY :: [[2022-06-20]]

the actual way to test this is to count number of source splits

I think your needs should be addressed by @homar 's #9830

ubyyj · 2022-06-24T00:11:30Z

I am not sure it's addressed by #9830. As for the latest release 387, such query will keep coordinator busy for long time, and then timeout (the table is relative big, with millions of parquet files, and millions of partitions)

findepi · 2022-06-24T11:51:51Z

and then timeout

do you have a stacktrace?

ubyyj · 2022-06-25T00:25:11Z

here is one

Query-20220625_001742_00041_njuwj-43241 RUNNABLE #43241 
  at java.net.SocketInputStream.socketRead0(SocketInputStream.java:-2)
  at java.net.SocketInputStream.socketRead(SocketInputStream.java:115)
  at java.net.SocketInputStream.read(SocketInputStream.java:168)
  at java.net.SocketInputStream.read(SocketInputStream.java:140)
  at sun.security.ssl.SSLSocketInputRecord.read(SSLSocketInputRecord.java:478)
  at sun.security.ssl.SSLSocketInputRecord.readHeader(SSLSocketInputRecord.java:472)
  at sun.security.ssl.SSLSocketInputRecord.decode(SSLSocketInputRecord.java:160)
  at sun.security.ssl.SSLTransport.decode(SSLTransport.java:111)
  at sun.security.ssl.SSLSocketImpl.decode(SSLSocketImpl.java:1506)
  at sun.security.ssl.SSLSocketImpl.readHandshakeRecord(SSLSocketImpl.java:1416)
  at sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:456)
  at sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:427)
  at org.apache.http.conn.ssl.SSLConnectionSocketFactory.createLayeredSocket(SSLConnectionSocketFactory.java:436)
  at org.apache.http.conn.ssl.SSLConnectionSocketFactory.connectSocket(SSLConnectionSocketFactory.java:384)
  at com.amazonaws.http.conn.ssl.SdkTLSSocketFactory.connectSocket(SdkTLSSocketFactory.java:142)
  at org.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(DefaultHttpClientConnectionOperator.java:142)
  at org.apache.http.impl.conn.PoolingHttpClientConnectionManager.connect(PoolingHttpClientConnectionManager.java:376)
  at jdk.internal.reflect.GeneratedMethodAccessor840.invoke(:-1)
  at jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
  at java.lang.reflect.Method.invoke(Method.java:566)
  at com.amazonaws.http.conn.ClientConnectionManagerFactory$Handler.invoke(ClientConnectionManagerFactory.java:76)
  at com.amazonaws.http.conn.$Proxy634.connect(:-1)
  at org.apache.http.impl.execchain.MainClientExec.establishRoute(MainClientExec.java:393)
  at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:236)
  at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:186)
  at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185)
  at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83)
  at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:56)
  at com.amazonaws.http.apache.client.impl.SdkHttpClient.execute(SdkHttpClient.java:72)
  at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1343)
  at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1154)
  at com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:811)
  at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:779)
  at com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:753)
  at com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:713)
  at com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:695)
  at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:559)
  at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:539)
  at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:5453)
  at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:5400)
  at com.amazonaws.services.s3.AmazonS3Client.getObjectMetadata(AmazonS3Client.java:1372)
  at io.trino.plugin.hive.s3.TrinoS3FileSystem.lambda$getS3ObjectMetadata$6(TrinoS3FileSystem.java:785)
  at io.trino.plugin.hive.s3.TrinoS3FileSystem$$Lambda$10174/0x0000ffc62c1fb508.call(:-1)
  at io.trino.plugin.hive.util.RetryDriver.run(RetryDriver.java:130)
  at io.trino.plugin.hive.s3.TrinoS3FileSystem.getS3ObjectMetadata(TrinoS3FileSystem.java:782)
  at io.trino.plugin.hive.s3.TrinoS3FileSystem.getS3ObjectMetadata(TrinoS3FileSystem.java:766)
  at io.trino.plugin.hive.s3.TrinoS3FileSystem.getFileStatus(TrinoS3FileSystem.java:446)
  at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:454)
  at org.apache.iceberg.hadoop.HadoopInputFile.lazyStat(HadoopInputFile.java:156)
  at org.apache.iceberg.hadoop.HadoopInputFile.getLength(HadoopInputFile.java:167)
  at io.trino.plugin.iceberg.HdfsInputFile$$Lambda$10172/0x0000ffc62c1fc8b0.run(:-1)
  at io.trino.plugin.hive.authentication.NoHdfsAuthentication.doAs(NoHdfsAuthentication.java:25)
  at io.trino.plugin.hive.HdfsEnvironment.doAs(HdfsEnvironment.java:97)
  at io.trino.plugin.iceberg.HdfsInputFile.getLength(HdfsInputFile.java:55)
  at org.apache.iceberg.avro.AvroIterable.newFileReader(AvroIterable.java:101)
  at org.apache.iceberg.avro.AvroIterable.getMetadata(AvroIterable.java:66)
  at org.apache.iceberg.ManifestReader.<init>(ManifestReader.java:104)
  at org.apache.iceberg.ManifestFiles.read(ManifestFiles.java:87)
  at org.apache.iceberg.ManifestGroup.lambda$entries$15(ManifestGroup.java:245)
  at org.apache.iceberg.ManifestGroup$$Lambda$10203/0x0000ffc62bff4920.apply(:-1)
  at org.apache.iceberg.relocated.com.google.common.collect.Iterators$6.transform(Iterators.java:783)
  at org.apache.iceberg.relocated.com.google.common.collect.TransformedIterator.next(TransformedIterator.java:47)
  at org.apache.iceberg.relocated.com.google.common.collect.TransformedIterator.next(TransformedIterator.java:47)
  at org.apache.iceberg.util.ParallelIterable$ParallelIterator.submitNextTask(ParallelIterable.java:131)
  at org.apache.iceberg.util.ParallelIterable$ParallelIterator.checkTasks(ParallelIterable.java:118)
  at org.apache.iceberg.util.ParallelIterable$ParallelIterator.hasNext(ParallelIterable.java:155)
  at java.lang.Iterable.forEach(Iterable.java:74)
  at io.trino.plugin.iceberg.TableStatisticsMaker.makeTableStatistics(TableStatisticsMaker.java:109)
  at io.trino.plugin.iceberg.TableStatisticsMaker.getTableStatistics(TableStatisticsMaker.java:60)
  at io.trino.plugin.iceberg.IcebergMetadata.getTableStatistics(IcebergMetadata.java:1916)
  at io.trino.plugin.base.classloader.ClassLoaderSafeConnectorMetadata.getTableStatistics(ClassLoaderSafeConnectorMetadata.java:326)
  at io.trino.metadata.MetadataManager.getTableStatistics(MetadataManager.java:445)
  at io.trino.cost.TableScanStatsRule.doCalculate(TableScanStatsRule.java:64)
  at io.trino.cost.TableScanStatsRule.doCalculate(TableScanStatsRule.java:38)
  at io.trino.cost.SimpleStatsRule.calculate(SimpleStatsRule.java:39)
  at io.trino.cost.ComposableStatsCalculator.calculateStats(ComposableStatsCalculator.java:85)
  at io.trino.cost.ComposableStatsCalculator.calculateStats(ComposableStatsCalculator.java:73)
  at io.trino.cost.CachingStatsProvider.getStats(CachingStatsProvider.java:80)
  at io.trino.sql.planner.iterative.rule.UseNonPartitionedJoinLookupSource.lambda$getSourceTablesRowCount$0(UseNonPartitionedJoinLookupSource.java:153)
  at io.trino.sql.planner.iterative.rule.UseNonPartitionedJoinLookupSource$$Lambda$10465/0x0000ffc9489db0b0.applyAsDouble(:-1)
  at java.util.stream.ReferencePipeline$6$1.accept(ReferencePipeline.java:246)
  at java.util.Collections$2.tryAdvance(Collections.java:4747)
  at java.util.Collections$2.forEachRemaining(Collections.java:4755)
  at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:484)
  at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:474)
  at java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:913)
  at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
  at java.util.stream.DoublePipeline.collect(DoublePipeline.java:516)
  at java.util.stream.DoublePipeline.sum(DoublePipeline.java:425)
  at io.trino.sql.planner.iterative.rule.UseNonPartitionedJoinLookupSource.getSourceTablesRowCount(UseNonPartitionedJoinLookupSource.java:154)
  at io.trino.sql.planner.iterative.rule.UseNonPartitionedJoinLookupSource.getSourceTablesRowCount(UseNonPartitionedJoinLookupSource.java:135)
  at io.trino.sql.planner.iterative.rule.UseNonPartitionedJoinLookupSource.apply(UseNonPartitionedJoinLookupSource.java:89)
  at io.trino.sql.planner.iterative.rule.UseNonPartitionedJoinLookupSource.apply(UseNonPartitionedJoinLookupSource.java:65)
  at io.trino.sql.planner.iterative.IterativeOptimizer.transform(IterativeOptimizer.java:202)
  at io.trino.sql.planner.iterative.IterativeOptimizer.exploreNode(IterativeOptimizer.java:169)
  at io.trino.sql.planner.iterative.IterativeOptimizer.exploreGroup(IterativeOptimizer.java:132)
  at io.trino.sql.planner.iterative.IterativeOptimizer.exploreChildren(IterativeOptimizer.java:251)
  at io.trino.sql.planner.iterative.IterativeOptimizer.exploreGroup(IterativeOptimizer.java:134)
  at io.trino.sql.planner.iterative.IterativeOptimizer.exploreChildren(IterativeOptimizer.java:251)
  at io.trino.sql.planner.iterative.IterativeOptimizer.exploreGroup(IterativeOptimizer.java:134)
  at io.trino.sql.planner.iterative.IterativeOptimizer.exploreChildren(IterativeOptimizer.java:251)
  at io.trino.sql.planner.iterative.IterativeOptimizer.exploreGroup(IterativeOptimizer.java:134)
  at io.trino.sql.planner.iterative.IterativeOptimizer.exploreChildren(IterativeOptimizer.java:251)
  at io.trino.sql.planner.iterative.IterativeOptimizer.exploreGroup(IterativeOptimizer.java:134)
  at io.trino.sql.planner.iterative.IterativeOptimizer.exploreChildren(IterativeOptimizer.java:251)
  at io.trino.sql.planner.iterative.IterativeOptimizer.exploreGroup(IterativeOptimizer.java:134)
  at io.trino.sql.planner.iterative.IterativeOptimizer.exploreChildren(IterativeOptimizer.java:251)
  at io.trino.sql.planner.iterative.IterativeOptimizer.exploreGroup(IterativeOptimizer.java:134)
  at io.trino.sql.planner.iterative.IterativeOptimizer.exploreChildren(IterativeOptimizer.java:251)
  at io.trino.sql.planner.iterative.IterativeOptimizer.exploreGroup(IterativeOptimizer.java:134)
  at io.trino.sql.planner.iterative.IterativeOptimizer.exploreChildren(IterativeOptimizer.java:251)
  at io.trino.sql.planner.iterative.IterativeOptimizer.exploreGroup(IterativeOptimizer.java:134)
  at io.trino.sql.planner.iterative.IterativeOptimizer.exploreChildren(IterativeOptimizer.java:251)
  at io.trino.sql.planner.iterative.IterativeOptimizer.exploreGroup(IterativeOptimizer.java:134)
  at io.trino.sql.planner.iterative.IterativeOptimizer.exploreChildren(IterativeOptimizer.java:251)
  at io.trino.sql.planner.iterative.IterativeOptimizer.exploreGroup(IterativeOptimizer.java:134)
  at io.trino.sql.planner.iterative.IterativeOptimizer.optimize(IterativeOptimizer.java:117)
  at io.trino.sql.planner.LogicalPlanner.plan(LogicalPlanner.java:240)
  at io.trino.sql.planner.LogicalPlanner.plan(LogicalPlanner.java:217)
  at io.trino.sql.planner.LogicalPlanner.plan(LogicalPlanner.java:212)
  at io.trino.execution.SqlQueryExecution.doPlanQuery(SqlQueryExecution.java:477)
  at io.trino.execution.SqlQueryExecution.planQuery(SqlQueryExecution.java:458)
  at io.trino.execution.SqlQueryExecution.start(SqlQueryExecution.java:399)
  at io.trino.execution.SqlQueryManager.createQuery(SqlQueryManager.java:243)
  at io.trino.dispatcher.LocalDispatchQueryFactory$$Lambda$8076/0x0000ffc641ff1cb0.accept(:-1)
  at io.trino.dispatcher.LocalDispatchQuery.lambda$startExecution$7(LocalDispatchQuery.java:143)
  at io.trino.dispatcher.LocalDispatchQuery$$Lambda$8174/0x0000ffc641d2ecb0.run(:-1)
  at io.trino.$gen.Trino_387____20220623_071609_2.run(:-1)
  at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
  at java.lang.Thread.run(Thread.java:829)

ubyyj · 2022-06-25T00:29:25Z

and in Tableau, we got following error message:
[Simba][Presto] (1060) Presto Query Error: The optimizer exhausted the time limit of 180000 ms: Top rules: { io.trino.sql.planner.iterative.rule.UseNonPartitionedJoinLookupSource@7fc3a7eb: 269393 ms, 1 invocations, 0 ...[see log] (65556

findepi · 2022-06-27T20:35:11Z

and in Tableau, we got following error message: The optimizer exhausted the time limit of 180000 ms

~~This is very disturbing.~~
~~Can you please file a separate issue about this? Please include your query reproducing the problem, the table schema and partitioning (SHOW CREATE TABLE output).~~

Edit: disregard.

findepi · 2022-06-28T10:11:12Z

When dt is a varchar column, let's think what can be the value of dt for cast(dt as date) >= DATE '2022-06-29' to be true.
Obviously, the dt can be '2022-06-29', '2022-06-30', '2022-07-....
However, the CAST accepts various different other forms of dates (see #13011).
So dt can start with a whitespace (any whitespace character, not just U+0020 SPACE), eg ' \t\n\t\r 2022-06-29'.
The year can be zero-padded, eg '0002022-06-29',

We still could derive a predicate directly over dt which would be very efficient in practice, in the spirit of #3618.
We just need to think about a few edge cases when doing that.

findepi · 2022-06-28T15:13:30Z

BTW, for Iceberg i do strongly recommend to switch your column data type to be date.
You will be able to leverage partitioning transforms later, if you need to. cc @alexjo2144

the issue may still be useful for existing data sets with suboptimal data type, and also useful for Hive.

ubyyj · 2022-06-29T00:22:28Z

the table is a successor of a hive table, so we kept the data type.
For hive table, everything worked as expected, wondering the difference between HiveMetadata and IcebergMetadata.

alexjo2144 · 2022-06-29T07:56:36Z

It will work okay but you loose the ability to use any date transforms: https://iceberg.apache.org/spec/#partition-transforms

ubyyj · 2022-06-30T02:12:03Z

can we treat identity transform separately, so iceberg can match hive's perf in this scenario

findepi changed the title ~~iceberg partition key in cast expression is not pushed down~~ Derive predicate over base column for CAST(a_varchar AS date) > a_date_constant condition Jun 28, 2022

This was referenced Aug 9, 2022

Improve comparison predicate pushdown when varchar column cast to date #13567

Merged

Add a test for varchar cast to date predicate #13605

Closed

findepi closed this as completed in #13567 Aug 29, 2022

findepi mentioned this issue Aug 29, 2022

Release notes for 394 #13723

Closed

okayhooni mentioned this issue Dec 15, 2023

Add config to reject Iceberg queries without partition pruning #20118

Closed

okayhooni mentioned this issue Jul 10, 2024

Regression on handling CAST(varchar_partition_field AS date) > date_type_filter_value on query planning #22643

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Derive predicate over base column for `CAST(a_varchar AS date) > a_date_constant` condition #12925

Derive predicate over base column for `CAST(a_varchar AS date) > a_date_constant` condition #12925

ubyyj commented Jun 22, 2022 •

edited by findepi

Loading

findepi commented Jun 22, 2022

ubyyj commented Jun 23, 2022

ubyyj commented Jun 23, 2022 •

edited

Loading

findepi commented Jun 23, 2022

ubyyj commented Jun 24, 2022

findepi commented Jun 24, 2022

ubyyj commented Jun 25, 2022 •

edited by findepi

Loading

ubyyj commented Jun 25, 2022

findepi commented Jun 27, 2022 •

edited

Loading

findepi commented Jun 28, 2022 •

edited

Loading

findepi commented Jun 28, 2022

ubyyj commented Jun 29, 2022

alexjo2144 commented Jun 29, 2022

ubyyj commented Jun 30, 2022

Derive predicate over base column for CAST(a_varchar AS date) > a_date_constant condition #12925

Derive predicate over base column for CAST(a_varchar AS date) > a_date_constant condition #12925

Comments

ubyyj commented Jun 22, 2022 • edited by findepi Loading

findepi commented Jun 22, 2022

ubyyj commented Jun 23, 2022

ubyyj commented Jun 23, 2022 • edited Loading

findepi commented Jun 23, 2022

ubyyj commented Jun 24, 2022

findepi commented Jun 24, 2022

ubyyj commented Jun 25, 2022 • edited by findepi Loading

ubyyj commented Jun 25, 2022

findepi commented Jun 27, 2022 • edited Loading

findepi commented Jun 28, 2022 • edited Loading

findepi commented Jun 28, 2022

ubyyj commented Jun 29, 2022

alexjo2144 commented Jun 29, 2022

ubyyj commented Jun 30, 2022

Derive predicate over base column for `CAST(a_varchar AS date) > a_date_constant` condition #12925

Derive predicate over base column for `CAST(a_varchar AS date) > a_date_constant` condition #12925

ubyyj commented Jun 22, 2022 •

edited by findepi

Loading

ubyyj commented Jun 23, 2022 •

edited

Loading

ubyyj commented Jun 25, 2022 •

edited by findepi

Loading

findepi commented Jun 27, 2022 •

edited

Loading

findepi commented Jun 28, 2022 •

edited

Loading