Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix 'Multiple entries with same key' error with dynamic filters #20917

Merged

Conversation

lxqfy
Copy link
Contributor

@lxqfy lxqfy commented Mar 4, 2024

Description

Trino will throw Exceptions "Multiple entries with same key", when the same dynamic filter applies to the same partitions key column multiple times under different conditions. E.g.

SELECT *
FROM tbl_a a
JOIN tbl_b b ON b.partition_key_date >= a.start_date AND b.partition_key_date <= a.start_date

Additional context and related issues

Release notes

( ) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
(x) Release notes are required, with the following suggested text:

# General
* Fix query failures with "Multiple entries with same key" error encountered with joins on partitioned tables. ({issue}`20917`)

Copy link

cla-bot bot commented Mar 4, 2024

Thank you for your pull request and welcome to our community. We could not parse the GitHub identity of the following contributors: Li Xiangqun.
This is most likely caused by a git client misconfiguration; please make sure to:

  1. check if your git client is configured with an email to sign commits git config --list | grep email
  2. If not, set it up using git config --global user.email email@example.com
  3. Make sure that the git commit email is configured in your GitHub account settings, see https://github.com/settings/emails

@lxqfy lxqfy force-pushed the trinodb_dynamic_filter_duplicate_key branch from caa3ed7 to 223d557 Compare March 4, 2024 10:33
@cla-bot cla-bot bot added the cla-signed label Mar 4, 2024
@wendigo wendigo requested a review from raunaqmorarka March 4, 2024 16:31
@@ -440,19 +441,22 @@ private TupleDomain<ColumnHandle> translateSummaryToTupleDomain(
{
Collection<DynamicFilters.Descriptor> descriptors = descriptorMultimap.get(filterId);
return TupleDomain.withColumnDomains(descriptors.stream()
.collect(toImmutableMap(
.collect(Collectors.groupingBy(
descriptor -> {
Symbol probeSymbol = Symbol.from(descriptor.getInput());
return requireNonNull(columnHandles.get(probeSymbol), () -> format("Missing probe column for %s", probeSymbol));
},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a test in BaseDynamicPartitionPruningTest

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noted will do.

@raunaqmorarka raunaqmorarka requested a review from sopel39 March 6, 2024 10:03
@raunaqmorarka raunaqmorarka added the bug Something isn't working label Mar 6, 2024
@sopel39
Copy link
Member

sopel39 commented Mar 6, 2024

lgtm % @raunaqmorarka comments.

SELECT *
FROM tbl_a a
JOIN tbl_b b ON b.partition_key_date >= a.start_date AND b.partition_key_date <= a.start_date

that query really simplifies to:

SELECT *
FROM tbl_a a
JOIN tbl_b b ON b.partition_key_date = a.start_date

and inequality join is pretty inefficient.

Is there a better example where it fails?

@@ -71,6 +71,7 @@
import java.util.concurrent.atomic.AtomicLong;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: commit message too long

@lxqfy
Copy link
Contributor Author

lxqfy commented Mar 7, 2024

lgtm % @raunaqmorarka comments.

SELECT *
FROM tbl_a a
JOIN tbl_b b ON b.partition_key_date >= a.start_date AND b.partition_key_date <= a.start_date

that query really simplifies to:

SELECT *
FROM tbl_a a
JOIN tbl_b b ON b.partition_key_date = a.start_date

and inequality join is pretty inefficient.

Is there a better example where it fails?

This is just a simplified query to reproduce the issue.

@lxqfy lxqfy force-pushed the trinodb_dynamic_filter_duplicate_key branch from 223d557 to ccb0192 Compare March 7, 2024 03:00
@raunaqmorarka raunaqmorarka changed the title Fixed 'Multiple entries with same key' issues for dynamic filters. Fix 'Multiple entries with same key' error with dynamic filters Mar 7, 2024
@raunaqmorarka raunaqmorarka force-pushed the trinodb_dynamic_filter_duplicate_key branch from ccb0192 to 9492455 Compare March 7, 2024 04:21
@raunaqmorarka raunaqmorarka merged commit 8052969 into trinodb:master Mar 7, 2024
2 of 13 checks passed
@github-actions github-actions bot added this to the 440 milestone Mar 7, 2024
@sopel39
Copy link
Member

sopel39 commented Mar 7, 2024

This is just a simplified query to reproduce the issue.

I understand. However, the issue that this PR fixes is rather basic, but wasn't discovered before. Hence I wonder what practical query could trigger it, because queries like ON b.partition_key_date >= a.start_date AND b.partition_key_date <= a.start_date are bad for performance anyway.

@lxqfy What kind of production query did hit the issue?

@lxqfy
Copy link
Contributor Author

lxqfy commented Mar 7, 2024

This is just a simplified query to reproduce the issue.

I understand. However, the issue that this PR fixes is rather basic, but wasn't discovered before. Hence I wonder what practical query could trigger it, because queries like ON b.partition_key_date >= a.start_date AND b.partition_key_date <= a.start_date are bad for performance anyway.

@lxqfy What kind of production query did hit the issue?

The actual query spans 200 lines and is overly complex. Generally, The query owner creates a CTE UNION data from various granularities such as "Daily", "Weekly", and "Monthly" with start_date and end_date. For "Daily", they just did SELECT report_date AS start_date, report_date AS end_date. and JOIN another table ON event_date between start_date and end_date

@sopel39
Copy link
Member

sopel39 commented Mar 7, 2024

@lxqfy but does that prod query simplify to ON b.partition_key_date >= a.start_date AND b.partition_key_date <= a.start_date?

The reason I'm asking is maybe we should have a rule that transforms ON b.partition_key_date >= a.start_date AND b.partition_key_date <= a.start_date into ON b.partition_key_date = a.start_date?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working cla-signed
Development

Successfully merging this pull request may close these issues.

3 participants