Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix nested column range index range computation #13297

Merged

Conversation

clintropolis
Copy link
Member

@clintropolis clintropolis commented Nov 2, 2022

Description

Fixes a bug with nested column range index value range computation that can lead to incorrect values matching filters when the range is not present in the nested fields local dictionary.

This PR fixes the issue by checking that the computed local start and end index values global id mapping does not violate the computed original global id range whenever the actual global ids are not present in the local dictionary. That's a lot of words thats probably hard to follow, but its basically an off by 1 error when mapping the global range to the local range in certain cases.

This wrong behavior was accidentally encoded in a test I wrote which would have caught the issue if the test had been expecting the correct answer, so I went through and manually verified that the tests are actually expecting the correct thing this time, as well as tried to get 100% coverage on everything which matters in practice, which is now pretty high
Screen Shot 2022-11-02 at 2 08 41 AM

The remaining uncovered lines are largely things like safety checks that are not hit in regular operation such as calling next on an iterator before calling hasNext, etc, which I cannot hit during normal usage of the index supplier.

Release note

Fixes a bug with nested columns when using filters that use range indexes such as greater/less than or like filters.


This PR has:

  • been self-reviewed.
  • a release note entry in the PR description.
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • been tested in a test Druid cluster.

localStartIndex = -(localFound + 1);
// if the computed local start index violates the global range, shift up by 1
int actualGlobalStartIndex = localDictionary.get(localStartIndex);
if (actualGlobalStartIndex < globalStartIndex) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be done in a loop or can we be sure that the global index at the next localStartIndex would be valid?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok so i looked a lot closer in debugger and working things out by hand, and this code on the start index is pointless and only can be hit in cases where the range will effectively be empty anyway so can be dropped. it also uncovered a missing bounds check on FixedIndexed get method, which I have fixed and added tests for 😅

I've re-arranged some things so the logic is simplified, it was really only the end index that needed adjustment when it was missing from the local dictionary, and that was because I was doing the shifting to account for the end index being exclusive in a funny manner. I think its clearer now. Thanks for asking this question!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Simpler to follow now.

I also wonder if the last value returned from FixedIndex.indexOf() should always be -(minIndex + 1).
My concern is that if we haven't found the value until now, then the insertion point could either be before or after the currIndex that we are evaluating. I think we should incorporate that fact into the returned value.

Say at the last step, we found that our value is bigger than the currValue, so minIndex becomes currIndex + 1, then we break out of the loop and then we return -(minIndex + 1) = -(currIndex + 2).

Doesn't this mean that we have actually skipped the insertion point? Or maybe I am missing something.
(Although, I guess this should be okay because the Indexed interface doesn't make any promises about the returned negative value.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FixedIndexed.indexOf is the same-ish implementation as GenericIndexed.indexOf, both of which are basically the same as Arrays.binarySearch, so I think it is ok. Btw, range finder method is used for both FixedIndexed and GenericIndexed global dictionaries depending on if it is making LexicographicalRangeIndex for nested strings or NumericRangeIndex for nested numbers, though the local dictionary is always a FixedIndexed.

Or did you mean something about how the range calculation is using indexOf to compute the ranges?

btw, I did recently strengthen the contract of indexOf to be (-(insertion point) - 1) for missing values if isSorted() returns true, since these index things basically require the (-(insertion point) - 1) behavior to work correctly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, thanks for the clarification!

Copy link
Contributor

@kfaraz kfaraz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor query, otherwise looks good.

@clintropolis clintropolis merged commit 018f984 into apache:master Nov 3, 2022
@clintropolis clintropolis deleted the fix-nested-column-range-filter branch November 3, 2022 04:37
clintropolis added a commit to clintropolis/druid that referenced this pull request Nov 3, 2022
* fix nested column range index range computation

* simplify, add missing bounds check for FixedIndexed
clintropolis added a commit to clintropolis/druid that referenced this pull request Nov 3, 2022
* fix nested column range index range computation

* simplify, add missing bounds check for FixedIndexed
@kfaraz kfaraz added this to the 24.0.1 milestone Nov 3, 2022
kfaraz pushed a commit that referenced this pull request Nov 3, 2022
* fix nested column range index range computation

* simplify, add missing bounds check for FixedIndexed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants