Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

31.0.0 not filter is always false for missing column #17528

Open
pmmag opened this issue Dec 2, 2024 · 2 comments
Open

31.0.0 not filter is always false for missing column #17528

pmmag opened this issue Dec 2, 2024 · 2 comments

Comments

@pmmag
Copy link

pmmag commented Dec 2, 2024

Affected Version

31.0.0

Description

Hello,

We have a use case where we use string-based schemaless ingestion for a datasource where a particular column may or may not be present. We were querying that data source with a condition to filter out rows where that column is NOT equal to a particular value ({"type": "not", "field": {"type": "selector", "dimension": "our-column", "value": "our-value"}}). However, against our expectations, this filter was not returning rows that were missing this column entirely.

We were able to reproduce this by simply querying on a completely bogus column:

{
  "queryType": "scan",
  "dataSource": {
    "type": "table",
    "name": "some-table"
  },
  "filter": {
    "type": "not",
    "field": {
      "type": "selector",
      "dimension": "kfjhdskgjshdk-this-column-obviously-does-not-exist",
      "value": "should-not-matter"
    }
  },
  "intervals": [
    "0/2025"
  ]
}

Expectation: this filter should be equivalent to no filter at all, because there are no rows with this column and consequently all rows should match the condition that this row is not equal to some value.

Reality: no rows are returned. The same happens with an in-filter and even with a regex filter.

We are now able to work around this by instead filtering on a virtual column (with nvl(our-column, 'default_value')), but we are concerned that this behavior my silently filter out data in some other cases. Or if this behavior is by design, we would at least expect to find it in the documentation because it is very surprising.

EDIT: as an update to this, these queries were working as expected in Druid version 24.0.1 which we recently upgraded from.

@gianm
Copy link
Contributor

gianm commented Dec 4, 2024

This sounds like expected behavior to me, given SQL compatible null handling and 3-valued logic. The nonexistent column kfjhdskgjshdk-this-column-obviously-does-not-exist is treated as a column composed entirely of nulls. When we compare a null to a constant like "should-not-matter", the result is UNKNOWN. The NOT of UNKNOWN is UNKNOWN itself. And when a top-level filter is UNKNOWN, it's treated as FALSE.

Some time between Druid 24 and 31, SQL compatible NULL handling became the default behavior, so that's likely why upgrading changed behavior here.

If you want the top level filter to be TRUE (rather than UNKNOWN) in this case, you can replace it with:

"filter": {
    "type": "not",
    "field": {
      "type": "istrue",
      "field": {
        "type": "selector",
        "dimension": "kfjhdskgjshdk-this-column-obviously-does-not-exist",
        "value": "should-not-matter"
      }
    }
  }

The istrue between not and selector changes the UNKNOWN from the selector to FALSE, which can then be negated by the not to TRUE.

For the time being, you can also set the NULL handling back to legacy mode to get the old behavior. This legacy mode will however be removed in the future, so that's not a long term solution.

@pmmag
Copy link
Author

pmmag commented Dec 5, 2024

Ok, thanks for the clarification. I had read about the SQL compatible NULL changes but had not considered they would impact something like this (especially in native queries). I'm not sure this type of "side-effect" was immediately obvious though.

This makes a bit more sense if you know exactly which columns exist but when you have some columns that may or may not be present I guess you need to be more careful (SQL compliance really makes much less sense intuitively when there is no strict schema).

Perhaps it could be added to the documentation of the selector filter that it treats UNKNOWN values as false. That does sound reasonable when you spell it out, but it may surprise less experienced users when the rows returned by the filter "not(a = b)" isn't the logical complement of the rows returned by "a = b". (EDIT: well, in one sentence on https://druid.apache.org/docs/latest/querying/filters there is a statement about the 3-valued logic, very easy to miss it and fail to understand all the consequences though).

We can live with this now by educating our team and changing some wrapper code to generate logical not filters in the form of not(istrue(filter))) instead of just not(filter). I already had to fix this in at least four separate places in our application after upgrading to 31.0.0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants