-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
enable sql compatible null handling mode by default #14792
enable sql compatible null handling mode by default #14792
Conversation
"earliest_user":null, | ||
"latest_user":null |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this one seems kind of sad, though i guess legal based on our current implementations of native first/last which allow nulls - investigating to make sure this is the right result though
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it turns out this is actually a bug in the string first/last aggregators, which are incorrectly checking timeSelector.isNull prior to doing the 'fold' check. The timeSelector isn't really used when merging pairs from the selector because the time is embedded in the pair of selector in that case, so timeSelector here can spit out nulls and incorrectly aggregate nothing when it should be aggregating pairs.
I'll do a short term fix, but longer term we should really probably split out the agg implementations that handle the raw values ("build") from the aggs that handle pairs ("merge") to avoid these mistakes and simplify the code.
There are also some flaws in the vector aggregator for string last, which is not checking for nulls at all on the time column, which while avoiding this bug, does have other bugs in cases where the time column could legitimately be null, such as from a virtual column.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#14195 which I forgot about does the 'longer term' thing i mentioned in my previous comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The production code changes look fine to me, I just have comments about the docs.
In addition to the line comments, we need an update to configuration/index.md
.
docs/design/segments.md
Outdated
@@ -82,10 +82,13 @@ For each row in the list of column data, there is only a single bitmap that has | |||
|
|||
## Handling null values | |||
|
|||
By default, Druid string dimension columns use the values `''` and `null` interchangeably. Numeric and metric columns cannot represent `null` but use nulls to mean `0`. However, Druid provides a SQL compatible null handling mode, which you can enable at the system level through `druid.generic.useDefaultValueForNull`. This setting, when set to `false`, allows Druid to create segments _at ingestion time_ in which the following occurs: | |||
By default, Druid runs in a SQL compatible null handling mode, which allows Druid to create segments _at ingestion time_ in which the following occurs: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This section could use some rewording to make it sound more like the SQL-compatible mode is the normal one. Various phrasings throughout the section are written as if legacy mode is normal.
For example, the data structures in SQL-compatible mode are described as "additional" over legacy. It'd be better to describe legacy as missing certain structures rather than SQL-compatible as adding them.
This is a nit, and it doesn't need to block this PR, but I think it would be good to do it the rewords in a follow-on PR.
docs/ingestion/schema-design.md
Outdated
@@ -261,7 +261,7 @@ native boolean types, Druid ingests these values as strings if `druid.expression | |||
the [array functions](../querying/sql-array-functions.md) or [UNNEST](../querying/sql-functions.md#unnest). Nested | |||
columns can be queried with the [JSON functions](../querying/sql-json-functions.md). | |||
|
|||
We also highly recommend setting `druid.generic.useDefaultValueForNull=false` when using these columns since it also enables out of the box `ARRAY` type filtering. If not set to `false`, setting `sqlUseBoundsAndSelectors` to `false` on the [SQL query context](../querying/sql-query-context.md) can enable `ARRAY` filtering instead. | |||
We also highly recommend setting `druid.generic.useDefaultValueForNull=false` (the default) when using these columns since it also enables out of the box `ARRAY` type filtering. If not set to `false`, setting `sqlUseBoundsAndSelectors` to `false` on the [SQL query context](../querying/sql-query-context.md) can enable `ARRAY` filtering instead. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm. This is ripe for confusion, since it's the only place sqlUseBoundsAndSelectors
is mentioned, and it doesn't say what the setting does. It also makes it sound like there is no real downside to setting it to false
always. But there is a downside, right?
I am not sure mentioning the setting is worth the added confusion. How about we don't mention it at all, and have the official position in the docs be that you need SQL-compatible null handling in order to get ARRAY
filtering?
Btw, I recognize this patch isn't really changing this section meaningfully. But… still.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
uh oh, this must have gotten lost in some document merging, since it was also documented on the query-context docs in #14760, but i no longer see it there...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh wait, i guess it is still there, but is a typo here, it should be sqlUseBoundAndSelectors
not sqlUseBoundsAndSelectors
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
anyway, yeah i can just remove mention of it i suppose, but it is documented at least
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thinking a bit more about this, since it is the default now I think i can just remove this entirely, and doing some testing it isn't even true. If this flag isn't set, then the filters plan into expression filters, which do produce the correct results, just a lot less efficiently than if the flag is set.
@@ -71,8 +71,8 @@ Casts between two SQL types with the same Druid runtime type have no effect othe | |||
Casts between two SQL types that have different Druid runtime types generate a runtime cast in Druid. | |||
|
|||
If a value cannot be cast to the target type, as in `CAST('foo' AS BIGINT)`, Druid either substitutes a default | |||
value (when `druid.generic.useDefaultValueForNull = true`, the default mode), or substitutes [NULL](#null-values) (when |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Swap these so the default (SQL-compatible) behavior is first, and explicitly mention that true
is a legacy mode.
docs/querying/sql-data-types.md
Outdated
|
||
When `druid.generic.useDefaultValueForNull = true` (the default mode), Druid treats NULLs and empty strings | ||
When `druid.generic.useDefaultValueForNull = true`, Druid treats NULLs and empty strings |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Explicitly call this a legacy mode.
Partially satisfies #14154.
Description
I think #14142 was the only real roadblock to making SQL compatible mode be the default mode, so with that merged, I think it is finally time. I think this should be merged as a pair in the same release that adds #14734, so we do the behavior changes all at once, but the end result will be a much more well behaved Druid SQL queries out of the box.
Release note
SQL compatible null handling mode is now enabled by default in Druid 28.0.0! This setting,
druid.generic.useDefaultValueForNull
, is now set tofalse
by default in the code, so if not explicitly configured in runtime.properties, upon upgrade clusters will take on new behavior with how Druid handles null values during ingestion and query processing. If you wish to retain the existing behavior, you must explicitly configure this totrue
. Any segments written in the new default mode can still be read correctly in the classic mode, at query time the null values will be ignored or coerced to zeros as appropriate.This PR has: