enable sql compatible null handling mode by default #14792

clintropolis · 2023-08-09T20:45:14Z

Partially satisfies #14154.

Description

I think #14142 was the only real roadblock to making SQL compatible mode be the default mode, so with that merged, I think it is finally time. I think this should be merged as a pair in the same release that adds #14734, so we do the behavior changes all at once, but the end result will be a much more well behaved Druid SQL queries out of the box.

Release note

SQL compatible null handling mode is now enabled by default in Druid 28.0.0! This setting, druid.generic.useDefaultValueForNull, is now set to false by default in the code, so if not explicitly configured in runtime.properties, upon upgrade clusters will take on new behavior with how Druid handles null values during ingestion and query processing. If you wish to retain the existing behavior, you must explicitly configure this to true. Any segments written in the new default mode can still be read correctly in the classic mode, at query time the null values will be ignored or coerced to zeros as appropriate.

This PR has:

been self-reviewed.
added documentation for new or modified features or behaviors.
a release note entry in the PR description.
been tested in a test Druid cluster.

clintropolis · 2023-08-10T19:55:43Z

...ation-tests-ex/cases/src/test/resources/multi-stage-query/wikipedia_merge_index_queries.json

+        "earliest_user":null,
+        "latest_user":null


this one seems kind of sad, though i guess legal based on our current implementations of native first/last which allow nulls - investigating to make sure this is the right result though

it turns out this is actually a bug in the string first/last aggregators, which are incorrectly checking timeSelector.isNull prior to doing the 'fold' check. The timeSelector isn't really used when merging pairs from the selector because the time is embedded in the pair of selector in that case, so timeSelector here can spit out nulls and incorrectly aggregate nothing when it should be aggregating pairs.

I'll do a short term fix, but longer term we should really probably split out the agg implementations that handle the raw values ("build") from the aggs that handle pairs ("merge") to avoid these mistakes and simplify the code.

There are also some flaws in the vector aggregator for string last, which is not checking for nulls at all on the time column, which while avoiding this bug, does have other bugs in cases where the time column could legitimately be null, such as from a virtual column.

#14195 which I forgot about does the 'longer term' thing i mentioned in my previous comment

…de-by-default

…eForNull=false

gianm

The production code changes look fine to me, I just have comments about the docs.

In addition to the line comments, we need an update to configuration/index.md.

gianm · 2023-08-17T21:01:28Z

docs/design/segments.md

@@ -82,10 +82,13 @@ For each row in the list of column data, there is only a single bitmap that has

 ## Handling null values

-By default, Druid string dimension columns use the values `''` and `null` interchangeably. Numeric and metric columns cannot represent `null` but use nulls to mean `0`. However, Druid provides a SQL compatible null handling mode, which you can enable at the system level through `druid.generic.useDefaultValueForNull`. This setting, when set to `false`, allows Druid to create segments _at ingestion time_ in which the following occurs:
+By default, Druid runs in a SQL compatible null handling mode, which allows Druid to create segments _at ingestion time_ in which the following occurs:


This section could use some rewording to make it sound more like the SQL-compatible mode is the normal one. Various phrasings throughout the section are written as if legacy mode is normal.

For example, the data structures in SQL-compatible mode are described as "additional" over legacy. It'd be better to describe legacy as missing certain structures rather than SQL-compatible as adding them.

This is a nit, and it doesn't need to block this PR, but I think it would be good to do it the rewords in a follow-on PR.

gianm · 2023-08-17T21:08:59Z

docs/ingestion/schema-design.md

@@ -261,7 +261,7 @@ native boolean types, Druid ingests these values as strings if `druid.expression
 the [array functions](../querying/sql-array-functions.md) or [UNNEST](../querying/sql-functions.md#unnest). Nested
 columns can be queried with the [JSON functions](../querying/sql-json-functions.md).

-We also highly recommend setting `druid.generic.useDefaultValueForNull=false` when using these columns since it also enables out of the box `ARRAY` type filtering. If not set to `false`, setting `sqlUseBoundsAndSelectors` to `false` on the [SQL query context](../querying/sql-query-context.md) can enable `ARRAY` filtering instead.
+We also highly recommend setting `druid.generic.useDefaultValueForNull=false` (the default) when using these columns since it also enables out of the box `ARRAY` type filtering. If not set to `false`, setting `sqlUseBoundsAndSelectors` to `false` on the [SQL query context](../querying/sql-query-context.md) can enable `ARRAY` filtering instead.


Hmm. This is ripe for confusion, since it's the only place sqlUseBoundsAndSelectors is mentioned, and it doesn't say what the setting does. It also makes it sound like there is no real downside to setting it to false always. But there is a downside, right?

I am not sure mentioning the setting is worth the added confusion. How about we don't mention it at all, and have the official position in the docs be that you need SQL-compatible null handling in order to get ARRAY filtering?

Btw, I recognize this patch isn't really changing this section meaningfully. But… still.

uh oh, this must have gotten lost in some document merging, since it was also documented on the query-context docs in #14760, but i no longer see it there...

oh wait, i guess it is still there, but is a typo here, it should be sqlUseBoundAndSelectors not sqlUseBoundsAndSelectors

anyway, yeah i can just remove mention of it i suppose, but it is documented at least

thinking a bit more about this, since it is the default now I think i can just remove this entirely, and doing some testing it isn't even true. If this flag isn't set, then the filters plan into expression filters, which do produce the correct results, just a lot less efficiently than if the flag is set.

gianm · 2023-08-17T21:10:17Z

docs/querying/sql-data-types.md

@@ -71,8 +71,8 @@ Casts between two SQL types with the same Druid runtime type have no effect othe
 Casts between two SQL types that have different Druid runtime types generate a runtime cast in Druid.

 If a value cannot be cast to the target type, as in `CAST('foo' AS BIGINT)`, Druid either substitutes a default
-value (when `druid.generic.useDefaultValueForNull = true`, the default mode), or substitutes [NULL](#null-values) (when


Swap these so the default (SQL-compatible) behavior is first, and explicitly mention that true is a legacy mode.

gianm · 2023-08-17T21:11:12Z

docs/querying/sql-data-types.md


-When `druid.generic.useDefaultValueForNull = true` (the default mode), Druid treats NULLs and empty strings
+When `druid.generic.useDefaultValueForNull = true`, Druid treats NULLs and empty strings


Explicitly call this a legacy mode.

…de-by-default

enable sql compatible null handling mode by default

53269d2

clintropolis added Area - Querying Release Notes Area - Segment Format and Ser/De Area - Ingestion labels Aug 9, 2023

github-actions bot added the Area - Documentation label Aug 9, 2023

clintropolis added 2 commits August 9, 2023 17:43

fixes

bf782cd

fix tests

8a31bbe

clintropolis commented Aug 10, 2023

View reviewed changes

clintropolis added 2 commits August 14, 2023 23:56

Merge remote-tracking branch 'upstream/master' into sql-compatible-mo…

6b42d2e

…de-by-default

fix bug with string first/last aggs when druid.generic.useDefaultValu…

9fb5451

…eForNull=false

clintropolis added the Bug label Aug 15, 2023

somu-imply approved these changes Aug 16, 2023

View reviewed changes

kgyrtkirk approved these changes Aug 16, 2023

View reviewed changes

gianm reviewed Aug 17, 2023

View reviewed changes

clintropolis added 3 commits August 17, 2023 18:50

Merge remote-tracking branch 'upstream/master' into sql-compatible-mo…

f1e2ca2

…de-by-default

adjust docs

8b866e0

oops

ac2e3c4

gianm approved these changes Aug 21, 2023

View reviewed changes

clintropolis merged commit 5d14129 into apache:master Aug 22, 2023

clintropolis deleted the sql-compatible-mode-by-default branch August 22, 2023 03:07

clintropolis mentioned this pull request Aug 22, 2023

Count distinct returned incorrect results without useApproximateCountDistinct #14748

Merged

8 tasks

LakshSingla added this to the 28.0 milestone Oct 12, 2023

LakshSingla mentioned this pull request Nov 4, 2023

[DRAFT] 28.0.0 release notes #15326

Closed

gianm mentioned this pull request Dec 6, 2023

Fix NullFilter getDimensionRangeSet. #15500

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

enable sql compatible null handling mode by default #14792

enable sql compatible null handling mode by default #14792

clintropolis commented Aug 9, 2023 •

edited

Loading

clintropolis Aug 10, 2023

clintropolis Aug 15, 2023

clintropolis Aug 15, 2023

gianm left a comment

gianm Aug 17, 2023

gianm Aug 17, 2023

clintropolis Aug 18, 2023

clintropolis Aug 18, 2023

clintropolis Aug 18, 2023

clintropolis Aug 18, 2023

gianm Aug 17, 2023

gianm Aug 17, 2023


		When `druid.generic.useDefaultValueForNull = true` (the default mode), Druid treats NULLs and empty strings
		When `druid.generic.useDefaultValueForNull = true`, Druid treats NULLs and empty strings

enable sql compatible null handling mode by default #14792

enable sql compatible null handling mode by default #14792

Conversation

clintropolis commented Aug 9, 2023 • edited Loading

Description

Release note

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gianm left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

clintropolis commented Aug 9, 2023 •

edited

Loading