Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

enable sql compatible null handling mode by default #14792

Merged

Conversation

clintropolis
Copy link
Member

@clintropolis clintropolis commented Aug 9, 2023

Partially satisfies #14154.

Description

I think #14142 was the only real roadblock to making SQL compatible mode be the default mode, so with that merged, I think it is finally time. I think this should be merged as a pair in the same release that adds #14734, so we do the behavior changes all at once, but the end result will be a much more well behaved Druid SQL queries out of the box.

Release note

SQL compatible null handling mode is now enabled by default in Druid 28.0.0! This setting, druid.generic.useDefaultValueForNull, is now set to false by default in the code, so if not explicitly configured in runtime.properties, upon upgrade clusters will take on new behavior with how Druid handles null values during ingestion and query processing. If you wish to retain the existing behavior, you must explicitly configure this to true. Any segments written in the new default mode can still be read correctly in the classic mode, at query time the null values will be ignored or coerced to zeros as appropriate.


This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • a release note entry in the PR description.
  • been tested in a test Druid cluster.

Comment on lines 37 to 38
"earliest_user":null,
"latest_user":null
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this one seems kind of sad, though i guess legal based on our current implementations of native first/last which allow nulls - investigating to make sure this is the right result though

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it turns out this is actually a bug in the string first/last aggregators, which are incorrectly checking timeSelector.isNull prior to doing the 'fold' check. The timeSelector isn't really used when merging pairs from the selector because the time is embedded in the pair of selector in that case, so timeSelector here can spit out nulls and incorrectly aggregate nothing when it should be aggregating pairs.

I'll do a short term fix, but longer term we should really probably split out the agg implementations that handle the raw values ("build") from the aggs that handle pairs ("merge") to avoid these mistakes and simplify the code.

There are also some flaws in the vector aggregator for string last, which is not checking for nulls at all on the time column, which while avoiding this bug, does have other bugs in cases where the time column could legitimately be null, such as from a virtual column.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#14195 which I forgot about does the 'longer term' thing i mentioned in my previous comment

Copy link
Contributor

@gianm gianm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The production code changes look fine to me, I just have comments about the docs.

In addition to the line comments, we need an update to configuration/index.md.

@@ -82,10 +82,13 @@ For each row in the list of column data, there is only a single bitmap that has

## Handling null values

By default, Druid string dimension columns use the values `''` and `null` interchangeably. Numeric and metric columns cannot represent `null` but use nulls to mean `0`. However, Druid provides a SQL compatible null handling mode, which you can enable at the system level through `druid.generic.useDefaultValueForNull`. This setting, when set to `false`, allows Druid to create segments _at ingestion time_ in which the following occurs:
By default, Druid runs in a SQL compatible null handling mode, which allows Druid to create segments _at ingestion time_ in which the following occurs:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section could use some rewording to make it sound more like the SQL-compatible mode is the normal one. Various phrasings throughout the section are written as if legacy mode is normal.

For example, the data structures in SQL-compatible mode are described as "additional" over legacy. It'd be better to describe legacy as missing certain structures rather than SQL-compatible as adding them.

This is a nit, and it doesn't need to block this PR, but I think it would be good to do it the rewords in a follow-on PR.

@@ -261,7 +261,7 @@ native boolean types, Druid ingests these values as strings if `druid.expression
the [array functions](../querying/sql-array-functions.md) or [UNNEST](../querying/sql-functions.md#unnest). Nested
columns can be queried with the [JSON functions](../querying/sql-json-functions.md).

We also highly recommend setting `druid.generic.useDefaultValueForNull=false` when using these columns since it also enables out of the box `ARRAY` type filtering. If not set to `false`, setting `sqlUseBoundsAndSelectors` to `false` on the [SQL query context](../querying/sql-query-context.md) can enable `ARRAY` filtering instead.
We also highly recommend setting `druid.generic.useDefaultValueForNull=false` (the default) when using these columns since it also enables out of the box `ARRAY` type filtering. If not set to `false`, setting `sqlUseBoundsAndSelectors` to `false` on the [SQL query context](../querying/sql-query-context.md) can enable `ARRAY` filtering instead.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. This is ripe for confusion, since it's the only place sqlUseBoundsAndSelectors is mentioned, and it doesn't say what the setting does. It also makes it sound like there is no real downside to setting it to false always. But there is a downside, right?

I am not sure mentioning the setting is worth the added confusion. How about we don't mention it at all, and have the official position in the docs be that you need SQL-compatible null handling in order to get ARRAY filtering?

Btw, I recognize this patch isn't really changing this section meaningfully. But… still.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

uh oh, this must have gotten lost in some document merging, since it was also documented on the query-context docs in #14760, but i no longer see it there...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh wait, i guess it is still there, but is a typo here, it should be sqlUseBoundAndSelectors not sqlUseBoundsAndSelectors

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

anyway, yeah i can just remove mention of it i suppose, but it is documented at least

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thinking a bit more about this, since it is the default now I think i can just remove this entirely, and doing some testing it isn't even true. If this flag isn't set, then the filters plan into expression filters, which do produce the correct results, just a lot less efficiently than if the flag is set.

@@ -71,8 +71,8 @@ Casts between two SQL types with the same Druid runtime type have no effect othe
Casts between two SQL types that have different Druid runtime types generate a runtime cast in Druid.

If a value cannot be cast to the target type, as in `CAST('foo' AS BIGINT)`, Druid either substitutes a default
value (when `druid.generic.useDefaultValueForNull = true`, the default mode), or substitutes [NULL](#null-values) (when
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Swap these so the default (SQL-compatible) behavior is first, and explicitly mention that true is a legacy mode.


When `druid.generic.useDefaultValueForNull = true` (the default mode), Druid treats NULLs and empty strings
When `druid.generic.useDefaultValueForNull = true`, Druid treats NULLs and empty strings
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Explicitly call this a legacy mode.

@clintropolis clintropolis merged commit 5d14129 into apache:master Aug 22, 2023
@clintropolis clintropolis deleted the sql-compatible-mode-by-default branch August 22, 2023 03:07
@LakshSingla LakshSingla added this to the 28.0 milestone Oct 12, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants