nested columns + arrays = array columns! #13803

clintropolis · 2023-02-14T16:52:57Z

Description

This PR adds support for storing arrays of primitive values (ARRAY<STRING>, ARRAY<LONG> and ARRAY<DOUBLE>) as specialized nested columns, instead of the current strategy which stores a separate nested column for each array element.

Basically it works by adding an additional value dictionary for all array values to our current STRING -> LONG -> DOUBLE dictionary stack. The array dictionary stores int[] values, where the elements are globalIds that point to values in the STRING/LONG/DOUBLE value dictionaries.

So instead of the 2-phase lookup that scalar nested columns have, nested array values have a 3-phase lookup - first lookup localId -> globalId, which gives an int[] of global ids which can then be translated into the actual values.

This dictionary is stored with a newly added FrontCodedIntArrayIndexed, which is pretty similar to FrontCodedIndexed added in #12277, but storing int[] instead of byte[]. There is probably more code that could be shared between the two implementations (currently there is none and FrontCodedIntArrayIndexed and its writer are complete copies).

additional description TBD

changes:

add support for storing nested arrays of string, long, and double values as specialized nested columns instead of breaking them into separate element columns
nested column typic mimic behavior means that columns ingested with only root arrays of primitive values will be ARRAY typed columns
neat test stuff

Release note

This PR has:

changes: * add support for storing nested arrays of string, long, and double values as specialized nested columns instead of breaking them into separate element columns * nested column typic mimic behavior means that columns ingested with only root arrays of primitive values will be ARRAY typed columns * neat test stuff

processing/src/main/java/org/apache/druid/segment/nested/CompressedNestedDataComplexColumn.java

+            if (o instanceof Object[]) {
+              Object[] array = (Object[]) o;
+              if (elementNumber < array.length) {
+                return array[elementNumber];


processing/src/test/java/org/apache/druid/query/NestedDataTestUtils.java

processing/src/main/java/org/apache/druid/segment/data/FrontCodedIntArrayIndexedWriter.java

clintropolis · 2023-02-17T08:43:33Z

I want to get #13809 finished first and pull those changes into this PR to make things consistent with null value coercion

imply-cheddar

Only got through a few of the files so far, but need to go and come back to it.

imply-cheddar · 2023-03-14T05:55:14Z

...ensions/src/test/java/org/apache/druid/data/input/parquet/NestedColumnParquetReaderTest.java

-    Assert.assertEquals(ImmutableList.of("dim1", "metric1", "timestamp"), rows.get(0).getDimensions());
+    Assert.assertEquals(ImmutableList.of("dim1", "metric1"), rows.get(0).getDimensions());


Why the change in expectation?

this is related to the change in dimension filter in MapBasedInputRow to always filter out the timestamp spec column.

In 'normal' production code the timestamp spec is added to dimension exclusions so the code in MapBasedInputRow that computes dimensions to ensure time would not be in the dimensions list. However, in test code, especially in processing, which doesn't have access to the methods that take a dataschema and transform it into an input row schema, its pretty easy to not explicitly add timestamp column to the dimensions exclusion list. So as a result of not manually adding timestamp column to exclusions, it would end up in the dimensions list in schema discovery modes, as a string (or nested column, depending on config), which when doing rollup tests means it ends up as part of the rollup key, and so on. (Again this doesn't happen in current production code because it goes through that translator utility method).

I made the change there to always filter the timestamp spec column from the dimensions list to make it easier to not write wrong tests for schema discovery mode, which caused the change here and other places.

imply-cheddar · 2023-03-14T05:56:39Z

processing/src/main/java/org/apache/druid/data/input/impl/DelimitedInputFormat.java

+  public static DelimitedInputFormat ofColumns(String... columns)
+  {
+    return new DelimitedInputFormat(
+        Arrays.asList(columns),
+        null,
+        null,
+        false,
+        false,
+        0
+    );
+  }


Location nit: does this need to be in main rather than test?

removed since there was only one caller

imply-cheddar · 2023-03-14T05:56:49Z

processing/src/main/java/org/apache/druid/data/input/impl/JsonInputFormat.java

+  public static final JsonInputFormat DEFAULT = new JsonInputFormat(
+      JSONPathSpec.DEFAULT,
+      null,
+      null,
+      null,
+      null
+  );
+


Location nit: does this need to be in main rather than test?

imply-cheddar · 2023-03-14T05:58:34Z

processing/src/main/java/org/apache/druid/data/input/impl/MapInputRowParser.java

      DimensionsSpec dimensionsSpec,
      Map<String, Object> rawInputRow
  )
  {
    if (dimensionsSpec.isIncludeAllDimensions()) {
      LinkedHashSet<String> dimensions = new LinkedHashSet<>(dimensionsSpec.getDimensionNames());
-      dimensions.addAll(Sets.difference(rawInputRow.keySet(), dimensionsSpec.getDimensionExclusions()));
+      for (String field : rawInputRow.keySet()) {
+        if (timestampSpec.getTimestampColumn().equals(field) || dimensionsSpec.getDimensionExclusions().contains(field)) {


can we just get these two things once?

yeah, can adjust, though after an earlier discussion i just want to remove getDimensions from InputRow, but haven't decided if in this PR or some later change.

imply-cheddar · 2023-03-14T07:14:53Z

processing/src/main/java/org/apache/druid/query/expression/NestedDataExpressions.java

+            @Nullable Object maybeArrayOfLiterals
+        )
+        {
+          ExprEval<?> eval = ExprEval.bestEffortOf(maybeArrayOfLiterals);


Would it make sense to break ExprEval.bestEffortOf into a bunch of checks for different groups of expected types (i.e. ExprEval.maybeLiteral() and ExprEval.maybeArray, etc.). Calls to bestEffortOf can cascade through, but places like this that already know some of what they expect can call the one that more aligns with expectations?

i split out bestEffortArray(@Nullable List<?> theList) and changed StructuredDataProcessor method to always pass a List when attempting to process arrays

imply-cheddar · 2023-03-14T07:22:37Z

processing/src/main/java/org/apache/druid/segment/NestedDataColumnIndexer.java

+      final ExprEval<?> maybeLiteralArray = ExprEval.bestEffortOf(maybeArrayOfLiterals);
+      if (maybeLiteralArray.type().isArray() && maybeLiteralArray.type().getElementType().isPrimitive()) {
+        final String fieldName = NestedPathFinder.toNormalizedJsonPath(fieldPath);
+        LiteralFieldIndexer fieldIndexer = fieldIndexers.get(fieldName);
+        if (fieldIndexer == null) {
+          estimatedFieldKeySize += StructuredDataProcessor.estimateStringSize(fieldName);
+          fieldIndexer = new LiteralFieldIndexer(globalDictionary);
+          fieldIndexers.put(fieldName, fieldIndexer);
+        }
+        return fieldIndexer.processValue(maybeLiteralArray);
+      }
+      return null;


This looks a lot like code in NestedDataExpressions, except this doesn't return the NULL_LITERAL. I find myself wondering if the NestedDataExpressions code shouldn't look like this?

the contract of processArrayOfLiteralsField is supposed to return a ProcessedLiteral if and only if the value was an array of literals (it is marked @Nullable). processLiteralField is not nullable, and must always return a ProcessedLiteral.

The StructuredDataProcessor code when processing some input and it encounters arrays will first attempt to processArrayOfLiteralField, if it returns something, it was an array, else it must instead process the array elements recursively. processLiteralField is called on everything that isn't a map or array.

I'll see if i can clarify it better

i modified this stuff a bit and updated javadocs so it is hopefully clearer, the new abstract method names are processField and processArrayField and the latter indicates that returning a non-null value halts further processing of arrays, otherwise the processor will continue for each element of the array

imply-cheddar · 2023-03-14T07:25:26Z

processing/src/main/java/org/apache/druid/segment/NestedDataColumnIndexer.java

+      final LiteralFieldIndexer root = fieldIndexers.get(NestedPathFinder.JSON_PATH_ROOT);
+      if (root.getTypes().getSingleType().isArray()) {
+        throw new UnsupportedOperationException("Not supported");
+      }


I'm reading this as "if all we have are root-level entries and they are always arrays, then throw a UOE exception". I'm pretty sure I'm reading it wrong, but wishing the error message gave me more context without me feeling like I need to expand lines on the review to know what this is validating.

heh, you are reading it correctly, currently the 'single typed root' dimension selector should only be used for scalar string columns, everything else should use the column value selector instead, will try to clarify with comments

Or put more words into the exception message so it's clear what's not supported?

clarified exception

imply-cheddar · 2023-03-14T07:25:51Z

processing/src/main/java/org/apache/druid/segment/NestedDataColumnIndexer.java

@@ -411,7 +445,7 @@ public Object getObject()
        if (0 <= dimIndex && dimIndex < dims.length) {
          final StructuredData data = (StructuredData) dims[dimIndex];
          if (data != null) {
-            return data.getValue();
+            return ExprEval.bestEffortOf(data.getValue()).value();


Is this counting on coercion or something?

yeah, i did this to make it consistent with the value that is stored in the dictionary (since the selector is used for merging/persisting)

imply-cheddar · 2023-03-14T07:27:03Z

processing/src/main/java/org/apache/druid/segment/NestedDataColumnIndexer.java

+        case ARRAY:
+          // skip empty arrays for now, they will always be called 'string' arrays, which isn't very helpful here since
+          // it will pollute the type set
+          Preconditions.checkNotNull(columnType.getElementType(), "Array element type must not be null");


If this were to ever happen, I think I'd want to know which field was the bad one.

imply-cheddar · 2023-03-14T07:31:36Z

processing/src/main/java/org/apache/druid/segment/NestedDataColumnIndexer.java

+          // skip empty arrays for now, they will always be called 'string' arrays, which isn't very helpful here since
+          // it will pollute the type set


How does it do the skipping of empties?

it was missing some code to do that, updated and added tests for nulls, empties, and arrays of nulls

imply-cheddar · 2023-03-20T02:40:01Z

processing/src/main/java/org/apache/druid/segment/NestedDataColumnMerger.java

+        } else if (adapter instanceof QueryableIndexIndexableAdapter) {
+          dimValues = getSortedIndexesFromQueryableAdapter((QueryableIndexIndexableAdapter) adapter, mergedFields);
+        } else {
+          throw new ISE("Unable to merge columns of unsupported adapter %s", adapter.getClass());


imply-cheddar · 2023-03-20T02:41:12Z

processing/src/main/java/org/apache/druid/segment/NestedDataColumnMerger.java

+        boolean allNulls = dimValues == null || allNull(dimValues.getSortedStrings()) &&
+                                                allNull(dimValues.getSortedLongs()) &&
+                                                allNull(dimValues.getSortedDoubles()) &&
+                                                dimValues.getArrayCardinality() == 0;


This seems like a nice check to delegate to GlobalDictionarySortedCollector instead of implementing here?

moved to GlobalDictionarySortedCollector

imply-cheddar · 2023-03-20T02:46:44Z

processing/src/main/java/org/apache/druid/segment/NestedDataColumnMerger.java

+        defaultSerializer.serializeArrayDictionary(() -> new ArrayDictionaryMergingIterator(
+            sortedArrayLookups,
+            defaultSerializer.getGlobalLookup()
+        ));


Why can't this one just be sortedLookup.getSortedArrays() like the other 3?

imply-cheddar · 2023-03-20T02:48:07Z

processing/src/main/java/org/apache/druid/segment/NestedDataColumnMerger.java

+        column.getArraysIterable(),
+        column.getArrayDictionary().size()


Why the 2 argument set of Iterable() and size() instead of a single collection-style object like the others?

it felt strange to try to wire this iterable up to wrap in an Indexed when all we need it for is to iterate over

Sure, doesn't have to be an Indexed, just also seemed weird to be passing in fully encapsulated objects above and then suddenly start passing in 2 arguments to wrap a new thing that seems similar to the ones above...

imply-cheddar · 2023-03-20T02:52:47Z

processing/src/main/java/org/apache/druid/segment/NestedDataColumnMerger.java

+        if (next[i] == null) {
+          newIdsWhoDis[i] = 0;
+        } else if (next[i] instanceof String) {
+          newIdsWhoDis[i] = idLookup.lookupString((String) next[i]);
+        } else if (next[i] instanceof Long) {
+          newIdsWhoDis[i] = idLookup.lookupLong((Long) next[i]);
+        } else if (next[i] instanceof Double) {
+          newIdsWhoDis[i] = idLookup.lookupDouble((Double) next[i]);
+        } else {
+          newIdsWhoDis[i] = -1;
+        }
+        Preconditions.checkArgument(
+            newIdsWhoDis[i] >= 0,
+            "unknown global id [%s] for value [%s]",
+            newIdsWhoDis[i],
+            next[i]
+        );


Given that the global dictionaries, once merged, will be in type-sorted order, do we really need to convert back into the actual values instead of just converting the dictionary id?

So, we are dealing with them here like this so we can lookup the new global id from the newly merged lower scalar value dictionaries. Otherwise we would need the mappings of old ids to new ids, which we don't currently have anywhere, and its a lot more complicated since its per segment. This way we just lookup the old values and after the lower dictionaries are merged, just lookup the array elements for the newly sorted values

i think since this stuff is all working i'd like to save further changes/optimizations for a follow-up

imply-cheddar · 2023-03-20T04:27:41Z

processing/src/test/java/org/apache/druid/segment/IndexBuilder.java

+      Preconditions.checkNotNull(inputFormat, "inputFormat");
+      Preconditions.checkNotNull(inputSourceTmpDir, "inputSourceTmpDir");
+
+      TransformSpec tranformer = transformSpec != null ? transformSpec : TransformSpec.NONE;


spell-check: transformer

imply-cheddar · 2023-03-20T04:28:29Z

processing/src/test/java/org/apache/druid/segment/IndexBuilder.java

+        .setIndexSchema(schema)
+        .setMaxRowCount(maxRows)
+        .build();
+    TransformSpec tranformer = transformSpec != null ? transformSpec : TransformSpec.NONE;


spell-check

imply-cheddar · 2023-03-20T04:30:00Z

processing/src/test/java/org/apache/druid/segment/IndexBuilder.java

+    InputSourceReader transformingReader = tranformer.decorate(reader);
+    try (CloseableIterator<InputRow> rowIterator = transformingReader.read()) {
+      while (rowIterator.hasNext()) {
+        incrementalIndex.add(rowIterator.next());


When maxRows is hit, are we expecting an exception? Generally speaking, setting the maxRows on the tests is done as a way to force running queries against multiple segments, so I had expected to see a check for the numRows and incremental persists in anything that takes maxRows.

Hmm, good question, this looks wired up to the maxRowCount but callers are not checking canAppendRow or anything like that and buildIncrementalIndex can only return a single IncrementalIndex so I think all it can do is explode. There is also intermediatePersistSize which can be used to force a bunch of incremental indexes to be written when mergeIndexes is called to make sure that segment merging happens.

Nothing much seems to be explicitly setting either of these things, and maybe could be removed or reworked in a follow-up.

imply-cheddar · 2023-03-20T04:34:55Z

...in/java/org/apache/druid/sql/calcite/expression/builtin/ArrayContainsOperatorConversion.java

@@ -95,7 +95,7 @@ public DimFilter toDruidFilter(
    final DruidExpression leftExpr = druidExpressions.get(0);
    final DruidExpression rightExpr = druidExpressions.get(1);

-    if (leftExpr.isSimpleExtraction()) {
+    if (leftExpr.isSimpleExtraction() && !(leftExpr.getDruidType() != null && leftExpr.getDruidType().isArray())) {


What if it's an array_contains() over just a normal single-valued String column? Shouldn't that also match the filter, pretending that each row contains an array of size 1?

the type shouldn't show up as array here if it isn't an array column

imply-cheddar · 2023-03-20T04:37:03Z

sql/src/test/java/org/apache/druid/sql/calcite/BaseCalciteQueryTest.java

+  public void testQuery(
+      final String sql,
+      final Map<String, Object> queryContext,
+      final List<Query<?>> expectedQueries,
+      final List<Object[]> expectedResults,
+      final RowSignature expectedResultSignature
+  )
+  {
+    testBuilder()
+        .sql(sql)
+        .queryContext(queryContext)
+        .expectedQueries(expectedQueries)
+        .expectedResults(expectedResults)
+        .expectedSignature(expectedResultSignature)
+        .run();
+  }


Part of me wonders if, instead of adding more functions with extra arguments to their signatures, we should switch the call-sites to using the builder directly?

removed in favor of just calling testBuilder, this should probably just be done everywhere, but its kind of tedious to switch

* FrontCodedIntArrayIndexed is now like v1 of FrontCodedIndexed with totally incremental buckets * add tests for unnest and array columns * fix unnest column value selector cursor handling of null and empty arrays * refactor a bunch of stuff per review

clintropolis · 2023-03-23T20:37:07Z

I've fixed the behavior of UNNEST with regards to how it handles null and [] values for ARRAY typed inputs. Per experiments with postgres, null and empty arrays [] are supposed to output nothing when fed into UNNEST:

but currently we are treating them as [null], which is probably unfortunately what we have to keep doing for multi-value STRING types because we have trouble distinguishing null, [] and [null], but no need to do this with actual arrays. So, I have fixed UnnestColumnValueSelectorCursor to consume null and [] values, fixing the behavior and added some tests.

processing/src/main/java/org/apache/druid/segment/UnnestColumnValueSelectorCursor.java

-      }
-    } else {
-      index++;
+    if (++index >= unnestListForCurrentRow.size()) {


processing/src/main/java/org/apache/druid/query/metadata/SegmentAnalyzer.java

imply-cheddar · 2023-03-27T04:24:41Z

processing/src/main/java/org/apache/druid/math/expr/ExprEval.java

+    // empty
+    NonnullPair<ExpressionType, Object[]> coerced = coerceListToArray(theList, false);
+    if (coerced == null) {
+      return bestEffortOf(null);


Calling bestEffortOf(null) looks to me like it's going to go through like 15 if statements, failing them all before just returning new StringExprEval(null). Why not just return the good thing here given that we already know what it should be and avoid the potential branches?

I think it would probably be best to have null handled first in bestEffortOf so that if we ever decide to represent null as something more sensible (like introduce a null type), we won't be making a 'string' here, not to mention the saving of not running through a bunch of checks for a null value.

Sure, I expected you to actually to create a method for makeNullNode() or something in ExprEval instead of newing it up directly here. Just moving around the ordering of the statements is 🤷 . Though, checking for null first is probably better than last.

imply-cheddar · 2023-03-27T04:30:58Z

processing/src/main/java/org/apache/druid/segment/NestedDataColumnIndexer.java

+          Preconditions.checkNotNull(
+              columnType.getElementType(),
+              "Array type [%s] for value [%s] missing element type, how did this possibly happen?",
+              eval.type(),
+              eval.valueOrDefault()
+          );


Nit: I think this would be better as an if statement. If this ever gets thrown out, someone is gonna want to attach a debugger to the point that this gets thrown from and they are gonna need to convert it to an if statement to be able to do that without setting some conditions and stuff on their debug breakpoint.

this shouldn't ever happen, but I guess i can wrap it in an if

Just for clarity, I was suggesting that instead of using Preconditions you use an if statement instead.

imply-cheddar · 2023-03-27T04:31:50Z

processing/src/main/java/org/apache/druid/segment/NestedDataColumnIndexer.java

+              sizeEstimate = globalDimensionDictionary.addStringArray(stringArray);
+              return new StructuredDataProcessor.ProcessedValue<>(stringArray, sizeEstimate);
+            default:
+              throw new IAE("Unhandled type: %s", columnType);


Just double checking, is this where we expect an array-of-arrays or an array-of-objects to end up?

We don't really expect it to happen because the StructuredDataProcessor should have done its job correctly and only defined fields for primitive and array of primitive, so this is more of a sanity check

Ah, so array-of-arrays should be handled in StructuredDataProcessor instead?

Even after this PR, the handling of array-of-arrays and array-of-objects is still a bit muddy, right? (it's falling back to the "an array is just an object with numbers for field names" behavior?)

imply-cheddar · 2023-03-27T04:33:21Z

processing/src/main/java/org/apache/druid/segment/NestedDataColumnIndexer.java

+              throw new IAE("Unhandled type: %s", columnType);
+          }
+        case STRING:
+        default:


What's an example of when we hit the default case? I'm legitimately asking because I do not know the answer and find myself wondering if believing that it's a String is really the right thing to do versus generating a parsing error or something of that nature.

I don't think we should hit the default case, i'm not entirely sure why i wrote it like this originally

imply-cheddar · 2023-03-27T04:36:49Z

processing/src/main/java/org/apache/druid/segment/NestedDataColumnMerger.java

-        cardinality,
-        System.currentTimeMillis() - dimStartTime
-    );
+    catch (Throwable ioe) {


Catching Throwable is dangerous, why cast such a wide net?

hmm, I don't remember why I added this, I think I was debugging something and needed an easy to place to catch to tell me what was failing for which column, and catching here seemed the best way to tell me what was messed up, but it could probably narrow it down to IOException and just rethrow it

imply-cheddar · 2023-03-27T04:39:16Z

processing/src/main/java/org/apache/druid/segment/NestedDataColumnMerger.java

+        column.getArraysIterable(),
+        column.getArrayDictionary().size()


Sure, doesn't have to be an Indexed, just also seemed weird to be passing in fully encapsulated objects above and then suddenly start passing in 2 arguments to wrap a new thing that seems similar to the ones above...

imply-cheddar · 2023-03-27T05:02:06Z

processing/src/main/java/org/apache/druid/segment/data/FrontCodedIntArrayIndexed.java

+    // due to vbyte encoding, the null value is not actually stored in the bucket (no negative values), so we adjust
+    // the index


If I understood correctly, can I suggest updating the comment to be

// For arrays, there is a conundrum of how to encode null different from the empty array. In this code, // we take the following approach: // 1) the 0 entry in our dictionary is assumed to be indicative of a completely null entry // 2) The 1 value in our dictionary is assumed to be the empty array // 3) Instead of storing the 0 value in this dictionary, we employ an indexing-shifting technique, where the // dictionary never stores null, and starts with the empty array (index 1), but shifted by 1 such that what // is persisted on disk is actually 0 instead of 1. // adjustIndex represents whether the null existed and thus whether we actually need to adjust the value

imply-cheddar · 2023-03-27T05:13:53Z

processing/src/main/java/org/apache/druid/segment/nested/NestedDataColumnSerializer.java

+            return ProcessedValue.NULL_LITERAL;
+          }
+          catch (IOException e) {
+            throw new RE(e, "Failed to write field [%s] value [%s]", fieldPath, array);


This exception leaks the content of the data if it ever gets built. I think the "best" we can do here is mention which field it was trying to write.

yeah, that's fair, i can change it, was thinking in terms of helping myself debug what wasn't handled correctly but it is leaky

imply-cheddar · 2023-03-27T05:17:54Z

processing/src/main/java/org/apache/druid/segment/nested/NestedFieldTypeInfo.java

+  private static final byte STRING_ARRAY_MASK = 1 << 4;
+
+  private static final byte LONG_ARRAY_MASK = 1 << 5;
+
+  private static final byte DOUBLE_ARRAY_MASK = 1 << 6;


I dunno if I'm being stingy, but couldn't we have an array mask that is applied and then we check the type mask after that? Would mean that we can add array support to this and only consume 1 more bit instead of consuming 3 more bits.

that would work if we only set 1 type, or if any row being an array meant all rows are arrays, but it doesn't currently, so we sort of need to know what type of array, at least the way things currently work

what about composite arrays? Like if the array has a String and a double? Or a String and an Object?

I had intepretted these as "this type exists in the column" not as "every row is this type". In which case, if there is an array, then it is true that the array type exists in the column, no? And then if the only scalar type is a long, then you would know that the arrays are all longs, right?

imply-cheddar · 2023-03-27T05:24:06Z

sql/src/test/java/org/apache/druid/sql/calcite/BaseCalciteQueryTest.java

@@ -190,6 +190,7 @@ public static void setupNullValues()

  public static final Map<String, Object> QUERY_CONTEXT_NO_STRINGIFY_ARRAY =
      DEFAULT_QUERY_CONTEXT_BUILDER.put(QueryContexts.CTX_SQL_STRINGIFY_ARRAYS, false)
+                                   .put(PlannerContext.CTX_ENABLE_UNNEST, true)


Not sure how much I care about this, but I expected this to be put on the Unnest/Array tests instead of across all of the tests in this base class.

QUERY_CONTEXT_NO_STRINGIFY_ARRAY is afaik only used on CalciteArrayQueryTest and CalciteNestedDataQueryTest so i think this is fine

changes: * introduce ColumnFormat to separate physical storage format from logical type. ColumnFormat is now used instead of ColumnCapabilities to get column handlers for segment creation * introduce new 'standard type' indexer, merger, and family of serializers, which is the next logical iteration of the nested column stuff. Essentially this is an automatic type column indexer that produces the most appropriate column for the given inputs, making either STRING, ARRAY<STRING>, LONG, ARRAY<LONG>, DOUBLE, ARRAY<DOUBLE>, or COMPLEX<json>. * revert NestedDataColumnIndexer, NestedDataColumnMerger, NestedDataColumnSerializer to their version pre apache#13803 behavior (v4) for backwards compatibility * fix a bug in RoaringBitmapSerdeFactory if anything actually ever wrote out an empty bitmap using toBytes and then later tried to read it (the nerve!)

changes: * introduce ColumnFormat to separate physical storage format from logical type. ColumnFormat is now used instead of ColumnCapabilities to get column handlers for segment creation * introduce new 'auto' type indexer and merger which produces a new common nested format of columns, which is the next logical iteration of the nested column stuff. Essentially this is an automatic type column indexer that produces the most appropriate column for the given inputs, making either STRING, ARRAY<STRING>, LONG, ARRAY<LONG>, DOUBLE, ARRAY<DOUBLE>, or COMPLEX<json>. * revert NestedDataColumnIndexer, NestedDataColumnMerger, NestedDataColumnSerializer to their version pre #13803 behavior (v4) for backwards compatibility * fix a bug in RoaringBitmapSerdeFactory if anything actually ever wrote out an empty bitmap using toBytes and then later tried to read it (the nerve!)

clintropolis added 2 commits February 14, 2023 08:10

add v4 segment test

f66f0e5

clintropolis added Area - Querying WIP Area - Segment Format and Ser/De Area - Ingestion labels Feb 14, 2023

github-advanced-security bot found potential problems Feb 14, 2023

View reviewed changes

adjustments

1acdad4

clintropolis added 7 commits February 19, 2023 12:54

Merge remote-tracking branch 'upstream/master' into nested-array-columns

eb6976a

add array element indexes

524975b

Merge remote-tracking branch 'upstream/master' into nested-array-columns

dd4b899

fixup

d8e1b3f

Merge remote-tracking branch 'upstream/master' into nested-array-columns

8ff7933

Merge remote-tracking branch 'upstream/master' into nested-array-columns

4f72c47

more test

0070157

imply-cheddar reviewed Mar 14, 2023

View reviewed changes

clintropolis added 2 commits March 14, 2023 15:15

Merge remote-tracking branch 'upstream/master' into nested-array-columns

ae382d0

Merge remote-tracking branch 'upstream/master' into nested-array-columns

6712e7a

clintropolis mentioned this pull request Mar 20, 2023

expression transform improvements and fixes #13947

Merged

7 tasks

imply-cheddar reviewed Mar 20, 2023

View reviewed changes

clintropolis added 6 commits March 21, 2023 02:55

Merge remote-tracking branch 'upstream/master' into nested-array-columns

26ebeef

Merge remote-tracking branch 'upstream/master' into nested-array-columns

2ecd8bd

Merge remote-tracking branch 'upstream/master' into nested-array-columns

7bfb113

changes:

9ea04a3

* FrontCodedIntArrayIndexed is now like v1 of FrontCodedIndexed with totally incremental buckets * add tests for unnest and array columns * fix unnest column value selector cursor handling of null and empty arrays * refactor a bunch of stuff per review

reuse field index for stuff

82f01aa

fix tests

810649c

clintropolis added the Bug label Mar 23, 2023

github-advanced-security bot found potential problems Mar 23, 2023

View reviewed changes

clintropolis added 2 commits March 23, 2023 15:02

adjust

76da0fb

adjust again

313f486

clintropolis removed the WIP label Mar 23, 2023

github-advanced-security bot found potential problems Mar 23, 2023

View reviewed changes

processing/src/main/java/org/apache/druid/query/metadata/SegmentAnalyzer.java Fixed Show fixed Hide fixed

clintropolis added 3 commits March 23, 2023 23:25

nested array tests

665e0bc

style and comments

e3f48a7

remove negative tests

e988d05

imply-cheddar reviewed Mar 27, 2023

View reviewed changes

fix stuff

1515267

imply-cheddar approved these changes Mar 27, 2023

View reviewed changes

clintropolis merged commit d5b1b5b into apache:master Mar 27, 2023

clintropolis deleted the nested-array-columns branch March 27, 2023 19:42

clintropolis mentioned this pull request Apr 2, 2023

remix nested columns #14014

Merged

10 tasks

clintropolis added this to the 26.0 milestone Apr 10, 2023

techdocsmith mentioned this pull request Apr 12, 2023

[DRAFT] 26.0.0 release notes #14064

Closed

gianm mentioned this pull request Apr 25, 2023

Community roadmap 2023 #14157

Open

clintropolis mentioned this pull request Apr 27, 2023

add context flag "useAutoColumnSchemas" to use new auto types for MSQ segment generation #14175

Merged

6 tasks

		Assert.assertEquals(ImmutableList.of("dim1", "metric1", "timestamp"), rows.get(0).getDimensions());
		Assert.assertEquals(ImmutableList.of("dim1", "metric1"), rows.get(0).getDimensions());

		// skip empty arrays for now, they will always be called 'string' arrays, which isn't very helpful here since
		// it will pollute the type set

		column.getArraysIterable(),
		column.getArrayDictionary().size()

		// due to vbyte encoding, the null value is not actually stored in the bucket (no negative values), so we adjust
		// the index

nested columns + arrays = array columns! #13803

nested columns + arrays = array columns! #13803

Conversation

clintropolis commented Feb 14, 2023 • edited Loading

Description

Release note

clintropolis commented Feb 17, 2023

imply-cheddar left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

clintropolis commented Mar 23, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

imply-cheddar Mar 27, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

clintropolis commented Feb 14, 2023 •

edited

Loading

imply-cheddar Mar 27, 2023 •

edited

Loading