Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nested columns + arrays = array columns! #13803

Merged
merged 24 commits into from
Mar 27, 2023

Conversation

clintropolis
Copy link
Member

@clintropolis clintropolis commented Feb 14, 2023

Description

This PR adds support for storing arrays of primitive values (ARRAY<STRING>, ARRAY<LONG> and ARRAY<DOUBLE>) as specialized nested columns, instead of the current strategy which stores a separate nested column for each array element.

Basically it works by adding an additional value dictionary for all array values to our current STRING -> LONG -> DOUBLE dictionary stack. The array dictionary stores int[] values, where the elements are globalIds that point to values in the STRING/LONG/DOUBLE value dictionaries.

E110E6BE-2856-4E09-BADA-575284BB3187 2

So instead of the 2-phase lookup that scalar nested columns have, nested array values have a 3-phase lookup - first lookup localId -> globalId, which gives an int[] of global ids which can then be translated into the actual values.

This dictionary is stored with a newly added FrontCodedIntArrayIndexed, which is pretty similar to FrontCodedIndexed added in #12277, but storing int[] instead of byte[]. There is probably more code that could be shared between the two implementations (currently there is none and FrontCodedIntArrayIndexed and its writer are complete copies).

additional description TBD

changes:

  • add support for storing nested arrays of string, long, and double values as specialized nested columns instead of breaking them into separate element columns
  • nested column typic mimic behavior means that columns ingested with only root arrays of primitive values will be ARRAY typed columns
  • neat test stuff

Screenshot 2023-02-02 at 6 54 05 PM

Screenshot 2023-02-07 at 2 21 33 PM

Screenshot 2023-02-07 at 5 05 21 PM

Release note


This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • a release note entry in the PR description.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

changes:
* add support for storing nested arrays of string, long, and double values as specialized nested columns instead of breaking them into separate element columns
* nested column typic mimic behavior means that columns ingested with only root arrays of primitive values will be ARRAY typed columns
* neat test stuff
if (o instanceof Object[]) {
Object[] array = (Object[]) o;
if (elementNumber < array.length) {
return array[elementNumber];

Check failure

Code scanning / CodeQL

Improper validation of user-provided array index

This index depends on a [user-provided value](1) which can cause an ArrayIndexOutOfBoundsException. This index depends on a [user-provided value](2) which can cause an ArrayIndexOutOfBoundsException. This index depends on a [user-provided value](3) which can cause an ArrayIndexOutOfBoundsException. This index depends on a [user-provided value](4) which can cause an ArrayIndexOutOfBoundsException. This index depends on a [user-provided value](5) which can cause an ArrayIndexOutOfBoundsException.
@clintropolis
Copy link
Member Author

I want to get #13809 finished first and pull those changes into this PR to make things consistent with null value coercion

Copy link
Contributor

@imply-cheddar imply-cheddar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only got through a few of the files so far, but need to go and come back to it.

Assert.assertEquals(ImmutableList.of("dim1", "metric1", "timestamp"), rows.get(0).getDimensions());
Assert.assertEquals(ImmutableList.of("dim1", "metric1"), rows.get(0).getDimensions());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the change in expectation?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is related to the change in dimension filter in MapBasedInputRow to always filter out the timestamp spec column.

In 'normal' production code the timestamp spec is added to dimension exclusions so the code in MapBasedInputRow that computes dimensions to ensure time would not be in the dimensions list. However, in test code, especially in processing, which doesn't have access to the methods that take a dataschema and transform it into an input row schema, its pretty easy to not explicitly add timestamp column to the dimensions exclusion list. So as a result of not manually adding timestamp column to exclusions, it would end up in the dimensions list in schema discovery modes, as a string (or nested column, depending on config), which when doing rollup tests means it ends up as part of the rollup key, and so on. (Again this doesn't happen in current production code because it goes through that translator utility method).

I made the change there to always filter the timestamp spec column from the dimensions list to make it easier to not write wrong tests for schema discovery mode, which caused the change here and other places.

Comment on lines 46 to 56
public static DelimitedInputFormat ofColumns(String... columns)
{
return new DelimitedInputFormat(
Arrays.asList(columns),
null,
null,
false,
false,
0
);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Location nit: does this need to be in main rather than test?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed since there was only one caller

Comment on lines 44 to 51
public static final JsonInputFormat DEFAULT = new JsonInputFormat(
JSONPathSpec.DEFAULT,
null,
null,
null,
null
);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Location nit: does this need to be in main rather than test?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

DimensionsSpec dimensionsSpec,
Map<String, Object> rawInputRow
)
{
if (dimensionsSpec.isIncludeAllDimensions()) {
LinkedHashSet<String> dimensions = new LinkedHashSet<>(dimensionsSpec.getDimensionNames());
dimensions.addAll(Sets.difference(rawInputRow.keySet(), dimensionsSpec.getDimensionExclusions()));
for (String field : rawInputRow.keySet()) {
if (timestampSpec.getTimestampColumn().equals(field) || dimensionsSpec.getDimensionExclusions().contains(field)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we just get these two things once?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, can adjust, though after an earlier discussion i just want to remove getDimensions from InputRow, but haven't decided if in this PR or some later change.

@Nullable Object maybeArrayOfLiterals
)
{
ExprEval<?> eval = ExprEval.bestEffortOf(maybeArrayOfLiterals);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to break ExprEval.bestEffortOf into a bunch of checks for different groups of expected types (i.e. ExprEval.maybeLiteral() and ExprEval.maybeArray, etc.). Calls to bestEffortOf can cascade through, but places like this that already know some of what they expect can call the one that more aligns with expectations?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i split out bestEffortArray(@Nullable List<?> theList) and changed StructuredDataProcessor method to always pass a List when attempting to process arrays

Comment on lines 90 to 101
final ExprEval<?> maybeLiteralArray = ExprEval.bestEffortOf(maybeArrayOfLiterals);
if (maybeLiteralArray.type().isArray() && maybeLiteralArray.type().getElementType().isPrimitive()) {
final String fieldName = NestedPathFinder.toNormalizedJsonPath(fieldPath);
LiteralFieldIndexer fieldIndexer = fieldIndexers.get(fieldName);
if (fieldIndexer == null) {
estimatedFieldKeySize += StructuredDataProcessor.estimateStringSize(fieldName);
fieldIndexer = new LiteralFieldIndexer(globalDictionary);
fieldIndexers.put(fieldName, fieldIndexer);
}
return fieldIndexer.processValue(maybeLiteralArray);
}
return null;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks a lot like code in NestedDataExpressions, except this doesn't return the NULL_LITERAL. I find myself wondering if the NestedDataExpressions code shouldn't look like this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the contract of processArrayOfLiteralsField is supposed to return a ProcessedLiteral if and only if the value was an array of literals (it is marked @Nullable). processLiteralField is not nullable, and must always return a ProcessedLiteral.

The StructuredDataProcessor code when processing some input and it encounters arrays will first attempt to processArrayOfLiteralField, if it returns something, it was an array, else it must instead process the array elements recursively. processLiteralField is called on everything that isn't a map or array.

I'll see if i can clarify it better

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i modified this stuff a bit and updated javadocs so it is hopefully clearer, the new abstract method names are processField and processArrayField and the latter indicates that returning a non-null value halts further processing of arrays, otherwise the processor will continue for each element of the array

Comment on lines 178 to 181
final LiteralFieldIndexer root = fieldIndexers.get(NestedPathFinder.JSON_PATH_ROOT);
if (root.getTypes().getSingleType().isArray()) {
throw new UnsupportedOperationException("Not supported");
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm reading this as "if all we have are root-level entries and they are always arrays, then throw a UOE exception". I'm pretty sure I'm reading it wrong, but wishing the error message gave me more context without me feeling like I need to expand lines on the review to know what this is validating.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

heh, you are reading it correctly, currently the 'single typed root' dimension selector should only be used for scalar string columns, everything else should use the column value selector instead, will try to clarify with comments

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or put more words into the exception message so it's clear what's not supported?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clarified exception

@@ -411,7 +445,7 @@ public Object getObject()
if (0 <= dimIndex && dimIndex < dims.length) {
final StructuredData data = (StructuredData) dims[dimIndex];
if (data != null) {
return data.getValue();
return ExprEval.bestEffortOf(data.getValue()).value();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this counting on coercion or something?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, i did this to make it consistent with the value that is stored in the dictionary (since the selector is used for merging/persisting)

case ARRAY:
// skip empty arrays for now, they will always be called 'string' arrays, which isn't very helpful here since
// it will pollute the type set
Preconditions.checkNotNull(columnType.getElementType(), "Array element type must not be null");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this were to ever happen, I think I'd want to know which field was the bad one.

Comment on lines 488 to 489
// skip empty arrays for now, they will always be called 'string' arrays, which isn't very helpful here since
// it will pollute the type set
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does it do the skipping of empties?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it was missing some code to do that, updated and added tests for nulls, empties, and arrays of nulls

} else if (adapter instanceof QueryableIndexIndexableAdapter) {
dimValues = getSortedIndexesFromQueryableAdapter((QueryableIndexIndexableAdapter) adapter, mergedFields);
} else {
throw new ISE("Unable to merge columns of unsupported adapter %s", adapter.getClass());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[] pls

Comment on lines 120 to 123
boolean allNulls = dimValues == null || allNull(dimValues.getSortedStrings()) &&
allNull(dimValues.getSortedLongs()) &&
allNull(dimValues.getSortedDoubles()) &&
dimValues.getArrayCardinality() == 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a nice check to delegate to GlobalDictionarySortedCollector instead of implementing here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moved to GlobalDictionarySortedCollector

Comment on lines 164 to 167
defaultSerializer.serializeArrayDictionary(() -> new ArrayDictionaryMergingIterator(
sortedArrayLookups,
defaultSerializer.getGlobalLookup()
));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why can't this one just be sortedLookup.getSortedArrays() like the other 3?

Comment on lines +276 to +277
column.getArraysIterable(),
column.getArrayDictionary().size()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the 2 argument set of Iterable() and size() instead of a single collection-style object like the others?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it felt strange to try to wire this iterable up to wrap in an Indexed when all we need it for is to iterate over

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, doesn't have to be an Indexed, just also seemed weird to be passing in fully encapsulated objects above and then suddenly start passing in 2 arguments to wrap a new thing that seems similar to the ones above...

Comment on lines 420 to 436
if (next[i] == null) {
newIdsWhoDis[i] = 0;
} else if (next[i] instanceof String) {
newIdsWhoDis[i] = idLookup.lookupString((String) next[i]);
} else if (next[i] instanceof Long) {
newIdsWhoDis[i] = idLookup.lookupLong((Long) next[i]);
} else if (next[i] instanceof Double) {
newIdsWhoDis[i] = idLookup.lookupDouble((Double) next[i]);
} else {
newIdsWhoDis[i] = -1;
}
Preconditions.checkArgument(
newIdsWhoDis[i] >= 0,
"unknown global id [%s] for value [%s]",
newIdsWhoDis[i],
next[i]
);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that the global dictionaries, once merged, will be in type-sorted order, do we really need to convert back into the actual values instead of just converting the dictionary id?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, we are dealing with them here like this so we can lookup the new global id from the newly merged lower scalar value dictionaries. Otherwise we would need the mappings of old ids to new ids, which we don't currently have anywhere, and its a lot more complicated since its per segment. This way we just lookup the old values and after the lower dictionaries are merged, just lookup the array elements for the newly sorted values

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think since this stuff is all working i'd like to save further changes/optimizations for a follow-up

Preconditions.checkNotNull(inputFormat, "inputFormat");
Preconditions.checkNotNull(inputSourceTmpDir, "inputSourceTmpDir");

TransformSpec tranformer = transformSpec != null ? transformSpec : TransformSpec.NONE;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

spell-check: transformer

.setIndexSchema(schema)
.setMaxRowCount(maxRows)
.build();
TransformSpec tranformer = transformSpec != null ? transformSpec : TransformSpec.NONE;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

spell-check

InputSourceReader transformingReader = tranformer.decorate(reader);
try (CloseableIterator<InputRow> rowIterator = transformingReader.read()) {
while (rowIterator.hasNext()) {
incrementalIndex.add(rowIterator.next());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When maxRows is hit, are we expecting an exception? Generally speaking, setting the maxRows on the tests is done as a way to force running queries against multiple segments, so I had expected to see a check for the numRows and incremental persists in anything that takes maxRows.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, good question, this looks wired up to the maxRowCount but callers are not checking canAppendRow or anything like that and buildIncrementalIndex can only return a single IncrementalIndex so I think all it can do is explode. There is also intermediatePersistSize which can be used to force a bunch of incremental indexes to be written when mergeIndexes is called to make sure that segment merging happens.

Nothing much seems to be explicitly setting either of these things, and maybe could be removed or reworked in a follow-up.

@@ -95,7 +95,7 @@ public DimFilter toDruidFilter(
final DruidExpression leftExpr = druidExpressions.get(0);
final DruidExpression rightExpr = druidExpressions.get(1);

if (leftExpr.isSimpleExtraction()) {
if (leftExpr.isSimpleExtraction() && !(leftExpr.getDruidType() != null && leftExpr.getDruidType().isArray())) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if it's an array_contains() over just a normal single-valued String column? Shouldn't that also match the filter, pretending that each row contains an array of size 1?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the type shouldn't show up as array here if it isn't an array column

Comment on lines 708 to 723
public void testQuery(
final String sql,
final Map<String, Object> queryContext,
final List<Query<?>> expectedQueries,
final List<Object[]> expectedResults,
final RowSignature expectedResultSignature
)
{
testBuilder()
.sql(sql)
.queryContext(queryContext)
.expectedQueries(expectedQueries)
.expectedResults(expectedResults)
.expectedSignature(expectedResultSignature)
.run();
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Part of me wonders if, instead of adding more functions with extra arguments to their signatures, we should switch the call-sites to using the builder directly?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed in favor of just calling testBuilder, this should probably just be done everywhere, but its kind of tedious to switch

* FrontCodedIntArrayIndexed is now like v1 of FrontCodedIndexed with totally incremental buckets
* add tests for unnest and array columns
* fix unnest column value selector cursor handling of null and empty arrays
* refactor a bunch of stuff per review
@clintropolis
Copy link
Member Author

I've fixed the behavior of UNNEST with regards to how it handles null and [] values for ARRAY typed inputs. Per experiments with postgres, null and empty arrays [] are supposed to output nothing when fed into UNNEST:

Screenshot 2023-03-23 at 2 02 11 AM

Screenshot 2023-03-23 at 2 04 02 AM

Screenshot 2023-03-23 at 2 04 57 AM

but currently we are treating them as [null], which is probably unfortunately what we have to keep doing for multi-value STRING types because we have trouble distinguishing null, [] and [null], but no need to do this with actual arrays. So, I have fixed UnnestColumnValueSelectorCursor to consume null and [] values, fixing the behavior and added some tests.

}
} else {
index++;
if (++index >= unnestListForCurrentRow.size()) {

Check failure

Code scanning / CodeQL

User-controlled data in arithmetic expression

This arithmetic expression depends on a [user-provided value](1), potentially causing an overflow. This arithmetic expression depends on a [user-provided value](2), potentially causing an overflow. This arithmetic expression depends on a [user-provided value](3), potentially causing an overflow.
@clintropolis clintropolis removed the WIP label Mar 23, 2023
// empty
NonnullPair<ExpressionType, Object[]> coerced = coerceListToArray(theList, false);
if (coerced == null) {
return bestEffortOf(null);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Calling bestEffortOf(null) looks to me like it's going to go through like 15 if statements, failing them all before just returning new StringExprEval(null). Why not just return the good thing here given that we already know what it should be and avoid the potential branches?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would probably be best to have null handled first in bestEffortOf so that if we ever decide to represent null as something more sensible (like introduce a null type), we won't be making a 'string' here, not to mention the saving of not running through a bunch of checks for a null value.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I expected you to actually to create a method for makeNullNode() or something in ExprEval instead of newing it up directly here. Just moving around the ordering of the statements is 🤷 . Though, checking for null first is probably better than last.

Comment on lines 500 to 505
Preconditions.checkNotNull(
columnType.getElementType(),
"Array type [%s] for value [%s] missing element type, how did this possibly happen?",
eval.type(),
eval.valueOrDefault()
);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: I think this would be better as an if statement. If this ever gets thrown out, someone is gonna want to attach a debugger to the point that this gets thrown from and they are gonna need to convert it to an if statement to be able to do that without setting some conditions and stuff on their debug breakpoint.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this shouldn't ever happen, but I guess i can wrap it in an if

Copy link
Contributor

@imply-cheddar imply-cheddar Mar 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just for clarity, I was suggesting that instead of using Preconditions you use an if statement instead.

sizeEstimate = globalDimensionDictionary.addStringArray(stringArray);
return new StructuredDataProcessor.ProcessedValue<>(stringArray, sizeEstimate);
default:
throw new IAE("Unhandled type: %s", columnType);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just double checking, is this where we expect an array-of-arrays or an array-of-objects to end up?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't really expect it to happen because the StructuredDataProcessor should have done its job correctly and only defined fields for primitive and array of primitive, so this is more of a sanity check

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, so array-of-arrays should be handled in StructuredDataProcessor instead?

Even after this PR, the handling of array-of-arrays and array-of-objects is still a bit muddy, right? (it's falling back to the "an array is just an object with numbers for field names" behavior?)

throw new IAE("Unhandled type: %s", columnType);
}
case STRING:
default:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's an example of when we hit the default case? I'm legitimately asking because I do not know the answer and find myself wondering if believing that it's a String is really the right thing to do versus generating a parsing error or something of that nature.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should hit the default case, i'm not entirely sure why i wrote it like this originally

cardinality,
System.currentTimeMillis() - dimStartTime
);
catch (Throwable ioe) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Catching Throwable is dangerous, why cast such a wide net?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, I don't remember why I added this, I think I was debugging something and needed an easy to place to catch to tell me what was failing for which column, and catching here seemed the best way to tell me what was messed up, but it could probably narrow it down to IOException and just rethrow it

Comment on lines +276 to +277
column.getArraysIterable(),
column.getArrayDictionary().size()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, doesn't have to be an Indexed, just also seemed weird to be passing in fully encapsulated objects above and then suddenly start passing in 2 arguments to wrap a new thing that seems similar to the ones above...

Comment on lines 153 to 154
// due to vbyte encoding, the null value is not actually stored in the bucket (no negative values), so we adjust
// the index
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understood correctly, can I suggest updating the comment to be

// For arrays, there is a conundrum of how to encode null different from the empty array.  In this code, 
// we take the following approach:
//   1) the 0 entry in our dictionary is assumed to be indicative of a completely null entry
//   2) The 1 value in our dictionary is assumed to be the empty array
//   3) Instead of storing the 0 value in this dictionary, we employ an indexing-shifting technique, where the
//       dictionary never stores null, and starts with the empty array (index 1), but shifted by 1 such that what
//       is persisted on disk is actually 0 instead of 1.
// adjustIndex represents whether the null existed and thus whether we actually need to adjust the value

return ProcessedValue.NULL_LITERAL;
}
catch (IOException e) {
throw new RE(e, "Failed to write field [%s] value [%s]", fieldPath, array);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This exception leaks the content of the data if it ever gets built. I think the "best" we can do here is mention which field it was trying to write.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, that's fair, i can change it, was thinking in terms of helping myself debug what wasn't handled correctly but it is leaky

Comment on lines +48 to +52
private static final byte STRING_ARRAY_MASK = 1 << 4;

private static final byte LONG_ARRAY_MASK = 1 << 5;

private static final byte DOUBLE_ARRAY_MASK = 1 << 6;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dunno if I'm being stingy, but couldn't we have an array mask that is applied and then we check the type mask after that? Would mean that we can add array support to this and only consume 1 more bit instead of consuming 3 more bits.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that would work if we only set 1 type, or if any row being an array meant all rows are arrays, but it doesn't currently, so we sort of need to know what type of array, at least the way things currently work

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about composite arrays? Like if the array has a String and a double? Or a String and an Object?

I had intepretted these as "this type exists in the column" not as "every row is this type". In which case, if there is an array, then it is true that the array type exists in the column, no? And then if the only scalar type is a long, then you would know that the arrays are all longs, right?

@@ -190,6 +190,7 @@ public static void setupNullValues()

public static final Map<String, Object> QUERY_CONTEXT_NO_STRINGIFY_ARRAY =
DEFAULT_QUERY_CONTEXT_BUILDER.put(QueryContexts.CTX_SQL_STRINGIFY_ARRAYS, false)
.put(PlannerContext.CTX_ENABLE_UNNEST, true)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure how much I care about this, but I expected this to be put on the Unnest/Array tests instead of across all of the tests in this base class.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

QUERY_CONTEXT_NO_STRINGIFY_ARRAY is afaik only used on CalciteArrayQueryTest and CalciteNestedDataQueryTest so i think this is fine

@clintropolis clintropolis merged commit d5b1b5b into apache:master Mar 27, 2023
@clintropolis clintropolis deleted the nested-array-columns branch March 27, 2023 19:42
clintropolis added a commit to clintropolis/druid that referenced this pull request Apr 2, 2023
changes:
* introduce ColumnFormat to separate physical storage format from logical type. ColumnFormat is now used instead of ColumnCapabilities to get column handlers for segment creation
* introduce new 'standard type' indexer, merger, and family of serializers, which is the next logical iteration of the nested column stuff. Essentially this is an automatic type column indexer that produces the most appropriate column for the given inputs, making either STRING, ARRAY<STRING>, LONG, ARRAY<LONG>, DOUBLE, ARRAY<DOUBLE>, or COMPLEX<json>.
* revert NestedDataColumnIndexer, NestedDataColumnMerger, NestedDataColumnSerializer to their version pre apache#13803 behavior (v4) for backwards compatibility
* fix a bug in RoaringBitmapSerdeFactory if anything actually ever wrote out an empty bitmap using toBytes and then later tried to read it (the nerve!)
clintropolis added a commit to clintropolis/druid that referenced this pull request Apr 2, 2023
changes:
* introduce ColumnFormat to separate physical storage format from logical type. ColumnFormat is now used instead of ColumnCapabilities to get column handlers for segment creation
* introduce new 'standard type' indexer, merger, and family of serializers, which is the next logical iteration of the nested column stuff. Essentially this is an automatic type column indexer that produces the most appropriate column for the given inputs, making either STRING, ARRAY<STRING>, LONG, ARRAY<LONG>, DOUBLE, ARRAY<DOUBLE>, or COMPLEX<json>.
* revert NestedDataColumnIndexer, NestedDataColumnMerger, NestedDataColumnSerializer to their version pre apache#13803 behavior (v4) for backwards compatibility
* fix a bug in RoaringBitmapSerdeFactory if anything actually ever wrote out an empty bitmap using toBytes and then later tried to read it (the nerve!)
clintropolis added a commit to clintropolis/druid that referenced this pull request Apr 2, 2023
changes:
* introduce ColumnFormat to separate physical storage format from logical type. ColumnFormat is now used instead of ColumnCapabilities to get column handlers for segment creation
* introduce new 'standard type' indexer, merger, and family of serializers, which is the next logical iteration of the nested column stuff. Essentially this is an automatic type column indexer that produces the most appropriate column for the given inputs, making either STRING, ARRAY<STRING>, LONG, ARRAY<LONG>, DOUBLE, ARRAY<DOUBLE>, or COMPLEX<json>.
* revert NestedDataColumnIndexer, NestedDataColumnMerger, NestedDataColumnSerializer to their version pre apache#13803 behavior (v4) for backwards compatibility
* fix a bug in RoaringBitmapSerdeFactory if anything actually ever wrote out an empty bitmap using toBytes and then later tried to read it (the nerve!)
@clintropolis clintropolis mentioned this pull request Apr 2, 2023
10 tasks
clintropolis added a commit to clintropolis/druid that referenced this pull request Apr 2, 2023
changes:
* introduce ColumnFormat to separate physical storage format from logical type. ColumnFormat is now used instead of ColumnCapabilities to get column handlers for segment creation
* introduce new 'standard type' indexer, merger, and family of serializers, which is the next logical iteration of the nested column stuff. Essentially this is an automatic type column indexer that produces the most appropriate column for the given inputs, making either STRING, ARRAY<STRING>, LONG, ARRAY<LONG>, DOUBLE, ARRAY<DOUBLE>, or COMPLEX<json>.
* revert NestedDataColumnIndexer, NestedDataColumnMerger, NestedDataColumnSerializer to their version pre apache#13803 behavior (v4) for backwards compatibility
* fix a bug in RoaringBitmapSerdeFactory if anything actually ever wrote out an empty bitmap using toBytes and then later tried to read it (the nerve!)
clintropolis added a commit that referenced this pull request Apr 5, 2023
changes:
* introduce ColumnFormat to separate physical storage format from logical type. ColumnFormat is now used instead of ColumnCapabilities to get column handlers for segment creation
* introduce new 'auto' type indexer and merger which produces a new common nested format of columns, which is the next logical iteration of the nested column stuff. Essentially this is an automatic type column indexer that produces the most appropriate column for the given inputs, making either STRING, ARRAY<STRING>, LONG, ARRAY<LONG>, DOUBLE, ARRAY<DOUBLE>, or COMPLEX<json>.
* revert NestedDataColumnIndexer, NestedDataColumnMerger, NestedDataColumnSerializer to their version pre #13803 behavior (v4) for backwards compatibility
* fix a bug in RoaringBitmapSerdeFactory if anything actually ever wrote out an empty bitmap using toBytes and then later tried to read it (the nerve!)
@clintropolis clintropolis added this to the 26.0 milestone Apr 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants