Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Further improvements to parquet page materializaition #5670

Merged
merged 7 commits into from
Jun 26, 2024

Conversation

malhotrashivam
Copy link
Contributor

Extension of #5582 and #5638

This PR fixes a bug related to codecs which was introduced in #5638 and does further code reorganization to reduce chances of such bugs in the future by moving the decisions for page materializer and toPage in the same place rather than the two decisions being made independently.

@malhotrashivam malhotrashivam added bug Something isn't working parquet Related to the Parquet integration NoDocumentationNeeded NoReleaseNotesNeeded No release notes are needed. labels Jun 26, 2024
@malhotrashivam malhotrashivam added this to the June 2024 milestone Jun 26, 2024
@malhotrashivam malhotrashivam self-assigned this Jun 26, 2024
Copy link
Member

@devinrsmith devinrsmith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't review your earlier PR, but I see other usages of org.apache.parquet.io.api.Binary that could potentially be improved without needing to copy the bytes.

Comment on lines +116 to +119
/**
* @return the factory to create the materializers for this column.
*/
PageMaterializerFactory getPageMaterializerFactory();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not immediately obvious to me that this belongs as part of ToPage. No doubt it's convenient (and, we might argue is in the same style as the other "helpers" ToPage#getChunkType, ToPage#getNativeType). Generally, I'm trying to minimize the expansion of "public" interfaces. That said, this is already an implementation exposing interface, and I don't think this additional exposure is egregious in any way; just that I wish we were better at having a separation between public interfaces and implementation details. Oh well. (I don't think you need to necessarily make any changes here, just venting a bit.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can safely consider ToPage to be internal, rather than part of our API. It's literally just the layer that takes Parquet data and formats it for the engine. The materializer is intrinsically tied to the ToPage implementation, in the sense that the two must be mutually compatible. I strongly encouraged Shivam to make this refactoring to avoid future bugs like the one that triggered this work, and to allow for further optimization in the same manner as his last several PRs, by removing unnecessary layering and data copying.

Copy link
Member

@rcaudy rcaudy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Partial code review.

Comment on lines +116 to +119
/**
* @return the factory to create the materializers for this column.
*/
PageMaterializerFactory getPageMaterializerFactory();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can safely consider ToPage to be internal, rather than part of our API. It's literally just the layer that takes Parquet data and formats it for the engine. The materializer is intrinsically tied to the ToPage implementation, in the sense that the two must be mutually compatible. I strongly encouraged Shivam to make this refactoring to avoid future bugs like the one that triggered this work, and to allow for further optimization in the same manner as his last several PRs, by removing unnecessary layering and data copying.

@@ -26,15 +28,14 @@ public static <ATTR extends Any> ToPage<ATTR, String[]> create(
new ChunkDictionary<>(
(dictionary, key) -> dictionary.decodeToBinary(key).toStringUsingUTF8(),
dictionarySupplier),
INSTANCE::convertResult);
INSTANCE::convertResult,
INSTANCE.getPageMaterializerFactory());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it's possible to defer this, but it's not obvious that it matters even a little bit.

/**
* @return the factory to create the materializers for this column.
*/
PageMaterializerFactory getPageMaterializerFactory();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be cleaner if instead of having a PageMaterializerFactory, ToPages implemented PageMaterializerFactory themselves? That seems more aligned with the intent, here.

@@ -20,16 +21,15 @@

public class ToBigDecimalPage<ATTR extends Any> implements ToPage<ATTR, BigDecimal[]> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious if we should delete ToBigDecimalPage and ToBigIntegerPage in order to just use ToObjectPage.

@rcaudy rcaudy dismissed their stale review June 26, 2024 17:35

Accidentally approved.

rcaudy
rcaudy previously approved these changes Jun 26, 2024
Copy link
Member

@rcaudy rcaudy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.

@malhotrashivam malhotrashivam merged commit f165357 into deephaven:main Jun 26, 2024
15 checks passed
@github-actions github-actions bot locked and limited conversation to collaborators Jun 26, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Something isn't working NoDocumentationNeeded NoReleaseNotesNeeded No release notes are needed. parquet Related to the Parquet integration
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants