Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-41728:[Java][Vector] Add ExtensionWriter and fix StructVector construction with ExtensionTypes #41731

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

FiV0
Copy link
Contributor

@FiV0 FiV0 commented May 20, 2024

Rationale for this change

For me personally I mainly want to fix apache/arrow-java#87, but this change also includes some larger changes to the writer interface. This will likely include some design work and some back and forth as I can't make these decisions.

What changes are included in this PR?

A commit that extends the writer interfaces with an ExtensionWriter as well as dealing with the MinorType.EXTENSIONTYPE in various places. The ExtensionWriter was left rather skeleton like as this needs some design decisions to be made before implementing the concrete details. I will leave some comments throughout the code of where I think stuff needs to be decided.

This next bit are my 2cents (hence taken with a grain of salt) of how I would do things as I have the impression that one will eventually hit a wall with the current writer interfaces. For example the PromotableWriter currently makes the assumption that the parent vector is a struct or a list vector, making it difficult to extend this to composite extension types.
I would just create one single writer interface (VectorWriter or whatever you want to call) implemented by all concrete writers. If a writer doesn't support an operation just keep it abstract or throw an UOE.

Are these changes tested?

There are two tests that test the issue of apache/arrow-java#87. I have not tested (because of the lack of a concrete implementation) the ExtensionWriter interface.

Are there any user-facing changes?

Yes. ExtensionWriter will be exposed for the first time. This is only an additive change and shouldn't break any existing API.

@FiV0 FiV0 requested a review from lidavidm as a code owner May 20, 2024 11:45
Copy link

⚠️ GitHub issue apache/arrow-java#87 has been automatically assigned in GitHub to PR creator.

import org.apache.arrow.vector.ExtensionTypeVector;
import org.apache.arrow.vector.types.pojo.Field;

public class UnionExtensionWriter extends AbstractFieldWriter {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This class is currently more skeleton like as the concrete details of the ExtensionWriter interface will need to be nailed down.

@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels May 20, 2024
@FiV0 FiV0 force-pushed the structs-and-extensions branch from 934466e to b4bda84 Compare May 20, 2024 11:48
@@ -61,20 +61,25 @@ public interface StructWriter extends BaseWriter {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is where some design decisions have to be made of how to actually expose the ExtensionWriter to the user and how the user should actually extend some ExtensionWriter.

As mentioned above I would make this one interface BaswWriter exposing all the possible writers and leaving them abstract in the concrete implementations.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sounds reasonable to me.

@@ -342,6 +347,34 @@ public void set(int index, UUID uuid) {
bb.putLong(uuid.getLeastSignificantBits());
getUnderlyingVector().set(index, bb.array());
}

@Override
public TransferPair makeTransferPair(ValueVector to) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This TransferPair is likely a very common pattern for extension types, so one could create a concrete ExtensionWriter which actually implements this and something like UuidWriter could inherit from.

We do something similar for our writers: https://github.com/xtdb/xtdb/blob/e9a15ee12244367c1d3d1a75c00e4f5c0e10008f/core/src/main/kotlin/xtdb/vector/extensions/XtExtensionVector.kt#L30-L40

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems to be fine, let's get this one working and further evaluate.

@lidavidm
Copy link
Member

CC @vibhatha @danepitkin

@FiV0 FiV0 force-pushed the structs-and-extensions branch from b4bda84 to f4f0e4d Compare May 21, 2024 10:06
@vibhatha
Copy link
Collaborator

@FiV0 I will take a look at this PR later today or earlier tomorrow.

@FiV0 FiV0 force-pushed the structs-and-extensions branch from f4f0e4d to 7cb7fd5 Compare May 21, 2024 14:25
@@ -85,21 +88,21 @@ public void roundtripUuid() throws IOException {
final ArrowFileReader reader = new ArrowFileReader(channel, allocator)) {
reader.loadNextBatch();
final VectorSchemaRoot readerRoot = reader.getVectorSchemaRoot();
Assert.assertEquals(root.getSchema(), readerRoot.getSchema());
assertEquals(root.getSchema(), readerRoot.getSchema());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a thought: since we are updating test cases, shall we use JUnit5? juputer.api.?

@@ -17,12 +17,17 @@

package org.apache.arrow.vector;

import static org.junit.Assert.*;
import static org.junit.Assert.assertEquals;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we use JUnit5?

final StructVector toVector = (StructVector) tp.getTo();
assertEquals(s1.getField(), toVector.getField());
assertEquals(s1.getField().getChildren().get(0), toVector.getField().getChildren().get(0));
// also fails but probably another issue
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I also debugged the code. It seems the copy didn't work properly. I guess you're still not finished with this work, right?

Copy link
Collaborator

@vibhatha vibhatha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The overall idea looks fine to me, but need to run another pass.

And the added test case seems to need some work as we don't get the expected results from the transferPair AFAIU.

@FiV0
Copy link
Contributor Author

FiV0 commented May 23, 2024

I think the main part of moving this forward is the design work on the writer or more specifically ExtensionWriter interface. I think it would be good to have some rough idea of how this will look in the future for the user, as otherwise we will need to change this again when we realize that the current API doesn't work. For example.

  • How do you expect people to write to a UuidVector?
  • How do you expect people to write to a composite extension vector ( for example a SetVector as an extension to a ListVector)?
  • How do you expect people to write to a Union with an extension leg? How should a union with two extensions legs of different type be dealt with? etc..

I think all of this should have a clear answer before moving forward with this proposal. Maybe have a look at what other language implementations do in this case.

@lidavidm
Copy link
Member

They don't really have answers for this.

Does it work to just ask users to build the underlying type directly for now?

@FiV0
Copy link
Contributor Author

FiV0 commented May 24, 2024

@lidavidm Ok, I will then create a Test for ExtensionWriter (when I find the time) with the cases outlined above, so that we have at least some idea of what it will look like.

@lidavidm
Copy link
Member

Thanks @FiV0!

Copy link

⚠️ GitHub issue #41728 has no components, please add labels for components.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Java] StructVector throws with ExtensionType
3 participants