Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support endpoints with different URLs than initial connect #42

Conversation

avantgardnerio
Copy link

@avantgardnerio avantgardnerio commented Jul 25, 2022

Clearly this is not ready for merge, but I tested it and it does allow me to use the JDBC driver through DataGrip to Apache Arrow Ballista:

image

I thought I would post this to get the discussion started about the ways in which a real fix would differ from what I have here.

@jduo
Copy link
Collaborator

jduo commented Jul 25, 2022

Thanks for looking into this @avantgardnerio .
I'd imagine we would want to try to create a pool of FlightClients keyed by the endpoint. Another goal is to parallelize retrieval on different endpoints.
We also should handle the case where the list of endpoints returned is empty (in which case we are supposed to call getStream() on the same node that getFlightInfo() was called on)

try {
ArrayList<FlightStream> streams = new ArrayList<>();
for (FlightEndpoint ep : flightInfo.getEndpoints()) {
for (Location loc : ep.getLocations()) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are not supposed to iterate over locations. For a given endpoint/ticket, multiple locations can be returned for redundancy and load balancing reasons and the caller can arbitrarily choose one.

It's also legal for no location to be returned, in which case the caller should just use the same node that getFlightInfo was called on. Note that if at least one location is returned, the caller cannot assume that getStream() can be called on the same node as getFlightInfo().

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this info! I'll put together a more serious PR and undraft it, taking this into account.

@jduo
Copy link
Collaborator

jduo commented Jul 26, 2022

I should point out that if you're interested in ODBC as well, the ODBC driver has a similar issue:
https://github.com/dremio/flightsql-odbc/blob/ca2c6c71e037a0e3b71bf43194ac92a803696417/flight_sql/flight_sql_stream_chunk_iterator.cc#L24

@avantgardnerio
Copy link
Author

pool of FlightClients

How does the latest code look?

@avantgardnerio avantgardnerio marked this pull request as ready for review July 26, 2022 17:11
@avantgardnerio avantgardnerio requested a review from jduo July 26, 2022 17:11
for (FlightEndpoint ep : flightInfo.getEndpoints()) {
URI uri = ep.getLocations().size() > 0 ? ep.getLocations().get(0).getUri() : null;
FlightSqlClient sqlClient = this.connectionManager.getConnection(uri);
streams.add(sqlClient.getStream(ep.getTicket(), getOptions()));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not really apparent how the sqlClient gets returned to the pool. It should get returned after the rest of the driver is done with the stream.

Copy link
Author

@avantgardnerio avantgardnerio Jul 26, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without trying to excuse the behavior, I was following the existing pattern: the builder created a FlightSqlClient and passed ownership of it to ArrowFlightSqlClientHandler, which closes it when it is closed itself.

This PR changes it so that: the builder creates a ConnectionManager instead of a FlightSqlClient, but all else remains the same.

I see that the resulting streams are AutoClosable. I think we could wrap them in an AOP proxy and return the client to the pool when the stream is closed. To be honest, calling what is there a "pool" is pretty strong. Presently there's no notion of outstanding leases, return, etc. If this behavior is desired, I can look into an apache-commons Pool or similar. (It's probably more of a factory / cache?)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I started to make a FlightStreamWrapper that could track the number of open FlightStreams for a given client, but I'm at a bit of a loss as to the intended life-cycle:

  1. Do we want to open FlightClients to each of these endpoints and keep them open?
  2. Should do we want to close the FlightClient to an endpoint as soon as it's stream is closed?
  3. Do we want to put in some kind of timeout where we close it after some amount of inactivity?

Just let me know what's desired and I can make it happen.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the late response. I've been thinking about this and I think the right thing to do is to actually change the protocol such that when a Flight Server responds to getFlightInfo, it can optionally add an optimization hint telling the caller that endpoints are long-lived and potentially re-usable. Perhaps indicating as well if contacting an endpoint can just re-use information from the originating request (such as cookies or authentication tokens).

For this PR, let's pessimistically assume that we should not reuse FlightClients and close them with the stream. And discuss how/if the server should provide lifetime information about endpoints in the dev mailing list.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll update the PR shortly. I think the easiest way to do this is:

  1. Update ConnectionManager.acquireConnection() to create a new FlightSqlClient each time and return ownership
  2. Create a FlightStreamWrapper class that intercepts the close() method
  3. Update ArrowFlightSqlClientHandler.getStreams() to wrap the returned streams in FlightStreamWrappers
  4. Have FlightStreamWrapper call close on the FlightSqlClient when close() is called on the stream

Sound good?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah this sounds like a good plan.

@jayhomn-bitquill
Copy link

@avantgardnerio would it be possible to make a PR to this branch here instead https://github.com/Bit-Quill/arrow/tree/BQ-flight-jdbc-driver thank you!

@avantgardnerio
Copy link
Author

make a PR to this branch

Yep, I have one in the works.

@avantgardnerio
Copy link
Author

Wait @jayhomn-bitquill you caught me off guard.. I was expecting https://github.com/apache/arrow/tree/flight-sql-jdbc ?

@jayhomn-bitquill
Copy link

Wait @jayhomn-bitquill you caught me off guard.. I was expecting https://github.com/apache/arrow/tree/flight-sql-jdbc ?

Ahh okay that is fine then!

@avantgardnerio
Copy link
Author

This has been moved to apache#13900

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants