Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gRPC Client Implementation #8423

Merged
merged 38 commits into from
May 30, 2024
Merged

gRPC Client Implementation #8423

merged 38 commits into from
May 30, 2024

Conversation

spericas
Copy link
Member

@spericas spericas commented Feb 26, 2024

Description

New gRPC Client API and implementation for SE. Issue #7060.

@spericas spericas self-assigned this Feb 26, 2024
@spericas spericas marked this pull request as draft February 26, 2024 18:58
@oracle-contributor-agreement oracle-contributor-agreement bot added the OCA Verified All contributors have signed the Oracle Contributor Agreement. label Feb 26, 2024
@spericas spericas changed the title gRPC Client Implementation [draft} gRPC Client Implementation Feb 26, 2024
@spericas spericas changed the title [draft} gRPC Client Implementation [draft] gRPC Client Implementation Feb 26, 2024
Copy link
Member

@tomas-langer tomas-langer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work!

There are some error handling and logging issues (this being a draft), but the core of the work seems the right thing to do, thanks!

all/pom.xml Show resolved Hide resolved
grpc/core/src/main/java/module-info.java Outdated Show resolved Hide resolved
grpc/core/src/main/java/module-info.java Outdated Show resolved Hide resolved
webclient/grpc/src/main/java/module-info.java Outdated Show resolved Hide resolved
webclient/grpc/src/main/java/module-info.java Outdated Show resolved Hide resolved
@thegridman
Copy link
Collaborator

thegridman commented Apr 22, 2024

After plugging the client into Coherence I am seeing a lot of severe level log messages due to timeouts. We use a bidirectional channel for Coherence events. The client makes the bi-di request which creates the bi-di
channel, then the client can send messages up that channel to the server and the server can periodically send event messages back down the channel. It appears there is a 100ms timeout in the code that constantly logs the error. But, even though the error is logged a lot, everything appears to work, as the tests for events all pass.

The code is in GrpcClientCall line 152

                    try {
                        frameData = clientStream().readOne(WAIT_TIME_MILLIS_DURATION);
                    } catch (StreamTimeoutException e) {
                        socket().log(LOGGER, ERROR, "[Reading thread] read timeout");
                        continue;
                    }

@spericas
Copy link
Member Author

After plugging the client into Coherence I am seeing a lot of severe level log messages due to timeouts. We use a bidirectional channel for Coherence events. The client makes the bi-di request which creates the bi-di channel, then the client can send messages up that channel to the server and the server can periodically send event messages back down the channel. It appears there is a 100ms timeout in the code that constantly logs the error. But, even though the error is logged a lot, everything appears to work, as the tests for events all pass.

The code is in GrpcClientCall line 152

                    try {
                        frameData = clientStream().readOne(WAIT_TIME_MILLIS_DURATION);
                    } catch (StreamTimeoutException e) {
                        socket().log(LOGGER, ERROR, "[Reading thread] read timeout");
                        continue;
                    }

That timeout was too aggressive and has now been set to 2 seconds. We shall make this configurable eventually.

@thegridman
Copy link
Collaborator

thegridman commented Apr 24, 2024

That timeout was too aggressive and has now been set to 2 seconds. We shall make this configurable eventually.

Will there be an option to configure it to never time out? We use a bi-directional channel to receive events related to a cache lifecycle on the client. There may be zero events from the server for the entire lifetime of the client application.

@thegridman
Copy link
Collaborator

After running the latest build through our gRPC client tests we have the following issues.

  • Still seeing constant timeout messages (which you mentioned above).
  • Large Protobuf messages do not work, it appears they cannot be read on the server.
  • The fix for streams that do not send a response and just close the observer seems to work now. But what is not working is where the server throws an exception, this seems to just get lost and the call never ends on the client. To test this I added a method that specifically throws an exception and this fails.

For the latest issues I updated the GrpcStubTest to add some reproducers. I have attached a patch file I exported from IntelliJ that you should be able to apply and reproduce the issues. All three additional tests will just hang as the client never completes the request.
gRPC_Tests.patch

@spericas spericas changed the title [draft] gRPC Client Implementation gRPC Client Implementation May 7, 2024
@spericas spericas marked this pull request as ready for review May 7, 2024 17:36
@thegridman
Copy link
Collaborator

thegridman commented May 17, 2024

I have been running our gRPC client tests using a build from this PR. The last time I ran them a few weeks ago they were ok. Now I have a failure in our fail-over tests.

In this scenario the client has a bidirectional stream open to the server when the server is killed. When running the tests with Netty the client's StreamObserver onError method is called with a StatusRuntimeException with a Status of Unavailable. When using Helidon this does not happen, it appears that Helidon tries to reconnect to the server the next time a messages is sent up the stream by the client. This does not work, in our case because the server is gone and we need to fail over to a new server on another port. Helidon does not allow us to configure multiple server addresses, so this constant reconnect attempt fails until the client eventually gets a timeout exception.

We need the clients StreamObserver onError to be called, because basically that stream is dead. In our case we rely on state on the server that was initialised by previous messages, we cannot have a bidirectional stream transparently fail over behind our backs.

In our client after the stream receives an error then the our client code can deal with reconnection. In our case we look up the new set of server endpoints and have to create a new Helidon client as the previous client only knows the single address we gave it previously. Obviously if our server was behind a load balancer that would be OK, but in a lot of cases there is no LB. With ManagedChannel in grpc-java there are various ways to pass multiple addresses or to do address lookup whenever the channel is about to connect, but Helidon does not have this. We have been able to work around this in our Helidon client code but it relies on the bidirectional (and presumably any other "stream" type calls) to be properly killed if the server disconnects.

@thegridman
Copy link
Collaborator

thegridman commented May 17, 2024

Further to the above, I copied one of the bidirectional tests in GrpcTest and modified it to stop the WebServer between two calls to send a message. At no time does my StreamObserver receive an error, even when sending more messages, they presumably fail and I see a sack trace in the log. I could fix this by changing GrpcClientCall line 128 that currently looks like this

            } catch (Throwable e) {
                socket().log(LOGGER, ERROR, e.getMessage(), e);
            }
            socket().log(LOGGER, DEBUG, "[Writing thread] exiting");

so that it calls the response listener like this:

            } catch (Throwable e) {
                socket().log(LOGGER, ERROR, e.getMessage(), e);
                responseListener().onClose(Status.UNKNOWN.withDescription(e.getMessage()).withCause(e), EMPTY_METADATA);
            }
            socket().log(LOGGER, DEBUG, "[Writing thread] exiting");

At least now my client StreamObserver onError is called when the send fails. But I still have not worked out how to call it when the socket closes.

@spericas spericas force-pushed the grpc-update branch 3 times, most recently from d9a51ba to 3e19765 Compare May 20, 2024 21:14
@thegridman
Copy link
Collaborator

After the latest changes I have run our 5,500 gRPC client tests using the Helidon client from this PR and they all pass (including the fail-over tests).

Signed-off-by: Santiago Pericasgeertsen <santiago.pericasgeertsen@oracle.com>
Signed-off-by: Santiago Pericasgeertsen <santiago.pericasgeertsen@oracle.com>
Signed-off-by: Santiago Pericasgeertsen <santiago.pericasgeertsen@oracle.com>
Signed-off-by: Santiago Pericasgeertsen <santiago.pericasgeertsen@oracle.com>
Signed-off-by: Santiago Pericasgeertsen <santiago.pericasgeertsen@oracle.com>
Signed-off-by: Santiago Pericasgeertsen <santiago.pericasgeertsen@oracle.com>
- Cleans all module-info.java files
- Switches to System.Logger for logging
Signed-off-by: Santiago Pericasgeertsen <santiago.pericasgeertsen@oracle.com>
…tStream to allow transitions from HEADERS to END. New tests.
…xceptions thrown in gRPC methods and (2) larger payloads.
Signed-off-by: Santiago Pericas-Geertsen <santiago.pericasgeertsen@oracle.com>
Signed-off-by: Santiago Pericas-Geertsen <santiago.pericasgeertsen@oracle.com>
…the abort flag is not set, attempts to send a PING frame to verify the connection's health.
…er has died. The period to check for that is configurable and set to 5 seconds by default.
Signed-off-by: Santiago Pericas-Geertsen <santiago.pericasgeertsen@oracle.com>
…HTTP/2 logging instead.

Signed-off-by: Santiago Pericas-Geertsen <santiago.pericasgeertsen@oracle.com>
Signed-off-by: Santiago Pericas-Geertsen <santiago.pericasgeertsen@oracle.com>
@spericas spericas merged commit 62adcbd into helidon-io:main May 30, 2024
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
OCA Verified All contributors have signed the Oracle Contributor Agreement.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants