Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

grpc-netty breaks with Netty 4.1.111.Final #11284

Closed
lhotari opened this issue Jun 13, 2024 · 17 comments · Fixed by #11294
Closed

grpc-netty breaks with Netty 4.1.111.Final #11284

lhotari opened this issue Jun 13, 2024 · 17 comments · Fixed by #11294
Labels
Milestone

Comments

@lhotari
Copy link

lhotari commented Jun 13, 2024

Added by ejona86: CORRUPTION. See #11284 (comment)


What version of gRPC-Java are you using?

1.60.2, 1.64.0

What is your environment?

❯ uname -a
Darwin Laris-MBP.lan 23.5.0 Darwin Kernel Version 23.5.0: Wed May  1 20:17:33 PDT 2024; root:xnu-10063.121.3~5/RELEASE_ARM64_T6031 arm64 arm Darwin
❯ java -version
openjdk version "17.0.11" 2024-04-16
OpenJDK Runtime Environment Temurin-17.0.11+9 (build 17.0.11+9)
OpenJDK 64-Bit Server VM Temurin-17.0.11+9 (build 17.0.11+9, mixed mode)

What did you expect to see?

I'd expect grpc-netty to work with Netty 4.1.111.Final . I understand that grpc-java currently supports Netty 4.1.100.Final, however many libraries depend on grpc-netty and cannot switch to use grpc-netty-shaded without relocating all classes to use the packages where grpc-netty-shaded contains shaded version of Netty. Examples of such popular libraries are io.etcd:jetcd-core and io.vertx:vertx-grpc.

What did you see instead?

There weren't any log messages or exceptions, communications just timed out. There were some debug logging about attempts. I'll share more details later.

Steps to reproduce the bug

I'll share more details later. I maintain Apache Pulsar and Apache Bookkeeper projects and this problem showed up there when Netty got upgrades from 4.1.108.Final to 4.1.111.Final. We use an older grpc version, but I also tested with 1.60.2 and 1.64.0 and the problem remained. I also tried upgrading to 4.1.110.Final and that works fine with the test using jetcd-core that fails.
This means that Netty release 4.1.111.Final introduces a change that breaks grpc-netty.
These are the 4.1.111.Final release notes and changes:

@ejona86
Copy link
Member

ejona86 commented Jun 13, 2024

You wrote a lot, but very little about what we actually need to be told. The tests in grpc-java run fine with 4.1.111. We could run the wider tests that we do when upgrading Netty, but right now this report is essentially useless to us.

Could you at least tell us what you were doing when it hanged, whether it was a reliable hang or random, how long of a hang it is (1s, 100s?)? Since you are controlling the Netty version, it'd be helpful to know if 4.1.110 had the issue as well.

As a hint, in the past when we've seen random hangs on upgrade, it generally meant you need a newer netty-tcnative or it was a netty-tcnative bug.

@ejona86
Copy link
Member

ejona86 commented Jun 13, 2024

Note: I was testing on Linux, so maybe this is a Mac problem.

@lhotari
Copy link
Author

lhotari commented Jun 13, 2024

You wrote a lot, but very little about what we actually need to be told. The tests in grpc-java run fine with 4.1.111. We could run the wider tests that we do when upgrading Netty, but right now this report is essentially useless to us.

Sure, it doesn't have relevant details. You can ignore it in the meantime. :)

I created the issue also to see if others are running into similar issues in the upgrade. I'll share more relevant details when I get back to debugging the issue.

@lhotari
Copy link
Author

lhotari commented Jun 14, 2024

@ejona86 Here's the reproducer with more details: https://github.com/lhotari/pulsar/blob/lh-grpc-netty-4.1.111/README.md

There are 2 cases which fail:

The libraries that are used under the covers are jetcd 0.8.2, vert-grpc 4.5.8, grpc-java 1.64.0 (versions aligned with grpc-bom) and Netty 4.1.111.Final (versions aligned with netty-bom, handles also other libs such as tcnative, it's 2.0.65.Final).

I also tested on my Linux laptop (x86_64) and the problem reproduces in the same way as on Mac OS (arm64).

@XN137
Copy link

XN137 commented Jun 14, 2024

another datapoint:

in our maven build we have this in the root pom:

<dependencyManagement>
(...)
     <dependency>
        <groupId>io.netty</groupId>
        <artifactId>netty-bom</artifactId>
        <version>4.1.109.Final</version>
        <type>pom</type>
        <scope>import</scope>
      </dependency>
      <dependency>
        <groupId>io.grpc</groupId>
        <artifactId>grpc-bom</artifactId>
        <version>1.64.0</version>
        <type>pom</type>
        <scope>import</scope>
      </dependency>
(...)
</dependencyManagement>

when upgrading to 4.1.111.Final many tests (running on linux box) show errors like:

[2024-06-12T13:11:15.494Z] 13:11:15.361 [grpc-default-worker-ELG-3-22] WARN  io.grpc.netty.NettyServerStream - Exception processing message
[2024-06-12T13:11:15.494Z] io.grpc.StatusRuntimeException: INTERNAL: Encountered end-of-stream mid-frame
[2024-06-12T13:11:15.494Z] 	at io.grpc.Status.asRuntimeException(Status.java:525)
[2024-06-12T13:11:15.494Z] 	at io.grpc.internal.AbstractServerStream$TransportState.deframerClosed(AbstractServerStream.java:247)
[2024-06-12T13:11:15.494Z] 	at io.grpc.netty.NettyServerStream$TransportState.deframerClosed(NettyServerStream.java:139)
[2024-06-12T13:11:15.494Z] 	at io.grpc.internal.MessageDeframer.close(MessageDeframer.java:234)
[2024-06-12T13:11:15.494Z] 	at io.grpc.internal.MessageDeframer.deliver(MessageDeframer.java:301)
[2024-06-12T13:11:15.494Z] 	at io.grpc.internal.MessageDeframer.request(MessageDeframer.java:162)
[2024-06-12T13:11:15.494Z] 	at io.grpc.internal.AbstractStream$TransportState$1RequestRunnable.run(AbstractStream.java:264)
[2024-06-12T13:11:15.495Z] 	at io.grpc.netty.NettyServerStream$TransportState$1.run(NettyServerStream.java:174)
[2024-06-12T13:11:15.495Z] 	at io.netty.util.concurrent.AbstractEventExecutor.runTask(AbstractEventExecutor.java:173)
[2024-06-12T13:11:15.495Z] 	at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:166)
[2024-06-12T13:11:15.495Z] 	at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:469)
[2024-06-12T13:11:15.495Z] 	at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:405)
[2024-06-12T13:11:15.495Z] 	at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:994)
[2024-06-12T13:11:15.495Z] 	at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
[2024-06-12T13:11:15.495Z] 	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
[2024-06-12T13:11:15.495Z] 	at java.base/java.lang.Thread.run(Thread.java:829)

and the tests eventually time out...

upgrading to 4.1.110.Final does not run into these problems.

@ejona86
Copy link
Member

ejona86 commented Jun 14, 2024

DO NOT USE NETTY 4.1.111 WITH GRPC. It appears there is corruption. There's no guarantees the corruption would be detected.


@lhotari, thanks for the reproducer. Unfortunately I can't easily run it because it requires docker. I've got podman and don't have time to fight getting rootless docker or spinning up a new VM at the moment.

Based on the logs (thanks!), I'm worried this is corruption. 00000201040000009d is 00 00020104 0000009d which means a 0x20104 byte message. And protobuf wouldn't start a message with a bunch of zeros (compare to the previous response 00 0000001c 0a1a08b2, and the 0a1a part). It is even more severe in the second log with 00 00020104 0000001b88c000018300000000001b000000017. That's a lot of zeros, which could make sense for protobuf bytes, but it wouldn't be the very start of the message.

000002 01 04 0000001b 88c0 looks like the response headers HTTP/2 frame that came immediately before. LENGTH=2, HEADERS (0x1), flags=END_HEADERS, streamid=27, indexed header :status 200, indexed dynamic header at position 3 (64 - 61 static headers).

@ejona86
Copy link
Member

ejona86 commented Jun 14, 2024

Given the recent changes to ByteBufs and there's now corruption, I'm strongly suspicious of gRPC's NettyAdaptiveCumulator being a contributing factor. Once I can reproduce this, the first thing to try is to disable the cumulator and see if the problem resolves.

@ejona86
Copy link
Member

ejona86 commented Jun 14, 2024

I've posted a warning to grpc-io, updated the 1.64.0 release notes, and notified Netty folks.

@sergiitk
Copy link
Member

I suspect NettyAdaptiveCumulator got broken by netty/netty#14093.
Because of netty/netty#12844, we had a temporary workaround to correct r/w indexes, which never got resolved:

// TODO(https://github.com/netty/netty/issues/12844): remove when we use Netty with
// the issue fixed.
// In certain cases, removing the CompositeByteBuf component, and then adding it back
// isn't idempotent. An example is provided in https://github.com/netty/netty/issues/12844.
// This happens because the buffer returned by composite.component() has out-of-sync
// indexes. Under the hood the CompositeByteBuf returns a duplicate() of the underlying
// buffer, but doesn't set the indexes.
//
// To get the right indexes we use the fact that composite.internalComponent() returns
// the slice() into the readable portion of the underlying buffer.
// We use this implementation detail (internalComponent() returning a *SlicedByteBuf),
// and combine it with the fact that SlicedByteBuf duplicates have their indexes
// adjusted so they correspond to the to the readable portion of the slice.
//
// Hence composite.internalComponent().duplicate() returns a buffer with the
// indexes that should've been on the composite.component() in the first place.
// Until the issue is fixed, we manually adjust the indexes of the removed component.

Without this fix, we saw a rare, but persistent HTTP2 frame corruption in certain situations. Similar to this one.

We should consider how to make it work with pre-4.1.111 and post-4.1.111 Netty versions. The NettyAdaptiveCumulator itself was introduced in #9558, and released in v1.51.0. At the moment unshaded Netty 4.1.111.Final can't be used with gRPC >= v1.51.0.

@sergiitk
Copy link
Member

netty/netty#12844 has very detailed description on what's wrong with CompositeByteBuf, and the nature of our workaround. component(int) is returningslice() which provides incorrect view into the data. Our fix relied on the fact that slice().duplicate() returned correct indexes (IRRC because duplicate did proper unwrapping). Our workaround corrects ByteBuf indexes by getting them from the slice duplicate:

// Hence composite.internalComponent().duplicate() returns a buffer with the
// indexes that should've been on the composite.component() in the first place.
// Until the issue is fixed, we manually adjust the indexes of the removed component.
ByteBuf sliceDuplicate = composite.internalComponent(tailComponentIndex).duplicate();
newTail.setIndex(sliceDuplicate.readerIndex(), sliceDuplicate.writerIndex());

netty/netty#14093 changes that, and apparently now the indexes we're correcting are from the slice duplicate are wrong. Probably because unwrapping was removed, but it's too early for conclusions. I won't be surprised if the root cause was actually fixed by that pr.

Relevant excerpt from netty/netty#12844 - read the issue description for the full context.


Let's explore the result of CompositeByteBuf.component(0):

###### component
Full:
         +-------------------------------------------------+
         |  0  1  2  3  4  5  6  7  8  9  a  b  c  d  e  f |
+--------+-------------------------------------------------+----------------+
|00000000| 2d 2d 2d 30 31 32 33 34                         |---01234        |
+--------+-------------------------------------------------+----------------+
Readable:
         +-------------------------------------------------+
         |  0  1  2  3  4  5  6  7  8  9  a  b  c  d  e  f |
+--------+-------------------------------------------------+----------------+
|00000000| 2d 2d 2d 30 31 32 33 34                         |---01234        |
+--------+-------------------------------------------------+----------------+
UnpooledDuplicatedByteBuf(ridx: 0, widx: 8, cap: 32, unwrapped: PooledUnsafeHeapByteBuf(ridx: 0, widx: 8, cap: 32))

Note that the component still contains the bytes discarded by the CompositeByteBuf(). This contradicts to what's returned by internalComponent(), which returns the slice() under the hood:

###### internalComponent
Full:
         +-------------------------------------------------+
         |  0  1  2  3  4  5  6  7  8  9  a  b  c  d  e  f |
+--------+-------------------------------------------------+----------------+
|00000000| 30 31 32 33 34                                  |01234           |
+--------+-------------------------------------------------+----------------+
Readable:
         +-------------------------------------------------+
         |  0  1  2  3  4  5  6  7  8  9  a  b  c  d  e  f |
+--------+-------------------------------------------------+----------------+
|00000000| 30 31 32 33 34                                  |01234           |
+--------+-------------------------------------------------+----------------+
UnpooledSlicedByteBuf(ridx: 0, widx: 5, cap: 5/5, unwrapped: PooledUnsafeHeapByteBuf(ridx: 0, widx: 8, cap: 32))

What's even more interesting, if we call .duplicate() on the slice, it'll return the full view of the buffer, and with correctly offset indexes!

###### internalComponentDup
Full:
         +-------------------------------------------------+
         |  0  1  2  3  4  5  6  7  8  9  a  b  c  d  e  f |
+--------+-------------------------------------------------+----------------+
|00000000| 2d 2d 2d 30 31 32 33 34                         |---01234        |
+--------+-------------------------------------------------+----------------+
Readable:
         +-------------------------------------------------+
         |  0  1  2  3  4  5  6  7  8  9  a  b  c  d  e  f |
+--------+-------------------------------------------------+----------------+
|00000000| 30 31 32 33 34                                  |01234           |
+--------+-------------------------------------------------+----------------+
UnpooledDuplicatedByteBuf(ridx: 3, widx: 8, cap: 32, unwrapped: PooledUnsafeHeapByteBuf(ridx: 0, widx: 8, cap: 32))

@ejona86
Copy link
Member

ejona86 commented Jun 14, 2024

@larry-safran informed me unit tests did fail with 4.1.111. I re-tried (twice; third time was the charm) and now I'm seeing the failures. And they are all in NettyAdaptiveCumulatorTest. I've also been able to see a RESOURCE_EXHAUSTED: gRPC message exceeds maximum size 16777216: 39256064 error from our interop tests, which is clear corruption. For those wondering, the corruption happens in the reading code path.

So this really seems triggered by netty/netty#14093, but only because of the workaround for netty/netty#12844. We're probably going to need Netty version-detection logic to avoid going through a corrupting code path. And it seems well past time to fix netty/netty#12844.

@normanmaurer
Copy link

@ejona86 so just to summarize you agree that what we did in netty is correct ? I think it is but wanted to double check :)

@ludoch
Copy link

ludoch commented Jun 17, 2024

If you want another internal datapoint, you can look at failure in https://screenshot.googleplex.com/59NDiYjVoeWcd6N
solved with this rollback: cr/643373868

@ejona86
Copy link
Member

ejona86 commented Jun 18, 2024

@normanmaurer, I think Netty's new "duplicating a slice returns a slice" behavior is appropriate. The previous behavior confused me. I'll need to add a comment to netty/netty#12844 . I think the failure is a short-term workaround became a permanent fix (until it broke).

@chrisvest
Copy link

@ejona86 @larry-safran What are the release and back-port plans for the fix?

@cfredri4
Copy link
Contributor

We're probably going to need Netty version-detection logic to avoid going through a corrupting code path.

@ejona86 Not fully following all the details, but instead of detecting specific Netty versions; would it be possible to look at the slice and/or duplicate buffer indexes to detect the old behaviour and not apply the workaround?

github-merge-queue bot pushed a commit to camunda/camunda that referenced this issue Jun 26, 2024
## Description

Netty 4.1.111.Final is incompatible with grpc-java < 1.65.0 and >=
1.51.0. It can introduce buffer corruption at any point, which if you're
lucky is detected, but may also go unnoticed and result in garbage and
waste of time.

Wait until grpc-java releases a compatible version before updating to
4.1.111.Final. Note that this also blocks Spring Boot 3.3.1 which
depends on Netty 4.1.111.Final.

See:

- grpc/grpc-java#11284 (comment)
- https://github.com/grpc/grpc-java/releases/tag/v1.64.0

## Related issues

blocks #19581
blocks #19265
blocks #19264
blocks #19259
blocks #19258
github-merge-queue bot pushed a commit to camunda/camunda that referenced this issue Jun 26, 2024
## Description

Netty 4.1.111.Final is incompatible with grpc-java < 1.65.0 and >=
1.51.0. It can introduce buffer corruption at any point, which if you're
lucky is detected, but may also go unnoticed and result in garbage and
waste of time.

Wait until grpc-java releases a compatible version before updating to
4.1.111.Final. Note that this also blocks Spring Boot 3.3.1 which
depends on Netty 4.1.111.Final.

See:

- grpc/grpc-java#11284 (comment)
- https://github.com/grpc/grpc-java/releases/tag/v1.64.0

## Related issues

blocks #19581
blocks #19265
blocks #19264
blocks #19259
blocks #19258
github-merge-queue bot pushed a commit to camunda/camunda that referenced this issue Jun 26, 2024
## Description

Netty 4.1.111.Final is incompatible with grpc-java < 1.65.0 and >=
1.51.0. It can introduce buffer corruption at any point, which if you're
lucky is detected, but may also go unnoticed and result in garbage and
waste of time.

Wait until grpc-java releases a compatible version before updating to
4.1.111.Final. Note that this also blocks Spring Boot 3.3.1 which
depends on Netty 4.1.111.Final.

See:

- grpc/grpc-java#11284 (comment)
- https://github.com/grpc/grpc-java/releases/tag/v1.64.0

## Related issues

blocks #19581
blocks #19265
blocks #19264
blocks #19259
blocks #19258
@ejona86
Copy link
Member

ejona86 commented Jun 28, 2024

v1.65.0 is available with the Netty 4.1.111 fix. The other releases are still in-progress.

alluxio-bot pushed a commit to Alluxio/alluxio that referenced this issue Jul 1, 2024
### What changes are proposed in this pull request?
add back netty dependency within grpc


### Why are the changes needed?
previously we exclude netty dependency since we already have netty-all outside #9942

But due to grpc/grpc-java#11284, we sometimes have incompatibility between grpc and netty, and it's better to use shaded netty within grpc so we can be sure that they are compatible.

### Does this PR introduce any user facing changes?

na

			pr-link: #18642
			change-id: cid-65d86f315e023592060b6a9f864bfe2a972dfe68
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants