-
Notifications
You must be signed in to change notification settings - Fork 904
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix TLS stability issues with V2 protocol that caused data corruption #4404
Fix TLS stability issues with V2 protocol that caused data corruption #4404
Conversation
abba085
to
912217c
Compare
Thank you for the detailed PR. Since Netty is an active project, do you think it would be possible to get confirmation from the Netty project team? |
What would you think that should be confirmed? The javadoc already provides a recommendation:
I could test another change where I the SslHandler is added after the FlushConsolidationHandler instead of removing the handler during the handshake. As I mentioned in the description, the actual root cause might still be hidden, but this change definitely improves the situation since the reproducer https://github.com/lhotari/pulsar-playground/tree/master/issues/issue22601/standalone_env won't reproduce with this change. |
- add the TLS handler after the FlushConsolidationHandler - This makes TLS connections from Pulsar Broker to Bookkeeper stable when bookkeeperUseV2WireProtocol=true is used - Fix test TestTLS for V2 - Fix inconsistency in client configuration in BookKeeperClusterTestCase
912217c
to
f05d8ae
Compare
@shoothzj I verified that the solution works also without removing and re-adding the handler. I changed the solution to add the TLS handler (SslHandler) after the FlushConsolidationHandler in the pipeline. That fixes the issue with V2 protocol stability. PTAL. |
- revert changes to enable certain tests for V2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
great catch
…apache#4404) * Fix TLS stability issues with V2 protocol that caused data corruption - add the TLS handler after the FlushConsolidationHandler - This makes TLS connections from Pulsar Broker to Bookkeeper stable when bookkeeperUseV2WireProtocol=true is used - Fix test TestTLS for V2 - Fix inconsistency in client configuration in BookKeeperClusterTestCase (cherry picked from commit 5f73147)
…#4404) * Fix TLS stability issues with V2 protocol that caused data corruption - add the TLS handler after the FlushConsolidationHandler - This makes TLS connections from Pulsar Broker to Bookkeeper stable when bookkeeperUseV2WireProtocol=true is used - Fix test TestTLS for V2 - Fix inconsistency in client configuration in BookKeeperClusterTestCase (cherry picked from commit 5f73147)
…#4404) * Fix TLS stability issues with V2 protocol that caused data corruption - add the TLS handler after the FlushConsolidationHandler - This makes TLS connections from Pulsar Broker to Bookkeeper stable when bookkeeperUseV2WireProtocol=true is used - Fix test TestTLS for V2 - Fix inconsistency in client configuration in BookKeeperClusterTestCase (cherry picked from commit 5f73147)
I can now confirm that this PR didn't fix the root cause. |
I can confirm that the problem that I was able to reproduce previously is fixed without making any changes in Pulsar or Bookkeeper. The only required fix was to switch to use Netty built from netty/netty#14086 . This means that this PR #4404 didn't address the root cause although it made the problem go away for some unknown reason. |
…apache#4404) * Fix TLS stability issues with V2 protocol that caused data corruption - add the TLS handler after the FlushConsolidationHandler - This makes TLS connections from Pulsar Broker to Bookkeeper stable when bookkeeperUseV2WireProtocol=true is used - Fix test TestTLS for V2 - Fix inconsistency in client configuration in BookKeeperClusterTestCase
Motivation
In Pulsar, there are multiple reports (for example apache/pulsar#22601, apache/pulsar#21421) about data corruption when TLS is enabled between Pulsar Brokers and Bookkeeper Bookies and when V2 protocol is enabled in broker.conf with the default configuration setting
bookkeeperUseV2WireProtocol=true
.The reason to change the order of FlushConsolidationHandler compared to SslHandler is due to the fact that Netty SslHandler's STARTTLS support uses the flush event. My assumption is that having FlushConsolidationHandler in "front" of the flushes could cause problems.
- See the Netty SslHandler code in: https://github.com/netty/netty/blob/5085cef149134951c94a02a743ed70025c8cdad4/handler/src/main/java/io/netty/handler/ssl/SslHandler.java#L782-L806 .
The FlushConsolidationHandler javadoc also provides a recommendation:
However, it's possible that the root cause is different, but having FlushConsolidationHandler as the first handler (to only impact network socket flushes) after the TLS handshake could have some other side-effect which makes the problem go away. SslHandler will coalesce small buffers into a larger buffer and release them so the lifetime of small buffers is shorter.
It's also a mystery why the problem appears only when V2 protocol is used. That could hint to the direction that it's a release issue on a small buffer that gets merged into larger buffers and it reduces the occurrence of the hidden root cause issue.
Regardless of this concern, this PR contains changes that improve the current state of V2 protocol with TLS and it doesn't contain a negative impact.
Changes
when bookkeeperUseV2WireProtocol=true is used