Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix TLS stability issues with V2 protocol that caused data corruption #4404

Merged
merged 3 commits into from
May 29, 2024

Conversation

lhotari
Copy link
Member

@lhotari lhotari commented May 28, 2024

Motivation

In Pulsar, there are multiple reports (for example apache/pulsar#22601, apache/pulsar#21421) about data corruption when TLS is enabled between Pulsar Brokers and Bookkeeper Bookies and when V2 protocol is enabled in broker.conf with the default configuration setting bookkeeperUseV2WireProtocol=true.

The reason to change the order of FlushConsolidationHandler compared to SslHandler is due to the fact that Netty SslHandler's STARTTLS support uses the flush event. My assumption is that having FlushConsolidationHandler in "front" of the flushes could cause problems.
- See the Netty SslHandler code in: https://github.com/netty/netty/blob/5085cef149134951c94a02a743ed70025c8cdad4/handler/src/main/java/io/netty/handler/ssl/SslHandler.java#L782-L806 .

The FlushConsolidationHandler javadoc also provides a recommendation:

The FlushConsolidationHandler should be put as first ChannelHandler in the ChannelPipeline to have the best effect.

However, it's possible that the root cause is different, but having FlushConsolidationHandler as the first handler (to only impact network socket flushes) after the TLS handshake could have some other side-effect which makes the problem go away. SslHandler will coalesce small buffers into a larger buffer and release them so the lifetime of small buffers is shorter.

It's also a mystery why the problem appears only when V2 protocol is used. That could hint to the direction that it's a release issue on a small buffer that gets merged into larger buffers and it reduces the occurrence of the hidden root cause issue.
Regardless of this concern, this PR contains changes that improve the current state of V2 protocol with TLS and it doesn't contain a negative impact.

Changes

  • add the TLS handler after the FlushConsolidationHandler
    • This makes TLS connections from Pulsar Broker to Bookkeeper stable
      when bookkeeperUseV2WireProtocol=true is used
  • Fix test TestTLS for V2
  • Fix inconsistency in client configuration in BookKeeperClusterTestCase

@shoothzj
Copy link
Member

Thank you for the detailed PR. Since Netty is an active project, do you think it would be possible to get confirmation from the Netty project team?

@lhotari
Copy link
Member Author

lhotari commented May 29, 2024

Thank you for the detailed PR. Since Netty is an active project, do you think it would be possible to get confirmation from the Netty project team?

What would you think that should be confirmed?

The javadoc already provides a recommendation:

The FlushConsolidationHandler should be put as first ChannelHandler in the ChannelPipeline to have the best effect.

I could test another change where I the SslHandler is added after the FlushConsolidationHandler instead of removing the handler during the handshake. As I mentioned in the description, the actual root cause might still be hidden, but this change definitely improves the situation since the reproducer https://github.com/lhotari/pulsar-playground/tree/master/issues/issue22601/standalone_env won't reproduce with this change.

- add the TLS handler after the FlushConsolidationHandler
  - This makes TLS connections from Pulsar Broker to Bookkeeper stable
    when bookkeeperUseV2WireProtocol=true is used
- Fix test TestTLS for V2
- Fix inconsistency in client configuration in BookKeeperClusterTestCase
@lhotari lhotari force-pushed the lh-fix-tls-stability-with-V2 branch from 912217c to f05d8ae Compare May 29, 2024 00:33
@lhotari
Copy link
Member Author

lhotari commented May 29, 2024

@shoothzj I verified that the solution works also without removing and re-adding the handler. I changed the solution to add the TLS handler (SslHandler) after the FlushConsolidationHandler in the pipeline. That fixes the issue with V2 protocol stability. PTAL.

- revert changes to enable certain tests for V2
Copy link
Contributor

@eolivelli eolivelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

great catch

@eolivelli eolivelli merged commit 5f73147 into apache:master May 29, 2024
23 checks passed
lhotari added a commit to lhotari/bookkeeper that referenced this pull request May 29, 2024
…apache#4404)

* Fix TLS stability issues with V2 protocol that caused data corruption
- add the TLS handler after the FlushConsolidationHandler
  - This makes TLS connections from Pulsar Broker to Bookkeeper stable
    when bookkeeperUseV2WireProtocol=true is used
- Fix test TestTLS for V2
- Fix inconsistency in client configuration in BookKeeperClusterTestCase

(cherry picked from commit 5f73147)
shoothzj pushed a commit that referenced this pull request May 30, 2024
…#4404)

* Fix TLS stability issues with V2 protocol that caused data corruption
- add the TLS handler after the FlushConsolidationHandler
  - This makes TLS connections from Pulsar Broker to Bookkeeper stable
    when bookkeeperUseV2WireProtocol=true is used
- Fix test TestTLS for V2
- Fix inconsistency in client configuration in BookKeeperClusterTestCase

(cherry picked from commit 5f73147)
shoothzj pushed a commit that referenced this pull request May 30, 2024
…#4404)

* Fix TLS stability issues with V2 protocol that caused data corruption
- add the TLS handler after the FlushConsolidationHandler
  - This makes TLS connections from Pulsar Broker to Bookkeeper stable
    when bookkeeperUseV2WireProtocol=true is used
- Fix test TestTLS for V2
- Fix inconsistency in client configuration in BookKeeperClusterTestCase

(cherry picked from commit 5f73147)
@lhotari
Copy link
Member Author

lhotari commented May 30, 2024

I can now confirm that this PR didn't fix the root cause.
The root cause is explained in this Netty PR: netty/netty#14086 . One of the unit test failures in the DUPLICATE_DIRECT_BUFFER scenario verifies that a duplicate input buffer could get mutated. The way to prevent this is to do .retainedSlice instead of using .retainedDuplicate.
In the Pulsar side, I have created PR apache/pulsar#22810 to mitigate issues and workaround the problems. I'll also create a similar PR to Bookkeeper to workaround the SslHandler behavior.

@lhotari
Copy link
Member Author

lhotari commented May 30, 2024

I can confirm that the problem that I was able to reproduce previously is fixed without making any changes in Pulsar or Bookkeeper. The only required fix was to switch to use Netty built from netty/netty#14086 .
(more details in apache/pulsar#22810 (comment))

This means that this PR #4404 didn't address the root cause although it made the problem go away for some unknown reason.

Ghatage pushed a commit to sijie/bookkeeper that referenced this pull request Jul 12, 2024
…apache#4404)

* Fix TLS stability issues with V2 protocol that caused data corruption
- add the TLS handler after the FlushConsolidationHandler
  - This makes TLS connections from Pulsar Broker to Bookkeeper stable
    when bookkeeperUseV2WireProtocol=true is used
- Fix test TestTLS for V2
- Fix inconsistency in client configuration in BookKeeperClusterTestCase
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants