Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Messages seems broken when got SSL exception in the producer #21933

Closed
1 of 2 tasks
pqab opened this issue Jan 19, 2024 · 18 comments
Closed
1 of 2 tasks

[Bug] Messages seems broken when got SSL exception in the producer #21933

pqab opened this issue Jan 19, 2024 · 18 comments

Comments

@pqab
Copy link

pqab commented Jan 19, 2024

Search before asking

  • I searched in the issues and found nothing similar.

Version

2.10.5

Minimal reproduce step

  1. Create topic
bin/pulsar-admin tenants create tenant1
bin/pulsar-admin namespaces create tenant1/namespace1
bin/pulsar-admin namespaces set-persistence --bookkeeper-ack-quorum 2 --bookkeeper-ensemble 3 --bookkeeper-write-quorum 3 --ml-mark-delete-max-rate 0 tenant1/namespace1
bin/pulsar-admin namespaces set-retention tenant1/namespace1 --size -1 --time 3d
bin/pulsar-admin namespaces set-message-ttl tenant1/namespace1 --messageTTL 604800
bin/pulsar-admin topics create-partitioned-topic tenant1/namespace1/topic1 -p 3
  1. Produce large payload & batch from the admin tool with tls
bin/pulsar-perf produce persistent://tenant1/namespace1/topic1 -mk autoIncrement -bb 5242880 -r 5000 -s 5242 -bm 1000 -threads 30 --auth-plugin org.apache.pulsar.client.impl.auth.AuthenticationTls --auth-params '{"tlsCertFile":"conf/user.cer","tlsKeyFile":"conf/user.key.pem"}'
  1. Stop until it produced around 1 million messages

  2. Wait until all the messages goes to BookKeeper backlog

  3. Start consumer to consume all the messages with tls

bin/pulsar-perf  consume persistent://tenant1/namespace1/topic1 --auth-plugin org.apache.pulsar.client.impl.auth.AuthenticationTls --auth-params '{"tlsCertFile":"conf/user.cer","tlsKeyFile":"conf/user.key.pem"}' -sp Earliest -ss sub1

What did you expect to see?

Able to consume all produced messages properly from the consumer

What did you see instead?

Consumer stopped receiving msg in the middle, and could see some error from the broker logs like

2024-01-19T14:05:39,899+0000 [BookKeeperClientWorker-OrderedExecutor-4-0] ERROR org.apache.bookkeeper.proto.checksum.DigestManager - Mac mismatch for ledger-id: 852, entry-id: 35932
2024-01-19T14:05:39,902+0000 [BookKeeperClientWorker-OrderedExecutor-4-0] ERROR org.apache.bookkeeper.proto.checksum.DigestManager - Mac mismatch for ledger-id: 852, entry-id: 35932
2024-01-19T14:05:39,916+0000 [BookKeeperClientWorker-OrderedExecutor-4-0] ERROR org.apache.bookkeeper.proto.checksum.DigestManager - Mac mismatch for ledger-id: 852, entry-id: 35932
2024-01-19T14:05:39,916+0000 [BookKeeperClientWorker-OrderedExecutor-4-0] ERROR org.apache.bookkeeper.client.PendingReadOp - Read of ledger entry failed: L852 E35899-E35998, Sent to [100.87.157.209:3181, 100.111.147.236:3181, 100.96.184.253:3181], Heard from [100.87.157.209:3181, 100.111.147.236:3181, 100.96.184.253:3181] : bitset = {0, 1, 2}, Error = 'Entry digest does not match'. First unread entry is (35973, rc = 0)
2024-01-19T14:05:39,916+0000 [broker-topic-workers-OrderedExecutor-15-0] ERROR org.apache.pulsar.broker.service.persistent.PersistentDispatcherSingleActiveConsumer - [persistent://tenant1/namespace1/topic1-0 / sub1-Consumer{subscription=PersistentSubscription{topic=persistent://tenant1/namespace1/topic1-0, name=sub1}, consumerId=0, consumerName=383fd, address=/100.96.184.253:50090}] Error reading entries at 852:35899 : Entry digest does not match - Retrying to read in 15.0 seconds

Anything else?

Seems only happening when there is SSL exception in the middle of the produce like

2024-01-19T13:39:13,450+0000 [pulsar-client-io-12-1] WARN  org.apache.pulsar.client.impl.ClientCnx - Got exception io.netty.handler.codec.DecoderException: io.netty.handler.ssl.ReferenceCountedOpenSslEngine$OpenSslException: error:100003fc:SSL routines:OPENSSL_internal:SSLV3_ALERT_BAD_RECORD_MAC
	at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:499)
	at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:290)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
	at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:440)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
	at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919)
	at io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.epollInReady(AbstractEpollStreamChannel.java:800)
	at io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:499)
	at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:397)
	at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)
	at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
	at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: io.netty.handler.ssl.ReferenceCountedOpenSslEngine$OpenSslException: error:100003fc:SSL routines:OPENSSL_internal:SSLV3_ALERT_BAD_RECORD_MAC
	at io.netty.handler.ssl.ReferenceCountedOpenSslEngine.newSSLExceptionForError(ReferenceCountedOpenSslEngine.java:1377)
	at io.netty.handler.ssl.ReferenceCountedOpenSslEngine.shutdownWithError(ReferenceCountedOpenSslEngine.java:1089)
	at io.netty.handler.ssl.ReferenceCountedOpenSslEngine.sslReadErrorResult(ReferenceCountedOpenSslEngine.java:1399)
	at io.netty.handler.ssl.ReferenceCountedOpenSslEngine.unwrap(ReferenceCountedOpenSslEngine.java:1325)
	at io.netty.handler.ssl.ReferenceCountedOpenSslEngine.unwrap(ReferenceCountedOpenSslEngine.java:1426)
	at io.netty.handler.ssl.ReferenceCountedOpenSslEngine.unwrap(ReferenceCountedOpenSslEngine.java:1469)
	at io.netty.handler.ssl.SslHandler$SslEngineType$1.unwrap(SslHandler.java:223)
	at io.netty.handler.ssl.SslHandler.unwrap(SslHandler.java:1353)
	at io.netty.handler.ssl.SslHandler.decodeNonJdkCompatible(SslHandler.java:1257)
	at io.netty.handler.ssl.SslHandler.decode(SslHandler.java:1297)
	at io.netty.handler.codec.ByteToMessageDecoder.decodeRemovalReentryProtection(ByteToMessageDecoder.java:529)
	at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:468)
	... 15 more

or

2024-01-19T14:01:02,532+0000 [pulsar-client-io-6-1] WARN  org.apache.pulsar.client.impl.ClientCnx - Got exception io.netty.handler.codec.DecoderException: io.netty.handler.ssl.ReferenceCountedOpenSslEngine$OpenSslException: error:10000438:SSL routines:OPENSSL_internal:TLSV1_ALERT_INTERNAL_ERROR
	at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:499)
	at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:290)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
	at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:440)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
	at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919)
	at io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.epollInReady(AbstractEpollStreamChannel.java:800)
	at io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:499)
	at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:397)
	at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)
	at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
	at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: io.netty.handler.ssl.ReferenceCountedOpenSslEngine$OpenSslException: error:10000438:SSL routines:OPENSSL_internal:TLSV1_ALERT_INTERNAL_ERROR
	at io.netty.handler.ssl.ReferenceCountedOpenSslEngine.newSSLExceptionForError(ReferenceCountedOpenSslEngine.java:1377)
	at io.netty.handler.ssl.ReferenceCountedOpenSslEngine.shutdownWithError(ReferenceCountedOpenSslEngine.java:1089)
	at io.netty.handler.ssl.ReferenceCountedOpenSslEngine.sslReadErrorResult(ReferenceCountedOpenSslEngine.java:1399)
	at io.netty.handler.ssl.ReferenceCountedOpenSslEngine.unwrap(ReferenceCountedOpenSslEngine.java:1325)
	at io.netty.handler.ssl.ReferenceCountedOpenSslEngine.unwrap(ReferenceCountedOpenSslEngine.java:1426)
	at io.netty.handler.ssl.ReferenceCountedOpenSslEngine.unwrap(ReferenceCountedOpenSslEngine.java:1469)
	at io.netty.handler.ssl.SslHandler$SslEngineType$1.unwrap(SslHandler.java:223)
	at io.netty.handler.ssl.SslHandler.unwrap(SslHandler.java:1353)
	at io.netty.handler.ssl.SslHandler.decodeNonJdkCompatible(SslHandler.java:1257)
	at io.netty.handler.ssl.SslHandler.decode(SslHandler.java:1297)
	at io.netty.handler.codec.ByteToMessageDecoder.decodeRemovalReentryProtection(ByteToMessageDecoder.java:529)
	at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:468)
	... 15 more

Are you willing to submit a PR?

  • I'm willing to submit a PR!
@lhotari
Copy link
Member

lhotari commented Jan 19, 2024

Not an answer to this problem, but noticed --ml-mark-delete-max-rate 0 in your persistence parameters. Could you share the reason for using that? I'd be interesting to know if there's an underlying requirement that could be covered with a better solution in Pulsar.
UPDATE: I now noticed that it's also in the examples: https://pulsar.apache.org/docs/next/admin-api-namespaces/#set-persistence-policies .

@lhotari
Copy link
Member

lhotari commented Jan 19, 2024

-bb 5242880

Does the problem reproduce with lower sizes? let's say 4000000 or 5000000.

@lhotari
Copy link
Member

lhotari commented Jan 19, 2024

Which OS and platform is this (uname -a)? Is this Linux x86_64? Which Java version (java -version)?

@pqab
Copy link
Author

pqab commented Jan 20, 2024

Not an answer to this problem, but noticed --ml-mark-delete-max-rate 0 in your persistence parameters. Could you share the reason for using that? I'd be interesting to know if there's an underlying requirement that could be covered with a better solution in Pulsar. UPDATE: I now noticed that it's also in the examples: https://pulsar.apache.org/docs/next/admin-api-namespaces/#set-persistence-policies .

yes, I believe we are just following the examples to create the topics

@pqab
Copy link
Author

pqab commented Jan 20, 2024

java -version

yes

Linux 4.20.11-1.el7.elrepo.x86_64 #1 SMP Wed Feb 20 09:00:53 EST 2019 x86_64 x86_64 x86_64 GNU/Linux

openjdk version "11.0.12" 2021-07-20 LTS
OpenJDK Runtime Environment 18.9 (build 11.0.12+7-LTS)
OpenJDK 64-Bit Server VM 18.9 (build 11.0.12+7-LTS, mixed mode, sharing)

@pqab
Copy link
Author

pqab commented Jan 20, 2024

-bb 5242880

Does the problem reproduce with lower sizes? let's say 4000000 or 5000000.

I was trying to re-create the topic in order to clean the environment before re-running the test, but somehow the topic couldn't be deleted, and I keep seeing some error in the broker like

2024-01-20T01:07:32,897+0000 [pulsar-web-39-8] ERROR org.apache.pulsar.broker.admin.impl.PersistentTopicsBase - [superuser] Failed to delete topic persistent://tenant1/namespace1/topic1-partition-0
org.apache.bookkeeper.mledger.ManagedLedgerException$MetadataNotFoundException: Managed ledger not found

If I try to delete the topic, it's saying Topic not found, but if I try to re-create the same topic, it's saying Topic already exists.

I tried to create another new topic & publish with 3242880, but the polling rate is very slow and stopped receiving new message at some time.
seems like the server is kind of very busy now, we will need some time to check why the topic couldn't be deleted first

@lhotari
Copy link
Member

lhotari commented Jan 20, 2024

Version

2.10.5

Do you have a specific reason to use 2.10.x ? 2.10.x is very close to end of support since according to our release policy, we should have already dropped support (decision is pending, mailing list discussion). It would be recommended to use and test the latest LTS release which is 3.0.2 at the moment.

In any case, it would be helpful to test with 3.0.2 to see if the problem can be reproduced there.

@pqab
Copy link
Author

pqab commented Jan 21, 2024

Version

2.10.5

Do you have a specific reason to use 2.10.x ? 2.10.x is very close to end of support since according to our release policy, we should have already dropped support (decision is pending, mailing list discussion). It would be recommended to use and test the latest LTS release which is 3.0.2 at the moment.

In any case, it would be helpful to test with 3.0.2 to see if the problem can be reproduced there.

It's our current production version and we found this issues when we tried to run the load test to our application

sure, thank you
we will upgrade to 3.0.2 and run the same test again to see if we are still getting the same issues
I will update the result here after

@pqab
Copy link
Author

pqab commented Jan 23, 2024

After we upgraded to 3.1.2, we are still having the same issues

It looks similar to this ticket, but we are already using bookkeeperUseV2WireProtocol=true
#21421
would you have any idea? would it be something related to the SSL?

now we have another workaround to set managedLedgerDigestType=MAC to bypass it, and so far it looks fine, we don't see error and it doesn't stop consuming

@lhotari
Copy link
Member

lhotari commented Jan 29, 2024

After we upgraded to 3.1.2, we are still having the same issues

It looks similar to this ticket, but we are already using bookkeeperUseV2WireProtocol=true #21421 would you have any idea? would it be something related to the SSL?

now we have another workaround to set managedLedgerDigestType=MAC to bypass it, and so far it looks fine, we don't see error and it doesn't stop consuming

This is surprising since I have been assuming that #21421 impacts the case when the libcirce-checksum.so library cannot be loaded and Bookkeeper falls back to use JVM based CRC32 calculations with com.scurrilous.circe.checksum.Java9IntHash class.

@lhotari
Copy link
Member

lhotari commented Jan 29, 2024

Just wondering if you might be impacted by #21684 ? do you happen to have haProxyProtocolEnabled=true?

@lhotari
Copy link
Member

lhotari commented Jan 29, 2024

It's possible that this is a bug in Bookkeeper's checksum calculation. I'll follow up later after investigation. Thanks for the good issue reports!

@pqab
Copy link
Author

pqab commented Jan 30, 2024

Just wondering if you might be impacted by #21684 ? do you happen to have haProxyProtocolEnabled=true?

we are using haProxyProtocolEnabled=false

@lhotari
Copy link
Member

lhotari commented May 31, 2024

This is possibly related to #22601 / #22810 if this isn't fixed by Pulsar 3.0.5 / 3.2.3 which includes Bookkeeper 4.16.5 . @pqab Do you have a chance to test with Pulsar 3.0.5 ?

@pqab
Copy link
Author

pqab commented May 31, 2024

I believe @semistone is testing with Pulsar 3.2.2, and this error is gone

@lhotari
Copy link
Member

lhotari commented May 31, 2024

I believe @semistone is testing with Pulsar 3.2.2, and this error is gone

Sounds good. btw. It would be worth to test Pulsar 3.2.3 since there's over 100 commits since 3.2.2. Pulsar 3.2.3 release notes: https://github.com/apache/pulsar/releases/tag/v3.2.3 . Pulsar 3.2.3 also includes Bookkeeper 4.16.5 with important fixes.

@pqab
Copy link
Author

pqab commented May 31, 2024

I think so, we will test with the latest version and see if all similar issues are gone

@pqab
Copy link
Author

pqab commented Jun 7, 2024

We tested with 3.2.3 broker & client, and the issue is gone now

@lhotari lhotari closed this as completed Jun 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants