Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sometimes SetFailed() is called with errno=0 in InputMessenger::OnNewMessages() #1860

Open
wudisheng opened this issue Jul 26, 2022 · 11 comments
Labels
bug the code does not work as expected

Comments

@wudisheng
Copy link

Describe the bug (描述bug)
m->SetFailed() is called with saved_errno == 0 in InputMessenger::OnNewMessages(), which then triggers a hard error at CHECK(false) << "error_code is 0".

To Reproduce (复现方法)
In our environment we can constantly reproduce this issue in LTO (-flto=full) and ThinLTO (-flto=thin) modes, but everything runs without error in normal (-fno-lto) mode.

I surfed a bit into the source code, and it turns out that

By adding several pieces of logging codes with cares taken not to unintentionally change errno, I can confirm the behavior described above, see the attached screenshot for reference.

A potential fix / workaround is to change if (errno == EINTR) to if (error == 0 || errno == EINTR), but I'm not familiar with BRPC code (nor did I figure out why link-time optimization triggers such an issue), so I'd prefer BRPC maintainers taking a look at it. Thanks!

Expected behavior (期望行为)
Consistent behavior regardless of link-time optimization options.

Versions (各种版本)
OS: Linux Ubuntu 18.04 in Docker
Compiler: Clang 12.0.1
brpc: 1.1.0 Release
protobuf: 21.1

Additional context/screenshots (更多上下文/截图)

20220727045020

@wudisheng
Copy link
Author

Some further investigation shows that there might be another scenario --- readv() returned -1 with a non-zero errno, but errno got changed to 0 in return_cached_blocks().

Another piece of debugging log supports this scenario.

@wwbmmm
Copy link
Contributor

wwbmmm commented Jul 28, 2022

Another piece of debugging log supports this scenario.

Could you show this log?

@wwbmmm
Copy link
Contributor

wwbmmm commented Jul 28, 2022

This issue may be related to #1693. Although we have removed the __const__ attribute of __errno_location(), but powerful link time optimization may still cache the tls errno location ?

@wudisheng
Copy link
Author

Another piece of debugging log supports this scenario.

Could you show this log?

This is a piece of debugging log showing such a scenario (readv() returned -1 with errno==11), but outer callsite sees 0.

image

@wudisheng
Copy link
Author

wudisheng commented Jul 28, 2022

This issue may be related to #1693. Although we have removed the __const__ attribute of __errno_location(), but powerful link time optimization may still cache the tls errno location ?

BTW-1: I also observed once that before return nr in pappend_from_file_descriptor(), errno is non-zero, but immediately in the outer callsite errno become zero. I couldn't find the corresponding debugging log at this time, but it sounds like LTO does collapse something around errno.

BTW-2: During my investigation, I also tried using Clang-15 (instead of Clang-12) with exactly the same building configurations, then I observed a different coredump saying something like "bthread sched_to itself" or so, so it seems LTO may really break something in BRPC code base.

BTW-3: About half a year ago, when we are using 0.9.7, I roughly tested LTO capabilities once and BRPC wasn't a blocker at that time (the service behaves correctly online at a benefited performance).

Currently I have switched to a building mode that everything but BRPC in the entire dependency graph is built with ThinLTO, and the binary can be started without an immediate coredump (whether it behaves correctly may need a longer verification procedure).

Let me know if you need any further context from my scenario.

@wwbmmm
Copy link
Contributor

wwbmmm commented Jul 28, 2022

I observed a different coredump saying something like "bthread sched_to itself"

I encountered this error before, that is related to tls cache. Compiler options: -fno-gcse, -fno-cse-follow-jumps, -fno-move-loop-invariants may fix this.

@wudisheng
Copy link
Author

Thanks for the information, I'll give it a shoot later this week.

@wudisheng
Copy link
Author

-fno-gcse, -fno-cse-follow-jumps, -fno-move-loop-invariants

Unfortunately I could not easily find corresponding switches in Clang, it seems that optimization switches of Gcc and Clang at such a detailed level differ a lot.

@wwbmmm
Copy link
Contributor

wwbmmm commented Jul 28, 2022

BTW-3: About half a year ago, when we are using 0.9.7, I roughly tested LTO capabilities once and BRPC wasn't a blocker at that time (the service behaves correctly online at a benefited performance).

What compiler did you use in this case?

@wudisheng
Copy link
Author

wudisheng commented Jul 28, 2022

BTW-3: About half a year ago, when we are using 0.9.7, I roughly tested LTO capabilities once and BRPC wasn't a blocker at that time (the service behaves correctly online at a benefited performance).

What compiler did you use in this case?

Clang 8 at that time, with GLIBC 2.27 and GLIBCXX 3.4.26.

@wwbmmm
Copy link
Contributor

wwbmmm commented Jul 28, 2022

We are using gcc 8 and gcc 10 with lto to compile brpc and everything works well.
It seems that some new optimization in clang is not compatible with bthread, that is, it may cache the tls address between bthread context switch.

@wwbmmm wwbmmm added the bug the code does not work as expected label Nov 23, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug the code does not work as expected
Projects
None yet
Development

No branches or pull requests

2 participants