Sometimes SetFailed() is called with errno=0 in InputMessenger::OnNewMessages() #1860

wudisheng · 2022-07-26T20:57:09Z

Describe the bug (描述bug)
m->SetFailed() is called with saved_errno == 0 in InputMessenger::OnNewMessages(), which then triggers a hard error at CHECK(false) << "error_code is 0".

To Reproduce (复现方法)
In our environment we can constantly reproduce this issue in LTO (-flto=full) and ThinLTO (-flto=thin) modes, but everything runs without error in normal (-fno-lto) mode.

I surfed a bit into the source code, and it turns out that

readv() may return -1 with errno == 0 in pappend_from_file_descriptor(), which then cause
m->DoRead() returns nr == -1 with errno == 0 in InputMessenger::OnNewMessages(), which then run into
else if (errno != EAGAIN) branch with errno == 0 and triggers the CHECK failure.

By adding several pieces of logging codes with cares taken not to unintentionally change errno, I can confirm the behavior described above, see the attached screenshot for reference.

A potential fix / workaround is to change if (errno == EINTR) to if (error == 0 || errno == EINTR), but I'm not familiar with BRPC code (nor did I figure out why link-time optimization triggers such an issue), so I'd prefer BRPC maintainers taking a look at it. Thanks!

Expected behavior (期望行为)
Consistent behavior regardless of link-time optimization options.

Versions (各种版本)
OS: Linux Ubuntu 18.04 in Docker
Compiler: Clang 12.0.1
brpc: 1.1.0 Release
protobuf: 21.1

Additional context/screenshots (更多上下文/截图)

The text was updated successfully, but these errors were encountered:

wudisheng · 2022-07-27T05:53:43Z

Some further investigation shows that there might be another scenario --- readv() returned -1 with a non-zero errno, but errno got changed to 0 in return_cached_blocks().

Another piece of debugging log supports this scenario.

wwbmmm · 2022-07-28T04:27:29Z

Another piece of debugging log supports this scenario.

Could you show this log?

wwbmmm · 2022-07-28T04:39:28Z

This issue may be related to #1693. Although we have removed the __const__ attribute of __errno_location(), but powerful link time optimization may still cache the tls errno location ?

wudisheng · 2022-07-28T05:05:12Z

Another piece of debugging log supports this scenario.

Could you show this log?

This is a piece of debugging log showing such a scenario (readv() returned -1 with errno==11), but outer callsite sees 0.

wudisheng · 2022-07-28T05:19:15Z

This issue may be related to #1693. Although we have removed the __const__ attribute of __errno_location(), but powerful link time optimization may still cache the tls errno location ?

BTW-1: I also observed once that before return nr in pappend_from_file_descriptor(), errno is non-zero, but immediately in the outer callsite errno become zero. I couldn't find the corresponding debugging log at this time, but it sounds like LTO does collapse something around errno.

BTW-2: During my investigation, I also tried using Clang-15 (instead of Clang-12) with exactly the same building configurations, then I observed a different coredump saying something like "bthread sched_to itself" or so, so it seems LTO may really break something in BRPC code base.

BTW-3: About half a year ago, when we are using 0.9.7, I roughly tested LTO capabilities once and BRPC wasn't a blocker at that time (the service behaves correctly online at a benefited performance).

Currently I have switched to a building mode that everything but BRPC in the entire dependency graph is built with ThinLTO, and the binary can be started without an immediate coredump (whether it behaves correctly may need a longer verification procedure).

Let me know if you need any further context from my scenario.

wwbmmm · 2022-07-28T05:56:41Z

I observed a different coredump saying something like "bthread sched_to itself"

I encountered this error before, that is related to tls cache. Compiler options: -fno-gcse, -fno-cse-follow-jumps, -fno-move-loop-invariants may fix this.

wudisheng · 2022-07-28T05:58:30Z

Thanks for the information, I'll give it a shoot later this week.

wudisheng · 2022-07-28T10:20:55Z

-fno-gcse, -fno-cse-follow-jumps, -fno-move-loop-invariants

Unfortunately I could not easily find corresponding switches in Clang, it seems that optimization switches of Gcc and Clang at such a detailed level differ a lot.

wwbmmm · 2022-07-28T10:47:12Z

BTW-3: About half a year ago, when we are using 0.9.7, I roughly tested LTO capabilities once and BRPC wasn't a blocker at that time (the service behaves correctly online at a benefited performance).

What compiler did you use in this case?

wudisheng · 2022-07-28T10:48:11Z

BTW-3: About half a year ago, when we are using 0.9.7, I roughly tested LTO capabilities once and BRPC wasn't a blocker at that time (the service behaves correctly online at a benefited performance).

What compiler did you use in this case?

Clang 8 at that time, with GLIBC 2.27 and GLIBCXX 3.4.26.

wwbmmm · 2022-07-28T10:55:34Z

We are using gcc 8 and gcc 10 with lto to compile brpc and everything works well.
It seems that some new optimization in clang is not compatible with bthread, that is, it may cache the tls address between bthread context switch.

wwbmmm added the bug the code does not work as expected label Nov 23, 2022

ehds mentioned this issue Mar 7, 2023

fix compiler optimize thread local variable access #2156

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sometimes SetFailed() is called with errno=0 in InputMessenger::OnNewMessages() #1860

Sometimes SetFailed() is called with errno=0 in InputMessenger::OnNewMessages() #1860

wudisheng commented Jul 26, 2022

wudisheng commented Jul 27, 2022

wwbmmm commented Jul 28, 2022

wwbmmm commented Jul 28, 2022

wudisheng commented Jul 28, 2022

wudisheng commented Jul 28, 2022 •

edited

Loading

wwbmmm commented Jul 28, 2022

wudisheng commented Jul 28, 2022

wudisheng commented Jul 28, 2022

wwbmmm commented Jul 28, 2022

wudisheng commented Jul 28, 2022 •

edited

Loading

wwbmmm commented Jul 28, 2022

Sometimes SetFailed() is called with errno=0 in InputMessenger::OnNewMessages() #1860

Sometimes SetFailed() is called with errno=0 in InputMessenger::OnNewMessages() #1860

Comments

wudisheng commented Jul 26, 2022

wudisheng commented Jul 27, 2022

wwbmmm commented Jul 28, 2022

wwbmmm commented Jul 28, 2022

wudisheng commented Jul 28, 2022

wudisheng commented Jul 28, 2022 • edited Loading

wwbmmm commented Jul 28, 2022

wudisheng commented Jul 28, 2022

wudisheng commented Jul 28, 2022

wwbmmm commented Jul 28, 2022

wudisheng commented Jul 28, 2022 • edited Loading

wwbmmm commented Jul 28, 2022

wudisheng commented Jul 28, 2022 •

edited

Loading

wudisheng commented Jul 28, 2022 •

edited

Loading