Skip to content
This repository has been archived by the owner on Jun 23, 2022. It is now read-only.

fix(network): use multi io_services in asio #1016

Merged
merged 12 commits into from
Jan 26, 2022
Merged

Conversation

Smityz
Copy link
Contributor

@Smityz Smityz commented Jan 10, 2022

Background

related issue: apache/incubator-pegasus#307
This issue reported a bug that the core dump will happen in some scenes. It's caused by the race condition when multi-threads read/write/close the same socket.

how to fix this bug

Original way

Epoll
+-------------------------------+
| socket1  socket2  socket3     |
|  +--+     +--+     +--+       |
|  +--+     +--+     +--+       |
|                               |
+------------^------------------+
             |
io_service   |polling
+------------+------------------+
| task_queue                    |
| +-----------+--------------+  |
| | epollwait | call_back    |  |
| +-----+-----+--+--------+--+  |
|       |        |        |     |
+-------+--------+--------+-----+
Thread1 |  2     |     3  |
      +-v-+    +-v-+    +-v-+
      |   |    |   |    |   |
      +---+    +---+    +---+

In the past, we used multi-threads to execute polling or callback in one event loop. But like the coredump information showed, the Use-After-Free may happen in the high-traffic scene. It's hard for us to add mutex to prevent this problem, so I change the way we use ASIO in this PR.

New way

+-----------------------------------------------+
|Linux kernel                                   |
| +-----------+   +-----------+   +-----------+ |
| |   Epoll1  |   |   Epoll2  |   |   Epoll3  | |
| +-----^-----+   +-----^-----+   +-----^-----+ |
+-------|---------------|---------------|-------+
  +-----------+   +-----------+   +-----------+
  |  polling  |   |  polling  |   |  polling  |
  | +-------+ |   | +-------+ |   | +-------+ |
  | |Thread1| |   | |Thread2| |   | |Thread3| |
  | +-------+ |   | +-------+ |   | +-------+ |
  |io_service1|   |io_service2|   |io_service3|
  +-----------+   +-----------+   +-----------+

I use one loop per thread model in the network service, the operations of one socket are executed in the single thread. So we won't worry about race conditions anymore.

Benchmark

Original benchmark

+--------------------------+----------+------------+--------+------------+--------------+----------------------------------------------+-----------------------------------------------+
|      operation_case      | run_time | throughput | length | read_write | thread_count |     read(qps|ave|min|max|95|99|999|9999)     |     write(qps|ave|min|max|95|99|999|9999)     |
+--------------------------+----------+------------+--------+------------+--------------+----------------------------------------------+-----------------------------------------------+
| write=single,read=single | 8284     | 36210      | 1000   | 0 : 1      | 15           | {0 0 0 0 0 0 0 0}                            | {36212 1240 365 236287 2386 5349 12119 18929} |
| write=single,read=single | 3124     | 289183     | 1000   | 1 : 0      | 50           | {289233 518 116 163924 853 1500 15120 21369} | {0 0 0 0 0 0 0 0}                             |
| write=single,read=single | 3151     | 76148      | 1000   | 1 : 1      | 30           | {38078 712 116 179583 1892 8137 38959 51305} | {38075 1643 368 322815 3995 6908 18257 32436} |
| write=single,read=single | 3317     | 45208      | 1000   | 1 : 3      | 15           | {11306 568 121 148287 1191 5044 31439 39956} | {33909 1133 378 253524 1992 4940 16777 27567} |
| write=single,read=single | 2348     | 38309      | 1000   | 1 : 30     | 15           | {1234 588 148 99583 1204 4399 29561 38324}   | {37083 1190 374 242303 2189 5061 12209 22660} |
| write=single,read=single | 2498     | 120131     | 1000   | 3 : 1      | 30           | {90117 535 115 149247 1083 4484 29825 45375} | {30040 1379 375 303273 2717 5896 30233 50068} |
| write=single,read=single | 3393     | 267123     | 1000   | 30 : 1     | 50           | {258529 543 113 177535 957 2168 16620 25764} | {8615 1121 418 266537 1585 4980 23660 48585}  |
+--------------------------+----------+------------+--------+------------+--------------+----------------------------------------------+-----------------------------------------------+

After change

+--------------------------+----------+------------+--------+------------+--------------+-----------------------------------------------+-----------------------------------------------+
|      operation_case      | run_time | throughput | length | read_write | thread_count |     read(qps|ave|min|max|95|99|999|9999)      |     write(qps|ave|min|max|95|99|999|9999)     |
+--------------------------+----------+------------+--------+------------+--------------+-----------------------------------------------+-----------------------------------------------+
| write=single,read=single | 8000     | 37495      | 1000   | 0 : 1      | 15           | {0 0 0 0 0 0 0 0}                             | {37497 1197 359 210644 2199 5193 16247 23100} |
| write=single,read=single | 3076     | 293209     | 1000   | 1 : 0      | 50           | {293270 510 112 176489 848 1346 16665 24001}  | {0 0 0 0 0 0 0 0}                             |
| write=single,read=single | 3056     | 78516      | 1000   | 1 : 1      | 30           | {39266 720 118 170367 1954 8243 38601 51241}  | {39262 1564 353 253609 3555 6564 17687 34548} |
| write=single,read=single | 3213     | 46676      | 1000   | 1 : 3      | 15           | {11671 561 121 157055 1211 5136 30281 38335}  | {35013 1093 357 215679 1844 4887 11745 22825} |
| write=single,read=single | 2292     | 39252      | 1000   | 1 : 30     | 15           | {1266 600 150 119081 1258 4555 29892 38025}   | {37994 1161 367 242260 2003 4985 10463 19873} |
| write=single,read=single | 2422     | 123899     | 1000   | 3 : 1      | 30           | {92937 527 114 127593 1072 4433 23495 37716}  | {30979 1310 364 227199 2375 5739 17865 35332} |
| write=single,read=single | 3808     | 241281     | 1000   | 30 : 1     | 50           | {233521 613 110 186708 1264 2553 16615 24703} | {7784 1148 400 244820 1914 5060 24692 47156}  |
+--------------------------+----------+------------+--------+------------+--------------+-----------------------------------------------+-----------------------------------------------+

The change has no significant performance impact

Actual performance

This change has fixed the bug in our production environment.

@Smityz Smityz changed the title fix(network): fix asio core dump refactor(network): use multi io_services in asio Jan 24, 2022
src/runtime/rpc/asio_net_provider.cpp Outdated Show resolved Hide resolved
src/runtime/rpc/asio_net_provider.cpp Outdated Show resolved Hide resolved
src/runtime/rpc/asio_rpc_session.cpp Show resolved Hide resolved
Copy link
Contributor

@neverchanje neverchanje left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you reproduced the bug? And how do you ensure the new code will fix it?

src/runtime/rpc/asio_net_provider.cpp Outdated Show resolved Hide resolved
@Smityz
Copy link
Contributor Author

Smityz commented Jan 24, 2022

Have you reproduced the bug? And how do you ensure the new code will fix it?

I did a grayscale test in the production environment, the old version machine still had this problem, but the new version didn't have.

The reason of the coredump is the race condition, but in the new design there won't be a multi-thread environment in one socket

@acelyc111
Copy link
Member

If this is a bugfix PR, explicit use fix instead of refactor for the PR title.

@neverchanje neverchanje changed the title refactor(network): use multi io_services in asio fix(network): use multi io_services in asio Jan 25, 2022
levy5307
levy5307 previously approved these changes Jan 25, 2022
src/runtime/rpc/asio_net_provider.cpp Outdated Show resolved Hide resolved

private:
friend class asio_rpc_session;
friend class asio_network_provider_test;

std::shared_ptr<boost::asio::ip::tcp::acceptor> _acceptor;
boost::asio::io_service _io_service;
int _next_io_service = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This number must be atomic since multiple threads may concurrently modify it. I would recommend randomly choosing the io service such that we can totally get rid of concurrent conflict.

Copy link
Contributor Author

@Smityz Smityz Jan 25, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. the operation of int is naturally atomic
  2. random function costs lots of time

Copy link
Contributor

@neverchanje neverchanje Jan 25, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. This statement really disappoint me, but the overall idea to be efficient maybe work. I won't give it+1. Others might do.

https://stackoverflow.com/questions/54188/are-c-reads-and-writes-of-an-int-atomic

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you are right. And I'm sorry for my irresponsible statement.

image
I did an experiment in Godbolt before, And I found i++ is one line in assembly language:
add eax, 1. So I think this operation is thread-safety.
But optimizations for modern processors may make it more complex. And I learned a lot from this answer

I think modifying int in multi-threads is safe(won't core dump), but it can't keep its coherence between multi processors.

Anyway, it's ub behavior to use non-atomic variables in multi-threads in C++, and I have changed my codes. You can continue reviewing it now.

Comment on lines 412 to 420
++_next_io_service;
if (_next_io_service >= FLAGS_io_service_worker_count) {
_next_io_service = 0;
}

int tmp = _next_io_service;
if (tmp >= FLAGS_io_service_worker_count) {
tmp = 0;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about ensure FLAGS_io_service_worker_count is 2^N, io_service_worker_mask = FLAGS_io_service_worker_count - 1, then the code can be simplfied as:

uint32_t idx = _next_io_service.fetch_add(1);
return *_io_services[idx & io_service_worker_mask];

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, I have thought about it too. But it's strict to limit FLAGS_io_service_worker_count. I'll do a speed test later to check if it's faster than two addition operations.

@Smityz
Copy link
Contributor Author

Smityz commented Jan 26, 2022

I did a benchmark in multi-threads env again, and found random is the quickest way, so I decide to adopt this way.

round_robin_1(add):
time = 825987.000 ns
round_robin_2(MOD):
time = 421049.000 ns
round_robin_3(BIT MOD):
time = 504586.000 ns
round_robin_4(dsn::rand):
time = 420370.000 ns

CPU: Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz

https://gist.github.com/Smityz/00426f49544348676d4ddd8b0b0eb253

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants