-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
occasional syncd crash at initialization time #8177
occasional syncd crash at initialization time #8177
Comments
We believe that this is likely due to multi-threaded access to maps in swss::Logger, specifically m_currentPrios and m_currentOutputs |
Possibly caused by this commit: http://github.com/Azure/sonic-sairedis/commit/d458a11acb81b6d6cd1f00e029a6fa7c74a5af24#diff-0f282bb4fdc70639e695cdafd322730eaa1c141b3b4e87ec0fda6e317f687017 This moved linkToDb after linkToDbNative, which (per the API) looks to be invalid (http://github.com/Azure/sonic-swss-common/blob/master/common/logger.h#L81) and may result in data race conditions. @kcudnik |
@kcudnik please help to double check the suggestion from Mike. Thanks. |
@qiluo-msft hey, can you explain why in this case order matters https://github.com/Azure/sonic-swss-common/blob/master/common/logger.h#L81 ?? |
Created issue for this sonic-net/sonic-swss-common#507 |
i added some debug logs to swss logger, and i can see concurrent access from settingThread and linkToDbWithOutput functions to m_current variables, i put that into small program in a loop but i can't hit that crash locally to have a repro, but problem exists and needs to be fixed |
i created a fix for that sonic-net/sonic-swss-common@bf02298 but imho looks ugly as hell :/ any suggestions? |
If we were to enforce the call order (and that linkToDbNative must only be called exactly once) and move linkToDbNative to just before http://github.com/Azure/sonic-sairedis/blob/master/syncd/syncd_main.cpp#L71 would that be cleaner? It wouldn't make Logger thread-safe (multiple threads calling linkToDb would still be unsafe), but that might be fine. This would be more along the lines of "avoid the problem" than "fix the problem"... |
i would not recommend to move this to line 71, since that would indicate that you are aware of whats in Syncd::ctr, which we should not, and it could still happen in other applications, so i would rather want to fix the problem instead of avoiding. If we fix that in logger, then it would be fixed everywhere for free. Avoiding will always get you at some point. Also if you look inside linkToDbNative, then inside it calls linkToDb, so why there is still this comment in logger.h about order ? make no sense :/ @qiluo-msft maybe you will have some insights? |
ping, on thread and fix PR |
@marian-pritsak Could you answer @kcudnik 's question on #8177 (comment) |
Description
We're seeing occasional syncd crashes during initialization that appear to be related to updating the API log levels. These crashes are infrequent, but always occur during system startup with a similar trace (see example below). We do not have a reliable way to reproduce this issue, but it has been seen multiple times.
Steps to reproduce the issue:
Describe the results you received:
syncd crash
Describe the results you expected:
no syncd crash
Output of
show version
:Image based on 202012 release
Output of
show techsupport
:Additional information you deem important (e.g. issue happens only occasionally):
(gdb) bt
#0 0x00007fc18d3168b0 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#1 0x00007fc18d63be18 in std::__cxx11::basic_string<char, std::char_traits, std::allocator >::compare(std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&) const ()
from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#2 0x000056471dfb53a3 in std::operator< <char, std::char_traits, std::allocator > (__rhs="SAI_API_FDB", __lhs=<error: Cannot access memory at address 0x3>) at /usr/include/c++/8/bits/basic_string.h:6133
#3 std::less<std::__cxx11::basic_string<char, std::char_traits, std::allocator > >::operator() (this=0x7fc19e4a4590 swss::Logger::getInstance()::m_logger+48, __y="SAI_API_FDB",
__x=<error: Cannot access memory at address 0x3>) at /usr/include/c++/8/bits/stl_function.h:386
#4 std::_Rb_tree<std::__cxx11::basic_string<char, std::char_traits, std::allocator >, std::pair<std::__cxx11::basic_string<char, std::char_traits, std::allocator > const, std::__cxx11::basic_string<char, std::char_traits, std::allocator > >, std::_Select1st<std::pair<std::__cxx11::basic_string<char, std::char_traits, std::allocator > const, std::__cxx11::basic_string<char, std::char_traits, std::allocator > > >, std::less<std::__cxx11::basic_string<char, std::char_traits, std::allocator > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits, std::allocator > const, std::__cxx11::basic_string<char, std::char_traits, std::allocator > > > >::_M_get_insert_hint_unique_pos (this=this@entry=0x7fc19e4a4590 swss::Logger::getInstance()::m_logger+48, Python Exception <class 'gdb.error'> No type named std::_Rb_tree_node<struct std::pair<std::__cxx11::basic_string<char, std::char_traits, std::allocator > const, std::__cxx11::basic_string<char, std::char_traits, std::allocator > >>.:
__position=..., __k="SAI_API_FDB")
at /usr/include/c++/8/bits/stl_tree.h:2168
#5 0x000056471dfb5678 in std::_Rb_tree<std::__cxx11::basic_string<char, std::char_traits, std::allocator >, std::pair<std::__cxx11::basic_string<char, std::char_traits, std::allocator > const, std::__cxx11::basic_string<char, std::char_traits, std::allocator > >, std::_Select1st<std::pair<std::__cxx11::basic_string<char, std::char_traits, std::allocator > const, std::__cxx11::basic_string<char, std::char_traits, std::allocator > > >, std::less<std::__cxx11::basic_string<char, std::char_traits, std::allocator > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits, std::allocator > const, std::__cxx11::basic_string<char, std::char_traits, std::allocator > > > >::_M_emplace_hint_unique<std::piecewise_construct_t const&, std::tuple<std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&>, std::tuple<> >(std::_Rb_tree_const_iterator<std::pair<std::__cxx11::basic_string<char, std::char_traits, std::allocator > const, std::__cxx11::basic_string<char, std::char_traits, std::allocator > > >, std::piecewise_construct_t const&, std::tuple<std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&>&&, std::tuple<>&&) (
this=0x7fc19e4a4590 swss::Logger::getInstance()::m_logger+48, Python Exception <class 'gdb.error'> No type named std::_Rb_tree_node<struct std::pair<std::__cxx11::basic_string<char, std::char_traits, std::allocator > const, std::__cxx11::basic_string<char, std::char_traits, std::allocator > >>.:
__pos=..., __args#0=..., __args#1=..., __args#2=...) at /usr/include/c++/8/bits/char_traits.h:287
#6 0x00007fc19e42cb6e in std::map<std::__cxx11::basic_string<char, std::char_traits, std::allocator >, std::__cxx11::basic_string<char, std::char_traits, std::allocator >, std::less<std::__cxx11::basic_string<char, std::char_traits, std::allocator > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits, std::allocator > const, std::__cxx11::basic_string<char, std::char_traits, std::allocator > > > >::operator[] (__k="SAI_API_FDB", this=0x7fc19e4a4590 swss::Logger::getInstance()::m_logger+48) at /usr/include/c++/8/bits/stl_tree.h:348
#7 swss::Logger::linkToDbWithOutput(std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&, std::function<void (std::__cxx11::basic_string<char, std::char_traits, std::allocator >, std::__cxx11::basic_string<char, std::char_traits, std::allocator >)> const&, std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&, std::function<void (std::__cxx11::basic_string<char, std::char_traits, std::allocator >, std::__cxx11::basic_string<char, std::char_traits, std::allocator >)> const&, std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&) (
dbName="SAI_API_FDB", prioNotify=..., defPrio=..., outputNotify=..., defOutput=...) at logger.cpp:133
#8 0x00007fc19e42d5f0 in swss::Logger::linkToDb(std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&, std::function<void (std::__cxx11::basic_string<char, std::char_traits, std::allocator >, std::__cxx11::basic_string<char, std::char_traits, std::allocator >)> const&, std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&) (dbName="SAI_API_FDB", prioNotify=...,
defPrio="SAI_LOG_LEVEL_NOTICE") at /usr/include/c++/8/new:169
#9 0x000056471dfa4394 in syncd::Syncd::setSaiApiLogLevel (this=0x56471f4234a0) at /usr/include/c++/8/bits/basic_string.h:252
#10 0x000056471dfab894 in syncd::Syncd::Syncd (this=0x56471f4234a0, vendorSai=std::shared_ptr (use count 3, weak count 0) = {...}, cmd=..., isWarmStart=) at Syncd.cpp:57
#11 0x000056471df9d046 in __gnu_cxx::new_allocatorsyncd::Syncd::construct<syncd::Syncd, std::shared_ptrsyncd::VendorSai&, std::shared_ptrsyncd::CommandLineOptions&, bool&> (__p=0x56471f4234a0, this=)
at /usr/include/c++/8/new:169
#12 std::allocator_traits<std::allocatorsyncd::Syncd >::construct<syncd::Syncd, std::shared_ptrsyncd::VendorSai&, std::shared_ptrsyncd::CommandLineOptions&, bool&> (__p=0x56471f4234a0, __a=...)
at /usr/include/c++/8/bits/alloc_traits.h:475
#13 std::_Sp_counted_ptr_inplace<syncd::Syncd, std::allocatorsyncd::Syncd, (__gnu_cxx::_Lock_policy)2>::_Sp_counted_ptr_inplace<std::shared_ptrsyncd::VendorSai&, std::shared_ptrsyncd::CommandLineOptions&, bool&> (__a=...,
this=0x56471f423490) at /usr/include/c++/8/bits/shared_ptr_base.h:545
#14 std::__shared_count<(__gnu_cxx::_Lock_policy)2>::__shared_count<syncd::Syncd, std::allocatorsyncd::Syncd, std::shared_ptrsyncd::VendorSai&, std::shared_ptrsyncd::CommandLineOptions&, bool&> (__a=...,
__p=: , this=) at /usr/include/c++/8/bits/shared_ptr_base.h:677
#15 std::__shared_ptr<syncd::Syncd, (__gnu_cxx::_Lock_policy)2>::__shared_ptr<std::allocatorsyncd::Syncd, std::shared_ptrsyncd::VendorSai&, std::shared_ptrsyncd::CommandLineOptions&, bool&> (__tag=...,
this=) at /usr/include/c++/8/bits/shared_ptr_base.h:1342
#16 std::shared_ptrsyncd::Syncd::shared_ptr<std::allocatorsyncd::Syncd, std::shared_ptrsyncd::VendorSai&, std::shared_ptrsyncd::CommandLineOptions&, bool&> (__tag=..., this=)
at /usr/include/c++/8/bits/shared_ptr.h:359
#17 std::allocate_shared<syncd::Syncd, std::allocatorsyncd::Syncd, std::shared_ptrsyncd::VendorSai&, std::shared_ptrsyncd::CommandLineOptions&, bool&> (__a=...) at /usr/include/c++/8/bits/shared_ptr.h:706
#18 std::make_shared<syncd::Syncd, std::shared_ptrsyncd::VendorSai&, std::shared_ptrsyncd::CommandLineOptions&, bool&> () at /usr/include/c++/8/bits/shared_ptr.h:722
#19 syncd_main (argc=argc@entry=6, argv=argv@entry=0x7ffedda7bf18) at syncd.cpp:74
#20 0x000056471df9b51e in main (argc=6, argv=0x7ffedda7bf18) at main.cpp:9
The text was updated successfully, but these errors were encountered: