Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

occasional syncd crash at initialization time #8177

Closed
mikeberesford opened this issue Jul 14, 2021 · 11 comments · Fixed by sonic-net/sonic-swss-common#510
Closed

occasional syncd crash at initialization time #8177

mikeberesford opened this issue Jul 14, 2021 · 11 comments · Fixed by sonic-net/sonic-swss-common#510
Labels
Triaged this issue has been triaged

Comments

@mikeberesford
Copy link

Description

We're seeing occasional syncd crashes during initialization that appear to be related to updating the API log levels. These crashes are infrequent, but always occur during system startup with a similar trace (see example below). We do not have a reliable way to reproduce this issue, but it has been seen multiple times.

Steps to reproduce the issue:

  1. Restart the system many times.
  2. ???
  3. Find a corefile

Describe the results you received:

syncd crash

Describe the results you expected:

no syncd crash

Output of show version:

Image based on 202012 release

Output of show techsupport:

(paste your output here or download and attach the file here )

Additional information you deem important (e.g. issue happens only occasionally):

(gdb) bt
#0 0x00007fc18d3168b0 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#1 0x00007fc18d63be18 in std::__cxx11::basic_string<char, std::char_traits, std::allocator >::compare(std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&) const ()
from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#2 0x000056471dfb53a3 in std::operator< <char, std::char_traits, std::allocator > (__rhs="SAI_API_FDB", __lhs=<error: Cannot access memory at address 0x3>) at /usr/include/c++/8/bits/basic_string.h:6133
#3 std::less<std::__cxx11::basic_string<char, std::char_traits, std::allocator > >::operator() (this=0x7fc19e4a4590 swss::Logger::getInstance()::m_logger+48, __y="SAI_API_FDB",
__x=<error: Cannot access memory at address 0x3>) at /usr/include/c++/8/bits/stl_function.h:386
#4 std::_Rb_tree<std::__cxx11::basic_string<char, std::char_traits, std::allocator >, std::pair<std::__cxx11::basic_string<char, std::char_traits, std::allocator > const, std::__cxx11::basic_string<char, std::char_traits, std::allocator > >, std::_Select1st<std::pair<std::__cxx11::basic_string<char, std::char_traits, std::allocator > const, std::__cxx11::basic_string<char, std::char_traits, std::allocator > > >, std::less<std::__cxx11::basic_string<char, std::char_traits, std::allocator > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits, std::allocator > const, std::__cxx11::basic_string<char, std::char_traits, std::allocator > > > >::_M_get_insert_hint_unique_pos (this=this@entry=0x7fc19e4a4590 swss::Logger::getInstance()::m_logger+48, Python Exception <class 'gdb.error'> No type named std::_Rb_tree_node<struct std::pair<std::__cxx11::basic_string<char, std::char_traits, std::allocator > const, std::__cxx11::basic_string<char, std::char_traits, std::allocator > >>.:
__position=..., __k="SAI_API_FDB")
at /usr/include/c++/8/bits/stl_tree.h:2168
#5 0x000056471dfb5678 in std::_Rb_tree<std::__cxx11::basic_string<char, std::char_traits, std::allocator >, std::pair<std::__cxx11::basic_string<char, std::char_traits, std::allocator > const, std::__cxx11::basic_string<char, std::char_traits, std::allocator > >, std::_Select1st<std::pair<std::__cxx11::basic_string<char, std::char_traits, std::allocator > const, std::__cxx11::basic_string<char, std::char_traits, std::allocator > > >, std::less<std::__cxx11::basic_string<char, std::char_traits, std::allocator > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits, std::allocator > const, std::__cxx11::basic_string<char, std::char_traits, std::allocator > > > >::_M_emplace_hint_unique<std::piecewise_construct_t const&, std::tuple<std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&>, std::tuple<> >(std::_Rb_tree_const_iterator<std::pair<std::__cxx11::basic_string<char, std::char_traits, std::allocator > const, std::__cxx11::basic_string<char, std::char_traits, std::allocator > > >, std::piecewise_construct_t const&, std::tuple<std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&>&&, std::tuple<>&&) (
this=0x7fc19e4a4590 swss::Logger::getInstance()::m_logger+48, Python Exception <class 'gdb.error'> No type named std::_Rb_tree_node<struct std::pair<std::__cxx11::basic_string<char, std::char_traits, std::allocator > const, std::__cxx11::basic_string<char, std::char_traits, std::allocator > >>.:
__pos=..., __args#0=..., __args#1=..., __args#2=...) at /usr/include/c++/8/bits/char_traits.h:287
#6 0x00007fc19e42cb6e in std::map<std::__cxx11::basic_string<char, std::char_traits, std::allocator >, std::__cxx11::basic_string<char, std::char_traits, std::allocator >, std::less<std::__cxx11::basic_string<char, std::char_traits, std::allocator > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits, std::allocator > const, std::__cxx11::basic_string<char, std::char_traits, std::allocator > > > >::operator[] (__k="SAI_API_FDB", this=0x7fc19e4a4590 swss::Logger::getInstance()::m_logger+48) at /usr/include/c++/8/bits/stl_tree.h:348
#7 swss::Logger::linkToDbWithOutput(std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&, std::function<void (std::__cxx11::basic_string<char, std::char_traits, std::allocator >, std::__cxx11::basic_string<char, std::char_traits, std::allocator >)> const&, std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&, std::function<void (std::__cxx11::basic_string<char, std::char_traits, std::allocator >, std::__cxx11::basic_string<char, std::char_traits, std::allocator >)> const&, std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&) (
dbName="SAI_API_FDB", prioNotify=..., defPrio=..., outputNotify=..., defOutput=...) at logger.cpp:133
#8 0x00007fc19e42d5f0 in swss::Logger::linkToDb(std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&, std::function<void (std::__cxx11::basic_string<char, std::char_traits, std::allocator >, std::__cxx11::basic_string<char, std::char_traits, std::allocator >)> const&, std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&) (dbName="SAI_API_FDB", prioNotify=...,
defPrio="SAI_LOG_LEVEL_NOTICE") at /usr/include/c++/8/new:169
#9 0x000056471dfa4394 in syncd::Syncd::setSaiApiLogLevel (this=0x56471f4234a0) at /usr/include/c++/8/bits/basic_string.h:252
#10 0x000056471dfab894 in syncd::Syncd::Syncd (this=0x56471f4234a0, vendorSai=std::shared_ptr (use count 3, weak count 0) = {...}, cmd=..., isWarmStart=) at Syncd.cpp:57
#11 0x000056471df9d046 in __gnu_cxx::new_allocatorsyncd::Syncd::construct<syncd::Syncd, std::shared_ptrsyncd::VendorSai&, std::shared_ptrsyncd::CommandLineOptions&, bool&> (__p=0x56471f4234a0, this=)
at /usr/include/c++/8/new:169
#12 std::allocator_traits<std::allocatorsyncd::Syncd >::construct<syncd::Syncd, std::shared_ptrsyncd::VendorSai&, std::shared_ptrsyncd::CommandLineOptions&, bool&> (__p=0x56471f4234a0, __a=...)
at /usr/include/c++/8/bits/alloc_traits.h:475
#13 std::_Sp_counted_ptr_inplace<syncd::Syncd, std::allocatorsyncd::Syncd, (__gnu_cxx::_Lock_policy)2>::_Sp_counted_ptr_inplace<std::shared_ptrsyncd::VendorSai&, std::shared_ptrsyncd::CommandLineOptions&, bool&> (__a=...,
this=0x56471f423490) at /usr/include/c++/8/bits/shared_ptr_base.h:545
#14 std::__shared_count<(__gnu_cxx::_Lock_policy)2>::__shared_count<syncd::Syncd, std::allocatorsyncd::Syncd, std::shared_ptrsyncd::VendorSai&, std::shared_ptrsyncd::CommandLineOptions&, bool&> (__a=...,
__p=: , this=) at /usr/include/c++/8/bits/shared_ptr_base.h:677
#15 std::__shared_ptr<syncd::Syncd, (__gnu_cxx::_Lock_policy)2>::__shared_ptr<std::allocatorsyncd::Syncd, std::shared_ptrsyncd::VendorSai&, std::shared_ptrsyncd::CommandLineOptions&, bool&> (__tag=...,
this=) at /usr/include/c++/8/bits/shared_ptr_base.h:1342
#16 std::shared_ptrsyncd::Syncd::shared_ptr<std::allocatorsyncd::Syncd, std::shared_ptrsyncd::VendorSai&, std::shared_ptrsyncd::CommandLineOptions&, bool&> (__tag=..., this=)
at /usr/include/c++/8/bits/shared_ptr.h:359
#17 std::allocate_shared<syncd::Syncd, std::allocatorsyncd::Syncd, std::shared_ptrsyncd::VendorSai&, std::shared_ptrsyncd::CommandLineOptions&, bool&> (__a=...) at /usr/include/c++/8/bits/shared_ptr.h:706
#18 std::make_shared<syncd::Syncd, std::shared_ptrsyncd::VendorSai&, std::shared_ptrsyncd::CommandLineOptions&, bool&> () at /usr/include/c++/8/bits/shared_ptr.h:722
#19 syncd_main (argc=argc@entry=6, argv=argv@entry=0x7ffedda7bf18) at syncd.cpp:74
#20 0x000056471df9b51e in main (argc=6, argv=0x7ffedda7bf18) at main.cpp:9

@mikeberesford
Copy link
Author

We believe that this is likely due to multi-threaded access to maps in swss::Logger, specifically m_currentPrios and m_currentOutputs

@mikeberesford
Copy link
Author

Possibly caused by this commit: http://github.com/Azure/sonic-sairedis/commit/d458a11acb81b6d6cd1f00e029a6fa7c74a5af24#diff-0f282bb4fdc70639e695cdafd322730eaa1c141b3b4e87ec0fda6e317f687017

This moved linkToDb after linkToDbNative, which (per the API) looks to be invalid (http://github.com/Azure/sonic-swss-common/blob/master/common/logger.h#L81) and may result in data race conditions. @kcudnik

@zhangyanzhao
Copy link
Collaborator

@kcudnik please help to double check the suggestion from Mike. Thanks.

@zhangyanzhao zhangyanzhao added the Triaged this issue has been triaged label Jul 21, 2021
@kcudnik
Copy link
Contributor

kcudnik commented Jul 21, 2021

@qiluo-msft hey, can you explain why in this case order matters https://github.com/Azure/sonic-swss-common/blob/master/common/logger.h#L81 ??

@kcudnik
Copy link
Contributor

kcudnik commented Jul 22, 2021

Created issue for this sonic-net/sonic-swss-common#507

@kcudnik
Copy link
Contributor

kcudnik commented Jul 22, 2021

i added some debug logs to swss logger, and i can see concurrent access from settingThread and linkToDbWithOutput functions to m_current variables, i put that into small program in a loop but i can't hit that crash locally to have a repro, but problem exists and needs to be fixed

@kcudnik
Copy link
Contributor

kcudnik commented Jul 23, 2021

i created a fix for that sonic-net/sonic-swss-common@bf02298 but imho looks ugly as hell :/ any suggestions?

@mikeberesford
Copy link
Author

If we were to enforce the call order (and that linkToDbNative must only be called exactly once) and move linkToDbNative to just before http://github.com/Azure/sonic-sairedis/blob/master/syncd/syncd_main.cpp#L71 would that be cleaner? It wouldn't make Logger thread-safe (multiple threads calling linkToDb would still be unsafe), but that might be fine. This would be more along the lines of "avoid the problem" than "fix the problem"...

@kcudnik
Copy link
Contributor

kcudnik commented Jul 23, 2021

i would not recommend to move this to line 71, since that would indicate that you are aware of whats in Syncd::ctr, which we should not, and it could still happen in other applications, so i would rather want to fix the problem instead of avoiding. If we fix that in logger, then it would be fixed everywhere for free. Avoiding will always get you at some point.

Also if you look inside linkToDbNative, then inside it calls linkToDb, so why there is still this comment in logger.h about order ? make no sense :/ @qiluo-msft maybe you will have some insights?

@kcudnik
Copy link
Contributor

kcudnik commented Aug 9, 2021

ping, on thread and fix PR

@qiluo-msft
Copy link
Collaborator

@marian-pritsak Could you answer @kcudnik 's question on #8177 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Triaged this issue has been triaged
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants