Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[docker-wait-any] End child threads when main thread is signalled to end #10812

Closed
wants to merge 1 commit into from

Conversation

vaibhavhd
Copy link
Contributor

Why I did it

This is to fix a bug in docker-wait-any script which leads to Exceptions during shutdown path for services.
Reason for this error:

  1. During startup some services use docker-wait-any as part of systemctl wait function.
  2. docker-wait-any creates different threads that wait as long as dependent services are running.
  3. All the threads share the same docker_client instance which is owned by main thread.
  4. When one of child threads signals main thread to exit, the child thread still continues to proceed to next iteration in the loop.
  5. This creates problem when main thread starts cleaning up docker_client instance, but the child threads are still running and use the same docker_client instance.
  6. This leads to NoneType error to be generated.

Error trace shows that the error is always originated during chilkd thread accessing docker_client, which is being freed up by main thread:

Instance 1:

May  4 08:25:34.249137 str-msn2700-01 INFO swss.sh[3838]: Exception in thread Thread-1:
May  4 08:25:34.250864 str-msn2700-01 INFO swss.sh[3838]: Traceback (most recent call last):
May  4 08:25:34.251987 str-msn2700-01 INFO swss.sh[3838]:   File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner
May  4 08:25:34.253123 str-msn2700-01 INFO swss.sh[3838]:     self.run()
May  4 08:25:34.254245 str-msn2700-01 INFO swss.sh[3838]:   File "/usr/lib/python2.7/threading.py", line 754, in run
May  4 08:25:34.255094 str-msn2700-01 INFO swss.sh[3838]:     self.__target(*self.__args, **self.__kwargs)
May  4 08:25:34.256085 str-msn2700-01 INFO swss.sh[3838]:   File "/usr/bin/docker-wait-any", line 49, in wait_for_container
May  4 08:25:34.257340 str-msn2700-01 INFO swss.sh[3838]:     while docker_client.inspect_container(container_name)['State']['Status'] != "running":
May  4 08:25:34.258840 str-msn2700-01 INFO swss.sh[3838]:   File "/usr/local/lib/python2.7/dist-packages/docker/utils/decorators.py", line 21, in wrapped
May  4 08:25:34.259924 str-msn2700-01 INFO swss.sh[3838]:     return f(self, resource_id, *args, **kwargs)
May  4 08:25:34.261800 str-msn2700-01 INFO swss.sh[3838]:   File "/usr/local/lib/python2.7/dist-packages/docker/api/container.py", line 173, in inspect_container
May  4 08:25:34.263452 str-msn2700-01 INFO swss.sh[3838]:     self._get(self._url("/containers/{0}/json", container)), True
May  4 08:25:34.264960 str-msn2700-01 INFO swss.sh[3838]:   File "/usr/local/lib/python2.7/dist-packages/docker/client.py", line 110, in _get
May  4 08:25:34.266356 str-msn2700-01 INFO swss.sh[3838]:     return self.get(url, **self._set_request_timeout(kwargs))
May  4 08:25:34.267707 str-msn2700-01 INFO swss.sh[3838]:   File "/usr/local/lib/python2.7/dist-packages/requests/sessions.py", line 542, in get
May  4 08:25:34.269129 str-msn2700-01 INFO swss.sh[3838]:     return self.request('GET', url, **kwargs)
May  4 08:25:34.270841 str-msn2700-01 INFO swss.sh[3838]:   File "/usr/local/lib/python2.7/dist-packages/requests/sessions.py", line 515, in request
May  4 08:25:34.272377 str-msn2700-01 INFO swss.sh[3838]:     prep = self.prepare_request(req)
May  4 08:25:34.274015 str-msn2700-01 INFO swss.sh[3838]:   File "/usr/local/lib/python2.7/dist-packages/requests/sessions.py", line 449, in prepare_request
May  4 08:25:34.275624 str-msn2700-01 INFO swss.sh[3838]:     headers=merge_setting(request.headers, self.headers, dict_class=CaseInsensitiveDict),
May  4 08:25:34.277034 str-msn2700-01 INFO swss.sh[3838]:   File "/usr/local/lib/python2.7/dist-packages/requests/sessions.py", line 64, in merge_setting
May  4 08:25:34.278583 str-msn2700-01 INFO swss.sh[3838]:     isinstance(session_setting, Mapping) and
May  4 08:25:34.279902 str-msn2700-01 INFO swss.sh[3838]:   File "/usr/lib/python2.7/abc.py", line 144, in __instancecheck__
May  4 08:25:34.281770 str-msn2700-01 INFO swss.sh[3838]:     return cls.__subclasscheck__(subtype)
May  4 08:25:34.283544 str-msn2700-01 INFO swss.sh[3838]:   File "/usr/lib/python2.7/abc.py", line 171, in __subclasscheck__
May  4 08:25:34.284746 str-msn2700-01 INFO swss.sh[3838]:     cls._abc_cache.add(subclass)
May  4 08:25:34.286218 str-msn2700-01 INFO swss.sh[3838]:   File "/usr/lib/python2.7/_weakrefset.py", line 86, in add
May  4 08:25:34.291449 str-msn2700-01 INFO swss.sh[3838]:     self.data.add(ref(item, self._remove))
May  4 08:25:34.291679 str-msn2700-01 INFO swss.sh[3838]: TypeError: 'NoneType' object is not callable

Instance 2:

Apr 25 16:02:11.122539 str-dcfx-t0-1-04 INFO swss.sh[4575]: Exception in thread Thread-1:
Apr 25 16:02:11.123217 str-dcfx-t0-1-04 INFO swss.sh[4575]: Traceback (most recent call last):
Apr 25 16:02:11.123788 str-dcfx-t0-1-04 INFO swss.sh[4575]:   File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner
Apr 25 16:02:11.124292 str-dcfx-t0-1-04 INFO swss.sh[4575]:     self.run()
Apr 25 16:02:11.124792 str-dcfx-t0-1-04 INFO swss.sh[4575]:   File "/usr/lib/python2.7/threading.py", line 754, in run
Apr 25 16:02:11.125278 str-dcfx-t0-1-04 INFO swss.sh[4575]:     self.__target(*self.__args, **self.__kwargs)
Apr 25 16:02:11.125782 str-dcfx-t0-1-04 INFO swss.sh[4575]:   File "/usr/bin/docker-wait-any", line 49, in wait_for_container
Apr 25 16:02:11.126274 str-dcfx-t0-1-04 INFO swss.sh[4575]:     while docker_client.inspect_container(container_name)['State']['Status'] != "running":
Apr 25 16:02:11.126759 str-dcfx-t0-1-04 INFO swss.sh[4575]:   File "/usr/local/lib/python2.7/dist-packages/docker/utils/decorators.py", line 21, in wrapped
Apr 25 16:02:11.127247 str-dcfx-t0-1-04 INFO swss.sh[4575]:     return f(self, resource_id, *args, **kwargs)
Apr 25 16:02:11.129110 str-dcfx-t0-1-04 INFO swss.sh[4575]:   File "/usr/local/lib/python2.7/dist-packages/docker/api/container.py", line 173, in inspect_container
Apr 25 16:02:11.130943 str-dcfx-t0-1-04 INFO swss.sh[4575]:     self._get(self._url("/containers/{0}/json", container)), True
Apr 25 16:02:11.133140 str-dcfx-t0-1-04 INFO swss.sh[4575]:   File "/usr/local/lib/python2.7/dist-packages/docker/client.py", line 120, in _url
Apr 25 16:02:11.136573 str-dcfx-t0-1-04 INFO swss.sh[4575]:     if not isinstance(arg, six.string_types):
Apr 25 16:02:11.138394 str-dcfx-t0-1-04 INFO swss.sh[4575]: AttributeError: 'NoneType' object has no attribute 'string_types'

Instance 3:

Apr 25 16:38:26.009619 str-dcfx-t0-1-04 INFO swss.sh[4594]: Exception in thread Thread-1:
Apr 25 16:38:26.010321 str-dcfx-t0-1-04 INFO swss.sh[4594]: Traceback (most recent call last):
Apr 25 16:38:26.010878 str-dcfx-t0-1-04 INFO swss.sh[4594]:   File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner
Apr 25 16:38:26.011440 str-dcfx-t0-1-04 INFO swss.sh[4594]:     self.run()
Apr 25 16:38:26.011946 str-dcfx-t0-1-04 INFO swss.sh[4594]:   File "/usr/lib/python2.7/threading.py", line 754, in run
Apr 25 16:38:26.012469 str-dcfx-t0-1-04 INFO swss.sh[4594]:     self.__target(*self.__args, **self.__kwargs)
Apr 25 16:38:26.012982 str-dcfx-t0-1-04 INFO swss.sh[4594]:   File "/usr/bin/docker-wait-any", line 49, in wait_for_container
Apr 25 16:38:26.013479 str-dcfx-t0-1-04 INFO swss.sh[4594]:     while docker_client.inspect_container(container_name)['State']['Status'] != "running":
Apr 25 16:38:26.014010 str-dcfx-t0-1-04 INFO swss.sh[4594]:   File "/usr/local/lib/python2.7/dist-packages/docker/utils/decorators.py", line 21, in wrapped
Apr 25 16:38:26.014526 str-dcfx-t0-1-04 INFO swss.sh[4594]:     return f(self, resource_id, *args, **kwargs)
Apr 25 16:38:26.015021 str-dcfx-t0-1-04 INFO swss.sh[4594]:   File "/usr/local/lib/python2.7/dist-packages/docker/api/container.py", line 173, in inspect_container
Apr 25 16:38:26.017533 str-dcfx-t0-1-04 INFO swss.sh[4594]:     self._get(self._url("/containers/{0}/json", container)), True
Apr 25 16:38:26.020224 str-dcfx-t0-1-04 INFO swss.sh[4594]:   File "/usr/local/lib/python2.7/dist-packages/docker/client.py", line 120, in _url
Apr 25 16:38:26.023942 str-dcfx-t0-1-04 INFO swss.sh[4594]:     if not isinstance(arg, six.string_types):
Apr 25 16:38:26.027494 str-dcfx-t0-1-04 INFO swss.sh[4594]: AttributeError: 'NoneType' object has no attribute 'string_types'

How I did it

Child thread should break out of infinite loop after signaling main thread to unblock and exit.

How to verify it

Tried multiple times on physical testbed and issue is not seen with this fix,

Which release branch to backport (provide reason below if selected)

  • 201811
  • 201911
  • 202006
  • 202012
  • 202106
  • 202111

Description for the changelog

Link to config_db schema for YANG module changes

A picture of a cute animal (not mandatory but encouraged)

@vaibhavhd vaibhavhd requested a review from prsunny May 11, 2022 17:44
@vaibhavhd vaibhavhd requested a review from lguohan as a code owner May 11, 2022 17:44
@prsunny prsunny requested a review from prabhataravind May 20, 2024 16:03
lguohan pushed a commit that referenced this pull request May 22, 2024
Fix docker-wait-any script crash issue.

Why I did it
docker-wait-any script will create a waiting thread for multiple containers.
When any container thread exit, g_thread_exit_event will set, and main thread will exit.
However when this happen, some thread may still waiting container with following code:
docker_client.wait(container_name)
Because docker_client will be destroyed when main thread exist, some time wait method will throw TypeError, and this will cause swss.sh crash then swss container can't start:

<30>May 13 07:11:22 DEVICE_NAME swss.sh[13603]: Traceback (most recent call last):
<30>May 13 07:11:22 DEVICE_NAME swss.sh[13603]: Exception in thread Thread-1:
...
<30>May 13 07:11:22 DEVICE_NAME swss.sh[13603]: while docker_client.inspect_container(container_name)['State']['Status'] != "running":
...
<30>May 13 07:11:23 DEVICE_NAME swss.sh[13603]: TypeError: 'NoneType' object is not callable
<13>May 13 07:11:23 DEVICE_NAME root: Stopping swss service...

This PR is based on analyze result of following PR: #10812

Microsoft ADO: 28052815
@vaibhavhd
Copy link
Contributor Author

This change was merged as part of #19009

@vaibhavhd vaibhavhd closed this Jun 3, 2024
@vaibhavhd vaibhavhd deleted the fix-docker-wait-any branch June 3, 2024 23:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant