-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[hostcfgd] Configure service auto-restart in hostcfgd. #5744
[hostcfgd] Configure service auto-restart in hostcfgd. #5744
Conversation
retest mellanox please |
retest vsimage please |
e729226
to
e4447ee
Compare
Before this change, a process runnning inside every SONiC container dealt with FEATURE table 'auto_restart' field and depending on the value decided wether a container has to be killed or not. If killed service auto restart mechanism restarts the container. This change moves the logic from container to the host daemon - hostcfgd. * hostcfgd refactoring - move feature handling in another class. * override systemd service Restart= setting from hostcfgd. * remove code that deals with FEATURE table from supervisor-proc-exit-listener. * remove default systemd Restart=always. Signed-off-by: Stepan Blyshchak <stepanb@nvidia.com>
Signed-off-by: Stepan Blyshchak <stepanb@nvidia.com>
e4447ee
to
6f47365
Compare
This pull request introduces 1 alert when merging 6f47365 into 261a81d - view on LGTM.com new alerts:
|
Signed-off-by: Stepan Blyshchak <stepanb@nvidia.com>
…restart_cfg Signed-off-by: Stepan Blyshchak <stepanb@nvidia.com>
…restart_cfg Signed-off-by: Stepan Blyshchak <stepanb@nvidia.com>
start_cmds.append("sudo systemctl start {}.{}".format(feature_name_suffix, feature_suffixes[-1])) | ||
for cmd in start_cmds: | ||
syslog.syslog(syslog.LOG_INFO, "Running cmd: '{}'".format(cmd)) | ||
try: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we enhance run_cmd() use it to return error code as well as log the error read from the exception
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, yes, please check the enhanced run_cmd()
stop_cmds.append("sudo systemctl mask {}.{}".format(feature_name_suffix, suffix)) | ||
for cmd in stop_cmds: | ||
syslog.syslog(syslog.LOG_INFO, "Running cmd: '{}'".format(cmd)) | ||
try: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment as before to use run_cmd()
…it in feature handler code Signed-off-by: Stepan Blyshchak <stepanb@nvidia.com>
@stepanblyschak: This PR appears to change the container behavior upon critical process exit. Currently, |
@jleveque |
…restart_cfg Signed-off-by: Stepan Blyshchak <stepanb@nvidia.com>
d918e64
to
79c2866
Compare
Signed-off-by: Stepan Blyschak <stepanb@nvidia.com>
@lguohan: A few Azure Pipelines check builds are stuck in the "Expected — Waiting for status to be reported" state, and I cannot re-trigger these tests. There are a few PRs like this and I've even tried closing/reopening the PRs to no avail. Can you please help here? |
/AzurePipleines run |
Closing and reopening PR in hopes of getting stuck Azure Pipelines jobs running. |
syslog.syslog(syslog.LOG_INFO, "Feature '{}' service is '{}'" | ||
.format(feature_name, invariant_state)) | ||
entry = self.config_db.get_entry('FEATURE', feature_name) | ||
entry['state'] = invariant_state |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@stepanblyschak Following my previous comment, if the state
at here is always_disabled
and invariant_state
is always_enabled
, the code at here will update the state field of feature to invariant_state
. However, the code from line 761 ~ 764 will disable this feature. So I think the code at line 758 should be entry['state'] = state
, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No sure I get this comment, are you saying there is a bug in original code?
Could you please point me to a document describing feature state transitions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yozhao101 Do you have this comment still?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yozhao101 could you please check if the last commit addresses your comment?
Signed-off-by: Stepan Blyschak <stepanb@nvidia.com>
/AzurePipelines run |
Azure Pipelines successfully started running 1 pipeline(s). |
Signed-off-by: Stepan Blyschak <stepanb@nvidia.com>
a6f9c94
to
774781d
Compare
if cached_feature.state is None: | ||
enable = feature.state in ("always_enabled", "enabled") | ||
disable = feature.state in ("always_disabled", "disabled") | ||
elif cached_feature.state == ("always_enabled", "always_disabled"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in
instead of ==
: elif cached_feature.state in ("always_enabled", "always_disabled"):
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for noticing this!
Signed-off-by: Stepan Blyschak <stepanb@nvidia.com>
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
@tahmed-dev can you please provide your approval? |
@yozhao101 can you please provide your review/approval ASAP? |
@jleveque, can you please provide your review or approval? |
Signed-off-by: Stepan Blyschak <stepanb@nvidia.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good from my perspective. @yozhao101: Can you please review again to make sure all your concerns have been addressed?
Thanks, Joe and Renuka ! I am checking ... |
except Exception as err: | ||
if log_err: | ||
syslog.syslog(syslog.LOG_ERR, "{} - failed: return code - {}, output:\n{}" | ||
.format(err.cmd, err.returncode, err.output)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like we only have output
from child process if it was captured by run()
or check_output()
. Otherwise, None
. Please see: https://docs.python.org/3/library/subprocess.html#subprocess.CalledProcessError.output
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This issue is relevant for existing code as well - https://github.com/Azure/sonic-buildimage/blob/master/src/sonic-host-services/scripts/hostcfgd#L66. This PR has no intend to fix this issue.
@yozhao101 can you please resolve/provide your reviews and approval ? |
Waiting for build to succeed. @stepanblyschak, if you can, please ping me, when build succeeds. |
Same error appears again - https://dev.azure.com/mssonic/build/_build/results?buildId=19697&view=logs&j=88ce9a53-729c-5fa9-7b6e-3d98f2488e3f&t=8d99be27-49d0-54d0-99b1-cfc0d47f0318&l=527 |
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
Before this change, a process running inside every SONiC container dealt with FEATURE table 'auto_restart' field and depending on the value decided whether a container has to be killed or not. If killed service auto restart mechanism restarts the container. This change moves the logic from container to the host daemon - hostcfgd. The 'auto_restart' handling is kept in supervisor-proc-exit-listener but now it is not required for container that wants to support auto restart feature. hostcfgd refactoring - move feature handling in another class. override systemd service Restart= setting from hostcfgd. remove default systemd Restart=always. Signed-off-by: Stepan Blyshchak stepanb@nvidia.com - Why I did it Remove the need to deal with container orchestration logic from the container itself. Leave this logic to the orchestrator - host OS. - How I did it hostcfgd configures 'Restart=' value for systemd service. - How to verify it root@r-tigon-11:/home/admin# sudo config feature autorestart lldp enabled root@r-tigon-11:/home/admin# show feature status | grep lldp lldp enabled enabled root@r-tigon-11:/home/admin# docker exec -it lldp pkill -9 lldpd root@r-tigon-11:/home/admin# docker ps -a | grep lldp 65058396277c docker-lldp:latest "/usr/bin/docker-lld…" 2 days ago Exited (0) 20 seconds ago lldp root@r-tigon-11:/home/admin# docker ps -a | grep lldp 65058396277c docker-lldp:latest "/usr/bin/docker-lld…" 2 days ago Up 5 seconds lldp root@r-tigon-11:/home/admin# sudo config feature autorestart lldp disabled root@r-tigon-11:/home/admin# docker exec -it lldp pkill -9 lldpd root@r-tigon-11:/home/admin# docker ps -a | grep lldp 65058396277c docker-lldp:latest "/usr/bin/docker-lld…" 2 days ago Up 35 seconds lldp root@r-tigon-11:/home/admin# docker ps -a | grep lldp 65058396277c docker-lldp:latest "/usr/bin/docker-lld…" 2 days ago Exited (0) 3 seconds ago lldp root@r-tigon-11:/home/admin# docker ps -a | grep lldp 65058396277c docker-lldp:latest "/usr/bin/docker-lld…" 2 days ago Exited (0) 39 seconds ago lldp root@r-tigon-11:/home/admin#
Before this change, a process running inside every SONiC container dealt with FEATURE table 'auto_restart' field and depending on the value decided whether a container has to be killed or not.
If killed service auto restart mechanism restarts the container.
This change moves the logic from container to the host daemon - hostcfgd.
The 'auto_restart' handling is kept in supervisor-proc-exit-listener but now it is not required for container that wants to support auto restart feature.
Signed-off-by: Stepan Blyshchak stepanb@nvidia.com
- Why I did it
Remove the need to deal with container orchestration logic from the container itself. Leave this logic to the orchestrator - host OS.
- How I did it
hostcfgd configures 'Restart=' value for systemd service.
- How to verify it
- Which release branch to backport (provide reason below if selected)
- Description for the changelog
- A picture of a cute animal (not mandatory but encouraged)