Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[teamsyncd][teammgrd] Graceful exit after receiving SIGTERM #1407

Merged
merged 3 commits into from
Sep 4, 2020

Conversation

Sabareesh-Kumar-Anandan
Copy link
Contributor

Signed-off-by: Sabareesh Kumar Anandan sanandan@marvell.com

What I did
When SIGTERM is received, gracefully exist teamsyncd and teammgrd after cleaning up teamd processes and resources.

Why I did it

Below errors are after config reload

  1. portchannel interfaces in kernel were not cleaned up.
  2. teamsyncd gets netlink messages with old ifIndex.
  3. Error - "TeamPortSync: Failed to initialize team handler".

How I verified it
I did multiple config reloads and docker stop teamd.
Verified all portchannel intf are cleaned up in kernel and all teamd processes exists cleanly.

Details if related

Signed-off-by: Sabareesh Kumar Anandan <sanandan@marvell.com>
@pavel-shirshov
Copy link
Contributor

pavel-shirshov commented Aug 20, 2020

Found this comment explaining why it wasn't done
#1159 (comment)

Since we need to have Signal Handlers in both the teammgrd and teamsyncd and both have resources to be cleaned up which are "interdependent", I found it is right to let the process continue and get killed by SIGKILL later. If we add explicit process exits, since they are exiting on their on pace, on testing I found some of the PortChannel interfaces remaining in the kernel and not getting cleaned.

@Sabareesh-Kumar-Anandan Can you please check that after your changes all PortChannel interfaces are being removed?

@pavel-shirshov
Copy link
Contributor

pavel-shirshov commented Aug 20, 2020

Also as suggestion
we have

while (true)
{
  if (received_signal)
    break;
}

I'd better to write it as

while (!received_signa)
{
}

Copy link
Contributor

@pavel-shirshov pavel-shirshov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please check my comments?

@judyjoseph
Copy link
Contributor

judyjoseph commented Aug 20, 2020

Thanks @pavel-shirshov -- yes that was what my observations during testing earlier.

@Sabareesh-Kumar-Anandan You mention you still get the errors mentioned in PR comment -- after config reload ?

@Sabareesh-Kumar-Anandan
Copy link
Contributor Author

Thanks @pavel-shirshov -- yes that was what my observations during testing earlier.

@Sabareesh-Kumar-Anandan You mention you still get the errors mentioned in PR comment -- after config reload ?

Yes. The errors mentioned in the PR are still seen after config reload.
I have observed teammgrd receives SIGTERM only after teamsyncd exits. Since teamsyncd continues till SIGKILL, teammgrd cleanup is not happening. So I added explicit process exits.
With this fix, all portchannel interfaces in kernel are cleaned up correctly. I tested multiple times and I dont see any issues.

@pavel-shirshov
Copy link
Contributor

@Sabareesh-Kumar-Anandan Can you please change your code
while (true) -> while (!received_signa) and I'll approve your PR?

@judyjoseph
Copy link
Contributor

Hi @Sabareesh-Kumar-Anandan, I still don't get when you tell "all portchannel interfaces in kernel are cleaned up correctly" -- as we are calling TeamMgr::cleanTeamProcesses() and sync.cleanTeamSync() in respective signal handlers. It should be cleaned, unless even while doing a config reload you are creating portchannels !

Share your logs and steps you did....and the image you use We would need test with scale scenario as well.

After #1159, we stopped seeing all issues you mentioned with teamsyncd/teammgrd . or any logs in our production environment. Need to make sure we don't introduce any issues again !

@judyjoseph
Copy link
Contributor

judyjoseph commented Aug 21, 2020

I did a quick check in one of the devices .. I see the interfaces getting cleaned in kernel and recreated on config reload. Need more info before proceeding !

admin@str-s6000-acs-8:~$ ip link | grep Port
6: PortChannel0002: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9100 qdisc noqueue state UP mode DEFAULT group default qlen 1000
7: PortChannel0005: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9100 qdisc noqueue state UP mode DEFAULT group default qlen 1000
8: PortChannel0008: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9100 qdisc noqueue state UP mode DEFAULT group default qlen 1000
9: PortChannel0011: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9100 qdisc noqueue state UP mode DEFAULT group default qlen 1000
10: PortChannel0014: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9100 qdisc noqueue state UP mode DEFAULT group default qlen 1000
11: PortChannel0017: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9100 qdisc noqueue state UP mode DEFAULT group default qlen 1000
12: PortChannel0020: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9100 qdisc noqueue state UP mode DEFAULT group default qlen 1000
13: PortChannel0023: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9100 qdisc noqueue state UP mode DEFAULT group default qlen 1000
admin@str-s6000-acs-8:~$ ip link | grep Port
admin@str-s6000-acs-8:~$ ip link | grep Port
admin@str-s6000-acs-8:~$ ip link | grep Port
admin@str-s6000-acs-8:~$ ip link | grep Port
admin@str-s6000-acs-8:~$ ip link | grep Port
admin@str-s6000-acs-8:~$ ip link | grep Port
admin@str-s6000-acs-8:~$ ip link | grep Port
admin@str-s6000-acs-8:~$ ip link | grep Port
.......
admin@str-s6000-acs-8:~$ ip link | grep Port
admin@str-s6000-acs-8:~$ ip link | grep Port
admin@str-s6000-acs-8:~$ ip link | grep Port
admin@str-s6000-acs-8:~$ ip link | grep Port
50: PortChannel0002: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 9100 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000
admin@str-s6000-acs-8:~$ ip link | grep Port
50: PortChannel0002: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 9100 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000
51: PortChannel0005: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 9100 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000
52: PortChannel0008: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 9100 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000
.....
admin@str-s6000-acs-8:~$ ip link | grep Port
50: PortChannel0002: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 9100 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000
51: PortChannel0005: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 9100 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000
52: PortChannel0008: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 9100 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000
53: PortChannel0011: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 9100 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000
54: PortChannel0014: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 9100 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000
55: PortChannel0017: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 9100 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000
56: PortChannel0020: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 9100 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000
57: PortChannel0023: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 9100 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000

@Sabareesh-Kumar-Anandan
Copy link
Contributor Author

Hi @Sabareesh-Kumar-Anandan, I still don't get when you tell "all portchannel interfaces in kernel are cleaned up correctly" -- as we are calling TeamMgr::cleanTeamProcesses() and sync.cleanTeamSync() in respective signal handlers. It should be cleaned, unless even while doing a config reload you are creating portchannels !

Share your logs and steps you did....and the image you use We would need test with scale scenario as well.

After #1159, we stopped seeing all issues you mentioned with teamsyncd/teammgrd . or any logs in our production environment. Need to make sure we don't introduce any issues again !

@judyjoseph I am using below commit

Branch - 202006
Commit - 96fedf1ae9ebcc6604daced6b7dd577eaeb26883

Steps:

  1. config portchannel add PortChannelTest
  2. config reload
  3. ip link show PortChannelTest
    58: PortChannelTest: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 9100 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000
    link/ether 00:50:b6:50:51:86 brd ff:ff:ff:ff:ff:ff

Logs:

Feb 14 10:21:00.670764 sonic INFO systemd[1]: Stopping TEAMD container...
Feb 14 10:21:01.010035 sonic DEBUG teamd#teammgrd: :< select: exit
Feb 14 10:21:01.010035 sonic DEBUG teamd#teammgrd: :> select: enter
Feb 14 10:21:01.829681 sonic DEBUG teamd#teamsyncd: :< select: exit
Feb 14 10:21:01.830231 sonic DEBUG teamd#teamsyncd: :> cleanTeamSync: enter
Feb 14 10:21:01.830674 sonic NOTICE teamd#teamsyncd: :- cleanTeamSync: Cleaning up LAG teamd resources ...
Feb 14 10:21:01.831069 sonic INFO teamd#teamsyncd: :- removeLag: Remove LAG PortChannelTest
Feb 14 10:21:01.831646 sonic WARNING teamd#tlm_teamd: :- remove_lag: The LAG 'PortChannelTest' hasn't been added. Can't remove it
Feb 14 10:21:01.831957 sonic DEBUG teamd#teamsyncd: :< cleanTeamSync: exit
Feb 14 10:21:01.832259 sonic DEBUG teamd#teamsyncd: :> select: enter
Feb 14 10:21:02.011017 sonic DEBUG teamd#teammgrd: :< select: exit
Feb 14 10:21:02.011387 sonic DEBUG teamd#teammgrd: :> select: enter
Feb 14 10:21:02.832807 sonic DEBUG teamd#teamsyncd: :< select: exit
Feb 14 10:21:02.833653 sonic DEBUG teamd#teamsyncd: :> select: enter
Feb 14 10:21:03.012152 sonic DEBUG teamd#teammgrd: :< select: exit
Feb 14 10:21:03.012152 sonic DEBUG teamd#teammgrd: :> select: enter
Feb 14 10:21:03.834819 sonic DEBUG teamd#teamsyncd: :< select: exit
Feb 14 10:21:03.834819 sonic DEBUG teamd#teamsyncd: :> select: enter
Feb 14 10:21:04.013244 sonic DEBUG teamd#teammgrd: :< select: exit
Feb 14 10:21:04.013244 sonic DEBUG teamd#teammgrd: :> select: enter
Feb 14 10:21:04.836057 sonic DEBUG teamd#teamsyncd: :< select: exit
Feb 14 10:21:04.836057 sonic DEBUG teamd#teamsyncd: :> select: enter
Feb 14 10:21:05.014351 sonic DEBUG teamd#teammgrd: :< select: exit
Feb 14 10:21:05.014351 sonic DEBUG teamd#teammgrd: :> select: enter
Feb 14 10:21:05.836634 sonic DEBUG teamd#teamsyncd: :< select: exit
Feb 14 10:21:05.836634 sonic DEBUG teamd#teamsyncd: :> select: enter
Feb 14 10:21:06.015413 sonic DEBUG teamd#teammgrd: :< select: exit
Feb 14 10:21:06.015413 sonic DEBUG teamd#teammgrd: :> select: enter
Feb 14 10:21:06.837675 sonic DEBUG teamd#teamsyncd: :< select: exit
Feb 14 10:21:06.837675 sonic DEBUG teamd#teamsyncd: :> select: enter
Feb 14 10:21:07.016521 sonic DEBUG teamd#teammgrd: :< select: exit
Feb 14 10:21:07.016521 sonic DEBUG teamd#teammgrd: :> select: enter
Feb 14 10:21:07.026884 sonic INFO teamd#supervisord 2019-02-14 10:21:00,826 WARN received SIGTERM indicating exit request
Feb 14 10:21:07.026884 sonic INFO teamd#supervisord 2019-02-14 10:21:00,827 INFO waiting for teammgrd, tlm_teamd, teamsyncd, supervisor-proc-exit-listener, rsyslogd to die
Feb 14 10:21:07.026884 sonic INFO teamd#supervisord 2019-02-14 10:21:03,832 INFO waiting for teammgrd, tlm_teamd, teamsyncd, supervisor-proc-exit-listener, rsyslogd to die
Feb 14 10:21:07.026884 sonic INFO teamd#supervisord 2019-02-14 10:21:06,836 INFO waiting for teammgrd, tlm_teamd, teamsyncd, supervisor-proc-exit-listener, rsyslogd to die
Feb 14 10:21:07.838953 sonic DEBUG teamd#teamsyncd: :< select: exit
Feb 14 10:21:07.838953 sonic DEBUG teamd#teamsyncd: :> select: enter
Feb 14 10:21:08.016616 sonic DEBUG teamd#teammgrd: :< select: exit
Feb 14 10:21:08.016616 sonic DEBUG teamd#teammgrd: :> select: enter
Feb 14 10:21:08.840184 sonic DEBUG teamd#teamsyncd: :< select: exit
Feb 14 10:21:08.840184 sonic DEBUG teamd#teamsyncd: :> select: enter
Feb 14 10:21:09.017699 sonic DEBUG teamd#teammgrd: :< select: exit
Feb 14 10:21:09.017699 sonic DEBUG teamd#teammgrd: :> select: enter
Feb 14 10:21:09.841431 sonic DEBUG teamd#teamsyncd: :< select: exit
Feb 14 10:21:09.841431 sonic DEBUG teamd#teamsyncd: :> select: enter
Feb 14 10:21:10.018783 sonic DEBUG teamd#teammgrd: :< select: exit
Feb 14 10:21:10.019377 sonic DEBUG teamd#teammgrd: :> select: enter
Feb 14 10:21:10.832451 sonic INFO dockerd[356]: time="2019-02-14T10:21:10.831099840Z" level=info msg="Container f62fb7cc3e7ff9f9b33a86ec3285f03b70beeb6335ccd826c6a5e6b89bd24618 failed to exit within 10 seconds of signal 15 - using the force"
Feb 14 10:21:10.842843 sonic DEBUG teamd#teamsyncd: :< select: exit
Feb 14 10:21:10.842843 sonic DEBUG teamd#teamsyncd: :> select: enter
Feb 14 10:21:11.087475 sonic INFO containerd[355]: time="2019-02-14T10:21:11.086529680Z" level=info msg="shim reaped" id=f62fb7cc3e7ff9f9b33a86ec3285f03b70beeb6335ccd826c6a5e6b89bd24618
Feb 14 10:21:11.106441 sonic INFO dockerd[356]: time="2019-02-14T10:21:11.106166840Z" level=info msg="ignoring event" module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Feb 14 10:21:11.117101 sonic INFO systemd[1]: var-lib-docker-containers-f62fb7cc3e7ff9f9b33a86ec3285f03b70beeb6335ccd826c6a5e6b89bd24618-mounts-shm.mount: Succeeded.
Feb 14 10:21:11.140720 sonic INFO systemd[1]: var-lib-docker-overlay2-3d3b60be6d83d846c6e633987688796faf43e9a43c737b48ac97c94d5c1333f8-merged.mount: Succeeded.
Feb 14 10:21:11.225420 sonic INFO teamd.sh[1637]: 137
Feb 14 10:21:11.227266 sonic INFO teamd.sh[4676]: teamd
Feb 14 10:21:11.234251 sonic WARNING systemd[1]: teamd.service: Main process exited, code=killed, status=15/TERM
Feb 14 10:21:11.234635 sonic INFO systemd[1]: teamd.service: Succeeded.
Feb 14 10:21:11.234877 sonic INFO systemd[1]: Stopped TEAMD container.

In the above logs, teammgrd doesnt get SIGTERM. After 10sec teammgrd is killed by SIGKILL. so teamgrd cleanup is not happening.

Logs with this PR:

Feb 14 10:18:34.675285 sonic INFO systemd[1]: Stopping TEAMD container...
Feb 14 10:18:34.813691 sonic DEBUG teamd#teamsyncd: :< select: exit
Feb 14 10:18:34.813691 sonic DEBUG teamd#teamsyncd: :> select: enter
Feb 14 10:18:34.916287 sonic INFO teamd#supervisord 2019-02-14 10:18:34,830 WARN received SIGTERM indicating exit request
Feb 14 10:18:34.916287 sonic INFO teamd#supervisord 2019-02-14 10:18:34,831 INFO waiting for teammgrd, tlm_teamd, teamsyncd, supervisor-proc-exit-listener, rsyslogd to die
Feb 14 10:18:35.149972 sonic DEBUG teamd#teammgrd: :< select: exit
Feb 14 10:18:35.149972 sonic DEBUG teamd#teammgrd: :> select: enter
Feb 14 10:18:35.833639 sonic DEBUG teamd#teamsyncd: :< select: exit
Feb 14 10:18:35.834700 sonic DEBUG teamd#teamsyncd: :> cleanTeamSync: enter
Feb 14 10:18:35.838287 sonic NOTICE teamd#teamsyncd: :- cleanTeamSync: Cleaning up LAG teamd resources ...
Feb 14 10:18:35.839923 sonic INFO teamd#teamsyncd: :- removeLag: Remove LAG PortChannelTest
Feb 14 10:18:35.839923 sonic WARNING teamd#tlm_teamd: :- remove_lag: The LAG 'PortChannelTest' hasn't been added. Can't remove it
Feb 14 10:18:35.839923 sonic DEBUG teamd#teamsyncd: :< cleanTeamSync: exit
Feb 14 10:18:35.839923 sonic NOTICE teamd#teamsyncd: :- main: Exiting
Feb 14 10:18:36.150894 sonic DEBUG teamd#teammgrd: :< select: exit
Feb 14 10:18:36.150894 sonic DEBUG teamd#teammgrd: :> select: enter
Feb 14 10:18:36.842303 sonic NOTICE teamd#tlm_teamd: :- main: Exiting
Feb 14 10:18:37.847867 sonic DEBUG teamd#teammgrd: :< select: exit
Feb 14 10:18:37.847867 sonic DEBUG teamd#teammgrd: :> cleanTeamProcesses: enter
Feb 14 10:18:37.847867 sonic NOTICE teamd#teammgrd: :- cleanTeamProcesses: Cleaning up LAGs during shutdown...
Feb 14 10:18:37.847867 sonic INFO teamd#teammgrd: :- cleanTeamProcesses: Sending TERM Signal to (PID: 39) for LaG PortChannelTest
Feb 14 10:18:37.847867 sonic DEBUG teamd#teammgrd: :< cleanTeamProcesses: exit
Feb 14 10:18:37.847867 sonic NOTICE teamd#teammgrd: :- main: Exiting
Feb 14 10:18:37.850779 sonic DEBUG teamd#teammgrd: :< main: exit
Feb 14 10:18:38.178367 sonic INFO containerd[357]: time="2019-02-14T10:18:38.177459800Z" level=info msg="shim reaped" id=f62fb7cc3e7ff9f9b33a86ec3285f03b70beeb6335ccd826c6a5e6b89bd24618
Feb 14 10:18:38.192468 sonic INFO dockerd[363]: time="2019-02-14T10:18:38.191512200Z" level=info msg="ignoring event" module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Feb 14 10:18:38.212984 sonic INFO systemd[1]: var-lib-docker-containers-f62fb7cc3e7ff9f9b33a86ec3285f03b70beeb6335ccd826c6a5e6b89bd24618-mounts-shm.mount: Succeeded.
Feb 14 10:18:38.232351 sonic INFO systemd[1]: var-lib-docker-overlay2-3d3b60be6d83d846c6e633987688796faf43e9a43c737b48ac97c94d5c1333f8-merged.mount: Succeeded.
Feb 14 10:18:38.315219 sonic INFO teamd.sh[4524]: teamd
Feb 14 10:18:38.316247 sonic INFO teamd.sh[1634]: 0
Feb 14 10:18:38.325018 sonic INFO systemd[1]: teamd.service: Succeeded.
Feb 14 10:18:38.327410 sonic INFO systemd[1]: Stopped TEAMD container.

@judyjoseph
Copy link
Contributor

judyjoseph commented Aug 31, 2020

@Sabareesh-Kumar-Anandan , I too find this behavior with the master branch. Looks like there is a change in behavior with the commit sonic-net/sonic-buildimage@7158ccd, where in the process start/exit of teammgrd is sequenced after teamsyncd. So please go ahead with this fix after you take care of Pavel's comments.

The branches 201811 and 201911 ( till 2 weeks back ) behaves correctly ( no behavior change as seen in master branch) with the teammgrd also getting the SIGTERM signal and cleaning up the Portchannel kernel resources correctly.

judyjoseph
judyjoseph previously approved these changes Sep 3, 2020
Copy link
Contributor

@pavel-shirshov pavel-shirshov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@judyjoseph judyjoseph merged commit e54948a into sonic-net:master Sep 4, 2020
abdosi pushed a commit that referenced this pull request Sep 4, 2020
* [teamsyncd][teammgrd] Graceful exit after receiving SIGTERM

Signed-off-by: Sabareesh Kumar Anandan <sanandan@marvell.com>
* Update teammgrd.cpp
* Update teamsyncd.cpp

Co-authored-by: pavel-shirshov <pavelsh@microsoft.com>
abdosi added a commit to abdosi/sonic-build-tools that referenced this pull request Sep 8, 2020
Test make sure cleanup happens of Port-channel Kernel devices.
This test case track the fixes done by PR:
sonic-net/sonic-swss#1407
sonic-net/sonic-swss#1159

Signed-off-by: Abhishek Dosi <abdosi@microsoft.com>
abdosi added a commit to Azure/sonic-build-tools that referenced this pull request Sep 9, 2020
* Added the test case for Port Channel cleanup.
Test make sure cleanup happens of Port-channel Kernel devices.
This test case track the fixes done by PR:
sonic-net/sonic-swss#1407
sonic-net/sonic-swss#1159

Signed-off-by: Abhishek Dosi <abdosi@microsoft.com>

* Address Review Comments

Signed-off-by: Abhishek Dosi <abdosi@microsoft.com>
EdenGri pushed a commit to EdenGri/sonic-swss that referenced this pull request Feb 28, 2022
…nic-net#1407)

This reverts commit b10622e.

**What I did**
revert changes to call sdkdump and replace with old call to mstdump

**How I did it**
reverting a previous commit [Mellanox] Add FW dump with new SAI implementation and remove mst dump sonic-net#1338

**How to verify it**
run techsupport
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants