Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GCU or caclmgrd] GCU CACL update might broke caclmgrd #2049

Closed
wen587 opened this issue Jan 29, 2022 · 3 comments · Fixed by sonic-net/sonic-buildimage#10248
Closed

[GCU or caclmgrd] GCU CACL update might broke caclmgrd #2049

wen587 opened this issue Jan 29, 2022 · 3 comments · Fixed by sonic-net/sonic-buildimage#10248

Comments

@wen587
Copy link
Contributor

wen587 commented Jan 29, 2022

Description

Case 1(described below): During continuous cacl test, caclmgrd may be broken and iptabels won't update.
Case 2(during kvm test): There is a possibility that iptables is not immediately updated cuz caclmgrd UPDATE_DELAY_SECS=0.5.
Refer test here: https://github.com/Azure/sonic-mgmt/blob/master/tests/generic_config_updater/test_cacl.py#L467
It will have a chance to fail which is already found in kvm nightly test. It check immediately after apply-patch success, while caclmgrd might upadte iptables in 0.5 seconds.
Example syslog:

vlab-01 INFO caclmgrd[2235692]: ACL config not stable for namespace '': 2 changes detected in the past 0.5 seconds. Skipping update ...
///////@If we check iptables between these two move, the iptables check will fail as it have not been updated.@
vlab-01 INFO caclmgrd[2235692]: ACL config for namespace '' has not changed for 0.5 seconds. Applying updates ...

Two thoughts I am thinking:

  1. GCU only make sure configDB's table is updated. Caclmgrd subscribe configDB tables. Maybe we can make it to wait(sleep) 1 sec inside GCU to make sure caclmgrd does take effect and update iptables. Or we will need some communication between GCU and caclmgrd. I don't know if that is feasible.
  2. Make GCU apply the whole cacl change in one step. Just like config load loads a whole json file at one time. This way, caclmgrd may not have the update delay. In this way, Case 1 may also be fixed?

Steps to reproduce the issue

  1. Provide add and remove cacl config json file
admin@vlab-01:~/cacl/test$ cat test11.json
[
        {
            "op": "add",
            "path": "/ACL_RULE",
            "value": {
                "SSH_ONLY|TEST_DROP": {
                 "IP_TYPE": "IP",
                 "PRIORITY": "9998",
                 "PACKET_ACTION": "DROP",
                 "SRC_IP": "9.9.9.9/32"
                }
            }
        }
    ]
admin@vlab-01:~/cacl/test$ cat test22.json
[
        {
            "op": "remove",
            "path": "/ACL_RULE"
        }
    ]

  1. Continuous run python script
admin@vlab-01:~/cacl/test$ cat bug_repro.py

import subprocess
import time
import datetime
def run_script():
    # add cacl config
    cmd = "sudo config apply-patch -i /FEATURE -i /QUEUE -i /SCHEDULER test11.json"
    result = subprocess.Popen(cmd, shell=True, text=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
    _, _= result.communicate()

    cmd = "sudo iptables -S"
    result = subprocess.Popen(cmd, shell=True, text=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
    text, err = result.communicate()

    if "9.9.9.9" not in text:
        x = datetime.datetime.now()
        print(x + " : fail to add")
        print(text)
        return "fail"

    # remove cacl config
    cmd = "sudo config apply-patch -i /FEATURE -i /QUEUE -i /SCHEDULER test22.json"
    result = subprocess.Popen(cmd, shell=True, text=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
    _, _= result.communicate()

    cmd = "sudo iptables -S"
    result = subprocess.Popen(cmd, shell=True, text=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
    text, err = result.communicate()

    if "9.9.9.9" in text:
        x = datetime.datetime.now()
        print(x + " : fail to remove")
        print(text)
        return "fail"


for i in range(1, 100):
    print("iteration"+str(i))
    ret= run_script()
    if ret== "fail":
        break

  1. caclmgrd being broke and iptables won't be updated
admin@vlab-01:~/cacl/test$ python verify.py
iteration1
iteration2
...
iteration25
2022-01-29 03:58:09.802610 : fail to remove   <======
-P INPUT ACCEPT
-P FORWARD ACCEPT
-P OUTPUT ACCEPT
-A INPUT -s 127.0.0.1/32 -i lo -j ACCEPT
-A INPUT -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT
-A INPUT -p icmp -m icmp --icmp-type 8 -j ACCEPT
-A INPUT -p icmp -m icmp --icmp-type 0 -j ACCEPT
-A INPUT -p icmp -m icmp --icmp-type 3 -j ACCEPT
-A INPUT -p icmp -m icmp --icmp-type 11 -j ACCEPT
-A INPUT -p udp -m udp --dport 67:68 -j ACCEPT
-A INPUT -p udp -m udp --dport 546:547 -j ACCEPT
-A INPUT -p tcp -m tcp --dport 179 -j ACCEPT
-A INPUT -p tcp -m tcp --sport 179 -j ACCEPT
-A INPUT -s 9.9.9.9/32 -p tcp -m tcp --dport 22 -j DROP   <=====
-A INPUT -d 10.1.0.32/32 -j DROP
-A INPUT -d 10.250.0.0/32 -j DROP
-A INPUT -d 192.168.0.1/32 -j DROP
-A INPUT -d 10.0.0.56/32 -j DROP
-A INPUT -d 10.0.0.58/32 -j DROP
-A INPUT -d 10.0.0.60/32 -j DROP
-A INPUT -d 10.0.0.62/32 -j DROP
-A INPUT -m ttl --ttl-lt 2 -j ACCEPT
-A INPUT -j DROP

Describe the results you received

Check syslog. And iptables won't be updated due to caclmgrd's failure

admin@vlab-01:~/cacl/test$ sudo cat /var/log/syslog | less
Jan 29 03:58:04.375320 vlab-01 NOTICE config: Patch Applier: The patch was sorted into 5 changes:
Jan 29 03:58:04.376123 vlab-01 NOTICE config: Patch Applier:   * [{"op": "remove", "path": "/ACL_RULE/SSH_ONLY|TEST_DROP/PACKET_ACTION"}]
Jan 29 03:58:04.376229 vlab-01 NOTICE config: Patch Applier:   * [{"op": "remove", "path": "/ACL_RULE/SSH_ONLY|TEST_DROP/PRIORITY"}]
Jan 29 03:58:04.376325 vlab-01 NOTICE config: Patch Applier:   * [{"op": "remove", "path": "/ACL_RULE/SSH_ONLY|TEST_DROP/SRC_IP"}]
Jan 29 03:58:04.376399 vlab-01 NOTICE config: Patch Applier:   * [{"op": "remove", "path": "/ACL_RULE/SSH_ONLY|TEST_DROP/IP_TYPE"}]
Jan 29 03:58:04.376473 vlab-01 NOTICE config: Patch Applier:   * [{"op": "remove", "path": "/ACL_RULE"}]
Jan 29 03:58:04.376601 vlab-01 NOTICE config: Patch Applier: Applying 5 changes in order:
Jan 29 03:58:04.376695 vlab-01 NOTICE config: Patch Applier:   * [{"op": "remove", "path": "/ACL_RULE/SSH_ONLY|TEST_DROP/PACKET_ACTION"}]
Jan 29 03:58:04.705395 vlab-01 INFO caclmgrd[2235692]: ACL change detected for namespace ''
Jan 29 03:58:04.705636 vlab-01 INFO caclmgrd[2235692]: Spawning ACL update thread for namepsace '' ...
Jan 29 03:58:05.056248 vlab-01 NOTICE config: Patch Applier:   * [{"op": "remove", "path": "/ACL_RULE/SSH_ONLY|TEST_DROP/PRIORITY"}]
Jan 29 03:58:05.206276 vlab-01 INFO caclmgrd[2235692]: ACL config not stable for namespace '': 1 changes detected in the past 0.5 seconds. Skipping update ...
Jan 29 03:58:05.707468 vlab-01 INFO caclmgrd[2235692]: ACL config not stable for namespace '': 2 changes detected in the past 0.5 seconds. Skipping update ...
Jan 29 03:58:05.847169 vlab-01 NOTICE config: Patch Applier:   * [{"op": "remove", "path": "/ACL_RULE/SSH_ONLY|TEST_DROP/SRC_IP"}]
Jan 29 03:58:06.207341 vlab-01 INFO caclmgrd[2235692]: ACL config for namespace '' has not changed for 0.5 seconds. Applying updates ...
Jan 29 03:58:06.217784 vlab-01 INFO caclmgrd[2235692]: Translating ACL rules for control plane ACL 'NTP_ACL' (service: 'NTP')
Jan 29 03:58:06.217864 vlab-01 WARNING caclmgrd[2235692]: Unable to determine if ACL table 'NTP_ACL' contains IPv4 or IPv6 rules. Skipping table...
Jan 29 03:58:06.217925 vlab-01 INFO caclmgrd[2235692]: Translating ACL rules for control plane ACL 'SNMP_ACL' (service: 'SNMP')
Jan 29 03:58:06.218007 vlab-01 WARNING caclmgrd[2235692]: Unable to determine if ACL table 'SNMP_ACL' contains IPv4 or IPv6 rules. Skipping table...
Jan 29 03:58:06.218104 vlab-01 INFO caclmgrd[2235692]: Translating ACL rules for control plane ACL 'SSH_ONLY' (service: 'SSH')
Jan 29 03:58:06.218187 vlab-01 ERR caclmgrd[2235692]: rule_props for rule_id TEST_DROP does not have key 'PRIORITY'!
Jan 29 03:58:06.218295 vlab-01 WARNING caclmgrd[2235692]: Unable to determine if ACL table 'SSH_ONLY' contains IPv4 or IPv6 rules. Skipping table...
Jan 29 03:58:06.225719 vlab-01 INFO caclmgrd[2235692]: Traceback (most recent call last):
Jan 29 03:58:06.225812 vlab-01 INFO caclmgrd[2235692]:   File "/usr/local/bin/caclmgrd", line 866, in <module>
Jan 29 03:58:06.226357 vlab-01 INFO caclmgrd[2235692]:     main() <=========================
Jan 29 03:58:06.226406 vlab-01 INFO caclmgrd[2235692]:   File "/usr/local/bin/caclmgrd", line 862, in main
Jan 29 03:58:06.226692 vlab-01 INFO caclmgrd[2235692]:     caclmgr.run() <=========================
Jan 29 03:58:06.226734 vlab-01 INFO caclmgrd[2235692]:   File "/usr/local/bin/caclmgrd", line 832, in run
Jan 29 03:58:06.226985 vlab-01 INFO caclmgrd[2235692]:     if self.config_db_map[namespace].get_table(self.ACL_TABLE)[acl_table]["type"] == self.ACL_TABLE_TYPE_CTRLPLANE:  <=========================
Jan 29 03:58:06.227040 vlab-01 INFO caclmgrd[2235692]: KeyError: 'SSH_ONLY'
Jan 29 03:58:06.613009 vlab-01 NOTICE config: Patch Applier:   * [{"op": "remove", "path": "/ACL_RULE/SSH_ONLY|TEST_DROP/IP_TYPE"}]
Jan 29 03:58:07.304973 vlab-01 NOTICE config: Patch Applier:   * [{"op": "remove", "path": "/ACL_RULE"}]
Jan 29 03:58:08.016759 vlab-01 NOTICE config: Patch Applier: Verifying patch updates are reflected on ConfigDB.
Jan 29 03:58:08.909793 vlab-01 NOTICE config: Patch Applier: Patch application completed.

Describe the results you expected

Should not broke caclmgrd.

Additional information you deem important (e.g. issue happens only occasionally)

Output of show version

(paste your output here)
@ghooo
Copy link
Contributor

ghooo commented Mar 8, 2022

ran the script in the description, found the error happened on the first patch step:

Mar  8 19:05:55.610196 vlab-01 NOTICE config: Patch Applier: The patch was sorted into 4 changes:
Mar  8 19:05:55.610673 vlab-01 NOTICE config: Patch Applier:   * [{"op": "add", "path": "/ACL_RULE", "value": {"SSH_ONLY|TEST_DROP": {"IP_TYPE": "IP"}}}]
Mar  8 19:05:55.610760 vlab-01 NOTICE config: Patch Applier:   * [{"op": "add", "path": "/ACL_RULE/SSH_ONLY|TEST_DROP/PRIORITY", "value": "9998"}]
Mar  8 19:05:55.611657 vlab-01 NOTICE config: Patch Applier:   * [{"op": "add", "path": "/ACL_RULE/SSH_ONLY|TEST_DROP/PACKET_ACTION", "value": "DROP"}]
Mar  8 19:05:55.611757 vlab-01 NOTICE config: Patch Applier:   * [{"op": "add", "path": "/ACL_RULE/SSH_ONLY|TEST_DROP/SRC_IP", "value": "9.9.9.9/32"}]
Mar  8 19:05:55.611865 vlab-01 NOTICE config: Patch Applier: Applying 4 changes in order:
Mar  8 19:05:55.611949 vlab-01 NOTICE config: Patch Applier:   * [{"op": "add", "path": "/ACL_RULE", "value": {"SSH_ONLY|TEST_DROP": {"IP_TYPE": "IP"}}}]
Mar  8 19:05:55.975080 vlab-01 INFO caclmgrd[130303]: **** ACL TABLE has 0 entries. 
Mar  8 19:05:55.977561 vlab-01 INFO caclmgrd[130303]: **** SSH_ONLY not found. Try again.
Mar  8 19:05:55.984697 vlab-01 INFO caclmgrd[130303]: Traceback (most recent call last):
Mar  8 19:05:55.988131 vlab-01 INFO caclmgrd[130303]:   File "/usr/local/bin/caclmgrd", line 836, in run
Mar  8 19:05:55.988991 vlab-01 INFO caclmgrd[130303]:     if acl_table_content[acl_table]["type"] == self.ACL_TABLE_TYPE_CTRLPLANE:
Mar  8 19:05:55.989041 vlab-01 INFO caclmgrd[130303]: KeyError: 'SSH_ONLY'
Mar  8 19:05:55.989085 vlab-01 INFO caclmgrd[130303]: During handling of the above exception, another exception occurred:
Mar  8 19:05:55.989109 vlab-01 INFO caclmgrd[130303]: Traceback (most recent call last):
Mar  8 19:05:55.989149 vlab-01 INFO caclmgrd[130303]:   File "/usr/local/bin/caclmgrd", line 876, in <module>
Mar  8 19:05:55.992773 vlab-01 INFO caclmgrd[130303]:     main()
Mar  8 19:05:55.993474 vlab-01 INFO caclmgrd[130303]:   File "/usr/local/bin/caclmgrd", line 872, in main
Mar  8 19:05:55.994981 vlab-01 INFO caclmgrd[130303]:     caclmgr.run()
Mar  8 19:05:55.995057 vlab-01 INFO caclmgrd[130303]:   File "/usr/local/bin/caclmgrd", line 840, in run
Mar  8 19:05:55.995105 vlab-01 INFO caclmgrd[130303]:     if self.config_db_map[namespace].get_table(self.ACL_TABLE)[acl_table]["type"] == self.ACL_TABLE_TYPE_CTRLPLANE:
Mar  8 19:05:55.995155 vlab-01 INFO caclmgrd[130303]: KeyError: 'SSH_ONLY'
Mar  8 19:05:56.318380 vlab-01 NOTICE config: Patch Applier:   * [{"op": "add", "path": "/ACL_RULE/SSH_ONLY|TEST_DROP/PRIORITY", "value": "9998"}]
Mar  8 19:05:57.136246 vlab-01 NOTICE config: Patch Applier:   * [{"op": "add", "path": "/ACL_RULE/SSH_ONLY|TEST_DROP/PACKET_ACTION", "value": "DROP"}]
Mar  8 19:05:57.790636 vlab-01 NOTICE config: Patch Applier:   * [{"op": "add", "path": "/ACL_RULE/SSH_ONLY|TEST_DROP/SRC_IP", "value": "9.9.9.9/32"}]

@ghooo
Copy link
Contributor

ghooo commented Mar 8, 2022

From the logs

Jan 29 03:58:06.218187 vlab-01 ERR caclmgrd[2235692]: rule_props for rule_id TEST_DROP does not have key 'PRIORITY'!

It seems PRIORITY is required.

@ghooo
Copy link
Contributor

ghooo commented Mar 9, 2022

from caclmgr:

  • PRIORITY is a required field code

  • PACKET_ACTION is a required field code

  • V4 ACL cannot have v6 rules, and vice versa code

  • PRIORITY is unique for the same ACL

    In the example below we add 3 ACL rules 2 of them have the same priority 9997, we see only 1 of them is not applied. This is verified by sudo iptables -S

    admin@vlab-01:~$ cat add-acl-rule2.json 
    [
        {
            "op": "add",
            "path": "/ACL_RULE",
            "value": {
                "SSH_ONLY|TEST_DROP1": {
                    "L4_DST_PORT": "22",
                    "IP_PROTOCOL": "17",
                    "IP_TYPE": "ANY",
                    "PACKET_ACTION": "DROP",
                    "PRIORITY": "9998",
                    "SRC_IP": "9.9.9.9/32"
                },
                "SSH_ONLY|TEST_DROP2": {
                    "L4_DST_PORT": "22",
                    "IP_PROTOCOL": "17",
                    "IP_TYPE": "ANY",
                    "PACKET_ACTION": "DROP",
                    "PRIORITY": "9997",
                    "SRC_IP": "10.10.10.10/32"
                },
                "SSH_ONLY|TEST_DROP3": {
                    "L4_DST_PORT": "22",
                    "IP_PROTOCOL": "17",
                    "IP_TYPE": "ANY",
                    "PACKET_ACTION": "DROP",
                    "PRIORITY": "9997",
                    "SRC_IP": "11.11.11.11/32"
                }
            }
        }
    ]
    admin@vlab-01:~$ 
    admin@vlab-01:~$ sudo iptables -S
    -P INPUT ACCEPT
    -P FORWARD ACCEPT
    -P OUTPUT ACCEPT
    -A INPUT -s 127.0.0.1/32 -i lo -j ACCEPT
    -A INPUT -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT
    -A INPUT -p icmp -m icmp --icmp-type 8 -j ACCEPT
    -A INPUT -p icmp -m icmp --icmp-type 0 -j ACCEPT
    -A INPUT -p icmp -m icmp --icmp-type 3 -j ACCEPT
    -A INPUT -p icmp -m icmp --icmp-type 11 -j ACCEPT
    -A INPUT -p udp -m udp --dport 67:68 -j ACCEPT
    -A INPUT -p udp -m udp --dport 546:547 -j ACCEPT
    -A INPUT -p tcp -m tcp --dport 179 -j ACCEPT
    -A INPUT -p tcp -m tcp --sport 179 -j ACCEPT
    -A INPUT -d 10.1.0.32/32 -j DROP
    -A INPUT -d 10.250.0.0/32 -j DROP
    -A INPUT -d 192.168.0.1/32 -j DROP
    -A INPUT -d 10.0.0.56/32 -j DROP
    -A INPUT -d 10.0.0.58/32 -j DROP
    -A INPUT -d 10.0.0.60/32 -j DROP
    -A INPUT -d 10.0.0.62/32 -j DROP
    -A INPUT -m ttl --ttl-lt 2 -j ACCEPT
    admin@vlab-01:~$ sudo config apply-patch add-acl-rule2.json -i ''
    Patch Applier: Patch application starting.
    ...
    Patch Applier: Verifying patch updates are reflected on ConfigDB.
    Patch Applier: Patch application completed.
    Patch applied successfully.
    admin@vlab-01:~$ sudo iptables -S
    -P INPUT ACCEPT
    -P FORWARD ACCEPT
    -P OUTPUT ACCEPT
    -A INPUT -s 127.0.0.1/32 -i lo -j ACCEPT
    -A INPUT -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT
    -A INPUT -p icmp -m icmp --icmp-type 8 -j ACCEPT
    -A INPUT -p icmp -m icmp --icmp-type 0 -j ACCEPT
    -A INPUT -p icmp -m icmp --icmp-type 3 -j ACCEPT
    -A INPUT -p icmp -m icmp --icmp-type 11 -j ACCEPT
    -A INPUT -p udp -m udp --dport 67:68 -j ACCEPT
    -A INPUT -p udp -m udp --dport 546:547 -j ACCEPT
    -A INPUT -p tcp -m tcp --dport 179 -j ACCEPT
    -A INPUT -p tcp -m tcp --sport 179 -j ACCEPT
    -A INPUT -s 9.9.9.9/32 -p tcp -m tcp --dport 22 -j DROP     <====
    -A INPUT -s 11.11.11.11/32 -p tcp -m tcp --dport 22 -j DROP  <=====
    -A INPUT -d 10.1.0.32/32 -j DROP
    -A INPUT -d 10.250.0.0/32 -j DROP
    -A INPUT -d 192.168.0.1/32 -j DROP
    -A INPUT -d 10.0.0.56/32 -j DROP
    -A INPUT -d 10.0.0.58/32 -j DROP
    -A INPUT -d 10.0.0.60/32 -j DROP
    -A INPUT -d 10.0.0.62/32 -j DROP
    -A INPUT -m ttl --ttl-lt 2 -j ACCEPT
    -A INPUT -j DROP
    admin@vlab-01:~$ 

Why caclmgr crashes?

It crashes because during the update was divided into smaller steps, the first step was adding the IP_TYPE without the PRIORITY nor the PACKET_ACTION.

Why caclmgr only crashes after a lot of tries, and does not crash immediately?

Because caclmgr waits for 0.5s after each update for other updates. During this time we usually manage to also push PRIORITY and PACKET_ACTION.
We have to try multiple times to get the failure, because we need a case where the time between pushing IP_TYPE and the other properties is > 0.5s.

ghooo added a commit to sonic-net/sonic-buildimage that referenced this issue Mar 23, 2022
…ANE ACLs (#10248)

#### Why I did it
Fixes sonic-net/sonic-utilities#2049

from caclmgr:
- PRIORITY is a required field [code](https://github.com/Azure/sonic-buildimage/blob/3fa18d18d4c06fe38164ebfefdc3187820fc7496/src/sonic-host-services/scripts/caclmgrd#L548)
- PACKET_ACTION is a required field [code](https://github.com/Azure/sonic-buildimage/blob/3fa18d18d4c06fe38164ebfefdc3187820fc7496/src/sonic-host-services/scripts/caclmgrd#L581)

I think PRIORITY is a required field for ACLs not only CTRLPLANE ACLs

#### How I did it
Check code.

#### How to verify it
Unit-test

#### Which release branch to backport (provide reason below if selected)

<!--
- Note we only backport fixes to a release branch, *not* features!
- Please also provide a reason for the backporting below.
- e.g.
- [x] 202006
-->

- [ ] 201811
- [ ] 201911
- [ ] 202006
- [ ] 202012
- [ ] 202106
- [ ] 202111

#### Description for the changelog
<!--
Write a short (one line) summary that describes the changes in this
pull request for inclusion in the changelog:
-->

#### Link to config_db schema for YANG module changes
<!--
Provide a link to config_db schema for the table for which YANG model
is defined
Link should point to correct section on https://github.com/Azure/SONiC/wiki/Configuration.
-->

#### A picture of a cute animal (not mandatory but encouraged)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants