Fix deadlock when accessing dirtyRules in fqdn controller #5566

Dyanngg · 2023-10-10T23:02:46Z

As descried in the issue above, a deadlock could occur when :

Rule sync fails due to realization error in dataplane, and the controller
retries it later despite there's no IP changes for the FQDN.
A single Antrea-native policy rule refers to multiple FQDNs, and the
fqdn controller receives updated records for these FQDNs at around
the same time. One of the FQDN has address updates so the rule
need to be resynced and thus marked dirty. Another FQDN does not
have address updates, but the controller tries to read the dirtyRules
to make sure that, if it is a previously failed to sync rule, it gets
re-queued even when there's no address updates.

This PR addresses the deadlock issue by explicitly putting the lock
around the get dirty rules op itself.
Additional UT for these potential rule sync scenarios are also added.

pkg/agent/controller/networkpolicy/fqdn.go

tnqn · 2023-10-11T01:52:10Z

@Dyanngg have you managed to reproduce the issue? You mentioned "a deadlock could occur when rule sync
fails due to realization error in dataplane" but I didn't see such error when investigating the issue and thought it could be triggered even everything works fine.

Dyanngg · 2023-10-11T02:22:55Z

@Dyanngg have you managed to reproduce the issue? You mentioned "a deadlock could occur when rule sync fails due to realization error in dataplane" but I didn't see such error when investigating the issue and thought it could be triggered even everything works fine.

Thanks for the reminder. I reproduced the deadlock in UT when OVS error is simulated. Now that you mention it, I realized that the same issue could also happen if a single rule has multiple FQDNs, and during the handling of the proactive record update of these FQDNs, one of the FQDN response could have marked the rule dirty since the agent has not finished rule sync yet, while the other FQDN response tries to add a subscriber for the same rule, causing deadlock. I will add a new UT testcase validating this theory, and verify that the deadlock could occur before the fix and will not after the fix.

Dyanngg · 2023-10-11T03:24:00Z

@tnqn Please check the updated PR description and the latest added testcase. While the specific scenario is tricky to reproduce in an real setup (due to the need for two concurrent DNS record updates and one with address change and one does not), the last UT testcase reproduces the deadlock very steadily: without the change, it hangs every single time after the syncDirtyRule() calls due to deadlock.

tnqn · 2023-10-11T10:54:15Z

@Dyanngg the code and unit test look good to me. However, could we still try to reproduce the issue in a cluster and see if this can completely fix the issue in case there is something else preventing it from working? We have met several issues related to NetworkPolicy (especially when FQDN rule is used) in the last few months, we need to be more cautious to deliver one more patch release. I think in theory it can be reproduced by generating two concurrent DNS resolution towards the same steady FQDN, the first resolution's addressUpdate will be true while the second one will be false.

And I just got an update from the reporter: restarting a workload pod alone could fix the issue, which I haven't figured out why.

luolanzone

a nit

pkg/agent/controller/networkpolicy/fqdn_test.go

tnqn · 2023-10-11T15:23:16Z

And I just got an update from the reporter: restarting a workload pod alone could fix the issue, which I haven't figured out why.

I figured out this, it didn't really recover, the reason why new Pod can resolve domain should be because the realization of the FQDN rule for new Pod was stuck in r.fqdnController.addFQDNRule due to the same dead lock.

Dyanngg · 2023-10-11T18:36:36Z

Trying to reproduce the issue but no luck so far. Domains that I remember to have dynamic IP ranges seems to return pretty steady resolved dns addresses.

Test yamls used:

apiVersion: v1
kind: ReplicationController
metadata:
  name: client
spec:
  replicas: 3
  template:
    metadata:
      labels:
        app: client
    spec:
      containers:
      - name: client-worker
        image: praqma/network-multitool
        imagePullPolicy: IfNotPresent
        ports:
        - containerPort: 443
        command:
        - /bin/sh
        - -c
        - |
          while true; do
          dig bay.camera;
          done
---
apiVersion: v1
kind: ReplicationController
metadata:
  name: client2
spec:
  replicas: 3
  template:
    metadata:
      labels:
        app: client
    spec:
      containers:
      - name: client-worker2
        image: praqma/network-multitool
        imagePullPolicy: IfNotPresent
        ports:
        - containerPort: 443
        command:
        - /bin/sh
        - -c
        - |
          while true; do
          dig google.com;
          done
---
apiVersion: crd.antrea.io/v1alpha1
kind: ClusterNetworkPolicy
metadata:
  name: acnp-multiple-fqdn-per-rule
spec:
  priority: 1
  appliedTo:
  - podSelector:
      matchLabels:
        app: client
  egress:
  - action: Allow
    to:
    - fqdn: "wayfair.com"
    - fqdn: "google.com"
    - fqdn: "bay.camera"
    - fqdn: "amazon.com"
    - fqdn: "medium.com"

Signed-off-by: Dyanngg <dingyang@vmware.com>

Dyanngg · 2023-10-11T21:08:21Z

/test-all

tnqn · 2023-10-12T08:29:39Z

Trying to reproduce the issue but no luck so far. Domains that I remember to have dynamic IP ranges seems to return pretty steady resolved dns addresses.

The test commands you used triggered DNS lookups sequentially, I think it can't lead to deadlock.

While trying to construct a test case to reproduce the issue, I found that even concurrent DNS lookups can't trigger it because packetin events are processed sequentially. And given there is no error about networkpolicy realization, I'm thinking the issue is probably more complicated and this patch will not likely fix the issue completely. I will share my hypothesis in #5565

tnqn

LGTM

tnqn · 2023-10-16T03:29:50Z

@Dyanngg please backport it to 1.11-1.13.

tnqn · 2023-10-16T11:03:15Z

This commit introduces a data race. #5583 will fix it. The cherry-picking PRs need to include the latter commit as well.

Dyanngg added kind/bug Categorizes issue or PR as related to a bug. action/backport Indicates a PR that requires backports. labels Oct 10, 2023

Dyanngg added this to the Antrea v1.14 release milestone Oct 10, 2023

Dyanngg requested review from tnqn, GraysonWu and luolanzone October 10, 2023 23:02

GraysonWu reviewed Oct 11, 2023

View reviewed changes

pkg/agent/controller/networkpolicy/fqdn.go Outdated Show resolved Hide resolved

Dyanngg force-pushed the fix-fqdn-rule-sync-deadlock branch from 7a4571c to 8393f87 Compare October 11, 2023 03:08

Dyanngg requested a review from GraysonWu October 11, 2023 03:49

tnqn added the action/release-note Indicates a PR that should be included in release notes. label Oct 11, 2023

luolanzone reviewed Oct 11, 2023

View reviewed changes

pkg/agent/controller/networkpolicy/fqdn_test.go Outdated Show resolved Hide resolved

Dyanngg closed this Oct 11, 2023

Dyanngg reopened this Oct 11, 2023

Fix deadlock when accessing dirtyRules in fqdn controller

3a28b93

Signed-off-by: Dyanngg <dingyang@vmware.com>

Dyanngg force-pushed the fix-fqdn-rule-sync-deadlock branch from 8393f87 to 3a28b93 Compare October 11, 2023 21:07

tnqn approved these changes Oct 13, 2023

View reviewed changes

tnqn merged commit 62a440a into antrea-io:main Oct 16, 2023
50 of 57 checks passed

tnqn mentioned this pull request Oct 16, 2023

Fix data race in FQDN ruleSyncTracker #5583

Merged

samng1991 mentioned this pull request Jan 4, 2024

FQDN resolution deadlock issue in version 1.12 #5840

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix deadlock when accessing dirtyRules in fqdn controller #5566

Fix deadlock when accessing dirtyRules in fqdn controller #5566

Dyanngg commented Oct 10, 2023 •

edited

Loading

tnqn commented Oct 11, 2023

Dyanngg commented Oct 11, 2023

Dyanngg commented Oct 11, 2023

tnqn commented Oct 11, 2023

luolanzone left a comment

tnqn commented Oct 11, 2023

Dyanngg commented Oct 11, 2023 •

edited

Loading

Dyanngg commented Oct 11, 2023

tnqn commented Oct 12, 2023

tnqn left a comment

tnqn commented Oct 16, 2023

tnqn commented Oct 16, 2023

Fix deadlock when accessing dirtyRules in fqdn controller #5566

Fix deadlock when accessing dirtyRules in fqdn controller #5566

Conversation

Dyanngg commented Oct 10, 2023 • edited Loading

tnqn commented Oct 11, 2023

Dyanngg commented Oct 11, 2023

Dyanngg commented Oct 11, 2023

tnqn commented Oct 11, 2023

luolanzone left a comment

Choose a reason for hiding this comment

tnqn commented Oct 11, 2023

Dyanngg commented Oct 11, 2023 • edited Loading

Dyanngg commented Oct 11, 2023

tnqn commented Oct 12, 2023

tnqn left a comment

Choose a reason for hiding this comment

tnqn commented Oct 16, 2023

tnqn commented Oct 16, 2023

Dyanngg commented Oct 10, 2023 •

edited

Loading

Dyanngg commented Oct 11, 2023 •

edited

Loading