Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix missing call of removing groupDb cache when deleting OVS group #4592

Merged
merged 1 commit into from
Mar 10, 2023

Conversation

ceclinux
Copy link
Contributor

@ceclinux ceclinux commented Feb 1, 2023

Fix missing call of removing groupDb cache when deleting OVS group

The old group will be reused unexpectedly if groupDb cache managed
by OFBridge is not cleared for that group, which causes a new group
claims to have a different group type acquiring the old group type.

Fixes #4575

@ceclinux
Copy link
Contributor Author

ceclinux commented Feb 1, 2023

/test-multicast-e2e

@codecov
Copy link

codecov bot commented Feb 1, 2023

Codecov Report

Merging #4592 (a9a1292) into main (482fc93) will increase coverage by 0.12%.
The diff coverage is 71.42%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #4592      +/-   ##
==========================================
+ Coverage   69.75%   69.88%   +0.12%     
==========================================
  Files         400      403       +3     
  Lines       59450    60115     +665     
==========================================
+ Hits        41472    42010     +538     
- Misses      15178    15283     +105     
- Partials     2800     2822      +22     
Flag Coverage Δ *Carryforward flag
e2e-tests 38.34% <ø> (+<0.01%) ⬆️ Carriedforward from 482fc93
integration-tests 34.40% <7.14%> (-0.15%) ⬇️
kind-e2e-tests 47.64% <7.14%> (+1.35%) ⬆️
unit-tests 59.96% <71.42%> (+0.21%) ⬆️

*This pull request uses carry forward flags. Click here to find out more.

Impacted Files Coverage Δ
pkg/agent/openflow/client.go 88.74% <50.00%> (-0.05%) ⬇️
pkg/ovs/openflow/ofctrl_bridge.go 80.07% <100.00%> (+2.52%) ⬆️
pkg/agent/proxy/topology.go 72.72% <0.00%> (-9.10%) ⬇️
pkg/agent/multicluster/mc_route_controller.go 47.00% <0.00%> (-8.55%) ⬇️
...nt/apiserver/handlers/serviceexternalip/handler.go 29.62% <0.00%> (-7.41%) ⬇️
...agent/flowexporter/connections/deny_connections.go 84.94% <0.00%> (-5.38%) ⬇️
pkg/agent/route/route_linux.go 66.19% <0.00%> (-5.30%) ⬇️
pkg/agent/controller/egress/egress_controller.go 82.37% <0.00%> (-2.86%) ⬇️
...rs/multicluster/member/serviceexport_controller.go 76.33% <0.00%> (-2.80%) ⬇️
pkg/ovs/ovsctl/ofctl.go 52.05% <0.00%> (-2.06%) ⬇️
... and 32 more

@ceclinux
Copy link
Contributor Author

ceclinux commented Feb 1, 2023

/test-multicast-e2e

@ceclinux ceclinux changed the title fix missing call of removing groupDb cache for OVS group when deletin… Fix missing call of removing groupDb cache when deleting OVS group Feb 2, 2023
@@ -229,10 +229,6 @@ func (b *OFBridge) DeleteGroup(id GroupIDType) error {
if ofctrlGroup == nil {
return nil
}
g := &ofGroup{bridge: b, ofctrl: ofctrlGroup}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why remove line 232 to 235, but not directly call OFBridge.DeleteGroup(gID) in the caller?

This change might cause group still exists in OVS when calling OFBridge.DeleteGroup but not sending another separate OF message to remove group.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

OFBridge.DeleteGroup is not called here directly because DeleteOFEntries will callmetrics.OVSFlowOpsErrorCount.WithLabelValues metrics.OVSFlowOpsCount.WithLabelValues for prometheus integration while DeleteGroup don't(and it shouldn't). I have changed DeleteGroup to allow users to decide whether to do flow deletion and call DeleteOFEntries and DeleteGroup both here.

@ceclinux
Copy link
Contributor Author

ceclinux commented Feb 2, 2023

/test-multicast-e2e

Copy link
Contributor

@wenyingd wenyingd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code change looks good to me, but it looks the UT coverage does not satisfy the requirement.

@ceclinux ceclinux force-pushed the fix-delete-multicast-group branch 7 times, most recently from 778bfc4 to bb275da Compare February 15, 2023 06:38
@ceclinux
Copy link
Contributor Author

ceclinux commented Feb 16, 2023

Updated. UT coverage requirement satisfied @wenyingd

return fmt.Errorf("error when deleting Service Endpoints Group %d: %w", groupID, err)
return fmt.Errorf("error when deleting Openflow entries for Service Endpoints Group %d: %w", groupID, err)
}
if err := c.featureService.bridge.DeleteGroup(groupID, false); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if err := c.featureService.bridge.DeleteGroup(groupID, false); err != nil {
if err := c.bridge.DeleteGroup(groupID, false); err != nil {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated. Thanks

return fmt.Errorf("error when deleting Multicast receiver Group %d: %w", groupID, err)
return fmt.Errorf("error when deleting Openflow entries for Multicast receiver Group %d: %w", groupID, err)
}
if err := c.featureMulticast.bridge.DeleteGroup(groupID, false); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if err := c.featureMulticast.bridge.DeleteGroup(groupID, false); err != nil {
if err := c.bridge.DeleteGroup(groupID, false); err != nil {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated. Thanks

wenyingd
wenyingd previously approved these changes Feb 28, 2023
Copy link
Contributor

@wenyingd wenyingd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@wenyingd
Copy link
Contributor

/test-all
/test-windows-all

pkg/agent/openflow/client_test.go Outdated Show resolved Hide resolved
@@ -224,14 +224,16 @@ func (b *OFBridge) createGroupWithType(id GroupIDType, groupType ofctrl.GroupTyp
return g
}

func (b *OFBridge) DeleteGroup(id GroupIDType) error {
func (b *OFBridge) DeleteGroup(id GroupIDType, deleteFlows bool) error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it used anywhere that deleteFlows is true? Another question: why name it deleteFlows? I didn't see that this is related to any flows.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is related to flows. In deleteFlows, g.Delete() is called, which calls AddOFEntriesInBundle.

deleteFlows = true is not used in other places. Keeping it as a parameter because I think it is reasonable that flow transactions should be kept at this level. Users can disable it explicitly, not implicitly if the flow transaction is not wanted.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about 'deleteGroupOnOVS'? This parameter is used to decide whether to delete the group on OVS.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

g := &ofGroup{bridge: b, ofctrl: ofctrlGroup}
if err := g.Delete(); err != nil {
return fmt.Errorf("failed to delete the group: %w", err)
if deleteFlows {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add more comments or PR description? I didn't get that why the issue can be fixed by bypassing this part.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

@hongliangl
Copy link
Contributor

I synced with @wenyingd offline about the issue #4575. The root cause is that:

  • Create a group for AntreaProxy with CreateGroup, then a group object is return, and the group object will be also cached groupDB in ofnet. Note that, the group type will be select.
  • Sync the group to OVS with bundle.
  • Delete the group in OVS with bundle, and the method DeleteGroup is not called, as a result, cached group object in groupDB will not be deleted.
  • If the group ID is reallocated to Multicast, then the cached group object will be reused, and the group type is still select, not all.

I'm ok with this fix, but could you add more comments about adding parameter deleteFlows to DeleteGroup?

@ceclinux
Copy link
Contributor Author

ceclinux commented Mar 1, 2023

Thanks for the review. Your understanding is correct. The root cause for the contrived code fix is coming from Prometheus metrics count. Please check #4592 (comment) and the updated comment for detail to see if it is reasonable to you.

@wenyingd wenyingd added this to the Antrea v1.11 release milestone Mar 2, 2023
hongliangl
hongliangl previously approved these changes Mar 7, 2023
Copy link
Member

@tnqn tnqn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The seems a serious issue when Multicast is enabled. Please backport it after merged.

Comment on lines 227 to 230
// DeleteGroup deletes a specified group in groupDb. If deleteGroupOnOVS sets to true, the group deletion transaction
// will be synced to OVS. Note that to record OVS operation in Prometheus, openFlowClient.DeleteOFEntries([]binding.OFEntry{groupCache})
// should be called in advance, and deleteGroupOnOVS is unnecessary.
func (b *OFBridge) DeleteGroup(id GroupIDType, deleteGroupOnOVS bool) error {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the methods is a little confusing as CreateGroup only construct group in memory while DeleteGroup could delete the group in OVS. Methods should be symmetrical so clients can use similar methods to manage the same resource.
Since there isn't even an actual case needing deleteGroupOnOVS to be true, I think we should just make CreateGroup and DeleteGroup symmetrical.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

pkg/ovs/openflow/ofctrl_bridge.go Outdated Show resolved Hide resolved
pkg/ovs/openflow/ofctrl_bridge.go Outdated Show resolved Hide resolved
@@ -224,15 +224,13 @@ func (b *OFBridge) createGroupWithType(id GroupIDType, groupType ofctrl.GroupTyp
return g
}

// DeleteGroup deletes a specified group in groupDb. Note that to record OVS operations in Prometheus, openFlowClient.DeleteOFEntries([]binding.OFEntry{groupCache})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

update comment

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated.

Comment on lines 100 to 101
name: "delete existed group without flow",
existedGroupID: 20,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
name: "delete existed group without flow",
existedGroupID: 20,
name: "delete existing group without flow",
existingGroupID: 20,

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

err: nil,
},
{
name: "delete non-existed group",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

} {
t.Run(m.name, func(t *testing.T) {
b.ofSwitch = newFakeOFSwitch(b)
b.ofSwitch.NewGroup(uint32(m.existedGroupID), ofctrl.GroupAll)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why it doesn't use b.CreateGroup to be symmetrical

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

@ceclinux ceclinux force-pushed the fix-delete-multicast-group branch 2 times, most recently from 912b9b5 to e514b98 Compare March 9, 2023 16:29
The old group will be reused unexpectedly if groupDb cache managed
by OFBridge is not cleared for that group, which causes a new group
claims to have a different group type acquiring the old group type.

Fixes antrea-io#4575

Signed-off-by: ceclinux <src655@gmail.com>
Copy link
Member

@tnqn tnqn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@tnqn
Copy link
Member

tnqn commented Mar 10, 2023

/test-all

@tnqn tnqn merged commit 9b6319d into antrea-io:main Mar 10, 2023
@tnqn tnqn added action/backport Indicates a PR that requires backports. action/release-note Indicates a PR that should be included in release notes. labels Mar 10, 2023
@tnqn
Copy link
Member

tnqn commented Mar 15, 2023

@ceclinux could you backport the PR to applicable releases up to 1.7.

jainpulkit22 pushed a commit to urharshitha/antrea that referenced this pull request Apr 28, 2023
…ntrea-io#4592)

The old group will be reused unexpectedly if groupDb cache managed
by OFBridge is not cleared for that group, which causes a new group
claims to have a different group type acquiring the old group type.

Fixes antrea-io#4575

Signed-off-by: ceclinux <src655@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
action/backport Indicates a PR that requires backports. action/release-note Indicates a PR that should be included in release notes.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Multicast receiver Pods won't get traffic after service is created and deleted
5 participants