Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix rollback invocation after CmdAdd failure in CNI server #5548

Merged

Conversation

antoninbas
Copy link
Contributor

  • When performing configuration rollback after an error in CmdAdd, we do not invoke CmdDel directly. Instead, we invoke an internal version of it which does not log a "Received CmdDel request" message (the message is confusing otherwise as it implies that we received a new CNI DEL command from the container runtime), and which does not process the network config again (as it was already processed at the beginning of CmdAdd). By not processing the config a second time, we ensure that there are no duplicate CIDRs in the IPAMConfig.
  • Migrate klog calls in server.go to use structured logging.
  • Improve unit tests for the CNI server to validate this fix.

Fixes #5547

@antoninbas
Copy link
Contributor Author

/test-all

@antoninbas antoninbas added action/backport Indicates a PR that requires backports. action/release-note Indicates a PR that should be included in release notes. labels Oct 5, 2023
@antoninbas
Copy link
Contributor Author

Also tested manually on a K8s cluster, by injecting an error in the CNI server during interface configuration (after IPAM):

I1006 01:47:20.047630       1 server.go:425] "Received CmdAdd request" request="cni_args:{container_id:\"8cba418aee4d4dc7b3f8a842fc5e8194f7708e54995afa9518eceed834b51597\" netns:\"/var/run/netns/cni-5003539e-60c5-d59b-3bbb-98a94f5eadda\" ifname:\"eth0\" args:\"K8S_POD_NAME=toolbox-pttdm;K8S_POD_INFRA_CONTAINER_ID=8cba418aee4d4dc7b3f8a842fc5e8194f7708e54995afa9518eceed834b51597;K8S_POD_UID=42681e73-8601-4d74-b983-6e1035355e55;IgnoreUnknown=1;K8S_POD_NAMESPACE=default\" path:\"/opt/cni/bin\" network_configuration:\"{\\\"cniVersion\\\":\\\"0.3.0\\\",\\\"ipam\\\":{\\\"type\\\":\\\"host-local\\\"},\\\"name\\\":\\\"antrea\\\",\\\"type\\\":\\\"antrea\\\"}\"}"
I1006 01:47:20.053556       1 server.go:494] "Allocated IP addresses" container="8cba418aee4d4dc7b3f8a842fc5e8194f7708e54995afa9518eceed834b51597" result={"cniVersion":"1.0.0","ips":[{"address":"10.10.1.3/24","gateway":"10.10.1.1"}],"dns":{},"VLANID":0}
E1006 01:47:20.053623       1 server.go:503] "FAKING ERROR"
I1006 01:47:20.053643       1 server.go:456] "CmdAdd for container failed, trying to rollback" container="8cba418aee4d4dc7b3f8a842fc5e8194f7708e54995afa9518eceed834b51597"
I1006 01:47:20.058455       1 server.go:560] "Deleted IP addresses for container" container="8cba418aee4d4dc7b3f8a842fc5e8194f7708e54995afa9518eceed834b51597"
I1006 01:47:20.058505       1 server.go:566] "CmdDel for container succeeded" container="8cba418aee4d4dc7b3f8a842fc5e8194f7708e54995afa9518eceed834b51597"
I1006 01:47:20.083692       1 server.go:581] "Received CmdDel request" request="cni_args:{container_id:\"8cba418aee4d4dc7b3f8a842fc5e8194f7708e54995afa9518eceed834b51597\" netns:\"/var/run/netns/cni-5003539e-60c5-d59b-3bbb-98a94f5eadda\" ifname:\"eth0\" args:\"IgnoreUnknown=1;K8S_POD_NAMESPACE=default;K8S_POD_NAME=toolbox-pttdm;K8S_POD_INFRA_CONTAINER_ID=8cba418aee4d4dc7b3f8a842fc5e8194f7708e54995afa9518eceed834b51597;K8S_POD_UID=42681e73-8601-4d74-b983-6e1035355e55\" path:\"/opt/cni/bin\" network_configuration:\"{\\\"cniVersion\\\":\\\"0.3.0\\\",\\\"ipam\\\":{\\\"type\\\":\\\"host-local\\\"},\\\"name\\\":\\\"antrea\\\",\\\"type\\\":\\\"antrea\\\"}\"}"
I1006 01:47:20.087014       1 server.go:560] "Deleted IP addresses for container" container="8cba418aee4d4dc7b3f8a842fc5e8194f7708e54995afa9518eceed834b51597"
I1006 01:47:20.087059       1 server.go:566] "CmdDel for container succeeded" container="8cba418aee4d4dc7b3f8a842fc5e8194f7708e54995afa9518eceed834b51597"

Notice how the rollback is now working (we release the IP address right away).
We still log the "CmdDel for container succeeded" message for the rollback. This is convenient to check for rollback success. If reviewers find it confusing, I can use a different success message for rollback vs. actual request.

if _, err := s.CmdDel(ctx, request); err != nil {
klog.Warningf("Failed to rollback after CNI add failure: %v", err)
klog.InfoS("CmdAdd for container failed, trying to rollback", "container", cniConfig.ContainerId)
if _, err := s.cmdDel(ctx, cniConfig); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that it is an improvement to directly use the existing "cniConfig" in cmdDel to avoid unnecessary call on loadNetworkConfig.

However, I didn't understand why the original code can append a duplicate PodCIDR range in the IPAM configuration. When we call function loadNetworkConfig in CmdDel/CmdAdd/CmdCheck, it has generated a new CNIConfig object, and the cniConfig.NetworkConfig is unmarshaled from the json stringrequest.CniArgs.NetworkConfiguration. So even if we call CmdDel in CmdAdd rollback, the cniConfig passed to IPAM plugin from a new call with loadNetworkConfig is supposed to equal to the values existing in CmdAdd.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original code was calling validateRequest for the rollback (because CmdDel calls validateRequest), using the same request object. validateRequest actually mutates the request but it is not obvious when looking at the code:

  • loadNetworkConfig has the following assignment:
    cniConfig.CniCmdArgs = request.CniArgs
    It means that mutating the contents of cniConfig.CniCmdArgs (a pointer to a protobuf struct) will mutate the request.
  • updateLocalIPAMSubnet has the following assignment:
    cniConfig.NetworkConfiguration, _ = json.Marshal(cniConfig.NetworkConfig)
    cniConfig.NetworkConfiguration is actually the same as cniConfig.CniCmdArgs.NetworkConfiguration (it is a byte slice). So at this point we have mutated the request. The next time we call validateRequestMessage (which we no longer do with my patch), the network configuration of the request already includes the Node subnets in the IPAM section.

I confirmed all of this with my unit test.

Note that I don't know why we are mutating the request in the first place; I don't know if it's intentional, but I think we have been doing it for 4 years. Since using the existing cniConfig seemed like the right thing to do, that's what I did in this patch. We could consider a follow-up patch to clean up existing logic, but I didn't want to risk introducing a new bug.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To clarify the statement above:

cniConfig.NetworkConfiguration is actually the same as cniConfig.CniCmdArgs.NetworkConfiguration

This is because CniCmdArgs is an embedded field:

type CNIConfig struct {
*types.NetworkConfig
// AntreaIPAM for an interface not managed by Antrea CNI.
secondaryNetworkIPAM bool
// CniCmdArgs received from the CNI plugin. IPAM data in CniCmdArgs can be updated with the
// Node's Pod CIDRs for NodeIPAM.
*cnipb.CniCmdArgs
// K8s CNI_ARGS passed to the CNI plugin.
*types.K8sArgs
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, thanks for the clarification.

wenyingd
wenyingd previously approved these changes Oct 8, 2023
Copy link
Contributor

@wenyingd wenyingd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

jianjuns
jianjuns previously approved these changes Oct 9, 2023
Copy link
Contributor

@jianjuns jianjuns left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix.

@@ -453,12 +453,12 @@ func (s *CNIServer) CmdAdd(ctx context.Context, request *cnipb.CniCmdRequest) (*
// Rollback to delete configurations once ADD is failure.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not your code, but change "is failure" to "fails"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

* When performing configuration rollback after an error in CmdAdd, we do
  not invoke CmdDel directly. Instead, we invoke an internal version of
  it which does not log a "Received CmdDel request" message (the message
  is confusing otherwise as it implies that we received a new CNI DEL
  command from the container runtime), and which does not process the
  network config again (as it was already processed at the beginning of
  CmdAdd). By not processing the config a second time, we ensure that
  there are no duplicate CIDRs in the IPAMConfig.
* Migrate klog calls in server.go to use structured logging.
* Improve unit tests for the CNI server to validate this fix.

Fixes antrea-io#5547

Signed-off-by: Antonin Bas <abas@vmware.com>
Signed-off-by: Antonin Bas <abas@vmware.com>
@antoninbas antoninbas dismissed stale reviews from jianjuns and wenyingd via 6ab005d October 9, 2023 18:31
@antoninbas antoninbas force-pushed the fix-rollback-after-cni-add-failure branch from 904a6b5 to 6ab005d Compare October 9, 2023 18:31
@antoninbas
Copy link
Contributor Author

/test-all

@antoninbas
Copy link
Contributor Author

/test-e2e
/test-conformance

@antoninbas antoninbas merged commit f43477f into antrea-io:main Oct 9, 2023
50 of 57 checks passed
@antoninbas antoninbas deleted the fix-rollback-after-cni-add-failure branch October 9, 2023 23:10
antoninbas added a commit that referenced this pull request Oct 10, 2023
…failure in CNI (#5558)

* When performing configuration rollback after an error in CmdAdd, we do
  not invoke CmdDel directly. Instead, we invoke an internal version of
  it which does not log a "Received CmdDel request" message (the message
  is confusing otherwise as it implies that we received a new CNI DEL
  command from the container runtime), and which does not process the
  network config again (as it was already processed at the beginning of
  CmdAdd). By not processing the config a second time, we ensure that
  there are no duplicate CIDRs in the IPAMConfig.
* Migrate klog calls in server.go to use structured logging.
* Improve unit tests for the CNI server to validate this fix.

Fixes #5547

Signed-off-by: Antonin Bas <abas@vmware.com>
antoninbas added a commit that referenced this pull request Oct 11, 2023
…failure in CNI (#5559)

* When performing configuration rollback after an error in CmdAdd, we do
  not invoke CmdDel directly. Instead, we invoke an internal version of
  it which does not log a "Received CmdDel request" message (the message
  is confusing otherwise as it implies that we received a new CNI DEL
  command from the container runtime), and which does not process the
  network config again (as it was already processed at the beginning of
  CmdAdd). By not processing the config a second time, we ensure that
  there are no duplicate CIDRs in the IPAMConfig.
* Migrate klog calls in server.go to use structured logging.
* Improve unit tests for the CNI server to validate this fix.

Fixes #5547

Signed-off-by: Antonin Bas <abas@vmware.com>
tnqn pushed a commit that referenced this pull request Oct 16, 2023
…failure in CNI server (#5560)

* When performing configuration rollback after an error in CmdAdd, we do
  not invoke CmdDel directly. Instead, we invoke an internal version of
  it which does not log a "Received CmdDel request" message (the message
  is confusing otherwise as it implies that we received a new CNI DEL
  command from the container runtime), and which does not process the
  network config again (as it was already processed at the beginning of
  CmdAdd). By not processing the config a second time, we ensure that
  there are no duplicate CIDRs in the IPAMConfig.
* Migrate klog calls in server.go to use structured logging.
* Improve unit tests for the CNI server to validate this fix.

Fixes #5547

Signed-off-by: Antonin Bas <abas@vmware.com>
luolanzone added a commit that referenced this pull request Oct 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
action/backport Indicates a PR that requires backports. action/release-note Indicates a PR that should be included in release notes.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Invalid CmdAdd rollback in antrea-agent
3 participants