Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OCPBUGS-13190: Avoid spurious updates for internalTrafficPolicy #927

Conversation

Miciah
Copy link
Contributor

@Miciah Miciah commented May 5, 2023

Avoid spurious updates for internalTrafficPolicy

Specify spec.internalTrafficPolicy on NodePort- and ClusterIP-type services that the operator manages. Also, ignore updates to the spec.ipFamilies and spec.ipFamilyPolicy fields.

Before this PR, the update logic for NodePort- and ClusterIP-type services would try to revert the default values that the API set for these fields.

  • assets/router/service-cloud.yaml:
  • assets/router/service-internal.yaml: Specify internalTrafficPolicy: Cluster.
  • pkg/manifests/bindata.go: Regenerate.
  • pkg/operator/controller/ingress/internal_service.go (internalServiceChanged): Ignore spec.ipFamilies and spec.ipFamilyPolicy.
  • pkg/operator/controller/ingress/internal_service_test.go (Test_desiredInternalIngressControllerService): Verify that spec.internalServiceChanged is set to "Cluster".
    (Test_internalServiceChanged): Verify that changes to spec.internalTrafficPolicy are detected and that changes to spec.ipFamilies and spec.ipFamilyPolicy are ignored.
  • pkg/operator/controller/ingress/nodeport_service.go (desiredNodePortService): Set spec.internalTrafficPolicy to "Cluster".
    (nodePortServiceChanged): Ignore spec.ipFamilies and spec.ipFamilyPolicy.
  • pkg/operator/controller/ingress/nodeport_service_test.go (TestDesiredNodePortService): Verify that spec.internalTrafficPolicy is set to "Cluster".
    (TestNodePortServiceChanged): Verify that changes to spec.internalTrafficPolicy are detected and that changes to spec.ipFamilies and spec.ipFamilyPolicy are ignored.

Ignore updates for null versus empty ennotations

Ignore updates to annotations when the update is from null to empty or vice versa.

  • pkg/operator/controller/ingress/internal_service.go (internalServiceChanged): Use EquateEmpty when comparing annotations.
  • pkg/operator/controller/ingress/internal_service_test.go (TestInternalServiceChangedEmptyAnnotations): New test to verify that internalServiceChanged treats empty and null annotations as equal.
  • pkg/operator/controller/ingress/load_balancer_service_test.go (TestLoadBalancerServiceChangedEmptyAnnotations): New test to verify that loadBalancerServiceChanged treats empty and null annotations as equal.
  • pkg/operator/controller/ingress/nodeport_service.go (nodePortServiceChanged): Use EquateEmpty when comparing annotations.
  • pkg/operator/controller/ingress/nodeport_service_test.go (TestNodePortServiceChangedEmptyAnnotations): New test to verify that loadBalancerServiceChanged treats empty and null annotations as equal.

@openshift-ci-robot openshift-ci-robot added jira/severity-moderate Referenced Jira bug's severity is moderate for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. labels May 5, 2023
@openshift-ci-robot
Copy link
Contributor

@Miciah: This pull request references Jira Issue OCPBUGS-13190, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.14.0) matches configured target version for branch (4.14.0)
  • bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @lihongan

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

Specify spec.internalTrafficPolicy on NodePort- and ClusterIP-type services that the operator manages. Also, ignore updates to the spec.ipFamilies and spec.ipFamilyPolicy fields.

Before this PR, the update logic for NodePort- and ClusterIP-type services would try to revert the default values that the API set for these fields.

  • assets/router/service-cloud.yaml:
  • assets/router/service-internal.yaml: Specify internalTrafficPolicy: Cluster.
  • pkg/manifests/bindata.go: Regenerate.
  • pkg/operator/controller/ingress/internal_service.go (internalServiceChanged): Ignore spec.ipFamilies and spec.ipFamilyPolicy.
  • pkg/operator/controller/ingress/internal_service_test.go (Test_desiredInternalIngressControllerService): Verify that spec.internalServiceChanged is set to "Cluster".
    (Test_internalServiceChanged): Verify that changes to spec.internalTrafficPolicy are detected and that changes to spec.ipFamilies and spec.ipFamilyPolicy are ignored.
  • pkg/operator/controller/ingress/nodeport_service.go (desiredNodePortService): Set spec.internalTrafficPolicy to "Cluster".
    (nodePortServiceChanged): Ignore spec.ipFamilies and spec.ipFamilyPolicy.
  • pkg/operator/controller/ingress/nodeport_service_test.go (TestDesiredNodePortService): Verify that spec.internalTrafficPolicy is set to "Cluster".
    (TestNodePortServiceChanged): Verify that changes to spec.internalTrafficPolicy are detected and that changes to spec.ipFamilies and spec.ipFamilyPolicy are ignored.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@gcs278
Copy link
Contributor

gcs278 commented May 5, 2023

/assign
I'll take a look at this early next week

@Miciah
Copy link
Contributor Author

Miciah commented May 8, 2023

This appears to be another spurious update in the ingress-operator logs from the last e2e-aws-operator job run:

2023-05-05T22:55:16.629Z	INFO	operator.ingress_controller	ingress/nodeport_service.go:83	updated NodePort service	{"namespace": "openshift-ingress", "name": "router-nodeport-local-with-fallback", "diff": "  &v1.Service{\n  \tTypeMeta:   {},\n  \tObjectMeta: {Name: \"router-nodeport-local-with-fallback\", Namespace: \"openshift-ingress\", UID: \"c646ae30-900d-46ed-9562-a62edebe24ee\", ResourceVersion: \"35457\", ...},\n  \tSpec: v1.ServiceSpec{\n  \t\tPorts:                    {{Name: \"http\", Protocol: \"TCP\", Port: 80, TargetPort: {Type: 1, StrVal: \"http\"}, ...}, {Name: \"https\", Protocol: \"TCP\", Port: 443, TargetPort: {Type: 1, StrVal: \"https\"}, ...}, {Name: \"metrics\", Protocol: \"TCP\", Port: 1936, TargetPort: {Type: 1, StrVal: \"metrics\"}, ...}},\n  \t\tSelector:                 {\"ingresscontroller.operator.openshift.io/deployment-ingresscontroller\": \"local-with-fallback\"},\n  \t\tClusterIP:                \"172.30.38.74\",\n- \t\tClusterIPs:               []string{\"172.30.38.74\"},\n+ \t\tClusterIPs:               nil,\n  \t\tType:                     \"NodePort\",\n  \t\tExternalIPs:              nil,\n- \t\tSessionAffinity:          \"None\",\n+ \t\tSessionAffinity:          \"\",\n  \t\tLoadBalancerIP:           \"\",\n  \t\tLoadBalancerSourceRanges: nil,\n  \t\t... // 3 identical fields\n  \t\tPublishNotReadyAddresses:      false,\n  \t\tSessionAffinityConfig:         nil,\n- \t\tIPFamilies:                    []v1.IPFamily{\"IPv4\"},\n+ \t\tIPFamilies:                    nil,\n- \t\tIPFamilyPolicy:                &\"SingleStack\",\n+ \t\tIPFamilyPolicy:                nil,\n  \t\tAllocateLoadBalancerNodePorts: nil,\n  \t\tLoadBalancerClass:             nil,\n  \t\tInternalTrafficPolicy:         &\"Cluster\",\n  \t},\n  \tStatus: {},\n  }\n"}

With this PR, the operator ignores all the fields that the diff shows changed, so my guess is that the problem is with null versus empty annotations. I've pushed a3ec26a to treat null and empty annotations as equal.

@openshift-ci-robot
Copy link
Contributor

@Miciah: This pull request references Jira Issue OCPBUGS-13190, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.14.0) matches configured target version for branch (4.14.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @lihongan

In response to this:

Avoid spurious updates for internalTrafficPolicy

Specify spec.internalTrafficPolicy on NodePort- and ClusterIP-type services that the operator manages. Also, ignore updates to the spec.ipFamilies and spec.ipFamilyPolicy fields.

Before this PR, the update logic for NodePort- and ClusterIP-type services would try to revert the default values that the API set for these fields.

  • assets/router/service-cloud.yaml:
  • assets/router/service-internal.yaml: Specify internalTrafficPolicy: Cluster.
  • pkg/manifests/bindata.go: Regenerate.
  • pkg/operator/controller/ingress/internal_service.go (internalServiceChanged): Ignore spec.ipFamilies and spec.ipFamilyPolicy.
  • pkg/operator/controller/ingress/internal_service_test.go (Test_desiredInternalIngressControllerService): Verify that spec.internalServiceChanged is set to "Cluster".
    (Test_internalServiceChanged): Verify that changes to spec.internalTrafficPolicy are detected and that changes to spec.ipFamilies and spec.ipFamilyPolicy are ignored.
  • pkg/operator/controller/ingress/nodeport_service.go (desiredNodePortService): Set spec.internalTrafficPolicy to "Cluster".
    (nodePortServiceChanged): Ignore spec.ipFamilies and spec.ipFamilyPolicy.
  • pkg/operator/controller/ingress/nodeport_service_test.go (TestDesiredNodePortService): Verify that spec.internalTrafficPolicy is set to "Cluster".
    (TestNodePortServiceChanged): Verify that changes to spec.internalTrafficPolicy are detected and that changes to spec.ipFamilies and spec.ipFamilyPolicy are ignored.

Ignore updates for null versus empty ennotations

Ignore updates to annotations when the update is from null to empty or vice versa.

  • pkg/operator/controller/ingress/internal_service.go (internalServiceChanged): Use EquateEmpty when comparing annotations.
  • pkg/operator/controller/ingress/internal_service_test.go (TestInternalServiceChangedEmptyAnnotations): New test to verify that internalServiceChanged treats empty and null annotations as equal.
  • pkg/operator/controller/ingress/load_balancer_service_test.go (TestLoadBalancerServiceChangedEmptyAnnotations): New test to verify that loadBalancerServiceChanged treats empty and null annotations as equal.
  • pkg/operator/controller/ingress/nodeport_service.go (nodePortServiceChanged): Use EquateEmpty when comparing annotations.
  • pkg/operator/controller/ingress/nodeport_service_test.go (TestNodePortServiceChangedEmptyAnnotations): New test to verify that loadBalancerServiceChanged treats empty and null annotations as equal.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@Miciah Miciah force-pushed the OCPBUGS-13190-avoid-spurious-updates-for-internalTrafficPolicy branch from a3ec26a to 9c5669c Compare May 9, 2023 07:07
@Miciah
Copy link
Contributor Author

Miciah commented May 9, 2023

@gcs278
Copy link
Contributor

gcs278 commented May 9, 2023

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label May 9, 2023
@Miciah
Copy link
Contributor Author

Miciah commented May 9, 2023

e2e-aws-operator failed on TestUnmanagedDNSToManagedDNSInternalIngressController. The ingress-operator logs from that CI run show no spurious "updated NodePort service" or "updated internal service" messages. I also checked the e2e-aws-operator run from before the rebase, and there to, the ingress-operator logs show no spurious updates.
/test e2e-aws-operator

e2e-aws-ovn failed because [bz-kube-apiserver][invariant] alert/KubeAPIErrorBudgetBurn should not be at or above info failed:

{  KubeAPIErrorBudgetBurn was at or above info for at least 50s on platformidentification.JobType{Release:"4.14", FromRelease:"", Platform:"aws", Architecture:"amd64", Network:"ovn", Topology:"ha"} (maxAllowed=0s): pending for 18m28s, firing for 50s:

May 09 07:57:00.375 - 50s   E alert/KubeAPIErrorBudgetBurn ns/openshift-kube-apiserver ALERTS{alertname="KubeAPIErrorBudgetBurn", alertstate="firing", long="6h", namespace="openshift-kube-apiserver", prometheus="openshift-monitoring/k8s", severity="critical", short="30m"}}

/test e2e-aws-ovn

@Miciah
Copy link
Contributor Author

Miciah commented May 9, 2023

e2e-gcp-ovn failed:

INFO[2023-05-09T07:50:09Z] Running multi-stage phase test               
INFO[2023-05-09T07:50:09Z] Running step e2e-gcp-ovn-openshift-e2e-test. 
WARN[2023-05-09T08:21:01Z] Pod e2e-gcp-ovn-openshift-e2e-test is being unexpectedly deleted 
WARN[2023-05-09T08:21:01Z] Pod e2e-gcp-ovn-openshift-e2e-test is being unexpectedly deleted 
INFO[2023-05-09T08:31:50Z] Logs for container test in pod e2e-gcp-ovn-openshift-e2e-test: 
INFO[2023-05-09T08:31:50Z] failed to try resolving symlinks in path "/var/log/pods/ci-op-mwibll67_e2e-gcp-ovn-openshift-e2e-test_c378d414-fe0e-41ec-977f-dc43aa6c822f/test/0.log": lstat /var/log/pods/ci-op-mwibll67_e2e-gcp-ovn-openshift-e2e-test_c378d414-fe0e-41ec-977f-dc43aa6c822f/test/0.log: no such file or directory 
WARN[2023-05-09T08:31:50Z] Pod e2e-gcp-ovn-openshift-e2e-test is being unexpectedly deleted 
WARN[2023-05-09T08:31:50Z] failed to get object after finishing watch    error=pods "e2e-gcp-ovn-openshift-e2e-test" not found
INFO[2023-05-09T08:31:50Z] Step e2e-gcp-ovn-openshift-e2e-test failed after 41m40s. 
INFO[2023-05-09T08:31:50Z] Step phase test failed after 41m40s.         

/test e2e-gcp-ovn

e2e-aws-operator failed because of TestUnmanagedDNSToManagedDNSInternalIngressController again. That failure has been observed on other PRs and reported as OCPBUGS-10983.
/override ci/prow/e2e-aws-operator

@openshift-ci
Copy link
Contributor

openshift-ci bot commented May 9, 2023

@Miciah: Overrode contexts on behalf of Miciah: ci/prow/e2e-aws-operator

In response to this:

e2e-gcp-ovn failed:

INFO[2023-05-09T07:50:09Z] Running multi-stage phase test               
INFO[2023-05-09T07:50:09Z] Running step e2e-gcp-ovn-openshift-e2e-test. 
WARN[2023-05-09T08:21:01Z] Pod e2e-gcp-ovn-openshift-e2e-test is being unexpectedly deleted 
WARN[2023-05-09T08:21:01Z] Pod e2e-gcp-ovn-openshift-e2e-test is being unexpectedly deleted 
INFO[2023-05-09T08:31:50Z] Logs for container test in pod e2e-gcp-ovn-openshift-e2e-test: 
INFO[2023-05-09T08:31:50Z] failed to try resolving symlinks in path "/var/log/pods/ci-op-mwibll67_e2e-gcp-ovn-openshift-e2e-test_c378d414-fe0e-41ec-977f-dc43aa6c822f/test/0.log": lstat /var/log/pods/ci-op-mwibll67_e2e-gcp-ovn-openshift-e2e-test_c378d414-fe0e-41ec-977f-dc43aa6c822f/test/0.log: no such file or directory 
WARN[2023-05-09T08:31:50Z] Pod e2e-gcp-ovn-openshift-e2e-test is being unexpectedly deleted 
WARN[2023-05-09T08:31:50Z] failed to get object after finishing watch    error=pods "e2e-gcp-ovn-openshift-e2e-test" not found
INFO[2023-05-09T08:31:50Z] Step e2e-gcp-ovn-openshift-e2e-test failed after 41m40s. 
INFO[2023-05-09T08:31:50Z] Step phase test failed after 41m40s.         

/test e2e-gcp-ovn

e2e-aws-operator failed because of TestUnmanagedDNSToManagedDNSInternalIngressController again. That failure has been observed on other PRs and reported as OCPBUGS-10983.
/override ci/prow/e2e-aws-operator

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@gcs278
Copy link
Contributor

gcs278 commented May 22, 2023

I reviewed again, still looks good to me. Thanks for the fix.
/approve

@openshift-ci
Copy link
Contributor

openshift-ci bot commented May 22, 2023

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: gcs278

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 22, 2023
@openshift-ci-robot
Copy link
Contributor

/retest-required

Remaining retests: 0 against base HEAD 960d410 and 2 for PR HEAD 9c5669c in total

@openshift-ci-robot
Copy link
Contributor

/retest-required

Remaining retests: 0 against base HEAD 108acb3 and 1 for PR HEAD 9c5669c in total

@openshift-ci-robot
Copy link
Contributor

/retest-required

Remaining retests: 0 against base HEAD 90d2fea and 0 for PR HEAD 9c5669c in total

@openshift-ci-robot
Copy link
Contributor

/hold

Revision 9c5669c was retested 3 times: holding

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 24, 2023
@Miciah
Copy link
Contributor Author

Miciah commented May 27, 2023

e2e-hypershift failed because the "e2e-hypershift-hypershift-install" step timed out "Waiting for operator deployment to be observed".

I don't know where to start with diagnosing e2e-hypershift failures. The CI errors are not meaningful to me. @enxebre, how can my team diagnose failures with this new job?

I did notice that https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-ingress-operator/927/pull-ci-openshift-cluster-ingress-operator-master-e2e-hypershift/1661465639627264000/artifacts/e2e-hypershift/dump-management-cluster/artifacts/namespaces/hypershift/core/pods/logs/operator-d98677dc8-dk7ks-operator.log has thousands of error messages, but most of them are "manifest unknown" or "unauthorized: authentication required" errors when attempting to pull images from registry.ci.openshift.org.

Hoping this is a transient failure...
/test e2e-hypershift

@Miciah
Copy link
Contributor Author

Miciah commented Jun 1, 2023

e2e-aws-ovn failed because of OCPBUGS-14321, which should be fixed by openshift/origin#27955.
/test e2e-aws-ovn

e2e-aws-ovn-upgrade failed because [sig-arch] events should not repeat pathologically for namespace openshift-monitoring failed:

{  5 events happened too frequently

event happened 21 times, something is wrong: ns/openshift-monitoring pod/prometheus-k8s-1 hmsg/0a0f2e79b2 - pathological/true reason/FailedAttachVolume AttachVolume.Attach failed for volume "pvc-cf969581-a629-48c8-b8f5-6fe872e3b416" : rpc error: code = FailedPrecondition desc =  From: 15:40:19Z To: 15:40:20Z result=reject 
event happened 22 times, something is wrong: ns/openshift-monitoring pod/prometheus-k8s-1 hmsg/0a0f2e79b2 - pathological/true reason/FailedAttachVolume AttachVolume.Attach failed for volume "pvc-cf969581-a629-48c8-b8f5-6fe872e3b416" : rpc error: code = FailedPrecondition desc =  From: 15:40:20Z To: 15:40:21Z result=reject 
event happened 23 times, something is wrong: ns/openshift-monitoring pod/prometheus-k8s-1 hmsg/0a0f2e79b2 - pathological/true reason/FailedAttachVolume AttachVolume.Attach failed for volume "pvc-cf969581-a629-48c8-b8f5-6fe872e3b416" : rpc error: code = FailedPrecondition desc =  From: 15:40:21Z To: 15:40:22Z result=reject 
event happened 24 times, something is wrong: ns/openshift-monitoring pod/prometheus-k8s-1 hmsg/0a0f2e79b2 - pathological/true reason/FailedAttachVolume AttachVolume.Attach failed for volume "pvc-cf969581-a629-48c8-b8f5-6fe872e3b416" : rpc error: code = FailedPrecondition desc =  From: 15:40:22Z To: 15:40:23Z result=reject 
event happened 25 times, something is wrong: ns/openshift-monitoring pod/prometheus-k8s-1 hmsg/0a0f2e79b2 - pathological/true reason/FailedAttachVolume AttachVolume.Attach failed for volume "pvc-cf969581-a629-48c8-b8f5-6fe872e3b416" : rpc error: code = FailedPrecondition desc =  From: 15:40:24Z To: 15:40:25Z result=reject }

I saw a couple similar failures in search.ci, so I filed OCPBUGS-14400 for these failures.
/test e2e-aws-ovn-upgrade

@Miciah Miciah force-pushed the OCPBUGS-13190-avoid-spurious-updates-for-internalTrafficPolicy branch from 9c5669c to 6cd3b67 Compare June 6, 2023 12:55
@openshift-ci openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Jun 6, 2023
@Miciah
Copy link
Contributor Author

Miciah commented Jun 6, 2023

Rebased for #939 #906.
/hold cancel

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 6, 2023
@frobware
Copy link
Contributor

frobware commented Jun 6, 2023

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jun 6, 2023
@openshift-ci-robot
Copy link
Contributor

/retest-required

Remaining retests: 0 against base HEAD e068d04 and 2 for PR HEAD 6cd3b67 in total

@Miciah
Copy link
Contributor Author

Miciah commented Jun 6, 2023

/hold
We're going to try to get #928 in next.

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 6, 2023
@Miciah
Copy link
Contributor Author

Miciah commented Jun 8, 2023

The e2e-hypershift job for this PR passed earlier, but the same job has been failing on #928. I want to see whether e2e-hypershift is still passing on this PR (indicating that #928 is causing a failure) or whether it fails now (indicating that something else broke e2e-hypershift).
/test e2e-hypershift

@Miciah
Copy link
Contributor Author

Miciah commented Jun 12, 2023

/hold cancel
/test all
now that #928 has merged.

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 12, 2023
@openshift-ci-robot
Copy link
Contributor

/retest-required

Remaining retests: 0 against base HEAD 0e500e6 and 2 for PR HEAD 6cd3b67 in total

@Miciah
Copy link
Contributor Author

Miciah commented Jun 13, 2023

e2e-aws-ovn-serial failed on [sig-storage] PersistentVolumes-local Stress with local volumes [Serial] should be able to process many pods and reuse local volumes:

{  fail [test/e2e/storage/persistent_volumes-local.go:522]: persistentvolumes "local-pvlh4qq" not found
Error: exit with code 1
Ginkgo exit error 1: exit with code 1}

Search.ci shows several similar failures for other jobs, so I filed OCPBUGS-14930 to track the issue. However, the failure rate appears to be low.
/test e2e-aws-ovn-serial

e2e-aws-operator failed on TestAWSELBConnectionIdleTimeout, which known to be flaky (see OCPBUGS-13810), and on TestRouterCompressionOperation:

=== RUN   TestAll/serial/TestRouterCompressionOperation
    router_compression_test.go:102: failed to update ingress controller, retrying...
    router_compression_test.go:147: failed to apply the required MIME type for test: failed to update ingress controller: Put "https://api.ci-op-gsdssil4-43abb.origin-ci-int-aws.dev.rhcloud.com:6443/apis/operator.openshift.io/v1/namespaces/openshift-ingress-operator/ingresscontrollers/default": read tcp 10.128.250.45:36442->13.56.185.125:6443: read: connection reset by peer

This error appears to be related to API server flakiness and probably does not represent flakiness on the part of the test itself or any problem with the changes in this PR. In this CI job run, several operators failed to roll out all their pods because the kubelet of one of the of the control-plane nodes stopped responding; see nodes.json.
/test e2e-aws-operator

@Miciah
Copy link
Contributor Author

Miciah commented Jun 15, 2023

e2e-aws-operator failed again because TestAWSELBConnectionIdleTimeout failed and because several operators again failed to roll out all of their pods. While investigating the failures, I happened to notice a couple other bugs in the ingress operator, for which I filed bug reports: OCPBUGS-14994 and OCPBUGS-14995. However, these bugs only cause spurious reconciliations and are probably not causing any test failures. I'll continue looking into the CI failures tomorrow, but let's give CI another try in the meantime.
/test e2e-aws-operator

@openshift-ci-robot
Copy link
Contributor

/retest-required

Remaining retests: 0 against base HEAD b1a6bb5 and 1 for PR HEAD 6cd3b67 in total

@Miciah
Copy link
Contributor Author

Miciah commented Jun 20, 2023

/test all
since #944 has merged.

@Miciah
Copy link
Contributor Author

Miciah commented Jun 20, 2023

e2e-aws-operator failed because must-gather failed.
/test e2e-aws-operator

e2e-aws-ovn-single-node failed because [sig-scheduling][Early] The HAProxy router pods [apigroup:route.openshift.io] should be scheduled on different nodes failed:

{  fail [github.com/openshift/origin/test/extended/scheduling/pods.go:102]: ns/openshift-ingress pod router-default-58ffdd7f97-7drtv and pod router-default-58ffdd7f97-44cps are running on the same node: ip-10-0-151-26.us-west-1.compute.internal
Ginkgo exit error 1: exit with code 1}

Having two pods is unexpected because this job creates a single-node cluster, and the deployment and replicaset have replicas: 1. The events show that both pods were scheduled at the same time, after the node became ready, but one of the pods failed the NodeAffinity predicate:

% jq -rc < events.json '.items|sort_by(.metadata.creationTimestamp)|.[]|select((.involvedObject.name//""|contains("router-default-58ffdd7f97")) or .reason=="NodeReady")|.metadata.creationTimestamp+" ("+(.count//1|tostring)+"x) "+.reason+": "+.message'  
2023-06-20T13:23:14Z (1x) NodeReady: Node ip-10-0-151-26.us-west-1.compute.internal status is now: NodeReady
2023-06-20T13:23:56Z (1x) FailedScheduling: 0/1 nodes are available: 1 node(s) had untolerated taint {node-role.kubernetes.io/master: }. preemption: 0/1 nodes are available: 1 Preemption is not helpful for scheduling..
2023-06-20T13:23:56Z (1x) SuccessfulCreate: Created pod: router-default-58ffdd7f97-7drtv
2023-06-20T13:24:41Z (1x) Scheduled: Successfully assigned openshift-ingress/router-default-58ffdd7f97-44cps to ip-10-0-151-26.us-west-1.compute.internal
2023-06-20T13:24:41Z (1x) Scheduled: Successfully assigned openshift-ingress/router-default-58ffdd7f97-7drtv to ip-10-0-151-26.us-west-1.compute.internal
2023-06-20T13:24:41Z (1x) NodeAffinity: Predicate NodeAffinity failed
2023-06-20T13:24:41Z (1x) SuccessfulCreate: Created pod: router-default-58ffdd7f97-44cps
2023-06-20T13:24:49Z (1x) AddedInterface: Add eth0 [10.128.0.60/23] from ovn-kubernetes
2023-06-20T13:24:50Z (1x) Pulling: Pulling image "registry.build01.ci.openshift.org/ci-op-cq7jwkqw/stable@sha256:117cb3f5b2dc2e3b9d7a1c66334ff6e861e62915af095a8ffdca32f1f1d60aa8"
2023-06-20T13:25:15Z (1x) Pulled: Successfully pulled image "registry.build01.ci.openshift.org/ci-op-cq7jwkqw/stable@sha256:117cb3f5b2dc2e3b9d7a1c66334ff6e861e62915af095a8ffdca32f1f1d60aa8" in 25.733799773s (25.7338107s including waiting)
2023-06-20T13:25:16Z (1x) Created: Created container router
2023-06-20T13:25:16Z (1x) Started: Started container router
2023-06-20T13:25:17Z (11x) ProbeError: Startup probe error: HTTP probe failed with statuscode: 500
body: [-]backend-proxy-http failed: reason withheld
[-]has-synced failed: reason withheld
[+]process-running ok
healthz check failed


2023-06-20T13:25:17Z (10x) Unhealthy: Startup probe failed: HTTP probe failed with statuscode: 500

The ingress operator does use the OnePodPerNodeController controller to delete misscheduled pods, but this controller ignores not-ready pods.

The node predicate only excludes remote nodes:

                            "nodeAffinity": {
                                "requiredDuringSchedulingIgnoredDuringExecution": {
                                    "nodeSelectorTerms": [
                                        {
                                            "matchExpressions": [
                                                {
                                                    "key": "node.openshift.io/remote-worker",
                                                    "operator": "NotIn",
                                                    "values": [
                                                        ""
                                                    ]
                                                }
                                            ]
                                        }
                                    ]
                                }
                            }

I don't know why the scheduler scheduled two pods, or why one of them failed the node predicate. Maybe this failure was a fluke.

Also, [sig-arch] Only known images used by tests failed:

{  Cluster accessed images that were not mirrored to the testing repository or already part of the cluster, see test/extended/util/image/README.md in the openshift/origin repo:

registry.redhat.io/rhel8/support-tools from pods:
  ns/e2e-test-egress-router-cni-e2e-vpksf pod/ip-10-0-151-26us-west-1computeinternal-debug-xr9wj node/ip-10-0-151-26.us-west-1.compute.internal
}

Let's see whether it's an eventual consistency issue with image mirroring.
/test e2e-aws-ovn-single-node

@Miciah
Copy link
Contributor Author

Miciah commented Jun 21, 2023

e2e-aws-operator failed again because must-gather failed.
/test e2e-aws-operator

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jun 21, 2023

@Miciah: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-aws-ovn-single-node 6cd3b67 link false /test e2e-aws-ovn-single-node

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@openshift-merge-robot openshift-merge-robot merged commit 038b86c into openshift:master Jun 21, 2023
@openshift-ci-robot
Copy link
Contributor

@Miciah: Jira Issue OCPBUGS-13190: All pull requests linked via external trackers have merged:

Jira Issue OCPBUGS-13190 has been moved to the MODIFIED state.

In response to this:

Avoid spurious updates for internalTrafficPolicy

Specify spec.internalTrafficPolicy on NodePort- and ClusterIP-type services that the operator manages. Also, ignore updates to the spec.ipFamilies and spec.ipFamilyPolicy fields.

Before this PR, the update logic for NodePort- and ClusterIP-type services would try to revert the default values that the API set for these fields.

  • assets/router/service-cloud.yaml:
  • assets/router/service-internal.yaml: Specify internalTrafficPolicy: Cluster.
  • pkg/manifests/bindata.go: Regenerate.
  • pkg/operator/controller/ingress/internal_service.go (internalServiceChanged): Ignore spec.ipFamilies and spec.ipFamilyPolicy.
  • pkg/operator/controller/ingress/internal_service_test.go (Test_desiredInternalIngressControllerService): Verify that spec.internalServiceChanged is set to "Cluster".
    (Test_internalServiceChanged): Verify that changes to spec.internalTrafficPolicy are detected and that changes to spec.ipFamilies and spec.ipFamilyPolicy are ignored.
  • pkg/operator/controller/ingress/nodeport_service.go (desiredNodePortService): Set spec.internalTrafficPolicy to "Cluster".
    (nodePortServiceChanged): Ignore spec.ipFamilies and spec.ipFamilyPolicy.
  • pkg/operator/controller/ingress/nodeport_service_test.go (TestDesiredNodePortService): Verify that spec.internalTrafficPolicy is set to "Cluster".
    (TestNodePortServiceChanged): Verify that changes to spec.internalTrafficPolicy are detected and that changes to spec.ipFamilies and spec.ipFamilyPolicy are ignored.

Ignore updates for null versus empty ennotations

Ignore updates to annotations when the update is from null to empty or vice versa.

  • pkg/operator/controller/ingress/internal_service.go (internalServiceChanged): Use EquateEmpty when comparing annotations.
  • pkg/operator/controller/ingress/internal_service_test.go (TestInternalServiceChangedEmptyAnnotations): New test to verify that internalServiceChanged treats empty and null annotations as equal.
  • pkg/operator/controller/ingress/load_balancer_service_test.go (TestLoadBalancerServiceChangedEmptyAnnotations): New test to verify that loadBalancerServiceChanged treats empty and null annotations as equal.
  • pkg/operator/controller/ingress/nodeport_service.go (nodePortServiceChanged): Use EquateEmpty when comparing annotations.
  • pkg/operator/controller/ingress/nodeport_service_test.go (TestNodePortServiceChangedEmptyAnnotations): New test to verify that loadBalancerServiceChanged treats empty and null annotations as equal.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@candita
Copy link
Contributor

candita commented May 20, 2024

/cherrypick release-4.13 release 4.12

@openshift-cherrypick-robot

@candita: #927 failed to apply on top of branch "release-4.13":

Applying: Avoid spurious updates for internalTrafficPolicy
Using index info to reconstruct a base tree...
M	pkg/manifests/bindata.go
Falling back to patching base and 3-way merge...
Auto-merging pkg/manifests/bindata.go
CONFLICT (content): Merge conflict in pkg/manifests/bindata.go
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
Patch failed at 0001 Avoid spurious updates for internalTrafficPolicy
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".

In response to this:

/cherrypick release-4.13 release 4.12

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@openshift-merge-robot
Copy link
Contributor

Fix included in accepted release 4.13.0-0.nightly-2024-05-31-202348

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. jira/severity-moderate Referenced Jira bug's severity is moderate for the branch this PR is targeting. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants