OCPBUGS-13190: Avoid spurious updates for internalTrafficPolicy #927

Miciah · 2023-05-05T22:14:11Z

Avoid spurious updates for internalTrafficPolicy

Specify spec.internalTrafficPolicy on NodePort- and ClusterIP-type services that the operator manages. Also, ignore updates to the spec.ipFamilies and spec.ipFamilyPolicy fields.

Before this PR, the update logic for NodePort- and ClusterIP-type services would try to revert the default values that the API set for these fields.

assets/router/service-cloud.yaml:
assets/router/service-internal.yaml: Specify internalTrafficPolicy: Cluster.
pkg/manifests/bindata.go: Regenerate.
pkg/operator/controller/ingress/internal_service.go (internalServiceChanged): Ignore spec.ipFamilies and spec.ipFamilyPolicy.
pkg/operator/controller/ingress/internal_service_test.go (Test_desiredInternalIngressControllerService): Verify that spec.internalServiceChanged is set to "Cluster".
(Test_internalServiceChanged): Verify that changes to spec.internalTrafficPolicy are detected and that changes to spec.ipFamilies and spec.ipFamilyPolicy are ignored.
pkg/operator/controller/ingress/nodeport_service.go (desiredNodePortService): Set spec.internalTrafficPolicy to "Cluster".
(nodePortServiceChanged): Ignore spec.ipFamilies and spec.ipFamilyPolicy.
pkg/operator/controller/ingress/nodeport_service_test.go (TestDesiredNodePortService): Verify that spec.internalTrafficPolicy is set to "Cluster".
(TestNodePortServiceChanged): Verify that changes to spec.internalTrafficPolicy are detected and that changes to spec.ipFamilies and spec.ipFamilyPolicy are ignored.

Ignore updates for null versus empty ennotations

Ignore updates to annotations when the update is from null to empty or vice versa.

pkg/operator/controller/ingress/internal_service.go (internalServiceChanged): Use EquateEmpty when comparing annotations.
pkg/operator/controller/ingress/internal_service_test.go (TestInternalServiceChangedEmptyAnnotations): New test to verify that internalServiceChanged treats empty and null annotations as equal.
pkg/operator/controller/ingress/load_balancer_service_test.go (TestLoadBalancerServiceChangedEmptyAnnotations): New test to verify that loadBalancerServiceChanged treats empty and null annotations as equal.
pkg/operator/controller/ingress/nodeport_service.go (nodePortServiceChanged): Use EquateEmpty when comparing annotations.
pkg/operator/controller/ingress/nodeport_service_test.go (TestNodePortServiceChangedEmptyAnnotations): New test to verify that loadBalancerServiceChanged treats empty and null annotations as equal.

openshift-ci-robot · 2023-05-05T22:14:19Z

@Miciah: This pull request references Jira Issue OCPBUGS-13190, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (4.14.0) matches configured target version for branch (4.14.0)
bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @lihongan

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

Specify spec.internalTrafficPolicy on NodePort- and ClusterIP-type services that the operator manages. Also, ignore updates to the spec.ipFamilies and spec.ipFamilyPolicy fields.

Before this PR, the update logic for NodePort- and ClusterIP-type services would try to revert the default values that the API set for these fields.

assets/router/service-cloud.yaml:

assets/router/service-internal.yaml: Specify internalTrafficPolicy: Cluster.

pkg/manifests/bindata.go: Regenerate.

pkg/operator/controller/ingress/internal_service.go (internalServiceChanged): Ignore spec.ipFamilies and spec.ipFamilyPolicy.

pkg/operator/controller/ingress/internal_service_test.go (Test_desiredInternalIngressControllerService): Verify that spec.internalServiceChanged is set to "Cluster".
(Test_internalServiceChanged): Verify that changes to spec.internalTrafficPolicy are detected and that changes to spec.ipFamilies and spec.ipFamilyPolicy are ignored.

pkg/operator/controller/ingress/nodeport_service.go (desiredNodePortService): Set spec.internalTrafficPolicy to "Cluster".
(nodePortServiceChanged): Ignore spec.ipFamilies and spec.ipFamilyPolicy.

pkg/operator/controller/ingress/nodeport_service_test.go (TestDesiredNodePortService): Verify that spec.internalTrafficPolicy is set to "Cluster".
(TestNodePortServiceChanged): Verify that changes to spec.internalTrafficPolicy are detected and that changes to spec.ipFamilies and spec.ipFamilyPolicy are ignored.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

gcs278 · 2023-05-05T22:25:59Z

/assign
I'll take a look at this early next week

Miciah · 2023-05-08T14:14:37Z

This appears to be another spurious update in the ingress-operator logs from the last e2e-aws-operator job run:

2023-05-05T22:55:16.629Z	INFO	operator.ingress_controller	ingress/nodeport_service.go:83	updated NodePort service	{"namespace": "openshift-ingress", "name": "router-nodeport-local-with-fallback", "diff": "  &v1.Service{\n  \tTypeMeta:   {},\n  \tObjectMeta: {Name: \"router-nodeport-local-with-fallback\", Namespace: \"openshift-ingress\", UID: \"c646ae30-900d-46ed-9562-a62edebe24ee\", ResourceVersion: \"35457\", ...},\n  \tSpec: v1.ServiceSpec{\n  \t\tPorts:                    {{Name: \"http\", Protocol: \"TCP\", Port: 80, TargetPort: {Type: 1, StrVal: \"http\"}, ...}, {Name: \"https\", Protocol: \"TCP\", Port: 443, TargetPort: {Type: 1, StrVal: \"https\"}, ...}, {Name: \"metrics\", Protocol: \"TCP\", Port: 1936, TargetPort: {Type: 1, StrVal: \"metrics\"}, ...}},\n  \t\tSelector:                 {\"ingresscontroller.operator.openshift.io/deployment-ingresscontroller\": \"local-with-fallback\"},\n  \t\tClusterIP:                \"172.30.38.74\",\n- \t\tClusterIPs:               []string{\"172.30.38.74\"},\n+ \t\tClusterIPs:               nil,\n  \t\tType:                     \"NodePort\",\n  \t\tExternalIPs:              nil,\n- \t\tSessionAffinity:          \"None\",\n+ \t\tSessionAffinity:          \"\",\n  \t\tLoadBalancerIP:           \"\",\n  \t\tLoadBalancerSourceRanges: nil,\n  \t\t... // 3 identical fields\n  \t\tPublishNotReadyAddresses:      false,\n  \t\tSessionAffinityConfig:         nil,\n- \t\tIPFamilies:                    []v1.IPFamily{\"IPv4\"},\n+ \t\tIPFamilies:                    nil,\n- \t\tIPFamilyPolicy:                &\"SingleStack\",\n+ \t\tIPFamilyPolicy:                nil,\n  \t\tAllocateLoadBalancerNodePorts: nil,\n  \t\tLoadBalancerClass:             nil,\n  \t\tInternalTrafficPolicy:         &\"Cluster\",\n  \t},\n  \tStatus: {},\n  }\n"}

With this PR, the operator ignores all the fields that the diff shows changed, so my guess is that the problem is with null versus empty annotations. I've pushed a3ec26a to treat null and empty annotations as equal.

openshift-ci-robot · 2023-05-08T14:14:41Z

@Miciah: This pull request references Jira Issue OCPBUGS-13190, which is valid.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (4.14.0) matches configured target version for branch (4.14.0)
bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @lihongan

In response to this:

Avoid spurious updates for internalTrafficPolicy

Specify spec.internalTrafficPolicy on NodePort- and ClusterIP-type services that the operator manages. Also, ignore updates to the spec.ipFamilies and spec.ipFamilyPolicy fields.

Before this PR, the update logic for NodePort- and ClusterIP-type services would try to revert the default values that the API set for these fields.

assets/router/service-cloud.yaml:

assets/router/service-internal.yaml: Specify internalTrafficPolicy: Cluster.

pkg/manifests/bindata.go: Regenerate.

pkg/operator/controller/ingress/internal_service.go (internalServiceChanged): Ignore spec.ipFamilies and spec.ipFamilyPolicy.

pkg/operator/controller/ingress/internal_service_test.go (Test_desiredInternalIngressControllerService): Verify that spec.internalServiceChanged is set to "Cluster".
(Test_internalServiceChanged): Verify that changes to spec.internalTrafficPolicy are detected and that changes to spec.ipFamilies and spec.ipFamilyPolicy are ignored.

pkg/operator/controller/ingress/nodeport_service.go (desiredNodePortService): Set spec.internalTrafficPolicy to "Cluster".
(nodePortServiceChanged): Ignore spec.ipFamilies and spec.ipFamilyPolicy.

pkg/operator/controller/ingress/nodeport_service_test.go (TestDesiredNodePortService): Verify that spec.internalTrafficPolicy is set to "Cluster".
(TestNodePortServiceChanged): Verify that changes to spec.internalTrafficPolicy are detected and that changes to spec.ipFamilies and spec.ipFamilyPolicy are ignored.

Ignore updates for null versus empty ennotations

Ignore updates to annotations when the update is from null to empty or vice versa.

pkg/operator/controller/ingress/internal_service.go (internalServiceChanged): Use EquateEmpty when comparing annotations.

pkg/operator/controller/ingress/internal_service_test.go (TestInternalServiceChangedEmptyAnnotations): New test to verify that internalServiceChanged treats empty and null annotations as equal.

pkg/operator/controller/ingress/load_balancer_service_test.go (TestLoadBalancerServiceChangedEmptyAnnotations): New test to verify that loadBalancerServiceChanged treats empty and null annotations as equal.

pkg/operator/controller/ingress/nodeport_service.go (nodePortServiceChanged): Use EquateEmpty when comparing annotations.

pkg/operator/controller/ingress/nodeport_service_test.go (TestNodePortServiceChangedEmptyAnnotations): New test to verify that loadBalancerServiceChanged treats empty and null annotations as equal.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Miciah · 2023-05-09T07:09:40Z

https://github.com/openshift/cluster-ingress-operator/compare/a3ec26aeb6113a607e614d7d3fbcb290ab8177b9..9c5669c2c84175af26090df70488ec247314ec2a rebases to fix a conflict in pkg/manifests/bindata.go from #908.

gcs278 · 2023-05-09T13:30:46Z

/lgtm

Miciah · 2023-05-09T14:16:21Z

e2e-aws-operator failed on TestUnmanagedDNSToManagedDNSInternalIngressController. The ingress-operator logs from that CI run show no spurious "updated NodePort service" or "updated internal service" messages. I also checked the e2e-aws-operator run from before the rebase, and there to, the ingress-operator logs show no spurious updates.
/test e2e-aws-operator

e2e-aws-ovn failed because [bz-kube-apiserver][invariant] alert/KubeAPIErrorBudgetBurn should not be at or above info failed:

{  KubeAPIErrorBudgetBurn was at or above info for at least 50s on platformidentification.JobType{Release:"4.14", FromRelease:"", Platform:"aws", Architecture:"amd64", Network:"ovn", Topology:"ha"} (maxAllowed=0s): pending for 18m28s, firing for 50s:

May 09 07:57:00.375 - 50s   E alert/KubeAPIErrorBudgetBurn ns/openshift-kube-apiserver ALERTS{alertname="KubeAPIErrorBudgetBurn", alertstate="firing", long="6h", namespace="openshift-kube-apiserver", prometheus="openshift-monitoring/k8s", severity="critical", short="30m"}}

/test e2e-aws-ovn

Miciah · 2023-05-09T17:11:23Z

e2e-gcp-ovn failed:

INFO[2023-05-09T07:50:09Z] Running multi-stage phase test               
INFO[2023-05-09T07:50:09Z] Running step e2e-gcp-ovn-openshift-e2e-test. 
WARN[2023-05-09T08:21:01Z] Pod e2e-gcp-ovn-openshift-e2e-test is being unexpectedly deleted 
WARN[2023-05-09T08:21:01Z] Pod e2e-gcp-ovn-openshift-e2e-test is being unexpectedly deleted 
INFO[2023-05-09T08:31:50Z] Logs for container test in pod e2e-gcp-ovn-openshift-e2e-test: 
INFO[2023-05-09T08:31:50Z] failed to try resolving symlinks in path "/var/log/pods/ci-op-mwibll67_e2e-gcp-ovn-openshift-e2e-test_c378d414-fe0e-41ec-977f-dc43aa6c822f/test/0.log": lstat /var/log/pods/ci-op-mwibll67_e2e-gcp-ovn-openshift-e2e-test_c378d414-fe0e-41ec-977f-dc43aa6c822f/test/0.log: no such file or directory 
WARN[2023-05-09T08:31:50Z] Pod e2e-gcp-ovn-openshift-e2e-test is being unexpectedly deleted 
WARN[2023-05-09T08:31:50Z] failed to get object after finishing watch    error=pods "e2e-gcp-ovn-openshift-e2e-test" not found
INFO[2023-05-09T08:31:50Z] Step e2e-gcp-ovn-openshift-e2e-test failed after 41m40s. 
INFO[2023-05-09T08:31:50Z] Step phase test failed after 41m40s.

/test e2e-gcp-ovn

e2e-aws-operator failed because of TestUnmanagedDNSToManagedDNSInternalIngressController again. That failure has been observed on other PRs and reported as OCPBUGS-10983.
/override ci/prow/e2e-aws-operator

openshift-ci · 2023-05-09T17:17:26Z

@Miciah: Overrode contexts on behalf of Miciah: ci/prow/e2e-aws-operator

In response to this:

e2e-gcp-ovn failed:

INFO[2023-05-09T07:50:09Z] Running multi-stage phase test               
INFO[2023-05-09T07:50:09Z] Running step e2e-gcp-ovn-openshift-e2e-test. 
WARN[2023-05-09T08:21:01Z] Pod e2e-gcp-ovn-openshift-e2e-test is being unexpectedly deleted 
WARN[2023-05-09T08:21:01Z] Pod e2e-gcp-ovn-openshift-e2e-test is being unexpectedly deleted 
INFO[2023-05-09T08:31:50Z] Logs for container test in pod e2e-gcp-ovn-openshift-e2e-test: 
INFO[2023-05-09T08:31:50Z] failed to try resolving symlinks in path "/var/log/pods/ci-op-mwibll67_e2e-gcp-ovn-openshift-e2e-test_c378d414-fe0e-41ec-977f-dc43aa6c822f/test/0.log": lstat /var/log/pods/ci-op-mwibll67_e2e-gcp-ovn-openshift-e2e-test_c378d414-fe0e-41ec-977f-dc43aa6c822f/test/0.log: no such file or directory 
WARN[2023-05-09T08:31:50Z] Pod e2e-gcp-ovn-openshift-e2e-test is being unexpectedly deleted 
WARN[2023-05-09T08:31:50Z] failed to get object after finishing watch    error=pods "e2e-gcp-ovn-openshift-e2e-test" not found
INFO[2023-05-09T08:31:50Z] Step e2e-gcp-ovn-openshift-e2e-test failed after 41m40s. 
INFO[2023-05-09T08:31:50Z] Step phase test failed after 41m40s.

/test e2e-gcp-ovn

e2e-aws-operator failed because of TestUnmanagedDNSToManagedDNSInternalIngressController again. That failure has been observed on other PRs and reported as OCPBUGS-10983.
/override ci/prow/e2e-aws-operator

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

gcs278 · 2023-05-22T18:59:40Z

I reviewed again, still looks good to me. Thanks for the fix.
/approve

openshift-ci · 2023-05-22T19:00:36Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: gcs278

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [gcs278]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci-robot · 2023-05-22T20:30:24Z

/retest-required

Remaining retests: 0 against base HEAD 960d410 and 2 for PR HEAD 9c5669c in total

openshift-ci-robot · 2023-05-23T15:09:59Z

/retest-required

Remaining retests: 0 against base HEAD 108acb3 and 1 for PR HEAD 9c5669c in total

openshift-ci-robot · 2023-05-24T12:26:10Z

/retest-required

Remaining retests: 0 against base HEAD 90d2fea and 0 for PR HEAD 9c5669c in total

openshift-ci-robot · 2023-05-24T16:13:14Z

/hold

Revision 9c5669c was retested 3 times: holding

Miciah · 2023-05-27T16:29:32Z

e2e-hypershift failed because the "e2e-hypershift-hypershift-install" step timed out "Waiting for operator deployment to be observed".

I don't know where to start with diagnosing e2e-hypershift failures. The CI errors are not meaningful to me. @enxebre, how can my team diagnose failures with this new job?

I did notice that https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-ingress-operator/927/pull-ci-openshift-cluster-ingress-operator-master-e2e-hypershift/1661465639627264000/artifacts/e2e-hypershift/dump-management-cluster/artifacts/namespaces/hypershift/core/pods/logs/operator-d98677dc8-dk7ks-operator.log has thousands of error messages, but most of them are "manifest unknown" or "unauthorized: authentication required" errors when attempting to pull images from registry.ci.openshift.org.

Hoping this is a transient failure...
/test e2e-hypershift

Miciah · 2023-06-01T02:06:11Z

e2e-aws-ovn failed because of OCPBUGS-14321, which should be fixed by openshift/origin#27955.
/test e2e-aws-ovn

e2e-aws-ovn-upgrade failed because [sig-arch] events should not repeat pathologically for namespace openshift-monitoring failed:

{  5 events happened too frequently

event happened 21 times, something is wrong: ns/openshift-monitoring pod/prometheus-k8s-1 hmsg/0a0f2e79b2 - pathological/true reason/FailedAttachVolume AttachVolume.Attach failed for volume "pvc-cf969581-a629-48c8-b8f5-6fe872e3b416" : rpc error: code = FailedPrecondition desc =  From: 15:40:19Z To: 15:40:20Z result=reject 
event happened 22 times, something is wrong: ns/openshift-monitoring pod/prometheus-k8s-1 hmsg/0a0f2e79b2 - pathological/true reason/FailedAttachVolume AttachVolume.Attach failed for volume "pvc-cf969581-a629-48c8-b8f5-6fe872e3b416" : rpc error: code = FailedPrecondition desc =  From: 15:40:20Z To: 15:40:21Z result=reject 
event happened 23 times, something is wrong: ns/openshift-monitoring pod/prometheus-k8s-1 hmsg/0a0f2e79b2 - pathological/true reason/FailedAttachVolume AttachVolume.Attach failed for volume "pvc-cf969581-a629-48c8-b8f5-6fe872e3b416" : rpc error: code = FailedPrecondition desc =  From: 15:40:21Z To: 15:40:22Z result=reject 
event happened 24 times, something is wrong: ns/openshift-monitoring pod/prometheus-k8s-1 hmsg/0a0f2e79b2 - pathological/true reason/FailedAttachVolume AttachVolume.Attach failed for volume "pvc-cf969581-a629-48c8-b8f5-6fe872e3b416" : rpc error: code = FailedPrecondition desc =  From: 15:40:22Z To: 15:40:23Z result=reject 
event happened 25 times, something is wrong: ns/openshift-monitoring pod/prometheus-k8s-1 hmsg/0a0f2e79b2 - pathological/true reason/FailedAttachVolume AttachVolume.Attach failed for volume "pvc-cf969581-a629-48c8-b8f5-6fe872e3b416" : rpc error: code = FailedPrecondition desc =  From: 15:40:24Z To: 15:40:25Z result=reject }

I saw a couple similar failures in search.ci, so I filed OCPBUGS-14400 for these failures.
/test e2e-aws-ovn-upgrade

Miciah · 2023-06-06T12:56:04Z

Rebased for ~~#939~~ #906.
/hold cancel

pkg/operator/controller/ingress/internal_service.go

pkg/operator/controller/ingress/nodeport_service.go

frobware · 2023-06-06T15:19:19Z

/lgtm

openshift-ci-robot · 2023-06-06T15:50:46Z

/retest-required

Remaining retests: 0 against base HEAD e068d04 and 2 for PR HEAD 6cd3b67 in total

Miciah · 2023-06-06T23:51:40Z

/hold
We're going to try to get #928 in next.

Miciah · 2023-06-08T03:05:07Z

The e2e-hypershift job for this PR passed earlier, but the same job has been failing on #928. I want to see whether e2e-hypershift is still passing on this PR (indicating that #928 is causing a failure) or whether it fails now (indicating that something else broke e2e-hypershift).
/test e2e-hypershift

Miciah · 2023-06-12T16:08:58Z

/hold cancel
/test all
now that #928 has merged.

openshift-ci-robot · 2023-06-12T18:06:48Z

/retest-required

Remaining retests: 0 against base HEAD 0e500e6 and 2 for PR HEAD 6cd3b67 in total

Miciah · 2023-06-13T23:15:51Z

e2e-aws-ovn-serial failed on [sig-storage] PersistentVolumes-local Stress with local volumes [Serial] should be able to process many pods and reuse local volumes:

{  fail [test/e2e/storage/persistent_volumes-local.go:522]: persistentvolumes "local-pvlh4qq" not found
Error: exit with code 1
Ginkgo exit error 1: exit with code 1}

Search.ci shows several similar failures for other jobs, so I filed OCPBUGS-14930 to track the issue. However, the failure rate appears to be low.
/test e2e-aws-ovn-serial

e2e-aws-operator failed on TestAWSELBConnectionIdleTimeout, which known to be flaky (see OCPBUGS-13810), and on TestRouterCompressionOperation:

=== RUN   TestAll/serial/TestRouterCompressionOperation
    router_compression_test.go:102: failed to update ingress controller, retrying...
    router_compression_test.go:147: failed to apply the required MIME type for test: failed to update ingress controller: Put "https://api.ci-op-gsdssil4-43abb.origin-ci-int-aws.dev.rhcloud.com:6443/apis/operator.openshift.io/v1/namespaces/openshift-ingress-operator/ingresscontrollers/default": read tcp 10.128.250.45:36442->13.56.185.125:6443: read: connection reset by peer

This error appears to be related to API server flakiness and probably does not represent flakiness on the part of the test itself or any problem with the changes in this PR. In this CI job run, several operators failed to roll out all their pods because the kubelet of one of the of the control-plane nodes stopped responding; see nodes.json.
/test e2e-aws-operator

Miciah · 2023-06-15T02:53:39Z

e2e-aws-operator failed again because TestAWSELBConnectionIdleTimeout failed and because several operators again failed to roll out all of their pods. While investigating the failures, I happened to notice a couple other bugs in the ingress operator, for which I filed bug reports: OCPBUGS-14994 and OCPBUGS-14995. However, these bugs only cause spurious reconciliations and are probably not causing any test failures. I'll continue looking into the CI failures tomorrow, but let's give CI another try in the meantime.
/test e2e-aws-operator

openshift-ci-robot · 2023-06-19T02:35:14Z

/retest-required

Remaining retests: 0 against base HEAD b1a6bb5 and 1 for PR HEAD 6cd3b67 in total

Miciah · 2023-06-20T13:00:21Z

/test all
since #944 has merged.

Miciah · 2023-06-20T22:06:06Z

e2e-aws-operator failed because must-gather failed.
/test e2e-aws-operator

e2e-aws-ovn-single-node failed because [sig-scheduling][Early] The HAProxy router pods [apigroup:route.openshift.io] should be scheduled on different nodes failed:

{  fail [github.com/openshift/origin/test/extended/scheduling/pods.go:102]: ns/openshift-ingress pod router-default-58ffdd7f97-7drtv and pod router-default-58ffdd7f97-44cps are running on the same node: ip-10-0-151-26.us-west-1.compute.internal
Ginkgo exit error 1: exit with code 1}

Having two pods is unexpected because this job creates a single-node cluster, and the deployment and replicaset have replicas: 1. The events show that both pods were scheduled at the same time, after the node became ready, but one of the pods failed the NodeAffinity predicate:

% jq -rc < events.json '.items|sort_by(.metadata.creationTimestamp)|.[]|select((.involvedObject.name//""|contains("router-default-58ffdd7f97")) or .reason=="NodeReady")|.metadata.creationTimestamp+" ("+(.count//1|tostring)+"x) "+.reason+": "+.message'  
2023-06-20T13:23:14Z (1x) NodeReady: Node ip-10-0-151-26.us-west-1.compute.internal status is now: NodeReady
2023-06-20T13:23:56Z (1x) FailedScheduling: 0/1 nodes are available: 1 node(s) had untolerated taint {node-role.kubernetes.io/master: }. preemption: 0/1 nodes are available: 1 Preemption is not helpful for scheduling..
2023-06-20T13:23:56Z (1x) SuccessfulCreate: Created pod: router-default-58ffdd7f97-7drtv
2023-06-20T13:24:41Z (1x) Scheduled: Successfully assigned openshift-ingress/router-default-58ffdd7f97-44cps to ip-10-0-151-26.us-west-1.compute.internal
2023-06-20T13:24:41Z (1x) Scheduled: Successfully assigned openshift-ingress/router-default-58ffdd7f97-7drtv to ip-10-0-151-26.us-west-1.compute.internal
2023-06-20T13:24:41Z (1x) NodeAffinity: Predicate NodeAffinity failed
2023-06-20T13:24:41Z (1x) SuccessfulCreate: Created pod: router-default-58ffdd7f97-44cps
2023-06-20T13:24:49Z (1x) AddedInterface: Add eth0 [10.128.0.60/23] from ovn-kubernetes
2023-06-20T13:24:50Z (1x) Pulling: Pulling image "registry.build01.ci.openshift.org/ci-op-cq7jwkqw/stable@sha256:117cb3f5b2dc2e3b9d7a1c66334ff6e861e62915af095a8ffdca32f1f1d60aa8"
2023-06-20T13:25:15Z (1x) Pulled: Successfully pulled image "registry.build01.ci.openshift.org/ci-op-cq7jwkqw/stable@sha256:117cb3f5b2dc2e3b9d7a1c66334ff6e861e62915af095a8ffdca32f1f1d60aa8" in 25.733799773s (25.7338107s including waiting)
2023-06-20T13:25:16Z (1x) Created: Created container router
2023-06-20T13:25:16Z (1x) Started: Started container router
2023-06-20T13:25:17Z (11x) ProbeError: Startup probe error: HTTP probe failed with statuscode: 500
body: [-]backend-proxy-http failed: reason withheld
[-]has-synced failed: reason withheld
[+]process-running ok
healthz check failed


2023-06-20T13:25:17Z (10x) Unhealthy: Startup probe failed: HTTP probe failed with statuscode: 500

The ingress operator does use the OnePodPerNodeController controller to delete misscheduled pods, but this controller ignores not-ready pods.

The node predicate only excludes remote nodes:

                            "nodeAffinity": {
                                "requiredDuringSchedulingIgnoredDuringExecution": {
                                    "nodeSelectorTerms": [
                                        {
                                            "matchExpressions": [
                                                {
                                                    "key": "node.openshift.io/remote-worker",
                                                    "operator": "NotIn",
                                                    "values": [
                                                        ""
                                                    ]
                                                }
                                            ]
                                        }
                                    ]
                                }
                            }

I don't know why the scheduler scheduled two pods, or why one of them failed the node predicate. Maybe this failure was a fluke.

Also, [sig-arch] Only known images used by tests failed:

{  Cluster accessed images that were not mirrored to the testing repository or already part of the cluster, see test/extended/util/image/README.md in the openshift/origin repo:

registry.redhat.io/rhel8/support-tools from pods:
  ns/e2e-test-egress-router-cni-e2e-vpksf pod/ip-10-0-151-26us-west-1computeinternal-debug-xr9wj node/ip-10-0-151-26.us-west-1.compute.internal
}

Let's see whether it's an eventual consistency issue with image mirroring.
/test e2e-aws-ovn-single-node

Miciah · 2023-06-21T03:42:08Z

e2e-aws-operator failed again because must-gather failed.
/test e2e-aws-operator

openshift-ci · 2023-06-21T05:23:02Z

@Miciah: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/e2e-aws-ovn-single-node	`6cd3b67`	link	false	`/test e2e-aws-ovn-single-node`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

openshift-ci-robot · 2023-06-21T05:27:55Z

@Miciah: Jira Issue OCPBUGS-13190: All pull requests linked via external trackers have merged:

openshift/cluster-ingress-operator#927

Jira Issue OCPBUGS-13190 has been moved to the MODIFIED state.

In response to this:

Avoid spurious updates for internalTrafficPolicy

Specify spec.internalTrafficPolicy on NodePort- and ClusterIP-type services that the operator manages. Also, ignore updates to the spec.ipFamilies and spec.ipFamilyPolicy fields.

Before this PR, the update logic for NodePort- and ClusterIP-type services would try to revert the default values that the API set for these fields.

assets/router/service-cloud.yaml:

assets/router/service-internal.yaml: Specify internalTrafficPolicy: Cluster.

pkg/manifests/bindata.go: Regenerate.

pkg/operator/controller/ingress/internal_service.go (internalServiceChanged): Ignore spec.ipFamilies and spec.ipFamilyPolicy.

pkg/operator/controller/ingress/internal_service_test.go (Test_desiredInternalIngressControllerService): Verify that spec.internalServiceChanged is set to "Cluster".
(Test_internalServiceChanged): Verify that changes to spec.internalTrafficPolicy are detected and that changes to spec.ipFamilies and spec.ipFamilyPolicy are ignored.

pkg/operator/controller/ingress/nodeport_service.go (desiredNodePortService): Set spec.internalTrafficPolicy to "Cluster".
(nodePortServiceChanged): Ignore spec.ipFamilies and spec.ipFamilyPolicy.

pkg/operator/controller/ingress/nodeport_service_test.go (TestDesiredNodePortService): Verify that spec.internalTrafficPolicy is set to "Cluster".
(TestNodePortServiceChanged): Verify that changes to spec.internalTrafficPolicy are detected and that changes to spec.ipFamilies and spec.ipFamilyPolicy are ignored.

Ignore updates for null versus empty ennotations

Ignore updates to annotations when the update is from null to empty or vice versa.

pkg/operator/controller/ingress/internal_service.go (internalServiceChanged): Use EquateEmpty when comparing annotations.

pkg/operator/controller/ingress/internal_service_test.go (TestInternalServiceChangedEmptyAnnotations): New test to verify that internalServiceChanged treats empty and null annotations as equal.

pkg/operator/controller/ingress/load_balancer_service_test.go (TestLoadBalancerServiceChangedEmptyAnnotations): New test to verify that loadBalancerServiceChanged treats empty and null annotations as equal.

pkg/operator/controller/ingress/nodeport_service.go (nodePortServiceChanged): Use EquateEmpty when comparing annotations.

pkg/operator/controller/ingress/nodeport_service_test.go (TestNodePortServiceChangedEmptyAnnotations): New test to verify that loadBalancerServiceChanged treats empty and null annotations as equal.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

candita · 2024-05-20T18:45:23Z

/cherrypick release-4.13 release 4.12

openshift-cherrypick-robot · 2024-05-20T18:46:15Z

@candita: #927 failed to apply on top of branch "release-4.13":

Applying: Avoid spurious updates for internalTrafficPolicy
Using index info to reconstruct a base tree...
M	pkg/manifests/bindata.go
Falling back to patching base and 3-way merge...
Auto-merging pkg/manifests/bindata.go
CONFLICT (content): Merge conflict in pkg/manifests/bindata.go
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
Patch failed at 0001 Avoid spurious updates for internalTrafficPolicy
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".

In response to this:

/cherrypick release-4.13 release 4.12

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

openshift-merge-robot · 2024-06-01T03:16:59Z

Fix included in accepted release 4.13.0-0.nightly-2024-05-31-202348

openshift-ci bot requested review from lihongan, knobunc and rfredette May 5, 2023 22:14

openshift-ci bot assigned gcs278 May 5, 2023

Miciah mentioned this pull request May 5, 2023

NE-1269: Replace bindata using embed #905

Merged

Miciah force-pushed the OCPBUGS-13190-avoid-spurious-updates-for-internalTrafficPolicy branch from a3ec26a to 9c5669c Compare May 9, 2023 07:07

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label May 9, 2023

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 22, 2023

openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 24, 2023

Miciah force-pushed the OCPBUGS-13190-avoid-spurious-updates-for-internalTrafficPolicy branch from 9c5669c to 6cd3b67 Compare June 6, 2023 12:55

openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Jun 6, 2023

openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 6, 2023

frobware reviewed Jun 6, 2023

View reviewed changes

pkg/operator/controller/ingress/internal_service.go Show resolved Hide resolved

pkg/operator/controller/ingress/nodeport_service.go Show resolved Hide resolved

openshift-ci bot assigned frobware Jun 6, 2023

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jun 6, 2023

openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 6, 2023

Miciah mentioned this pull request Jun 8, 2023

NE-1294: Add support for AWS shared VPC in another account #928

Merged

openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 12, 2023

openshift-merge-robot merged commit 038b86c into openshift:master Jun 21, 2023

Miciah mentioned this pull request May 1, 2024

OCPBUGS-23221: internalServiceChanged: Fix target port logic #1052

Merged

candita mentioned this pull request May 20, 2024

OCPBUGS-33990: Avoid spurious updates for internalTrafficPolicy [Backport to 4.13] #1055

Merged

OCPBUGS-13190: Avoid spurious updates for internalTrafficPolicy #927

OCPBUGS-13190: Avoid spurious updates for internalTrafficPolicy #927

Conversation

Miciah commented May 5, 2023 • edited Loading

Avoid spurious updates for internalTrafficPolicy

Ignore updates for null versus empty ennotations

openshift-ci-robot commented May 5, 2023

gcs278 commented May 5, 2023

Miciah commented May 8, 2023 • edited Loading

openshift-ci-robot commented May 8, 2023

Avoid spurious updates for internalTrafficPolicy

Ignore updates for null versus empty ennotations

Miciah commented May 9, 2023

gcs278 commented May 9, 2023

Miciah commented May 9, 2023

Miciah commented May 9, 2023

openshift-ci bot commented May 9, 2023

gcs278 commented May 22, 2023

openshift-ci bot commented May 22, 2023

openshift-ci-robot commented May 22, 2023

openshift-ci-robot commented May 23, 2023

openshift-ci-robot commented May 24, 2023

openshift-ci-robot commented May 24, 2023

Miciah commented May 27, 2023 • edited Loading

Miciah commented Jun 1, 2023 • edited Loading

Miciah commented Jun 6, 2023 • edited Loading

frobware commented Jun 6, 2023

openshift-ci-robot commented Jun 6, 2023

Miciah commented Jun 6, 2023

Miciah commented Jun 8, 2023

Miciah commented Jun 12, 2023

openshift-ci-robot commented Jun 12, 2023

Miciah commented Jun 13, 2023

Miciah commented Jun 15, 2023

openshift-ci-robot commented Jun 19, 2023

Miciah commented Jun 20, 2023

Miciah commented Jun 20, 2023

Miciah commented Jun 21, 2023

openshift-ci bot commented Jun 21, 2023

openshift-ci-robot commented Jun 21, 2023

Avoid spurious updates for internalTrafficPolicy

Ignore updates for null versus empty ennotations

candita commented May 20, 2024

openshift-cherrypick-robot commented May 20, 2024

openshift-merge-robot commented Jun 1, 2024

Miciah commented May 5, 2023 •

edited

Loading

Miciah commented May 8, 2023 •

edited

Loading

Miciah commented May 27, 2023 •

edited

Loading

Miciah commented Jun 1, 2023 •

edited

Loading

Miciah commented Jun 6, 2023 •

edited

Loading