Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

service IPs and ports are not released when deleting a service via a finalizer-removing update #87603

Closed
chrischdi opened this issue Jan 28, 2020 · 27 comments · Fixed by #96684
Closed
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/network Categorizes an issue or PR as relevant to SIG Network.

Comments

@chrischdi
Copy link
Member

chrischdi commented Jan 28, 2020

What happened:

  • Created a service including a finalizer
  • Triggered deletion of the service
  • Removed finalizer
  • Service got deleted by apiserver (not visible anymore via kubectl)
  • Tried to create service again
  • Creation got denied: The Service "foo" is invalid: spec.ports[0].nodePort: Invalid value: 30003: provided port is already allocated
    The apiserver logs the following lines prior to this happening:
    E0128 08:15:36.920788       1 repair.go:145] the node port 30003 for service foo/default is not allocated; repairing
    E0128 08:15:36.920837       1 repair.go:237] the cluster IP 10.0.0.81 for service foo/default is not allocated; repairing
    
  • After about 10 minutes I'm able to create the service, the apiserver shows the following log lines when it is repairing it:
    E0128 08:28:51.429642       1 repair.go:184] the node port 30003 appears to have leaked: cleaning up
    E0128 08:28:51.436350       1 repair.go:311] the cluster IP 10.0.0.81 appears to have leaked: cleaning up
    

What you expected to happen:

  • Service is allowed to get created some seconds after deletion

How to reproduce it (as minimally and precisely as possible):

cd $(mktemp -d)
mkdir etcd
docker run -d -p 2379:2379 --name=kube-etcd -v $(pwd)/etcd:/tmp/ --rm k8s.gcr.io/etcd:3.3.15 /usr/local/bin/etcd --data-dir /tmp/etcd --advertise-client-urls=http://0.0.0.0:2379 --listen-client-urls=http://0.0.0.0:2379
docker run -d --net=host --name=kube-apiserver --rm k8s.gcr.io/kube-apiserver:v1.17.2 kube-apiserver --etcd-servers http://127.0.0.1:2379 --insecure-port 8080 --authorization-mode=RBAC

export KUBECONFIG=$(pwd)/kubeconfig
touch $KUBECONFIG
kubectl config set-cluster etcd-local --server=http://localhost:8080
kubectl config set-context etcd-local --cluster=etcd-local
kubectl config use-context etcd-local

cat <<EOF > service.yaml
apiVersion: v1
kind: Service
metadata:
  name: foo
  finalizers:
  - foo.bar/some-finalizer
spec:
  ports:
  - port: 80
    protocol: TCP
    targetPort: 8080
    nodePort: 30003
  selector:
    app: kuard
  type: NodePort

EOF

for i in {1..200}; do
  echo "[$(date +%Y-%m-%d-%H:%M:%S)] # $i"
  kubectl apply -f service.yaml
  kubectl delete svc foo --wait=false
  sleep 1
  kubectl patch svc foo --type='json' -p='[{"op":"remove","path":"/metadata/finalizers"}]'
  kubectl delete svc foo --ignore-not-found
  sleep 1
done

Example output:

[2020-01-28-08:55:25] # 1                                                                        
service/foo unchanged                                                                                         
service "foo" deleted                                 
service/foo patched                                                      
...
[2020-01-28-08:58:21] # 77
service/foo created
service "foo" deleted
service/foo patched
[2020-01-28-08:58:23] # 78
service/foo created
service "foo" deleted
service/foo patched
[2020-01-28-08:58:26] # 79
The Service "foo" is invalid: spec.ports[0].nodePort: Invalid value: 30003: provided port is already allocated
Error from server (NotFound): services "foo" not found
Error from server (NotFound): services "foo" not found
[2020-01-28-08:58:28] # 80
The Service "foo" is invalid: spec.ports[0].nodePort: Invalid value: 30003: provided port is already allocated
Error from server (NotFound): services "foo" not found
Error from server (NotFound): services "foo" not found
[2020-01-28-08:58:30] # 81
The Service "foo" is invalid: spec.ports[0].nodePort: Invalid value: 30003: provided port is already allocated
Error from server (NotFound): services "foo" not found
Error from server (NotFound): services "foo" not found
...
[2020-01-28-09:07:23] # 5
The Service "foo" is invalid: spec.ports[0].nodePort: Invalid value: 30003: provided port is already allocated
Error from server (NotFound): services "foo" not found
Error from server (NotFound): services "foo" not found
[2020-01-28-09:07:26] # 6
service/foo created
service "foo" deleted
service/foo patched

Anything else we need to know?:

  • This does not always happen / this is a flakyness

  • Also the problem get's auto-resolved by the apiserver after some time (But this may need about 10 minutes):
    E0128 08:01:24.562044 1 repair.go:300] the cluster IP 10.0.0.215 may have leaked: flagging for later clean up

  • Background for us here is: we want to run a custom controller for services of type: LoadBalancer and want to use a finalizer. We did hit this issue sometimes during dev.

Environment:

  • Kubernetes version (use kubectl version):
    $ kubectl version
    Client Version: version.Info{Major:"1", Minor:"16", GitVersion:"v1.16.6", GitCommit:"72c30166b2105cd7d3350f2c28a219e6abcd79eb", GitTreeState:"clean", BuildDate:"2020-01-18T23:31:31Z", GoVersion:"go1.13.5", Compiler:"gc", Platform:"linux/amd64"}
    Server Version: version.Info{Major:"1", Minor:"17", GitVersion:"v1.17.2", GitCommit:"59603c6e503c87169aea6106f57b9f242f64df89", GitTreeState:"clean", BuildDate:"2020-01-18T23:22:30Z", GoVersion:"go1.13.5", Compiler:"gc", Platform:"linux/amd64"}
    
  • Cloud provider or hardware configuration: none / locally reproducable / all are affected
  • OS (e.g: cat /etc/os-release):
    $ cat /etc/os-release
    NAME="Ubuntu"
    VERSION="18.04.3 LTS (Bionic Beaver)"
    ID=ubuntu
    ID_LIKE=debian
    PRETTY_NAME="Ubuntu 18.04.3 LTS"
    VERSION_ID="18.04"
    HOME_URL="https://www.ubuntu.com/"
    SUPPORT_URL="https://help.ubuntu.com/"
    BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
    PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
    VERSION_CODENAME=bionic
    UBUNTU_CODENAME=bionic
    
  • Kernel (e.g. uname -a):
    $ uname -a
    

Linux 5.3.0-26-generic #28~18.04.1-Ubuntu SMP Wed Dec 18 16:40:14 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

- Install tools:
- Network plugin and version (if this is a network-related bug):
- Others:
@chrischdi chrischdi added the kind/bug Categorizes issue or PR as related to a bug. label Jan 28, 2020
@k8s-ci-robot k8s-ci-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Jan 28, 2020
@chrischdi
Copy link
Member Author

/sig api-machinery

@k8s-ci-robot k8s-ci-robot added sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Jan 28, 2020
@chrischdi
Copy link
Member Author

chrischdi commented Jan 28, 2020

Maybe also helpful: the output of kubectl events:

$ k get events -o wide
LAST SEEN   TYPE      REASON                  OBJECT        SUBOBJECT   SOURCE                            MESSAGE                                             FIRST SEEN   COUNT   NAME
42m         Warning   ClusterIPNotAllocated   service/foo               ipallocator-repair-controller     Cluster IP 10.0.0.100 is not allocated; repairing   42m          1       foo.15edfdd270043df5
42m         Warning   PortNotAllocated        service/foo               portallocator-repair-controller   Port 30003 is not allocated; repairing              42m          1       foo.15edfdd270314403
39m         Warning   ClusterIPNotAllocated   service/foo               ipallocator-repair-controller     Cluster IP 10.0.0.215 is not allocated; repairing   39m          1       foo.15edfdfc5961dbfc
39m         Warning   PortNotAllocated        service/foo               portallocator-repair-controller   Port 30003 is not allocated; repairing              39m          1       foo.15edfdfc59620b05
22m         Warning   PortNotAllocated        service/foo               portallocator-repair-controller   Port 30003 is not allocated; repairing              22m          1       foo.15edfeecb777d9c1
22m         Warning   ClusterIPNotAllocated   service/foo               ipallocator-repair-controller     Cluster IP 10.0.0.81 is not allocated; repairing    22m          1       foo.15edfeecb778ae2c
6m24s       Warning   PortNotAllocated        service/foo               portallocator-repair-controller   Port 30003 is not allocated; repairing              6m24s        1       foo.15edffcf9d0dae46
6m24s       Warning   ClusterIPNotAllocated   service/foo               ipallocator-repair-controller     Cluster IP 10.0.0.136 is not allocated; repairing   6m24s        1       foo.15edffcf9d9e2f30

When these events get emmitted the problem does occur - so the repair seems to break things in this case.

@liggitt
Copy link
Member

liggitt commented Jan 28, 2020

the service registry overrides the Delete implementation to free allocated ip/ports after a Delete API request:

func (rs *REST) Delete(ctx context.Context, id string, deleteValidation rest.ValidateObjectFunc, options *metav1.DeleteOptions) (runtime.Object, bool, error) {
// TODO: handle graceful
obj, _, err := rs.services.Delete(ctx, id, deleteValidation, options)
if err != nil {
return nil, false, err
}
svc := obj.(*api.Service)
// Only perform the cleanup if this is a non-dryrun deletion
if !dryrun.IsDryRun(options.DryRun) {
// TODO: can leave dangling endpoints, and potentially return incorrect
// endpoints if a new service is created with the same name
_, _, err = rs.endpoints.Delete(ctx, id, rest.ValidateAllObjectFunc, &metav1.DeleteOptions{})
if err != nil && !errors.IsNotFound(err) {
return nil, false, err
}
rs.releaseAllocatedResources(svc)
}

but it does not implement an AfterDelete hook in its strategy, so it does not participate in object deletion in response to removing the last finalizer of a service with a deletionTimestamp set during an Update API request:

// Check the default delete-during-update conditions, and store-specific conditions if provided
if ShouldDeleteDuringUpdate(ctx, key, obj, existing) &&
(e.ShouldDeleteDuringUpdate == nil || e.ShouldDeleteDuringUpdate(ctx, key, obj, existing)) {
deleteObj = obj
return nil, nil, errEmptiedFinalizers
}
ttl, err := e.calculateTTL(obj, res.TTL, true)
if err != nil {
return nil, nil, err
}
if int64(ttl) != res.TTL {
return obj, &ttl, nil
}
return obj, nil, nil
}, dryrun.IsDryRun(options.DryRun))
if err != nil {
// delete the object
if err == errEmptiedFinalizers {
return e.deleteWithoutFinalizers(ctx, name, key, deleteObj, storagePreconditions, dryrun.IsDryRun(options.DryRun))
}

// deleteWithoutFinalizers handles deleting an object ignoring its finalizer list.
// Used for objects that are either been finalized or have never initialized.
func (e *Store) deleteWithoutFinalizers(ctx context.Context, name, key string, obj runtime.Object, preconditions *storage.Preconditions, dryRun bool) (runtime.Object, bool, error) {
out := e.NewFunc()
klog.V(6).Infof("going to delete %s from registry, triggered by update", name)
// Using the rest.ValidateAllObjectFunc because the request is an UPDATE request and has already passed the admission for the UPDATE verb.
if err := e.Storage.Delete(ctx, key, out, preconditions, rest.ValidateAllObjectFunc, dryRun); err != nil {
// Deletion is racy, i.e., there could be multiple update
// requests to remove all finalizers from the object, so we
// ignore the NotFound error.
if storage.IsNotFound(err) {
_, err := e.finalizeDelete(ctx, obj, true)
// clients are expecting an updated object if a PUT succeeded,
// but finalizeDelete returns a metav1.Status, so return
// the object in the request instead.
return obj, false, err
}
return nil, false, storeerr.InterpretDeleteError(err, e.qualifiedResourceFromContext(ctx), name)
}
_, err := e.finalizeDelete(ctx, out, true)

// finalizeDelete runs the Store's AfterDelete hook if runHooks is set and
// returns the decorated deleted object if appropriate.
func (e *Store) finalizeDelete(ctx context.Context, obj runtime.Object, runHooks bool) (runtime.Object, error) {
if runHooks && e.AfterDelete != nil {
if err := e.AfterDelete(obj); err != nil {
return nil, err
}
}

/sig network
/remove-sig api-machinery

@k8s-ci-robot k8s-ci-robot added sig/network Categorizes an issue or PR as relevant to SIG Network. and removed sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. labels Jan 28, 2020
@k8s-ci-robot
Copy link
Contributor

@liggitt: Those labels are not set on the issue: sig/api-machinery

In response to this:

the service registry overrides the Delete implementation to free allocated ip/ports after a Delete API request:

func (rs *REST) Delete(ctx context.Context, id string, deleteValidation rest.ValidateObjectFunc, options *metav1.DeleteOptions) (runtime.Object, bool, error) {
// TODO: handle graceful
obj, _, err := rs.services.Delete(ctx, id, deleteValidation, options)
if err != nil {
return nil, false, err
}
svc := obj.(*api.Service)
// Only perform the cleanup if this is a non-dryrun deletion
if !dryrun.IsDryRun(options.DryRun) {
// TODO: can leave dangling endpoints, and potentially return incorrect
// endpoints if a new service is created with the same name
_, _, err = rs.endpoints.Delete(ctx, id, rest.ValidateAllObjectFunc, &metav1.DeleteOptions{})
if err != nil && !errors.IsNotFound(err) {
return nil, false, err
}
rs.releaseAllocatedResources(svc)
}

but it does not implement an AfterDelete hook in its strategy, so it does not participate in object deletion in response to removing the last finalizer of a service with a deletionTimestamp set during an Update API request:

// Check the default delete-during-update conditions, and store-specific conditions if provided
if ShouldDeleteDuringUpdate(ctx, key, obj, existing) &&
(e.ShouldDeleteDuringUpdate == nil || e.ShouldDeleteDuringUpdate(ctx, key, obj, existing)) {
deleteObj = obj
return nil, nil, errEmptiedFinalizers
}
ttl, err := e.calculateTTL(obj, res.TTL, true)
if err != nil {
return nil, nil, err
}
if int64(ttl) != res.TTL {
return obj, &ttl, nil
}
return obj, nil, nil
}, dryrun.IsDryRun(options.DryRun))
if err != nil {
// delete the object
if err == errEmptiedFinalizers {
return e.deleteWithoutFinalizers(ctx, name, key, deleteObj, storagePreconditions, dryrun.IsDryRun(options.DryRun))
}

// deleteWithoutFinalizers handles deleting an object ignoring its finalizer list.
// Used for objects that are either been finalized or have never initialized.
func (e *Store) deleteWithoutFinalizers(ctx context.Context, name, key string, obj runtime.Object, preconditions *storage.Preconditions, dryRun bool) (runtime.Object, bool, error) {
out := e.NewFunc()
klog.V(6).Infof("going to delete %s from registry, triggered by update", name)
// Using the rest.ValidateAllObjectFunc because the request is an UPDATE request and has already passed the admission for the UPDATE verb.
if err := e.Storage.Delete(ctx, key, out, preconditions, rest.ValidateAllObjectFunc, dryRun); err != nil {
// Deletion is racy, i.e., there could be multiple update
// requests to remove all finalizers from the object, so we
// ignore the NotFound error.
if storage.IsNotFound(err) {
_, err := e.finalizeDelete(ctx, obj, true)
// clients are expecting an updated object if a PUT succeeded,
// but finalizeDelete returns a metav1.Status, so return
// the object in the request instead.
return obj, false, err
}
return nil, false, storeerr.InterpretDeleteError(err, e.qualifiedResourceFromContext(ctx), name)
}
_, err := e.finalizeDelete(ctx, out, true)

// finalizeDelete runs the Store's AfterDelete hook if runHooks is set and
// returns the decorated deleted object if appropriate.
func (e *Store) finalizeDelete(ctx context.Context, obj runtime.Object, runHooks bool) (runtime.Object, error) {
if runHooks && e.AfterDelete != nil {
if err := e.AfterDelete(obj); err != nil {
return nil, err
}
}

/sig network
/remove-sig api-machinery

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@athenabot
Copy link

/triage unresolved

Comment /remove-triage unresolved when the issue is assessed and confirmed.

🤖 I am a bot run by vllry. 👩‍🔬

@k8s-ci-robot k8s-ci-robot added the triage/unresolved Indicates an issue that can not or will not be resolved. label Jan 28, 2020
@liggitt liggitt changed the title kube-apiserver sometimes fails to cleanup services when using finalizers and nodePorts service IPs and ports are not released when deleting a service via a finalizer-removing update Jan 28, 2020
@danwinship
Copy link
Contributor

/remove-triage unresolved

@k8s-ci-robot k8s-ci-robot removed the triage/unresolved Indicates an issue that can not or will not be resolved. label Feb 6, 2020
@thockin thockin added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Apr 2, 2020
@thockin
Copy link
Member

thockin commented Apr 2, 2020

@MrHohn this one seems important.

@MrHohn
Copy link
Member

MrHohn commented Apr 2, 2020

Thanks for the great analysis. My understanding is that we need to implement a AfterDelete hook for service - I will take a stab :)
/assign

@sparkoo
Copy link

sparkoo commented Apr 23, 2020

@MrHohn hello, any updates on this?

@MrHohn
Copy link
Member

MrHohn commented Apr 23, 2020

@sparkoo Sorry for the delay, I implemented a fix locally and am working on a test at the moment. Looking to have a PR out this week.

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 22, 2020
@BenTheElder
Copy link
Member

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 28, 2020
@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 26, 2020
@aojea
Copy link
Member

aojea commented Oct 26, 2020

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 26, 2020
@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 24, 2021
@chrischdi
Copy link
Member Author

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 24, 2021
@thockin
Copy link
Member

thockin commented Mar 4, 2021

We need to revisit this soon

@tomerleib
Copy link

tomerleib commented Apr 11, 2021

I've seen this as well:

  • K8s 1.17.6
  • Calico
  • Ingress-Nginx 0.45 (chart 3.29)

After deleting the chart from the system, I'm left with the controller service intact.
The events for this service shows the same outputs as above:

11m         Normal    EnsuringLoadBalancer      service/nginx-ingress-ingress-nginx-controller                                   Ensuring load balancer
11m         Normal    EnsuredLoadBalancer       service/nginx-ingress-ingress-nginx-controller                                   Ensured load balancer
3m19s       Normal    DeletingLoadBalancer      service/nginx-ingress-ingress-nginx-controller                                   Deleting load balancer
3m18s       Warning   FailedToCreateEndpoint    endpoints/nginx-ingress-ingress-nginx-controller                                 Failed to create endpoint for service default/nginx-ingress-ingress-nginx-controller: endpoints "nginx-ingress-ingress-nginx-controller" already exists
2m37s       Warning   PortNotAllocated          service/nginx-ingress-ingress-nginx-controller                                   Port 31861 is not allocated; repairing
2m37s       Warning   PortNotAllocated          service/nginx-ingress-ingress-nginx-controller                                   Port 32175 is not allocated; repairing
2m37s       Warning   ClusterIPNotAllocated     service/nginx-ingress-ingress-nginx-controller                                   Cluster IP 10.233.15.55 is not allocated; repairing

@lkoniecz
Copy link

lkoniecz commented Jun 12, 2021

Any updates on this?
I just got hit by that. Got stuck in deleting service

  ----     ------                 ----                ----                             -------
  Warning  PortNotAllocated       21m (x2 over 16h)   portallocator-repair-controller  Port 31704 is not allocated; repairing
  Warning  PortNotAllocated       21m (x2 over 16h)   portallocator-repair-controller  Port 30651 is not allocated; repairing
  Warning  PortNotAllocated       21m (x2 over 16h)   portallocator-repair-controller  Port 32757 is not allocated; repairing
  Warning  PortNotAllocated       21m (x2 over 16h)   portallocator-repair-controller  Port 31217 is not allocated; repairing
  Warning  PortNotAllocated       21m (x2 over 16h)   portallocator-repair-controller  Port 31105 is not allocated; repairing
  Normal   EnsuringLoadBalancer   10m (x236 over 8d)  service-controller               Ensuring load balancer
  Warning  PortNotAllocated       10m (x10 over 18h)  portallocator-repair-controller  Port 30651 is not allocated; repairing
  Normal   Type                   5m26s               service-controller               LoadBalancer -> NodePort
  Warning  PortNotAllocated       67s (x12 over 18h)  portallocator-repair-controller  Port 32757 is not allocated; repairing
  Warning  PortNotAllocated       67s (x12 over 18h)  portallocator-repair-controller  Port 31217 is not allocated; repairing
  Warning  PortNotAllocated       67s (x12 over 18h)  portallocator-repair-controller  Port 31105 is not allocated; repairing
  Warning  PortNotAllocated       67s (x12 over 18h)  portallocator-repair-controller  Port 31704 is not allocated; repairing
  Warning  ClusterIPNotAllocated  53s (x14 over 18h)  ipallocator-repair-controller    Cluster IP 172.20.39.20 is not allocated; repairing

@rvillane
Copy link

I was impacted by this issue today, also got stuck trying to delete a service in a Kubernetes 1.18 cluster

Warning ClusterIPNotAllocated service/myservice-stage-internal Cluster IP 10.32.19.1 is not allocated; repairing

@lkoniecz
Copy link

lkoniecz commented Jun 16, 2021

For those who are only interested in deleting the service

kubectl delete svc <your_service>
kubectl patch service/<your_service> --type json --patch='[ { "op": "remove", "path": "/metadata/finalizers" } ]'

@aojea
Copy link
Member

aojea commented Jun 16, 2021

For those who are only interested in deleting the service

This bug is a "temporary" problem when using finalizers on services, but doesn't cause the service to got stuck on deletion, if you have to remove the finalizer manually you should check what controller should remove that and what is causing it to fail to delete it

@aojea
Copy link
Member

aojea commented Jun 17, 2021

heh, I manage to reproduce it #102955
the key to make it deterministic was to wait for the repair loop, it is hardcoded to 3 mins ...

@aojea
Copy link
Member

aojea commented Jun 17, 2021

/assign

@aojea
Copy link
Member

aojea commented Jul 1, 2021

/unassign
/assign @thockin
/milestone 1.23
This will be fixed as part of Tim's PR #96684,
but not in 1.22 for sure, sorry

@k8s-ci-robot k8s-ci-robot assigned thockin and unassigned aojea Jul 1, 2021
@k8s-ci-robot
Copy link
Contributor

@aojea: You must be a member of the kubernetes/milestone-maintainers GitHub team to set the milestone. If you believe you should be able to issue the /milestone command, please contact your and have them propose you as an additional delegate for this responsibility.

In response to this:

/unassign
/assign @thockin
/milestone 1.23
This will be fixed as part of Tim's PR #96684,
but not in 1.22 for sure, sorry

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@collinjlesko
Copy link

Thought I would add to this,

Having the exact same issue with v1.18.16 on EKS. Initially, namespaces would be stuck in a "Terminating" state. Further research would lead you to find that services (loadbalancers) in Kubernetes were hanging or would persist for 10+ minutes. If you left it alone for a while, it would go away... eventually. However, this is typically instantly deleted. When running describe, I would notice it would initially say

Normal DeletingLoadBalancer 102s service-controller Deleting load balancer

Followed by the below about 60 seconds later:

Warning PortNotAllocated 19s portallocator-repair-controller Port 31015 is not allocated; repairing Warning ClusterIPNotAllocated 19s ipallocator-repair-controller Cluster IP 172.20.131.10 is not allocated; repairing

Then the service would just sit there... until it got deleted ~10 minutes later.

@lkoniecz solution of:

kubectl patch service/<your_service> --type json --patch='[ { "op": "remove", "path": "/metadata/finalizers" } ]'

works perfect in the meantime, but going to have to overhaul a lot of automation.

Going to downgrade to 1.17 as the issues seems to appear in any version above 1.17.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/network Categorizes an issue or PR as relevant to SIG Network.
Projects
None yet