Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix panic in portallocator and deallocate endpoints when update with none EndpointSpec #1481

Merged
merged 1 commit into from
Sep 13, 2016

Conversation

allencloud
Copy link
Contributor

@allencloud allencloud commented Sep 1, 2016

fixes #1480

This PR did:

  1. in service update request, if there is no endpointSpec in service and there is endpoint in service, it means user no longer wants the allocated endpoint, then swarmkit needs to deallocate this.

Signed-off-by: allencloud allen.sun@daocloud.io

@codecov-io
Copy link

codecov-io commented Sep 1, 2016

Current coverage is 55.40% (diff: 28.57%)

Merging #1481 into master will decrease coverage by 0.14%

@@             master      #1481   diff @@
==========================================
  Files            82         82          
  Lines         12914      12920     +6   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
- Hits           7173       7158    -15   
- Misses         4757       4771    +14   
- Partials        984        991     +7   

Sunburst

Powered by Codecov. Last update 3cf210e...33ce23b

@dperny
Copy link
Collaborator

dperny commented Sep 1, 2016

Ooo, ouch. Good catch. LGTM.

As an aside, can you reproduce this behavior? I'm wondering how a panic went this long without being noticed.

@allencloud
Copy link
Contributor Author

Actually I can not reproduce this issue 100%. My colleague sent me the log. I just read the code and try to fix this. @dperny

@wrfly
Copy link

wrfly commented Sep 1, 2016

100% reproduce when create a service with exposed ports.
screenshot from 2016-09-01 15-10-05

And the docker's version and info:
version

info

@dperny

@wrfly
Copy link

wrfly commented Sep 1, 2016

And the logs:
screenshot from 2016-09-01 15-19-50

@stevvooe
Copy link
Contributor

stevvooe commented Sep 1, 2016

This doesn't look right. If Spec is nil, then it seems we have no allocation to do, so it should return true. @mrjana

Also, let's make sure there is a unit test for this.

@mrjana
Copy link
Contributor

mrjana commented Sep 1, 2016

PortAllocator is little different since it may always need to reconcile the spec with the actual Endpoint state specifically for a service update which can change the port configs. So the way serviceAllocatePorts works is compares the Spec and the Endpoint PortConfigs and deallocates any old port which is not there any more in the Spec and allocates any new PortConfigs which are in the Spec and not in Endpoint state. So even if Spec is nil we have to return false here because we may have to release ports allocated for the previous version of the ServiceSpec.

But the current crash can happen only if there was non-nil EndpointSpec before i.e there were exposed ports before which got removed in a service update. I am not sure how we can trigger this crash during a service create with exposed ports in which case the Spec cannot be nil.

Even for the case of service update from having an exposed port to removing it from the service, this does not completely fix the problem since we have to do similar checks in serviceAllocatePorts

@aboch
Copy link

aboch commented Sep 1, 2016

@allencloud

Actually I can not reproduce this issue 100%.

In order to reproduce, create a service specifying a plublish port.
Then via a rest client post an update to that service with no EndpointSpec in the json.

Your change fixes the panic, but I am wondering whether we should instead reject a request where EndpointSpec is not specified, becasue currently the cli client always feed that in the request.

Also the cli allows to add or remove one exposed port at the time, and the cli client takes care of passing the proper list of ports.

[Note] Did not see @mrjana's reply when sent this one.

@aaronlehmann
Copy link
Collaborator

Your change fixes the panic, but I am wondering whether we should instead reject a request where EndpointSpec is not specified, becasue currently the cli client always feed that in the request.

I'd prefer not to do this. I think it's better to be more flexible in what we allow. If we always require an EndpointSpec, that's cumbersome for someone using the API if they don't want to specify any endpoints. Also, going down that path leads to making assumptions that objects which already exist have been previously validated, but that's not necessarily the case (i.e. objects created by older versions).

@aboch
Copy link

aboch commented Sep 1, 2016

@aaronlehmann I agree with you. In case a service had multiple published ports and an update with no spec is received, if the allocator cannot remove all at once (my guess given the cli provides flags to only remove one port at the time), either it should return a proper error or be changed to autonomously remove all the ports one by one.

@mrjana
Copy link
Contributor

mrjana commented Sep 1, 2016

@aboch @aaronlehmann I agree with @aaronlehmann. We should handle an update to ServiceSpec were the EndpointSpec is nil but there was one previously. Even if the previous EndpoinSpec had multiple PortConfigs we should handle them all i.e deallocate them all. This should be doable in the current implemetation

BTW, even with CLI I believe you can remove all ports at once by providing multiple --publish-rm options.

@stevvooe
Copy link
Contributor

stevvooe commented Sep 1, 2016

We should handle an update to ServiceSpec were the EndpointSpec is nil but there was one previously. Even if the previous EndpoinSpec had multiple PortConfigs we should handle them all i.e deallocate them all. This should be doable in the current implemetation

This is simply the right fix. A nil EndpointSpec means that the service no longer has endpoints and those ports should be deallocated.

@yongtang
Copy link
Member

yongtang commented Sep 1, 2016

I wrote a test case to reproduce the panic:

func (s *DockerSwarmSuite) TestApiSwarmServicesUpdateRemovePorts(c *check.C) {
       d := s.AddDaemon(c, true, true)

       id := d.createService(c, simpleTestService, func(s *swarm.Service) {
               s.Spec.EndpointSpec = &swarm.EndpointSpec{
                       Ports: []swarm.PortConfig{
                               {
                                       Protocol:   "tcp",
                                       TargetPort: 80,
                               },
                       },
               }
       })
       waitAndAssert(c, defaultReconciliationTimeout, d.checkActiveContainerCount, checker.Equals, 1)
       service := d.getService(c, id)

       var serviceWithoutPorts swarm.Service
       simpleTestService(&serviceWithoutPorts)
       url := fmt.Sprintf("/services/%s/update?version=%d", service.ID, service.Version.Index)
       status, out, err := d.SockRequest("POST", url, serviceWithoutPorts.Spec)
       c.Assert(err, checker.IsNil)
       c.Assert(status, checker.Equals, http.StatusOK, check.Commentf("output: %q", string(out)))
       waitAndAssert(c, defaultReconciliationTimeout, d.checkActiveContainerCount, checker.Equals, 1)
}

@allencloud
Copy link
Contributor Author

@dperny @stevvooe @mrjana @aaronlehmann @yongtang @aboch
Thanks a lot for all of you. I think I have more knowledge of the background now. And I am on the side of making it work with non-endpoint service update.
Then, I was wondering what is the best steps for us to follow. New PR or a more specific proposal?

@stevvooe
Copy link
Contributor

stevvooe commented Sep 2, 2016

@allencloud You can continue with this PR, starting by integrating the unit test suggested by @yongtang. The fix should be straightforward, involving a little bit of work to ensure that the deallocations happen.

@allencloud allencloud changed the title fix panic in portallocator fix panic in portallocator and deallocate endpoints when update with none EndpointSpec Sep 5, 2016
@allencloud
Copy link
Contributor Author

PR updated, PTAL.
Actually the test case @yongtang provided is used in docker/docker project. I think when updating swarmkit in docker engine, it is good new to add the test in docker engine.

@stevvooe
Copy link
Contributor

stevvooe commented Sep 6, 2016

@allencloud Should be trivial to convert the testcase to swarmkit. If you need help, let me know.

…on-endpointSpec

Signed-off-by: allencloud <allen.sun@daocloud.io>
@allencloud
Copy link
Contributor Author

allencloud commented Sep 7, 2016

@stevvooe
Thanks a lot for your advice. Now I added a test case in allocate_test.go.
I tried with the modified allocator_test.go in branch master, failed with the same error above

panic: runtime error: invalid memory address or nil pointer dereference
[signal 0xb code=0x1 addr=0x10 pc=0x3849a6]

goroutine 26 [running]:
panic(0x9f19a0, 0xc82000e0f0)
        /usr/local/go/src/runtime/panic.go:481 +0x3ff
github.com/docker/swarmkit/manager/allocator/networkallocator.(*portAllocator).isPortsAllocated(0xc82012e000, 0xc8201bd380, 0xafa9e0)
        /Users/AllenSun/gocode/src/github.com/docker/swarmkit/manager/allocator/networkallocator/portallocator.go:164 +0xb6
github.com/docker/swarmkit/manager/allocator/networkallocator.(*NetworkAllocator).IsServiceAllocated(0xc820158900, 0xc8201bd380, 0xc8201e2d00)
        /Users/AllenSun/gocode/src/github.com/docker/swarmkit/manager/allocator/networkallocator/networkallocator.go:315 +0x201
github.com/docker/swarmkit/manager/allocator.(*Allocator).doNetworkAlloc(0xc8201dd1d0, 0x1dc01d8, 0xc82006ce40, 0xa2bda0, 0xc8201e2d00)
        /Users/AllenSun/gocode/src/github.com/docker/swarmkit/manager/allocator/network.go:318 +0x3a8
github.com/docker/swarmkit/manager/allocator.(*Allocator).(github.com/docker/swarmkit/manager/allocator.doNetworkAlloc)-fm(0x1dc01d8, 0xc82006ce40, 0xa2bda0, 0xc8201e2d00)
        /Users/AllenSun/gocode/src/github.com/docker/swarmkit/manager/allocator/allocator.go:117 +0x56
github.com/docker/swarmkit/manager/allocator.(*Allocator).run(0xc8201dd1d0, 0x1dc01d8, 0xc82006ce40, 0xc82006a0c0, 0xc820128a20, 0xadff50, 0x7, 0xc820126a60, 0xc820126a50)
        /Users/AllenSun/gocode/src/github.com/docker/swarmkit/manager/allocator/allocator.go:175 +0x12e
github.com/docker/swarmkit/manager/allocator.(*Allocator).Run.func2.1(0xc820126a20, 0xc8201dd1d0, 0xc820126a00, 0xc82006a0c0, 0xc820128a20, 0xadff50, 0x7, 0xc820126a60, 0xc820126a50)
        /Users/AllenSun/gocode/src/github.com/docker/swarmkit/manager/allocator/allocator.go:142 +0xc9
created by github.com/docker/swarmkit/manager/allocator.(*Allocator).Run.func2
        /Users/AllenSun/gocode/src/github.com/docker/swarmkit/manager/allocator/allocator.go:143 +0x19d

And the modified allocator_test.go in this PR in this branch work well, like the circleci works for all cases.

PTAL

@xiaods
Copy link

xiaods commented Sep 13, 2016

any update?

@mrjana
Copy link
Contributor

mrjana commented Sep 13, 2016

LGTM

1 similar comment
@stevvooe
Copy link
Contributor

LGTM

@stevvooe stevvooe merged commit 4fb2797 into moby:master Sep 13, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

panic in portallocator when updating service