Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cAdvisor e2e failing 100% on core OS #1344

Open
timstclair opened this issue Jun 21, 2016 · 20 comments
Open

cAdvisor e2e failing 100% on core OS #1344

timstclair opened this issue Jun 21, 2016 · 20 comments
Assignees

Comments

@timstclair
Copy link
Contributor

timstclair commented Jun 21, 2016

Error:

F0620 11:09:44.381011   18340 runner.go:290] Error 0: error on host e2e-cadvisor-coreos-beta: command "godep" ["go" "test" "--timeout" "15m0s" "github.com/google/cadvisor/integration/tests/..." "--host" "e2e-cadvisor-coreos-beta" "--port" "8080" "--ssh-options" "-i /home/jenkins/.ssh/google_compute_engine -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o CheckHostIP=no -o StrictHostKeyChecking=no"] failed with error: exit status 1 and output: godep: WARNING: Godep workspaces (./Godeps/_workspace) are deprecated and support for them will be removed when go1.8 is released.
godep: WARNING: Go version (go1.6) & $GO15VENDOREXPERIMENT= wants to enable the vendor experiment, but disabling because a Godep workspace (Godeps/_workspace) exists
--- FAIL: TestDockerContainerSpec (0.98s)
    assertions.go:150: 

    Location:   docker_test.go:225

    Error:      Not equal: "0" (expected)
                    != "" (actual)

    Messages:   Cpu mask should be "0", but is ""

FAIL
FAIL    github.com/google/cadvisor/integration/tests/api    29.639s
ok      github.com/google/cadvisor/integration/tests/healthz    0.012s
godep: go exit status 1

failing line

This has been failing 100% on e2e-cadvisor-coreos-beta since #1333 was merged.

@Random-Liu @pwittrock @euank

NOTE: cadvisor-pull-build-test-e2e coreos-beta VM disabled -- re enable once this issue is resolved.

@timstclair
Copy link
Contributor Author

@pwittrock do you know when you restarted the jenkins CI VMs? At first glance this doesn't look like the PR it started failing at would have caused it.

@timstclair
Copy link
Contributor Author

/cc @dchen1107

@timstclair
Copy link
Contributor Author

timstclair commented Jun 21, 2016

Failing since kubekins build #4096 (sorry, Google internal only)

Actually, ignore my comment about #1333 - many builds succeeded on that revision prior to the failures starting.

I'm guessing the core-os VM auto-updated and is no longer setting the CPU mask. I'll ping the pr-builder jenkins to see if it's having the same issue.

@vishh
Copy link
Contributor

vishh commented Jun 22, 2016

@timstclair Were able to root cause this issue?

@timstclair
Copy link
Contributor Author

No, I haven't tracked it down yet. It looks like this has started affecting the build jobs as well (it didn't yesterday). I think this lends credence to it being caused by a core OS update.

@timstclair
Copy link
Contributor Author

Let me see if I can reproduce on a new GCE coreos instance.

@vishh
Copy link
Contributor

vishh commented Jun 22, 2016

We can disable CoreOS node from the e2e suite until this issue is resolved.
WDYT?

On Wed, Jun 22, 2016 at 10:42 AM, Tim St. Clair notifications@github.com
wrote:

Let me see if I can reproduce on a new GCE coreos instance.


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#1344 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/AGvIKMebpzqb8hCjGqUgzMRqXjP98252ks5qOXPzgaJpZM4I7Hq7
.

@timstclair
Copy link
Contributor Author

I'll disable it for the builder.

@euank
Copy link
Collaborator

euank commented Jun 23, 2016

Do you have more information on how these images are setup and so on (sorry, not familiar with the cadvisor test setup)? For reference, node_e2e disable updates now with this.

If the cadvisor tests fail on newer CoreOSs post-update, that sounds like something to triage and fix..

@timstclair
Copy link
Contributor Author

Ok, I was able to reproduce this on a fresh coreOS beta image (VERSION_ID=1068.2.0). I just did:

  1. Install go to home directory
  2. go get godep & cadvisor
  3. I had trouble building cadvisor, so I copied the binary built on another machine
  4. run build/integration.sh, which surfaced the same error:
$ build/integration.sh 
>> starting cAdvisor locally
Waiting for cAdvisor to start ...
>> running integration tests against local cAdvisor
godep: WARNING: Godep workspaces (./Godeps/_workspace) are deprecated and support for them will be removed when go1.8 is released.
godep: WARNING: Go version (go1.6) & $GO15VENDOREXPERIMENT= wants to enable the vendor experiment, but disabling because a Godep workspace (Godeps/_workspace) exists
--- FAIL: TestDockerContainerSpec (0.51s)
        Location:       docker_test.go:225
    Error:      Not equal: "0" (expected)
                    != "" (actual)
    Messages:   Cpu mask should be "0", but is ""

FAIL
FAIL    github.com/google/cadvisor/integration/tests/api    29.321s
ok      github.com/google/cadvisor/integration/tests/healthz    0.004s
godep: go exit status 1
Integration tests failed
>> stopping cAdvisor
build/integration.sh: line 45:  5341 Killed                  sudo ./cadvisor --docker_env_metadata_whitelist=TEST_VAR

@euank can you take it from here?

@euank
Copy link
Collaborator

euank commented Jun 27, 2016

I'll take a guess that this is related to systemd >= 226's change in cgroup hierarchy, but not sure yet.

I was able to reproduce on a machine with systemd 226 and docker 1.11 launched with --exec-opt native.cgroupdriver=systemd.

The machine in question is gentoo, but I expect it'll reproduce broadly in that configuration. I'll dig further...

@euank
Copy link
Collaborator

euank commented Jun 27, 2016

It finds a cpuset root at both /sys/fs/cgroup/cpuset/system.slice/docker-de6460b5039fa64f505cf383c15dc96d515bbc507c76ec0f7c06a00a5115002f.scope and /sys/fs/cgroup/cpuset/init.scope/system.slice/docker-de6460b5039fa64f505cf383c15dc96d515bbc507c76ec0f7c06a00a5115002f.scope (note the init.scope bit), then tries to use the latter and doesn't find the files it expects and returns "". The first one would be correct. I think this is moby/moby#16256 (comment) .. I'm giving a quick go at patching this similarly to as suggested to be sure.

@euank
Copy link
Collaborator

euank commented Jun 30, 2016

Upstream bug to point to as well: opencontainers/runc#931

Our options are, I think:

  1. Wait for runc + docker-on-coreos to be fixed (we might backport the fix)
  2. Switch the CoreOS+docker test setup to use cgroupfs driver (non-default)
  3. Switch to CoreOS stable where it's still using systemd 225 and thus shouldn't be affected
  4. modify cadvisor to handle runc's bad behaviour and do the "right thing" when there are multiple cgroup paths for one container

My preference is 3 to put off having to get a better solution, and hope that 1 happens in the meanwhile. Sound reasonable?

@timstclair
Copy link
Contributor Author

(3) sounds reasonable to me, and I think we should do it anyway (filed #1361). Once the jenkins jobs are updated to use the cAdvisor jenkins script (kubernetes/test-infra#248) I'll add a coreos-stable VM.

@euank
Copy link
Collaborator

euank commented Sep 8, 2016

We should be able to update the CoreOS node if no one has already now that we've switched coreos to use cgroupfs by default.

I don't have access to the images referenced by ./build/jenkins_e2e.sh to update appropriately, but any version of CoreOS right now should include that change. If it's possible to use unmodified coreos images as I updated the node e2e to do, that might also help.

@timstclair
Copy link
Contributor Author

I'm not sure what (if anything) needs to be changed from the unmodified coreos image, so it might just work. If you're up for trying it and figuring out what (if anything) needs to be added, I'd certainly welcome the help :)

You can see the command used to run the tests here.

@dchen1107
Copy link
Collaborator

@euank I am assigning this one to you for delegating. We need better support for coreos as one of basic images for us. Re-assign it back to us or ask for help if you need. Thanks!

@euank
Copy link
Collaborator

euank commented Sep 9, 2016

I don't expect we'll need to do more than is done for the node_e2e stuff (user-data of https://github.com/euank/kubernetes/blob/5a5ba51b24c9e62aa775de1f568d365c2761aeb5/test/e2e_node/jenkins/coreos-init.json basically).

I'm on vacation for the next couple weeks, so I won't be able to verify that's true, and regardless someone with access to the jenkins account where these test instances run will need to start one up, unless we switch to a node_e2e type model where instances are launched and specified as part of code in this repository, not totally out of band. (#1361 could fix that perhaps).

cc @yifan-gu @crawford to help with or delegate further on this one, thanks!

@crawford
Copy link

crawford commented Sep 9, 2016

Minor nit, the correct extension for that file should be .ign or .ignition per the IANA MIME registration.

@yifan-gu
Copy link
Contributor

subscribe

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants