cAdvisor e2e failing 100% on core OS #1344

timstclair · 2016-06-21T20:06:46Z

Error:

F0620 11:09:44.381011   18340 runner.go:290] Error 0: error on host e2e-cadvisor-coreos-beta: command "godep" ["go" "test" "--timeout" "15m0s" "github.com/google/cadvisor/integration/tests/..." "--host" "e2e-cadvisor-coreos-beta" "--port" "8080" "--ssh-options" "-i /home/jenkins/.ssh/google_compute_engine -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o CheckHostIP=no -o StrictHostKeyChecking=no"] failed with error: exit status 1 and output: godep: WARNING: Godep workspaces (./Godeps/_workspace) are deprecated and support for them will be removed when go1.8 is released.
godep: WARNING: Go version (go1.6) & $GO15VENDOREXPERIMENT= wants to enable the vendor experiment, but disabling because a Godep workspace (Godeps/_workspace) exists
--- FAIL: TestDockerContainerSpec (0.98s)
    assertions.go:150: 

    Location:   docker_test.go:225

    Error:      Not equal: "0" (expected)
                    != "" (actual)

    Messages:   Cpu mask should be "0", but is ""

FAIL
FAIL    github.com/google/cadvisor/integration/tests/api    29.639s
ok      github.com/google/cadvisor/integration/tests/healthz    0.012s
godep: go exit status 1

failing line

This has been failing 100% on e2e-cadvisor-coreos-beta since #1333 was merged.

@Random-Liu @pwittrock @euank

NOTE: cadvisor-pull-build-test-e2e coreos-beta VM disabled -- re enable once this issue is resolved.

The text was updated successfully, but these errors were encountered:

timstclair · 2016-06-21T20:07:49Z

@pwittrock do you know when you restarted the jenkins CI VMs? At first glance this doesn't look like the PR it started failing at would have caused it.

timstclair · 2016-06-21T20:12:25Z

/cc @dchen1107

timstclair · 2016-06-21T20:22:11Z

Failing since kubekins build #4096 (sorry, Google internal only)

Actually, ignore my comment about #1333 - many builds succeeded on that revision prior to the failures starting.

I'm guessing the core-os VM auto-updated and is no longer setting the CPU mask. I'll ping the pr-builder jenkins to see if it's having the same issue.

vishh · 2016-06-22T17:37:26Z

@timstclair Were able to root cause this issue?

timstclair · 2016-06-22T17:40:21Z

No, I haven't tracked it down yet. It looks like this has started affecting the build jobs as well (it didn't yesterday). I think this lends credence to it being caused by a core OS update.

timstclair · 2016-06-22T17:42:03Z

Let me see if I can reproduce on a new GCE coreos instance.

vishh · 2016-06-22T17:43:52Z

We can disable CoreOS node from the e2e suite until this issue is resolved.
WDYT?

On Wed, Jun 22, 2016 at 10:42 AM, Tim St. Clair notifications@github.com
wrote:

Let me see if I can reproduce on a new GCE coreos instance.

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#1344 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/AGvIKMebpzqb8hCjGqUgzMRqXjP98252ks5qOXPzgaJpZM4I7Hq7
.

timstclair · 2016-06-22T17:53:14Z

I'll disable it for the builder.

euank · 2016-06-23T21:38:56Z

Do you have more information on how these images are setup and so on (sorry, not familiar with the cadvisor test setup)? For reference, node_e2e disable updates now with this.

If the cadvisor tests fail on newer CoreOSs post-update, that sounds like something to triage and fix..

timstclair · 2016-06-24T00:02:57Z

Ok, I was able to reproduce this on a fresh coreOS beta image (VERSION_ID=1068.2.0). I just did:

Install go to home directory
go get godep & cadvisor
I had trouble building cadvisor, so I copied the binary built on another machine
run build/integration.sh, which surfaced the same error:

$ build/integration.sh 
>> starting cAdvisor locally
Waiting for cAdvisor to start ...
>> running integration tests against local cAdvisor
godep: WARNING: Godep workspaces (./Godeps/_workspace) are deprecated and support for them will be removed when go1.8 is released.
godep: WARNING: Go version (go1.6) & $GO15VENDOREXPERIMENT= wants to enable the vendor experiment, but disabling because a Godep workspace (Godeps/_workspace) exists
--- FAIL: TestDockerContainerSpec (0.51s)
        Location:       docker_test.go:225
    Error:      Not equal: "0" (expected)
                    != "" (actual)
    Messages:   Cpu mask should be "0", but is ""

FAIL
FAIL    github.com/google/cadvisor/integration/tests/api    29.321s
ok      github.com/google/cadvisor/integration/tests/healthz    0.004s
godep: go exit status 1
Integration tests failed
>> stopping cAdvisor
build/integration.sh: line 45:  5341 Killed                  sudo ./cadvisor --docker_env_metadata_whitelist=TEST_VAR

@euank can you take it from here?

euank · 2016-06-27T19:18:01Z

I'll take a guess that this is related to systemd >= 226's change in cgroup hierarchy, but not sure yet.

I was able to reproduce on a machine with systemd 226 and docker 1.11 launched with --exec-opt native.cgroupdriver=systemd.

The machine in question is gentoo, but I expect it'll reproduce broadly in that configuration. I'll dig further...

euank · 2016-06-27T21:49:08Z

It finds a cpuset root at both /sys/fs/cgroup/cpuset/system.slice/docker-de6460b5039fa64f505cf383c15dc96d515bbc507c76ec0f7c06a00a5115002f.scope and /sys/fs/cgroup/cpuset/init.scope/system.slice/docker-de6460b5039fa64f505cf383c15dc96d515bbc507c76ec0f7c06a00a5115002f.scope (note the init.scope bit), then tries to use the latter and doesn't find the files it expects and returns "". The first one would be correct. I think this is moby/moby#16256 (comment) .. I'm giving a quick go at patching this similarly to as suggested to be sure.

euank · 2016-06-30T21:06:01Z

Upstream bug to point to as well: opencontainers/runc#931

Our options are, I think:

Wait for runc + docker-on-coreos to be fixed (we might backport the fix)
Switch the CoreOS+docker test setup to use cgroupfs driver (non-default)
Switch to CoreOS stable where it's still using systemd 225 and thus shouldn't be affected
modify cadvisor to handle runc's bad behaviour and do the "right thing" when there are multiple cgroup paths for one container

My preference is 3 to put off having to get a better solution, and hope that 1 happens in the meanwhile. Sound reasonable?

timstclair · 2016-07-01T00:05:41Z

(3) sounds reasonable to me, and I think we should do it anyway (filed #1361). Once the jenkins jobs are updated to use the cAdvisor jenkins script (kubernetes/test-infra#248) I'll add a coreos-stable VM.

euank · 2016-09-08T23:28:38Z

We should be able to update the CoreOS node if no one has already now that we've switched coreos to use cgroupfs by default.

I don't have access to the images referenced by ./build/jenkins_e2e.sh to update appropriately, but any version of CoreOS right now should include that change. If it's possible to use unmodified coreos images as I updated the node e2e to do, that might also help.

timstclair · 2016-09-09T18:25:30Z

I'm not sure what (if anything) needs to be changed from the unmodified coreos image, so it might just work. If you're up for trying it and figuring out what (if anything) needs to be added, I'd certainly welcome the help :)

You can see the command used to run the tests here.

dchen1107 · 2016-09-09T21:33:09Z

@euank I am assigning this one to you for delegating. We need better support for coreos as one of basic images for us. Re-assign it back to us or ask for help if you need. Thanks!

euank · 2016-09-09T23:01:15Z

I don't expect we'll need to do more than is done for the node_e2e stuff (user-data of https://github.com/euank/kubernetes/blob/5a5ba51b24c9e62aa775de1f568d365c2761aeb5/test/e2e_node/jenkins/coreos-init.json basically).

I'm on vacation for the next couple weeks, so I won't be able to verify that's true, and regardless someone with access to the jenkins account where these test instances run will need to start one up, unless we switch to a node_e2e type model where instances are launched and specified as part of code in this repository, not totally out of band. (#1361 could fix that perhaps).

cc @yifan-gu @crawford to help with or delegate further on this one, thanks!

crawford · 2016-09-09T23:24:58Z

Minor nit, the correct extension for that file should be .ign or .ignition per the IANA MIME registration.

yifan-gu · 2016-09-14T16:43:50Z

subscribe

timstclair added the area/testing label Jun 21, 2016

timstclair mentioned this issue Jun 22, 2016

fallback to /dev/mapper device if metadata device is not set in docker info #1343

Merged

timstclair self-assigned this Jul 1, 2016

euank mentioned this issue Jul 1, 2016

docker cgroup driver discussion - cgroupfs or systemd coreos/bugs#1435

Closed

euank mentioned this issue Sep 8, 2016

Heapster cpu/usage_rate returns incorrect cpu/usage_rate caused by overflow? kubernetes/kubernetes#30939

Closed

dchen1107 unassigned timstclair Sep 9, 2016

dashpole mentioned this issue Jan 27, 2017

Missing container metrics in kubelet (cAdvisor) in v1.5.1 kubernetes/kubernetes#39812

Closed

dashpole assigned euank Mar 16, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cAdvisor e2e failing 100% on core OS #1344

cAdvisor e2e failing 100% on core OS #1344

timstclair commented Jun 21, 2016 •

edited

Loading

timstclair commented Jun 21, 2016

timstclair commented Jun 21, 2016

timstclair commented Jun 21, 2016 •

edited

Loading

vishh commented Jun 22, 2016

timstclair commented Jun 22, 2016

timstclair commented Jun 22, 2016

vishh commented Jun 22, 2016

timstclair commented Jun 22, 2016

euank commented Jun 23, 2016 •

edited

Loading

timstclair commented Jun 24, 2016

euank commented Jun 27, 2016

euank commented Jun 27, 2016

euank commented Jun 30, 2016 •

edited

Loading

timstclair commented Jul 1, 2016

euank commented Sep 8, 2016

timstclair commented Sep 9, 2016

dchen1107 commented Sep 9, 2016

euank commented Sep 9, 2016 •

edited

Loading

crawford commented Sep 9, 2016

yifan-gu commented Sep 14, 2016

cAdvisor e2e failing 100% on core OS #1344

cAdvisor e2e failing 100% on core OS #1344

Comments

timstclair commented Jun 21, 2016 • edited Loading

timstclair commented Jun 21, 2016

timstclair commented Jun 21, 2016

timstclair commented Jun 21, 2016 • edited Loading

vishh commented Jun 22, 2016

timstclair commented Jun 22, 2016

timstclair commented Jun 22, 2016

vishh commented Jun 22, 2016

timstclair commented Jun 22, 2016

euank commented Jun 23, 2016 • edited Loading

timstclair commented Jun 24, 2016

euank commented Jun 27, 2016

euank commented Jun 27, 2016

euank commented Jun 30, 2016 • edited Loading

timstclair commented Jul 1, 2016

euank commented Sep 8, 2016

timstclair commented Sep 9, 2016

dchen1107 commented Sep 9, 2016

euank commented Sep 9, 2016 • edited Loading

crawford commented Sep 9, 2016

yifan-gu commented Sep 14, 2016

timstclair commented Jun 21, 2016 •

edited

Loading

timstclair commented Jun 21, 2016 •

edited

Loading

euank commented Jun 23, 2016 •

edited

Loading

euank commented Jun 30, 2016 •

edited

Loading

euank commented Sep 9, 2016 •

edited

Loading