flake in e2e test: Got value for metric problem_counter with label map[reason:Ext4Error]: 3, expect at most 2. #403

xueweiz · 2019-12-12T00:09:01Z

• Failure [52.528 seconds]
NPD should export Prometheus metrics.
/home/prow/go/src/k8s.io/node-problem-detector/test/e2e/metriconly/metrics_test.go:36
  When ext4 filesystem error happens
  /home/prow/go/src/k8s.io/node-problem-detector/test/e2e/metriconly/metrics_test.go:104
    NPD should update problem_counter{reason:Ext4Error} and problem_gauge{type:ReadonlyFilesystem} [It]
    /home/prow/go/src/k8s.io/node-problem-detector/test/e2e/metriconly/metrics_test.go:113
    Got value for metric problem_counter with label map[reason:Ext4Error]: 3, expect at most 2.
      
    Expected
        <float64>: 3
    to be <=
        <float64>: 2
    /home/prow/go/src/k8s.io/node-problem-detector/test/e2e/metriconly/metrics_test.go:190

Go to the kernel log test artifact, we see:

Dec 11 23:29:38 npd-metrics-cos-73-11647-348-0-0d055070 kernel: Bridge firewalling registered
Dec 11 23:29:38 npd-metrics-cos-73-11647-348-0-0d055070 kernel: IPv6: ADDRCONF(NETDEV_UP): docker0: link is not ready
Dec 11 23:30:02 npd-metrics-cos-73-11647-348-0-0d055070 kernel: EXT4-fs (sda1): re-mounted. Opts: commit=30,data=ordered
Dec 11 23:30:03 npd-metrics-cos-73-11647-348-0-0d055070 kernel: EXT4-fs error (device sda1): trigger_test_error:124: comm problem-maker: fake filesystem error from problem-maker
Dec 11 23:30:03 npd-metrics-cos-73-11647-348-0-0d055070 kernel: Aborting journal on device sda1-8.
Dec 11 23:30:03 npd-metrics-cos-73-11647-348-0-0d055070 kernel: EXT4-fs error (device sda1): ext4_journal_check_start:61: Detected aborted journal
Dec 11 23:30:03 npd-metrics-cos-73-11647-348-0-0d055070 kernel: EXT4-fs (sda1): Remounting filesystem read-only
Dec 11 23:30:03 npd-metrics-cos-73-11647-348-0-0d055070 kernel: EXT4-fs error (device sda1) in ext4_init_inode_table:1451: IO failure
Dec 11 23:30:03 npd-metrics-cos-73-11647-348-0-0d055070 kernel: EXT4-fs (sda1): Remounting filesystem read-only

The test is only expecting two filesystem errors from:

EXT4-fs error (device sda1): trigger_test_error:124: comm problem-maker: fake filesystem error from problem-maker
EXT4-fs error (device sda1): ext4_journal_check_start:61: Detected aborted journal

However in this case, filesystem failure also triggered an IO failure, which is not expected:

EXT4-fs error (device sda1) in ext4_init_inode_table:1451: IO failure

We should lift the assertion restriction a little bit, so that this case is properly handled.

The text was updated successfully, but these errors were encountered:

fejta-bot · 2020-03-11T00:35:32Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot · 2020-04-10T01:18:42Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

fejta-bot · 2020-05-10T02:02:29Z

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

k8s-ci-robot · 2020-05-10T02:02:41Z

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

stmcginnis · 2021-02-08T18:58:04Z

Looks like this may still occasionally fail:

#529

xueweiz mentioned this issue Dec 12, 2019

update the testing example in README.md #385

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 11, 2020

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Apr 10, 2020

k8s-ci-robot closed this as completed May 10, 2020

stmcginnis mentioned this issue Feb 8, 2021

Allow building on macOS #529

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

flake in e2e test: Got value for metric problem_counter with label map[reason:Ext4Error]: 3, expect at most 2. #403

flake in e2e test: Got value for metric problem_counter with label map[reason:Ext4Error]: 3, expect at most 2. #403

xueweiz commented Dec 12, 2019

fejta-bot commented Mar 11, 2020

fejta-bot commented Apr 10, 2020

fejta-bot commented May 10, 2020

k8s-ci-robot commented May 10, 2020

stmcginnis commented Feb 8, 2021

flake in e2e test: Got value for metric problem_counter with label map[reason:Ext4Error]: 3, expect at most 2. #403

flake in e2e test: Got value for metric problem_counter with label map[reason:Ext4Error]: 3, expect at most 2. #403

Comments

xueweiz commented Dec 12, 2019

fejta-bot commented Mar 11, 2020

fejta-bot commented Apr 10, 2020

fejta-bot commented May 10, 2020

k8s-ci-robot commented May 10, 2020

stmcginnis commented Feb 8, 2021