Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failling Test: [sig-storage] Flexvolume expansion tests are flaky #71470

Closed
gnufied opened this issue Nov 27, 2018 · 21 comments
Closed

Failling Test: [sig-storage] Flexvolume expansion tests are flaky #71470

gnufied opened this issue Nov 27, 2018 · 21 comments
Assignees
Labels
kind/flake Categorizes issue or PR as related to a flaky test. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. sig/storage Categorizes an issue or PR as relevant to SIG Storage.
Milestone

Comments

@gnufied
Copy link
Member

gnufied commented Nov 27, 2018

This is a child issue of #71434

For example - https://gubernator.k8s.io/build/kubernetes-jenkins/logs/ci-kubernetes-e2e-gci-gce-slow/21996

/sig storage

@aniket-s-kulkarni will you be able to look into this?

@k8s-ci-robot k8s-ci-robot added the sig/storage Categorizes an issue or PR as relevant to SIG Storage. label Nov 27, 2018
@AishSundar
Copy link
Contributor

/milestone v1.13
/kind failing-test
/priority critical-urgent

I am addind this to 1.13 milestone as critical-urgent until investigated and triaged otherwise. The Code freeze deadline for the fix is EOD tomorrow, 11/28

/cc @msau42 @jberkus @mariantalla @mortent

@k8s-ci-robot k8s-ci-robot added the kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. label Nov 27, 2018
@k8s-ci-robot k8s-ci-robot added this to the v1.13 milestone Nov 27, 2018
@k8s-ci-robot k8s-ci-robot added the priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. label Nov 27, 2018
@aniket-s-kulkarni
Copy link
Contributor

@gnufied yes I will look at it. At first glance it seems to be timing

This is what the test does

		By("Waiting for file system resize to finish")
		pvc, err = waitForFSResize(pvc, c)
		Expect(err).NotTo(HaveOccurred(), "while waiting for fs resize to finish")

		pvcConditions := pvc.Status.Conditions
		Expect(len(pvcConditions)).To(Equal(0), "pvc should not have conditions")

The waitForFSResize actually checks for .spec.resources.requests and .spec.status.capacity to reflect a successful resize. That one did pass. But the very next step is to expect the PVC conditions to be cleared away didn't

Unfortunately we don't log what condition was there on the PVC, so unsure what was left uncleared. The kubelet updates .spec.status.capacity and clears the conditions in the same patch command so I wouldn't expect the test check getting in between those two.

@gnufied
Copy link
Member Author

gnufied commented Nov 27, 2018

@aniket-s-kulkarni I am wondering if kubelet logs have some clues about why condition update failed. I do agree that if pvc.Status.Capacity had correct value then condition should be removed too but apparently something happened.

@gnufied
Copy link
Member Author

gnufied commented Nov 27, 2018

I think the root cause of this issue is - same PVC is being considered twice for resizing while fs resizing was pending on the node. The flow is like this:

  1. User edits the PVC and requests more size (this add pvc resize progress condition)
  2. flexvolume plugin controller side resize is instant and hence it returns success immediately and pv size is updated and PVC gets fs resize condition.
  3. On node, kubelet resizes the volume and removes the condition.
  4. Now while we wait for resize on the node, same PVC is being considerd again for resizing by the expand-controller loop, because pvc.status.capacity < pvc.spec.capacity . The actual operation generator code actually will reject the second resize request - https://github.com/kubernetes/kubernetes/blob/master/pkg/volume/util/operationexecutor/operation_generator.go#L1266 because it considers PV size but what we are seeing is, resize-in-progress condition is being at least added twice.

Here is how we know that volume was considered twice for resize:

1126 13:08:09.979859       1 volume_resize_map.go:115] Adding pvc e2e-tests-mounted-flexvolume-expand-6c7wp/pvc-k7lj2 with Size 6Gi/2Gi for resizing
I1126 13:08:12.090042       1 volume_resize_map.go:115] Adding pvc e2e-tests-mounted-flexvolume-expand-6c7wp/pvc-k7lj2 with Size 6Gi/2Gi for resizing

This flake was not introduced by flexvolume change afaict and existed before. I think timing issue with flex is just triggering it. All resize operation are idempotent and expand controller anyways rejects the resize request by comparing PV size and hence it should not be too bad. In worst case, this WILL cause a condition to be added to the PVC unnecessarily (and that is what is happening here).

I will make a fix for this. But IMO - this should not be release blocker.

@saad-ali
Copy link
Member

Thanks for digging in to this @gnufied.

Based on #71434 (comment) we will move this to v1.14 and document it as a known issue for Flex resize.

/milestone v1.14

@AishSundar
Copy link
Contributor

@gnufied @saad-ali is there any perceived user effect if an extra condition gets added to the PVC ? is it just ignored?

@jberkus
Copy link

jberkus commented Nov 28, 2018

Fix tags:

/kind flake
/remove-kind failing-test

@k8s-ci-robot k8s-ci-robot added kind/flake Categorizes issue or PR as related to a flaky test. and removed kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. labels Nov 28, 2018
@gnufied
Copy link
Member Author

gnufied commented Nov 28, 2018

The condition exists as a messaging to the user about stage of volume resizing, but there aren't any actions taken based on that. So - user will not have any usage issues with PVC or anything, guess just the messaging will be kinda confusing(when resizing operation is over, to end user it will appear as it isn't).

Also - the issue itself is kinda rare. For example, this test flaked only twice between 11/13 and 11/27, so most users will most likely not run into this. But nontheless - I am working on a fix.

@AishSundar
Copy link
Contributor

Thanks @gnufied this is now moved to 1.14

@gnufied
Copy link
Member Author

gnufied commented Nov 30, 2018

/assign

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 28, 2019
@spiffxp
Copy link
Member

spiffxp commented Mar 5, 2019

/remove-lifecycle stale
This is still failing (it's unclear to me whether it's a flake or a continuous failure from this display) https://storage.googleapis.com/k8s-gubernator/triage/index.html?test=Flexvolume

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 5, 2019
@gnufied
Copy link
Member Author

gnufied commented Mar 5, 2019

Yeah, there is a fix for this available actually - #71611 . I need to sync with @jsafrane and see how we can merge the PR.

@nikopen
Copy link
Contributor

nikopen commented Mar 11, 2019

Hey @gnufied , in terms of v1.14 release, how close is the fix for this / what's the current status?

@gnufied
Copy link
Member Author

gnufied commented May 13, 2019

We missed fixing this in v1.14. We are working on a fix that will land in 1.15.

@spiffxp
Copy link
Member

spiffxp commented May 14, 2019

/remove-priority critical-urgent
/priority important-longterm
More representative of how it's actually being treated

@k8s-ci-robot k8s-ci-robot added priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. and removed priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. labels May 14, 2019
@gnufied
Copy link
Member Author

gnufied commented May 14, 2019

/milestone v1.15

@k8s-ci-robot
Copy link
Contributor

@gnufied: You must be a member of the kubernetes/kubernetes-milestone-maintainers GitHub team to set the milestone. If you believe you should be able to issue the /milestone command, please contact your and have them propose you as an additional delegate for this responsibility.

In response to this:

/milestone v1.15

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot
Copy link
Contributor

@mariantalla: You must be a member of the kubernetes/kubernetes-milestone-maintainers GitHub team to set the milestone. If you believe you should be able to issue the /milestone command, please contact your and have them propose you as an additional delegate for this responsibility.

In response to this:

/milestone v1.15

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@gnufied
Copy link
Member Author

gnufied commented Jul 18, 2019

This should be fixed now..

/close

@k8s-ci-robot
Copy link
Contributor

@gnufied: Closing this issue.

In response to this:

This should be fixed now..

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/flake Categorizes issue or PR as related to a flaky test. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. sig/storage Categorizes an issue or PR as relevant to SIG Storage.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants