Failling Test: [sig-storage] Flexvolume expansion tests are flaky #71470

gnufied · 2018-11-27T17:59:48Z

This is a child issue of #71434

For example - https://gubernator.k8s.io/build/kubernetes-jenkins/logs/ci-kubernetes-e2e-gci-gce-slow/21996

/sig storage

@aniket-s-kulkarni will you be able to look into this?

AishSundar · 2018-11-27T19:29:32Z

/milestone v1.13
/kind failing-test
/priority critical-urgent

I am addind this to 1.13 milestone as critical-urgent until investigated and triaged otherwise. The Code freeze deadline for the fix is EOD tomorrow, 11/28

/cc @msau42 @jberkus @mariantalla @mortent

aniket-s-kulkarni · 2018-11-27T19:36:52Z

@gnufied yes I will look at it. At first glance it seems to be timing

This is what the test does

		By("Waiting for file system resize to finish")
		pvc, err = waitForFSResize(pvc, c)
		Expect(err).NotTo(HaveOccurred(), "while waiting for fs resize to finish")

		pvcConditions := pvc.Status.Conditions
		Expect(len(pvcConditions)).To(Equal(0), "pvc should not have conditions")

The waitForFSResize actually checks for .spec.resources.requests and .spec.status.capacity to reflect a successful resize. That one did pass. But the very next step is to expect the PVC conditions to be cleared away didn't

Unfortunately we don't log what condition was there on the PVC, so unsure what was left uncleared. The kubelet updates .spec.status.capacity and clears the conditions in the same patch command so I wouldn't expect the test check getting in between those two.

gnufied · 2018-11-27T20:05:14Z

@aniket-s-kulkarni I am wondering if kubelet logs have some clues about why condition update failed. I do agree that if pvc.Status.Capacity had correct value then condition should be removed too but apparently something happened.

gnufied · 2018-11-27T22:50:14Z

I think the root cause of this issue is - same PVC is being considered twice for resizing while fs resizing was pending on the node. The flow is like this:

User edits the PVC and requests more size (this add pvc resize progress condition)
flexvolume plugin controller side resize is instant and hence it returns success immediately and pv size is updated and PVC gets fs resize condition.
On node, kubelet resizes the volume and removes the condition.
Now while we wait for resize on the node, same PVC is being considerd again for resizing by the expand-controller loop, because pvc.status.capacity < pvc.spec.capacity . The actual operation generator code actually will reject the second resize request - https://github.com/kubernetes/kubernetes/blob/master/pkg/volume/util/operationexecutor/operation_generator.go#L1266 because it considers PV size but what we are seeing is, resize-in-progress condition is being at least added twice.

Here is how we know that volume was considered twice for resize:

1126 13:08:09.979859       1 volume_resize_map.go:115] Adding pvc e2e-tests-mounted-flexvolume-expand-6c7wp/pvc-k7lj2 with Size 6Gi/2Gi for resizing
I1126 13:08:12.090042       1 volume_resize_map.go:115] Adding pvc e2e-tests-mounted-flexvolume-expand-6c7wp/pvc-k7lj2 with Size 6Gi/2Gi for resizing

This flake was not introduced by flexvolume change afaict and existed before. I think timing issue with flex is just triggering it. All resize operation are idempotent and expand controller anyways rejects the resize request by comparing PV size and hence it should not be too bad. In worst case, this WILL cause a condition to be added to the PVC unnecessarily (and that is what is happening here).

I will make a fix for this. But IMO - this should not be release blocker.

saad-ali · 2018-11-28T00:16:27Z

Thanks for digging in to this @gnufied.

Based on #71434 (comment) we will move this to v1.14 and document it as a known issue for Flex resize.

/milestone v1.14

AishSundar · 2018-11-28T00:31:00Z

@gnufied @saad-ali is there any perceived user effect if an extra condition gets added to the PVC ? is it just ignored?

jberkus · 2018-11-28T02:36:14Z

Fix tags:

/kind flake
/remove-kind failing-test

gnufied · 2018-11-28T02:47:10Z

The condition exists as a messaging to the user about stage of volume resizing, but there aren't any actions taken based on that. So - user will not have any usage issues with PVC or anything, guess just the messaging will be kinda confusing(when resizing operation is over, to end user it will appear as it isn't).

Also - the issue itself is kinda rare. For example, this test flaked only twice between 11/13 and 11/27, so most users will most likely not run into this. But nontheless - I am working on a fix.

AishSundar · 2018-11-28T05:44:58Z

Thanks @gnufied this is now moved to 1.14

gnufied · 2018-11-30T18:32:59Z

/assign

fejta-bot · 2019-02-28T19:33:10Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

spiffxp · 2019-03-05T19:23:35Z

/remove-lifecycle stale
This is still failing (it's unclear to me whether it's a flake or a continuous failure from this display) https://storage.googleapis.com/k8s-gubernator/triage/index.html?test=Flexvolume

gnufied · 2019-03-05T20:24:18Z

Yeah, there is a fix for this available actually - #71611 . I need to sync with @jsafrane and see how we can merge the PR.

nikopen · 2019-03-11T08:46:18Z

Hey @gnufied , in terms of v1.14 release, how close is the fix for this / what's the current status?

gnufied · 2019-05-13T22:22:35Z

We missed fixing this in v1.14. We are working on a fix that will land in 1.15.

spiffxp · 2019-05-14T00:36:58Z

/remove-priority critical-urgent
/priority important-longterm
More representative of how it's actually being treated

gnufied · 2019-05-14T01:20:49Z

/milestone v1.15

k8s-ci-robot · 2019-05-14T01:20:50Z

@gnufied: You must be a member of the kubernetes/kubernetes-milestone-maintainers GitHub team to set the milestone. If you believe you should be able to issue the /milestone command, please contact your and have them propose you as an additional delegate for this responsibility.

In response to this:

/milestone v1.15

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot · 2019-05-14T13:20:16Z

@mariantalla: You must be a member of the kubernetes/kubernetes-milestone-maintainers GitHub team to set the milestone. If you believe you should be able to issue the /milestone command, please contact your and have them propose you as an additional delegate for this responsibility.

In response to this:

/milestone v1.15

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

gnufied · 2019-07-18T17:14:37Z

This should be fixed now..

/close

k8s-ci-robot · 2019-07-18T17:14:38Z

@gnufied: Closing this issue.

In response to this:

This should be fixed now..

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot added the sig/storage Categorizes an issue or PR as relevant to SIG Storage. label Nov 27, 2018

k8s-ci-robot added the kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. label Nov 27, 2018

k8s-ci-robot added this to the v1.13 milestone Nov 27, 2018

k8s-ci-robot added the priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. label Nov 27, 2018

gnufied mentioned this issue Nov 27, 2018

Tracking: Complete list of Storage-related flakes for 1.13 #71434

Closed

k8s-ci-robot modified the milestones: v1.13, v1.14 Nov 28, 2018

saad-ali mentioned this issue Nov 28, 2018

1.13 Release Notes: "Known Issues" #70955

Closed

k8s-ci-robot added kind/flake Categorizes issue or PR as related to a flaky test. and removed kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. labels Nov 28, 2018

saad-ali mentioned this issue Nov 28, 2018

Final 1.13 release notes, copied in from Google Doc kubernetes/sig-release#386

Merged

k8s-ci-robot assigned gnufied Nov 30, 2018

gnufied mentioned this issue Nov 30, 2018

Fix stale pvc patch #71611

Closed

mlmhl mentioned this issue Jan 8, 2019

Ensure resizer handles double processing of PVCs without problems. kubernetes-csi/external-resizer#4

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 28, 2019

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 5, 2019

k8s-ci-robot added priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. and removed priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. labels May 14, 2019

gnufied mentioned this issue May 14, 2019

Use workqueues for volume expansion #77883

Merged

k8s-ci-robot closed this as completed Jul 18, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failling Test: [sig-storage] Flexvolume expansion tests are flaky #71470

Failling Test: [sig-storage] Flexvolume expansion tests are flaky #71470

gnufied commented Nov 27, 2018

AishSundar commented Nov 27, 2018

aniket-s-kulkarni commented Nov 27, 2018

gnufied commented Nov 27, 2018 •

edited

Loading

gnufied commented Nov 27, 2018 •

edited

Loading

saad-ali commented Nov 28, 2018

AishSundar commented Nov 28, 2018

jberkus commented Nov 28, 2018

gnufied commented Nov 28, 2018 •

edited

Loading

AishSundar commented Nov 28, 2018

gnufied commented Nov 30, 2018

fejta-bot commented Feb 28, 2019

spiffxp commented Mar 5, 2019

gnufied commented Mar 5, 2019

nikopen commented Mar 11, 2019

gnufied commented May 13, 2019

spiffxp commented May 14, 2019

gnufied commented May 14, 2019

k8s-ci-robot commented May 14, 2019

k8s-ci-robot commented May 14, 2019

gnufied commented Jul 18, 2019

k8s-ci-robot commented Jul 18, 2019

Failling Test: [sig-storage] Flexvolume expansion tests are flaky #71470

Failling Test: [sig-storage] Flexvolume expansion tests are flaky #71470

Comments

gnufied commented Nov 27, 2018

AishSundar commented Nov 27, 2018

aniket-s-kulkarni commented Nov 27, 2018

gnufied commented Nov 27, 2018 • edited Loading

gnufied commented Nov 27, 2018 • edited Loading

saad-ali commented Nov 28, 2018

AishSundar commented Nov 28, 2018

jberkus commented Nov 28, 2018

gnufied commented Nov 28, 2018 • edited Loading

AishSundar commented Nov 28, 2018

gnufied commented Nov 30, 2018

fejta-bot commented Feb 28, 2019

spiffxp commented Mar 5, 2019

gnufied commented Mar 5, 2019

nikopen commented Mar 11, 2019

gnufied commented May 13, 2019

spiffxp commented May 14, 2019

gnufied commented May 14, 2019

k8s-ci-robot commented May 14, 2019

k8s-ci-robot commented May 14, 2019

gnufied commented Jul 18, 2019

k8s-ci-robot commented Jul 18, 2019

gnufied commented Nov 27, 2018 •

edited

Loading

gnufied commented Nov 27, 2018 •

edited

Loading

gnufied commented Nov 28, 2018 •

edited

Loading