Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix cloud event flaky unit tests by adding waitgroup to fakeclient #5690

Merged
merged 1 commit into from
Nov 17, 2022

Conversation

Yongxuanzhang
Copy link
Member

@Yongxuanzhang Yongxuanzhang commented Oct 26, 2022

Changes

This commit adds waitgroup to fakeclient to avoid that some goroutines are not done when we want to collect the events.
The tests are flaky because the cloud events are sent with goroutine
but we don't wait until all goroutines done to check the events. So it
is possible that some events are not collected. The waitGroup will count when each goroutine is created and decrease the count when the goroutine is done.
This change has no impact on current code.

/kind flake

Signed-off-by: Yongxuanzhang yongxuanzhang@google.com

related issues:
#5160
to reproduce this issue, adding time.Sleep(100 * time.Millisecond) to

func (c FakeClient) Send(ctx context.Context, event cloudevents.Event) protocol.Result {

related PR:
#5313
Recent failing tests:

Submitter Checklist

As the author of this PR, please check off the items in this checklist:

  • Has Docs included if any changes are user facing
  • Has Tests included if any functionality added or changed
  • Follows the commit message standard
  • Meets the Tekton contributor standards (including
    functionality, content, code)
  • Has a kind label. You can add one by adding a comment on this PR that contains /kind <type>. Valid types are bug, cleanup, design, documentation, feature, flake, misc, question, tep
  • Release notes block below has been updated with any user facing changes (API changes, bug fixes, changes requiring upgrade notices or deprecation warnings)
  • Release notes contains the string "action required" if the change requires additional action from users switching to the new release

Release Notes

Fix cloud event flacky unit tests by adding EventSender

@tekton-robot
Copy link
Collaborator

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@tekton-robot tekton-robot added release-note-none Denotes a PR that doesnt merit a release note. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Oct 26, 2022
@Yongxuanzhang Yongxuanzhang marked this pull request as ready for review October 26, 2022 20:48
@tekton-robot tekton-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 26, 2022
@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/reconciler/events/cloud_event_controller.go Do not exist 88.2%
pkg/reconciler/events/cloudevent.go Do not exist 91.0%
pkg/reconciler/events/cloudeventclient.go Do not exist 38.9%
pkg/reconciler/events/cloudeventsfakeclient.go Do not exist 54.5%
pkg/reconciler/events/eventhelper.go Do not exist 76.9%

@lbernick
Copy link
Member

based on this comment from @vdemeester it seems like we're thinking of moving cloudevents functionality out of pipelines, so it might not make sense to combine these two packages into one, but I'm not sure if this comment is still true. @afrittoli can you comment on this?

@Yongxuanzhang
Copy link
Member Author

/hold need to make changes

@tekton-robot tekton-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 28, 2022
@tekton-robot tekton-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed release-note-none Denotes a PR that doesnt merit a release note. labels Nov 2, 2022
@Yongxuanzhang Yongxuanzhang changed the title refactor events and cloudevents into the same package fix cloud event flacky unit tests by adding eventsender Nov 2, 2022
@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/reconciler/events/cloudevent/cloud_event_controller.go 88.2% 88.6% 0.3

@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/reconciler/events/cloudevent/cloud_event_controller.go 88.2% 88.6% 0.3

t.Helper()
// Sleep 50ms to make sure events have delivered
time.Sleep(50 * time.Millisecond)
e.Wait()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can the time.Sleep be removed now? Here, and in CheckEventsUnordered

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if the fix can 100% replace the sleep here (Though I believe it is). Maybe if we could wait the fix is merged and then remove the sleep in another PR?

@@ -144,7 +145,9 @@ func SendCloudEventWithRetries(ctx context.Context, object runtime.Object) error
_, isRun := object.(*v1alpha1.Run)

wasIn := make(chan error)
e.WaitGroup.Add(1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing that I'm noticing here is that the interface is really just a wrapper around a waitGroup, i.e. it is pretty similar to the original implementation that just used a plain waitGroup. I'm sorry if I led you down a rabbit hole with my suggestions about how this interface should be implemented, but now that I'm seeing it in a PR we might just want a wrapper around the CloudEventClient. Something like

type OurOwnCloudEventClient struct {
   client *cloudEvents.Client
   eventCount int
}

func initialize(ce *cloudEvents.Client) *OurOwnCloudEventClient {
   client = ce
   eventCount = 0
}

func (c *OurOwnCloudEventClient) SendCloudEventWithRetries(ctx, object) {
  <do some stuff>
  eventCount += 1
  go func() {
    c.client.Send()
  }
}

func (c *OurOwnCloudEventClient) checkEventsOrdered(t *testing.T, <some other args>) {
    eventsSent := []string{}
    for i := 0; i < c.eventCount, i ++ {
      e <- c.client.Events
      eventsSent = append(eventsSent, e)
    }
   // Compare expected list of events with actual list of events
}

In this example, eventCount can also be replaced with a waitgroup. WDYT?

@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/reconciler/events/cloudevent/cloud_event_controller.go 88.2% 89.0% 0.8
pkg/reconciler/events/cloudevent/cloudeventsfakeclient.go 54.5% 61.5% 7.0

@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/reconciler/events/cloudevent/cloud_event_controller.go 88.2% 88.6% 0.3
pkg/reconciler/events/cloudevent/cloudeventclient.go 38.9% 36.8% -2.0
pkg/reconciler/events/cloudevent/cloudeventsfakeclient.go 54.5% 61.5% 7.0

@tekton-robot tekton-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Nov 2, 2022
@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/reconciler/events/cloudevent/cloud_event_controller.go 88.2% 88.6% 0.3
pkg/reconciler/events/cloudevent/cloudeventclient.go 38.9% 36.8% -2.0
pkg/reconciler/events/cloudevent/cloudeventsfakeclient.go 54.5% 61.5% 7.0

@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/reconciler/events/cloudevent/cloud_event_controller.go 88.2% 88.6% 0.3
pkg/reconciler/events/cloudevent/cloudeventclient.go 38.9% 36.8% -2.0
pkg/reconciler/events/cloudevent/cloudeventsfakeclient.go 54.5% 57.1% 2.6

@Yongxuanzhang
Copy link
Member Author

/retest

@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/reconciler/events/cloudevent/cloud_event_controller.go 88.4% 88.7% 0.3
pkg/reconciler/events/cloudevent/cloudeventclient.go 38.9% 31.8% -7.1
pkg/reconciler/events/cloudevent/cloudeventsfakeclient.go 54.5% 74.4% 19.8

@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/reconciler/events/cloudevent/cloud_event_controller.go 88.4% 88.7% 0.3
pkg/reconciler/events/cloudevent/cloudeventclient.go 38.9% 31.8% -7.1
pkg/reconciler/events/cloudevent/cloudeventsfakeclient.go 54.5% 73.2% 18.6

@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/reconciler/events/cloudevent/cloud_event_controller.go 88.4% 88.7% 0.3
pkg/reconciler/events/cloudevent/cloudeventclient.go 38.9% 31.8% -7.1
pkg/reconciler/events/cloudevent/cloudeventsfakeclient.go 54.5% 72.1% 17.5

@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/reconciler/events/cloudevent/cloud_event_controller.go 88.4% 88.7% 0.3
pkg/reconciler/events/cloudevent/cloudeventclient.go 38.9% 31.8% -7.1
pkg/reconciler/events/cloudevent/cloudeventsfakeclient.go 54.5% 71.1% 16.6

@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/reconciler/events/cloudevent/cloud_event_controller.go 88.4% 88.7% 0.3
pkg/reconciler/events/cloudevent/cloudeventclient.go 38.9% 31.8% -7.1
pkg/reconciler/events/cloudevent/cloudeventsfakeclient.go 54.5% 72.1% 17.5

@lbernick lbernick mentioned this pull request Nov 15, 2022
7 tasks
@lbernick
Copy link
Member

/hold cancel
since this is ready for review

@tekton-robot tekton-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 15, 2022
@lbernick
Copy link
Member

@Yongxuanzhang could you please update your commit message title + body to match the PR description?

@Yongxuanzhang
Copy link
Member Author

@Yongxuanzhang could you please update your commit message title + body to match the PR description?

Oh thanks! I just updated it. Sorry I made some changes after @afrittoli's reviews. you may want to remove your approval and review again?

  1. I added a check in Send so if the channel is full we will return error, this should prevent the extra events.
  2. I loop the channel size not the len(expectedEvents), this will avoid blocking on reading more events when we send less events. This should work in this PR since we will wait to ensure no events are missed

@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/reconciler/events/cloudevent/cloud_event_controller.go 88.4% 88.7% 0.3
pkg/reconciler/events/cloudevent/cloudeventclient.go 38.9% 31.8% -7.1
pkg/reconciler/events/cloudevent/cloudeventsfakeclient.go 54.5% 72.1% 17.5

// eventsFromChannelUnordered takes a chan of string and a list of events that a test
// expects to receive. The events can be received in any order. Any extra or too few
// events are both considered errors.
func eventsFromChannelUnordered(c chan string, wantEvents []string) error {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might make sense to merge this function with CheckCloudEventsUnordered, since it's not used anywhere else as far as I can tell, and it needs the call to waitgroup.Wait in order to work correctly.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

}

func TestSend_Error(t *testing.T) {
sendEvents := []event.Event{{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this doesn't need to be a slice

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

func eventsFromChannelUnordered(c chan string, wantEvents []string) error {
expected := append([]string{}, wantEvents...)
channelEvents := len(c)
// fakeclient's channel buffersize equals to the size of wantEvents, so no extra events will be sent
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is slightly misleading. Buffered channels can send once events have been pulled off of them. The reason no extra events will be sent is because the fake client returns an error in this case, but that logic lives far away from here.

I'd suggest adding commentary to the fake client where you return an error (to explain why this prevents extra events from being sent) and some commentary here about why this block detects the case where too few events are sent.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

This commit adds waitgroup to fakeclient to avoid that some goroutines are not done when we want to collect the events.
The tests are flaky because the cloud events are sent with goroutine
but we don't wait until all goroutines done to check the events. So it
is possible that some events are not collected. The waitGroup will count when each goroutine is created and decrease the count when the goroutine is done.
This change has no impact on current code.

Signed-off-by: Yongxuan Zhang yongxuanzhang@google.com
@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/reconciler/events/cloudevent/cloud_event_controller.go 88.4% 88.7% 0.3
pkg/reconciler/events/cloudevent/cloudeventclient.go 38.9% 31.8% -7.1
pkg/reconciler/events/cloudevent/cloudeventsfakeclient.go 54.5% 71.8% 17.2

Comment on lines +69 to +72
err := fakeClient.Send(ctx, sendEvent)
if err == nil {
t.Fatalf("want err but got nil")
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you 🙏

}

// SetupFakeCloudClientContext sets up the fakeclient to context
func SetupFakeCloudClientContext(ctx context.Context, expectedEventCount int) context.Context {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: Why does this need to be a separate function?

@@ -35,19 +35,6 @@ func CheckEventsOrdered(t *testing.T, eventChan chan string, testName string, wa
return nil
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: It's a bit inconsistent now that the unordered check has been moved inside to the fake client but the ordered one is in this test module. It might we worth keeping this logic in the same place, for maintenance and to help test developers. BTW, do you think the ordered check suffers from the same flake as the other one?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is something we could fix as a follow-up.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I'm looking into the k8s event as well, but I don't think they are the same root cause.
Some prs:
#5752
#5770

Copy link
Member

@afrittoli afrittoli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!
/lgtm

@tekton-robot tekton-robot added the lgtm Indicates that a PR is ready to be merged. label Nov 17, 2022
@tekton-robot tekton-robot merged commit fff3fcd into tektoncd:main Nov 17, 2022
@jerop jerop added the kind/flake Categorizes issue or PR as related to a flakey test label Nov 30, 2022
@Yongxuanzhang Yongxuanzhang mentioned this pull request Jan 31, 2023
4 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. kind/flake Categorizes issue or PR as related to a flakey test lgtm Indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants