Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reorganise CI to have a sane number of statuses and reduce queueing #1401

Open
dave-tucker opened this issue Jun 9, 2020 · 16 comments
Open

Comments

@dave-tucker
Copy link
Collaborator

dave-tucker commented Jun 9, 2020

PR #1397 goes some way towards this, but I think we need to do more, especially given that #1343 is on the horizon.

Currently the test matrix includes these suites:

  • K8s Network and Network Policy Tests (4 shards)
  • Control Plane Tests

We then run these 5 jobs against the product of this matrix for every pull request (assuming with v6/dual-stack on the horizon)

  • HA, nonHA
  • Shared Gateway Mode, Local Gateway Mode
  • v4, v6, dual-stack

Assuming all of these variants, that's 60 jobs per run 😱 Without v4/v6/dual-stack it's still 20.

Given how CI is setup now, we will continue to run even if the initial build/lint fails.
Usually authors are quick to respond and force push a fix, but the old build isn't cancelled.

Proposal 1: Block running e2e tests until the initial Build has passed

As for the e2e tests, we need to simplify the matrix that we have.

Proposal 2: Move test orchestration outside of GitHub Actions
This would allow for us to have, one (or at a push 2, if we can assume that dual-stack might prove out the v4/v6 only paths) top level matrix items - those that dictate the dimensions of the cluster required.
Either ginkgo a bash script or make could then be responsible for running the tests with the necessary features enabled on the cluster, assuming this can be done by configuration only.

If that's not possible, and all matrix elements require to do things at install time with the cluster then we should...

Proposal 3: Move cluster creation outside of GitHub Actions
This would require using something (e.g terraform, or maybe a remote driver for KIND) to prepare cloud instances that can be used for testing. We might also require using an image registry for images that are made during the PR build.
The benefit here is that we're not capped with how many parallel machines we can create for testing.
The drawback being that we'd have to create additional rigging for log streaming, test reporting and result aggregation.

Proposal 4: Move some testing to post-merge

There may be other options here I haven't thought of so I'd be interested in others thoughts or opinions on how we could improve the current situation.

@dave-tucker
Copy link
Collaborator Author

As a follow up, the following from GitHub may help illustrate the boundaries we're working within re: concurrency.

GitHub plan Total concurrent jobs Maximum concurrent macOS jobs
Free 20 5
Pro 40 5
Team 60 5
Enterprise 180 50

Also note that a single workflow can spawn up to 256 jobs maximum.

@squeed
Copy link
Contributor

squeed commented Jun 9, 2020

Perhaps a dumb question: Can we get rid of non-HA? What does that offer us?

@squeed
Copy link
Contributor

squeed commented Jun 9, 2020

Another dumb question? Do we need v6? What about just v4 and dual-stack?

@Billy99
Copy link
Contributor

Billy99 commented Jun 9, 2020

Another dumb question? Do we need v6? What about just v4 and dual-stack?

@squeed Per request from @trozet, I'm working on IPv6 only in CI so that we can verify that dual-stack doesn't break IPv6 only (which was actually already broken upstream and we didn't know because it is not tested).

@trozet
Copy link
Contributor

trozet commented Jun 9, 2020

Perhaps a dumb question: Can we get rid of non-HA? What does that offer us?

It helps us differentiate if there is a failure due to HA functionality. If we see something consistently fail in HA, but not in noHA, we know the root cause is something to do with the HA path.

@squeed
Copy link
Contributor

squeed commented Jun 9, 2020

Perhaps a dumb question: Can we get rid of non-HA? What does that offer us?

It helps us differentiate if there is a failure due to HA functionality. If we see something consistently fail in HA, but not in noHA, we know the root cause is something to do with the HA path.

I see. That's a useful signal, but perhaps we can skip noHA by default?

@trozet
Copy link
Contributor

trozet commented Jun 9, 2020

Perhaps a dumb question: Can we get rid of non-HA? What does that offer us?

It helps us differentiate if there is a failure due to HA functionality. If we see something consistently fail in HA, but not in noHA, we know the root cause is something to do with the HA path.

I see. That's a useful signal, but perhaps we can skip noHA by default?

sure. Now that @dave-tucker has given us some comments to use to trigger things, we could make that a trigger /run noha or something. Is that possible @dave-tucker ? Also, to cut down on retest, would it be possible to say like /retest "test case name" and have a job that fires that only runs that single test case?

@squeed
Copy link
Contributor

squeed commented Jun 9, 2020

We could also reduce matrix size by not doing a full cross join. For example, we probably don't need the full multiplication of (ha, noha) x (local, shared) x (v4, v6, dual-stack)

What if we did, by default,

  • ha, local, (v4, v6, dual-stack)
  • ha, shared, dual-stack

With the (possibly erroneous) assumption that if dual-stack works, v4 and v6 work for shared gateway

Is it possible to run more tests if /pkg/node is touched?

@trozet
Copy link
Contributor

trozet commented Jun 9, 2020

We could also reduce matrix size by not doing a full cross join. For example, we probably don't need the full multiplication of (ha, noha) x (local, shared) x (v4, v6, dual-stack)

What if we did, by default,

  • ha, local, (v4, v6, dual-stack)
  • ha, shared, dual-stack

With the (possibly erroneous) assumption that if dual-stack works, v4 and v6 work for shared gateway

Is it possible to run more tests if /pkg/node is touched?

There's a possibility we may be moving to shared gw mode in the future, in which case we could drop all but some sanity on local. That would help some.

@Billy99
Copy link
Contributor

Billy99 commented Jun 9, 2020

My IPv6 CI PR will add some "exclude:" examples. For example I added a 'shard-ipv4' and then exclude it in IPv6 mode. Not sure how to add more tests if a particular file is touched.

@danwinship
Copy link
Contributor

ha, shared, dual-stack

shared gateway doesn't even fully support single-stack IPv6 yet (#1141, though there have been drive-by fixes for some of that).

It will still be a while before dual-stack is even worth running automatically. And once dual-stack is working, we won't really need to run single-stack IPv6 regularly.

So I'd say for now:

  • noha, local, v4
  • ha, local, v6
  • ha, shared, v4

and once dual-stack is ready, replace 'ha, local, v6' with 'ha, local, dual-stack'

@dave-tucker
Copy link
Collaborator Author

dave-tucker commented Jun 9, 2020

@trozet there is no api to re-run jobs, just the entire worfklow. so /retest can't take arguments.
Adding /test noHA is possible.

Someone would have to:

  1. Create a new workflow .github/workflows/test-no-ha.yml from test.yml and just include the noHA stuff
  2. Copy the code that we have for /retest and make a new /test action
  3. Adjust the dispatch code to trigger the new workflow and extend it to post a comment with pass/fail back to the PR where it was requested

@trozet
Copy link
Contributor

trozet commented Jun 9, 2020

@dave-tucker couldnt we then simulate retesting a single e2e test by adding the same workflow as /test noHA, except /test "some ginkgo focus" and make a workflow that takes that in as an arg?

@dave-tucker
Copy link
Collaborator Author

dave-tucker commented Jun 9, 2020

The problem is that a workflow set up in that way won't replace the github statuses on the PR 😢
It writes new ones, because the trigger is no longer pull_request, it's repository_dispatch or similar.

@trozet
Copy link
Contributor

trozet commented Jun 17, 2020

The problem is that a workflow set up in that way won't replace the github statuses on the PR
It writes new ones, because the trigger is no longer pull_request, it's repository_dispatch or similar.

That is kind of lame, but this would still be useful to be able to run another test and see aresult.

@aojea
Copy link
Contributor

aojea commented Jun 17, 2020

maybe it is not necessary to run all jobs in premerge, just a subset of them, the most significant. Then, run ALL of them nightly, the risks of regressions is higher but I don't see a big number of PR per day so it would be easy to catch regressions with only one day of delay.
https://jasonet.co/posts/scheduled-actions/

Billy99 added a commit to Billy99/ovn-kubernetes-1 that referenced this issue Jun 19, 2020
Signed-off-by: Billy McFall <22157057+Billy99@users.noreply.github.com>
Billy99 added a commit to Billy99/ovn-kubernetes-1 that referenced this issue Jun 20, 2020
Signed-off-by: Billy McFall <22157057+Billy99@users.noreply.github.com>
Billy99 added a commit to Billy99/ovn-kubernetes-1 that referenced this issue Jul 9, 2020
Signed-off-by: Billy McFall <22157057+Billy99@users.noreply.github.com>
Billy99 added a commit to Billy99/ovn-kubernetes-1 that referenced this issue Jul 10, 2020
Signed-off-by: Billy McFall <22157057+Billy99@users.noreply.github.com>
Billy99 added a commit to Billy99/ovn-kubernetes-1 that referenced this issue Jul 10, 2020
Signed-off-by: Billy McFall <22157057+Billy99@users.noreply.github.com>
Billy99 added a commit to Billy99/ovn-kubernetes-1 that referenced this issue Jul 13, 2020
Signed-off-by: Billy McFall <22157057+Billy99@users.noreply.github.com>
Billy99 added a commit to Billy99/ovn-kubernetes-1 that referenced this issue Jul 13, 2020
Signed-off-by: Billy McFall <22157057+Billy99@users.noreply.github.com>
Billy99 added a commit to Billy99/ovn-kubernetes-1 that referenced this issue Jul 13, 2020
Signed-off-by: Billy McFall <22157057+Billy99@users.noreply.github.com>
Billy99 added a commit to Billy99/ovn-kubernetes-1 that referenced this issue Jul 14, 2020
Signed-off-by: Billy McFall <22157057+Billy99@users.noreply.github.com>
Billy99 added a commit to Billy99/ovn-kubernetes-1 that referenced this issue Jul 14, 2020
Signed-off-by: Billy McFall <22157057+Billy99@users.noreply.github.com>
Billy99 added a commit to Billy99/ovn-kubernetes-1 that referenced this issue Jul 15, 2020
Signed-off-by: Billy McFall <22157057+Billy99@users.noreply.github.com>
Billy99 added a commit to Billy99/ovn-kubernetes-1 that referenced this issue Jul 16, 2020
Signed-off-by: Billy McFall <22157057+Billy99@users.noreply.github.com>
Billy99 added a commit to Billy99/ovn-kubernetes-1 that referenced this issue Jul 16, 2020
Signed-off-by: Billy McFall <22157057+Billy99@users.noreply.github.com>
Billy99 added a commit to Billy99/ovn-kubernetes-1 that referenced this issue Jul 17, 2020
Signed-off-by: Billy McFall <22157057+Billy99@users.noreply.github.com>
Billy99 added a commit to Billy99/ovn-kubernetes-1 that referenced this issue Jul 17, 2020
Signed-off-by: Billy McFall <22157057+Billy99@users.noreply.github.com>
Billy99 added a commit to Billy99/ovn-kubernetes-1 that referenced this issue Jul 19, 2020
Signed-off-by: Billy McFall <22157057+Billy99@users.noreply.github.com>
Billy99 added a commit to Billy99/ovn-kubernetes-1 that referenced this issue Jul 20, 2020
Signed-off-by: Billy McFall <22157057+Billy99@users.noreply.github.com>
Billy99 added a commit to Billy99/ovn-kubernetes-1 that referenced this issue Jul 22, 2020
Per comment in ovn-org#1401, limit the test cases. This does not fix the root cause
of the issue, but addresses a comment in the issue:
 ovn-org#1401 (comment)

Signed-off-by: Billy McFall <22157057+Billy99@users.noreply.github.com>
Billy99 added a commit to Billy99/ovn-kubernetes-1 that referenced this issue Jul 22, 2020
Per comment in ovn-org#1401, limit the test cases. This does not fix the root cause
of the issue, but addresses a comment in the issue:
 ovn-org#1401 (comment)

Signed-off-by: Billy McFall <22157057+Billy99@users.noreply.github.com>
Billy99 added a commit to Billy99/ovn-kubernetes-1 that referenced this issue Jul 23, 2020
Per comment in ovn-org#1401, limit the test cases. This does not fix the root cause
of the issue, but addresses a comment in the issue:
 ovn-org#1401 (comment)

Signed-off-by: Billy McFall <22157057+Billy99@users.noreply.github.com>
Billy99 added a commit to Billy99/ovn-kubernetes-1 that referenced this issue Jul 24, 2020
Per comment in ovn-org#1401, limit the test cases. This does not fix the root cause
of the issue, but addresses a comment in the issue:
 ovn-org#1401 (comment)

Signed-off-by: Billy McFall <22157057+Billy99@users.noreply.github.com>
Billy99 added a commit to Billy99/ovn-kubernetes-1 that referenced this issue Jul 24, 2020
Per comment in ovn-org#1401, limit the test cases. This does not fix the root cause
of the issue, but addresses a comment in the issue:
 ovn-org#1401 (comment)

Signed-off-by: Billy McFall <22157057+Billy99@users.noreply.github.com>
Billy99 added a commit to Billy99/ovn-kubernetes-1 that referenced this issue Jul 24, 2020
Per comment in ovn-org#1401, limit the test cases. This does not fix the root cause
of the issue, but addresses a comment in the issue:
 ovn-org#1401 (comment)

Signed-off-by: Billy McFall <22157057+Billy99@users.noreply.github.com>
Billy99 added a commit to Billy99/ovn-kubernetes-1 that referenced this issue Jul 24, 2020
Per comment in ovn-org#1401, limit the test cases. This does not fix the root cause
of the issue, but addresses a comment in the issue:
 ovn-org#1401 (comment)

Signed-off-by: Billy McFall <22157057+Billy99@users.noreply.github.com>
Billy99 added a commit to Billy99/ovn-kubernetes-1 that referenced this issue Jul 25, 2020
Per comment in ovn-org#1401, limit the test cases. This does not fix the root cause
of the issue, but addresses a comment in the issue:
 ovn-org#1401 (comment)

Signed-off-by: Billy McFall <22157057+Billy99@users.noreply.github.com>
Billy99 added a commit to Billy99/ovn-kubernetes-1 that referenced this issue Jul 25, 2020
Per comment in ovn-org#1401, limit the test cases. This does not fix the root cause
of the issue, but addresses a comment in the issue:
 ovn-org#1401 (comment)

Signed-off-by: Billy McFall <22157057+Billy99@users.noreply.github.com>
Billy99 added a commit to Billy99/ovn-kubernetes-1 that referenced this issue Jul 27, 2020
Per comment in ovn-org#1401, limit the test cases. This does not fix the root cause
of the issue, but addresses a comment in the issue:
 ovn-org#1401 (comment)

Signed-off-by: Billy McFall <22157057+Billy99@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants