Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement Gateway status updates #502

Merged
merged 9 commits into from
Apr 22, 2020

Conversation

mangelajo
Copy link
Contributor

@mangelajo mangelajo commented Apr 13, 2020

No description provided.

@submariner-bot
Copy link
Contributor

🤖 Created branch: z_pr502/mangelajo/implement-status-updates

@mangelajo mangelajo changed the base branch from master to z_pr501/skitt/expose-status-types April 13, 2020 17:06
main.go Outdated
if err != nil {
klog.Fatalf("Fatal error occurred creating engine: %v", err)
}
cableEngine, err := cableengine.NewEngine(localSubnets, localCluster, localEndpoint, submarinerClient,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We move this creation ahead in time, so ... the loop reconciling the gateway status will also
work for the PASSIVE gateway.

The driver initialization does not happen until "StartEngine" though.

@submariner-bot
Copy link
Contributor

🤖 Updated branch: z_pr502/mangelajo/implement-status-updates

@submariner-bot
Copy link
Contributor

🤖 the base of this PR has been updated to master
please rebase this branch and remove #502 related commits

@submariner-bot submariner-bot changed the base branch from z_pr501/skitt/expose-status-types to master April 13, 2020 19:41
@submariner-bot
Copy link
Contributor

🤖 Updated branch: z_pr502/mangelajo/implement-status-updates

@mangelajo mangelajo changed the base branch from master to release-0.1 April 14, 2020 07:12
@mangelajo mangelajo changed the base branch from release-0.1 to master April 14, 2020 07:12
@submariner-bot
Copy link
Contributor

🤖 Updated branch: z_pr502/mangelajo/implement-status-updates

@mangelajo mangelajo changed the title Implement status updates Implement Gateway status updates Apr 14, 2020
@submariner-bot
Copy link
Contributor

🤖 Updated branch: z_pr502/mangelajo/implement-status-updates

@submariner-bot
Copy link
Contributor

🤖 Updated branch: z_pr502/mangelajo/implement-status-updates

1 similar comment
@submariner-bot
Copy link
Contributor

🤖 Updated branch: z_pr502/mangelajo/implement-status-updates

@submariner-bot
Copy link
Contributor

🤖 Updated branch: z_pr502/mangelajo/implement-status-updates

@mangelajo
Copy link
Contributor Author

ok, removed the workaround after #506 was merged

@mangelajo
Copy link
Contributor Author

E2E tests being handled here: #508

pkg/cableengine/syncgatewaystatus.go Outdated Show resolved Hide resolved
pkg/cableengine/syncgatewaystatus.go Outdated Show resolved Hide resolved
Copy link
Member

@sridhargaddam sridhargaddam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some small nits, otherwise it looks good. I even tried it :)
Thanks for working on this @mangelajo

pkg/cableengine/syncgatewaystatus.go Outdated Show resolved Hide resolved
pkg/cableengine/syncgatewaystatus.go Outdated Show resolved Hide resolved
pkg/cableengine/syncgatewaystatus.go Outdated Show resolved Hide resolved
pkg/cableengine/syncgatewaystatus.go Outdated Show resolved Hide resolved
Copy link
Contributor Author

@mangelajo mangelajo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I handled those comments (need to update the patch) Also, I will move this gateway checking together with the redundancy part which is almost the same.

test/e2e/framework/framework.go Outdated Show resolved Hide resolved
test/e2e/framework/framework.go Outdated Show resolved Hide resolved

func (f *Framework) AwaitForGatewayGone(cluster framework.ClusterIndex, name string) {
gwClient := SubmarinerClients[cluster].SubmarinerV1().Gateways(framework.TestContext.SubmarinerNamespace)
framework.AwaitUntil("await gateway gone",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

test/e2e/framework/framework.go Outdated Show resolved Hide resolved
test/e2e/framework/framework.go Show resolved Hide resolved
test/e2e/gateway/gateway_reporting.go Outdated Show resolved Hide resolved
The Host field was redudant with the Gateway name, also the localEndpoint
information was missing when looking at the Gateway via describe for more
context.

Modified as discussed over slack channel.
The active gateway handles the role of cleaning up stale gateway entries,
if a gateway has not reported for x3 the reporting interval the entry will
be deleted. We handle this via an annotation since k8s API does not include
modification timestamp in the object metadata.
In this way, when we use some of those also as labels it's less weird
since the unwritten norm in k8s is to use lowcase values in the labels
instead of upcase.
@submariner-bot
Copy link
Contributor

🤖 Updated branch: z_pr502/mangelajo/implement-status-updates

@submariner-bot
Copy link
Contributor

🤖 Updated branch: z_pr502/mangelajo/implement-status-updates

@mangelajo
Copy link
Contributor Author

Unrelated globalnet failure, I re-ran that test.

@mangelajo
Copy link
Contributor Author

Thanks for the review @skitt

@skitt
Copy link
Member

skitt commented Apr 22, 2020

Thanks for the review @skitt

Not much of a review, but it works for me, and my concerns have been raised and addressed already!

pkg/cableengine/syncer/syncer.go Show resolved Hide resolved
pkg/cableengine/syncer/syncer.go Show resolved Hide resolved
return gw.(*submarinerv1.Gateway)
}

func (f *Framework) AwaitForGatewayRemoved(cluster framework.ClusterIndex, name string) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
func (f *Framework) AwaitForGatewayRemoved(cluster framework.ClusterIndex, name string) {
func (f *Framework) AwaitGatewayRemoved(cluster framework.ClusterIndex, name string) {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

func testBasicGatewayReporting(f *subFramework.Framework) {
clusterAName := framework.TestContext.ClusterIDs[framework.ClusterA]

By(fmt.Sprintf("Ensuring that one gateway reports as active %q", clusterAName))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
By(fmt.Sprintf("Ensuring that one gateway reports as active %q", clusterAName))
By(fmt.Sprintf("Ensuring that one gateway reports as active on %q", clusterAName))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


By(fmt.Sprintf("Ensuring that one gateway reports as active %q", clusterAName))
activeGateways := f.GetGatewaysWithHAStatus(subv1.HAStatusActive)
Expect(activeGateways).To(HaveLen(1))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be wrapped with framework.AwaitUntil to avoid intermittent failures if the active was not yet running or written it's status. Perhaps AwaitActiveGateway.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually there's already AwaitGatewayWithStatus although it takes the gateway name which could be obtained by finding the active gateway node. But I realize we're actually duplicating what testEnginePodRestartScenario does so why do we need this test?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For independence, I wanted an indication that the basic reporting is working if the dataplane is broken. When we have stats then we could identify if it's the dataplane breaking, or the reporting breaking. The reporting would always affect the dataplane tests though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should do AwaitActiveGateway at least to avoid possible intermittent failures as you said.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh AwaitGateway does not work, because we don't know the name, but we know that we want one. I will add another function that works with any.

activeGateways := f.GetGatewaysWithHAStatus(subv1.HAStatusActive)
Expect(activeGateways).To(HaveLen(1))

gw := activeGateways[0]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
gw := activeGateways[0]
activeGateway := activeGateways[0]

For consistency ...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

By(fmt.Sprintf("Ensuring that the gateway reports as active %q", clusterAName))
activeGateways := f.GetGatewaysWithHAStatus(subv1.HAStatusActive)
Expect(activeGateways).To(HaveLen(1))

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should verify that the found active Gateway's name matches that of the gateway node found above.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or better yet, call AwaitGatewayWithStatus.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But I want to verify there's only "1". I can't do that with Await, I can call Await just before this, which would make sense, also would remove any glitches if it's starts as passive and then moves to active.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I did it in a slightly different way, let me know what you think

gw := activeGateways[0]
By(fmt.Sprintf("Ensuring that the gateway %q is reporting connections", gw.Name))
Expect(gw.Status.Connections).NotTo(BeEmpty())
Expect(gw.Status.Connections[0].Status).To(Equal(subv1.Connected))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could cause an intermittent failure if active not yet reporting connections - it seems this should call AwaitGatewayFullyConnected.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

By(fmt.Sprintf("Deleting submariner engine pod %q", enginePod.Name))
f.DeletePod(framework.ClusterA, enginePod.Name, framework.TestContext.SubmarinerNamespace)

newEnginePod := f.AwaitSubmarinerEnginePod(framework.ClusterA)
By(fmt.Sprintf("Found new submariner engine pod %q", newEnginePod.Name))

By(fmt.Sprintf("Waiting for the gateway to be up and connected %q", newEnginePod.Name))
f.AwaitGatewayFullyConnected(framework.ClusterA, gw.Name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could report fully connected as the last state from the previous engine pod. Perhaps we should move this after the connectivity tests below - that way we know the gateway is re-connected and should be reporting the new state.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, that is true, I was actually wanting to make sure we had connections before starting the dataplane tests.

This will work better/faster when we I implement the cleanup of state at pod exit, I didn't dig into that, but I will do right after 0.3.0-rc1
#518

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah I was thinking that too - delete on engine shutdown.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a DeleteGateway function for now, right after the pod deletion, which will ensure that at least the gateway status is re-created, and that it eventually moves to "fully connected"

When I do #518 then we remove this, and probably do a different check (I guess an informer to make sure it's deleted and re-created).


gwActive := f.AwaitGatewayWithStatus(framework.ClusterA, initialGatewayNode.Name, subv1.HAStatusActive)
gwPassive := f.AwaitGatewayWithStatus(framework.ClusterA, initialNonGatewayNode.Name, subv1.HAStatusPassive)
Expect(gwActive.Status.Connections).ToNot(BeEmpty(), "The active gateway must have active connections")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As above, it would be safest to call AwaitGatewayFullyConnected for the active gateway.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@mangelajo
Copy link
Contributor Author

mangelajo commented Apr 22, 2020

@tpantelis I opened this to handle comments as a follow-up #519 , and #518 if that's ok

Currently lighthouse pushed a new ":latest" image, and this will fail when I submit any new changes.

In such way we are able to release 0.3.0-rc1 and at the same time test the changes you're proposing here (changes that make sense and I agree with).

@mangelajo
Copy link
Contributor Author

Wait, may be there's no such chicken-egg, since we don't depend on lighthouse here.. ':D
testing locally and I will send a PR update if it works.

@mangelajo
Copy link
Contributor Author

The technical difficulty with the lighthouse images does not exist, so it should be fine :) anyway we are talking on slack about merging this and do the follow up PR so we can cut RC1 now

@tpantelis
Copy link
Contributor

My comments aren't critical - let's merge this and address in the follow-ups @mangelajo created.

@tpantelis tpantelis merged commit 4bec940 into submariner-io:master Apr 22, 2020
@submariner-bot
Copy link
Contributor

🤖 Closed branches: [z_pr502/mangelajo/implement-status-updates]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants