Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding capability for HA control plane testing with kops #6113

Merged
merged 1 commit into from
Feb 13, 2018

Conversation

chrislovecnm
Copy link
Contributor

Added a new flag for kops-ha. Also, since we now may run in regions that
may not have m3 instances, updated the master type to c4 instances.

In order to test HA control plain a minimum of 3 AZs must exist.
Implemented a func that queries all regions, and then queries the list
of that regions AZ. A random region is then selected. If running in ha
mode, a region with a minimum of 3 AZ is selected. The new zone
capability can be run in both ha and non-ha modes. If --kops-zones is
not set, and the test run is on AWS, the new code is used. You can
override the new code by setting the zone as done previously.

Added a unit test, that requires a live AWS account for testing the new
random region selection. These tests are skipped since they require a
live AWS account.

@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Dec 31, 2017
@chrislovecnm
Copy link
Contributor Author

/cc @geojaz @justinsb
/assign @krzyzacy

@chrislovecnm
Copy link
Contributor Author

/hold

Need to do e2e test with this

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Dec 31, 2017
@chrislovecnm chrislovecnm mentioned this pull request Jan 4, 2018
@chrislovecnm chrislovecnm force-pushed the kops-regions branch 2 times, most recently from 28d98c2 to 442df92 Compare January 6, 2018 07:06
@k8s-ci-robot k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jan 8, 2018
@chrislovecnm chrislovecnm force-pushed the kops-regions branch 2 times, most recently from b722f2f to 37f743c Compare January 9, 2018 01:07
kubetest/kops.go Outdated
var zones []string

// if zones is set to zero and gcp project is not set then pick random aws zone
if *kopsZones == "" && gcpProject == "" {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why check for gcpProject? gce vs aws? maybe check for provider here

kubetest/kops.go Outdated
}

if len(zones) == 0 {
return nil, fmt.Errorf("no zones found")
Copy link
Member

@krzyzacy krzyzacy Jan 9, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

errors.New, same as below

kubetest/kops.go Outdated
"--zones", strings.Join(k.zones, ","),
"--override", "'cluster.spec.etcdClusters[*].version=3.1.10'",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are we pinning etcd version for all the kops job now? (does etcd 3.1 work with older releases?)

Copy link
Contributor Author

@chrislovecnm chrislovecnm Jan 9, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point, what versions are we testing?? Do I have access to the version number for the test run?

kubetest/kops.go Outdated
}
}

// TODO get the number of ec2 instances in the region and ensure that there are not too many running
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: TODO(chrislovecnm): blah blah

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh - that would be cool. Having something like boskos that acts as a mutex. I wonder if we could "fake it" with tags or some cheap resource somehow...

@krzyzacy
Copy link
Member

krzyzacy commented Jan 9, 2018

/lint

Copy link
Contributor

@k8s-ci-robot k8s-ci-robot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@krzyzacy: 0 warnings.

In response to this:

/lint

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot removed the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Jan 12, 2018
kubetest/kops.go Outdated
kopsDiskSize = flag.Int("kops-disk-size", 48, "Disk size to use for nodes and masters")
kopsHA = flag.Bool("kops-ha", false, "(kops only) Run HA control plane")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make thiskops-master-count and have it pass through to --master-count ?

Or kops-multizone - as that's closer to what it is doing - picking distinct zones in the same region?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is kops-master-count, not multizone :) Yes I can do that

kubetest/kops.go Outdated
kopsDiskSize = flag.Int("kops-disk-size", 48, "Disk size to use for nodes and masters")
kopsHA = flag.Bool("kops-ha", false, "(kops only) Run HA control plane")
kopsEtcd = flag.String("kops-etcd", "", "(kops only) Etcd Version")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needed in this PR?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And perhaps rename to kops-etcd-version or should we just expose the override flag?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can move this to another PR if you like, but we need it for HA testing in general

kubetest/kops.go Outdated
ha bool

// kopsEtcd is the etcd version to run
kopsEtcd string
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need for kops prefix here (in struct kops). And I suggest etcdVersion

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

kubetest/kops.go Outdated
zones = *z
} else {
zones = strings.Split(*kopsZones, ",")
if err := os.Setenv("ZONE", zones[0]); err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I take it this is no longer needed, as it isn't done in the other branch of the if statement. Remove (or add to the other if statement branch)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I frankly have no bloody idea wth this is doing, I refactored it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, then the problem is that it has to be done in both branches. Pull it out of the if statement?

kubetest/kops.go Outdated

if len(zones) == 0 {
return nil, errors.New("no zones found")
} else if zones[0] == "" {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Validate all zones in slice, not just the first?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is refactoring of old code. I can do that in another PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see it in the diff? Can you point me to it?

kubetest/kops.go Outdated
}
zones = *z
} else {
zones = strings.Split(*kopsZones, ",")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the validation error message should be "kops-zones must be specified when not running on AWS"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

again old code, we can follow-up with another PR if you like.

kubetest/kops.go Outdated
}
}

// TODO get the number of ec2 instances in the region and ensure that there are not too many running
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh - that would be cool. Having something like boskos that acts as a mutex. I wonder if we could "fake it" with tags or some cheap resource somehow...

kubetest/kops.go Outdated
// getRandomAWSZones looks up all regions, and the availability zones for those regions. A random
// region is then chosen and the AZ's for that region is returned. If HA is required a regions with 3
// or more zones is returned.
func getRandomAWSZones(isHA bool) (*[]string, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Return []string, error

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why? Would love to understand more.

kubetest/kops.go Outdated
if err != nil {
return nil, err
}
zones = *z
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd have getRandomAWSZones return []string, and just check that if len(zones) == 0) { return nil, errors.New("could not pick zones") } or similar

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hrmmm I get what you are saying, but I would rather check for an error. That is why I made it a pointer. Can you live with it?

)

func Test_GetHARandomRegion(t *testing.T) {
t.Skip("this test requires a valid aws account to run")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you wanted to write a test, you could split getRandomAWSZones into two functions - one that builds the map, the other that returns a list of zones with at least N members

kubetest/kops.go Outdated
// TODO get the number of ec2 instances in the region and ensure that there are not too many running
log.Printf("Launching cluster in region: %q", selectedRegion)

return &selectedZones, nil
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if selectedZones > 3 ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then we use all of them.

kubetest/kops.go Outdated
}
if len(overrides) != 0 {
featureFlags = append(featureFlags, "SpecOverrideFlag")
createArgs = append(createArgs, "--override", strings.Join(overrides, ","))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't join I don't think - not sure it's valid

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need context

@justinsb
Copy link
Member

So I will counterpose something. How about we have a blacklist of regions, so that when US-Colorado opens we blacklist it? Also, when a region gets its third AZ it will automatically start running HA tests. I would rather know which regions are having issues. Just a thought.

I'm fine with that. There will be a lag when they announce a new region, but that's not a big deal.

@justinsb
Copy link
Member

For the --overrides: I'm not sure we accept --override=a,b in kops; I think you have to do --override=a --override=b. Am I wrong?

kubetest/kops.go Outdated

if len(zones) == 0 {
return nil, errors.New("no zones found")
} else if zones[0] == "" {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see it in the diff? Can you point me to it?

kubetest/kops.go Outdated
zones = *z
} else {
zones = strings.Split(*kopsZones, ",")
if err := os.Setenv("ZONE", zones[0]); err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, then the problem is that it has to be done in both branches. Pull it out of the if statement?

kubetest/kops.go Outdated
@@ -273,17 +309,19 @@ func (k kops) Up() error {
"--node-count", strconv.Itoa(k.nodes),
"--node-volume-size", strconv.Itoa(k.diskSize),
"--master-volume-size", strconv.Itoa(k.diskSize),
// TODO chrislovecnm make this configurable
"--master-size", "c4.large",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We override this in some tests (the ENA test IIRC), so this should be a flag if you want to default it to c4.large

kubetest/kops.go Outdated
// az for each of the regions. AWS Go API does not allow us to make a single call
zonesResults, err := ec2Session.DescribeAvailabilityZones(&ec2.DescribeAvailabilityZonesInput{})
if err != nil {
return nil, fmt.Errorf("unable to call aws api DescribeAvailabilityZones: %v", err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think building the session is going to fail, it's going to be on your first "real" API call - i.e. here

kubetest/kops.go Outdated
}
// build a map of region and the regions az so that we can lookup AZ easily
for _, z := range zonesResults.AvailabilityZones {
existing, ok := regionMap[*z.RegionName]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can replace the whole block of the for loop

kubetest/kops.go Outdated
}

// initialize global pseudo random generator
rand.Seed(time.Now().Unix())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You normally only call rand.Seed once, at the very start of your program (i.e. in main). Many languages don't even require you to call seed at all. It's unusual to seed repeatedly.

So - why are you seeding repeatedly is more the question :-)

/*
Per @justinsb
TODO @chrislovecnm
If you wanted to write a test, you could split getRandomAWSZones into two functions - one that builds the map, the other that returns a list of zones with at least N members
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just don't really see the point of these skipped test functions. We're going to find out PDQ if it doesn't work...

kubetest/kops.go Outdated
kopsDiskSize = flag.Int("kops-disk-size", 48, "Disk size to use for nodes and masters")
kopsMasterCount = flag.Int("kops-master-count", 1, "(kops only) Number of masters to run")
kopsEtcdVersion = flag.String("kops-etcd", "", "(kops only) Etcd Version")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change flag to kops-etcd-version?

kubetest/kops.go Outdated
// getRandomAWSZones looks up all regions, and the availability zones for those regions. A random
// region is then chosen and the AZ's for that region is returned. If HA is required a regions with 3
// or more zones is returned.
func getRandomAWSZones(masterCount int) (*[]string, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was unclear before - I suggest this should return ([]string, error), because []string is already a ref type, so it's already nullable.

kubetest/kops.go Outdated
// initialize global pseudo random generator
rand.Seed(time.Now().Unix())

// select a random region and zones
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You asked for more context on using rand.Perm.

Instead of looping through and removing the regions one by one, instead just use rand.Perm to get a list of indexes. Those are the random order in which you will consider the regions in the slice. Then you don't need to do the removal, and you only need one rand call.

for _, i := rand.Perm(n) {
  region := regions[i]
...
}
return nil, fmt.Errorf("unable to find region with %d zones", masterCount)

@chrislovecnm chrislovecnm force-pushed the kops-regions branch 4 times, most recently from 5d444a7 to 66d96e9 Compare February 2, 2018 22:18
@chrislovecnm
Copy link
Contributor Author

For the --overrides: I'm not sure we accept --override=a,b in kops; I think you have to do --override=a --override=b. Am I wrong?

I will test, but I think it is ok either way

Otherwise @justinsb PTAL

@chrislovecnm
Copy link
Contributor Author

We cannot merge this until we get the AWS quotas addressed.

kubetest/kops.go Outdated
kopsDiskSize = flag.Int("kops-disk-size", 48, "Disk size to use for nodes and masters")
kopsPublish = flag.String("kops-publish", "", "Publish kops version to the specified gs:// path on success")
kopsMasterType = flag.String("kops-master-type", "c4.large", "(kops only) master instance type")
kopsMasterCount = flag.Int("kops-master-count", 1, "(kops only) Number of masters to run")
kopsEtcdVersion = flag.String("kops-etcd", "", "(kops only) Etcd Version")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Flag name should be kops-etcd-version I think

kubetest/kops.go Outdated
if *kopsZones == "" && provider == "aws" {
// Initialize global pseudo random generator.
// We do not need to intialize it if kopsZones is passed in or we are running the other tests.
// That is why I did not put it in main func.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it should go into the main func, it's something that should be done once in every program and then we forget about it

if len(zones) == 0 {
return nil, errors.New("no zones found")
} else if zones[0] == "" {
return nil, errors.New("zone cannot be a empty string")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we validate all the zones?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A really good idea, but do you mind if we do it in another PR? I am wondering if that work is outside the scope of this PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, my plan is to not pass in zones long term. Always pick a random zone or zones with the new func. Because of this zones would be chosen programmatically.

kubetest/kops.go Outdated
kopsDiskSize = flag.Int("kops-disk-size", 48, "Disk size to use for nodes and masters")
kopsPublish = flag.String("kops-publish", "", "Publish kops version to the specified gs:// path on success")
kopsMasterType = flag.String("kops-master-type", "c4.large", "(kops only) master instance type")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kops-master-size, for consistency with the master-size kops flag

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also think default should be empty, because otherwise you're changing the size that all tests run with.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yah I understand that we are changing all tests, and it is a big change. My overall plan is to always select a random region and at least one random AZ. Because of that plan, we are going to hit far more regions that do not support m3 instance types. If we default it to c4.large it may help us with lack of m3 instances. What are your thoughts?

kubetest/kops.go Outdated
etcdVersion string

// masterType is the EC2 instance type for the master
masterType string
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

masterSize, for consistency with the kops flag

kubetest/kops.go Outdated

// getRandomAWSZones looks up all regions, and the availability zones for those regions. A random
// region is then chosen and the AZ's for that region is returned. If HA is required a regions with 3
// or more zones is returned.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe replace " If HA is required a regions with 3 or more zones is returned." with "At least masterCount zones will be returned, all in the same region" or something like that?

kubetest/kops.go Outdated
for _, i := range rand.Perm(len(awsRegions)) {
ec2Session, err := getAWSEC2Session(awsRegions[i])
if err != nil {
return selectedZones, err
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return nil, err would be less confusing

kubetest/kops.go Outdated
// az for a region. AWS Go API does not allow us to make a single call
zoneResults, err := ec2Session.DescribeAvailabilityZones(&ec2.DescribeAvailabilityZonesInput{})
if err != nil {
return nil, fmt.Errorf("unable to call aws api DescribeAvailabilityZones: %v", err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Include region in message

kubetest/kops.go Outdated
}
}

return selectedZones, fmt.Errorf("unable to find region with %d zones", masterCount)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nil, fmt.Error... I think

import "testing"

///
// TODO:(@chrislovecnm) we should mock out the aws interface so we do not make a aws API call
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should do one of (1) just remove the skipped tests, (2) implement the mock, (3) make the function testable by splitting it up. I'm not sure there's much to test in option 3, TBH.

@BenTheElder
Copy link
Member

/retest

@chrislovecnm
Copy link
Contributor Author

@justinsb PTAL

I think there are two points that we may disagree on.

  1. changing the default instance size to c4.large. m3.large are not available in a lot of regions and az, even one of our az in us-east-1. With the new code that selects from all regions, I think we need to change
  2. validating the zones. I do not think we will need that anymore. My plan is to not pass in the zones for the most part and have the code choose random zones. I now have the capability of just choosing a single random zone in a random region. Also, I think it is kinda outside the scope of this PR.

I should have done all of your other updates, please let me know if I missed anything.

@justinsb
Copy link
Member

justinsb commented Feb 8, 2018

LGTM

@chrislovecnm
Copy link
Contributor Author

@krzyzacy do you have time to deploy this on Tuesday?

@krzyzacy
Copy link
Member

@chrislovecnm can you squash?
/lgtm
/hold
/lint

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 13, 2018
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: chrislovecnm, krzyzacy

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these OWNERS Files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 13, 2018
Copy link
Contributor

@k8s-ci-robot k8s-ci-robot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@krzyzacy: 0 warnings.

In response to this:

@chrislovecnm can you squash?
/lgtm
/hold
/lint

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Added a new flag for kops-ha. Also, since we now may run in regions that
may not have m3 instances, updated the master type to c4 instances.

In order to test HA control plain a minimum of 3 AZs must exist.
Implemented a func that queries all regions, and then queries the list
of that regions AZ. A random region is then selected. If running in ha
mode, a region with a minimum of 3 AZ is selected. The new zone
capability can be run in both ha and non-ha modes. If --kops-zones is
not set, and the test run is on AWS, the new code is used. You can
override the new code by setting the zone as done previously.

Added a unit test, that requires a live AWS account for testing the new
random region selection. These tests are skipped since they require a
live AWS account.
@chrislovecnm
Copy link
Contributor Author

/hold cancel

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 13, 2018
@k8s-ci-robot k8s-ci-robot merged commit ae4dc70 into kubernetes:master Feb 13, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants