Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

aws: Increase the default master instance size to reduce etcd timeouts #1069

Merged
merged 1 commit into from
Jan 17, 2019

Conversation

smarterclayton
Copy link
Contributor

@smarterclayton smarterclayton commented Jan 15, 2019

After experimenting with disks we determined that we were CPU starved, not IO starved (later comments). This PR sets default master size to m4.xlarge which performs more consistently against 1.11 kube.

https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-4.0/3155/ is representative, many timeouts.

The CPU use in 1.11 kube is excessive due to two bugs that will be fixed in the rebase, after which we can try stepping back down to m4.large.

@openshift-ci-robot openshift-ci-robot added size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Jan 15, 2019
@smarterclayton
Copy link
Contributor Author

/assign @wking

@wking
Copy link
Member

wking commented Jan 15, 2019

Previous bump was #737. I'd floated 500 GiB in #844.

/lgtm
/hold

Do we need an exception for this, or is it a bugfix? Also, do we need to stuff this into 0.10.0, or can it land in follow-up work?

@openshift-ci-robot openshift-ci-robot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. lgtm Indicates that a PR is ready to be merged. labels Jan 15, 2019
@smarterclayton
Copy link
Contributor Author

Exception granted, bug fix, release blocker. I don't think it has to be in 0.10.0 unless everything magically is fixed, in which case we'll rebuild 0.10.0

@smarterclayton
Copy link
Contributor Author

smarterclayton commented Jan 15, 2019

Ok, here's the latency distribution of etcd "took too long" error messages in the logs:

image

Red is another run, blue is this PR, x-axis is MS.

So we're still seeing substantially the same curve, but the outliers are a bit better.

@openshift-ci-robot openshift-ci-robot removed the lgtm Indicates that a PR is ready to be merged. label Jan 15, 2019
@smarterclayton
Copy link
Contributor Author

Retrying with 120gb/400iop/io1 drives

@smarterclayton smarterclayton changed the title aws: Increase the default master disk size to 160gb for more io aws: Increase the default master disk size for more io Jan 15, 2019
@wking
Copy link
Member

wking commented Jan 15, 2019

All nice and green :). Can you plot a pretty graph with the io1 results too, so we can ooh and ahh?

@smarterclayton
Copy link
Contributor Author

Guess what?!

io1 did nothing!!!!

image

@smarterclayton
Copy link
Contributor Author

I love data

@smarterclayton
Copy link
Contributor Author

For the curious here's the one liner I used on the etcd logs:

cat ~/Downloads/kube-system_etcd-member-ip-10-0-* | sed -E -e 's/.*took too long \(([^\)]+)\).*/\1/;t;d' | sed -E -e 's/^([0-9]+)\.([0-9]{3})([0-9])*s/\1\2.\3ms/' | sed -e 's/ms//' | sort -n > /tmp/times3.txt

@smarterclayton
Copy link
Contributor Author

Trying with m4.xlarge

@openshift-ci-robot openshift-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Jan 15, 2019
.openshift_install.log Outdated Show resolved Hide resolved
@openshift-ci-robot openshift-ci-robot added size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Jan 16, 2019
@smarterclayton
Copy link
Contributor Author

smarterclayton commented Jan 16, 2019

m4.xlarge (4 cpu, up from 2) definitely improved it:

image

I think we should bump the instance size for now until the rebase lands, at which point we can look at reducing it. The two known CPU issues should be addressed in the rebase.

(this is log scale y-axis, so a 10x improvement in tail latency)

@openshift-ci-robot openshift-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Jan 16, 2019
@@ -81,7 +82,7 @@ func (m *Master) Generate(dependencies asset.Parents) error {
if err != nil {
return errors.Wrap(err, "failed to create master machine objects")
}
aws.ConfigMasters(machines, ic.ObjectMeta.Name)
aws.ConfigMasters(machines, ic.ObjectMeta.Name, mpool.InstanceType)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this required?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wking ^ this is what I get :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this required?

Ah, sorry about that. Yeah, we can probably drop this, since aws.provider should be applying this for us.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@smarterclayton, can you unwind is change with:

diff --git a/pkg/asset/machines/aws/machines.go b/pkg/asset/machines/aws/machines.go
index 865d40a..7e04a99 100644
--- a/pkg/asset/machines/aws/machines.go
+++ b/pkg/asset/machines/aws/machines.go
@@ -120,12 +120,10 @@ func tagsFromUserTags(clusterID, clusterName string, usertags map[string]string)
 	return tags, nil
 }
 
-// ConfigMasters sets the PublicIP flag, bumps the instance type, and
-// assigns a set of load balancers to the given machines
-func ConfigMasters(machines []clusterapi.Machine, clusterName string, instanceType string) {
+// ConfigMasters sets the PublicIP flag and assigns a set of load balancers to the given machines
+func ConfigMasters(machines []clusterapi.Machine, clusterName string) {
 	for _, machine := range machines {
 		providerSpec := machine.Spec.ProviderSpec.Value.Object.(*awsprovider.AWSMachineProviderConfig)
-		providerSpec.InstanceType = instanceType
 		providerSpec.PublicIP = pointer.BoolPtr(true)
 		providerSpec.LoadBalancers = []awsprovider.LoadBalancerReference{
 			{
diff --git a/pkg/asset/machines/master.go b/pkg/asset/machines/master.go
index f4c3806..2c782ac 100644
--- a/pkg/asset/machines/master.go
+++ b/pkg/asset/machines/master.go
@@ -82,7 +82,7 @@ func (m *Master) Generate(dependencies asset.Parents) error {
 		if err != nil {
 			return errors.Wrap(err, "failed to create master machine objects")
 		}
-		aws.ConfigMasters(machines, ic.ObjectMeta.Name, mpool.InstanceType)
+		aws.ConfigMasters(machines, ic.ObjectMeta.Name)
 
 		list := listFromMachines(machines)
 		raw, err := yaml.Marshal(list)

I'm testing that now to make sure it still works...

return awstypes.MachinePool{
InstanceType: "m4.large",
}
return awstypes.MachinePool{}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this emptied ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wking ^

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this emptied ?

Because we no longer have any default AWS configuration shared by both masters and workers. They each have their own defaults for instance types (and, with #1079) for volumes, that get applied directly after defaultAWSMachinePoolPlatform() calls. In fact, my preference would be to drop defaultAWSMachinePoolPlatform entirely in favor of an explicit awstypes.MachinePool{} for initializing AWS machine pools.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, maybe this was working around #1076.

@smarterclayton
Copy link
Contributor Author

/retest

@smarterclayton
Copy link
Contributor Author

I had a run with setting proper resource constraints on etcd and it didn't move the needle enough (4x improvement). We need this as well.

image

Blue is baseline, red is moving etcd to have a 300m CPU request (openshift/machine-config-operator#316), and yellow is doubling cores. So it's a good tail and

I updated with the comments from trevor, let me know what else needs to be done.

@smarterclayton
Copy link
Contributor Author

/retest

@wking
Copy link
Member

wking commented Jan 16, 2019

nit: can we squash down to one commit?

@@ -120,7 +120,8 @@ func tagsFromUserTags(clusterID, clusterName string, usertags map[string]string)
return tags, nil
}

// ConfigMasters sets the PublicIP flag and assigns a set of load balancers to the given machines
// ConfigMasters sets the PublicIP flag, bumps the instance type, and
// assigns a set of load balancers to the given machines
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to bump this comment, since we're no longer bumping ConfigMasters.

We were seeing frequent long requests from etcd. After increasing
CPU (2 -> 4 cores) those pauses dropped significantly. Increase
the limit until the rebase lands and we can deliver the CPU perf
improvements to the control plane.

Make a set of changes to connect machine set size to the values
passed as input. Update the docs.
@smarterclayton
Copy link
Contributor Author

de-nitified

@crawford
Copy link
Contributor

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Jan 16, 2019
@wking
Copy link
Member

wking commented Jan 16, 2019

/lgtm
/hold

@smarterclayton, I'm fine if you want to pull the hold yourself, or if you want to wait for @abhinavdahiya and/or @crawford.

@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: crawford, smarterclayton, wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [crawford,smarterclayton,wking]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@wking
Copy link
Member

wking commented Jan 16, 2019

/hold cancel

I see @crawford is already here ;).

@openshift-ci-robot openshift-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 16, 2019
@abhinavdahiya
Copy link
Contributor

@smarterclayton
Copy link
Contributor Author

I can nuke that

@wking
Copy link
Member

wking commented Jan 16, 2019

IMO this change is not required for this PR https://github.com/openshift/installer/pull/1069/files#diff-df3202277ffea159153362ec095a8762

I think we want that change, but I'm fine going either way to get this PR landed and hashing it our later.

@wking
Copy link
Member

wking commented Jan 16, 2019

e2e-aws:

fail [k8s.io/kubernetes/test/e2e/framework/util.go:2355]: Expected error:
    <*errors.errorString | 0xc4215d4d30>: {
        s: "failed to get logs from pod-configmaps-679ec92b-19da-11e9-9cbd-0a58ac1041b1 for configmap-volume-test: an error on the server (\"unknown\") has prevented the request from succeeding (get pods pod-configmaps-679ec92b-19da-11e9-9cbd-0a58ac1041b1)",
    }
    failed to get logs from pod-configmaps-679ec92b-19da-11e9-9cbd-0a58ac1041b1 for configmap-volume-test: an error on the server ("unknown") has prevented the request from succeeding (get pods pod-configmaps-679ec92b-19da-11e9-9cbd-0a58ac1041b1)
not to have occurred

Jan 16 22:02:27.651 W persistentvolume=volume-idempotent-delete-6svns Error deleting EBS volume "vol-037c7a92290037b82" since volume is in "deleting" state count(1)

failed: (37.9s) 2019-01-16T22:02:43 "[sig-storage] ConfigMap should be consumable from pods in volume as non-root with defaultMode and fsGroup set [NodeFeature:FSGroup] [Suite:openshift/conformance/parallel] [Suite:k8s]"

and more (generally involving Error from server: error dialing backend: remote error: tls: internal error).

/retest

@wking
Copy link
Member

wking commented Jan 17, 2019

images:

2019/01/17 01:07:20 Copied 0.20Mi of artifacts from release-latest to /logs/artifacts/release-latest
2019/01/17 01:07:26 Ran for 9m46s
error: could not run steps: test "release-latest" failed: pod release-latest was already deleted

/retest

@openshift-merge-robot openshift-merge-robot merged commit b63074b into openshift:master Jan 17, 2019
@jeremyeder
Copy link

related: #1087

@mykaul
Copy link
Contributor

mykaul commented Jan 19, 2019

Any idea if we need something similar for libvirt based installation? I'm keeping (since it doesn't seem to be accepted) this small diff:

[ykaul@ykaul installer]$ git diff
diff --git a/pkg/asset/machines/libvirt/machines.go b/pkg/asset/machines/libvirt/machines.go
index 909e08a51..defa4ab36 100644
--- a/pkg/asset/machines/libvirt/machines.go
+++ b/pkg/asset/machines/libvirt/machines.go
@@ -64,8 +64,8 @@ func provider(clusterName string, networkInterfaceAddress string, platform *libv
                        APIVersion: "libvirtproviderconfig.k8s.io/v1alpha1",
                        Kind:       "LibvirtMachineProviderConfig",
                },
-               DomainMemory: 2048,
-               DomainVcpu:   2,
+               DomainMemory: 4096,
+               DomainVcpu:   4,
                Ignition: &libvirtprovider.Ignition{
                        UserDataSecret: userDataSecret,
                },

But it seems the above kinda proves my point? (Note - did not test for etcd timeout, but I am frequently failing deployment on timeouts...)

@wking
Copy link
Member

wking commented Jan 19, 2019

I'm keeping (since it doesn't seem to be accepted) this small diff...

No need for patching, since #785 set these up to allow environment-variable overrides. I can't speak for the other maintainers, but personally I'd rather punt further tuning until after the origin rebase lands, since that's rumored to have some memory-usage improvements.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants