Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: Coordinated dynamic storage version #1201

Closed

Conversation

caesarxuchao
Copy link
Member

@caesarxuchao caesarxuchao commented Aug 6, 2019

Superseded by #1220

An alternative (hopefully a better one) to #1176.

We propose a mechanism for HA API servers to vote on the storage version to use, then show the selected storage version in the discovery document and switch to use the selected version atomically.

/assign @yliaog @wojtek-t @lavalamp
Thank you for you previous comments in #1176

/sig api-machinery

@k8s-ci-robot k8s-ci-robot added sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Aug 6, 2019
@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Aug 6, 2019
@k8s-ci-robot k8s-ci-robot added the kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory label Aug 6, 2019

### Curating a list of participating API servers in HA master

Currently, API servers maintain such a list in the "kubernetes" endpoints. See
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have a way to do that - this doesn't mean we always do.
For example in GKE, this endpoint contains only IP of the LB in front of apiservers (whether that's good or bad is a separate discussion, but I guess we may not be the only one doing something like that).

Copy link
Member Author

@caesarxuchao caesarxuchao Aug 6, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that other operators might have other forms of deployments that also hide the individual apiserver endpoints, so we can't rely on the current "kubernetes" service endpoints to get a reliable list of apiservers.

As I mentioned in the KEP, I will send another KEP to formalize the procedure of apiserver census. Operators will need to configure an ID for each apiserver to make it work.

participating API server. This means the API server has gone and the entry
is stale. Removes such entries.
* checks if all participating API server votes for the same version. If so,
sets the version as the Selectedversion.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if not all of them are voting for the same version?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The preferred version in the first vote becomes the first selected version.

Afterwards, only after all apiservers prefer a new version, then it becomes the new selected version.

checks if key does not already exist before committing a create operation. As
another example, kube-apiserver checks that the resourceVersion is the latest
before committing an update operation. Adding another transaction condition does
not add extra RPC call.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to run some benchmark of it to understand the impact of it. I hope it should be relatively small, but I don't know etcd internals well enough to know exactly what would happen.
@jpbetz - FYI

version. The storage version in the CRD spec is the "selected storage version".

The current implementation of the apiextension-apiserver already provides the
"loose guarantee".
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have any signals from users that it's not enough?
If not, maybe it's not worth changing that (and maybe that also means it should be good enough for builtin resources).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I haven't heard any complain.

I'm willing to pursue the "loose guarantee" path if all reviewers think it's good enough.

* gets the StorageVersion object for this resource,
* if the object does not exist, creates one, setting this kube-apiserver's
preferred storage version as the Selectedversion, also adding its preferred
storage version to the CandidateVersions list. Then jumps to the last step.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

last step is which step? probably better to number the steps

* gets the list of all participating API servers,
* gets all the StorageVersion objects,
* removes the StorageVersion object if none of its CandidateVersions.APIServerID
is a participating API server.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

usually api servers are independent, unaware of the existence of the other api servers. what about using keep-alive have each api server to tell it is still alive, participating

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what you meant by "keep-alive", is it a formal kubernetes API?

The way Kubernetes curating the "kubernetes" endpoints today is using a TTL mechanism (design, code), is it what you are suggesting?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that is it.

exists in all parts of Kubernetes API. Client-side coordination is necessary
to overcome the race. Let's take the storage migrator as an example. For the
storage migrator to work properly, the required client-side coordination is
that at most one master version change could happen in any 5 minute window.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don't think this is client-side coordination, this is an assumption / guarantee made on the server side, i.e., cluster admins do not do version upgrade or downgrade more than once every 5 minutes.

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: caesarxuchao
To complete the pull request process, please assign lavalamp
You can assign the PR to them by writing /assign @lavalamp in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@caesarxuchao
Copy link
Member Author

/assign @deads2k

server will change the selected version to the new storage version, and all API
servers start to use the new version.

### The selected storage version is supported by all API servers during upgrades/downgrades
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This describes selecting which version to use for a type which all API servers support, but at different versions.

Would the same mechanism correctly handle a type which some API servers do not support at all (e.g. avoid serving a new beta type while there are joined API servers that do not make that type available, xref kubernetes/kubernetes#81286)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so. It's mentioned in the "future work" section: "For example,
kube-apiservers can vote on what API is served, and only serve the API that is
supported by all servers."


During the rolling update of an HA master, API server instances *i)* use
different storage versions for a built-in API resource, and *ii)* show different
storageVersionHash in the discovery documents. These facts make the storage
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

elaborate here. I suspect it's because etcd may contain records written in different storage versions.

[storageVersionHash]:https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/apimachinery/pkg/apis/meta/v1/types.go#L979
[details]:https://github.com/kubernetes/enhancements/blob/master/keps/sig-api-machinery/35-storage-version-hash.md#ha-masters

We propose a mechanism for HA API servers to vote on the storage version to use,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that we need to have all apiserver allow configurable storage. Instead, an apiserver can indicate which storage version it writes (hardcoded like today) and the entity driving the storage migrator can decide to wait until all apiservers agree. That builds one system on another instead of changing the underlying system.


```golang
// Storage version for a specific resource.
type StorageVersion struct {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to see a spec/status split here. apiservers write the status of what is

  1. groupversions this server can read
  2. groupversions this server will write

I don't see any fields for spec yet.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is "groupversions this server can read" essential to this API? That information is beyond "storageVersion". I'll need to come up with a new API name.

### Storage Version API

We introduce a new API `StorageVersion`, in a new API group
`kube-apiserver.internal.k8s.io/v1alpha1`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this isn't specific to kube-apiserver. This is generic for any aggregated apiserver as well.


type VersionCandidate struct {
// The ID of the reporting API server.
APIServerID string
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how will old/outdated servers be stripped removed? A server can crash and be replaced by a different server with a different ID. Imagine using an autoscaling deployment as a for instance.

this [design doc][] for details.

We will formalize the API in another KEP. For the purpose of this KEP, all we
need to know is that HA master is able to maintain a list of participating API
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, neat. This is how you handle pruning. Combined with visibility into apiservices, this is sufficient to selectively prune apiservers.


[design doc]:https://github.com/kubernetes/community/pull/939

### Voting for Storage Version
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not seeing why voting is necessary for the storage migrator to function correctly. It seems like being able to see the current state and knowing whether it is consistent across the cluster is sufficient to make a choice about migrating that cluster. That, in combination with a resource version seems fully sufficient to know if storage migration should commence and whether it was successful.

Copy link

@sanchezl sanchezl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall I think this proposal should be about discovery, not configuration, enabling other processes to drive the storage migration.

[storageVersionHash]:https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/apimachinery/pkg/apis/meta/v1/types.go#L979
[details]:https://github.com/kubernetes/enhancements/blob/master/keps/sig-api-machinery/35-storage-version-hash.md#ha-masters

We propose a mechanism for HA API servers to vote on the storage version to use,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The StorageVersionHash problem linked above is about being able to detect when its safe to trigger the storage migratior, whereas here you are suggesting you intent to actually configure which storage version to use. Please clarify the intended use. I would suggest this proposal should only be about discovering which storage versions are in use.

ObjectMeta
// The selected storage version. All API server intances encode the resource
// objects in this version when committing it to etcd.
SelectedVersion string

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are not configuring storage (I don't think this is what we want), this field is unnecessary.

APIServerID string
// The preferred storage version of the reporting API server. The preferred
// storage version is hardcoded in each Kubernetes release.
PreferredVersion string

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should simply be the storage version in use by this api server.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants