Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merge master into layering #3060

Closed
wants to merge 54 commits into from
Closed

Conversation

cgwalters
Copy link
Member

Keeping up with things

mkenigs and others added 30 commits December 23, 2021 14:25
The call to TempDir a few lines above already created this directory, so
this call to MkdirAll is completely unecessary
When we added the nodeip-configuration service for None platform
deployments, we broke some existing users who were relying on the
(largely undefined) previous behavior Kubelet used to select its
node ip. While it is possible to work around this by overriding the
node ip selection logic, that's very cumbersome and not an acceptable
user experience.

This change adds a KUBELET_NODEIP_HINT env variable that can be used
to override the default behavior of runtimecfg when selecting a node
ip. When the variable is unset, the old behavior of selecting an
address on the interface of the default route will take effect. When
the variable is set, its value will be passed to runtimecfg like a
VIP for the IPI platforms. This will cause runtimecfg to prefer an
address in the same subnet as the one provided in
KUBELET_NODEIP_HINT. If no such address is found, it will fall back
to the default route logic as before.

KUBELET_NODEIP_HINT can be set using a systemd environment file.
The file must be named /etc/default/nodeip-configuration
with contents such as (replacing the IP as appropriate):

KUBELET_NODEIP_HINT=192.0.2.1

This file should be created using a machine-config manifest that is
passed to the installer so it will take effect on initial deployment.
The node ip cannot be changed after the node registers initially so
this cannot be done as a day 2 operation.

Note that the IP specified in the hint does not necessarily need to
exist in the environment, it just needs to be in the correct subnet.
No traffic will be sent to this address.

Co-authored-by: Dan Winship <danwinship@redhat.com>
The machine config controller did not previously have a metrics handler
so one must be added in order for us to do any alerting/metrics work.
This requires setting up:
- Cluster Roles
- Cluster Role Bindings
- ServiceMonitor for metrics
- Service for metrics
- oauth-proxy sidecar to deploymentfor machine-config-controller
- mcc-proxy-tls secret for machine-config-controller
- metrics handler function in machine-config-controller common
- Cluster Roles
- Cluster Role Bindings
- ServiceMonitor for metrics
- Service for metrics
- oauth-proxy sidecar to deploymentfor machine-config-controller
- mcc-proxy-tls secret for machine-config-controller
- metrics handler function in machine-config-controller common

I cribbed off of: 557303f
And then to add oauth: 3ab692f
Adds certificate helper functions to:
- extract certificates from PEM bundles
- find the certificate that has the latest expiry date
when provided a list
Adds functionality to the node controller such that:
1.) when a paused machine config pool attempts to sync
2.) if the kubelet-ca has been updated in
the pool's 'spec' config
3.) the MCC will set metric to the NotAfter date of the
kube-apiserver-to-kubelet-signer certificate
5.) once the pool is unpaused, that metric will be
reset to zero
Testutil package from the prometheus client used in the
node_controller tests, needed to add as dependency.

Commands run:
```
$ go mod tidy
$ go mod vendor
$ make verify
```
Adds an e2e test that steps through the rotation of the
kubelet-apiserver-to-kubelet-signer by:
- pausing a pool
- rotating the certificate
- checking that the proper metric is emitted
- unpausing the pool
- checking that the metric stops being emitted
Node controller now requires a
MachineConfigInformer as part of its New() function,
updates bootstrap_tests to match
As we now tear down and reconfigure br-ex on every reboot, we must
provide a means to stabilize interface selection in scenarios with
multiple default route interfaces.

Signed-off-by: Andreas Karis <ak.karis@gmail.com>
Signed-off-by: Andreas Karis <ak.karis@gmail.com>
Update controllerconfig CRD and relevant switch statements
in pkg to handle Nutanix platform. Also Update
install/0000_80_machine-config-operator_00_namespace.yaml

Add `openshift-nutanix-infra` to list of namespaces.
Right now Fedora doesn't ship Go 1.17, only Go 1.18beta.  That
version emits a different error message for incompatible TLS
versions.  Adjust our unit test to handle both.

(Also, a motivation for me is to cross-check the new CI configuration
 after openshift/release#27015 )
server/api_test: Adjust expected error message for Go 1.18
Created MCONamespace constant and used in all *.go files except for
test/helpers/utils.go which would create a cyclic import
…-certificate

Send alert when MCO can't safely apply updated Kubelet CA on nodes in paused pool
Remove the restriction on the runtime-request-timeout
option in the kubeletconfig.

Signed-off-by: Urvashi Mohnani <umohnani@redhat.com>
…-74-controller-alert-certificate

Revert "Send alert when MCO can't safely apply updated Kubelet CA on nodes in paused pool"
…2802-mco-74-controller-alert-certificate"

This reverts commit b80e6a1, reversing
changes made to 57267b7.

This "un-reverts" the reversion so we can put PR 2802 back in with the
fix to resourcemerge.
Resourcemerge did not previously merge a container's Resources.Requests
in ensureContainer(), which meant that during upgrade cases where we update
the container object directly with changes (instead of applying/re-applying
the manifests), Resources.Requests changes would not propagate to the
updated object.

This makes ensureContainer update Resources.Requests if it has changed,
which keeps that structure from getting scraped off when we update. ( Which
will keep us from failing tests, since at least cpu and memory in that
structure are required fields )
Make our resourcemerge fork update a container's Resources.Requests, un-revert openshift#2802
This will keep layered and non-layered update logging consistent
bootstrap_test.go: remove unused constants
The main motivation here is to work around
coreos/rpm-ostree#3523
(Which is itself a workaround for a RHEL8 systemd bug)

Basically this e2e is invoking `rpm-ostree kargs` in a pretty
tight loop which triggers that bug.

To read the kernel command line, we can just read `/proc/cmdline`
instead.  (Now, this is the *actual* cmdline instead of just rpm-ostree's
view of it, but it should be fine)
…latform

Add Nutanix Platform to Machine Config Operator
Today, typing `make` does nothing, which is not very useful.  By listing this rule first,
`make` will default to `make binaries`.
Fix description typo in osImageURL CRD parameter
e2e: Use `/proc/cmdline` instead of `rpm-ostree kargs`
@openshift-ci openshift-ci bot requested a review from sinnykumari April 5, 2022 16:46
@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 5, 2022
@cgwalters cgwalters changed the base branch from master to layering April 5, 2022 16:47
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Apr 5, 2022

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cgwalters

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@kikisdeliveryservice
Copy link
Contributor

retesting all of these (single node didnt bootstrap)
/retest

@kikisdeliveryservice
Copy link
Contributor

ok ignore the single node comment above bc they apparently never worked in this branch, but the gcp-op failures seem.. not like flakes? for ex:

        	Messages:   	failed to exec cmd [cat /rootfs/etc/crio/crio.conf.d/01-ctrcfg-logLevel] on node ci-op-5zk9gyv0-b7394-v48gf-worker-c-kgchx: 
....

@kikisdeliveryservice
Copy link
Contributor

also that bot report above seems wrong?
/skip

@cgwalters
Copy link
Member Author

/retest

1 similar comment
@cgwalters
Copy link
Member Author

/retest

@cgwalters
Copy link
Member Author

That's...weird, it's like that job somehow lost our override adding rhel-coreos in CI. I wonder if there's some confusion based on the git merges causing the CI setup to be reused across master/layering branches?
/test images

@cgwalters
Copy link
Member Author

/retest

1 similar comment
@cgwalters
Copy link
Member Author

/retest

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Apr 13, 2022

@cgwalters: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-gcp-op-single-node b78d77a link false /test e2e-gcp-op-single-node
ci/prow/okd-images fbcbb7e2cb35fcd3e3e5e79304fc8a82442d77d5 link false /test okd-images
ci/prow/verify fbcbb7e2cb35fcd3e3e5e79304fc8a82442d77d5 link true /test verify
ci/prow/unit fbcbb7e2cb35fcd3e3e5e79304fc8a82442d77d5 link true /test unit
ci/prow/bootstrap-unit fbcbb7e2cb35fcd3e3e5e79304fc8a82442d77d5 link false /test bootstrap-unit
ci/prow/e2e-agnostic-upgrade fbcbb7e2cb35fcd3e3e5e79304fc8a82442d77d5 link true /test e2e-agnostic-upgrade
ci/prow/e2e-gcp-single-node b78d77a link false /test e2e-gcp-single-node
ci/prow/e2e-gcp-op b78d77a link true /test e2e-gcp-op

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@cgwalters
Copy link
Member Author

OK let's do #3126 first, then I think it may make sense to instead force-rebase layering on master.

@cgwalters cgwalters marked this pull request as draft April 29, 2022 16:42
@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 29, 2022
@cgwalters
Copy link
Member Author

Trying a rebase of layering on current master, there are some conflicts to work through. Split out one bit in #3133

@openshift-bot
Copy link
Contributor

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 1, 2022
@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 1, 2022
@openshift-merge-robot
Copy link
Contributor

@cgwalters: PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-bot
Copy link
Contributor

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

@openshift-ci openshift-ci bot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Sep 1, 2022
@openshift-bot
Copy link
Contributor

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

@openshift-ci openshift-ci bot closed this Oct 1, 2022
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Oct 1, 2022

@openshift-bot: Closed this PR.

In response to this:

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. layering lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet