-
Notifications
You must be signed in to change notification settings - Fork 5.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KEP: Dynamic Audit Configuration #2188
Conversation
/cc @kubernetes/sig-api-machinery-feature-requests @kubernetes/sig-auth-feature-requests |
After sending out https://lists.cncf.io/g/cncf-k8s-conformance/message/417 |
/ok-to-test |
/assign @ericchiang @liggitt @tallclair |
/cc @loburm |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, I agree these are important features to have.
keps/0007-dynamic-audit-control.md
Outdated
- group: <group> | ||
resources: <resources> | ||
``` | ||
In kube-apiserver today this is only accepted as a file, this KEP would allow it to either be configured as a file on the master node or applied to the `kube-system` namespace as an object and be dynamically configured. This will require the audit API to be registered with the apiserver. If both file and object exist the dynamic object will take precedence. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if I want to enforce that the audit policy cannot be overwritten, even by someone who has the ClusterAdmin role (root in cluster)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suspected this would be controversial, if we want to switch it around that works for me. I was going for observability first, but there may be security concerns that are more important.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than introduce a new API object, why don't we just watch
the audit policy file and load policy dynamically from it?
Reason:
- The audit policy object should be owned by the cluster admin. No need to expose it to API.
- There is only one audit policy. We usually create multi objects when there is an API in kube-apiserver.
Advantage:
- We can still dynamically config it.
- Easy to implement. (I think...)
- reduce one additional threat. (One can not close the audit events by POST policy to API)
Disadvantage:
- Cluster admin have to check the audit log or apiserver log to ensure that his modification has been accepted by apiserver.
If there is an error in the config file, the cluster admin can not get a timely and obvious error message, he has to check the log.
We can add a new option to kube-apiserver:
--audit-policy-check-frequency duration Duration between checking audit policy file for new policy rules (default 0s). If set to 0s, kube-apiserver will not dynamically load new policy when the policy is modified.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would require write access to the master node which is highly privileged. The main goal of the proposal is to provide drop in components that can utilize the audit data, which in turn enhances the security and conformance of the cluster.
Based on the conversations were looking to provide a root policy that cannot be tampered with but enable the ability to dynamically configure secondary policies and webhooks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would require write access to the master node which is highly privileged.
Agree. This is a reason.
keps/0007-dynamic-audit-control.md
Outdated
apiVersion: audit.k8s.io/v1alpha1 | ||
kind: AuditConfiguration | ||
metadata: | ||
name: <name> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this mean multiple webhooks can be setup at once?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use Cases for multiple webhooks / policies:
- There is an ongoing permanent audit policy, but a temporary audit with a different policy (and likely webhook endpoint) needs to be added. ie. Enabling APISnoop for a specific application
- Joint auditing with same policy but dual webhooks to independent auditors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was shooting for a single webhook initially just to match parity with the current system, but if we think there are sufficient use cases for multiples, I'm happy to change direction.
keps/0007-dynamic-audit-control.md
Outdated
### Risks and Mitigations | ||
|
||
This does open up the attack surface of the audit mechanisms. Having them strictly configured through the api server has | ||
the advantage of limiting the access of those configurations to those that have access to the master node. However, Dynamic Admission has shown that this can be done in a safe manner. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would like to see this section expanded. I can think of a few additional threats that are opened up with this:
- privileged user changes audit policy to hide (not audit) malicious actions
- privileged user changes audit policy to DoS audit endpoint (with malintent, or ignorance)
- privileged user changes webhook configuration to hide malicious actions
As a potential mitigation strategy, what if we said that the flag-defined configuration always takes precedence, but can be combined with the dynamic config? This would let the cluster operator define a policy subset that must be abided by (e.g. ALWAYS log secrets access, at metadata level; NEVER log events, ALWAYS log dynamic audit log changes). This could easily be done by just appending the dynamic config to the static config.
Similarly, the operator could enforce that audit logs are always sent to the flag-configured webhook. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a potential mitigation strategy, what if we said that the flag-defined configuration always takes precedence, but can be combined with the dynamic config? This would let the cluster operator define a policy subset that must be abided by (e.g. ALWAYS log secrets access, at metadata level; NEVER log events, ALWAYS log dynamic audit log changes). This could easily be done by just appending the dynamic config to the static config.
that's tricky... I can see valid use cases both ways:
- flag-defined config that limits what dynamic configs can do
- flag-defined config for operating in normal mode, dynamic config that allows temporary verbose logging to debug a particular issue
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Temporary verbose logging is an interesting use case. Could we settle on assuming that if you want that feature you would need to leave your static config open to it, and configure the rest dynamically?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having dynamic audit config overriding flag-defined config as defined is very problematic, IMHO. Dynamic policies, e.g. defined as an API object, would be letting the API monitor itself. It would remove the ability to provide any guarantee of the integrity of an audit.
The only tenable way to allow some dynamism to audit policy would be to have flag-defined config be the ultimate authority and let it specify the bounds and policy for applying any dynamic configuration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a potential mitigation strategy, what if we said that the flag-defined configuration always takes precedence, but can be combined with the dynamic config? This would let the cluster operator define a policy subset that must be abided by (e.g. ALWAYS log secrets access, at metadata level; NEVER log events, ALWAYS log dynamic audit log changes). This could easily be done by just appending the dynamic config to the static config.
Agree to make the flag-defined configuration read only.
And the dynamic audit config will send audits to it's own backends, they should not effect each other.
keps/0007-dynamic-audit-control.md
Outdated
throttleEnabled: <true> | ||
throttleQPS: <10> | ||
``` | ||
The webhook would share the model of validating and mutating admission control plugins, leveraging a shared informer to provide online configuration. If runtime flags are also provided, the dynamic configuration will override any parameters provided there. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we allowed multiple webhooks to be configured simultaneously, I think it might make sense to allow different policies to be configured for different webhooks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we go that route I agree, we could add a policy
field to the AuditConfiguration
object, that would take the name of the policy object
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please read the API conventions. In particular:
https://github.com/kubernetes/community/blob/master/contributors/devel/api-conventions.md#naming-conventions
keps/0007-dynamic-audit-control.md
Outdated
|
||
- 05/18/2018: initial design | ||
|
||
## Alternatives |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As an extension, do you think it would ever make sense to allow per-namespace policies? We would probably want to enforce that the namespace-level policies are additive only (i.e. can only increase data logged, not decrease), or maybe appended. Out of scope for this proposal, but something that could be a follow up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this idea, I think inherently you'll want stricter policy on certain namespaces while others cause bloat. We'll track this as a later feature
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As an extension, do you think it would ever make sense to allow per-namespace policies
This is what my boss want. I'am happy to implement it.
I think a quota for each namespace's webhooks is necessary.
keps/0007-dynamic-audit-control.md
Outdated
As a cluster admin I will easily be able to enable the interal auditing features of an existing cluster, and tweak the configurations as necessary. | ||
|
||
#### Story 2 | ||
As a Kubernetes extension developer, I will be able to provide drop in extensions that utilize audit data. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Related: kubernetes/kubernetes#53455
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hh this is a good use case for multiple webhooks, I'll update to reflect. @tallclair do we see this as working in tandem with kubernetes/kubernetes#53455 ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
kubernetes/kubernetes#53455 is for a pull-model (at least for setting up the initial connection), where's multiple webhooks would be for a push model. Definitely overlap, but I think there's a use case for both. One thing to note is that if there is a policy-per-webhook, that gets confusing with kubernetes/kubernetes#53455, which policy does it use?
keps/0007-dynamic-audit-control.md
Outdated
- group: <group> | ||
resources: <resources> | ||
``` | ||
In kube-apiserver today this is only accepted as a file, this KEP would allow it to either be configured as a file on the master node or applied to the `kube-system` namespace as an object and be dynamically configured. This will require the audit API to be registered with the apiserver. If both file and object exist the dynamic object will take precedence. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will require the audit API to be registered with the apiserver.
I would like there to be additional controls to enable/disable the dynamic policy and/or dynamic webhook config.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agreed, updating
keps/NEXT_KEP_NUMBER
Outdated
@@ -1 +1 @@ | |||
7 | |||
8 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is supposed to be a separate PR, so that you don't need to keep renaming the file (and losing the comments).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah ok, misunderstood that part, updating
keps/0007-dynamic-audit-control.md
Outdated
```yaml | ||
apiVersion: audit.k8s.io/v1beta1 | ||
kind: Policy | ||
omitStages: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose omitStages should be under the rules: https://github.com/kubernetes/apiserver/blob/master/pkg/apis/audit/types.go#L228
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
omitStages
can also be global https://github.com/kubernetes/apiserver/blob/master/pkg/apis/audit/types.go#L168
/cc |
keps/0007-dynamic-audit-control.md
Outdated
|
||
### Dynamic Configuration of Flags | ||
There are a number of configuration options that are currently set through runtime flags. This KEP would provide a new API for the dynamic configuration of the webhook parameters. | ||
```yaml |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to combine configuration flags into an object.
We have 30 audit related flast in kube-apiserver now.
$ kube-apiserver --help |grep audit | wc -l
30
I can't wait to see this happens. |
keps/0007-dynamic-audit-control.md
Outdated
- group: <group> | ||
resources: <resources> | ||
``` | ||
In kube-apiserver today this is only accepted as a file, this KEP would allow it to either be configured as a file on the master node or applied to the `kube-system` namespace as an object and be dynamically configured. This will require the audit API to be registered with the apiserver. If both file and object exist the dynamic object will take precedence. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
applied to the
kube-system
namespace as an object and be dynamically configured
Why policy is not a cluster-scoped object?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch, updating
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The doc still says kube-system
.
@CaoShuFeng I plan on beginning the implementation this week, and will tweak things as feedback comes in |
|
||
const ( | ||
// WebhookModeBatch sends requests in batches | ||
// bloking mode is currently disable but will be discussed at a later date |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does it mean? When dynamic config is enabled in the cluster, will it disable the ability to use blocking mode?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a concern raised about dynamically configuring a webhook to be blocking with a bad configuration which kills the apiserver. You could still configure a blocking webhook through flags.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack. So, is the idea for now is to keep one "static" configuration via flags, as it exists today, and in addition to that, provide any number of additional dynamic configurations that effectively enable multiple backends? If yes, can you state that explicitly in the proposal?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep, thats the current proposal. Its discussed in more detail here https://github.com/kubernetes/community/pull/2188/files#diff-ed9ccb4e76c92c65792d7180a7f0bd63R153 Let me know if that needs further explanation
@@ -0,0 +1,231 @@ | |||
--- | |||
kep-number: 7 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't forget to update this.
- "@liggitt" | ||
approvers: | ||
- "@tallclair" | ||
- "@liggitt" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to have someone from apimachinery on this list to approve the webhook parts. In particular, the ability to create multiple webhooks and dynamically configure the performance characteristics.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added @yliaog
- Provide an api and set of objects to configure the advanced auditing kube-apiserver configuration dynamically | ||
|
||
### Non-Goals | ||
- Provide a generic interface to configure all kube-apiserver flags |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
non-goals is an opportunity to define the scope of this feature / proposal. Can you be more specific about what is out of scope here? Examples might include:
- composable audit policies per-endpoint
- configuring non-webhook backends
- configuring audit output (format or per-field filtering)
- authorization of audit output
type Backends struct { | ||
// Webhooks holds the webhook backends | ||
Webhooks []*WebhookBackend `json:"webhooks,omitempty"` | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the motivation for specifying multiple backends per configuration? I think it might simplify things to have 1:1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the original idea was that you may want to define one policy and have multiple backends work off of that. This may only really become relevant if we allow more policies, so I'm fine removing it for the time being.
// WebhookBackend holds the configuration of the webhooks | ||
type WebhookBackend struct { | ||
// Mode for webhook backend | ||
Mode WebhookMode `json:"mode"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we just omit the field initially if it only has a single allowed value?
} | ||
|
||
// BatchConfig holds the configuration for batch webhooks | ||
type BatchConfig struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I feel fine about having global options for these. If we ever decide we need them to be dynamically configurable, it's easy to add back later (with the global settings acting as the default values).
|
||
### Implementation Details/Notes/Constraints | ||
|
||
When an update to a configuration is issued, the server will cache the current working configuration. If the update fails to apply, the system will roll back to the previous configuration and log that action, if configured through the api a 400 will be returned. If no previous configuration exists the application of the config will fail, and log the result. Any actions to the audit configuration objects will be hard coded to log at the `level=RequestResponse` to the previous backend and the new backend. If multiple API servers exist, the configuration will be rolled out in increments. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I still don't understand this. What specifically does this mean?
If the update fails to apply, ...
Maybe it would help if you describe the implementation of that condition a bit more?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this only applies to policy, since were not handling that here I'll remove
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Almost there... thanks for scoping down so much.
|
||
### Implementation Details/Notes/Constraints | ||
|
||
Any actions to the audit configuration objects will be hard coded to log at the `level=RequestResponse` to the previous backend and the new backend. If multiple API servers exist, the configuration will be rolled out in increments. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If multiple API servers exist, the configuration will be rolled out in increments.
What does this mean? Is this talking about HA masters, or aggregated api servers?
|
||
Sucess will be determined by stability of the provided mechanisms and ease of understanding for the end user. | ||
|
||
* alpha: Policy and api server flags can be dynamically configured, known issues are tested and resolved. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
drop policy
@tallclair @liggitt @yliaog I think we are in a good place with this here, is there anything else you would like to see? or can this move to next steps? Thanks |
## Proposal | ||
|
||
### Dynamic Configuration | ||
Dynamic configuration of the advanced auditing features will be provided through use of a shared informer. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure what this means... what do informers have to do with configuration? Do you mean the audit pipeline will get the configuration from an informer?
If so, how will it be used? I assume event handlers to add, remove & update backends to sync with the configuration?
Suggestion:
Introduce a new "dynamic" audit backend type. The dynamic audit backend looks like the union backend, but keeps a map of config name -> BE, and keeps the list in sync using the informer event hooks.
The dynamic backend is enabled through a commandline flag. It is always configured with a wrapping truncate & batch backend. The implications of this are that options to the truncate & batch backends should not be included in the dynamic configuration (I think that's already the case). IMO the performance benefits of sharing those are worth it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 I think were on the same page here, I updated to make it more clear. Let me know if there's anything else we're missing
v1.ObjectMeta | ||
|
||
// Backends to send events | ||
Backends *Backends `json:"backends"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I think this should be singular. Update the comments too.
// Backends holds the configuration for the various event backends | ||
type Backends struct { | ||
// Webhook holds the webhook backends | ||
Webhook *WebhookBackend `json:"webhooks,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: json doesn't match. I suggest just removing the tags from the KEP - they can be dealt with in the API PR review.
ThrottleQPS *float32 `json:"thottleQPS,omitempty"` | ||
|
||
// ClientConfig holds the connection parameters for the webhook | ||
ClientConfig *ClientConfig `json:clientConfig"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: ClientConfig *WebhookClientConfig
ThrottleQPS *float32 `json:"thottleQPS,omitempty"` | ||
|
||
// ClientConfig holds the connection parameters for the webhook | ||
ClientConfig *ClientConfig `json:clientConfig"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is required it shouldn't be a pointer (I think)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Almost there, just one more comment
} | ||
``` | ||
|
||
This configuration object will be accepted both as the existing runtime flags, and a dynamic api object. Multiple definitions can exist as independent solutions. These updates will require the audit API to be registered with the apiserver. Dynamic configuration will be enabled by a feature gate. If existing flags are provided to configure the audit backend they will be taken as a separate configuration object. | ||
Multiple definitions can exist as independent solutions. These updates will require the audit API to be registered with the apiserver. The dynamic configurations will be wrapped by truncate and batch options, which are set statically through existing flags. Dynamic configuration will be enabled by a feature gate for pre-stable releases, as well as a runtime flag `--audit-dynamic-enabled`. If existing flags are provided to configure the audit backend they will be taken as a separate backend configuration. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the feature gate is sufficient... I don't think we also need the --audit-dynamic-enabled
flag.
LGTM. Please squash your commits. Any last thoughts from other reviewers? @liggitt @caesarxuchao @yliaog |
|
||
## Graduation Criteria | ||
|
||
Sucess will be determined by stability of the provided mechanisms and ease of understanding for the end user. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/sucess/success/
/lgtm |
9f01e0f
to
332f76e
Compare
ok @tallclair I fixed up the spelling error and updated the next kep number due to an issue with #2331, should be cleared up 👍 |
/lgtm |
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: jbeda The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Proposal to allow for the dynamic configuration of advanced auditing features.