Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KEP-2732 NetworkPolicy Versioning #2806

Closed

Conversation

tfherbert
Copy link

@tfherbert tfherbert commented Jun 30, 2021

KEP-2732: NetworkPolicy Versioning

Adds a new KEP for Network Policy Versioning

  • This is modified from original KEP-2136 PR: KEP-2136

-- The old PR submitted in Nov 2020 was closed: KEP-2136

  • Other comments:
    -- This KEP template isn't fully filled out yet. Additional feedback would be helpful.

/cc @jayunit100 @aojea @rikatz @thockin @astoycos @rpkatz @danwinship @abhiraut

Signed-off-by: Thomas F Herbert <therbert@redhat.com>
Signed-off-by: Thomas F Herbert <therbert@redhat.com>
@k8s-ci-robot
Copy link
Contributor

Thanks for your pull request. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please follow instructions at https://git.k8s.io/community/CLA.md#the-contributor-license-agreement to sign the CLA.

It may take a couple minutes for the CLA signature to be fully registered; after that, please reply here with a new comment and we'll verify. Thanks.


Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@k8s-ci-robot k8s-ci-robot added cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jun 30, 2021
@k8s-ci-robot
Copy link
Contributor

Hi @tfherbert. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: tfherbert
To complete the pull request process, please assign dcbw after the PR has been reviewed.
You can assign the PR to them by writing /assign @dcbw in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the sig/network Categorizes an issue or PR as relevant to SIG Network. label Jun 30, 2021
@k8s-ci-robot
Copy link
Contributor

Welcome @tfherbert!

It looks like this is your first PR to kubernetes/enhancements 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes/enhancements has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot k8s-ci-robot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Jun 30, 2021
@tfherbert
Copy link
Author

The LF Contributor agreement has been signed as required.

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Jun 30, 2021
@rikatz
Copy link
Contributor

rikatz commented Jun 30, 2021

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jun 30, 2021
Signed-off-by: Thomas F Herbert <therbert@redhat.com>
@tfherbert
Copy link
Author

/retest

@rikatz
Copy link
Contributor

rikatz commented Jul 2, 2021

@tfherbert try ./hack/update-toc.sh (If you didn't :) )

@tfherbert
Copy link
Author

/retest

Comment on lines +106 to +109
Just how special/unusual is NetworkPolicy in this regard? Have any
other SIGs already dealt with similar problems? Or if not are there
any other APIs that are similarly failing to deal with the same
problems?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other broadly similar challenges (in my opinion):

  • Ingress, and what features the controller supports. Partly solved by IngressClass.
  • PodSecurityPolicy (control mechanisms depend both on container runtime and on node operating system features; Windows and Linux have obviously different security mechanisms)

PodSecurityPolicy is already deprecated.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sftim The language was cribbed from the earlier KEP proposal, KEP-2136. Will clean. up.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this problem exists for any aspect of the system where the API and the implementation are not (roughly) lock-step. Ingress yes, but also Gateway. I am sure there are and will be more. We should be looking for a general pattern.

## Proposal

### User Stories

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would add a first “story 0”:
as a developer, I want to find out if NetworkPolicy applies at all in my cluster. It's plausible that the API is present but the cluster network pays it no heed at all.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sftim That is a good question. The objective of this KEP should unify the community around network policy in a future version since there are differences in the features and scope of network policy across various CNIs. The implied issue is whether there would be an opt so "not asserted" would be an option in the API.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would add a first “story 0”:
as a developer, I want to find out if NetworkPolicy applies at all in my cluster.

That's sort of a special case of Story 3; for when the plugin doesn't implement any NetworkPolicy features

Comment on lines +219 to +226
If a user is waiting to see the status of a newly-created NetworkPolicy,
there is no entirely-reliable way to distinguish "the plugin has not yet
set `status` but will soon" from "the plugin doesn't know about `status`
and is never going to set it".

It's not clear how big a problem this is, especially if we suggest that
implementations should create an "empty" `status` right away if it's
going to take them a while to determine the final `status`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kubernetes has a deprecated ComponentStatus API.

If we brought back something similar but as an official CRD, it could allow CNI plugins to report their status including what policy is supported.

fully in effect.

### Risks and Mitigations

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Imagine that I perform a rolling replacement of all nodes, intending to apply what I think is a routine OS upgrade. After that upgrade, NetworkPolicy is no longer enforced. What happens to the .status of existing NetworkPolicies?

My thoughts on addressing this: maybe a heartbeat mechanism using Lease, that activates when the first Lease appears in a relevant namespace.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sftim That is a good question. We intentionally left out enforcing, so status without enforcing does seem strange. Maybe status should be in the future KEP that will deal with enforcing. See here for discussion: network-policy-api-thread

- Define rules for dealing with feature gates and alpha APIs in
NetworkPolicy that work well with the 3-way versioning split.

- Allow network plugins to indicate when a NetworkPolicy has been fully
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not put those 2 items related to status into different KEP? It looks like not related to versioning.

@thockin thockin self-assigned this Nov 9, 2021
Comment on lines +106 to +109
Just how special/unusual is NetworkPolicy in this regard? Have any
other SIGs already dealt with similar problems? Or if not are there
any other APIs that are similarly failing to deal with the same
problems?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this problem exists for any aspect of the system where the API and the implementation are not (roughly) lock-step. Ingress yes, but also Gateway. I am sure there are and will be more. We should be looking for a general pattern.

<<[/UNRESOLVED]>>
```

Somewhat related to this, there is currently no way for a
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overlap with #2947 ?

Should these be split or merged?

NetworkPolicy that work well with the 3-way versioning split.

- Allow network plugins to indicate when a NetworkPolicy has been fully
"programmed" into the cluster network, to allow clients (including
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As before - I am not keen on this idea. Any sort of "all nodes agree" logic is super brittle. All it takes is one node going out to lunch, or worse one node to join the cluster and the consensus is broken.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, and this was part of the "Status" part of the original KEP, not the "Versioning" part anyway, so it doesn't really belong here

// conforms to the specified version. If it is not specified, the apiserver
// will fill in the correct minVersion based on the features used by the policy.
// +optional
MinVersion NetworkPolicyVersion `json:"minVersion,omitempty" protobuf:"bytes,5,name=minVersion"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This implies that version is a linear feature-accumulator. I'm not convinced that is 100% accurate.

Suppose we start with version X. Then version Y adds "foo" support. Then version Z adds "bar" support.

If an implementation supports "bar' and not "foo" - is that allowed? At leas you can assert that if it knows about "bar" it should be aware of "foo". But are we intending to mandate 100% implementation? How can we enforce that?

An alternative might be a list of feature flags. E.g. "this policy uses: [ FeatureFoo, FeatureBar ]". Then controllers can warn "I support Bar, but not Foo". Of course, we have to get all policy imps to support the features-required field...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd also say this is NOT something a user may set themselves.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suppose we start with version X. Then version Y adds "foo" support. Then version Z adds "bar" support.

If an implementation supports "bar' and not "foo" - is that allowed? At least you can assert that if it knows about "bar" it should be aware of "foo". But are we intending to mandate 100% implementation? How can we enforce that?

The intention in my original KEP is that a network policy implementation has to recognize every feature up to the latest feature that it implements. So an implementation can support "bar" and not "foo", but it would be required to recognize when a policy was using "foo", and fail in an appropriate way. This is discussed more later, in the NetworkPolicyVersion descriptions and the section "The Supported Condition"

When a user creates a NetworkPolicy, and is using a network plugin that
implements this specification:

- If the NetworkPolicy uses API fields which are not known to the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general this isn't true. kubectl does some validation but that can be bypassed and the apiserver drops unknown fields

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, this was the big problem that led to the original KEP being abandoned

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 22, 2022
Comment on lines +162 to +163
- Defining "metric-like" NetworkPolicy status information (eg, how long
a particular rule took to implement).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

again, this is Status, not Versioning

of which plugins (and which versions of which plugins) are expected to
correctly implement which NetworkPolicy features.

#### Story 2 - More Reliable NetworkPolicy Test Cases
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is status, not versioning

Comment on lines +194 to +197
#### Story 4 - Reporting NetworkPolicy Version

As a network plugin developer, I want to verify that a CNI implements a specific
NetworkPolicy version.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand what this means. Related: "A CNI" is not a thing. There is only one CNI, and it is this. (However, assuming you meant "a network plugin", the story still does not make sense.)

implementations should create an "empty" `status` right away if it's
going to take them a while to determine the final `status`.

#### Distributed and Delegating NetworkPolicy Implementations
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of this is about status, not versioning. There's still the issue of "who actually updates the status to indicate what was supported", but the answer for the versioning-only case is pretty clearly "some centralized part of the network plugin, even if that component isn't actually involved in enforcing the network policy".

// conforms to the specified version. If it is not specified, the apiserver
// will fill in the correct minVersion based on the features used by the policy.
// +optional
MinVersion NetworkPolicyVersion `json:"minVersion,omitempty" protobuf:"bytes,5,name=minVersion"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suppose we start with version X. Then version Y adds "foo" support. Then version Z adds "bar" support.

If an implementation supports "bar' and not "foo" - is that allowed? At least you can assert that if it knows about "bar" it should be aware of "foo". But are we intending to mandate 100% implementation? How can we enforce that?

The intention in my original KEP is that a network policy implementation has to recognize every feature up to the latest feature that it implements. So an implementation can support "bar" and not "foo", but it would be required to recognize when a policy was using "foo", and fail in an appropriate way. This is discussed more later, in the NetworkPolicyVersion descriptions and the section "The Supported Condition"

// +optional
// +patchMergeKey=type
// +patchStrategy=merge
Conditions []NetworkPolicyStatusCondition `json:"conditions,omitempty" patchStrategy:"merge" patchMergeKey:"type" protobuf:"bytes,3,rep,name=conditions"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FTR I used Conditions in the combined Status+Versioning KEP, but it's not necessarily clear that Conditions are right for just Versioning.

When a user creates a NetworkPolicy, and is using a network plugin that
implements this specification:

- If the NetworkPolicy uses API fields which are not known to the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, this was the big problem that led to the original KEP being abandoned

#### Other Conditions

```
<<[UNRESOLVED matching-conditions ]>>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ignoring the question of whether this is a good idea, it feels more like Status than Versioning

@danwinship
Copy link
Contributor

Not sure why this is claiming to be "KEP-2732", as #2732 is something totally unrelated.

@danwinship
Copy link
Contributor

#2136 was the original "NetworkPolicy versioning and status" enhancement so maybe this could take over that number, since the original PR for that was abandoned. But then, it's not clear if this is still in progress either...

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels May 11, 2022
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue or PR with /reopen
  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

@k8s-ci-robot
Copy link
Contributor

@k8s-triage-robot: Closed this PR.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue or PR with /reopen
  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. sig/network Categorizes an issue or PR as relevant to SIG Network. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants