Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

user-defined-network-segmentation: Add CRDs for managing networks #1638

Merged
merged 1 commit into from
Jul 26, 2024

Conversation

ormergi
Copy link
Contributor

@ormergi ormergi commented Jun 9, 2024

Following on #1623, this PR updated the user-define-network-segemetation enhancement and add content related to CRDs for managing OVN networks

@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 9, 2024
Copy link
Contributor

openshift-ci bot commented Jun 9, 2024

Hi @ormergi. Thanks for your PR.

I'm waiting for a openshift member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@openshift-ci openshift-ci bot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Jun 9, 2024
@ormergi
Copy link
Contributor Author

ormergi commented Jun 9, 2024

/cc @tssurya @trozet

@openshift-ci openshift-ci bot requested review from trozet and tssurya June 9, 2024 17:49
@ormergi ormergi marked this pull request as ready for review June 9, 2024 17:51
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 9, 2024
@openshift-ci openshift-ci bot requested review from abhat and dougbtv June 9, 2024 17:51
@ormergi ormergi force-pushed the ovn-k-network-api branch 2 times, most recently from 0ab6aee to 2ca9dea Compare June 9, 2024 18:00
@tssurya
Copy link
Contributor

tssurya commented Jun 9, 2024

cc @kyrtapz assigned SDN reviewer
@danwinship @dougbtv : PTAL since you both have lots of experience around potential k8s ideas for MultiNetworking APIs

@ormergi
Copy link
Contributor Author

ormergi commented Jun 10, 2024

/cc @maiqueb

enhancements/network/user-defined-network-segmentation.md Outdated Show resolved Hide resolved
enhancements/network/user-defined-network-segmentation.md Outdated Show resolved Hide resolved
| Topology | The topological configuration for the network. Must be one of `layer2` or `layer3`. | No |
| Role | Select the network role in the pod, either `primary` or `secondary`. | No |
| MTU | The maximum transmission unit (MTU).<br/>The default value, 1300, is automatically set by the kernel | Yes |
| Subnets | The subnet to use for the network across the cluster.<br/>E.g. 10.100.200.0/24.<br/>IPv6 (2001:DBB::/64) and dual-stack (192.168.100.0/24,2001:DBB::/64) subnets are supported.<br/>When omitted, the logical switch implementing the network only provides layer 2 communication, and users must configure IP addresses.<br/>Port security only prevents MAC spoofing if the subnets are omitted. | Yes |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we also want to expose configuring the join subnet via this API as well as the pod subnet? @tssurya FYI

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the CR should expose the least of options, although it sounds mandatory, do you think we can avoid exposing the join subnet?

Could there be a way for the backing CRDs controller to realize what the join subnet should be?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OH! having join subnet via CR would be better, it definitely belongs here specially if podsubnet is via here.

Perhaps then for NAD config approach we can do the same and add joinSubnet as a config field to the NAD? that will save us from having to do a new API change in ocp/api for join subnets...

great idea tim..

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of offline discussion about join subnet, for primary networks the join subnet can be set as optional field, in case no join subnet is set, a default one should be set by the controller.
Advance users can choose the CIRD to use for the join subnet by specifying the field.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added text about join-subnet in regard to primary networks.

Are join subenet mandatody for secondary networks as well?
Is so, does it make sense to use we use similar strategy for secondary networks?
Where the controller picks arbitrary join subnet in case its not specified?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few things we need to make clear about join subnet in the enhancement:

  1. Join subnet is not applicable to secondary networks. Join subnet is used for north/south traffic, and secondary networks will not support north/south at this time. We don't need to stop a user from configuring the join subnet for a seondary network CR, but it wont do anything.
  2. The join subnet specified in the CR cannot overlap with the cluster default network join subnet. This subnet is configured in OVN-Kubernetes via: https://github.com/ovn-org/ovn-kubernetes/blob/master/go-controller/pkg/config/config.go#L441
    with defaults here: https://github.com/ovn-org/ovn-kubernetes/blob/master/go-controller/pkg/config/config.go#L138
    You can simply add to this enhancement:
    "The join subnet shall not overlap with the configured cluster default network join subnet. If there is overlap, OVN-Kubernetes will report an error status to the network CR"
  3. Join subnet specified in the CR may overlap with other UDNs. There is no conflict there. According to Surya's PR, the default subnet chosen for a UDN will be a constant declared at:
    ovn-org/ovn-kubernetes@05e1305#diff-ee7e8cb2c18f89a20eaaf4c10d14e404559cc7c7dbaabe1801566ba57e2b5ed7R113

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for clarifying, added text about join subnet validation.


##### status
The CRD status should reflect the NAD state through conditions.
For example, when the NAD its created the condition should be placed with status `True`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we say when the NAD is created and OVN-Kubernetes has programmed this network into OVN, OVN-Kubernetes will update the status?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the suggested solution utilize the CNI model where the underlying ovn network is set by ovn-k CNI plugin when pod is created, and deleted when pod is deleted.
With the suggested solution, the status is actually updated when the NAD is created, the underlying ovn network is created later on, once the consumer pod is created.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the suggested solution utilize the CNI model where the underlying ovn network is set by ovn-k CNI plugin when pod is created, and deleted when pod is deleted. With the suggested solution, the status is actually updated when the NAD is created,

so what does a "true" status indicate? and when the status be false?

the underlying ovn network is created later on, once the consumer pod is created.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The suggested status condition indicated whether a corresponding NAD is created or not in the target namespace.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would advocate for 2 conditions then - NetworkReady (is a NAD available), and NetworkCreated (is the network actually created with matching / valid configurations and in use.

The names can surely be improved.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding the network actually created part, do you mean the NAD object in the cluster?
If so, single condition could reflect this state.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think he means NetworkCreated == the network topology has been created in OVN. NetworkReady== the NAD has been created.

I think this is a good idea.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The suggested solution utilize CNI plugin and NADs, the NetworkCreated condition state ends up exclusively related to whethers there is at least one running pod that use the NAD
(as CNI ADD establish the network, CNI DEL removes it).

Given this architecture, what value do you see it"ll give for a user that requested a network using the CRDs knowing the OVN topology is created?
Wouldn't the NetworkReady be sufficient?

Since the suggested controller should be part of the ovnkube-control-plane pod, and, IIUC, only ovnkube-node pods can be aware of OVN network typologies changes.

Does ovnkube-CP aware of such low level details so it can report it in the CRD status?

If not, from what I understand I can think of these two two options:

  1. ovnkube-CP should subscribe to ovnkube-nodes to have indication about relevant OVN network-topology changes.
  2. ovnkube-node watch the CRDs, monitor the CRDs related OVN network typologies, and updated the CRD status accordingly.

For both option, ovnkube-node should change to actively monitor OVN network typologies (if it doesnt already).
For option (1) it should expose some kind of an interface for notifying subscribers about OVN network typologies changes.
And have ovnkube-CP act on relevant OVN network topology changes.
For option (2) ovnkube-node would need to act on relevant OVN network topologies changes and update the relevant CRD status "NetworkCreated" condition.

Please note that in both options, it could end up having N (ovnkube pods) entities updating the CRDs status.
It could stumble conflicts on status update often, and quickly become racefull.

(Its possible that I got everything wrong and its super simple thing to have, please correct me if I am wrong)

This topic may require decent changes I think it should discussed thoroughly and settled on a follow up proposal, instead of adding more to this one.
Make sense?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following offline discussion, it was concluded condition should reflect underlying OVN network state, reflecting the NAD state is not sufficient.

Thanks for the discussion and explaining 🙂
Added text to describe status should reflect underlying OVN network state, and updated examples.

}'
```

Creating cluster scoped CRD instance should trigger creation of the corresponding NAD at each namespace specified by `spec.selector`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice examples. What happens if:

  1. a user has already created a namespaced scope CR for "theirnamespace"
  2. admin creates the below CR which selects "theirnamespace"

Should OVNk refuse to apply the cluster scoped CR? Also, wondering if we should have a webhook to validate these things at API creation time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the namespaced CR there would be a NAD with generated network-name 8301b3c4-ff07-4c6d-9786-232e02b87794 (line 520).
Where for the cluster scoped CR there should be a NAD with different generated network-name ef1df435-ef21-4850-a4a2-5fd8a8a8f991 (line 549).
Basically, these are two different underlying ovn networks so the cluster scope CR should be valid.

Regarding the webhook, the suggested solution utilize the CNI model (i.e.: the underlying ovn network is created and wired to the pod by the CNI) and the eventual consistency principal (i.e.: desired state should apply eventually by the reconcile).
IIUC, such validation would require querying the nodes ovn components to verify the underlaying ovn network state, right?
If so, the subject component is basically an external component the webhook should interacts with.
Note that having a webhook that depends on external component feedback could introduce blockage in a scenario interacting with the external component fail or hangs, causing requests to hang and pile up.


The CRDs spec defines as follows:

| Field name | Description | optional |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Take a look at the other enhancement we have for 4.17:
https://github.com/openshift/enhancements/pull/1636/files#diff-5261006a76db7654704ca30952f26ca2debd4d860d829770a3de6fed26a30678R276

It will also need another field here for "Transport" to decide if it should be Geneve or none.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't aware of the BGP enhancement, and that the CRD is expected to account for it, I will review it the enhancement to have better understanding of the expectations from the suggested CRD.

Eventaully suggested CRD spec translate to NAD pointing to the ovn-k8s-overlay-cni, does the CNI have such option?
Is it in scope of the BGP enhancement to extend the CNI to expose such field?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah its in scope of BGP enhancement to expose the new field. We can take care of adding the new field in that PR as this one will most likely merge before the BGP one does.

@ormergi
Copy link
Contributor Author

ormergi commented Jun 19, 2024

Changes:

  • Fixed MTU description
  • Rephrased CRDs finalizer explanation

@@ -67,6 +68,7 @@ tenant to isolate traffic.
are natively isolated from other tenants.
* As a user, I want to be able to request new private networks for my namespace, without having to get administrator
permission.
* As a user, I want to be able to request secondary networks for my namespace, without having to get administrator permission.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for my own understanding, I also see @trozet having written something similar in CRDs for managing networks section (2. It requires cluster admin RBAC in order to create the NAD.)

NAD is just a CRD right? depending on what RBAC is given by cluster admin in OCP this can either be created by an end user or by the admin right?

What am I missing as to how the NAD approach itself restricts end users from creating networks? (it seems like a devops config issue rather than a NAD issue itself; meaning do I really need a CRD for specially this case?)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you give a tenant the ability to modify NADs, they could potentially modify someone else's NAD for another network/namespace. How do we resolve this without a namespace scoped CRD? Maybe a webhook?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tssurya the suggested namespace CRD enable non-admin user create network without the risk of granting such users permission to create plain NAD with no restriction on its spec.

Having such CRD that could be granted permission for non-admin users answer the user stories that describe non-admin user being able to create networks (primary and secondary) with no admin intervention.

enhancements/network/user-defined-network-segmentation.md Outdated Show resolved Hide resolved
#### CRDs

Following the [CRDs for Managing Networks](#crds-for-managing-networks), two CRDs shall be introduced:
- Namespace scoped CRD - represent user request for creating namespace scoped OVN network.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need a namespace scoped CRD? all we need is a cluster scoped CRD because this request of connecting only a single namespace to a network can be expressed via that. I think its just namespaceSelector at the end of the day, I'd rather just have 1 CRD unless I am missing some reasons behind this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cluster scoped CRDs is targeted for cluster-admins only, where the namespace scope CRD is trageted for admin and non admin users.
Having both CRD together with utilizing the RBAC mechanism, cluster admins could grant non-admin users with permissions for the namespace-scope CRD; with that they can create OVN network in namespace they permitted to w/o admin.
While having only cluster-admins permissions for the cluster-scope CRD to create cross-namespace network.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't admin able to do the same with Network Attachment Definitions ? like allowing certain users to create them at certain namespaces.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They can, but the problem is that non-admin user could write anything in the net conf and has potential to destabilize the cluster nodes or break the cluster network.
Having the namespace scope CRD enable having control on the NAD net conf, at least for OVN networks.

| Field name | Description | optional |
|----------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------|
| Topology | The topological configuration for the network. Must be one of `layer2` or `layer3`. | No |
| Role | Select the network role in the pod, either `primary` or `secondary`. | No |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see some other things like vlanID etc that NAD config allows for, shouldn't we add those too?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are missing:

  • vlan (that just for role=secondary)
  • allowPersistentIPs (just for layer2)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are two options, expose all the settings as the NAD, or expose the minimal set of options providing an abstraction on top the ovn-k8s-overlay-cni CNI plugin settings.

From what I understand, specifically the vlanID is related to localnet topology, right?
Is so, and if there is no strict requirement to have it on phase one, can we support for it in a follow up phase?
Same goes for allowPersistentIPs can it be done on a follow up phase?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ormergi The allowpesisentIPs is important for the kubevirt scenario to support restart an live migration, I think we should add it to this enhancement.

Copy link
Contributor Author

@ormergi ormergi Jul 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


| Field name | Description | optional |
|------------|-------------------------------------------------------------------------------------------------------|----------|
| Selector | Key-value list of labels used as a selector for which namespace the network should be available for. | No |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

exactly why I vote to not have a second CR that is namespace scoped, doesn't add any value AFAICT

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

excludeSubnets: ["10.0.0.100"]
status:
conditions:
- type: "NetworkAttachmentDefinitionReady"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: "NetworkReady"; let's not mix NAD here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still see this being referred to as NetworkAttachmentDefinitionReady. I agree w/ @tssurya the NAD detail should not leak here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.


##### status
The CRD status should reflect the NAD state through conditions.
For example, when the NAD its created the condition should be placed with status `True`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the suggested solution utilize the CNI model where the underlying ovn network is set by ovn-k CNI plugin when pod is created, and deleted when pod is deleted. With the suggested solution, the status is actually updated when the NAD is created,

so what does a "true" status indicate? and when the status be false?

the underlying ovn network is created later on, once the consumer pod is created.

- type: "NetworkAttachmentDefinitionReady"
status: "False"
reason: "NetworkAttachmentDefinitionNotExist"
message: "Not all desired namespaces provisioned with NetworkAttachmentDefinition: ['theirnamespace', 'namespace not exist']"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that sounds wrong to me that a cluster-scoped CR depends on namespaces to be created beforehand?
I think a CR asking for a network between namespaces a,b,c,d should be legit request even if namespaces don't exist.
if they exist then yes we must do the necessary validation
if they don't exist yet, then when they are created we must do the necessary plumbing, that's the way I see it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The backing controller should reconcile the request in a way that once the missing namespace is created, it will provision it with a corresponding NAD.

Following the mention example: where the CR specifies namespace A,B and C, but only namespaces A and B exist.
The controller will provision namespaces A and B with the corresponding NAD, and reflect in the status the desired spec is not completed yet due to missing namespace C.
Once namespace C is created, the controller will provision it with the corresponding NAD.
WDYT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While the reconciliation path is clearly the way forward, I do not fully understand how / why this would affect the network.

Taking into account your example @ormergi - let's say I provision workloads on namespaces A and B. What would happen to them ? AFAIU, they will be scheduled and allowed to start (since their respective NADs are present), and still they're "connected" to a network that is not ready. It's weird.

Still, I'd say it would be a lot weirder / hard to reason about if they were prevented from starting because a namespace that has nothing to do with them does not exist.

Trying to state this availability concept at cluster-scoped level is a bit weird, and easy to get wrong.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that the other way around, where network is available at namespace "A" and "B", and status saying network is ready/available is also weird, since the network is not ready at some namespaces - "C" 🙂

The intention is to have the CRDs represent user request for creating a network (in a namespace or shared across few), if you request shard network with multiple namespaces, then the desired state is having the network available at all namespaces.
If its not available at all namespaces, the request has no fulfilled yet, and having the CRD status say network is not ready/available is how the controller communicate this intermediate state.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that:

  1. if the namespaces do not exist, that should not stop the CR from reporting NetworkReady: true
  2. If the namespaces exist, but their NAD has not/cannot be created, NetworkReady: false should be reported

I don't think a reason should be "namespace does not exist". A selector is just selecting. A namespace that does not exist cannot be selected, which means it does not currently apply to the Network CR and cannot force a failure status.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I share @trozet 's opinion.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if we have a request that specify namespaces A, B and C, but non of them exist, should it report NetworkReady=true?

About using selectors, the more I think about it I realize how it complicate things and see less value in it. see Maybe using plain namespaces is better.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following offline discussion, it was concluded that in case all or at least one namespace is not exist condition should say network is ready.

Thanks for the discussion and explaining 🙂
The mentioned example has been fixed.

@@ -391,6 +488,98 @@ guarantee uniqueness and eliminates the ability to falsify access.
After creating the CRD, check the status to ensure OVN-Kubernetes has accepted this network to serve the namespaces
selected. Now tenants may go ahead and be provisioned their namespace.

#### NetworkAttachmentDefinition rendering

The NAD should be created with owner-reference to the CRD, using the owner-reference mechanism should prevent deletion of the NAD before the corresponding CRD instance is deleted.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here you mean controller will be created the NAD with an owner reference right? could you please clarify the bit that controller will watch for CRs and then create the corresponding NADs in each namespace?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added text for describing the controller business logic.

@ormergi
Copy link
Contributor Author

ormergi commented Jun 27, 2024

Rebase on latest version of #1623

@trozet
Copy link
Contributor

trozet commented Jun 27, 2024

@ormergi @tssurya I'll try to summarize our slack discussion here and we can discuss further:

  1. We agree that we either need 2 CRDs (one namespace scoped and one cluster scoped) or we need a single CRD with a webhook admission controller. For now it seems like the more simple path is 2 CRDs.
  2. Having the global CR creating namespaced scope CRs seems vulnerable to bugs. It might make more sense to have them be independent of each other and let OVNK controller create NADs from them.
  3. We want a user to be able to specify their network name, but we dont want them to be able to guess the name of another network and attach to it when they shouldn't be allowed. We initially discussed using CEL to do this, but with point number 2 I don't think it is necessary anymore. For namespaced scoped CRDs we can say the name of the network attachment definition->name will always be namespace-name. Global CRDs will have no namespace prefix. So to put it into an example:

A namespace scoped CR of:

kind: OVNNetwork
metadata:
  name: db-network
  namespace: demo
  finalizers:
  - connected-network.k8s.ovn.org
spec:
  topology: layer3
  role: primary
  mtu: 9000
  subnets: ["10.0.0.0/24"]
  excludeSubnets: ["10.0.0.100"]

would result in this NAD:

kind: NetworkAttachmentDefinition
metadata:
  name: db-network
  namespace: demo
spec:
  config: |2
    {
            "cniVersion": "0.3.1",
            "name": "demo-db-network",
            "type": "ovn-k8s-cni-overlay",
            "topology":"layer3",
            "subnets": "10.0.0.0/24",
            "mtu": 9000,
            "netAttachDefName": "demo/db-network",
            "primaryNetwork": true

Meanwhile a cluster scoped CR of:

kind: ClusterOVNNetwork
metadata:
  name: db-network
  finalizers:
  - connected-network.k8s.ovn.org
spec:
  topology: layer2
  role: primary
  selector:
  - kubernetes.io/metadata.name: "demo"
  mtu: 9000
  subnets: ["10.0.0.0/24"]

Would result in the following NAD:

kind: NetworkAttachmentDefinition
metadata:
  name: db-network
  namespace: demo
spec:
  config: |2
    {
            "cniVersion": "0.3.1",
            "name": "db-network",
            "type": "ovn-k8s-cni-overlay",
            "topology":"layer3",
            "subnets": "10.0.0.0/24",
            "mtu": 9000,
            "netAttachDefName": "demo/db-network",
            "primaryNetwork": true

With this design, a user cannot connect to another namespace's network, even if they know the name of it. They also cannot connect to a global CR network, unless the admin is dumb enough to prefix the global network name with a namespace (not our problem). Thoughts?

The main API extension here will be a namespace scoped network CRD as well a cluster scoped network CRD.
These CRDs will be registered by Cluster Network Operator (CNO).

#### CRDs
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it fair to say that these new CRDs are meant to be exposed directly to OpenShift users, as the main way to define their primary networks? Or are we planning to wrap it in another CRD?

If the former, I think it's critical to have the OpenShift API team review and accept the CRD API. Otherwise we risk that will will not be allowed to use these new APIs as a core part of OpenShift.

I'm saying this because we are now facing similar issue with kubernetes-nmstate API.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

directly exposed to OpenShift users, but they exist as part of the OVN-Kubernetes API and not OpenShift API. Therefore, there wont be any OpenShift API change for this. The CRDs for downstream are stored in the CNO repo and registered by CNO. That being said, we still should have the API team give us their feedback. We plan to do this after internally we are in agreement on what we think the API should look like, then can ask for their feedback.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understood, thanks

| Role | Select the network role in the pod, either `primary` or `secondary`. | No |
| MTU | The maximum transmission unit (MTU).<br/>The default is 1400. | Yes |
| Subnets | The subnet to use for the network across the cluster.<br/>E.g. 10.100.200.0/24.<br/>IPv6 (2001:DBB::/64) and dual-stack (192.168.100.0/24,2001:DBB::/64) subnets are supported.<br/>When omitted, the logical switch implementing the network only provides layer 2 communication, and users must configure IP addresses.<br/>Port security only prevents MAC spoofing if the subnets are omitted. | Yes |
| ExcludeSubnets | List of CIDRs and IP addresses.<br/>IP addresses are removed from the assignable IP address pool and are never passed to the pods. | Yes |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have any other API fields where we allow both CIDRs and IP addresses? Normally in that case we only allow CIDRs, and you have to specify /32 or /128 CIDRs if you want to include individual IP addresses.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for consistency. I think we should limit it to CIDRs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


| Field name | Description | optional |
|------------|-------------------------------------------------------------------------------------------------------|----------|
| Selector | Key-value list of labels used as a selector for which namespace the network should be available for. | No |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be a standard metav1.LabelSelector, for better compatibility with tooling, etc

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant for the native one as you mention, sorry for confusion as its not mentioned in the doc.

Now that I think about it again I think its better to have it simplified and have namespace names instead of a selector, WDYT? @tssurya @trozet

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think namespaceSelector is better allowing users to select using labelselectors and match expressions thus following pattern with normal k8s standards in a API

Comment on lines 448 to 459
topology: layer2
role: primary
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"enum" values in k8s APIs are normally capitalized. (Layer2, Primary)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, fixed.

@ormergi
Copy link
Contributor Author

ormergi commented Jun 30, 2024

Rebased

// and users must configure IP addresses for the pods.
// Port security only prevents MAC spoofing.
// +optional
Subnets []string `json:"subnets,omitempty"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think subnets have the same format as NAD subnets right? If so, should we explain here an L3 format of IP/clusterSubnet/hostSubnet? Also do we want to require hostSubnet for L3 or leave it to default as in NAD?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also do we want to limit the number of allowed subnets?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The intention was to streamline the CNI subent format by using go native string slice.
Instead of intensifying "10.0.0.0/16/24", the spec should specify ["10.0.0.0/16","10.0.0.0/24"]
The spec should translate to comma-separated subents list "10.0.0.0/16,10.0.0.0/24".
WDYT?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I completely missed where this enhancement says anything about the difference in subnet formats for NAD vs UDn and how it should be converted.
also, won't NAD treat "10.0.0.0/16,10.0.0.0/24" as 2 separate networks?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this just a bad example? those 2 separate networks actually overlap and should be invalid.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I modified it slightly:

// The subnet to use for the pod network across the cluster.
// Dualstack clusters may set 2 subnets (one for each ip family), otherwise only 1 subnet is allowed.
// When topology is `Layer3`, the given subnet is split into smaller subnets for every node.
// To specify how the subnet should be split, the following format is supported for `Layer3` network:
// `10.128.0.0/16/24`, which means that every host will get a `/24` subnet.
// If host subnet is not set (for example, `10.128.0.0/16`), it will be assigned automatically.
// This field may be omitted for `Layer2` and `Localnet` topologies, in that case
// the logical switch implementing the network only provides layer 2 communication,
// and users must configure IP addresses for the pods. Port security only prevents MAC spoofing

I noticed the above does mention the behavior if the host mask is provided for a layer 2 network. What should be the behavior?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed the above does mention the behavior if the host mask is provided for a layer 2 network. What should be the behavior?

It should throw an error. I hoped from the comment it is obvious that host subnet is only supported for Layer3, do you think we can make it more obvious?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe just spell out the format for layer 2 and localnet. Like this:

// The subnet to use for the pod network across the cluster.
// Dualstack clusters may set 2 subnets (one for each ip family), otherwise only 1 subnet is allowed.
// When topology is `Layer3`, the given subnet is split into smaller subnets for every node.
// To specify how the subnet should be split, the following format is supported for `Layer3` network:
// `10.128.0.0/16/24`, which means that every host will get a `/24` subnet.
// If host subnet mask is not set (for example, `10.128.0.0/16`), it will be assigned automatically.
// For `Layer2` and `Localnet` topology types, the format should match standard CIDR notation, without
// providing any host subnet mask.
// This field may be omitted for `Layer2` and `Localnet` topologies. In that case
// the logical switch implementing the network only provides layer 2 communication,
// and users must configure IP addresses for the pods. Port security only prevents MAC spoofing

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Contributor Author

@ormergi ormergi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes

  • Resolve review comments

In addition:

  • CRD Definitions
    • Cluster-scope CRD status - added field to reflect on which namespace network is available - as requested in previous review cycle.
    • Added suggestion for CRDs short-name
  • General flow section
    • Added text to describe what happen when foreign NAD exist at the target namespace.
    • Changed text for malformed NAD scenario -> should be reconciled to desired state instead of raising an error.

@trozet thanks you so much for the feedback, please have another look 🙂

enhancements/network/user-defined-network-segmentation.md Outdated Show resolved Hide resolved
enhancements/network/user-defined-network-segmentation.md Outdated Show resolved Hide resolved
enhancements/network/user-defined-network-segmentation.md Outdated Show resolved Hide resolved
enhancements/network/user-defined-network-segmentation.md Outdated Show resolved Hide resolved
enhancements/network/user-defined-network-segmentation.md Outdated Show resolved Hide resolved
- Validates the CRD spec.
- Generate NAD manifest from the CRD spec.
- Check that the desired NAD does not already exist. If not, create the NAD and return.
- Otherwise, verify the existing NAD correspond to desired spec, if so return.
Copy link
Contributor Author

@ormergi ormergi Jul 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The controller should add owner-reference for NADs it create, so when the CR is deleted and all finalizes are gone, the managed NAD will be disposed.

There is text describing NAD is created with owner-reference and finalizer at the Controller section here.

Please let me know if it make sense 🙂

apiVersion: k8s.cni.cncf.io/v1
kind: NetworkAttachmentDefinition
metadata:
name: shared-db-network <--- name starts with "shared"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, added text example for generating the NAD name of cluster-scope CR.

Please let me know if it make sense 🙂

}'
```

##### Validations
Copy link
Contributor Author

@ormergi ormergi Jul 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added test to this section to have the controller cover it as well.

There is also CEL validations suggestions on the CRDs to enforce Subnets are specified in case of Layer3 topology, at the CRD Definitions section here.

Please let me know if it make sense and if it require additional validations for the Subsets field 🙂

@ormergi ormergi force-pushed the ovn-k-network-api branch 2 times, most recently from 66502de to 7860ae5 Compare July 25, 2024 09:43
@ormergi
Copy link
Contributor Author

ormergi commented Jul 25, 2024

Changes: resolve comment #1638 (comment)

@danwinship
Copy link
Contributor

/retitle user-defined-network-segmentation: Add CRDs for managing networks
(sorry, the typo kept annoying me)

@openshift-ci openshift-ci bot changed the title user-define-network-segemetation: Add CRDs for managing networks user-defined-network-segmentation: Add CRDs for managing networks Jul 25, 2024
@ormergi
Copy link
Contributor Author

ormergi commented Jul 25, 2024

Changes: resolve #1638 (comment)

@ormergi ormergi force-pushed the ovn-k-network-api branch 2 times, most recently from 2acbace to 790ad80 Compare July 26, 2024 13:59
Add content related to CRDs for managing OVN networks.

Signed-off-by: Or Mergi <ormergi@redhat.com>
Copy link
Contributor

openshift-ci bot commented Jul 26, 2024

@ormergi: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Copy link
Contributor

@trozet trozet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve

Copy link
Contributor

openshift-ci bot commented Jul 26, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: qinqon, trozet

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 26, 2024
@trozet
Copy link
Contributor

trozet commented Jul 26, 2024

thanks for all of your patience and hard work on this @ormergi !

@npinaeva
Copy link
Member

/lgtm
thanks @ormergi !

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jul 26, 2024
@openshift-merge-bot openshift-merge-bot bot merged commit e97e677 into openshift:master Jul 26, 2024
2 checks passed
@JoelSpeed
Copy link
Contributor

Since this enhancement is pretty much exclusively about API extensions, it really should have had an API reviewer glance over it before it merged. If I leave feedback here, can we create a follow up to resolve any issues?

// The maximum transmission unit (MTU).
// The default value is 1400.
// +optional
Mtu uint `json:"mtu,omitempty"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use either int32 or int64, and set a minimum and maximum value. Any validations you add should be explained within the godoc for the field

// and users must configure IP addresses for the pods.
// Port security only prevents MAC spoofing.
// +optional
Subnets []string `json:"subnets,omitempty"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CEL can be used to validate subnets now, you should use this.

That said, what versions of Kube/OCP is this extension expected to work against?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we are updating to 1.30 now, so either 1.29 or 1.30 would be nice. Is there a way to enable isCIDR for these earlier versions?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it works in OCP 4.16 and up, but for upstream Kube, it's 1.31 and up

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Validation rules for CIDR was added to CRD definition on D/S
Once we have 1.31 on U/S we will updated the U/S CRD as well, taking issue:
ovn-org/ovn-kubernetes#4607


// ClusterUserDefinedNetwork defines the desired state of ClusterUserDefinedNetwork.
type ClusterUserDefinedNetwork struct {
// NamespaceSelector Label selector for which namespace network should be available for.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this expected to allow multiple namespaces?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the backing controller should realize the desired namesapce names from the selector

@@ -75,6 +76,8 @@ tenant to isolate traffic.
are natively isolated from other tenants.
* As a user, I want to be able to request a unique, primary network for my namespace without having to get administrator
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How was this user story solved and why does it not extend to secondary networks?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The suggested CRDs represent a request for creating user-defined network according to spec, it enable users create primary network.
The namespace scope CRD represent a network request for specific namespace, it can be used by cluster-admin and non cluster-admin users (such as project admins, according to RBAC).
The cluster scope CRD represent a cross namespaces network, making the network available for workloads in reside at namespaces that applies for the specified namespaces label-selector.

This user story doesnt cover the user-defined secondary network (not primary), therefore I added the following user story (line 79).

@@ -75,6 +76,8 @@ tenant to isolate traffic.
are natively isolated from other tenants.
* As a user, I want to be able to request a unique, primary network for my namespace without having to get administrator
permission.
* As a user, I want to be able to request new secondary networks for my namespace, without having to get administrator
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would a user want secondary networks? It's not clear from the user story what they would use a secondary network for, an example of this would be helpful

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since creating NetworkAttachmentDefinitions (NAD) is possible by cluster-admin only, due to the possibility of a non-admin user creating a NAD that could break cluster network (like creating something on br-ex or an underlying bond), or worse - cause a disruption for cluster nodes as there is no restriction on the NAD spec one could specify bad spec that could exceed the node resources.

Support migrating from OpenStack to Kubenetes, where tenant (project admins) should be able to create tenant-network at the namespace they are managing.
OVN-K enable creating such networks but since it require creating NAD, only cluster admin can create such networks.

Enable sysadmin. of one org. to manage their project in Openshift and manage the network for VMs they are managing.
Similarly sysadmin of another org should be able to do the same for their own project (namespace they are managing)

In general, secondary networks can be used for, but not only, to segregate networks and manage which workloads could have communication.
For example:
I have legacy app runs inside a VM (OCP virt), it should be accessible by a subset of workloads (pods).
I can achieve that by creating OVN overlay network, and connect the legacy app and permitted workloads to this network.

apiVersion: k8s.cni.cncf.io/v1
kind: NetworkAttachmentDefinition
metadata:
name: cluster-db-network <--- name starts with "cluster"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if the name of the cluster scoped UDN is long enough that you can't extend it without overflowing the character limit of this field?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NAD creation fails due to too long name, cluster UDN status will reflect NAD in namespace "X" creation failed due name length.

"cniVersion":"0.3.1",
"excludeSubnets":"10.0.0.100/24",
"mtu":1500,
"name":"demo.network", <--- generated unique network-name
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does OVN add any restrictions to this field or is it genuinely free form text? Any chars that aren't allowed? Any length limits?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@trozet @tssurya hey could you please advise on OVN internals? 🙏

}'
```

##### Validations
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CEL now has IPs a native library, I'd encourage using that over implementing regexes for validation


In addition, the following scenarios should be validated:
- The join subnet shall not overlap with the configured cluster default network join subnet.
If there is overlap, OVN-Kubernetes should report an error in the request CR status.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this cluster default exist in the API somewhere? If so, you could create a ValidatingAdmissionPolicy that takes information from the existing source and validates your new CRs against that value

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as it is an ovn-kubernetes and not openshift API, we don't have this info

@@ -438,6 +942,14 @@ The biggest risk with this feature is hitting scale limitations. With many names
internal OVN objects will multiply, as well as internal kernel devices, rules, VRFs. There will need to be a large-scale
effort to determine how many networks we can comfortably support.

Following the introduction of a CRD for non-admin users create OVN network, there is a risk a non-admin users could
cause node / OVN resources starvation due to creating too many OVN networks.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have an understanding of those limits? That would be good to have documented

Copy link
Contributor Author

@ormergi ormergi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JoelSpeed thanks for feedback!
We will address review comments in a follow up PR.
Please see my inline comments.

@@ -75,6 +76,8 @@ tenant to isolate traffic.
are natively isolated from other tenants.
* As a user, I want to be able to request a unique, primary network for my namespace without having to get administrator
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The suggested CRDs represent a request for creating user-defined network according to spec, it enable users create primary network.
The namespace scope CRD represent a network request for specific namespace, it can be used by cluster-admin and non cluster-admin users (such as project admins, according to RBAC).
The cluster scope CRD represent a cross namespaces network, making the network available for workloads in reside at namespaces that applies for the specified namespaces label-selector.

This user story doesnt cover the user-defined secondary network (not primary), therefore I added the following user story (line 79).

| Field name | Description | optional |
|-------------------|---------------------------------------------------------------------------------------------------------------|----------|
| NamespaceSelector | List of the standard `metav1.LabelSelector` selector for which namespace the network should be available for. | No |
| Template | The user defined network spec. | No |
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not quite, it actually create the NADs directly.
The cluster scope CRD gets a template (UDN spec) and namespace label selector, it should create NAD according to spec in each target namespace that aplies to the specified namesapce label selector.


The template type should be the namespace scope CRD spec.

> **Note:** The spec should be extended with care and strive to have minimal set of fields to provide nice abstraction for the NAD spec.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NAD is acronym for NetworkAttachmentDefinition, in general its a CRD that describes networks it integrates with Multus enabling creating secondary networks for workloads.
OVN-K utilize NAD and Multus to enable creating secondary OVN networks (OVN-K multi-homeing) , please see the following doc for more details https://github.com/ovn-org/ovn-kubernetes/blob/master/docs/features/multiple-networks/multi-homing.md

The CRD status should reflect the NAD state through conditions.
For example, when the NAD its created the condition should be placed with status `True`.

For cluster scoped networks, the condition should be true once all desired namespaces are provisioned with the corresponding NAD.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cluster scope CRD represent single OVN network, in order to connect to the network, the workload should have access to the relevant NAD.
The cluster-scope CRD should be followed by NAD in each relevant namespace (according to spec).
Since it generates the NAD with same OVN network-name (represented by the NAD spec.config "name" field) there would be single underlying OVN network.
Depending on the network role, workloads can have access to each other, (primary - isolated, secondary - shared).

metav1.ObjectMeta `json:"metadata,omitempty"`
// +kubebuilder:validation:Required
// +kubebuilder:validation:XValidation:rule="self == oldSelf", message="Spec is immutable"
// +kubebuilder:validation:XValidation:rule=" self.topology != 'Layer3' || (self.topology == 'Layer3' && size(self.subnets) > 0)", message="Subnets is required for Layer3 topology"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Subnets are optional for Layer2 and Localnet , mandatory for Layer3 typologies.

}'
```

##### Validations
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JoelSpeed is the new library available on kube 1.30?

| Subnets | The subnet to use for the network across the cluster.<br/>E.g. `10.100.200.0/24`.<br/>IPv6 `2001:DBB::/64` and dual-stack `192.168.100.0/24`,`2001:DBB::/64` subnets are supported.<br/>When omitted, the logical switch implementing the network only provides layer 2 communication, and users must configure IP addresses.<br/>Port security only prevents MAC spoofing if the subnets are omitted. | Yes |
| ExcludeSubnets | List of CIDRs.<br/>IP addresses are removed from the assignable IP address pool and are never passed to the pods. | Yes |
| JoinSubnets | Subnet used inside the OVN network topology. When omitted, this means no opinion and the platform is left to choose a reasonable default which is subject to change over time. | Yes |
| IPAMLifecycle | Control IP addresses management lifecycle. When `Persistent` is specified it enable workloads have persistent IP addresses. For example: Virtual Machines will have the same IP addresses along their lifecycle (stop, start migration, reboots). Supported by Topology `Layer2` & `Localnet`. | Yes |
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the IPAMLifecycle behavior is the same for both layer2 and localnet typologies, I didnt add additional union that specify the topology, make sense?

| Subnets | The subnet to use for the network across the cluster.<br/>E.g. `10.100.200.0/24`.<br/>IPv6 `2001:DBB::/64` and dual-stack `192.168.100.0/24`,`2001:DBB::/64` subnets are supported.<br/>When omitted, the logical switch implementing the network only provides layer 2 communication, and users must configure IP addresses.<br/>Port security only prevents MAC spoofing if the subnets are omitted. | Yes |
| ExcludeSubnets | List of CIDRs.<br/>IP addresses are removed from the assignable IP address pool and are never passed to the pods. | Yes |
| JoinSubnets | Subnet used inside the OVN network topology. When omitted, this means no opinion and the platform is left to choose a reasonable default which is subject to change over time. | Yes |
| IPAMLifecycle | Control IP addresses management lifecycle. When `Persistent` is specified it enable workloads have persistent IP addresses. For example: Virtual Machines will have the same IP addresses along their lifecycle (stop, start migration, reboots). Supported by Topology `Layer2` & `Localnet`. | Yes |
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently the only option is Persistent.

Some context:
This filed value translate to the boolean CNI argument allowPersistentIPs: true.
I tried to avoid having boolean filed in the spec.
The IPAMLifecycle which describe the domain of the allowPersistentIPs CNI argument, which related to how IPAM is done.

- type: NetworkReady
status: "True"
reason: NetworkAttachmentDefinitionReady
message: NetworkAttachmentDefinition has been created
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please note that NAD type describe a network configuration it doesnt hold any state and has no status.

@@ -75,6 +76,8 @@ tenant to isolate traffic.
are natively isolated from other tenants.
* As a user, I want to be able to request a unique, primary network for my namespace without having to get administrator
permission.
* As a user, I want to be able to request new secondary networks for my namespace, without having to get administrator
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since creating NetworkAttachmentDefinitions (NAD) is possible by cluster-admin only, due to the possibility of a non-admin user creating a NAD that could break cluster network (like creating something on br-ex or an underlying bond), or worse - cause a disruption for cluster nodes as there is no restriction on the NAD spec one could specify bad spec that could exceed the node resources.

Support migrating from OpenStack to Kubenetes, where tenant (project admins) should be able to create tenant-network at the namespace they are managing.
OVN-K enable creating such networks but since it require creating NAD, only cluster admin can create such networks.

Enable sysadmin. of one org. to manage their project in Openshift and manage the network for VMs they are managing.
Similarly sysadmin of another org should be able to do the same for their own project (namespace they are managing)

In general, secondary networks can be used for, but not only, to segregate networks and manage which workloads could have communication.
For example:
I have legacy app runs inside a VM (OCP virt), it should be accessible by a subset of workloads (pods).
I can achieve that by creating OVN overlay network, and connect the legacy app and permitted workloads to this network.

@npinaeva
Copy link
Member

npinaeva commented Aug 1, 2024

@JoelSpeed thank you for your feedback!
We will address API-related comments in the implementations PR on the ovn-kubernetes side and will make a follow-up update for this enhancement after that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

9 participants