Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add proposal: communication ingress flows matrix #1588

Merged

Conversation

sabinaaledort
Copy link
Contributor

Communication ingress flows matrix of OpenShift and Operators: This enhancement allows automatically generate an accurate and up-to-date communication flows matrix that can be delivered to customers as part of product documentation for all ingress flows of OpenShift (multi-node and single node deployments) and Operators.

@openshift-ci openshift-ci bot requested review from dougbtv and trozet March 7, 2024 09:06
@sabinaaledort sabinaaledort force-pushed the commatrix_ingress_proposal branch 3 times, most recently from 643355b to 231662e Compare March 7, 2024 10:49
@sabinaaledort
Copy link
Contributor Author

/cc @cybertron @trozet @wking @danwinship @msherif1234
Can you please review?


The communication matrix can be generated on single-node deployments.

MicroShift is out of scope for this proposal.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MicroShift has the EndpointSlices API. Is there any reason the tool wouldn't work with MicroShift?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yuvalk what do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like customers will eventually definitely want this functionality with MicroShift

enhancements/network/communication-flows-matrix-ingress.md Outdated Show resolved Hide resolved
N/A

## Version Skew Strategy
N/A
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will the new command work with older clusters?

Copy link
Contributor Author

@sabinaaledort sabinaaledort Mar 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should

@yuvalk what do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it'll work on any version with endpointslice (which are stable since 1.21)
problem is with all the exceptions (ie listening ports that are not covered by an endpointslices)

listening ports in a running cluster, `oc adm communication-matrix generate`.

- A new option will be added to OpenShift web console to generate an up-to-date
communication matrix.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How will that be different from what the network observability features provide?

cc @stleerh

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with this... this seems to be the right approach to use. (Unless it is using that to gather the data and then this is just to make it easier to use in the UI?)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we are targeting a declarative approach whereas observability matrix is looking into what is actually running.
IMHO - what is not declared is "wrong" and should be blocked.
observability feature might be used to detect those and correct them but still need also a tool to extract the declared matrix


- The admin uses the OpenShift command-line interface (CLI) to generate an up-to-date
communication matrix using the following command `oc adm communication-matrix generate`.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

even from CLI pov netobserv adding cli capability to dump flows (ingress and egress)
https://github.com/netobserv/network-observability-cli/blob/main/img/flow-table.png
so its not clear me either what additional info will communication-matrix option will bring ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I know the network observability features require installing an operator while we would like to introduce a lighter tool to focus on generating a communication matrix that can be easily applied in firewall rules, as well as delivering it as part of the product documentation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in the cli mode we don't use operator

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I see that now, thank! checking

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sabinaaledort I think we should collaborate and see if this can be used to extend the current netobserv solution or not best to discuss this over slack in #forum-ocp-network-observability

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

even from CLI pov netobserv adding cli capability to dump flows (ingress and egress)
so its not clear me either what additional info will communication-matrix option will bring ?

AIUI, the netobserv tool shows current traffic, while the communication-matrix tool shows expected traffic.

You can't necessarily reliably generate a firewall based on the netobserv output, since you can't guarantee that all of the required-to-be-allowed flows will actually get used during any particular time period.

Also, the netobserv tool can't tell you what ports will need to be open in the next release. (Well, OK, neither can the communcation-matrix tool, but you can at least look at the "generic"/documented communication-matrix to find that.)

```
direction Data flow direction (currently ingress only)
protocol IP protocol (TCP/UDP/etc)
port Flow port number
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is the service port right not the endpoint , also do u plan to suport SCTP` protocol ?
also for ingress firewall setting the rules might also require knowing the srcIP

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's the endpoint port
@liornoy can/do we support SCTP protocol? also you have a test to apply firewall rules right? does it require the srcIP?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so actually 3 questions here:

  1. service port vs endpoint port - it's the endpoint / the one seen on the host network side
  2. SCTP - theoretically can be included too, especially in the proposal. impl will focus on TCP/UDP in the first iteration. can add later
  3. srcIP - there might be cases where you want to limit a certain service/flow to be accessible only from specific ip ranges. but that's mostly up to the customer. so short answer is for now it's binary yes/no

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Supporting SCTP would only matter if we think OCP will ever use SCTP for internal communication (as opposed to merely supporting customer workloads that use SCTP). This seems extremely unlikely.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the tool is to be used by customers to generate their matrix, they would like it to include their SCTP ports too.

for the one we are generating for OCP, you are right.

```
direction Data flow direction (currently ingress only)
protocol IP protocol (TCP/UDP/SCTP/etc)
port Flow port number
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since this ingress only why do you have direction in the config ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might add egress in the future, for now we think it is still a valuable field for a user analyzing the matrix

"protocol": "TCP",
"port": 51035,
"service": "rpc.statd",
"nodeRole": "master",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what will be the nodeRole for the SNO case ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

@knobunc knobunc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this has significant overlap with Network Observability (as several people have noted). But also with ACS. I would like to make sure we are not duplicating work here.

@sabinaaledort
Copy link
Contributor Author

I think this has significant overlap with Network Observability (as several people have noted). But also with ACS. I would like to make sure we are not duplicating work here.

We met with the Network Observability team and discussed the possibility to integrate our module into the Network Observability CLI. This can be the next step. Currently our module is stored in openshift-kni https://github.com/openshift-kni/commatrix and we plan to release the first documented communication matrix in 4.16 openshift/openshift-docs#69720

@msherif1234
Copy link
Contributor

I think this has significant overlap with Network Observability (as several people have noted). But also with ACS. I would like to make sure we are not duplicating work here.

We met with the Network Observability team and discussed the possibility to integrate our module into the Network Observability CLI. This can be the next step. Currently our module is stored in openshift-kni https://github.com/openshift-kni/commatrix and we plan to release the first documented communication matrix in 4.16 openshift/openshift-docs#69720

correct we have a meeting to demonstrate the netobserv cli project and provided info about the repo and how to use it as possible future intg step

@yuvalk
Copy link
Contributor

yuvalk commented Jun 12, 2024

I think this has significant overlap with Network Observability (as several people have noted). But also with ACS. I would like to make sure we are not duplicating work here.

I dont think there's any overlap at the moment
and if there will be in the future, we can converge.

as you can see in the proposal, our plan is to use declarative information from EndpointSlices to create the matrix, which is very different from NetObserve. only for our CI, to make sure we are not missing anything, we are using evidence based, currently with ss commands, can certainly move to netobserve in the future there. especially if/when we'll start handling egress traffic too.

- A communication matrix describing the expected flows of incoming traffic will
be included in every OpenShift release documentation.

- A new `oc` command will be added to generate a current snapshot of known
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this really be part of oc? And don't people want more than just a list of listening ports? Shouldn't we be saying why the port is open, and what it is used for?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oc so customers can generate a matrix from a running cluster, including their workloads.

it is a great idea, to extend EndpointSlices API so that they include a description too (with guidelines to document why port is open and what it is used for)

but meanwhile, we can cover that in the official doc

@openshift-bot
Copy link

Inactive enhancement proposals go stale after 28d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Mark the proposal as fresh by commenting /remove-lifecycle stale.
Stale proposals rot after an additional 7d of inactivity and eventually close.
Exclude this proposal from closing by commenting /lifecycle frozen.

If this proposal is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 23, 2024
@yuvalk
Copy link
Contributor

yuvalk commented Jul 23, 2024

/remove-lifecycle stale

@openshift-ci openshift-ci bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 23, 2024
@openshift-bot
Copy link

Inactive enhancement proposals go stale after 28d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Mark the proposal as fresh by commenting /remove-lifecycle stale.
Stale proposals rot after an additional 7d of inactivity and eventually close.
Exclude this proposal from closing by commenting /lifecycle frozen.

If this proposal is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 20, 2024
@sabinaaledort
Copy link
Contributor Author

/remove-lifecycle stale

@openshift-ci openshift-ci bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 20, 2024
Copy link
Contributor

@danwinship danwinship left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I missed this earlier...


## Motivation

Security-conscious customers need OpenShift flows matrix for regulatory reasons
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're providing this for customers to use "for regulatory reasons", then we need more information about exactly what that entails, so we can be sure this meets the regulations.


Security-conscious customers need OpenShift flows matrix for regulatory reasons
and/or to implement firewall rules to restrict traffic to the minimum set of
required flows only, on-node firewall or external.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the overall goal is to enable customers to implement restrictive firewalls, then this is probably only one part of the full solution.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not exactly sure what you call "this" and what, in your mind is the full solution.
my vision is that we have a node level firewall that is enabled by default and adaptively configure itself based on the declerations that are applied. obviously that will require more machinery then what is described in this proposal.
we are incrementally building toward that future

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant that if we are going to support customers who are doing aggressive firewalling, then we need to put a bunch more work into making sure that we don't break them. eg, this comment


### Non-Goals

- Egress traffic.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this a non-goal? Do customers not care about firewalling egress traffic? Is this a future goal?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's non-goal in this proposal, indeed a future goal and we are actively working on a research how egress traffic flows can be added to communication matrix

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe better add it to the open questions, topics.
but we are actively working and thinking now about egress too (it's much newer work than this proposal, that been lingering for a long while)


- The admin uses the OpenShift command-line interface (CLI) to generate an up-to-date
communication matrix using the following command `oc adm communication-matrix generate`.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

even from CLI pov netobserv adding cli capability to dump flows (ingress and egress)
so its not clear me either what additional info will communication-matrix option will bring ?

AIUI, the netobserv tool shows current traffic, while the communication-matrix tool shows expected traffic.

You can't necessarily reliably generate a firewall based on the netobserv output, since you can't guarantee that all of the required-to-be-allowed flows will actually get used during any particular time period.

Also, the netobserv tool can't tell you what ports will need to be open in the next release. (Well, OK, neither can the communcation-matrix tool, but you can at least look at the "generic"/documented communication-matrix to find that.)


The communication matrix can be generated on single-node deployments.

MicroShift is out of scope for this proposal.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like customers will eventually definitely want this functionality with MicroShift


Another test will be added to validate the ports in a generated communication
matrix match a snapshot of the node's listening ports (created with the Linux
`ss` utility).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we also need a periodic e2e job that brings up a cluster, firewalls everything that isn't required by the communications matrix, and then does an upgrade; if the upgrade fails, that warns us that the matrix is failing to detect something.


A user will be able to run `openshift-tests` or a new `oc` command,
`oc adm communication-matrix validate`, to validate the `EndpointSlices`
in the cluster match a current snapshot of the node's listening ports.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's also important that a user be able to validate "the cluster won't break if I upgrade". IOW, they need to be able to prove that their current firewall rules are acceptable according to the matrix of the OCP release they want to upgrade to.

`EndpointSlices` stand better for ingress traffic than for egress, in ingress traffic the
services are less dynamic and mostly stay up during the entire cluster lifetime.
Supporting egress traffic might require changes in the API that should also be reviewed
and agreed with the upstream Kubernetes community.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really understand what you're saying here.

Are you just talking about the fact that, in general, we do not have Services/EndpointSlices pointing to the external IPs that we connect to?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I will rephrase


3. The node hosted services such as sshd, rpc, etc. are currently missing
`EndpointSlices`. We believe it should be created by the Machine API as
part of adding/removing nodes. This can be addressed in later enhancement.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to try out the "firewall everything that isn't listed in the matrix and see what breaks" test, and that should help to answer these questions...

Note that the installer has a bunch of hardcoded rules for setting up networking on cloud platforms (eg, aws) which ideally would be autogenerated from some official list of requirements...

Comment on lines 218 to 222
- Use the communication matrix to create and apply firewall rules, and
run E2E `openshift-tests`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can't expect openshift-tests to work on a firewalled cluster; many of the networking test cases in particular create HostNetwork pods and expect to be able to connect to them.

We could potentially run a subset of tests that are "firewall-safe". But I think doing an OCP version upgrade is probably a better overall test, since an OCP upgrade exercises a substantial amount of core functionality, and because upgrades are something we explicitly want to guarantee will work even with a firewall, so it's a good test case anyway.

Comment on lines +249 to +257
- Use the communication matrix to create and apply firewall rules, and
run E2E `openshift-tests` ([implemented](https://github.com/openshift-kni/commatrix/blob/main/test/e2e/commatrix_suite_test.go#L100))
- Use the communication matrix to create and apply firewall rules, and upgrade
OpenShift version
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These aren't things that would be "added to openshift-tests"; they're both new jobs rather than additions to openshift-tests. (So they should be a new top-level bullet point, not under the openshift-tests bullet point.)

### Drawbacks
N/A

## Open Questions
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add: "Right now the installer has its own hard-coded list of ports to open on different clouds. Can we eventually auto-generate this as well?"

@danwinship
Copy link
Contributor

/approve
just two nitpicks; someone else can lgtm after that

Copy link
Contributor

openshift-ci bot commented Oct 14, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: danwinship

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 14, 2024
Communication ingress flows matrix of OpenShift and Operators:
This enhancement allows automatically generate an accurate and up-to-date
communication flows matrix that can be delivered to customers as part of
product documentation for all ingress flows of OpenShift (multi-node and
single node deployments) and Operators.
Copy link
Contributor

openshift-ci bot commented Oct 15, 2024

@sabinaaledort: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@yuvalk
Copy link
Contributor

yuvalk commented Oct 15, 2024

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Oct 15, 2024
@openshift-merge-bot openshift-merge-bot bot merged commit 96d8a57 into openshift:master Oct 15, 2024
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants