Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enabling NetworkPolicy logging may cause packet drops when massive connections hit the policy #5018

Closed
tnqn opened this issue May 22, 2023 · 2 comments · Fixed by #5061
Closed
Assignees
Labels
area/network-policy Issues or PRs related to network policies. kind/bug Categorizes issue or PR as related to a bug. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. reported-by/end-user Issues reported by end users.

Comments

@tnqn
Copy link
Member

tnqn commented May 22, 2023

Describe the bug

As described in the title, packet could be dropped unexpectedly if massive connections hit a NetworkPolicy with logging enabled, having an impact to the performance of applications.

Quoting some discussions had in other channel:

Antonin Bas:
I have a couple of follow up questions on this

  1. do you believe that this is the best implementation: we are ensuring correctness of the audit logs but can impact legitimate > traffic (for Allow rules with audit logging enabled). Correctness of audit logs seem like a valid rationale, but maybe the > documentation should mention the possible impact on traffic and make recommendations.
  2. the 100 packets/s limit may be a bit low with the current implementation. I think it was set this way to be very conservative, but I feel like we could test with 1000? maybe users should have a way to set it as well (if they are willing to increase CPU usage, they can go with a higher limit)

Vicky Liu:
Yes, Antonin, it seems that ANP logging is a "dangerous" function which will cause heavy overhead. Before we make big performance enhancement, we need to let customers aware of the impact on system. +Wenying to share the plan of performance enhancement.

Wenying Dong:
In my opinion, the meter usage is mainly for securing agent on logging purpose, but it is not supposed to impact on the traffic. So it is better to change the meter usage to work only on the logging path (OF action of controller) first. As for the limited value with 100 or 1000 packets, maybe we need to give the final decision after stress tests on agent.

Antonin Bas:

So it is better to change the meter usage to work only on the logging path (OF action of controller) first.

I think that was the original intention, but looks like the implementation didn’t work out this way. Is it possible to meter the PacketIn message without metering the user packet?
I think we have to choose between:

  • audit log accuracy
  • no impact on user traffic

I think we should privilege the second one here.
The rationale for adding the meter was to prevent denial of service on the Agent. However, right now there is a possibility of denial of service on user traffic (if the user enables audit logging on Allow rules).

Wenying Dong:
A solution in my thought is using a type all OF group instead of using direct OF actions in a ANP rule which also requires logging. Then we could have two copies of the packet firstly, then one is to continue with the user traffic control, and the other is for logging. For the second one, we could apply meter control if necessary.

To Reproduce

  1. Create a NetworkPolicy with allow action and logging enabled.
  2. Use netperf -t TCP_CRR to generate connections that hit the above NetworkPolicy. The throughput will be limited to 100 trans/s when logging is enabled, and will be increased when logging is disabled.

Expected

Enabling logging should have no impact on user traffic.

Versions:

  • Antrea version (Docker image tag). ?~v1.11.1
@tnqn tnqn added kind/bug Categorizes issue or PR as related to a bug. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. area/network-policy Issues or PRs related to network policies. labels May 22, 2023
@tnqn tnqn added this to the Antrea v1.13 release milestone May 22, 2023
@Dyanngg
Copy link
Contributor

Dyanngg commented May 23, 2023

/cc @Dyanngg @qiyueyao

@wenyingd
Copy link
Contributor

wenyingd commented Jun 2, 2023

The root cause of the issue is Meter action must be the first action in an OpenFlow action set, which leads to both the actions "resubmit the packet to another table" ( for ANP security traffic purpose ) and "send the packet to Antrea controller" ( for logging purpose) are impacted by the rate limit from Meter.

The major idea of the solution is to use Meter for rate limit only on the "send to controller" action for logging purpose. To implement the idea, we should have two copies of the packet firstly, and then two flows are applied to these packets separately for the different purposes. Hence, a type all group is suggested to make the copies when logging is enabled.

After reading the code, there are 4 functions related with ANP logging: conjunctionActionFlow, conjunctionDenyFlow, conjunctionActionPassFlow, defaultDropFlow, and there two kinds of the flow actions, 1) drop the current packet in OVS pipeline and send to controller; and 2) resubmit the packet to a "next" table and send to controller. For 1), we could have one group with a single bucket to resubmit the packet to a table a flow with "send to controller" action locates. An alternative is we can use a flow action to resubmit the packet to the table directly instead of a group because only one copy for packet is needed. For 2), there may be one group per "next" table since the table ID is changable according to the ANP rule. As for other marks set in the packet which is needed by antrea-agent with a packetIn packet, we can set it in the original flow before applying "group" action, in this way the group and the logging purpose flow can only focus on how to send the packet to antrea-agent itself.

I would suggest to place the flow for "sending the packet to controller" to L2ForwardingOutTable. Then another chanllenge is introduced that how differentiate the packet for logging purpose from the one for ANP traffic purpose. Here I would suggest to extend the original OFPortFoundRegMark to a Regfield as OutputDispositionField, and OFPortFoundRegMark and SendToControllerRegMark are two different marks with this field. In this way, the two copies of packets ( for ANP traffic purpose and for logging purpose ) are differentiated with diffent marks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/network-policy Issues or PRs related to network policies. kind/bug Categorizes issue or PR as related to a bug. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. reported-by/end-user Issues reported by end users.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants