Enabling NetworkPolicy logging may cause packet drops when massive connections hit the policy #5018

tnqn · 2023-05-22T14:49:51Z

Describe the bug

As described in the title, packet could be dropped unexpectedly if massive connections hit a NetworkPolicy with logging enabled, having an impact to the performance of applications.

Quoting some discussions had in other channel:

Antonin Bas:
I have a couple of follow up questions on this

do you believe that this is the best implementation: we are ensuring correctness of the audit logs but can impact legitimate > traffic (for Allow rules with audit logging enabled). Correctness of audit logs seem like a valid rationale, but maybe the > documentation should mention the possible impact on traffic and make recommendations.

the 100 packets/s limit may be a bit low with the current implementation. I think it was set this way to be very conservative, but I feel like we could test with 1000? maybe users should have a way to set it as well (if they are willing to increase CPU usage, they can go with a higher limit)

Vicky Liu:
Yes, Antonin, it seems that ANP logging is a "dangerous" function which will cause heavy overhead. Before we make big performance enhancement, we need to let customers aware of the impact on system. +Wenying to share the plan of performance enhancement.

Wenying Dong:
In my opinion, the meter usage is mainly for securing agent on logging purpose, but it is not supposed to impact on the traffic. So it is better to change the meter usage to work only on the logging path (OF action of controller) first. As for the limited value with 100 or 1000 packets, maybe we need to give the final decision after stress tests on agent.

Antonin Bas:

So it is better to change the meter usage to work only on the logging path (OF action of controller) first.

I think that was the original intention, but looks like the implementation didn’t work out this way. Is it possible to meter the PacketIn message without metering the user packet?
I think we have to choose between:

audit log accuracy

no impact on user traffic

I think we should privilege the second one here.
The rationale for adding the meter was to prevent denial of service on the Agent. However, right now there is a possibility of denial of service on user traffic (if the user enables audit logging on Allow rules).

Wenying Dong:
A solution in my thought is using a type all OF group instead of using direct OF actions in a ANP rule which also requires logging. Then we could have two copies of the packet firstly, then one is to continue with the user traffic control, and the other is for logging. For the second one, we could apply meter control if necessary.

To Reproduce

Create a NetworkPolicy with allow action and logging enabled.
Use netperf -t TCP_CRR to generate connections that hit the above NetworkPolicy. The throughput will be limited to 100 trans/s when logging is enabled, and will be increased when logging is disabled.

Expected

Enabling logging should have no impact on user traffic.

Versions:

Antrea version (Docker image tag). ?~v1.11.1

The text was updated successfully, but these errors were encountered:

Dyanngg · 2023-05-23T05:05:40Z

/cc @Dyanngg @qiyueyao

wenyingd · 2023-06-02T04:56:32Z

The root cause of the issue is Meter action must be the first action in an OpenFlow action set, which leads to both the actions "resubmit the packet to another table" ( for ANP security traffic purpose ) and "send the packet to Antrea controller" ( for logging purpose) are impacted by the rate limit from Meter.

The major idea of the solution is to use Meter for rate limit only on the "send to controller" action for logging purpose. To implement the idea, we should have two copies of the packet firstly, and then two flows are applied to these packets separately for the different purposes. Hence, a type all group is suggested to make the copies when logging is enabled.

After reading the code, there are 4 functions related with ANP logging: conjunctionActionFlow, conjunctionDenyFlow, conjunctionActionPassFlow, defaultDropFlow, and there two kinds of the flow actions, 1) drop the current packet in OVS pipeline and send to controller; and 2) resubmit the packet to a "next" table and send to controller. For 1), we could have one group with a single bucket to resubmit the packet to a table a flow with "send to controller" action locates. An alternative is we can use a flow action to resubmit the packet to the table directly instead of a group because only one copy for packet is needed. For 2), there may be one group per "next" table since the table ID is changable according to the ANP rule. As for other marks set in the packet which is needed by antrea-agent with a packetIn packet, we can set it in the original flow before applying "group" action, in this way the group and the logging purpose flow can only focus on how to send the packet to antrea-agent itself.

I would suggest to place the flow for "sending the packet to controller" to L2ForwardingOutTable. Then another chanllenge is introduced that how differentiate the packet for logging purpose from the one for ANP traffic purpose. Here I would suggest to extend the original OFPortFoundRegMark to a Regfield as OutputDispositionField, and OFPortFoundRegMark and SendToControllerRegMark are two different marks with this field. In this way, the two copies of packets ( for ANP traffic purpose and for logging purpose ) are differentiated with diffent marks.

tnqn added kind/bug Categorizes issue or PR as related to a bug. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. area/network-policy Issues or PRs related to network policies. labels May 22, 2023

tnqn added this to the Antrea v1.13 release milestone May 22, 2023

tnqn assigned wenyingd May 23, 2023

wenyingd mentioned this issue Jul 12, 2023

Use OpenFlow group for Network Policy logging #5061

Merged

tnqn closed this as completed in #5061 Jul 21, 2023

tnqn added the reported-by/end-user Issues reported by end users. label Sep 18, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enabling NetworkPolicy logging may cause packet drops when massive connections hit the policy #5018

Enabling NetworkPolicy logging may cause packet drops when massive connections hit the policy #5018

tnqn commented May 22, 2023

Dyanngg commented May 23, 2023

wenyingd commented Jun 2, 2023 •

edited

Loading

Enabling NetworkPolicy logging may cause packet drops when massive connections hit the policy #5018

Enabling NetworkPolicy logging may cause packet drops when massive connections hit the policy #5018

Comments

tnqn commented May 22, 2023

Dyanngg commented May 23, 2023

wenyingd commented Jun 2, 2023 • edited Loading

wenyingd commented Jun 2, 2023 •

edited

Loading