New implementation of the learning bridge app #638

alexandergall · 2015-10-02T12:38:28Z

Well, this one took me a while. My previous implementation based on Bloom filters had two major drawbacks

Performance. Insertions and lookups were relatively costly at up to 150-200 cycles per operation.
By the nature of Bloom filters, it was not possible to dump the MAC table, which can be crucial for debugging operational issues

My L2VPN system requires full multi-port and split-horizon semantics, so I had to keep those features.

My approach is to use a custom-built hash table to store MAC address-to-port mappings using lib.hash.murmur. I didn't try to use Lua hash tables for this, maybe I should have to compare performance :/ With this implementation, at least, I can understand and control all aspects.

This would all have been straight forward, except that it wasn't. The crux is the branchy nature of MAC table lookups and the tiny loop over the slots in a hash bucket. The comments at the top of apps/bridge/learning.lua and apps/bridge/mac_table.lua explain the issues and my solution for it. I don't reproduce those arguments here, but I'm happy to discuss them.

The upshot is that I had to hide a small portion of the code from the compiler inside a C function. I'm fairly satisfied concerning performance, a little less so with the kludgyness of the code. Suggestions for enhancements are very welcome.

As a goodie, the code provides auto-scaling of the table and should be essentially maintenance-free for operation (this also makes sure that the hash table is always operating below a load-factor where performance is only marginally affected by the depth of the buckets).

apps/bridge has been re-written to use a custom-built hash table to store MAC addresses instead of a bloom filter for performance and management reasons. Performance considerations have made it necessary to implement part of the mac_table module in C.

eugeneia · 2015-10-05T15:20:25Z

Very cool to follow up on this, especially since I had my own attempt on making a l2 switch fast. :)

@alexandergall Do you think it would make sense to add a benchmark based on the learning bridge app? I feel like it would be a great addition to our basic1 benchmark. (Maybe revive the benchmark we used in #555 ?)

alexandergall · 2015-10-06T08:04:41Z

A benchmark would certainly make sense. I'm not sure what would be most useful to capture the performance of the actual bridge if we run the packet generator in the same process, for example. Do we want separate benchmarks for flooding and unicast performance? It would also be useful to benchmark the bridge when the MAC table is full.

lukego · 2015-10-06T08:32:13Z

(Haven't read the exciting code yet - more feedback to follow :-))

Great idea to make benchmarks for regression testing!

I see a lot of potential for interesting work on traffic matching/dispatching mechanisms over time and it will be excellent to have enough tests for people to evaluate whether new mechiansms are suitable to succeed/replace existing ones (without breaking important performance characteristics of somebody's product).

From a software maintenance perspective I think the most important purpose of the benchmarks is to compare the relative performance of two software versions and to cover enough scenarios to represent the realistic performance characteristics that people care about in programs that are using the apps. This would make it possible to rewrite dispatching code over time and be confident when new versions are suitable/unsuitable to merge onto master. (That would also make the optimization work accessible to people who are masters of CPUs and data structures but don't have a background in networking and want to treat the benchmark setup as a black box that takes in code and outputs a score.)

Also, add a note that the method post_config() must be called after engine.configure() to complete the initialization of the bridge.

lukego · 2015-10-20T11:35:10Z

This code is super interesting! I have still not been all the way through it but you have implemented multiple ideas here that I have been imagining somebody would tackle one day:

To write an efficient lookup table for an application that is lookup-intensive. This seems extremely valuable as a reference benchmark. Then people can experiment with on the one hand optimizations to further increase performance and on the other hand trying to cut code while keeping performance steady. Fun times ahead! I am sure that our table lookup use cases will be growing over time :).
The programming technique of using FFI calls to hide small inner loops from the JIT so that they don't impact trace behavior. I have also been anticipating that this will be important if you have a loop in a trace and then want to introduce (say) a small hash code calculation that would need to iterate. (In a sense we are already doing this with memcpy.)
The programming technique of processing packets in two steps: first sorting them into a queue (flood, unicast, drop) and then using branch-free code to process each queue. I can really imagine LuaJIT being happy with the branch-free code and perhaps that benefits the CPU itself too. (Hard to know when you are contorting yourself to satisfy a cruel JIT compiler vs fundamentally matching your problem to modern CPU architecture! ref Tracing JITs and modern CPUs: double trouble, or, a problem shared is a problem halved? lukego/blog#5).

Very cool.

This also makes me think it would be really valuable to have a comprehensive benchmark in the style of SPECint that benchmarks the entire code base and says whether it is improving or degrading. The CI benchmarks are moving in this direction but I wonder if we need to make it easier to grow them somehow. Just thinking how wonderful it would be if the benchmarks were sufficient that people could independently work on problems like "Can I make this code run faster?" or "Can I halve the size of this module without reducing performance?" without actually having to understand the applications themselves (treat them as black boxes that either speed up or slow down as a result of changes to the source code).

Looking forward to digging a bit deeper into the implementation to see what more I can learn from your investigations :)

lukego · 2015-10-31T21:21:00Z

Merged onto next.

How are you measuring performance of this code today? Can we make a performance regression test out of that somehow?

alexandergall · 2015-11-01T09:11:24Z

Currently, my primitive benchmark is to run a two-port bridge with a slightly modified Source app that lets me create synthetic packets with pre-defined headers. On my system (Xeon E5-2620v2, 2.4GHz), I get around 11.5-12Mpps with 64-byte packets. I guess this should at least be translated into some metric like cycles per packet before it could be used as a regression test in the CI.

One important variant of this check is to fill the MAC table with random data to simulate a busy system and exercise the code paths associated with that. With the C-wrapper for the hash table, performance remains basically unchanged (that's where all of my attempts to write it in pure Lua failed miserably). But thats's just some ad-hock code right now.

eugeneia · 2015-11-02T14:29:08Z

I guess this should at least be translated into some metric like cycles per packet before it could be used as a regression test in the CI.

I don't think this translation is necessary for performance regression tests, as they are always measured against a previous result on the same hardware. E.g. in the future, once we have multiple SnabbBot instances running, we will effectively be able to tell how a change affects performance on a given CPU architecture / NIC. The effective “benchmark score” is computed by the simple formula HEAD_RESULT / BASE_RESULT.

lukego · 2015-11-02T19:50:22Z

@alexandergall Ad-hock code sounds fine. The purpose of this test case will be to estimate whether a new change is going to make you will be happy/sad/indifferent. So it only needs to test whatever you are already testing.

I wonder if this could be a nice basic template for making regression tests:

perf stat -e cycles ./snabb snsh -e 'require("mymodule").selfbenchmark()'

where selfbenchmark() does a fixed amount of work (e.g. 100M hashtable operations) and then returns.

I suppose that time would work instead of perf stat but that might be a little less dependable if "Turbo Boost" is enabled on the server (CPU clock speed will change based on activity on other cores).

lukego · 2015-11-02T20:00:32Z

aside:

I would love to make it as easy as possible to write performance regression tests.

One idea is to support a selfbenchmark() method that runs performance regression unit tests in the same streamlined way as selftest(). Then we could have a make benchmark target that outputs a CSV file with all the unit benchmarks for a software version. This could be used for testing and tracking the performance of individual subsystems over time.

One other idea would be to stop throwing out new ideas when we already have a working mechanism and only need to use it more... :-)

eugeneia · 2015-11-02T20:59:08Z

#626 introduces make benchmarks which is somewhat related.

.

Handle the case when no config table is supplied

4b4b8ad

Also, add a note that the method post_config() must be called after engine.configure() to complete the initialization of the bridge.

lukego added a commit to lukego/snabb that referenced this pull request Oct 31, 2015

Merge PR snabbco#638 (new learning bridge) into next

0f392cf

eugeneia merged commit 4b4b8ad into snabbco:master Nov 5, 2015

ghost mentioned this pull request Nov 7, 2015

Implementation of RawSocket using ljsyscall #655

Merged

This was referenced Nov 29, 2015

Large penalty for frequent side-trace exit LuaJIT/LuaJIT#114

Closed

Feature request: Support large tables with FFI data keys LuaJIT/LuaJIT#115

Closed

lukego mentioned this pull request Jan 20, 2016

Add lib.ctable module. #722

Merged

eugeneia mentioned this pull request Mar 24, 2016

Optimize lib.ipsec.esp #845

Merged

eugeneia added a commit to eugeneia/snabb that referenced this pull request May 26, 2016

apps.bridge.flooding: Fix regression introduced by 4b4b8ad / snabbco#638

753fd9f

.

eugeneia added a commit to eugeneia/snabb that referenced this pull request Jul 15, 2016

apps.bridge.flooding: Fix regression introduced by 4b4b8ad / snabbco#638

0814fd9

.

eugeneia mentioned this pull request Jul 15, 2016

apps.bridge.flooding: Fix regression introduced by 4b4b8ad / #638. #968

Merged

dpino pushed a commit to dpino/snabb that referenced this pull request Mar 8, 2017

Fix snabbco#638 - Validate path on snabb config get

7733f20

alexandergall deleted the bridge-redesign branch February 14, 2018 11:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New implementation of the learning bridge app #638

New implementation of the learning bridge app #638

alexandergall commented Oct 2, 2015

eugeneia commented Oct 5, 2015

alexandergall commented Oct 6, 2015

lukego commented Oct 6, 2015

lukego commented Oct 20, 2015

lukego commented Oct 31, 2015

alexandergall commented Nov 1, 2015

eugeneia commented Nov 2, 2015

lukego commented Nov 2, 2015

lukego commented Nov 2, 2015

eugeneia commented Nov 2, 2015

New implementation of the learning bridge app #638

New implementation of the learning bridge app #638

Conversation

alexandergall commented Oct 2, 2015

eugeneia commented Oct 5, 2015

alexandergall commented Oct 6, 2015

lukego commented Oct 6, 2015

lukego commented Oct 20, 2015

lukego commented Oct 31, 2015

alexandergall commented Nov 1, 2015

eugeneia commented Nov 2, 2015

lukego commented Nov 2, 2015

lukego commented Nov 2, 2015

eugeneia commented Nov 2, 2015