-
Notifications
You must be signed in to change notification settings - Fork 301
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New implementation of the learning bridge app #638
Conversation
apps/bridge has been re-written to use a custom-built hash table to store MAC addresses instead of a bloom filter for performance and management reasons. Performance considerations have made it necessary to implement part of the mac_table module in C.
Very cool to follow up on this, especially since I had my own attempt on making a l2 switch fast. :) @alexandergall Do you think it would make sense to add a benchmark based on the learning bridge app? I feel like it would be a great addition to our basic1 benchmark. (Maybe revive the benchmark we used in #555 ?) |
A benchmark would certainly make sense. I'm not sure what would be most useful to capture the performance of the actual bridge if we run the packet generator in the same process, for example. Do we want separate benchmarks for flooding and unicast performance? It would also be useful to benchmark the bridge when the MAC table is full. |
(Haven't read the exciting code yet - more feedback to follow :-)) Great idea to make benchmarks for regression testing! I see a lot of potential for interesting work on traffic matching/dispatching mechanisms over time and it will be excellent to have enough tests for people to evaluate whether new mechiansms are suitable to succeed/replace existing ones (without breaking important performance characteristics of somebody's product). From a software maintenance perspective I think the most important purpose of the benchmarks is to compare the relative performance of two software versions and to cover enough scenarios to represent the realistic performance characteristics that people care about in programs that are using the apps. This would make it possible to rewrite dispatching code over time and be confident when new versions are suitable/unsuitable to merge onto master. (That would also make the optimization work accessible to people who are masters of CPUs and data structures but don't have a background in networking and want to treat the benchmark setup as a black box that takes in code and outputs a score.) |
Also, add a note that the method post_config() must be called after engine.configure() to complete the initialization of the bridge.
This code is super interesting! I have still not been all the way through it but you have implemented multiple ideas here that I have been imagining somebody would tackle one day:
Very cool. This also makes me think it would be really valuable to have a comprehensive benchmark in the style of SPECint that benchmarks the entire code base and says whether it is improving or degrading. The CI benchmarks are moving in this direction but I wonder if we need to make it easier to grow them somehow. Just thinking how wonderful it would be if the benchmarks were sufficient that people could independently work on problems like "Can I make this code run faster?" or "Can I halve the size of this module without reducing performance?" without actually having to understand the applications themselves (treat them as black boxes that either speed up or slow down as a result of changes to the source code). Looking forward to digging a bit deeper into the implementation to see what more I can learn from your investigations :) |
Merged onto next. How are you measuring performance of this code today? Can we make a performance regression test out of that somehow? |
Currently, my primitive benchmark is to run a two-port bridge with a slightly modified Source app that lets me create synthetic packets with pre-defined headers. On my system (Xeon E5-2620v2, 2.4GHz), I get around 11.5-12Mpps with 64-byte packets. I guess this should at least be translated into some metric like cycles per packet before it could be used as a regression test in the CI. One important variant of this check is to fill the MAC table with random data to simulate a busy system and exercise the code paths associated with that. With the C-wrapper for the hash table, performance remains basically unchanged (that's where all of my attempts to write it in pure Lua failed miserably). But thats's just some ad-hock code right now. |
I don't think this translation is necessary for performance regression tests, as they are always measured against a previous result on the same hardware. E.g. in the future, once we have multiple SnabbBot instances running, we will effectively be able to tell how a change affects performance on a given CPU architecture / NIC. The effective “benchmark score” is computed by the simple formula |
@alexandergall Ad-hock code sounds fine. The purpose of this test case will be to estimate whether a new change is going to make you will be happy/sad/indifferent. So it only needs to test whatever you are already testing. I wonder if this could be a nice basic template for making regression tests:
where I suppose that |
aside: I would love to make it as easy as possible to write performance regression tests. One idea is to support a One other idea would be to stop throwing out new ideas when we already have a working mechanism and only need to use it more... :-) |
#626 introduces |
Well, this one took me a while. My previous implementation based on Bloom filters had two major drawbacks
My L2VPN system requires full multi-port and split-horizon semantics, so I had to keep those features.
My approach is to use a custom-built hash table to store MAC address-to-port mappings using
lib.hash.murmur
. I didn't try to use Lua hash tables for this, maybe I should have to compare performance :/ With this implementation, at least, I can understand and control all aspects.This would all have been straight forward, except that it wasn't. The crux is the branchy nature of MAC table lookups and the tiny loop over the slots in a hash bucket. The comments at the top of
apps/bridge/learning.lua
andapps/bridge/mac_table.lua
explain the issues and my solution for it. I don't reproduce those arguments here, but I'm happy to discuss them.The upshot is that I had to hide a small portion of the code from the compiler inside a C function. I'm fairly satisfied concerning performance, a little less so with the kludgyness of the code. Suggestions for enhancements are very welcome.
As a goodie, the code provides auto-scaling of the table and should be essentially maintenance-free for operation (this also makes sure that the hash table is always operating below a load-factor where performance is only marginally affected by the depth of the buckets).