Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New implementation of the learning bridge app #638

Merged
merged 2 commits into from
Nov 5, 2015

Conversation

alexandergall
Copy link
Contributor

Well, this one took me a while. My previous implementation based on Bloom filters had two major drawbacks

  • Performance. Insertions and lookups were relatively costly at up to 150-200 cycles per operation.
  • By the nature of Bloom filters, it was not possible to dump the MAC table, which can be crucial for debugging operational issues

My L2VPN system requires full multi-port and split-horizon semantics, so I had to keep those features.

My approach is to use a custom-built hash table to store MAC address-to-port mappings using lib.hash.murmur. I didn't try to use Lua hash tables for this, maybe I should have to compare performance :/ With this implementation, at least, I can understand and control all aspects.

This would all have been straight forward, except that it wasn't. The crux is the branchy nature of MAC table lookups and the tiny loop over the slots in a hash bucket. The comments at the top of apps/bridge/learning.lua and apps/bridge/mac_table.lua explain the issues and my solution for it. I don't reproduce those arguments here, but I'm happy to discuss them.

The upshot is that I had to hide a small portion of the code from the compiler inside a C function. I'm fairly satisfied concerning performance, a little less so with the kludgyness of the code. Suggestions for enhancements are very welcome.

As a goodie, the code provides auto-scaling of the table and should be essentially maintenance-free for operation (this also makes sure that the hash table is always operating below a load-factor where performance is only marginally affected by the depth of the buckets).

apps/bridge has been re-written to use a custom-built hash table to
store MAC addresses instead of a bloom filter for performance and
management reasons.  Performance considerations have made it necessary
to implement part of the mac_table module in C.
@eugeneia
Copy link
Member

eugeneia commented Oct 5, 2015

Very cool to follow up on this, especially since I had my own attempt on making a l2 switch fast. :)

@alexandergall Do you think it would make sense to add a benchmark based on the learning bridge app? I feel like it would be a great addition to our basic1 benchmark. (Maybe revive the benchmark we used in #555 ?)

@alexandergall
Copy link
Contributor Author

A benchmark would certainly make sense. I'm not sure what would be most useful to capture the performance of the actual bridge if we run the packet generator in the same process, for example. Do we want separate benchmarks for flooding and unicast performance? It would also be useful to benchmark the bridge when the MAC table is full.

@lukego
Copy link
Member

lukego commented Oct 6, 2015

(Haven't read the exciting code yet - more feedback to follow :-))

Great idea to make benchmarks for regression testing!

I see a lot of potential for interesting work on traffic matching/dispatching mechanisms over time and it will be excellent to have enough tests for people to evaluate whether new mechiansms are suitable to succeed/replace existing ones (without breaking important performance characteristics of somebody's product).

From a software maintenance perspective I think the most important purpose of the benchmarks is to compare the relative performance of two software versions and to cover enough scenarios to represent the realistic performance characteristics that people care about in programs that are using the apps. This would make it possible to rewrite dispatching code over time and be confident when new versions are suitable/unsuitable to merge onto master. (That would also make the optimization work accessible to people who are masters of CPUs and data structures but don't have a background in networking and want to treat the benchmark setup as a black box that takes in code and outputs a score.)

Also, add a note that the method post_config() must be called after
engine.configure() to complete the initialization of the bridge.
@lukego
Copy link
Member

lukego commented Oct 20, 2015

This code is super interesting! I have still not been all the way through it but you have implemented multiple ideas here that I have been imagining somebody would tackle one day:

  1. To write an efficient lookup table for an application that is lookup-intensive. This seems extremely valuable as a reference benchmark. Then people can experiment with on the one hand optimizations to further increase performance and on the other hand trying to cut code while keeping performance steady. Fun times ahead! I am sure that our table lookup use cases will be growing over time :).
  2. The programming technique of using FFI calls to hide small inner loops from the JIT so that they don't impact trace behavior. I have also been anticipating that this will be important if you have a loop in a trace and then want to introduce (say) a small hash code calculation that would need to iterate. (In a sense we are already doing this with memcpy.)
  3. The programming technique of processing packets in two steps: first sorting them into a queue (flood, unicast, drop) and then using branch-free code to process each queue. I can really imagine LuaJIT being happy with the branch-free code and perhaps that benefits the CPU itself too. (Hard to know when you are contorting yourself to satisfy a cruel JIT compiler vs fundamentally matching your problem to modern CPU architecture! ref Tracing JITs and modern CPUs: double trouble, or, a problem shared is a problem halved? lukego/blog#5).

Very cool.

This also makes me think it would be really valuable to have a comprehensive benchmark in the style of SPECint that benchmarks the entire code base and says whether it is improving or degrading. The CI benchmarks are moving in this direction but I wonder if we need to make it easier to grow them somehow. Just thinking how wonderful it would be if the benchmarks were sufficient that people could independently work on problems like "Can I make this code run faster?" or "Can I halve the size of this module without reducing performance?" without actually having to understand the applications themselves (treat them as black boxes that either speed up or slow down as a result of changes to the source code).

Looking forward to digging a bit deeper into the implementation to see what more I can learn from your investigations :)

lukego added a commit to lukego/snabb that referenced this pull request Oct 31, 2015
@lukego
Copy link
Member

lukego commented Oct 31, 2015

Merged onto next.

How are you measuring performance of this code today? Can we make a performance regression test out of that somehow?

@alexandergall
Copy link
Contributor Author

Currently, my primitive benchmark is to run a two-port bridge with a slightly modified Source app that lets me create synthetic packets with pre-defined headers. On my system (Xeon E5-2620v2, 2.4GHz), I get around 11.5-12Mpps with 64-byte packets. I guess this should at least be translated into some metric like cycles per packet before it could be used as a regression test in the CI.

One important variant of this check is to fill the MAC table with random data to simulate a busy system and exercise the code paths associated with that. With the C-wrapper for the hash table, performance remains basically unchanged (that's where all of my attempts to write it in pure Lua failed miserably). But thats's just some ad-hock code right now.

@eugeneia
Copy link
Member

eugeneia commented Nov 2, 2015

I guess this should at least be translated into some metric like cycles per packet before it could be used as a regression test in the CI.

I don't think this translation is necessary for performance regression tests, as they are always measured against a previous result on the same hardware. E.g. in the future, once we have multiple SnabbBot instances running, we will effectively be able to tell how a change affects performance on a given CPU architecture / NIC. The effective “benchmark score” is computed by the simple formula HEAD_RESULT / BASE_RESULT.

@lukego
Copy link
Member

lukego commented Nov 2, 2015

@alexandergall Ad-hock code sounds fine. The purpose of this test case will be to estimate whether a new change is going to make you will be happy/sad/indifferent. So it only needs to test whatever you are already testing.

I wonder if this could be a nice basic template for making regression tests:

perf stat -e cycles ./snabb snsh -e 'require("mymodule").selfbenchmark()'

where selfbenchmark() does a fixed amount of work (e.g. 100M hashtable operations) and then returns.

I suppose that time would work instead of perf stat but that might be a little less dependable if "Turbo Boost" is enabled on the server (CPU clock speed will change based on activity on other cores).

@lukego
Copy link
Member

lukego commented Nov 2, 2015

aside:

I would love to make it as easy as possible to write performance regression tests.

One idea is to support a selfbenchmark() method that runs performance regression unit tests in the same streamlined way as selftest(). Then we could have a make benchmark target that outputs a CSV file with all the unit benchmarks for a software version. This could be used for testing and tracking the performance of individual subsystems over time.

One other idea would be to stop throwing out new ideas when we already have a working mechanism and only need to use it more... :-)

@eugeneia
Copy link
Member

eugeneia commented Nov 2, 2015

#626 introduces make benchmarks which is somewhat related.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants