-
Notifications
You must be signed in to change notification settings - Fork 57
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
API Edge Cases Undocumented #2
Comments
Nice speedup! I have a newer version of the codec that needs some more On Nov 15, 2016 5:42 PM, "Matt Corallo" notifications@github.com wrote:
|
Oh, also, if you set ALL_ORIGINAL and receive all blocks from non-FEC chunks, even though wirehair has copied each chunk into its memory it will only give you back 0s when you ask for chunk reconstructions. |
The optimizations in question are at bitcoinfibre/bitcoinfibre@eafd1ca though note that I really only benchmarked in my benchmark tool, which works on relatively large data (~1MB @ 1024-byte chunks). Sadly, I think the initial setup/EncodeFeed adds too much latency for my application (since I'm bound by, essentially, setup/encode/first packet->setup/decode with low loss latency). Any pointers on reducing the initial setup latency, even if it comes at the cost of a handful of additional packets? |
One thing you can do is have the setup run in a background thread and when it's complete start generating output symbols. Since the first ~1000 symbols are going to be the originals you have some time to allow the CPU to catch up. cm256 is faster for a small number of symbols (about 32). But wirehair with speedups is much faster after that point. More details here: https://github.com/catid/wh256/blob/master/docs/wh256_performance.pdf |
Sadly for my application the receiving end has the vast majority of source symbols before it even receives the first packet (just needs a few packets of metadata to figure out how to place the data in FEC chunks). The FEC is just to communicate some random missing data, essentially, and since it realistically normally only ever needs a few packets to fill in the gaps the latency-to-first-FEC-packet is critical. |
Maybe something like this is more what you need? https://github.com/catid/shorthair That project could be improved a lot by: (1) Using cm256 instead of longhair and (2) by using an approach like Tetrys instead https://www.ietf.org/proceedings/86/slides/slides-86-nwcrg-1.pdf (2) requires a bunch more work that I've started but haven't gotten into a working state yet |
I looked at shorthair, though it looks like a full network implementation and everything above...I've already got my own networking layer with lots of random crap that I need for my specific use-case. I'm not sure that cm256 and sliding window are really ideal...let me describe a bit more: I'm sending Bitcoin blocks around a global network of servers at incredibly low-latency (10-20 milliseconds slower than the RTT/2 between the servers, usually, including through the GFW), so: every 10 minutes I get a new block (1-4MB) which contains almost entirely transactions that both sides have already seen, and I need to get it to the other side fast. I send a small set of data to encode which transactions go where in the ultimate block, send that (1-2 ms, including FEC, though that should be easily optimized by using cm256, etc), and then send FEC data for the whole block. Currently this works OK with some pretty shitty LDPC FEC code, though the code isnt well-optimized and is slightly frightening in its memory usage. Was hoping to switch to wirehair but it ends up wasting too much time in the initial encode by about 2x (I havent benchmarked, but I assume it has to do with calculating the GF(2^8) RS code data) to be a strict gain. |
How big is the data to send? Sounds like less than 1 MB diffs that would On Nov 16, 2016 4:56 PM, "Matt Corallo" notifications@github.com wrote:
|
It ranges widely...There's some data at http://bitcoinfibre.org/stats.html but, essentially, 90-something% of the time you get the full block reconstructed with ~40k of data, but a non-trivial amount of time it can take 1MB+ of data before the current LDPC stuff completes a reconstruction. |
Oh, and I really would prefer to not introduce of knowledge-of-receiver-side-data assumption, because speed-of-light makes that a bitch :(. |
You could give wh256 a shot. It switches to cm256 when it is faster to do On Nov 16, 2016 5:27 PM, "Matt Corallo" notifications@github.com wrote: Oh, and I really would prefer to not introduce of — Reply to this email directly, view it on GitHub |
You could give wh256 a shot. It switches to cm256 when it is faster to do On Nov 16, 2016 5:27 PM, "Matt Corallo" notifications@github.com wrote:
|
Yea, I think I was hoping a no-GF256 codec would be a simple constant-change that I could hack up...happy to get you a beer if its more involved :). |
Hi, -- In addition to patching over packet loss, this system ab(uses) the forward error correction to create a diff where you have no idea what the remote side knows, which we don't because it's a distributed system with data coming in every which direction and multiple sources... and 100ms+ one way delays. Two potential source nodes might even receive the same data at once, and both can usefully contribute to all the receivers recovering as fast as possible (as they can send different ranges of recovery data). The currently live system uses a LDPC erasure code which decodes exclusively via pealing (and has a fairly low quality implementation that Matt has somewhat fixed up). It's fast-- but very high overhead (e.g. 20% more packets than a RS could would have needed). I encountered whirehair a while back and suggested Matt give it a shot-- the performance is really impressive overall. I think what is frustrating him now is the time to the first recovery packet (as mentioned, he doesn't send the source packets because they're reconstructed on the far end). As an aside, if you want some number crunching done for coming up with except-seeds for wh256 I have a good hundred cores sitting idle for the moment and wouldn't mind running something for you. I was suggesting WH256 to Matt last night, with the hope that the faster GF256 parts might help some of his issues. (Prior to that I tried turning the heavy row count down but got lost trying to figure out why it crashing.) |
Two other random thoughts: (1) MTU size increase would improve latency in On Nov 16, 2016 5:42 PM, "Matt Corallo" notifications@github.com wrote:
|
"The decoder recreates the recovery set during decoding" huh? I'm assuming you're referring to the GF2^8 RS stuff? Is that strictly neccessary? Seems like it should be able to get away with only doing that when needed? |
Recovery set is a set of symbols slightly larger than the original data If a decoder has recovered lost data then it can be reused for encoding On Nov 17, 2016 10:16 PM, "Matt Corallo" notifications@github.com wrote:
|
That is useful for the application here (once a participant receives enough to recover, they'll encode and start generating new symbols to help nodes beyond simply forwarding along already received symbols), but unfortunately it's not on the critical path (which it sounds like Matt is saying is time to first encoded symbols on nodes that received the data from outside.) Pardon my ignorance, but is it the case that some non-trivial number of IDs would happen to not reference anything but original packets? (I'm familiar with the basic relevant FEC-tech, but ignorant of your construction except in a vague hand-waving sense :) ) if so, extracting and sending a wad of those first might resolve Matt's concern though I suppose it would seriously degrade the overhead. |
So I see two ways to make the encoder super-fast without writing a whole other codec: (1) Add a feature to work in non-systematic mode, where it multiplies by the matrix to generate the recovery set and then you'd send across the network only linear xor combinations of those. Instead of sending the original data you'd send the output of the encoder. (2) Add a larger feature to work in a mixed mode, where it again multiplies the matrix to generate the recovery set, but then it allows you to send the original data followed by some random combinations of the original data as in the systematic code. It then requires a more complicated decoder to solve the recovery set in two stages. I've not fully thought through (2) so maybe the decoder would be much slower for 50-50 original data and recovery data? (1) is a safer way forward. I tried out wh256 for your data sizes and it takes about 1.3-1.6 milliseconds to initialize the encoder. Is that really significant? |
So I hacked up some code to cache the matrix generation, which seems to be a nontrivial improvement in synthetic and real-world benchmarks, though it doesnt show up in callgrind almost at all (and I havent hooked a real profier up to it yet) (see code at bitcoinfibre/bitcoinfibre@d32a9a2, though I'm pretty sure the copy stuff is busted since I didnt bother to audit anything). I also swapped out for wh256 and it seemed to improve a bit more (getting within spitting distance of the naiive LDPC stuff I had previously - so probably worth swapping out at this point). Do you have the code to generate seeds handy and I could spin it up one some beefy servers? |
Would you guys mind also evaluating https://github.com/catid/siamese ? |
Neat! For Fibre the targeting for few recovery symbols will likely get in the way, but we have some related things where the ability to have a rolling window of changing data would be very useful. (... once I figure out how that actually works with Siamese :) ). |
Oh I thought the files were like 1 MB, so 1000 pieces, and 256 limit on output would be 25% - should be higher than loss right? |
Actually for that use case it would be nicer if Siamese had an API that supported fixed-sized input blocks that are maintained outside of the library to avoid some copies. It's probably still within 20% of current performance though |
Well FIBRE has a few strange properties - because we (usually) have almost the entire 1MB of data at the other end before we start sending, we really want to get the data in just a handful of packets. However, because it occasionally has much lower hitrate, we also need to be able to get the data at the other end after having just sent a bunch of recovery symbols. I suppose I could do some hack where it first sends 256 recovery symbols and then starts sending data packets, but then you'd waste a bunch of packets on data that is useless to the other side :/. |
If I increased the number of recovery packets to some larger number that would work for you? It's somewhat arbitrary |
For FIBRE we really want recovery packets to be at least equal to the input packet count, twice the input size is even better. Here is a brief and somewhat simplified outline of what we do w/ FIBRE today: There is a small collection of nodes spread around the world. Our task will be to get a block from a random source to all the targets with at little latency as possible. The nodes have gigabit+ connectivity, and the distance between them is such that a single round trip anywhere will dwarf the transmission time. Each have a many megabyte queue of transactions that will potentially end up in blocks, learned via a gossip protocol and prioritized based on their odds of showing up soon. These queues are roughly but not precisely consistent among the participating nodes. At one of these nodes, a block shows up (poisson process with expectation of ~10 minutes) The receiving node makes a sketch of the block-- a collection of short hashes and sizes for each of the transactions.. this is on the order of 15 kbytes in size. It's streamed out to peers over UDP with a bit of FEC. The targets get this data use the hashes and sizes to mock up the block in memory. There is a size based permutation applied to the transaction order that also adds a bit of padding so that when the block gets packetized missing transactions will cause fewer packet erasures. The source performs the same permutation+packetization process and runs a FEC encoder on it (currently wirehair, originally a more boring LDPC code). It then starts streaming out recovery packets, interleaving between peers (you can think of it as sending half east around the world, half west around the world). There is no congestion control, rather each node has a preconfigured transmission rate that is adjusted to send as fast as possible without causing queuing delays/losses. The targets feed their locally generated (from their queues + sketches) into the FEC decoder. When a peer receives a recovery packet it may pass it on to another peer. (There is a precomputed forwarding topology that minimizes latencies and avoids peers getting duplicates of the same data-- it's not sufficient to simply have the source send directly, getting the latency minimizing path can require passing through other nodes). Once a target has enough data, it reconstructs the whole block. It sends a packet to each of its peers telling them to not bother transmitting any more about the block to it. It will then FEC encode it and begin sending out novel chunks which it never received to whatever peers haven't told it that they're done. (this last point isn't that important except when there are outages, though it's easy to do with LDPC-like codes) Sometimes a block will show up which isn't very consistent with nodes queues and may have a much smaller hitrate. (e.g. a few percent) or nodes might have been recently restarted and not have a useful queue yet. In those cases, their reconstruction will take more data, but we still must avoid a round-trip. Also, a block might show up at multiple sources about the same time-- so it is useful if both can operate independently sending non-overlapping packet sets, and both contribute usefully to the reconstruction... rather than having a bunch of duplicates show up due to sources which can't know about each other's transmissions (outside each other's light cone). As you can see in this model the source never sends any of the input packets, because any of them potentially duplicate what the target already knows. Often recovery only needs a couple packets past the sketch... maybe two dozen (I asked Matt to get a historgram of that). But sometimes the hitrate is low and it needs an amount equal to the input. Because the source sends packets in multiple directions around the world, it's also good if the recovery count is 2x whats needed, so non-duplicates can be sent in different directions. The source can more or less detect in advance of sending if the hitrate is going to be very low: because the block will also not coincide well with the source's queue. We could potentially switch coding schemes based on that. We just can't wait for replies from the targets e.g. to tell us what they were missing so that we can just send original data-- because the whole process should be done before that reply would be received. I think it would be fine for us if the encoder could produce more recovery packets but decode was somewhat slower/less efficient if many were needed... otherwise we could potentially switch from WH to Siamese based on the expected queue hitrate.. and if siamese is use, after sending all the recovery packets start streaming out the input until we're told to stop. If that would be attractive would depend on the performance. Right now, WH encode+decode performance is a major factor in the controllable part of our total latency. (which 95% of the time is less than 16ms over the one-packet latency). |
I appreciate the extra detail; let me think about other methods that may help. Okay I'll put together a specialized block codec that can produce more output symbols (with higher perf) so you can evaluate that as an alternative. |
Ideally, of course, the decode would be iterative on packets to avoid taking too much CPU time if we do need a lot of packets - even at 1Gbps packets aren't coming in that fast, so you've got a bit of time to process things, and I've already got infrastructure to fill the block with transactions on a background thread (which used to be an expensive process with lots of mallocs, but is now largely free) which could easily be converted to a "do a step of something useful in FEC decoder" thread. |
Please evaluate this library for your purposes: https://github.com/catid/fecal |
Hmm, I hacked in a naive version into my fec wrapper and it turned out quite a bit slower (2x on my few-missing-chunks test, which is around 14 missing chunks out of 880). You can see the fec wrapper changes at bitcoinfibre/bitcoinfibre@6f69893#diff-208ddd15cda5e81d824f18d7daffdde1 it has some extra memcpys that add a bit of overhead, but at least callgrind's cycle estimation thinks the fecal_decode is where nearly all the time ended up. I'm happy to provide callgrind outputs for your analysis (or you can run the bench_bitcoin binary in that repo using ./autogen.sh && CXXFLAGS="-march=native -mtune=native -O2 -g" ./configure --disable-wallet && make && ./src/bench/bench_bitcoin) |
Disappointment =( |
Well, I'll get back to you guys next time I have some new software to share. Feel free to reach out to me at mrcatid@gmail.com for more support. |
If you're interested in evaluating another codec, I just put up this one: https://github.com/catid/leopard It's limited in that the recovery data cannot be streamed out like a fountain code. Also the recovery set cannot be larger than the original data size. Hope it's useful. |
Damn, super close, but didnt beat wh256 :(. Indeed, the lack of streaming output/incremental decode really hurts this one for FIBRE. In my synthetic benchmark on my Haswell-EP workstation, wh256 has ~1.5ms of setup time, then streams out enough chunks to get a decode super fast, with decode run taking around 1.5-1.7ms ignoring encode time. leopard is closer to 1.5 on the encode, 4 on the decode. It got a bit faster after I hardcoded the block size like I do in wh256, but only by a ms or 2 in total, which wasn't quite enough to get over the hump. (so, in conclusion, one thing to consider for future fecs is an option to hardcoded fec chunk block size - in FIBRE I use fixed-size UDP packet which makes tons of things lots simpler, and hardcoding the block size all the way down the the XOR functions can be a decent gain). |
Haha damn. Okay I will keep trying
The encoder and decoder should get about 50% faster as I finish DIT FFT
optimization for 16 bit field and thinking about 12 bits for speed for apps
like yours that operate around 1-4 MB file size
…On Fri, Jun 2, 2017 at 1:47 PM Matt Corallo ***@***.***> wrote:
Damn, super close, but didnt beat wh256 :(. Indeed, the lack of streaming
output/incremental decode really hurts this one for FIBRE. In my synthetic
benchmark on my Haswell-EP workstation, wh256 has ~1.5ms of setup time,
then streams out enough chunks to get a decode super fast, with decode run
taking around 1.5-1.7ms ignoring encode time. leopard is closer to 1.5 on
the encode, 4 on the decode. It got a bit faster after I hardcoded the
block size like I do in wh256, but only by a ms or 2 in total, which wasn't
quite enough to get over the hump.
—
You are receiving this because you modified the open/close state.
Reply to this email directly, view it on GitHub
<#2 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AAPZIZw8CUJA2RR3A1XFbMyfaXTM4VCeks5sAHTmgaJpZM4KzPvJ>
.
|
Oh wow, alright, that might just do it to beat wh256. I can try to PR a decent version of the optionally-hardcode-fec-size, if you'd like. |
Cool will let ya know when it is ready to test
…On Fri, Jun 2, 2017 at 3:10 PM Matt Corallo ***@***.***> wrote:
Oh wow, alright, that might just do it to beat wh256. I can try to PR a
decent version of the optionally-hardcode-fec-size, if you'd like.
—
You are receiving this because you modified the open/close state.
Reply to this email directly, view it on GitHub
<#2 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AAPZIRMSTnq0uT2BtkXxKONYnDSEWsr8ks5sAH_YgaJpZM4KzPvJ>
.
|
Okay I just finished the biggest optimizations for 16-bit finite fields.
Should be about 50% faster to encode and 50% faster to decode.
On Fri, Jun 2, 2017 at 3:14 PM, Christopher Taylor <mrcatid@gmail.com>
wrote:
… Cool will let ya know when it is ready to test
On Fri, Jun 2, 2017 at 3:10 PM Matt Corallo ***@***.***>
wrote:
> Oh wow, alright, that might just do it to beat wh256. I can try to PR a
> decent version of the optionally-hardcode-fec-size, if you'd like.
>
> —
> You are receiving this because you modified the open/close state.
>
>
> Reply to this email directly, view it on GitHub
> <#2 (comment)>, or mute
> the thread
> <https://github.com/notifications/unsubscribe-auth/AAPZIRMSTnq0uT2BtkXxKONYnDSEWsr8ks5sAH_YgaJpZM4KzPvJ>
> .
>
|
Nice: for a single test block (~800 1152-byte chunks) it went from ~1.3 on the encode to ~1.05. On the decode side, ~4.5-> ~3.6. Still, compared to wh256's ~1.2 to start encoding, ~1.56 to do the full encode and ~1.1 to do a decode, wh256 still wins by a healthy margin. I'd be curious to see the 12-bit field. (fec-al numbers, since I went and did things a bit more formally so went back and benchmarked that: ~0.2ms to start streaming, ~0.5 to do a full encode, ~1ms to decode in my low-loss bench, which is realistic for some reasonably-high-percent of cases, but the ~6ms to decode in my medium-loss bench makes it hard to swallow the lack of consistency :( ). |
If you allow for multithreading it can decode quite a lot faster. I am still working on gaining another 15% or so performance by moving copies/xors into the inner loops. Then MT after that's done, which could cut the decode time in half or more... 12-bit fields are going to be a little faster in that there would be less register pressure in the inner loops. |
So Sunday and tonight I managed to implement a worker thread pool and got somewhat tuned work spread across the threads. In practice it's not quite a 2x performance boost to add multi-threading: Parameters: [original count=1000] [recovery count=1000] [buffer bytes=1344] [loss count=1000] [random seed=2] (multi-threading OFF) Parameters: [original count=1000] [recovery count=1000] [buffer bytes=1344] [loss count=1000] [random seed=2] (multi-threading ON) I'm not 100% sure that I'm spreading the work across the workers the most optimal way, but it looks like about a 40% performance improvement for the case you care about. I don't think this is enough to make you prefer Leopard over Wirehair. Also my thread pool only works on Windows. |
your cpu is 4 GHz in s/t mode and 2.8 GHz in m/t mode |
Switched from my own threadpool to OpenMP and tuned more: Leopard Encoder(1.344 MB in 1000 pieces, 1000 losses): Input=3162.35 MB/s, Output=3162.35 MB/s Maybe this will be good enough? |
New version of the codec is up |
Hey! Sorry, somehow missed the first message and just got around to testing the new wirehair updates. Even with the OpenMP threadpool the leopard updates weren't a winner. They still did better than wirehair/wh256 on most/all-chunks-missing benchmarks, but on tests where most chunks were locally available before sending starts, they failed to quite reach the wh256/wirehair speed (but...) The wirehair updates did, indeed, provide a pretty healthy boost, though I went ahead and implemented the behavior of wh256 where it uses cm256 if the chunk count is below some threshold, which appeared to help my small-chunk-count reconstruction runs (recall, my use-case is two-pass: the first is around 5-10 chunks of 1152 bytes and never has any pre-knowlege, the second is usually around 800 chunks of 1152 bytes, but usually upwards of 780 chunks are available on the receiving end before any packets are even sent, as they can be calculated from the first pass). I did not, however, test backporting this behavior to leopard, but it looked like leopard lost to wirehair normally anyway, so I'd guess it wont save it. |
Cool glad to know that helped
…On Thu, May 3, 2018 at 3:25 PM Matt Corallo ***@***.***> wrote:
Hey! Sorry, somehow missed the first message and just got around to
testing the new wirehair updates. Even with the OpenMP threadpool the
leopard updates weren't a winner. They still did better than wirehair/wh256
on most/all-chunks-missing benchmarks, but on tests where most chunks were
locally available before sending starts, they failed to quite reach the
wh256/wirehair speed (but...)
The wirehair updates did, indeed, provide a pretty healthy boost, though I
went ahead and implemented the behavior of wh256 where it uses cm256 if the
chunk count is below some threshold, which appeared to help my
small-chunk-count reconstruction runs (recall, my use-case is two-pass: the
first is around 5-10 chunks of 1152 bytes and never has any pre-knowlege,
the second is usually around 800 chunks of 1152 bytes, but usually upwards
of 780 chunks are available on the receiving end before any packets are
even sent, as they can be calculated from the first pass). I did not,
however, test backporting this behavior to leopard, but it looked like
leopard lost to wirehair normally anyway, so I'd guess it wont save it.
—
You are receiving this because you modified the open/close state.
Reply to this email directly, view it on GitHub
<#2 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AAPZIZUvOwsKLeIm8xphE2HiSl0pwotsks5tu4PwgaJpZM4KzPvJ>
.
|
Not sure if it's interesting to you but I recently uploaded a new
project here: tonk.io
…On Thu, May 3, 2018 at 3:53 PM Christopher Taylor ***@***.***> wrote:
Cool glad to know that helped
On Thu, May 3, 2018 at 3:25 PM Matt Corallo ***@***.***> wrote:
>
> Hey! Sorry, somehow missed the first message and just got around to testing the new wirehair updates. Even with the OpenMP threadpool the leopard updates weren't a winner. They still did better than wirehair/wh256 on most/all-chunks-missing benchmarks, but on tests where most chunks were locally available before sending starts, they failed to quite reach the wh256/wirehair speed (but...)
>
> The wirehair updates did, indeed, provide a pretty healthy boost, though I went ahead and implemented the behavior of wh256 where it uses cm256 if the chunk count is below some threshold, which appeared to help my small-chunk-count reconstruction runs (recall, my use-case is two-pass: the first is around 5-10 chunks of 1152 bytes and never has any pre-knowlege, the second is usually around 800 chunks of 1152 bytes, but usually upwards of 780 chunks are available on the receiving end before any packets are even sent, as they can be calculated from the first pass). I did not, however, test backporting this behavior to leopard, but it looked like leopard lost to wirehair normally anyway, so I'd guess it wont save it.
>
> —
> You are receiving this because you modified the open/close state.
>
>
> Reply to this email directly, view it on GitHub, or mute the thread.
|
Hmm, sadly the stuff I've got is all super-context-specific so a more general non-FEC-only framework probably doesn't apply. Thanks for the update, though! |
Some fun things callers need to be aware of:
In other news, for my application I hardcoded the chunk size, assumed 16/32-byte alignment (depending on where its used) for XOR, replaced the different-compile-unit XOR code with simple avx-/sse-based intrinsics and saw something like a 3x speedup :).
The text was updated successfully, but these errors were encountered: