1BRC in Elixir #93
Replies: 36 comments 145 replies
-
I was planning to do it in Erlang just for the fun of it, I'll keep you posted on my progress. :) |
Beta Was this translation helpful? Give feedback.
-
@IceDragon200 Did you get your 1brc challenge completed? I managed in 141 seconds on Apple M1 Pro with OTP 26 using plain Erlang. I'll try and push a solution later tonight after some cleaning up :-) I know that I still have a bottleneck at a specific place, I just don't know how to fix it just yet 😅 |
Beta Was this translation helpful? Give feedback.
-
I'm trying something with Flow as this is almost the same scenario from their docs. defmodule VanillaFlow do
def run() do
filename = "./data/measurements.txt"
parent = self()
File.stream!(filename, :line)
|> Flow.from_enumerable()
|> Flow.map(fn line ->
[ws, temp] = :binary.split(line, ";")
temp = :binary.split(temp, "\n") |> List.first() |> :erlang.binary_to_float()
[ws, temp]
end)
|> Flow.partition()
|> Flow.reduce(fn -> :ets.new(:words, []) end, fn [ws, temp], ets -> # ETS
case :ets.lookup(ets, ws) do
[] ->
:ets.insert(ets, {ws, {temp, temp, temp}})
[{_, {current_min, current_mean, current_max}}] ->
:ets.insert(ets, {ws, {min(current_min, temp), (current_mean + temp) / 2, max(current_max, temp)}})
end
ets
end)
|> Flow.on_trigger(fn ets ->
:ets.give_away(ets, parent, [])
{[ets], :new_reduce_state_which_wont_be_used} # Emit the ETS
end)
|> Enum.to_list()
# then tab2list...
end
end |
Beta Was this translation helpful? Give feedback.
-
Found an old post from 2007 about file processing in erlang. They were also referencing some 1 million (not billion) rows problem which seemed to take 6-7 seconds on their hardware. How far we've come :) |
Beta Was this translation helpful? Give feedback.
-
Pure Erlang solution which runs in 130s on the full 1B input: https://github.com/jesperes/erlang_1brc/blob/main/src/aggregate.erl. Almost all the time is spent inside |
Beta Was this translation helpful? Give feedback.
-
We should be able to go faster if we could read the file concurrently. That's what the Java implementations are doing after all. Does anyone know if we could use Erlang's pread/2 to achieve that? |
Beta Was this translation helpful? Give feedback.
-
I use pread in my approach and chop off the last bit of the binary that doesn't form a full line and append it to the next chunk. This allows me to decouple reading and processing and do both in parallel. "Unfortunately" I went on vacation before I could clean it up and push my solution so it'll have to wait a week. In essence though the flow is as followed: Most time was spent in binary_to_float/1 so curious how much improvements @jesperes float parsing brings to my approach. Will try that next week when I'm back at home 👍 |
Beta Was this translation helpful? Give feedback.
-
Well, finally finished my first run: 17 minutes, I think I can do better even with my potato hardware, I'll have new results in... 20 minutes: Good news: down to 8 minutes, https://github.com/IceDragon200/1brc_ex/blob/master/src/1brc.workers.blob.maps.chunk_to_worker.exs on my hardware |
Beta Was this translation helpful? Give feedback.
-
For chunks C1, C2, C3, ... Cn, I split C2 once at |
Beta Was this translation helpful? Give feedback.
-
One problem with input.split('\n').fold(...) so that you can fold over the binaries without having to construct a (huge) list first. |
Beta Was this translation helpful? Give feedback.
-
Even though the challenge is about using "just java" or Elixir/Erlang in this case, this Explorer solution seemed quite interesting to me. defmodule WithExplorer do
# Results
# [
# 1_000_000_000: 675483.000ms,
# 500_000_000: 58244.713ms,
# 100_000_000: 10321.046ms,
# 50_000_000: 5104.949ms,
# ]
require Explorer.DataFrame
alias Explorer.{DataFrame, Series}
@filename "./data/measurements.txt"
def run() do
parent = self()
results = @filename
|> DataFrame.from_csv!(header: false, delimiter: ";", eol_delimiter: "\n")
|> DataFrame.group_by("column_1")
|> DataFrame.summarise(min: Series.min(column_2), mean: Series.mean(column_2), max: Series.max(column_2))
|> DataFrame.arrange(column_1)
# This could be improved... Would be nice to do it inside Explorer
for idx <- 0..(results["column_1"] |> Series.to_list() |> length() |> Kernel.-(1)) do
"#{results["column_1"][idx]}=#{results["min"][idx]}/#{:erlang.float_to_binary(results["mean"][idx], decimals: 2)}/#{results["max"][idx]}"
end
end
end The eager backend scaled linearly up to about 7GB (500 mil lines) using all my cores. Something happens with larger files that will make my machine do a lot of disk IO, and the 1 billion version performs ~6x slower. |
Beta Was this translation helpful? Give feedback.
-
Another Elixir implementation, which has been my first program in this language. It takes about 12 minutes with 1B rows on an ultra-low-voltage processor that is 7 years old. |
Beta Was this translation helpful? Give feedback.
-
Another Erlang implementation: https://github.com/Kartstig/1brc_erl Takes ~80s on my Threadripper 1950X However, I haven't utilized |
Beta Was this translation helpful? Give feedback.
-
@jesperes Could you provide a bash script to run your code, I'm putting together a table with the current times using my hardware for comparison I just have no idea how to run yours |
Beta Was this translation helpful? Give feedback.
-
Okay, everyone I've tried gathering some concrete numbers from my hardware (and almost killed it in the process) running everyone's implementation. I think we can all appreciate a single piece of hardware running it and giving somewhat consistent numbers yes? You can check the primary comment for your implementation, some times and my comments about it, for everyone who's new, you can link your implementation by simply commenting and I'll get it up within an hour of seeing it, usually. |
Beta Was this translation helpful? Give feedback.
-
@rparcus Broke one of your new implementations:
|
Beta Was this translation helpful? Give feedback.
-
Everyone, the Ryzen numbers are up... and while I want to chalk it up to bias hardware being bias... look at the numbers yourself (gosh I look good up there) The most surprising is @rparcus was able to claim the 50M spot, but his implementation seems to have fallen apart at the 1B test. I have no idea why, considering the intel test smoked everyone else, so I'd like to believe it has something to do with the Intel vs AMD optimizations in the nifs themselves. Other implementations did, as expected with faster hardware, same thing, just faster. |
Beta Was this translation helpful? Give feedback.
-
@garazdawi Results are up, considering your low CPU and low memory footprint, I really wish I could have ran jesperen's original version as well for comparison, that is some amazing performance, makes me think I could squeeze some more out of my own implementation right now. Bonus points: the 50M completed in under 6s on my intel cpu when I was testing the test scripts, the only other implementation that did that was @rparcus explorer |
Beta Was this translation helpful? Give feedback.
-
Updated Ryzen times for: @IceDragon200 @garazdawi and @jesperes Also rerunning the entire suite to see if the figures change drastically, I think garazdawi's code wasn't JIT-ed at the time of testing hence the slower time |
Beta Was this translation helpful? Give feedback.
-
Even if :prim_file is not thread safe, one can use a semaphore to regulate access and let each worker read its own chunk with :prim_file. A worker process could:
I thought it would be faster because we don't need to pass partial chunks of binaries around between process, but turns out that it's about 30% slower than something similar to what @IceDragon200's implementation does, where one process reads and all other workers parse. My lesson learned here is:
|
Beta Was this translation helpful? Give feedback.
-
Managed to tidy up my implementation a bit and push it (1brc) and create a PR to add my implementation @IceDragon200 👍 See: IceDragon200/1brc_erl_ex_test#1 It's not the cleanest code I ever wrote but I don't believe that was part of the challenge here 😅 I'm using @jesperes Float parsing but otherwise the structure is fairly similar to what the others have tried. Interestingly enough my approach beats @garazdawi (by a hair 😅) by using plain Here are the results from my Mac M1 Pro: Versions used for testing: 50M Results:
1B Results:
|
Beta Was this translation helpful? Give feedback.
-
I have pushed a PR with two optimizations:
|
Beta Was this translation helpful? Give feedback.
-
@IceDragon200 just merged my latest version. See: IceDragon200/1brc_erl_ex_test#4 Results on my Mac M1 Pro compared to the fastest previous solution from Jesperes. a 17.2% improvement on the 1B results. For some reason its a slower on the 50M results. Something that I'm looking into tonight 👍 50M results:
1B results:
Very excited to see how this'll do on Ryzen and if we can get an "official" sub 10 second time on the board 😅 |
Beta Was this translation helpful? Give feedback.
-
Upon request I have ran the 10k weather station test, results are under the default 420 (heh) weather station tests, since this test takes much longer to run, I won't be updating it frequently |
Beta Was this translation helpful? Give feedback.
-
Here are the 1B Results on 224 core Intel Xeon:
The machine is behind an http proxy that denied access to some packages that rparcus uses which is why those did not work. This is run using Erlang 26.2.1 and latest main Elixir. |
Beta Was this translation helpful? Give feedback.
-
Updated my solution now, and replaced some shitty home-grown hash function with Fnv32 (thanks @onno-vos-dev) and stole some spawn options from him too. :) |
Beta Was this translation helpful? Give feedback.
-
In case anyone is interested, feel free to rip out my (half-baked) PropEr test (PR: onno-vos-dev/1brc#4) and throw it at your solution 😄 It found some bugs for me and some really interesting ones... Apparently the FNV32a hash for |
Beta Was this translation helpful? Give feedback.
-
After a huge amount of fighting trying to come up with a variant of the FNV32 Hash which was both fast and produced unique results, I finally managed! 🎉 See details in: onno-vos-dev/1brc#5 which includes a list of 128k unique cities (those with a population over 1000 inhabitants) with a unit test to ensure uniqueness across all of them. With a runtime of |
Beta Was this translation helpful? Give feedback.
-
@mneumann Results updated with mneumann/1brc-elixir@efb7aae |
Beta Was this translation helpful? Give feedback.
-
Hi everyone, new idea on top of what already worked well: It does pretty well on my machine for the 420 test, and should be close to the top solutions without using dependencies. Didn't test for 10k... |
Beta Was this translation helpful? Give feedback.
-
Implementations
https://github.com/IceDragon200/1brc_erl_ex_test
AMD Ryzen 5 2600 / 32Gb DDR4 IceDragon200
Standard Data Set (420 weather stations)
50 Million
1 Billion
10k City Data Set
50 Million
1 Billion
RETIRED: Intel i7-2710QE @ 2.10Ghz / 16Gb DDR3 IceDragon200
This has been retired, if your implementation is not listed here, it's because I've stopped testing on my poor laptop, check the Ryzen section for your updated figure.
Dirty System - While under normal usage
Clean System - Nothing else running with it
AMD Threadripper 1950X @ ?Ghz / 94.2Gb DDR4 Kartstig
Wait a minute, how did you get these numbers?
tl;dr i7-2710QE capped at 2.10Ghz, and 16Gb of DDR3 RAM, only 4Gb is readily available for benching on a Lenovo E520. The hardware is 10+ years old.
All implementations were ran on a dirty system (that is, it's actively doing other work, mostly just firefox running so I can check this discussion), I'm currently listening to music which should be a low cpu and low disk job.
Hardware is a 10+ year old laptop CPU, more specifically a i7-2710QE running at a max clock of 2.10Ghz, turbo is disable to avoid overheating since the Lenovo E520 it's currently sitting in was not designed for it.
They are all ran with:
time your_main_entry_point
I always run the 50M test first, so your code will at least compile during that window, it will be ran twice, since all currently listed implementations can finish in under a minute
OOM?
Out-Of-Memory, I have 16Gb of memory, only 4~9Gb is usually free, so that means all of these implementations needed to run under 4Gb at worst, if your implementation OOMs, I will mark it as such no further comment.
Okay, so you sabotaged it for your own implementation!
No, I am aware of my system's constraints and appropriately reduced the amount of work it had to do in order to work within those constraints, relying on the disk and hammering the CPU instead of the memory, all implementations are ran with the same constraints, I try not to use the computer while it's benching, outside of listening to music to pass the time (come on, 18 minutes is a lot of nothing to do you know).
How do I get my implementation listed?
Just comment in this discussion with a link to it, I'll check and try to run it, just be aware of what I'm benchmarking it on.
Okay, I have an implementation!
Great, if it doesn't process 50M rows in under 2 minutes I'm not doing the 1B row test, I'm not wasting time or compute on that (and to be fair, I treat all of my implementations the same)
Okay, I made changes, can I get a re-test?
Sure, just comment that you did and I'll re-test it.
Related or Good reads
Current Status
After much trial and error and gathering knowledge from everyone's methods, I finally have a fast-enough implementation (there may be more room for improvement):
https://github.com/IceDragon200/1brc_ex/blob/master/src/1brc.workers.blob.maps.chunk_to_worker.exs
This completes in under 7 minutes on an i7-2710QE @ 2.10Ghz (turbo was disabled to maintain a safe operating temp, this laptop was not designed for the CPU after all).
Originally the code was idiomatic elixir, but there are a few things:
File
vs:prim_file
- if you run the original code under:eperf
, one would notice the countlessfile
related modules being called through standardElixir.File
or erlang's:file
, while I'm not entirely sure why it has so many processes involved, it likely has to do with concurrent access. The short solution is to cut out all of it and drop down to:prim_file
it is likely not concurrently safe, but chances are you're already chewing through the file on a single process as fast as you can, this removes around 2 to 3 layers of indirection and sends:binary.split/2
- is major the bottleneck, it's unclear if there is a way to make it faster at this pointFloat.parse/1
vs:erlang.binary_to_float/1
vs just doing it yourself -Float.parse
was much slower than expected, until one looks into it's implementation, since Float.parse needs to handle's elixir's niceities, it introduced additional parsing overhead:erlang.binary_to_float/1
, is much faster, if you don't need all the nice elixir things (e.g. underscores)IO.read_line/1
is a terrible idea, it is wasteful if one understands how reading line works internally, at the very bottom of it all is:prim_file
, a read_line must read a chunk into a prim_buffer and then each character scanned to look for the newline, the result is then returned, this is great, if you didn't have 3+ function calls between each line read and the same scanning over and over and over again.No longer relevant original post
Not exactly the fastest, erlang/elixir has trouble with heavy IO applications in general.
I only tested it against 50 million rows vs the 1 billion, the former completes in around a minute (specifically the
1brc.workers.stream.exs
others are 2x to 8x slower), since it pretty much scales linearly from what I've seen, I estimate it would take 20 minutes to complete on my hardware (i.e. Intel Core i7-2710QE).As of this writing my poor laptop's fan is screaming at me while I run the 1 billion line test.
I tried four different implementations, anything with the
reduce
suffix is 4x slower than itsstream
counterpart.The fastest implementation I have so far is the
1brc.workers.stream.exs
on my system it uses 32 workers/erlang-processes (that is 8 logical processors x 4) by taking batches of 100'000 rows at a time for each worker it keeps them quite busy anything less and they end up idling.Each implementation is mostly self contained outside of the output module which is shared across all four for writing the final result to console/STDIO
If anyone with slightly more recent hardware would be willing to provide some better numbers that would be swell.
Beta Was this translation helpful? Give feedback.
All reactions