1BRC 0.145s default dataset, 0.281s 10K dataset, accept all valid inputs (C++, SIMD, hash, tricks) #138
Replies: 16 comments 78 replies
-
Автор челленджа должен был выставить ссылку на базу: Теперь непонятно, все-ли участники сгенерировали одинаковый файл. |
Beta Was this translation helpful? Give feedback.
-
You may get a better idea of the minimum time to read the file and do some trivial calculation on it by calculating a checksum of the file rather than by copying data. In problems similar to this one I've found that the cost of writing data to some other buffer is pretty large. |
Beta Was this translation helpful? Give feedback.
-
@lehuyduc Great! I assume its faster than @dannyvankooten #46 isn't it?
|
Beta Was this translation helpful? Give feedback.
-
Why not skip munmap by calling _exit() |
Beta Was this translation helpful? Give feedback.
-
Added this to a stable comparison across languages https://github.com/buybackoff/1brc?tab=readme-ov-file#native. Maybe I missed something faster, but so far it's the top result. |
Beta Was this translation helpful? Give feedback.
-
I would expect automatic detection. Otherwise every host will have to
manually adjust the code.
…On Tue, 9 Jan 2024, 1:02 am lehuyduc, ***@***.***> wrote:
Thanks! Can you run again but with N_THREADS set to 12? N_THREADS = 128
in my code, but your test PC only has 12 threads
—
Reply to this email directly, view it on GitHub
<#138 (reply in thread)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AADXS7H4ZK2MZDG5Y5T7MYTYNSCKFAVCNFSM6AAAAABBOO22F2VHI2DSMVQWIX3LMV43SRDJONRXK43TNFXW4Q3PNVWWK3TUHM4DANJXHAYTO>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
Beta Was this translation helpful? Give feedback.
-
Correct me if I'm wrong, this solution assumes the number of keys is not larger than around 17000, which is not really a valid assumption. Fixing it may not affect performance much, though. |
Beta Was this translation helpful? Give feedback.
-
Continuing from my blog (https://curiouscoding.nl/posts/1brc/). I ran no hyperthreading (so only 6 cores/threads available)
with hyperthreading (so 12 threads over 6 cores)
So you're still a bit slower than my ~1.55s on my machine. But note that I'm cheating (participating out-of-competition, let's say) by assuming that all city names appear in the first 100k chars, and that lines are at most 33 chars long. |
Beta Was this translation helpful? Give feedback.
-
Hi. Great code. Small idea for optimization: have you tried replacing -O3 with -Ofast and perhaps playing with some other compiler arguments? Also, perhaps replacing GCC with clang might provide measurable improvement. |
Beta Was this translation helpful? Give feedback.
-
Just uploaded the final version of my 1BRC (except if I find bugs). It's ~4x faster on the 10K keys dataset compared to before (see @buybackoff has a blog that contains a lot of different results here. |
Beta Was this translation helpful? Give feedback.
-
Hi, I found a low-overhead physical server with the same CPU (AMD EPYC 7502P) as the official test server. Everyone's results are faster there, and some are massively 78% faster!! I'm not asking to change the test server or anything, I just want to share the results, because it's interesting how benchmark results can be wildly different even on the same hardware when running on different cloud providers. Tagging relevant people @gunnarmorling @artsiomkorzun @thomaswue @royvanrijn Edit: it might be because the results on the main page is slightly oudated 😢 |
Beta Was this translation helpful? Give feedback.
-
Did attempt using SIMD to processes the chunks in parallel within a thread? I'm able able to get a ~50% speedup over this implementation with that strategy. |
Beta Was this translation helpful? Give feedback.
-
A couple of ideas that have improved performance for me: Fork child processes instead of using threads. Each process then only mmaps() a portion of the file, and now the munmap() work can actually happen in parallel. Time spent should decrease linearly with cores. Populate the page cache for the measurements file before parsing. There should be a way to do this efficiently with flags and system calls, but I can't figure it out. But simply having each thread walk the file and touch every page before starting parsing increases your code's performance 3%. Front loading the page faults stops the caches from being polluted with kernel code while parsing. |
Beta Was this translation helpful? Give feedback.
-
I notice I'm behind on recent developments in this discussion, but still, here's another solution that's similar: |
Beta Was this translation helpful? Give feedback.
-
noahfalk @noahfalk I've tested your code on a server similar to the official test PC. At this point CPU and dataset difference can be a big factor, so I need your help to test on your dataset/PC too :D It's definitely my final final version, the |
Beta Was this translation helpful? Give feedback.
-
Hey @lehuyduc, hope this helps! These numbers are coming from the Hetzer CCX33 I have access to. My data sets were generated on this machine using the create_measurements.sh and create_measurements3.sh scripts. I just used the first one I got and its up to random chance if it winds up being good or bad for either of our hash functions :) The results for my entry look pretty similar to what I posted on the README at my repo. Your entry looks improved and I assume the difference is that this benchmark run is using a more recent version of your code. Machine info
Lehuyduc benchmarkingI compiled your app 3 different times with different thread settings and tested each on my default and 10K data. Let me know if there is some other specific config you are looking for.
noahfalk entry
|
Beta Was this translation helpful? Give feedback.
-
https://github.com/lehuyduc/1brc-simd
main.cpp
andmain_small.cpp
follows all the requirements specified by the challenges (no preprocessing, has to work with all valid inputs length <= 100, no extra input assumption, single-file no libraries, etc).Tested on file size 13795495846 bytes, generated by
./create_measurements.sh 1000000000
. See my repo for link to both default and 10K datasets. Tested on many PCs. No HugeTLB.Dual EPYC 9354 32c64t = 64c128t total, 1TB DDR5 unknown speed
This setup is actually slower than
Threadripper PRO 5995WX
when <= 32 threads are used. Likely because 5995WX has higher clock speed, and at <= 32 threads RAM bandwidth doesn't matter as much compared to running at 128 threads.Ubuntu 20.04, virtual machine, g++ 9.4. All very detailed results are in
benchmark_results
folder in my repo.Bandwidth = 1.4585e+11 byte/s
tested usingtest_bandwidth.sh
. To ensure benchmark are comparable, you must check not just the CPU but also RAM bandwidth. Also the dataset used can make a big difference.Version:
1brc_valid23_small
To get the best number possible, you will need luck. Run your code a bunch of time and pray for a good result (it's just like playing the slot machine 😆 ).
hyperfine
will make the code slower, so just run manually._small
suffix means it will still work with all inputs, but is slower on the 10K dataset (1.5-2x slower than1brc_valid23
). This is because the hash table size is only16384
, which is the smallest size > number of unique keys.Version:
1brc_valid23
, default dataset. In this test,Dual EPYC 9354
has same speed asThreadripper 5995WX
despite having 2.25x the bandwidth, likely because all the hash table elements fit in cache.Version:
1brc_valid23
, 10K dataset. This version shows noticeable improvement (0.319s -> 0.281s)At 128 threads, the program finishes the challenge faster than the OS can
munmap
the input file.munmap
alone takes up over 50% of the time, so it cost more than the program's allocation, initialization, processing, output, and freeing resources all combined. So, I use the subprocess trick used by top Java solutions, basically bypassingmunmap
and memory deallocation. It's possible to do this the correct way by either usingfork()
instead of threads, or let each thread process different input sizes andmunmap
at different times.Running using
hyperfine
is slower than running the command manually, as the previousmunmap
hasn't finished.With this many threads, aggregating the results from all threads take a noticeable amount of time. So we use parallel processing to speed up that too, saving ~28ms at 128 threads. By using 2 layers of parallel aggregation, we save an extra ~2.5ms lol.
Results for Java version are currently outdated so I've removed them. In my repo you can see the old results in
benchmark_results/old_post2.txt
, and filesother_artsi_7502p
,other_thomas_7502p
,other_royvanrijn_7502p
,valid20_7502p
,valid20_7502p_10k
Other submission
There's a super fast non-compliant solution at: https://curiouscoding.nl/posts/1brc/
It doesn't work with all inputs like required, but is extremely fast and has very creative ideas.
There's a super fast compliant version using .NET at: https://github.com/noahfalk/1brc/tree/main
I tested it on Dual EPYC 9354 (5995WX PC was not available at the time). This code doesn't use
mmap
, so itshyperfine
result is similar to running manually. See files:other_noahfalk_9354
,other_noahfalk_9354_10k
,valid23_9354.txt
,valid23_9354_10k.txt
Below is a lot of raw data, so here's the summary (unit: second)
Original dataset
10K dataset
@noahfalk solution use an extreme amount of manual SIMD and ILP tricks, so it performs much better at lower thread counts. I'm not sure why it performs worse at higher thread counts yet. On the 10K dataset for example, it plateaus completely around 32 threads.
Main ideas:
;
and loop through themlength <= 16
, use compiler hint + implement SIMD for this specific case. If length > 16, use a fallback => still meet requirements ofMAX_KEY_LENGTH = 100
-99.9 <= temperature <= 99.9
guaranteed, use special code using this propertymmap
once, then run the program again. Without this rule, all solutions will be completely different.Others
There's a potential out-of-range-access exploit in the code. It's left as exercise for the reader.I optimize the code for hyper threading. For example with a 16c32t CPU, if a change improves performance when running 32 threads but slightly decreases performance at 16, I will keep that change. Disabling HT will increase performance at the same thread count (maybe even 20-25%), so there are some situations where it's the correct choice.
Beta Was this translation helpful? Give feedback.
All reactions