Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rewrite Data.IntMap to be faster and use less memory #340

Open
wants to merge 147 commits into
base: master
Choose a base branch
from

Conversation

gereeter
Copy link

@gereeter gereeter commented Sep 13, 2016

I still sort of think of this as a work in progress, but I've been sitting for over 2 years now on this code, and all the tests pass with significant improvements on many benchmarks. The general approach is detailed at gereeter/bounded-intmap - the README is a bit outdated and I really should incorporate it into comments, but it is the best reference for how the structure works. As for compatibility, I think I'm missing some instances, some Safe Haskell stuff, and anything required to get things working on compilers that aren't the latest GHC.

If this is a large enough change, I wouldn't mind sending mail to the libraries mailing list, but it should be entirely backwards compatible.

cc @ekmett - This is a (heavily modified) version of an idea you came up with.

Memory usage

The old implementation had 6n-5 words of overhead (i.e. discounting storage of the keys and the pointers to values). The new implementation has only 3n-1 words of overhead.

Large runtime regressions

  • fromAscList and fromDistinctAscList are currently just aliases for fromList
  • I have no clue about foldlWithKey'
  • keysSet, fromSet, restrictKeys, and withoutKeys are currently implemented naively and are probably much slower

Benchmarks

Benchmark                     Runtime change  Original runtime
lookup                         -19.25%        4.38e-04
insert                         -22.57%        9.17e-04
"insertWith empty"             -10.20%        8.85e-04
"insertWith update"             -7.32%        1.96e-03
"insertWith' empty"             -9.92%        8.84e-04
"insertWith' update"           -11.00%        1.67e-03
"insertWithKey empty"          -10.23%        8.76e-04
"insertWithKey update"          +6.64%        1.96e-03
"insertWithKey' empty"         -10.47%        8.84e-04
"insertWithKey' update"        -11.29%        1.68e-03
"insertLookupWithKey empty"    -46.46%        1.74e-03
"insertLookupWithKey update"   -41.50%        3.94e-03
map                            -17.13%        1.81e-04
mapWithKey                     -24.97%        2.19e-04
foldlWithKey                   -19.19%        1.50e-03
foldlWithKey'                 +112.26%        4.29e-05
foldrWithKey                   -79.82%        1.02e-07
delete                          +9.36%        3.42e-04
update                         -12.59%        1.50e-03
updateLookupWithKey             +1.39%        2.44e-03
alter                           +5.04%        1.53e-03
mapMaybe                       -15.58%        2.25e-04
mapMaybeWithKey                -15.88%        2.24e-04
fromList                       -15.42%        8.96e-04
fromAscList                    -81.50%        6.83e-04
fromDistinctAscList            -43.34%        2.24e-04

Merges

Benchmark               Runtime change  Original runtime
union-disj_nn             +8.53%        2.08e-07
union-disj_nn             +9.74%        2.09e-07
union-disj_ns            +11.66%        1.85e-07
union-disj_ns             +7.15%        1.85e-07
union-disj_sn             +7.00%        1.94e-07
union-disj_sn             +5.67%        1.94e-07
union-disj_nt             +7.22%        1.55e-07
union-disj_nt             +5.51%        1.55e-07
union-disj_tn             +5.88%        1.61e-07
union-disj_tn             +5.91%        1.61e-07
union-common_nn          -15.58%        1.12e-02
union-common_nn          -15.30%        1.10e-02
union-common_ns          -11.65%        4.69e-03
union-common_ns          -11.57%        4.67e-03
union-common_sn          -10.49%        4.67e-03
union-common_sn          -10.23%        4.67e-03
union-common_nt          -26.05%        9.11e-05
union-common_nt          -28.51%        9.43e-05
union-common_tn          -21.01%        8.89e-05
union-common_tn          -19.67%        8.80e-05
union-mix_nn             +12.20%        2.06e-02
union-mix_nn             +11.70%        2.06e-02
union-mix_ns              +5.05%        5.10e-03
union-mix_ns              +5.84%        5.10e-03
union-mix_sn              +6.15%        5.11e-03
union-mix_sn              +7.16%        5.08e-03
union-mix_nt              -4.19%        8.05e-05
union-mix_nt              -4.29%        8.03e-05
union-mix_tn              -6.17%        8.58e-05
union-mix_tn              -6.81%        8.61e-05
union-block_nn           +38.23%        1.03e-04
union-block_nn           +39.21%        1.03e-04
union-block_ns           +30.53%        6.79e-06
union-block_ns           +31.57%        6.78e-06
union-block_sn           +29.41%        6.92e-06
union-block_sn           +30.39%        6.88e-06
difference-disj_nn       -78.96%        1.72e-07
difference-disj_nn       -78.95%        1.72e-07
difference-disj_ns       -73.77%        1.60e-07
difference-disj_ns       -77.32%        1.60e-07
difference-disj_sn       -82.64%        1.91e-07
difference-disj_sn       -82.02%        1.91e-07
difference-disj_nt       -74.05%        1.40e-07
difference-disj_nt       -74.09%        1.40e-07
difference-disj_tn       -77.35%        1.47e-07
difference-disj_tn       -77.28%        1.47e-07
difference-common_nn      +4.10%        7.50e-03
difference-common_nn      +4.25%        7.50e-03
difference-common_ns      -0.97%        4.27e-03
difference-common_ns      -1.31%        4.28e-03
difference-common_sn     -30.13%        8.64e-04
difference-common_sn     -34.10%        8.55e-04
difference-common_nt     -20.80%        1.00e-04
difference-common_nt     -18.82%        1.01e-04
difference-common_tn     -37.75%        5.91e-05
difference-common_tn     -32.56%        5.92e-05
difference-mix_nn        -18.42%        1.78e-02
difference-mix_nn        -18.95%        1.78e-02
difference-mix_ns        -21.13%        4.36e-03
difference-mix_ns        -21.18%        4.35e-03
difference-mix_sn        +19.57%        1.49e-03
difference-mix_sn        +19.49%        1.49e-03
difference-mix_nt        -29.75%        8.57e-05
difference-mix_nt        -29.13%        8.52e-05
difference-mix_tn        -27.97%        5.37e-05
difference-mix_tn        -29.38%        5.45e-05
difference-block_nn      -69.85%        7.37e-05
difference-block_nn      -70.66%        7.57e-05
difference-block_ns      -64.67%        6.55e-06
difference-block_ns      -64.53%        6.54e-06
difference-block_sn      -40.12%        5.43e-06
difference-block_sn      -40.19%        5.44e-06
intersection-disj_nn     -70.09%        1.24e-07
intersection-disj_nn     -69.98%        1.24e-07
intersection-disj_ns     -67.57%        1.12e-07
intersection-disj_ns     -66.18%        1.12e-07
intersection-disj_sn     -66.65%        1.12e-07
intersection-disj_sn     -66.92%        1.12e-07
intersection-disj_nt     -60.91%        9.38e-08
intersection-disj_nt     -61.18%        9.38e-08
intersection-disj_tn     -60.17%        9.28e-08
intersection-disj_tn     -60.32%        9.27e-08
intersection-common_nn    +6.06%        7.08e-03
intersection-common_nn    +5.82%        7.07e-03
intersection-common_ns   -25.00%        1.04e-03
intersection-common_ns   -25.84%        1.05e-03
intersection-common_sn   -20.87%        1.07e-03
intersection-common_sn   -15.66%        1.05e-03
intersection-common_nt   -31.67%        5.48e-05
intersection-common_nt   -31.14%        5.45e-05
intersection-common_tn   -28.09%        5.97e-05
intersection-common_tn   -28.55%        5.98e-05
intersection-mix_nn       +0.71%        3.01e-03
intersection-mix_nn       -6.14%        3.14e-03
intersection-mix_ns      -29.39%        6.40e-04
intersection-mix_ns      -30.65%        6.39e-04
intersection-mix_sn      -23.04%        6.64e-04
intersection-mix_sn      -22.10%        6.58e-04
intersection-mix_nt      -17.51%        3.90e-05
intersection-mix_nt      -19.35%        3.91e-05
intersection-mix_tn      -21.05%        4.63e-05
intersection-mix_tn      -17.28%        4.59e-05
intersection-block_nn    -62.01%        4.88e-05
intersection-block_nn    -62.37%        4.88e-05
intersection-block_ns    -45.10%        3.52e-06
intersection-block_ns    -45.39%        3.54e-06
intersection-block_sn    -27.86%        3.43e-06
intersection-block_sn    -27.83%        3.43e-06

@gereeter
Copy link
Author

This also partially fixes #327 - it doesn't have all the tactics and instances or mergeA yet though.

Data/IntMap.hs Outdated
@@ -10,6 +10,7 @@
-- Module : Data.IntMap
-- Copyright : (c) Daan Leijen 2002
-- (c) Andriy Palamarchuk 2008
-- (c) Jonathan S. 2016
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really have no clue what the proper thing to do here is. To be clear, I'm pretty sure that all the actual code is mine - I looked at the existing source a few times to clarify what exactly some functions did and to put things in a nice, consistent order, but otherwise everything was build from scratch. However, the documentation is lifted straight from the existing codebase.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Documentation is also subject to copyright. I'm generally in favor of being expansive about attribution; it's almost always safer (legally, academically, and interpersonally) to give unnecessary credit than to fail to give necessary credit. That said, you should probably mention somewhere near the top that you wrote the code while previous authors designed the interface and wrote the documentation.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I figured so, and in general I wanted to be conservative both to properly credit the work done on documentation and to avoid legal issues since I know basically nothing real about copyright law. I'll add some clarification, but leave both existing lines.

@treeowl
Copy link
Contributor

treeowl commented Sep 13, 2016

This will be primarily up to @wrengr to decide. Are there operations or contexts where this new implementation is known or believed to be inherently worse?

@gereeter
Copy link
Author

Are there operations or contexts where this new implementation is known or believed to be inherently worse?

No. Everything has the same asymptotics (except findMin and findMax, which are faster), and the tree structure is the identical to a PATRICIA tree. Strictly fewer pointers need to be dereferenced in order to reach any given key.

@gereeter
Copy link
Author

gereeter commented Sep 13, 2016

Also, I think that most of the existing regressions can be fixed, especially the large ones - I haven't done the work to investigate foldlWithKey', and the rest basically just want specialized implementations.

@treeowl
Copy link
Contributor

treeowl commented Sep 13, 2016

Can you get this to build with older GHC by conditionally using your own
copy of Data.Functor.Identity? How important is ScopedTypeVariables here?
We use it elsewhere, conditionally, but unless and until it's standardized
I'd prefer to avoid it. That said, if you use it to revolutionize map
performance, I'm personally willing to go there.

On Sep 13, 2016 12:15 AM, "Jonathan S" notifications@github.com wrote:

Also, I think that most of the existing regressions can be fixed,
especially the large ones - I haven't done the work to investigate
foldlWIthKey', and the rest basically just want specialized
implementations.


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#340 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABzi_Sp0N9NvX6O6qBQ7xKA4ePEaOi_Iks5qpiNVgaJpZM4J7RJU
.

@gereeter gereeter force-pushed the direct-bounded-intmap branch 2 times, most recently from 32cfb0c to 8a35111 Compare September 13, 2016 15:33
@gereeter
Copy link
Author

I remove the requirements of EmptyDataDecls and ScopedTypeVariables, and I fixed a bunch of compatibility problems, including the use of Data.Functor.Identity.

@treeowl
Copy link
Contributor

treeowl commented Sep 13, 2016

The bounded-intmap README indicates regressions in intersection and
difference. I don't see those in the benchmarks above. Were you able to
work around the problem? If not, can you give us a sense of how bad it is?
There are some set-operations benchmarks in a subdirectory of containers
benchmarks that may help.

On Sep 12, 2016 11:57 PM, "Jonathan S" notifications@github.com wrote:

I still sort of think of this as a work in progress, but I've been sitting
for over 2 years now on this code, and all the tests pass with significant
improvements on many benchmarks. The general approach is detailed at
gereeter/bounded-intmap https://github.com/gereeter/bounded-intmap -
the README is a bit outdated and I really should incorporate it into
comments, but it is the best reference for how the structure works. As for
compatibility, I think I'm missing some instances, some Safe Haskell stuff,
and anything required to get things working on compilers that aren't the
latest GHC.

If this is a large enough change, I wouldn't mind sending mail to the
libraries mailing list, but it should be entirely backwards compatible.

cc @ekmett https://github.com/ekmett - This is a (heavily modified)
version of an idea you came up with.
Memory usage

The old implementation had 6n-5 words of overhead (i.e. discounting
storage of the keys and the pointers to values). The new implementation has
only 3n-1 words of overhead.
Large runtime regressions

  • fromAscList and fromDistinctAscList are currently just aliases for
    fromList
  • I have no clue about foldlWithKey'
  • keysSet, fromSet, restrictKeys, and withoutKeys are currently
    implemented naively and are probably much slower

Benchmarks after

Benchmark intmap-benchmarks: RUNNING...
benchmarking lookup
time 351.2 μs (348.0 μs .. 353.4 μs)
0.999 R² (0.999 R² .. 1.000 R²)
mean 348.4 μs (345.8 μs .. 350.5 μs)
std dev 7.589 μs (5.950 μs .. 8.951 μs)
variance introduced by outliers: 14% (moderately inflated)

benchmarking insert
time 710.6 μs (708.1 μs .. 713.5 μs)
1.000 R² (1.000 R² .. 1.000 R²)
mean 709.2 μs (707.8 μs .. 711.0 μs)
std dev 5.133 μs (4.064 μs .. 6.746 μs)

benchmarking insertWith empty
time 784.2 μs (782.1 μs .. 786.3 μs)
1.000 R² (1.000 R² .. 1.000 R²)
mean 783.8 μs (781.9 μs .. 786.2 μs)
std dev 7.298 μs (5.164 μs .. 10.90 μs)

benchmarking insertWith update
time 1.815 ms (1.809 ms .. 1.823 ms)
1.000 R² (1.000 R² .. 1.000 R²)
mean 1.815 ms (1.810 ms .. 1.819 ms)
std dev 16.52 μs (12.94 μs .. 22.60 μs)

benchmarking insertWith' empty
time 794.3 μs (792.9 μs .. 795.7 μs)
1.000 R² (1.000 R² .. 1.000 R²)
mean 794.5 μs (792.8 μs .. 796.1 μs)
std dev 6.120 μs (4.768 μs .. 8.325 μs)

benchmarking insertWith' update
time 1.491 ms (1.486 ms .. 1.496 ms)
1.000 R² (1.000 R² .. 1.000 R²)
mean 1.496 ms (1.491 ms .. 1.503 ms)
std dev 18.86 μs (14.67 μs .. 24.30 μs)

benchmarking insertWithKey empty
time 790.0 μs (786.0 μs .. 793.7 μs)
1.000 R² (0.999 R² .. 1.000 R²)
mean 794.1 μs (790.1 μs .. 807.6 μs)
std dev 22.04 μs (5.555 μs .. 45.51 μs)
variance introduced by outliers: 18% (moderately inflated)

benchmarking insertWithKey update
time 2.034 ms (2.029 ms .. 2.042 ms)
1.000 R² (1.000 R² .. 1.000 R²)
mean 2.033 ms (2.028 ms .. 2.039 ms)
std dev 17.85 μs (14.47 μs .. 22.44 μs)

benchmarking insertWithKey' empty
time 797.9 μs (796.6 μs .. 798.9 μs)
1.000 R² (1.000 R² .. 1.000 R²)
mean 797.3 μs (795.8 μs .. 798.9 μs)
std dev 5.257 μs (4.250 μs .. 6.617 μs)

benchmarking insertWithKey' update
time 1.504 ms (1.499 ms .. 1.510 ms)
1.000 R² (1.000 R² .. 1.000 R²)
mean 1.504 ms (1.500 ms .. 1.511 ms)
std dev 18.15 μs (12.36 μs .. 29.84 μs)

benchmarking insertLookupWithKey empty
time 917.9 μs (914.9 μs .. 920.7 μs)
1.000 R² (1.000 R² .. 1.000 R²)
mean 915.4 μs (912.2 μs .. 918.1 μs)
std dev 10.09 μs (8.160 μs .. 12.89 μs)

benchmarking insertLookupWithKey update
time 2.332 ms (2.326 ms .. 2.338 ms)
1.000 R² (1.000 R² .. 1.000 R²)
mean 2.334 ms (2.329 ms .. 2.341 ms)
std dev 20.29 μs (16.52 μs .. 24.81 μs)

benchmarking map
time 149.6 μs (149.1 μs .. 150.0 μs)
1.000 R² (1.000 R² .. 1.000 R²)
mean 151.7 μs (151.0 μs .. 153.6 μs)
std dev 3.726 μs (1.409 μs .. 7.458 μs)
variance introduced by outliers: 19% (moderately inflated)

benchmarking mapWithKey
time 162.1 μs (161.9 μs .. 162.3 μs)
1.000 R² (1.000 R² .. 1.000 R²)
mean 162.1 μs (162.0 μs .. 162.4 μs)
std dev 623.3 ns (379.3 ns .. 1.039 μs)

benchmarking foldlWithKey
time 1.188 ms (1.184 ms .. 1.193 ms)
1.000 R² (1.000 R² .. 1.000 R²)
mean 1.190 ms (1.187 ms .. 1.194 ms)
std dev 10.63 μs (8.096 μs .. 14.74 μs)

benchmarking foldlWithKey'
time 93.76 μs (93.72 μs .. 93.88 μs)
1.000 R² (1.000 R² .. 1.000 R²)
mean 93.80 μs (93.74 μs .. 93.98 μs)
std dev 371.8 ns (30.91 ns .. 714.5 ns)

benchmarking foldrWithKey
time 22.71 ns (22.62 ns .. 22.80 ns)
0.998 R² (0.996 R² .. 1.000 R²)
mean 22.65 ns (22.13 ns .. 23.43 ns)
std dev 2.077 ns (1.152 ns .. 3.451 ns)
variance introduced by outliers: 90% (severely inflated)

benchmarking delete
time 392.1 μs (384.0 μs .. 408.8 μs)
0.994 R² (0.985 R² .. 1.000 R²)
mean 386.7 μs (384.2 μs .. 395.9 μs)
std dev 14.89 μs (710.6 ns .. 31.54 μs)
variance introduced by outliers: 32% (moderately inflated)

benchmarking update
time 1.277 ms (1.275 ms .. 1.280 ms)
1.000 R² (1.000 R² .. 1.000 R²)
mean 1.277 ms (1.274 ms .. 1.279 ms)
std dev 8.210 μs (6.086 μs .. 11.37 μs)

benchmarking updateLookupWithKey
time 2.538 ms (2.529 ms .. 2.547 ms)
1.000 R² (1.000 R² .. 1.000 R²)
mean 2.534 ms (2.527 ms .. 2.540 ms)
std dev 22.10 μs (18.13 μs .. 28.85 μs)

benchmarking alter
time 1.555 ms (1.550 ms .. 1.559 ms)
1.000 R² (1.000 R² .. 1.000 R²)
mean 1.554 ms (1.550 ms .. 1.558 ms)
std dev 13.60 μs (11.01 μs .. 16.65 μs)

benchmarking mapMaybe
time 184.7 μs (184.1 μs .. 185.3 μs)
1.000 R² (1.000 R² .. 1.000 R²)
mean 185.1 μs (184.5 μs .. 185.9 μs)
std dev 2.239 μs (1.418 μs .. 3.591 μs)

benchmarking mapMaybeWithKey
time 186.9 μs (184.6 μs .. 191.9 μs)
0.992 R² (0.977 R² .. 1.000 R²)
mean 187.6 μs (185.1 μs .. 197.7 μs)
std dev 14.17 μs (2.403 μs .. 31.55 μs)
variance introduced by outliers: 69% (severely inflated)

benchmarking fromList
time 717.2 μs (713.4 μs .. 719.9 μs)
1.000 R² (1.000 R² .. 1.000 R²)
mean 713.2 μs (710.5 μs .. 716.1 μs)
std dev 9.082 μs (7.906 μs .. 10.82 μs)

benchmarking fromAscList
time 718.4 μs (713.7 μs .. 722.3 μs)
1.000 R² (1.000 R² .. 1.000 R²)
mean 712.8 μs (710.4 μs .. 715.4 μs)
std dev 8.612 μs (7.509 μs .. 10.04 μs)

benchmarking fromDistinctAscList
time 717.4 μs (713.4 μs .. 720.7 μs)
1.000 R² (1.000 R² .. 1.000 R²)
mean 712.6 μs (710.7 μs .. 714.8 μs)
std dev 6.553 μs (5.526 μs .. 7.963 μs)

Benchmark intmap-benchmarks: FINISH

Benchmarks before

Benchmark intmap-benchmarks: RUNNING...
benchmarking lookup
time 462.8 μs (457.1 μs .. 466.8 μs)
0.999 R² (0.998 R² .. 1.000 R²)
mean 461.9 μs (457.6 μs .. 464.5 μs)
std dev 10.83 μs (7.533 μs .. 14.53 μs)
variance introduced by outliers: 15% (moderately inflated)

benchmarking insert
time 908.2 μs (906.9 μs .. 909.8 μs)
1.000 R² (1.000 R² .. 1.000 R²)
mean 908.7 μs (907.0 μs .. 910.9 μs)
std dev 6.776 μs (5.050 μs .. 9.429 μs)

benchmarking insertWith empty
time 904.2 μs (903.1 μs .. 905.6 μs)
1.000 R² (1.000 R² .. 1.000 R²)
mean 903.8 μs (902.0 μs .. 905.8 μs)
std dev 5.991 μs (4.748 μs .. 7.628 μs)

benchmarking insertWith update
time 1.962 ms (1.957 ms .. 1.968 ms)
1.000 R² (1.000 R² .. 1.000 R²)
mean 1.965 ms (1.959 ms .. 1.971 ms)
std dev 18.44 μs (13.84 μs .. 24.77 μs)

benchmarking insertWith' empty
time 918.0 μs (916.0 μs .. 920.9 μs)
1.000 R² (1.000 R² .. 1.000 R²)
mean 918.9 μs (915.3 μs .. 923.0 μs)
std dev 13.87 μs (10.57 μs .. 18.18 μs)

benchmarking insertWith' update
time 1.682 ms (1.672 ms .. 1.706 ms)
0.998 R² (0.995 R² .. 1.000 R²)
mean 1.686 ms (1.678 ms .. 1.709 ms)
std dev 44.73 μs (18.01 μs .. 85.74 μs)
variance introduced by outliers: 14% (moderately inflated)

benchmarking insertWithKey empty
time 906.4 μs (905.0 μs .. 907.8 μs)
1.000 R² (1.000 R² .. 1.000 R²)
mean 905.8 μs (904.3 μs .. 907.2 μs)
std dev 5.118 μs (4.080 μs .. 6.452 μs)

benchmarking insertWithKey update
time 2.039 ms (2.019 ms .. 2.054 ms)
1.000 R² (0.999 R² .. 1.000 R²)
mean 2.021 ms (2.014 ms .. 2.029 ms)
std dev 25.52 μs (20.98 μs .. 31.95 μs)

benchmarking insertWithKey' empty
time 912.6 μs (910.4 μs .. 916.3 μs)
1.000 R² (0.999 R² .. 1.000 R²)
mean 925.0 μs (919.7 μs .. 930.8 μs)
std dev 18.31 μs (15.26 μs .. 21.70 μs)

benchmarking insertWithKey' update
time 1.690 ms (1.687 ms .. 1.695 ms)
1.000 R² (1.000 R² .. 1.000 R²)
mean 1.693 ms (1.689 ms .. 1.699 ms)
std dev 17.40 μs (13.96 μs .. 23.11 μs)

benchmarking insertLookupWithKey empty
time 1.772 ms (1.766 ms .. 1.778 ms)
1.000 R² (1.000 R² .. 1.000 R²)
mean 1.771 ms (1.768 ms .. 1.774 ms)
std dev 10.39 μs (9.049 μs .. 12.27 μs)

benchmarking insertLookupWithKey update
time 4.051 ms (4.019 ms .. 4.092 ms)
1.000 R² (0.999 R² .. 1.000 R²)
mean 4.031 ms (4.018 ms .. 4.048 ms)
std dev 50.84 μs (38.55 μs .. 67.42 μs)

benchmarking map
time 172.7 μs (171.4 μs .. 173.7 μs)
1.000 R² (1.000 R² .. 1.000 R²)
mean 171.3 μs (171.0 μs .. 171.9 μs)
std dev 1.377 μs (870.7 ns .. 1.961 μs)

benchmarking mapWithKey
time 227.8 μs (226.9 μs .. 228.8 μs)
1.000 R² (1.000 R² .. 1.000 R²)
mean 227.1 μs (226.6 μs .. 227.7 μs)
std dev 1.799 μs (1.315 μs .. 2.385 μs)

benchmarking foldlWithKey
time 1.554 ms (1.513 ms .. 1.618 ms)
0.995 R² (0.991 R² .. 1.000 R²)
mean 1.546 ms (1.538 ms .. 1.566 ms)
std dev 40.51 μs (21.11 μs .. 81.14 μs)
variance introduced by outliers: 14% (moderately inflated)

benchmarking foldlWithKey'
time 46.11 μs (45.90 μs .. 46.25 μs)
1.000 R² (1.000 R² .. 1.000 R²)
mean 45.64 μs (45.52 μs .. 45.78 μs)
std dev 421.3 ns (342.6 ns .. 499.9 ns)

benchmarking foldrWithKey
time 105.9 ns (105.7 ns .. 106.3 ns)
1.000 R² (1.000 R² .. 1.000 R²)
mean 105.9 ns (105.7 ns .. 106.0 ns)
std dev 461.7 ps (320.4 ps .. 689.5 ps)

benchmarking delete
time 357.5 μs (356.7 μs .. 359.0 μs)
1.000 R² (1.000 R² .. 1.000 R²)
mean 357.6 μs (357.3 μs .. 358.5 μs)
std dev 1.734 μs (1.000 μs .. 3.244 μs)

benchmarking update
time 1.542 ms (1.539 ms .. 1.548 ms)
1.000 R² (1.000 R² .. 1.000 R²)
mean 1.541 ms (1.538 ms .. 1.544 ms)
std dev 10.09 μs (7.858 μs .. 13.96 μs)

benchmarking updateLookupWithKey
time 2.444 ms (2.439 ms .. 2.448 ms)
1.000 R² (1.000 R² .. 1.000 R²)
mean 2.457 ms (2.451 ms .. 2.467 ms)
std dev 24.38 μs (17.83 μs .. 32.20 μs)

benchmarking alter
time 1.536 ms (1.532 ms .. 1.540 ms)
1.000 R² (1.000 R² .. 1.000 R²)
mean 1.538 ms (1.535 ms .. 1.541 ms)
std dev 10.88 μs (8.249 μs .. 16.94 μs)

benchmarking mapMaybe
time 223.8 μs (223.6 μs .. 224.0 μs)
1.000 R² (1.000 R² .. 1.000 R²)
mean 224.4 μs (223.5 μs .. 226.4 μs)
std dev 4.150 μs (616.6 ns .. 7.538 μs)
variance introduced by outliers: 11% (moderately inflated)

benchmarking mapMaybeWithKey
time 223.3 μs (223.0 μs .. 223.6 μs)
1.000 R² (1.000 R² .. 1.000 R²)
mean 223.3 μs (223.1 μs .. 223.6 μs)
std dev 749.0 ns (456.5 ns .. 1.329 μs)

benchmarking fromList
time 848.4 μs (847.1 μs .. 849.4 μs)
1.000 R² (1.000 R² .. 1.000 R²)
mean 847.0 μs (845.7 μs .. 848.5 μs)
std dev 4.540 μs (3.715 μs .. 6.255 μs)

benchmarking fromAscList
time 604.2 μs (602.4 μs .. 605.6 μs)
1.000 R² (1.000 R² .. 1.000 R²)
mean 602.6 μs (601.4 μs .. 604.0 μs)
std dev 4.334 μs (3.383 μs .. 5.573 μs)

benchmarking fromDistinctAscList
time 226.4 μs (225.7 μs .. 227.7 μs)
1.000 R² (1.000 R² .. 1.000 R²)
mean 226.0 μs (225.7 μs .. 226.4 μs)
std dev 988.2 ns (534.4 ns .. 1.965 μs)

Benchmark intmap-benchmarks: FINISH


You can view, comment on, or merge this pull request online at:

#340
Commit Summary

  • Improve the Data.IntMap.mergeWithKey test to test more cases and
    have clearer output
  • Rewrite Data.IntMap to be faster and use less memory

File Changes

Patch Links:


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#340, or mute the thread
https://github.com/notifications/unsubscribe-auth/ABzi_dpKZTbtdZNvdPYdfH_LDph-EPjYks5qph8TgaJpZM4J7RJU
.

@gereeter
Copy link
Author

So, I ran the set-operations-intmap benchmark, and the results are quite interesting. Since this benchmark is far more detailed and thorough than what I threw together in the bounded-intmap repository, the conclusions look very different:

Benchmark               Runtime change  Original runtime
union-disj_nn            +20.14%        2.08e-07
union-disj_nn            +20.32%        2.09e-07
union-disj_ns            +19.76%        1.85e-07
union-disj_ns            +19.57%        1.85e-07
union-disj_sn            +13.50%        1.94e-07
union-disj_sn            +13.65%        1.94e-07
union-disj_nt            +16.34%        1.55e-07
union-disj_nt            +16.23%        1.55e-07
union-disj_tn            +10.41%        1.61e-07
union-disj_tn            +10.25%        1.61e-07
union-common_nn          -13.83%        1.12e-02
union-common_nn          -13.22%        1.10e-02
union-common_ns          -10.26%        4.69e-03
union-common_ns          -10.07%        4.67e-03
union-common_sn          -11.13%        4.67e-03
union-common_sn          -11.21%        4.67e-03
union-common_nt          -21.66%        9.11e-05
union-common_nt          -24.34%        9.43e-05
union-common_tn          -22.92%        8.89e-05
union-common_tn          -22.40%        8.80e-05
union-mix_nn             +13.34%        2.06e-02
union-mix_nn             +13.49%        2.06e-02
union-mix_ns              +6.80%        5.10e-03
union-mix_ns              +6.80%        5.10e-03
union-mix_sn              +5.54%        5.11e-03
union-mix_sn              +6.38%        5.08e-03
union-mix_nt              +1.41%        8.05e-05
union-mix_nt              +1.75%        8.03e-05
union-mix_tn              -7.61%        8.58e-05
union-mix_tn              -8.24%        8.61e-05
union-block_nn           +44.40%        1.03e-04
union-block_nn           +44.11%        1.03e-04
union-block_ns           +40.25%        6.79e-06
union-block_ns           +40.85%        6.78e-06
union-block_sn           +32.62%        6.92e-06
union-block_sn           +33.01%        6.88e-06
difference-disj_nn       -78.23%        1.72e-07
difference-disj_nn       -78.25%        1.72e-07
difference-disj_ns       -76.55%        1.60e-07
difference-disj_ns       -76.54%        1.60e-07
difference-disj_sn       -81.40%        1.91e-07
difference-disj_sn       -81.47%        1.91e-07
difference-disj_nt       -73.18%        1.40e-07
difference-disj_nt       -73.24%        1.40e-07
difference-disj_tn       -75.89%        1.47e-07
difference-disj_tn       -75.83%        1.47e-07
difference-common_nn      +7.85%        7.50e-03
difference-common_nn      +7.86%        7.50e-03
difference-common_ns      +1.65%        4.27e-03
difference-common_ns      +1.73%        4.28e-03
difference-common_sn     -27.05%        8.64e-04
difference-common_sn     -27.24%        8.55e-04
difference-common_nt     -20.93%        1.00e-04
difference-common_nt     -21.43%        1.01e-04
difference-common_tn     -37.30%        5.91e-05
difference-common_tn     -37.35%        5.92e-05
difference-mix_nn        -16.25%        1.78e-02
difference-mix_nn        -16.50%        1.78e-02
difference-mix_ns        -19.62%        4.36e-03
difference-mix_ns        -19.37%        4.35e-03
difference-mix_sn        +24.33%        1.49e-03
difference-mix_sn        +24.91%        1.49e-03
difference-mix_nt        -28.88%        8.57e-05
difference-mix_nt        -28.41%        8.52e-05
difference-mix_tn        -22.30%        5.37e-05
difference-mix_tn        -23.62%        5.45e-05
difference-block_nn      -64.96%        7.37e-05
difference-block_nn      -65.44%        7.57e-05
difference-block_ns      -52.32%        6.55e-06
difference-block_ns      -53.00%        6.54e-06
difference-block_sn      -32.62%        5.43e-06
difference-block_sn      -32.50%        5.44e-06
intersection-disj_nn     -73.10%        1.24e-07
intersection-disj_nn     -73.09%        1.24e-07
intersection-disj_ns     -70.37%        1.12e-07
intersection-disj_ns     -70.24%        1.12e-07
intersection-disj_sn     -70.05%        1.12e-07
intersection-disj_sn     -70.02%        1.12e-07
intersection-disj_nt     -64.38%        9.38e-08
intersection-disj_nt     -64.49%        9.38e-08
intersection-disj_tn     -63.88%        9.28e-08
intersection-disj_tn     -63.85%        9.27e-08
intersection-common_nn   +15.38%        7.08e-03
intersection-common_nn   +15.37%        7.07e-03
intersection-common_ns   +73.79%        1.04e-03
intersection-common_ns   +71.16%        1.05e-03
intersection-common_sn   +72.89%        1.07e-03
intersection-common_sn   +76.42%        1.05e-03
intersection-common_nt   -14.21%        5.48e-05
intersection-common_nt   -13.98%        5.45e-05
intersection-common_tn   -25.13%        5.97e-05
intersection-common_tn   -25.16%        5.98e-05
intersection-mix_nn       -4.39%        3.01e-03
intersection-mix_nn      -11.54%        3.14e-03
intersection-mix_ns      -31.01%        6.40e-04
intersection-mix_ns      -30.98%        6.39e-04
intersection-mix_sn      -20.84%        6.64e-04
intersection-mix_sn      -24.33%        6.58e-04
intersection-mix_nt       -6.07%        3.90e-05
intersection-mix_nt       -4.52%        3.91e-05
intersection-mix_tn      -26.01%        4.63e-05
intersection-mix_tn      -24.39%        4.59e-05
intersection-block_nn    -65.50%        4.88e-05
intersection-block_nn    -65.43%        4.88e-05
intersection-block_ns    -41.59%        3.52e-06
intersection-block_ns    -41.95%        3.54e-06
intersection-block_sn    -24.71%        3.43e-06
intersection-block_sn    -24.46%        3.43e-06

In particular, union-disj, union-mix, union-block, and intersection-common are the major regressions. I definitely need to look more at intersection and see what I can do to solve the 75% spikes.

@gereeter
Copy link
Author

So it seems that the worst of the intersection regression was another misguided constant argument capture - I really need to just go through the codebase for every instance of that and benchmark each one. New intersection benchmarks:

Benchmark               Runtime change  Original runtime
intersection-disj_nn     -74.95%        1.24e-07
intersection-disj_nn     -74.86%        1.24e-07
intersection-disj_ns     -72.28%        1.12e-07
intersection-disj_ns     -72.15%        1.12e-07
intersection-disj_sn     -70.07%        1.12e-07
intersection-disj_sn     -70.00%        1.12e-07
intersection-disj_nt     -66.86%        9.38e-08
intersection-disj_nt     -66.80%        9.38e-08
intersection-disj_tn     -63.92%        9.28e-08
intersection-disj_tn     -63.90%        9.27e-08
intersection-common_nn   +10.31%        7.08e-03
intersection-common_nn    +9.39%        7.07e-03
intersection-common_ns   +42.33%        1.04e-03
intersection-common_ns   +40.95%        1.05e-03
intersection-common_sn   +42.86%        1.07e-03
intersection-common_sn   +45.64%        1.05e-03
intersection-common_nt   -20.33%        5.48e-05
intersection-common_nt   -20.35%        5.45e-05
intersection-common_tn   -28.17%        5.97e-05
intersection-common_tn   -29.02%        5.98e-05
intersection-mix_nn      -16.26%        3.01e-03
intersection-mix_nn      -20.03%        3.14e-03
intersection-mix_ns      -40.50%        6.40e-04
intersection-mix_ns      -39.25%        6.39e-04
intersection-mix_sn      -34.05%        6.64e-04
intersection-mix_sn      -32.00%        6.58e-04
intersection-mix_nt      -16.56%        3.90e-05
intersection-mix_nt      -16.80%        3.91e-05
intersection-mix_tn      -29.41%        4.63e-05
intersection-mix_tn      -29.14%        4.59e-05
intersection-block_nn    -69.29%        4.88e-05
intersection-block_nn    -69.22%        4.88e-05
intersection-block_ns    -51.56%        3.52e-06
intersection-block_ns    -51.88%        3.54e-06
intersection-block_sn    -39.75%        3.43e-06
intersection-block_sn    -39.76%        3.43e-06

Now the worst regression is 45%.

@treeowl
Copy link
Contributor

treeowl commented Sep 13, 2016

If the worst regressions remain that bad, it may make sense to add your
version of IntMap in a separate module, rather than replacing the current
one. I believe both Wren and I are generally in favor of adding more data
structures to the package if there's a good case for them.

On Sep 13, 2016 5:45 PM, "Jonathan S" notifications@github.com wrote:

So it seems that the worst of the intersection regression was another
misguided constant argument capture - I really need to just go through the
codebase for every instance of that and benchmark each one. New
intersection benchmarks:

Benchmark Runtime change Original runtime
intersection-disj_nn -74.95% 1.24e-07
intersection-disj_nn -74.86% 1.24e-07
intersection-disj_ns -72.28% 1.12e-07
intersection-disj_ns -72.15% 1.12e-07
intersection-disj_sn -70.07% 1.12e-07
intersection-disj_sn -70.00% 1.12e-07
intersection-disj_nt -66.86% 9.38e-08
intersection-disj_nt -66.80% 9.38e-08
intersection-disj_tn -63.92% 9.28e-08
intersection-disj_tn -63.90% 9.27e-08
intersection-common_nn +10.31% 7.08e-03
intersection-common_nn +9.39% 7.07e-03
intersection-common_ns +42.33% 1.04e-03
intersection-common_ns +40.95% 1.05e-03
intersection-common_sn +42.86% 1.07e-03
intersection-common_sn +45.64% 1.05e-03
intersection-common_nt -20.33% 5.48e-05
intersection-common_nt -20.35% 5.45e-05
intersection-common_tn -28.17% 5.97e-05
intersection-common_tn -29.02% 5.98e-05
intersection-mix_nn -16.26% 3.01e-03
intersection-mix_nn -20.03% 3.14e-03
intersection-mix_ns -40.50% 6.40e-04
intersection-mix_ns -39.25% 6.39e-04
intersection-mix_sn -34.05% 6.64e-04
intersection-mix_sn -32.00% 6.58e-04
intersection-mix_nt -16.56% 3.90e-05
intersection-mix_nt -16.80% 3.91e-05
intersection-mix_tn -29.41% 4.63e-05
intersection-mix_tn -29.14% 4.59e-05
intersection-block_nn -69.29% 4.88e-05
intersection-block_nn -69.22% 4.88e-05
intersection-block_ns -51.56% 3.52e-06
intersection-block_ns -51.88% 3.54e-06
intersection-block_sn -39.75% 3.43e-06
intersection-block_sn -39.76% 3.43e-06

Now the worst regression is 45%.


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#340 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABzi_VwoJJ-KbmtU9O9yFgAEfbSW3ruNks5qpxmQgaJpZM4J7RJU
.

@gereeter
Copy link
Author

I just pushed a hacky change to intersection that almost completely removes the regression. I'm not happy with it, since it makes every other intersection benchmark (those besides intersection-common) slower, but it may be a starting point for a better solution.

Copy link
Contributor

@treeowl treeowl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current IntMap uses Bin and a "heavy" Tip everywhere but the top, where a Nil is permissible. You use data Node t a = Bin !Key a !(Node L a) !(Node R a) | Tip. There is a non-trivial cost to these empty Tips. In particular, every leaf of the tree ends up looking like (to use the current constructor names) Bin k a Nil Nil, which is two words larger than Tip k a. I'm not saying you're wrong, but it may be something to consider. One alternative might be data Node t a = Bin !Key a !(Node L a) !(Node R a) | LOnly !Key a !(Node L a) | ROnly !(Node R a).

Apropos of nothing, I do not like your definitions of R and L. Would it hurt anyone to use data L = L and data R = R?

Data/IntMap.hs Outdated
-- Workshop on ML, September 1998, pages 77-86,
-- <http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.37.5452>
--
-- * D.R. Morrison, \"/PATRICIA -- Practical Algorithm To Retrieve
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this data structure still based on ideas from the PATRICIA paper? If so, the citation should remain. It may be that nothing remains here of Okasaki and Gill, in which case it would be reasonable to delete that reference; I'll leave that up to you.

@treeowl
Copy link
Contributor

treeowl commented Sep 14, 2016

The vast number of changes to Data.IntMap.Internal make it a bit hard to review the changes on GitHub. Do you know if it would be happier if you were to temporarily restore the original (unused) Data.IntMap.Internal and create a brand-new Data.IntMap.Internal2? I don't know nearly enough about git or GitHub to say, but it might be worth a shot.

@gereeter
Copy link
Author

I can copy some files around if you wish, but since this really is a rewrite, not an edit, I'd just recommend looking at the new code not in diff format.

@gereeter
Copy link
Author

(Additionally, I don't think a separate file would work, since Github limits diffs to 1500 lines.)

@treeowl
Copy link
Contributor

treeowl commented Sep 14, 2016

I'm not sure if there's a way to make it happy. The annoying thing is that
I can't seem to make inline comments on that most-important file.

On Sep 14, 2016 5:35 PM, "Jonathan S" notifications@github.com wrote:

I can copy some files around if you wish, but since this really is a
rewrite, not an edit, I'd just recommend looking at the new code
https://github.com/gereeter/containers/blob/direct-bounded-intmap/Data/IntMap/Internal.hs
not in diff format.


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#340 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABzi_dOiu6L_tbWQEDW08QDAtyGvBBVAks5qqGiugaJpZM4J7RJU
.

@gereeter
Copy link
Author

to use the current constructor names

I actually chose to use Bin and Tip as opposed to Bin and Nil to more accurately correspond to the existing code - for the same set of keys, my map will have a Bin exactly where the old code would have a Bin, and my map will have a Tip exactly where the old code would have a Tip. Empty is the closest analogue to Nil.

There is a non-trivial cost to these empty Tips. In particular, every leaf of the tree ends up looking like (to use the current constructor names) Bin k a Nil Nil, which is two words larger than Tip k a. I'm not saying you're wrong, but it may be something to consider. One alternative might be data Node t a = Bin !Key a !(Node L a) !(Node R a) | LOnly !Key a !(Node L a) | ROnly !(Node R a).

This could work, and certainly would save more memory (only 2n-1 words of overhead). However,

  • Without pattern synonyms, it would make the code much larger and much uglier. Implementing the already unweildy merge functions for the more complicated structure sounds like a nightmare.
  • I suspect that the overhead of branching based on all 4 possible node types would cause serious regressions.
  • I can't see any way that the more complicated structure could improve performance to compensate, except possibly better cache behavior due to being smaller. It doesn't save any pointer-chasing, since Tips are already empty and therefore don't have to be allocated specially.

I'd be willing to try the more compact variation if you think the space savings are important enough, but I'd prefer to do that as a separate PR if at all.

Apropos of nothing, I do not like your definitions of R and L. Would it hurt anyone to use data L = L and data R = R?

newtype L = L L has a slight safety advantage in that it is impossible to construct, so it can't accidentally be used as anything but a marker. It doesn't really matter though.

@treeowl
Copy link
Contributor

treeowl commented Sep 14, 2016

I actually chose to use Bin and Tip as opposed to Bin and Nil to more accurately correspond to the existing code - for the same set of keys, my map will have a Bin exactly where the old code would have a Bin, and my map will have a Tip exactly where the old code would have a Tip. Empty is the closest analogue to Nil.

I'll have to read more deeply into this, because that doesn't make sense to me yet. The current code stores a key and a value in its Tip; how can that correspond to yours?

This could work, and certainly would save more memory (only 2n-1 words of overhead). However,

Without pattern synonyms, it would make the code much larger and much uglier. Implementing the already unwieldy merge functions for the more complicated structure sounds like a nightmare.

I fear you may be right about this. How hard would it be for you to use pattern synonyms temporarily, just to see how performance compares? I think these should all be "simple bidirectional" synonyms, which are generally rather straightforward.

I suspect that the overhead of branching based on all 4 possible node types would cause serious regressions.
I can't see any way that the more complicated structure could improve performance to compensate, except possibly better cache behavior due to being smaller. It doesn't save any pointer-chasing, since Tips are already empty and therefore don't have to be allocated specially.

Being smaller is tremendously helpful all by itself. But wait! There's more! A pointer to an already-evaluated node will generally be tagged (in low bits) to indicate which constructor that node was formed from. I don't know all the details of how this helps, but it helps quite tremendously.

Apropos of nothing, I do not like your definitions of R and L. Would it hurt anyone to use data L = L and data R = R?

newtype L = L L has a slight safety advantage in that it is impossible to construct, so it can't accidentally be used as anything but a marker. It doesn't really matter though.

How about something like this?

#if __GLASGOW_HASKELL__ >= whatever
data Hand = L' | R'
type L = ' L'
type R = ' R'
#else
data L = L
data R = R
#endif

@treeowl
Copy link
Contributor

treeowl commented Sep 14, 2016

Actually, I guess the pattern synonyms may not be quite as simple as I
thought. But they shouldn't be terrible, I don't think. If the benchmarks
using then turn out fantastic, then we'll have proof positive that your
structure can beat the current IntMap.

Another place to look for potential performance improvements is at the top
of the tree. The way you do it now is good if GHC inlines the outermost
constructor away. If not, you have an extra indirection. I haven't thought
of a non-horrible way around this yet.

On Sep 14, 2016 6:15 PM, "Jonathan S" notifications@github.com wrote:

to use the current constructor names

I actually chose to use Bin and Tip as opposed to Bin and Nil to more
accurately correspond to the existing code - for the same set of keys, my
map will have a Bin exactly where the old code would have a Bin, and my
map will have a Tip exactly where the old code would have a Tip. Empty is
the closest analogue to Nil.

There is a non-trivial cost to these empty Tips. In particular, every
leaf of the tree ends up looking like (to use the current constructor
names) Bin k a Nil Nil, which is two words larger than Tip k a. I'm not
saying you're wrong, but it may be something to consider. One alternative
might be data Node t a = Bin !Key a !(Node L a) !(Node R a) | LOnly !Key
a !(Node L a) | ROnly !(Node R a).

This could work, and certainly would save more memory (only 2n-1 words of
overhead). However,

  • Without pattern synonyms, it would make the code much larger and
    much uglier. Implementing the already unweildy merge functions for the more
    complicated structure sounds like a nightmare.
  • I suspect that the overhead of branching based on all 4 possible
    node types would cause serious regressions.
  • I can't see any way that the more complicated structure could
    improve performance to compensate, except possibly better cache behavior
    due to being smaller. It doesn't save any pointer-chasing, since Tips
    are already empty and therefore don't have to be allocated specially.

I'd be willing to try the more compact variation if you think the space
savings are important enough, but I'd prefer to do that as a separate PR if
at all.

Apropos of nothing, I do not like your definitions of R and L. Would it
hurt anyone to use data L = L and data R = R?

newtype L = L L has a slight safety advantage in that it is impossible to
construct, so it can't accidentally be used as anything but a marker. It
doesn't really matter though.


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#340 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABzi_d9zAnYrLQ7GIaGgnfCr8i6e3UaPks5qqHIQgaJpZM4J7RJU
.

@treeowl
Copy link
Contributor

treeowl commented Sep 15, 2016

Argh... I started looking into this some more, and I don't think pattern synonyms can help. However, I noticed that lookup, member, notMember, and findWithDefault can all be implemented (probably) efficiently using a single function lookupGen :: Key -> r -> (a -> r) -> IntMap a -> r.

Another question: is there an efficiency benefit to writing size by hand, or should we just use getSum . foldMap (const (Sum 1))?

@treeowl
Copy link
Contributor

treeowl commented Sep 15, 2016

Don't worry about using throwaway types like DeleteResult. I've sprayed similar things all over Data.Map and Data.Sequence. Such things seem to help a lot with strictness analysis, unboxing, etc. I wish we didn't need to use them, but don't think twice to do so and don't waste your time trying to avoid it.

By the way: I should have already said that I greatly appreciate the work you've done on this tremendous rewrite already. Aside from the fact that it speeds up a lot of critical operations, I'm seeing now that its code is much easier to understand than that of the current IntMap. I'm licking my chops at the thought of deleting great gobs of impenetrable bit fiddling.

@gereeter
Copy link
Author

Could you give me a sense of why some fold variants might be challenging?

No. I really have no clue why certain folds are slower with this implementation. The only thing I can think of is the fact that we need to use 3x the stack as the previous implementation since we need to store a node, a key, and a value instead of just a node at certain recursive calls.

If the worst regressions remain that bad, it may make sense to add your version of IntMap in a separate module, rather than replacing the current one.

Even if I fail to reduce the union regressions (now at worst 38%), I don't think this makes sense - the criteria for when to use which variant would be really odd, especially since the regression is so localized. For the old IntMap to be favorable (currently), you need to have a very union-heavy workload which generally unions maps that are disjoint but have coarsely interleaved keys. This seems very odd and certainly niche to me.

I'll have to read more deeply into this, because that doesn't make sense to me yet. The current code stores a key and a value in its Tip; how can that correspond to yours?

Also this structure stores the keys and values in different locations, the spine of the tree is the same. If the existing implementation looks like

Bin _ _ (Bin _ _ (Tip _ _) (Bin _ _ (Tip _ _) (Tip _ _))) (Tip _ _)

then my implementation will look like

NonEmpty _ _ (Bin _ _ (Bin _ _ Tip (Bin _ _ Tip Tip)) Tip)

See also the bounded-intmap README section about how this data structure was derived.

Being smaller is tremendously helpful all by itself.

This is true. However,

  • Since this PR already slims down the structure, we're starting to hit diminishing returns.
  • In the context of random accesses, what matters is having top levels of the tree in cache, since those help every access. However, the number of nodes in the top levels grows exponentially with depth, so a small constant factor decrease in size would probably only save 1 cache miss if it saves cache misses at all.

I'm not saying that space reduction is useless or unfavorable. I just think that it won't produce enough time benefit to outweigh the additional complexity.

In general, I think this may be worth it, solely because it saves memory even if it slows things down. I just want to put that discussion off until later, after the basic implementation is in.

But wait! There's more! A pointer to an already-evaluated node will generally be tagged (in low bits) to indicate which constructor that node was formed from. I don't know all the details of how this helps, but it helps quite tremendously.

I don't think this would actually help, though. The pointer tags already allow you to distinguish between Bin and Tip - the only new information would be how many Tips a Bin has as children, which would save a dereference except for the fact that you need to read the data stored in the Bin anyway. In fact, using only two constructors (just Bin and Tip) actually helps, as it makes it possible to check whether one child is a Tip independent of the other child, which is useful when you only actually care about a single child (as is the case is any point query).

I fear you may be right about this. How hard would it be for you to use pattern synonyms temporarily, just to see how performance compares? I think these should all be "simple bidirectional" synonyms, which are generally rather straightforward.

Actually, I guess the pattern synonyms may not be quite as simple as I thought. But they shouldn't be terrible, I don't think. If the benchmarks using then turn out fantastic, then we'll have proof positive that your structure can beat the current IntMap.

Argh... I started looking into this some more, and I don't think pattern synonyms can help.

Yeah, upon further thought, I think it would be a significant, ugly project even with pattern synonyms.

How about something like this?

#if __GLASGOW_HASKELL__ >= whatever
data Hand = L' | R'
type L = ' L'
type R = ' R'
#else
data L = L
data R = R
#endif

I could... but that seems like a rediculous amount of complexity for basically no gain. If you truly feel passionate about this, I can just use data L = L, and otherwise I can keep it like it is. The newtype solution is simple at only a single line, doesn't require any extensions and is thus trivially portable, and prevents misuse. Using conditional compilation to get a slightly prettier definition when you still have to keep around the older, uglier definition that acts nearly identically doesn't make sense to me.

Another place to look for potential performance improvements is at the top of the tree. The way you do it now is good if GHC inlines the outermost constructor away. If not, you have an extra indirection. I haven't thought of a non-horrible way around this yet.

It's not quite an extra indirection, since the wrapper is not useless - NonEmpty stores the minimum key/value pair. However, you are right that the top level can cause some performance issues - the regression in intersection basically boiled down to the fact that GHC can't return sum types on the stack. This makes it very painful to return a sum type that is immediately going to be inspected and deconstructed, as that causes a lot of garbage. My super hacky solution for intersection was to create a product type isomorphic (roughly) to IntMap_ and convert to/from that on recursion boundaries. This allows GHC to return the value on the stack, fixing the regression, but it also causes a bunch of annoying conversions and stack traffic that slows everything else down.

I did have one idea here that I didn't try yet, which was to inline one level of Node into IntMap_, making

data IntMap_ a = Empty | Singleton Key a | Branched Key a Key a (Node L a) (Node R a)

This would decrease indirection by one level and make things more symmetrical. I didn't really implement this because I believe that the top level is rarely the bottleneck - the only reason it showed up in intersection was because it was being returned and inspected in a loop. Most functions only inspect the top level once, making it largely irrelevant. Improvements would be nice of course, but they aren't critical.

Don't worry about using throwaway types like DeleteResult. I've sprayed similar things all over Data.Map and Data.Sequence. Such things seem to help a lot with strictness analysis, unboxing, etc. I wish we didn't need to use them, but don't think twice to do so and don't waste your time trying to avoid it.

Yeah, I'm not too worried about that. The one type I do want to get rid of is SP, which is just another strict pair, a remnant from when I was writing this code out of tree.

By the way: I should have already said that I greatly appreciate the work you've done on this tremendous rewrite already. Aside from the fact that it speeds up a lot of critical operations, I'm seeing now that its code is much easier to understand than that of the current IntMap. I'm licking my chops at the thought of deleting great gobs of impenetrable bit fiddling.

Thank you. One of the things I really like about this design is the fact that everything is just simple xors and comparisons, requiring no exotic instructions or bit twiddling. However, your opinion might change once you look at the merging functions, especially since I haven't commented them yet.

@treeowl
Copy link
Contributor

treeowl commented Sep 15, 2016

On Thu, Sep 15, 2016 at 1:11 AM, Jonathan S notifications@github.com
wrote:

Could you give me a sense of why some fold variants might be challenging?

No. I really have no clue why certain folds are slower with this
implementation. The only thing I can think of is the fact that we need to
use 3x the stack as the previous implementation since we need to store a
node, a key, and a value instead of just a node at certain recursive calls.

Hrmm... stack space is usually pretty cheap, so I'm a bit surprised. I'll
try to see if I can think of anything. How much slower are they again?

If the worst regressions remain that bad, it may make sense to add your
version of IntMap in a separate module, rather than replacing the current
one.

Even if I fail to reduce the union regressions (now at worst 38%), I
don't think this makes sense - the criteria for when to use which variant
would be really odd, especially since the regression is so localized. For
the old IntMap to be favorable (currently), you need to have a very union-heavy
workload which generally unions maps that are disjoint but have coarsely
interleaved keys. This seems very odd and certainly niche to me.

One thing that really helped out in Data.Set merge operations (and some
Data.Map ones--there are others I haven't explored yet) is trying to reuse
as much structure as possible. For the most part, I did this using a thin
wrapper around reallyUnsafePtrEquality#
(Utils.Containers.Internal.PtrEquality.ptrEq). Maps with coarsely
interleaved keys could benefit quite a bit from this sort of thing.

I'll have to read more deeply into this, because that doesn't make sense
to me yet. The current code stores a key and a value in its Tip; how can
that correspond to yours?

Also this structure stores the keys and values in different locations, the
spine of the tree is the same. If the existing implementation looks like

Bin _ _ (Bin _ _ (Tip _ _) (Bin _ _ (Tip _ _) (Tip _ _))) (Tip _ _)

then my implementation will look like

NonEmpty _ _ (Bin _ _ (Bin _ _ Tip (Bin _ _ Tip Tip)) Tip)

See also the bounded-intmap README section
https://github.com/gereeter/bounded-intmap#moving-the-values-upward
about how this data structure was derived.

Yeah, I did read that. It largely makes sens. One minor concern: if someone
uses IntMap in a heavily persistent way, will your "pull things up"
approach reduce sharing? I suspect that's not the most common scenario,
but we have to think about it.

Being smaller is tremendously helpful all by itself.

This is true. However,

  • Since this PR already slims down the structure, we're starting to
    hit diminishing returns.
  • In the context of random accesses, what matters is having top levels
    of the tree in cache, since those help every access. However, the number of
    nodes in the top levels grows exponentially with depth, so a small constant
    factor decrease in size would probably only save 1 cache miss if it saves
    cache misses at all.

I'm not saying that space reduction is useless or unfavorable. I just
think that it won't produce enough time benefit to outweigh the additional
complexity.

In general, I think this may be worth it, solely because it saves memory
even if it slows things down. I just want to put that discussion off until
later, after the basic implementation is in.

That's fine. When I thought it could be benchmarked quickly using a couple
pattern synonyms, it looked like low-hanging fruit. Now that I realize I
have no idea how to get there quickly, it looks like a better back-burner
candidate.

But wait! There's more! A pointer to an already-evaluated node will
generally be tagged (in low bits) to indicate which constructor that node
was formed from. I don't know all the details of how this helps, but it
helps quite tremendously.

I don't think this would actually help, though. The pointer tags already
allow you to distinguish between Bin and Tip - the only new information
would be how many Tips a Bin has as children, which would save a
dereference except for the fact that you need to read the data stored in
the Bin anyway. In fact, using only two constructors (just Bin and Tip)
actually helps, as it makes it possible to check whether one child is a
Tip independent of the other child, which is useful when you only
actually care about a single child (as is the case is any point query).

Sounds plausible.

How about something like this?

#if GLASGOW_HASKELL >= whateverdata Hand = L' | R'type L = ' L'type R = ' R'
#elsedata L = Ldata R = R
#endif

I could... but that seems like a ridiculous amount of complexity for
basically no gain. If you truly feel passionate about this, I can just use data
L = L, and otherwise I can keep it like it is. The newtype solution is
simple at only a single line, doesn't require any extensions and is thus
trivially portable, and prevents misuse. Using conditional compilation to
get a slightly prettier definition when you still have to keep around the
older, uglier definition that acts nearly identically doesn't make sense to
me.

I personally strongly prefer the nullary-constructor datatypes over the
newtypes. When I see an uninhabited type, it always sends the message that
something interesting is going on; in fact, less than nothing is going on
because the type is only used as a phantom. You're probably right that
conditional DataKinds is overkill.

Another place to look for potential performance improvements is at the top
of the tree. The way you do it now is good if GHC inlines the outermost
constructor away. If not, you have an extra indirection. I haven't thought
of a non-horrible way around this yet.

It's not quite an extra indirection, since the wrapper is not useless -
NonEmpty stores the minimum key/value pair. However, you are right that
the top level can cause some performance issues - the regression in
intersection basically boiled down to the fact that GHC can't return sum
types on the stack. This makes it very painful to return a sum type that is
immediately going to be inspected and deconstructed, as that causes a lot
of garbage. My super hacky solution for intersection was to create a
product type isomorphic (roughly) to IntMap_ and convert to/from that on
recursion boundaries. This allows GHC to return the value on the stack,
fixing the regression, but it also causes a bunch of annoying conversions
and stack traffic that slows everything else down.

Blegh. If we can avoid nastiness, let's avoid it. If we can't, well, we
can't.

I did have one idea here that I didn't try yet, which was to inline one
level of Node into IntMap_, making

data IntMap_ a = Empty | Singleton Key a | Branched Key a Key a (Node L a) (Node R a)

This would decrease indirection by one level and make things more
symmetrical. I didn't really implement this because I believe that the top
level is rarely the bottleneck - the only reason it showed up in
intersection was because it was being returned and inspected in a loop.
Most functions only inspect the top level once, making it largely
irrelevant. Improvements would be nice of course, but they aren't critical.

I was considering something like this. I suspect it will be helpful for
situations where someone stuffs a data structure full of IntMaps. How much
do you think it would blow up the code?

Apropos of not much: do you think there's some recursion-schemes-style
trickery that could make it easier to work with your partially-decorated
trees as though they were fully-decorated?

Thank you. One of the things I really like about this design is the fact
that everything is just simple xors and comparisons, requiring no exotic
instructions or bit twiddling. However, your opinion might change once you
look at the merging functions, especially since I haven't commented them
yet.

I'll get there eventually.

@treeowl
Copy link
Contributor

treeowl commented Sep 15, 2016

How did you like my lookupGen idea, by the way? If it's INLINEd, I think it can probably produce good object code, and it cuts the source down a non-trivial (although non-large) amount.

@gereeter
Copy link
Author

Another question: is there an efficiency benefit to writing sizeby hand, or should we just usegetSum . foldMap (const (Sum 1))`?

In terms of efficiency, I doubt there is any difference at the moment (I haven't actually benchmarked), though the current code counts Tips instead of key/value pairs.

I wrote it manually for a few reasons:

  • Clarity. getSum . foldMap (const (Sum 1)) isn't necessarily hard to read, but it involves a bunch of high powered concepts and immediately raises a bunch of flags in my head: will everything get inlined correctly to produce a specialized function? Will that build up a large thunk instead of just evaluating things in one go, given that foldMap isn't automatically strict? Even if foldMap were strict, does Sum box the contained value and reintroduce laziness? Even if everything does get inlined and optimized properly, will this take several optimization passes to finish, thereby preventing further automatic fusion or improvement? These questions aren't necessarily hard to answer, but they just go away with a direct implementation.
  • Consistent performance. Even if using foldMap optimizes now, changes to it could prevent that in the future. For example, binding a constant argument (such as the function passed to foldMap) can cause slowdowns, but as seen in Rewrite IntMap map so it can inline; define <$ #302, it is also necessary to allow the function to be properly inlined, since GHC won't inline recursive functions. A manual implementation has no such forward-compatibility concerns.
  • No real reason to switch. Using a foldMap-based one-liner only saves 4 lines of code. I don't see the point.

However, I noticed that lookup, member, notMember, and findWithDefault can all be implemented (probably) efficiently using a single function lookupGen :: Key -> r -> (a -> r) -> IntMap a -> r.

How did you like my lookupGen idea, by the way? If it's INLINEd, I think it can probably produce good object code, and it cuts the source down a non-trivial (although non-large) amount.

This is somewhat related to the case of size, but in this case I think that lookupGen or some variation is a good idea, mostly because lookup and friends are fairly large. There are a few unanswered questions about the best way to do this, though: is there a way to make it return the individual go functions so that it can replace the redundant logic in, e.g., intersection? Is this desired? Should the constant key be bound in a closure? Is the decision about that dependent on which function is being implemented?

Basically, yes, I like your idea, but I may not implement it immediately.

@gereeter
Copy link
Author

I just pushed optimized versions of fromAscList and friends. However, in doing so, I realized that their strictness (in Data.IntMap.Strict) seems somewhat bogus:

Prelude Data.IntMap.Strict> fromList [(0, undefined), (0, 0)]
fromList *** Exception: Prelude.undefined
CallStack (from HasCallStack):
  error, called at libraries/base/GHC/Err.hs:79:14 in base:GHC.Err
  undefined, called at <interactive>:2:15 in interactive:Ghci1
Prelude Data.IntMap.Strict> fromAscList [(0, undefined), (0, 0)]
fromList [(0,0)]
Prelude Data.IntMap.Strict> fromList [(0, 0), (0, undefined), (0, 0)]
fromList *** Exception: Prelude.undefined
CallStack (from HasCallStack):
  error, called at libraries/base/GHC/Err.hs:79:14 in base:GHC.Err
  undefined, called at <interactive>:4:23 in interactive:Ghci2
Prelude Data.IntMap.Strict> fromAscList [(0, 0), (0, undefined), (0, 0)]
fromList *** Exception: Prelude.undefined
CallStack (from HasCallStack):
  error, called at libraries/base/GHC/Err.hs:79:14 in base:GHC.Err
  undefined, called at <interactive>:5:26 in interactive:Ghci2

I believe I matched this behavior, but it possibly should be changed.

Copy link
Contributor

@treeowl treeowl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't just so this. With -O0 all the rewrite rules are skipped. We need to shift all tests that are supposed to trip rewrite rules into another module....

@gereeter
Copy link
Author

We can't just [do] this. With -O0 all the rewrite rules are skipped. We need to shift all tests that are supposed to trip rewrite rules into another module....

Ah, didn't think about that. It is a really unimportant change; I was just trying to reduce my iteration time. I'll revert. I do have a few less-effective techniques. Though is -O1 okay?


Now that I made the long-form merge documentation and filled in the remaining API holes, are there any items you think I should specifically focus on? Otherwise I'm doing smaller documentation pieces and attempting to optimize union some more.

@treeowl
Copy link
Contributor

treeowl commented Jan 24, 2020

-O1 should be fine for testing, I believe. I'd love to see even more detailed documentation for insert, in an almost "explain it like I'm five" style. I'm sure it's all obvious to you by now, but it's still a bit fuzzy to me. How bad is the union regression at the moment?

@gereeter
Copy link
Author

How bad is the union regression at the moment?

I need to rerun full benchmarks (which take a while) to give you exact numbers. Roughly, the worst case regressions are near 50%, but most of the cases are mild improvements or break even.

@treeowl
Copy link
Contributor

treeowl commented Apr 6, 2020

@gereeter, I'm hoping to make a release in the next week or three. Do you think you can get this beaten the rest of the way into shape?

@treeowl
Copy link
Contributor

treeowl commented Apr 6, 2020

Is there anything left but updating benchmark numbers?

@gereeter
Copy link
Author

gereeter commented Apr 6, 2020

@gereeter, I'm hoping to make a release in the next week or three. Do you think you can get this beaten the rest of the way into shape?

Yeah, definitely. Sorry for dropping off the radar; I think at some point I was expecting some code review or something and didn't communicate properly while you were waiting on me and then I forgot to check in further or do random improvements.

Is there anything left but updating benchmark numbers?

Not that I know of? I thought I had updated benchmark numbers recently even, but I can't remember exactly when. I'll regenerate them tomorrow. #698 lists diffing Haddocks (which I did, but felt like someone else should double check), checking performance of GHC (which we never found the root cause of, but since the overall performance seemed improved even if there were more allocations, it didn't seem bad; also, I never figured out how to check that myself), and trying to improve union (still might be good but is very open ended and it isn't clear how possible this is), in addition to benchmarks.

Documentation can always be better, and maybe with a fresh look I can write some better and more extensive explanations, but I don't think it should be a blocker.

@sjakobi
Copy link
Member

sjakobi commented Apr 7, 2020

I also apologize for dropping the ball here! I finally asked for advice on measuring GHC's performance and received good tips, particularly from Ben Gamari and Andreas Klebinger.

I haven't actually done much benchmarking yet though. One difficulty here is that to take reliable timings you need a quiet machine, and my desktop usually isn't quiet. Allocations are somewhat easier to measure reliably, put so far they don't indicate any improvements due to this patch.

In order to somewhat cancel out the noise when taking in timings, I've given bench a try. When I compile a module often used for comparing compiler performance, I get the following results:

Before:

$ bench "_build/stage1/bin/ghc -fforce-recomp nofib/spectral/simple/Main.hs -O"
benchmarking _build/stage1/bin/ghc -fforce-recomp nofib/spectral/simple/Main.hs -O
time                 3.677 s    (3.665 s .. 3.701 s)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 3.642 s    (3.628 s .. 3.651 s)
std dev              14.37 ms   (0.0 s .. 16.59 ms)
variance introduced by outliers: 19% (moderately inflated)

This branch:

$ bench "_build/stage1/bin/ghc -fforce-recomp nofib/spectral/simple/Main.hs -O"
benchmarking _build/stage1/bin/ghc -fforce-recomp nofib/spectral/simple/Main.hs -O
time                 3.714 s    (3.674 s .. 3.753 s)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 3.721 s    (3.713 s .. 3.727 s)
std dev              10.46 ms   (0.0 s .. 11.29 ms)
variance introduced by outliers: 19% (moderately inflated)

So we see a small regression in compile times, although I can't confidently say by how much.

So while it would be good to have more data on the performance implications for GHC, I'm currently skeptical that we'll see GHC benefit from this change.

IMHO we should finally invite other people to review this patch and to give this branch a try in their applications. Maybe we just haven't found the applications that will profit from this yet.

@sjakobi
Copy link
Member

sjakobi commented Apr 7, 2020

BTW if you want to try your own hand at benchmarking GHC, you can use my branch: https://gitlab.haskell.org/sjakobi/ghc/-/tree/sjakobi-new-intmap

If you need any help, I may be able to answer questions, but #ghc on freenode will be much more competent.

This should be a good start regarding building GHC: https://ghc.dev/

@sjakobi
Copy link
Member

sjakobi commented May 21, 2020

After all the work that went into this rewrite I feel a bit bad that it hasn't been merged simply because I'm not fully convinced that the performance changes will pay off in practice.

How about advertising this code a bit e.g. on the libraries list or r/haskell?! Publishing it in a separate library would help make it more accessible of course…

@sjakobi sjakobi added the IntMap label Jul 15, 2020
@sjakobi
Copy link
Member

sjakobi commented Jul 26, 2021

For reference: @doyougnu has done an interesting performance analysis on how this branch would affect GHC. See https://gitlab.haskell.org/ghc/ghc/-/issues/19820#note_364086.

@doyougnu
Copy link

For reference: @doyougnu has done an interesting performance analysis on how this branch would affect GHC. See https://gitlab.haskell.org/ghc/ghc/-/issues/19820#note_364086.

One thing I didn't dig into is the root cause behind Demand Analysis being universally worse across the packages I tested with. If it wasn't for that effect these IntMaps would be a pretty clear win.

@treeowl
Copy link
Contributor

treeowl commented Jul 26, 2021 via email

@jwaldmann
Copy link
Contributor

Publishing it in a separate library

I'd welcome that. If there's a trade-off between these implementations (is it the following: efficient single-element operations vs. efficient bulk operations?) then, as a user, I'd like to have that choice.

@Boarders
Copy link
Contributor

Boarders commented Sep 9, 2021

What is the status on this branch, is there any work we can do to get it towards merge status? Are there functions left to optimize?

@sjakobi
Copy link
Member

sjakobi commented Sep 15, 2021

What is the status on this branch, is there any work we can do to get it towards merge status? Are there functions left to optimize?

IMHO the best thing that can be done for this new IntMap implementation is to extract it into a new library and upload it to Hackage. This would allow interested people to easily use it in their own code base and also try their hand at improving the implementation further.

The main reasons why I don't think that this work should be merged (yet) are:

  1. It does not seem to speed up GHC in general. (And it would be very interesting to figure out why!) This might be related to operations such as union and intersection being slower than the current IntMap implementation.
  2. I perceive the implementation to be extremely complex and verbose, and am therefore worried about maintainability issues down the road. I think the most interesting improvement to the implementation would be a way to get rid of the mutually recursive L and R variants of any internal functions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants