The author ( full paper ) sought to establish a micro benchmark comparison between simple java programs and their implementations in Clojure. From there, a sample-based performance profiling methodology was applied to allow for JIT warm up and establish well founded statistical measures of each implementation. Per the author:
“The idea behind this approach was that seeing the way the languages perform in common fundamental tasks would give the reader an idea of how the languages will perform in their application. The reason that the fundamental areas selected were separated into their own experiments rather than putting them all into the same program, was so that the reader could more easily predict which language is better for their specific tasks.”
“The Java version that was used to execute both the Clojure and the Java code was 1.8.0_60. The JVM was run with the arguments –Xmx11g and -Xss11g to increase the max heap and stack space to 11 gigabytes when needed for the experiments.”
After reading a post about this paper on r/clojure on reddit, I decided to walk through the author’s code and offer some insights for the Clojure community and perhaps the java community regarding the efficiency of the micro benchmarks.
Where possible, I will delineate apples:apples and apples:oranges during the walk through, as well as detail areas where I had to fill in the gaps from the paper with my own hopefully correct implementation. For purposes of consistency and reproducibility, I inline all of my measurements in the source code, using the criterium library to benchmark both the clojure and java implementations. I also caveat the results in that I did not explore the same design space as the author with regard to experimental parameters (e.g. recursion count, collection size, tree depth, etc.). Rather, I stuck with relatively small collections amenable to quick iterative benchmarking and experimentation. I believe the resulting performance improvements will hold up at scale though (based on experience).
All noted measures are on a similar platform as the original paper, namely OpenJDK 1.8. Java measures are provided as a baseline for comparison to nullify the effects of different hardware.
“The recursion experiment consisted of a number of recursion calls with only a counter as a parameter and a simple exit condition. It was designed to test the performance of function calls in the two languages. The counter was a primitive integer in both languages and was decreased by one for each recursive call.”
Execution times were measured for problem sizes of 2000, 20000, 200000, 2000000 and 20000000,””
private void Recurse(int cnt)
{if (cnt > 0)
Recurse (cnt - 1);}
performancepaper.core> (c/quick-bench (bench.core/Recurse 10))
Evaluation count : 49359366 in 6 samples of 8226561 calls.
Execution time mean : 10.123866 ns
Execution time std-deviation : 0.199596 ns
Execution time lower quantile : 9.852564 ns ( 2.5%)
Execution time upper quantile : 10.309200 ns (97.5%)
Overhead used : 1.797578 ns
21x slower
(defn pure-recursion [cnt]
(if (> cnt 0)
(pure-recursion
(- cnt 1))))
performancepaper.core> (c/quick-bench (pure-recursion 10))
Evaluation count : 2776386 in 6 samples of 462731 calls.
Execution time mean : 217.748915 ns
Execution time std-deviation : 3.708932 ns
Execution time lower quantile : 213.224904 ns ( 2.5%)
Execution time upper quantile : 221.431481 ns (97.5%)
Overhead used : 1.804565 ns
Off the bat, we are in pretty bad spot. Thankfully this can be heavily mitigated
by using unchecked math and a type hint ( with-unchecked
has the effect of
setting *unchecked-math*
to true for the scope of the body, then reverting to
false
after evaluation).
(with-unchecked
(defn pure-recursion2 [^long cnt]
(if (pos? cnt)
(pure-recursion2 (dec cnt)))))
Evaluation count : 34678608 in 6 samples of 5779768 calls.
Execution time mean : 15.723221 ns
Execution time std-deviation : 0.156759 ns
Execution time lower quantile : 15.545890 ns ( 2.5%)
Execution time upper quantile : 15.907675 ns (97.5%)
Overhead used : 1.804565 ns
That gets us to within 1.5x.
We can still do better, although it may be argued over whether
going this route deviates from the author’s original intent.
Since the original intent was to measure function call overhead,
we can actually leverage a language feature that clojure provides
to eliminate that overhead completely. Where there was naive recursion
in the original, we can just use the recur
form to semantically
re-enter the function with new arguments, while operationally the
clojure compiler optimizes this to a loop:
(defn pure-recursion4 [^long cnt]
(if (> cnt 0)
(recur (dec cnt))))
performancepaper.core> (c/quick-bench (pure-recursion4 10))
Evaluation count : 68567172 in 6 samples of 11427862 calls.
Execution time mean : 6.972233 ns
Execution time std-deviation : 0.059662 ns
Execution time lower quantile : 6.887818 ns ( 2.5%)
Execution time upper quantile : 7.030860 ns (97.5%)
Overhead used : 1.797578 ns
nil
We are now 0.697x of the original java runtime, so faster. We’re also somewhat
cheating at the machine level, but at the language level, recur
(in my
opinion) is fair game to avoid function call overhead, which java can’t do.
“The sorting experiment consisted of sorting a collection of integers. In Clojure this was done by sorting a list of integers, shuffled by the shuffle function, using the sort function, all of which are included in the clojure.core library. In Java this was done similarly by sorting an array of primitive integers, which was shuffled using java.util.Collections.shuffle, using the Arrays.sort function.
Execution times were measured for collections with 2000, 20000, 200000, 2000000 and 20000000 integers.”
private int[] createArray (int size)
{int counter = Integer.MIN_VALUE;
ArrayList <Integer> arrList= new ArrayList <Integer>(size) ;
for(int i = 0; i < size ; ++ i)
arrList.add (counter ++);
java.util.Collections.shuffle(arrList);
int[] retArr = new int[size] ;
for(int i = 0; i < size ; ++ i )
retArr [i] = arrList.get(i);
return retArr;}
Arrays.sort(array) ;
performancepaper.core> (c/quick-bench (core/createArray 100))
Evaluation count : 138942 in 6 samples of 23157 calls.
Execution time mean : 4.369374 µs
Execution time std-deviation : 63.001723 ns
Execution time lower quantile : 4.310739 µs ( 2.5%)
Execution time upper quantile : 4.467841 µs (97.5%)
Overhead used : 1.797578 ns
Clojure implemention underspecified
(let [list (−> (create−list size (atom Integer/MIN_VALUE))
(shuffle))]
...) author elides this, and `create-list` is not provided.
(sort list)
Since the original paper elided the exact source code for the clojure implementation, I filled in the rest to maintain a bit of consistency with what was provided and the java implementation:
(defn create-sorted-array [n]
(->> (range Integer/MIN_VALUE 0 1)
(take n)
shuffle
sort))
performancepaper.core> (c/quick-bench (create-sorted-array 100))
Evaluation count : 17532 in 6 samples of 2922 calls.
Execution time mean : 34.841374 µs
Execution time std-deviation : 549.515702 ns
Execution time lower quantile : 34.210927 µs ( 2.5%)
Execution time upper quantile : 35.646224 µs (97.5%)
Overhead used : 1.804565 ns
Found 1 outliers in 6 samples (16.6667 %)
low-severe 1 (16.6667 %)
Variance from outliers : 13.8889 % Variance is moderately inflated by outliers
As a starting point, we are roughly 8x slower than the java implementation. We can improve this to 3x and stay within Clojure idioms though. One thing to target is to avoid creating copies of stuff; since we are producing a sorted array using an intermediate ArrayList, we can bypass clojure.core/shuffle since it creates an intermediate clojure vector we don’t need:
(defn create-sorted-array2 [^long n]
(let [^ArrayList alist
(->> (range Integer/MIN_VALUE 0 1)
(transduce (take n)
(completing (fn [^ArrayList acc n]
(doto acc (.add n))))
(java.util.ArrayList. n)))
_ (java.util.Collections/shuffle alist)]
(doto (int-array alist) Arrays/sort)))
Evaluation count : 46506 in 6 samples of 7751 calls.
Execution time mean : 12.985146 µs
Execution time std-deviation : 570.944434 ns
Execution time lower quantile : 12.451225 µs ( 2.5%)
Execution time upper quantile : 13.917159 µs (97.5%)
Overhead used : 1.800162 ns
Found 1 outliers in 6 samples (16.6667 %)
low-severe 1 (16.6667 %)
Variance from outliers : 13.8889 % Variance is moderately inflated by outliers
nil
We still incur overhead in a couple of places, namely
transduce has some checking inside it’s internal loop,
and coercing the ArrayList into a seq for int-array
is substantially slower than iterating the ArrayList and
updating a pre-allocated int-array, as java does. Using
more interop, we get to 1.07x, slightly slower but not bad:
(with-unchecked
(defn create-sorted-array3 [^long size]
(let [^ArrayList alist
(loop [^ArrayList acc (java.util.ArrayList. size)
counter (int Integer/MIN_VALUE)
n 0]
(if (< n size)
(let [c (inc counter)]
(recur (doto acc (.add c))
c
(inc n)))
acc))
_ (Collections/shuffle alist)
res (int-array size)]
(dotimes [i size] (aset res i ^int (.get alist i)))
(doto res Arrays/sort))))
performancepaper.core> (c/quick-bench (create-sorted-array3 100))
Evaluation count : 130794 in 6 samples of 21799 calls.
Execution time mean : 4.669894 µs
Execution time std-deviation : 179.454425 ns
Execution time lower quantile : 4.477268 µs ( 2.5%)
Execution time upper quantile : 4.902860 µs (97.5%)
Overhead used : 1.800162 ns
“The map creation experiment consisted of adding integers as keys and values to a map. In Java they were added to a HashMapfrom thejava.util library, and in Clojure they were added to the built-in persistent map data structure.
Execution times were measured for20000, 63246, 200000, 632456 and 2000000 different key-value pairs.”
private HashMap<Integer , Integer> createMap (int sze)
{HashMap<Integer , Integer> retMap= new HashMap<Integer , Integer>(sze) ;
for (int i = 0; i < sze ;)
retMap.put(i , ++ i ) ;
return retMap ;}
(c/quick-bench (bench.core/createMap 100))
Evaluation count : 538998 in 6 samples of 89833 calls.
Execution time mean : 1.178573 µs
Execution time std-deviation : 40.404054 ns
Execution time lower quantile : 1.142367 µs ( 2.5%)
Execution time upper quantile : 1.237344 µs (97.5%)
Overhead used : 1.800162 ns
We are comparing a java program that builds a mutable hashmap via tight loop iteration against a clojure program that uses a transient clojure hashmap to build and the coerce into a persistent clojure map.
(defn create-map [size]
(loop [map (transient {}),
i (int size)]
(if (> i 0)
(recur (assoc! map i (+ i 1)) (- i 1) )
(persistent! map))))
Evaluation count : 61686 in 6 samples of 10281 calls.
Execution time mean : 9.874480 µs
Execution time std-deviation : 96.973621 ns
Execution time lower quantile : 9.750675 µs ( 2.5%)
Execution time upper quantile : 9.964194 µs (97.5%)
Overhead used : 1.800162 ns
Our baseline is ~9x slower, despite the use of transients. We may try to leverage unchecked math as before, and direct method invocation to make things a tad more efficient:
(with-unchecked
(defn create-map2 [size]
(loop [^clojure.lang.ITransientAssociative
map (transient {}),
i (int size)]
(if (> i 0)
(recur (.assoc map i (+ i 1))
(- i 1))
(persistent! map)))))
performancepaper.core> (c/quick-bench (create-map2 100))
Evaluation count : 61260 in 6 samples of 10210 calls.
Execution time mean : 9.576160 µs
Execution time std-deviation : 147.638187 ns
Execution time lower quantile : 9.392887 µs ( 2.5%)
Execution time upper quantile : 9.723504 µs (97.5%)
Overhead used : 1.804565 ns
Looks like not much change; still around 9x slower. It seems that the cost of building and coercing a transient map is still substantially outweighed by a pure mutable java hashmap that pays no coercion cost. Thankfully, we can just use java hashmaps from clojure via interop:
(with-unchecked
(defn create-map3 [^ long size]
(let [^java.util.HashMap map (java.util.HashMap. size)]
(dotimes [i size]
(.put map i (+ i 1))))))
performancepaper.core> (c/quick-bench (create-map3 100))
Evaluation count : 487116 in 6 samples of 81186 calls.
Execution time mean : 1.229078 µs
Execution time std-deviation : 30.572826 ns
Execution time lower quantile : 1.191533 µs ( 2.5%)
Execution time upper quantile : 1.268660 µs (97.5%)
Overhead used : 1.804565 ns
Leveraging interop leaves us 1.04x, slower but perhaps that’s within the margins.
“The object creation experiment consisted of creating a linked list without values. In Java a custom class was used to create the links while in Clojure nested persistent maps were used. The links were created backwards in both languages, meaning that the first object created would have a next-pointer with a null value, and the second object created would point to the first, and so on.
Execution times were measured for 100000, 316228, 1000000, 3162278 and 10000000 linked objects”
private class LLNode{
public LLNode next ;
public LLNode (LLNode next ){
this.next = next ;}
private LLNode create Objects (int count )
{LLNode last = null ;
for (int i = 0; i < count; ++ i)
last = new LLNode(last) ;
return last;}
performancepaper.core> (c/quick-bench (bench.core/createObjects 100))
Evaluation count : 2368566 in 6 samples of 394761 calls.
Execution time mean : 249.927510 ns
Execution time std-deviation : 4.557640 ns
Execution time lower quantile : 244.464795 ns ( 2.5%)
Execution time upper quantile : 254.444188 ns (97.5%)
Overhead used : 1.800162 ns
(defn create-objects [count]
(loop [last nil
i (int count)]
(if (= 0 i )
last
(recur {:next last} (- i 1)))))
Evaluation count : 916590 in 6 samples of 152765 calls.
Execution time mean : 673.619823 ns
Execution time std-deviation : 26.588156 ns
Execution time lower quantile : 647.556044 ns ( 2.5%)
Execution time upper quantile : 701.464334 ns (97.5%)
Overhead used : 1.800162 ns
Our baseline implementation compares a java class-based implementation to a clojure hash-map based one. Notably unlike the java implementation, the hashmap must pay a key lookup cost to access fields, and has a higher construction/allocation cost as opposed to a simple class constructor with fixed fields (LLNode). Clojure starts off about 2.7x slower.
Allocations are hurting us here, as well as array-map instantation. We’re on a slow path compared to java. We can add unchecked math, and get some marginal gains,
(with-unchecked
(defn create-objects2 [count]
(loop [last nil
i (int count)]
(if (== i 0)
last
(recur {:next last} (- i 1))))))
Evaluation count : 933462 in 6 samples of 155577 calls.
Execution time mean : 646.923626 ns
Execution time std-deviation : 11.946099 ns
Execution time lower quantile : 634.453274 ns ( 2.5%)
Execution time upper quantile : 664.344180 ns (97.5%)
Overhead used : 1.800162 ns
but the real target is to get a simpler container that’s easy to construct.
Records are faster to construct, but they implement a bunch of stuff and carry more state, so there is more setup. Still they are very much faster to create when you have fixed fields, like the node class.
(defrecord ll-node [next])
(defn create-objects3 [count]
(loop [last nil
i (int count)]
(if (== i 0)
last
(recur (ll-node. last) (- i 1)))))
Evaluation count : 1699422 in 6 samples of 283237 calls.
Execution time mean : 348.583970 ns
Execution time std-deviation : 6.587955 ns
Execution time lower quantile : 337.022098 ns ( 2.5%)
Execution time upper quantile : 354.655388 ns (97.5%)
Overhead used : 1.800162 ns
Found 1 outliers in 6 samples (16.6667 %)
low-severe 1 (16.6667 %)
Variance from outliers : 13.8889 % Variance is moderately inflated by outliers
Record-based is now 1.39x slower; getting close. As it turns out, types have less to setup, very barebones like the node class.
(deftype ll-node-type [next])
(with-unchecked
(defn create-objects5 [^long count]
(loop [last nil
i count]
(if (== i 0)
last
(recur (ll-node-type. last) (dec i))))))
Evaluation count : 2440158 in 6 samples of 406693 calls.
Execution time mean : 249.399392 ns
Execution time std-deviation : 5.009429 ns
Execution time lower quantile : 244.748218 ns ( 2.5%)
Execution time upper quantile : 256.732288 ns (97.5%)
Overhead used : 1.800162 ns
With a barebones class equivalent and direct field access, we get ~1x, pretty much identical to java now, with very similar code.
“The binary tree DFS experiment consisted of searching a binary tree for a value it did not contain using depth first search. The depth first search was implemented recursively in both languages. In Java the binary tree was represented by a custom class while in Clojure they were represented using nested persistent maps.”
We have a similar situation with the object creation in 3.4 here, where the clojure solution is implemented on top of generic hashmaps, while the java implementation leverages classes and field access. Persistent hashmaps should have a bit higher instantiation and key lookup cost compared to raw classes.
public BinaryTreeNode createBinaryTree (int depth, int[] counter)
{if (depth == 0) return null;
int value = counter[0]++;
BinaryTreeNode btn = new BinaryTreeNode(value);
btn.left = createBinaryTree(depth - 1, counter) ;
btn.right = createBinaryTree(depth - 1 , counter) ;
return btn ;}
public boolean binaryTreeDFS(BinaryTreeNode root, int target)
{if (root == null) return false ;
return root.value == target ||
binaryTreeDFS(root.left, target) ||
binaryTreeDFS (root.right, target);}
//Added by joinr
public boolean binaryTreeDFSTest(int depth, int target)
{
int[] counter = new int[1];
counter[0] = 0;
return binaryTreeBFS(createBinaryTree(depth,counter),target);
}
performancepaper.core> (c/quick-bench (bench.core/binaryTreeDFSTest 7 126))
Evaluation count : 643680 in 6 samples of 107280 calls.
Execution time mean : 900.028340 ns
Execution time std-deviation : 25.156556 ns
Execution time lower quantile : 873.937425 ns ( 2.5%)
Execution time upper quantile : 927.532690 ns (97.5%)
Overhead used : 1.804565 ns
(defn create-binary-tree [depth counter−atom]
(when (> depth 0)
(let [val @counter−atom]
(swap! counter−atom inc )
{:value val
:left (create−binary−tree (- depth 1) counter−atom )
:right (create−binary−tree (- depth 1) counter−atom )})))
(defn binary-tree-DFS [root target]
(if (nil? root)
false
(or (= (:value root) target)
(binary-tree-DFS (:left root) target)
(binary-tree-DFS (:right root) target))))
(defn binary-tree-DFS-test [depth target]
(binary-tree-DFS (create-binary-tree depth (atom 0)) 126))
Evaluation count : 46068 in 6 samples of 7678 calls.
Execution time mean : 12.656700 µs
Execution time std-deviation : 244.046759 ns
Execution time lower quantile : 12.465987 µs ( 2.5%)
Execution time upper quantile : 13.059028 µs (97.5%)
Overhead used : 1.804565 ns
We start at 14x slower, although there is a lot of incidental overhead to explore:
- keyword access,
- map allocation,
- recursion,
- using atom as a mutable numeric counter,
- boxed numeric comparisons
with potentially lots of room to improve.
(with-unchecked
(defn create-binary-tree2 [^long depth counter-atom]
(when (> depth 0)
(let [val @counter-atom]
(swap! counter-atom inc)
{:value val
:left (create−binary−tree (- depth 1) counter-atom)
:right (create−binary−tree (- depth 1) counter-atom)}))))
(defn binary-tree-DFS2 [root ^long target]
(if (nil? root)
false
(or (== (root :value) target)
(binary-tree-DFS2 (root :left) target)
(binary-tree-DFS2 (root :right) target))))
(defn binary-tree-DFS-test2 [depth target]
(binary-tree-DFS2 (create-binary-tree2 depth (atom 0)) 126))
Evaluation count : 54552 in 6 samples of 9092 calls.
Execution time mean : 11.121393 µs
Execution time std-deviation : 168.460662 ns
Execution time lower quantile : 10.878002 µs ( 2.5%)
Execution time upper quantile : 11.307488 µs (97.5%)
Overhead used : 1.804565 ns
Found 2 outliers in 6 samples (33.3333 %)
low-severe 1 (16.6667 %)
low-mild 1 (16.6667 %)
Variance from outliers : 13.8889 % Variance is moderately inflated by outliers
At 12.35x, unboxed numerics and faster keyword access help a bit, but they are not the choke point. We are still allocating though, so building the tree is probably the slow point.
As before, we know that types are barebones classes. Direct class instantiation is faster than map creation, and direct field access is faster than key lookup. We can also probably gain a bit of speed by looking at our counter, switching from an atom to a volatile for perhaps a little gain:
(deftype binary-node [^int value left right])
(with-unchecked
(defn create-binary-tree3 [^long depth counter-atom]
(when (> depth 0)
(let [^long val @counter-atom]
(vreset! counter-atom (inc val))
(binary-node. val
(create-binary-tree3 (- depth 1) counter-atom)
(create-binary-tree3 (- depth 1) counter-atom))))))
(defn binary-tree-DFS3 [^binary-node root ^long target]
(if (nil? root)
false
(or (== (.value root) target)
(binary-tree-DFS3 (.left root) target)
(binary-tree-DFS3 (.right root) target))))
(defn binary-tree-DFS-test3 [depth target]
(binary-tree-DFS3 (create-binary-tree3 depth (volatile! 0)) 126))
Evaluation count : 222192 in 6 samples of 37032 calls.
Execution time mean : 2.665373 µs
Execution time std-deviation : 69.489473 ns
Execution time lower quantile : 2.580740 µs ( 2.5%)
Execution time upper quantile : 2.737338 µs (97.5%)
Overhead used : 1.804565 ns
So that leaves 2.96x; using a custom type and a volatile as a mutable counter gets us much closer. One difference with the java implementation is the use of the counter; it’s a primitive int array leading to unboxed operations and primitive math. Our counter (either an atom or a volatile) has a tad bit of overhead compared to mutating a primitive array. Let’s copy the java implementation and use an array:
(with-unchecked
(defn create-binary-tree4 [^long depth ^ints counter]
(when (> depth 0)
(let [val (aget counter 0)]
(aset counter 0 (inc val))
(binary-node. val
(create-binary-tree4 (- depth 1) counter)
(create-binary-tree4 (- depth 1) counter))))))
(defn binary-tree-DFS4 [^binary-node root ^long target]
(if root
(or (== (.value root) target)
(binary-tree-DFS4 (.left root) target)
(binary-tree-DFS4 (.right root) target))
false))
(defn binary-tree-DFS-test4 [depth target]
(binary-tree-DFS4 (create-binary-tree4 depth (doto (int-array 1) (aset 0 1))) 126))
Evaluation count : 524934 in 6 samples of 87489 calls.
Execution time mean : 1.158351 µs
Execution time std-deviation : 46.874432 ns
Execution time lower quantile : 1.116454 µs ( 2.5%)
Execution time upper quantile : 1.222972 µs (97.5%)
Overhead used : 1.804565 ns
That leaves us with 1.27x, and like the java version, we use a mutable int array as a counter to save time on boxing with the volatile. There are perhaps more non-obvious optimizations, but I’m ending these for now since we’re still relatively high up and fairly idiomatic.
“The binary tree BFS, similar to the binary tree DFS experiment consisted of searching a binary tree for a value it did not contain, but using breadth first search. The breadth first search was implemented iteratively in both languages.In Java the binary tree was represented by a custom class while in Clojure they were represented using nested persistent maps.”
public boolean binaryTreeBFS(BinaryTreeNode root, int target)
{Queue<BinaryTreeNode >queue= new LinkedList <BinaryTreeNode>() ;
queue.add(root) ;
while (! queue.isEmpty())
{BinaryTreeNode item = queue.poll();
if (item.value == target) return true;
if (item.left != null) queue.add (item.left);
if (item.right != null) queue.add (item.right);}
return false;}
//Added by joinr
public boolean binaryTreeBFSTest(int depth, int target)
{
int[] counter = new int[1];
counter[0] = 0;
return binaryTreeBFS(createBinaryTree(depth,counter),target);
}
Here we are comparing a java implementation - based on a mutable queue (based on a doubly linked list) for the search fringe - against a clojure implementation that uses a persistent queue.
performancepaper.core> (c/quick-bench (bench.core/binaryTreeBFSTest 7 126))
Evaluation count : 465144 in 6 samples of 77524 calls.
Execution time mean : 1.325622 µs
Execution time std-deviation : 31.643248 ns
Execution time lower quantile : 1.301545 µs ( 2.5%)
Execution time upper quantile : 1.376586 µs (97.5%)
Overhead used : 1.804565 ns
(defn binary-tree-BFS [root target]
(loop [queue (conj clojure.lang.PersistentQueue/EMPTY root)]
(if (empty? queue)
false
(let [item (peek queue)]
(if (= target (:value item))
true
(recur (as-> (pop queue) $
(if (nil? (:left item))
$
(conj $ (:left item)))
(if (nil? (:right item))
$
(conj $ (:right item))))))))))
(defn binary-tree-BFS-test [depth tgt]
(binary-tree-BFS (create-binary-tree depth (atom 0)) 126))
performancepaper.core> (c/quick-bench (binary-tree-BFS-test 7 126))
Evaluation count : 23448 in 6 samples of 3908 calls.
Execution time mean : 27.534318 µs
Execution time std-deviation : 3.168409 µs
Execution time lower quantile : 25.831461 µs ( 2.5%)
Execution time upper quantile : 32.973576 µs (97.5%)
Overhead used : 1.804565 ns
Found 1 outliers in 6 samples (16.6667 %)
low-severe 1 (16.6667 %)
Variance from outliers : 31.1481 % Variance is moderately inflated by outliers
As expected, the map-based, persistent queued clojure implementation is 20.8x
slower than the java implementation that stores information in plain classes and
uses a mutable queue. Let’s apply the lessons from BFS and use
our deftype
based nodes to build the tree, then search it:
(defn binary-tree-BFS-test2 [depth tgt]
(binary-tree-BFS (create-binary-tree4 depth (doto (int-array 1) (aset 0 0))) 126))
performancepaper.core> (c/quick-bench (binary-tree-BFS-test2 7 126))
Evaluation count : 509616 in 6 samples of 84936 calls.
Execution time mean : 1.221056 µs
Execution time std-deviation : 28.469631 ns
Execution time lower quantile : 1.193429 µs ( 2.5%)
Execution time upper quantile : 1.257014 µs (97.5%)
Overhead used : 1.804565 ns
We end up at 0.89x, which is surprisingly a bit faster. I would naively expect mutable implementations to have a 2-4x edge in most cases, but we may have a niche for the persistent queue here.
I ran through basic optimization/idiomatic stuff to explore each benchmark using criterium to compare the java implementation and the clojure ones.
I started with the original implementations from the paper, then adding derivative versions suffixed by N, e.g. some-fn, some-fn2, some-fn3, etc.
The goal here was to provide a layered approach to showing the impact of certain stuff. In almost all cases (except for the BFS test, which I don’t understand the performance yields), we see a typical pattern:
- the clojure implementation starts off about 10x worse or more,
- then you get some immediate gains with low-hanging optimizations,
- then eventually converge on typed java interop in the limit to get either
- equivalent performance, within some percentage (like 18% or less),
- or better in a few cases.
The BFS stuff in clojure was surprisingly a bit better using a persistent queue with similar optimization from the DFS, which is interesting since I would “imagine” that the mutable queue implementation in the jvm version would have an advantage.
Other than that, the other bench marks are predictable (from an experiental perspective).
I guess the real interest is comparing apples:apples in such microbenchmarks. The evolutionary pattern of optimization was to start with perhaps intentionally naive clojure implementations - which leverage persistent structures, boxed math, and perhaps a bit of overhead compared to their statically typed java counterparts - and then gradually morph toward something closer to the host (java) to level the playing field. We add hints and primitive math, leverage efficient class-based field access and instantiation, and where necessary, direct java interop to compete with java.
I’d like to address some points made by the author:
Like any good researcher, the author addresses some possible criticisms openly:
- “All of the code tested was implemented by the researcher and it might not be optimal for some experiments, meaning that there might exist faster solutions.”
I think we have demonstrated that this is the case for the sample code; although I’m not entirely certain if the code in this repository is admissible under potentially unseen criterion in the original paper. If there are no constraints placed on Clojure, we can typically get at Java performance given the tight level of host interop (as well as more esoteric techniques like runtime bytecode gen via asm and similar libraries). I think the raw java implementations still edge out clojure in like-for-like cases (e.g. primitive math and mutable collections), but the margins are certainly far less than the range demonstrated in the paper (at least for the subset of testing performed here).
- “This work is intended for private persons and companies to use when evaluating which language to use for their programming projects. This saves time and potentially money for the readers, benefiting the society’s economic sustainability positively, albeit very little.”
- “These results strongly suggest that the use of Clojure over Java comes with a cost of both startup and runtime performance.”
I hope to provide - if not additional context for pedagogical reasons - a bit of a counterpoint to the observations in the paper.