More efficient object finalising implementation #1638

dipinhora · 2017-03-06T21:40:04Z

This PR includes 2 changes

Ponybench arguments + finaliser benchmarks
This commit enhances ponybench to use options to take
optional command line arguments to change autobench_time
and autobench_max_ops dynamically at runtime.
This commit also adds benchmarks for objects with and without
finalisers to minimal-cates/finalisers.
More efficient object finalising implementation
Thanks to @sylvanc and @aturley for their ideas.
- The old finaliser implementation used the object hashmap to keep
  track of finalisers that needed to be run. This was not ideal
  because while the hashmap provides constant time operations,
  the constant time was still much larger than the time for a normal
  no finaliser allocation. Additionally, keeping track of finalisers
  in the object hashmap meant that every object with a finaliser
  would be added to the object hashmap even if it was only transient
  and never sent to another actor. This, once again, was different
  from normal allocations where the objects wouldn't be added to
  the hashmap until they were sent to another actor. The benchmark
  using ponybench showed that objects with finalisers were about
  1 order of magnitude slower than objects without finalisers due
  to the overhead of using the object hashmap for tracking them.
- The new finaliser implementation keeps the finaliser information
  in the chunk where the memory was allocated from. This is exactly
  the same as how non-finaliser allocations are tracked except for
  the additional work to keep track of the finaliser. The resulting
  benefit is that objects with finalisers will only get added to the
  object hashmap under the same circumstances as objects without
  finalisers. This gives us an increase in performance by 1 order of
  magnitude so that now objects with finalisers have the same allocation
  performance as objects without finalisers.
- Keep a finaliser bitmap in chunk_t instead of an array of finaliser
  pointers. Run the finaliser from the pony_type_t->final_fn instead
  of storing/using the function passed in to pony_alloc_final.
- Add pony_alloc_small_final and pony_alloc_large_final functions
  to avoid having to go through a branch and another function call
  to allocate memory with a finaliser.
- Update compiler to call the appropriate one of pony_alloc_small_final
  or pony_alloc_large_final instead of pony_alloc_final.
Future work:
- It should be possible to remove pony_alloc_*_final finctions altogether and
  update pony_alloc, pony_alloc_small, pony_alloc_large to take
  a boolean as to whether a finaliser exists for the type or not.
  This would also require changes to the compiler to generate the
  appropriate boolean true/false for when a finaliser exists or not.

Benchmarks before:

vagrant@buffy-leader-1:~/hp2$ ./finalisers --autobench_time 100_000_000
primitive init
nofinal	  20000000	        11 ns/op
final	   2000000	       125 ns/op
actor final
embed final
class final
primitive final

Benchmarks after:

vagrant@buffy-leader-1:~/hp2$ sudo chrt -f 80 ./finalisers --autobench_time 100_000_000
primitive init
nofinal	  10000000	        12 ns/op
final	  10000000	        13 ns/op
actor final
embed final
class final
primitive final

Please let me know if there are any questions or concerns or if any of the changes need to go through an RFC.

Also, as always, feedback/suggestions for improvements welcome.

Praetonus

We recently switched from the C standard library assert to our custom pony_assert. I've left comments on the asserts that need to be changed.

Praetonus · 2017-03-06T22:04:33Z

src/libponyrt/mem/heap.c

+    p = chunk->m + (bit << HEAP_MINBITS);
+
+    // run finaliser
+    assert((*(pony_type_t**)p)->final != NULL);


This should be pony_assert.

@Praetonus thanks for catching these. I should have caught that but lost track of it due to copy/paste from the sendence ponyc fork.

I'll fix these and the CI failures.

Praetonus · 2017-03-06T22:05:07Z

src/libponyrt/mem/heap.c

+    p = chunk->m + (bit << HEAP_MINBITS);
+
+    // run finaliser
+    assert((*(pony_type_t**)p)->final != NULL);


This should be pony_assert.

jemc · 2017-03-06T23:46:49Z

I think this should be split into two PRs, especially since the changes don't seem to have a strict dependency relationship to eachother (for example, the finaliser change doesn't seem to strictly depend on having more customizable benchmark facilities, even though it did help you to confirm the approach).

I'm also inclined to say the ponybench changes should go through the RFC process to get more feedback about the design, but maybe somebody else will disagree with me about that and we can discuss.

dipinhora · 2017-03-07T00:23:18Z

@jemc I'm okay with splitting this PR into two PRs.

I only made the ponybench changes because the process got OOMKilled otherwise:

vagrant@buffy-leader-1:~/hp2$ ./finalisers
primitive init
Killed

I modeled the ponybench changes based on how ponytest works (except that I used options which ponytest explicitly avoids).

If the ponybench changes do go through an RFC, it'd be a good idea to discuss whether having ponybench run all iterations in a single behavior is desired or not (or if it should be dynamically choosable as a runtime option). The current design of having all iterations run in a single behavior is why the finalisers benchmark was getting OOMKilled, however, I do understand that running iterations in multiple behavior calls will result in the added overhead of GC and contention from other actors so it's not a simple decision.

Regardless, I'll split this PR and/or open the RFC once other folks have had a chance to weigh in (no point in opening another PR if the decision is to go through an RFC first).

jemc · 2017-03-07T01:18:22Z

If the ponybench changes do go through an RFC, it'd be a good idea to discuss whether having ponybench run all iterations in a single behavior is desired or not (or if it should be dynamically choosable as a runtime option).

Yeah, I agree that sounds to me like the good topic of an RFC discussion.

dipinhora · 2017-03-08T16:47:25Z

@jemc I've updated this PR to remove the ponybench changes so they can be managed separately via another PR or an RFC.

dipinhora · 2017-03-09T13:33:42Z

@Praetonus @jemc Are there any other comments, questions or changes for this PR before it's merged?

I'll try and get the ponybench RFC started sometime in the next few days.

Praetonus · 2017-03-09T17:20:26Z

I'm not sure about the benchmark added to minimal-cases/finalisers. If I recall correctly, the minimal cases directory was added to contain regression tests that couldn't be added as compiler unit tests because we couldn't yet run a Pony program inside of a compiler unit test. Now that we can do that, the minimal cases should progressively move to the compiler test suite (and I've already started that with finalisers in #1629), so I don't think we should add more things in the files. Is there another place to put the benchmark?

dipinhora · 2017-03-09T17:45:16Z

@Praetonus I understand your concern and agree that there should be another place for the benchmarks. However, the only other Pony based benchmarks in the repo are part of stdlib for the persistent maps in packages/collections/persistent/benchmarks/.

Any suggestions for where to put this benchmark since it's not testing a specific stdlib package?

Praetonus · 2017-03-09T18:06:15Z

Maybe it could be implemented in C and added to benchmark/libponyrt?

dipinhora · 2017-03-10T14:08:31Z

@Praetonus I've updated the PR to remove the changes to minimal-cases/finalisers so that this PR can be merged without a benchmark.

I'll work on creating a benchmark in C that exercises the same functionality to add into benchmark/libponyrt and submit that as another PR when it is ready.

sylvanc · 2017-03-10T17:28:51Z

Sorry to be a pain, @dipinhora , but I think another set of changes is needed. In heap.c, you are passing a has_finaliser flag to several methods. This results in an extra test and branch instruction on every allocation.

However, whether a finaliser is present or not is known statically when the memory allocation occurs, so we don't need to pay this cost.

If we separate out functions that set the finalise bit from those that don't in heap.c, we can statically call the right one, and eliminate the branch during allocation.

sylvanc · 2017-03-10T17:29:58Z

src/libponyrt/mem/heap.c

@@ -225,6 +315,10 @@ void* ponyint_heap_alloc_small(pony_actor_t* actor, heap_t* heap,
    m = chunk->m + (bit << HEAP_MINBITS);
    chunk->slots = slots;

+    // note that a finaliser needs to run
+    if(has_finaliser)


This branch impacts all allocations adversely.

sylvanc · 2017-03-10T17:30:12Z

src/libponyrt/mem/heap.c

@@ -237,6 +331,12 @@ void* ponyint_heap_alloc_small(pony_actor_t* actor, heap_t* heap,
    n->m = (char*) POOL_ALLOC(block_t);
    n->size = sizeclass;

+    // note that a finaliser needs to run
+    if(has_finaliser)


And this one too.

sylvanc · 2017-03-10T17:31:01Z

src/libponyrt/mem/heap.c

@@ -265,6 +366,12 @@ void* ponyint_heap_alloc_large(pony_actor_t* actor, heap_t* heap, size_t size)
  chunk->slots = 0;
  chunk->shallow = 0;

+  // note that a finaliser needs to run
+  if(has_finaliser)


And this one for large allocations.

dipinhora · 2017-03-10T18:03:54Z

@sylvanc No problem at all and good catch on the extra/unnecessary branch. I didn't even notice it. 8*/

I've eliminated branch by converting the has_finaliser to a uint32_t which should turn true to 1 and false to 0 since we're using C99 stdbool.h.

Please let me know if you find anything else that needs fixing.

sylvanc · 2017-03-11T14:24:11Z

Ah I see what you did there :) But what I meant was, just like how in pony.h we have separate entry points for allocators with finalisers and without, we should maintain that distinction in heap.c. That avoids paying any cost, even bit shifts, for allocations of objects that have no finalisers. As such, there is no performance drop (not even a tiny one) on alloc compared to the old code.

Thanks to Sylvan and Andy Turley for their ideas. * The old finaliser implementation used the object hashmap to keep track of finalisers that needed to be run. This was not ideal because while the hashmap provides constant time operations, the constant time was still much larger than the time for a normal no finaliser allocation. Additionally, keeping track of finalisers in the object hashmap meant that every object with a finaliser would be added to the object hashmap even if it was only transient and never sent to another actor. This, once again, was different from normal allocations where the objects wouldn't be added to the hashmap until they were sent to another actor. The benchmark using ponybench showed that objects with finalisers were about 1 order of magnitude slower than objects without finalisers due to the overhead of using the object hashmap for tracking them. * The new finaliser implementation keeps the finaliser information in the chunk where the memory was allocated from. This is exactly the same as how non-finaliser allocations are tracked except for the additional work to keep track of the finaliser. The resulting benefit is that objects with finalisers will only get added to the object hashmap under the same circumstances as objects without finalisers. This gives us an increase in performance by 1 order of magnitude so that now objects with finalisers have the same allocation performance as objects without finalisers. * Keep a finaliser bitmap in chunk_t instead of an array of finaliser pointers. Run the finaliser from the pony_type_t->final_fn instead of storing/using the function passed in to pony_alloc_final. * Add pony_alloc_small_final and pony_alloc_large_final functions to avoid having to go through a branch and another function call to allocate memory with a finaliser. * Update compiler to call the appropriate one of pony_alloc_small_final or pony_alloc_large_final instead of pony_alloc_final.

dipinhora · 2017-03-11T17:17:47Z

@sylvanc changes made to add functions ponyint_heap_alloc_final, ponyint_heap_alloc_small_final and ponyint_heap_alloc_large_final that are a copy of the logic from the original functions. This introduces the maintenance overhead of making sure all core logic changes to ponyint_heap_alloc_small are also made in ponyint_heap_alloc_small_final and vice versa. Please let me know if you had something other than copying the logic in mind.

Also, one minor note, there is still a tiny overhead with the new implementation of alloc (without a finaliser) as compared to the old code because the new code needs to set chunk->finalisers that the old code didn't. I can't think of any way to avoid that without adding in a memset to 0 the new chunk_t first which would also add overhead as compared to the old code.

dipinhora · 2017-03-16T01:35:13Z

@sylvanc Any additional suggestions or comments before this can be merged?

sylvanc · 2017-03-16T08:50:29Z

Nice! Thanks @dipinhora for bearing with all the little changes on this one.

…segfault (#4522) A long time ago, some rando named "Dipin Hora" decided to be clever and rework how objects with finalisers are stored/garbage collected in actor heaps (see: #1638). Unfortunately for us all, he wasn't nearly as clever as he thought and introduced a use after free bug that only occurs when finalisers have logic to reference other objects that might have already been freed and reused during the same garbage collection process. Luckily for us, a smarterer rando who also happens to be named "Dipin Hora" has come along to save the day. This commit adds in tests to reproduce the bug and rework the actor heap garbage collection logic to make sure this issue can no longer occur by making sure that: * finalisers can no longer re-use memory that might be freed earlier in the garbage collection process * heap chunks are only freed/destroyed after all finalisers have been run to ensure all cross object references in finalisers remain valid at the time the finalisers are run

Praetonus suggested changes Mar 6, 2017

View reviewed changes

dipinhora force-pushed the PR_pony_alloc_final branch 2 times, most recently from 5683f1d to 01ee0f2 Compare March 6, 2017 22:52

dipinhora force-pushed the PR_pony_alloc_final branch from 01ee0f2 to 387f086 Compare March 8, 2017 16:45

dipinhora force-pushed the PR_pony_alloc_final branch from 387f086 to e2ae104 Compare March 10, 2017 14:06

Praetonus approved these changes Mar 10, 2017

View reviewed changes

sylvanc suggested changes Mar 10, 2017

View reviewed changes

dipinhora force-pushed the PR_pony_alloc_final branch from e2ae104 to 373daae Compare March 10, 2017 18:00

SeanTAllen force-pushed the master branch 3 times, most recently from 4ccafba to 362d1d0 Compare March 11, 2017 13:53

SeanTAllen force-pushed the master branch 3 times, most recently from 97b1943 to 21d37aa Compare March 11, 2017 14:51

dipinhora force-pushed the PR_pony_alloc_final branch from 373daae to c574c26 Compare March 11, 2017 17:07

sylvanc approved these changes Mar 16, 2017

View reviewed changes

sylvanc merged commit be82960 into ponylang:master Mar 16, 2017

dipinhora mentioned this pull request Oct 7, 2024

Fix use after free bug in actor heap finalisation that can lead to a segfault #4522

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

More efficient object finalising implementation #1638

More efficient object finalising implementation #1638

dipinhora commented Mar 6, 2017

Praetonus left a comment

Praetonus Mar 6, 2017

dipinhora Mar 6, 2017

Praetonus Mar 6, 2017

jemc commented Mar 6, 2017 •

edited

Loading

dipinhora commented Mar 7, 2017

jemc commented Mar 7, 2017 •

edited

Loading

dipinhora commented Mar 8, 2017

dipinhora commented Mar 9, 2017

Praetonus commented Mar 9, 2017

dipinhora commented Mar 9, 2017

Praetonus commented Mar 9, 2017

dipinhora commented Mar 10, 2017

sylvanc commented Mar 10, 2017

sylvanc Mar 10, 2017

sylvanc Mar 10, 2017

sylvanc Mar 10, 2017

dipinhora commented Mar 10, 2017

sylvanc commented Mar 11, 2017

dipinhora commented Mar 11, 2017

dipinhora commented Mar 16, 2017

sylvanc commented Mar 16, 2017

More efficient object finalising implementation #1638

More efficient object finalising implementation #1638

Conversation

dipinhora commented Mar 6, 2017

Praetonus left a comment

Choose a reason for hiding this comment

Praetonus Mar 6, 2017

Choose a reason for hiding this comment

dipinhora Mar 6, 2017

Choose a reason for hiding this comment

Praetonus Mar 6, 2017

Choose a reason for hiding this comment

jemc commented Mar 6, 2017 • edited Loading

dipinhora commented Mar 7, 2017

jemc commented Mar 7, 2017 • edited Loading

dipinhora commented Mar 8, 2017

dipinhora commented Mar 9, 2017

Praetonus commented Mar 9, 2017

dipinhora commented Mar 9, 2017

Praetonus commented Mar 9, 2017

dipinhora commented Mar 10, 2017

sylvanc commented Mar 10, 2017

sylvanc Mar 10, 2017

Choose a reason for hiding this comment

sylvanc Mar 10, 2017

Choose a reason for hiding this comment

sylvanc Mar 10, 2017

Choose a reason for hiding this comment

dipinhora commented Mar 10, 2017

sylvanc commented Mar 11, 2017

dipinhora commented Mar 11, 2017

dipinhora commented Mar 16, 2017

sylvanc commented Mar 16, 2017

jemc commented Mar 6, 2017 •

edited

Loading

jemc commented Mar 7, 2017 •

edited

Loading