Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More efficient object finalising implementation #1638

Merged
merged 1 commit into from
Mar 16, 2017

Conversation

dipinhora
Copy link
Contributor

This PR includes 2 changes

  • Ponybench arguments + finaliser benchmarks
    This commit enhances ponybench to use options to take
    optional command line arguments to change autobench_time
    and autobench_max_ops dynamically at runtime.
    This commit also adds benchmarks for objects with and without
    finalisers to minimal-cates/finalisers.

  • More efficient object finalising implementation
    Thanks to @sylvanc and @aturley for their ideas.

    • The old finaliser implementation used the object hashmap to keep
      track of finalisers that needed to be run. This was not ideal
      because while the hashmap provides constant time operations,
      the constant time was still much larger than the time for a normal
      no finaliser allocation. Additionally, keeping track of finalisers
      in the object hashmap meant that every object with a finaliser
      would be added to the object hashmap even if it was only transient
      and never sent to another actor. This, once again, was different
      from normal allocations where the objects wouldn't be added to
      the hashmap until they were sent to another actor. The benchmark
      using ponybench showed that objects with finalisers were about
      1 order of magnitude slower than objects without finalisers due
      to the overhead of using the object hashmap for tracking them.
    • The new finaliser implementation keeps the finaliser information
      in the chunk where the memory was allocated from. This is exactly
      the same as how non-finaliser allocations are tracked except for
      the additional work to keep track of the finaliser. The resulting
      benefit is that objects with finalisers will only get added to the
      object hashmap under the same circumstances as objects without
      finalisers. This gives us an increase in performance by 1 order of
      magnitude so that now objects with finalisers have the same allocation
      performance as objects without finalisers.
    • Keep a finaliser bitmap in chunk_t instead of an array of finaliser
      pointers. Run the finaliser from the pony_type_t->final_fn instead
      of storing/using the function passed in to pony_alloc_final.
    • Add pony_alloc_small_final and pony_alloc_large_final functions
      to avoid having to go through a branch and another function call
      to allocate memory with a finaliser.
    • Update compiler to call the appropriate one of pony_alloc_small_final
      or pony_alloc_large_final instead of pony_alloc_final.

    Future work:

    • It should be possible to remove pony_alloc_*_final finctions altogether and
      update pony_alloc, pony_alloc_small, pony_alloc_large to take
      a boolean as to whether a finaliser exists for the type or not.
      This would also require changes to the compiler to generate the
      appropriate boolean true/false for when a finaliser exists or not.

Benchmarks before:

vagrant@buffy-leader-1:~/hp2$ ./finalisers --autobench_time 100_000_000
primitive init
nofinal	  20000000	        11 ns/op
final	   2000000	       125 ns/op
actor final
embed final
class final
primitive final

Benchmarks after:

vagrant@buffy-leader-1:~/hp2$ sudo chrt -f 80 ./finalisers --autobench_time 100_000_000
primitive init
nofinal	  10000000	        12 ns/op
final	  10000000	        13 ns/op
actor final
embed final
class final
primitive final

Please let me know if there are any questions or concerns or if any of the changes need to go through an RFC.

Also, as always, feedback/suggestions for improvements welcome.

Copy link
Member

@Praetonus Praetonus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We recently switched from the C standard library assert to our custom pony_assert. I've left comments on the asserts that need to be changed.

p = chunk->m + (bit << HEAP_MINBITS);

// run finaliser
assert((*(pony_type_t**)p)->final != NULL);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be pony_assert.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Praetonus thanks for catching these. I should have caught that but lost track of it due to copy/paste from the sendence ponyc fork.

I'll fix these and the CI failures.

p = chunk->m + (bit << HEAP_MINBITS);

// run finaliser
assert((*(pony_type_t**)p)->final != NULL);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be pony_assert.

@dipinhora dipinhora force-pushed the PR_pony_alloc_final branch 2 times, most recently from 5683f1d to 01ee0f2 Compare March 6, 2017 22:52
@jemc
Copy link
Member

jemc commented Mar 6, 2017

I think this should be split into two PRs, especially since the changes don't seem to have a strict dependency relationship to eachother (for example, the finaliser change doesn't seem to strictly depend on having more customizable benchmark facilities, even though it did help you to confirm the approach).

I'm also inclined to say the ponybench changes should go through the RFC process to get more feedback about the design, but maybe somebody else will disagree with me about that and we can discuss.

@dipinhora
Copy link
Contributor Author

@jemc I'm okay with splitting this PR into two PRs.

I only made the ponybench changes because the process got OOMKilled otherwise:

vagrant@buffy-leader-1:~/hp2$ ./finalisers
primitive init
Killed

I modeled the ponybench changes based on how ponytest works (except that I used options which ponytest explicitly avoids).

If the ponybench changes do go through an RFC, it'd be a good idea to discuss whether having ponybench run all iterations in a single behavior is desired or not (or if it should be dynamically choosable as a runtime option). The current design of having all iterations run in a single behavior is why the finalisers benchmark was getting OOMKilled, however, I do understand that running iterations in multiple behavior calls will result in the added overhead of GC and contention from other actors so it's not a simple decision.

Regardless, I'll split this PR and/or open the RFC once other folks have had a chance to weigh in (no point in opening another PR if the decision is to go through an RFC first).

@jemc
Copy link
Member

jemc commented Mar 7, 2017

If the ponybench changes do go through an RFC, it'd be a good idea to discuss whether having ponybench run all iterations in a single behavior is desired or not (or if it should be dynamically choosable as a runtime option).

Yeah, I agree that sounds to me like the good topic of an RFC discussion.

@dipinhora dipinhora force-pushed the PR_pony_alloc_final branch from 01ee0f2 to 387f086 Compare March 8, 2017 16:45
@dipinhora
Copy link
Contributor Author

@jemc I've updated this PR to remove the ponybench changes so they can be managed separately via another PR or an RFC.

@dipinhora
Copy link
Contributor Author

@Praetonus @jemc Are there any other comments, questions or changes for this PR before it's merged?

I'll try and get the ponybench RFC started sometime in the next few days.

@Praetonus
Copy link
Member

I'm not sure about the benchmark added to minimal-cases/finalisers. If I recall correctly, the minimal cases directory was added to contain regression tests that couldn't be added as compiler unit tests because we couldn't yet run a Pony program inside of a compiler unit test. Now that we can do that, the minimal cases should progressively move to the compiler test suite (and I've already started that with finalisers in #1629), so I don't think we should add more things in the files. Is there another place to put the benchmark?

@dipinhora
Copy link
Contributor Author

@Praetonus I understand your concern and agree that there should be another place for the benchmarks. However, the only other Pony based benchmarks in the repo are part of stdlib for the persistent maps in packages/collections/persistent/benchmarks/.

Any suggestions for where to put this benchmark since it's not testing a specific stdlib package?

@Praetonus
Copy link
Member

Maybe it could be implemented in C and added to benchmark/libponyrt?

@dipinhora dipinhora force-pushed the PR_pony_alloc_final branch from 387f086 to e2ae104 Compare March 10, 2017 14:06
@dipinhora
Copy link
Contributor Author

@Praetonus I've updated the PR to remove the changes to minimal-cases/finalisers so that this PR can be merged without a benchmark.

I'll work on creating a benchmark in C that exercises the same functionality to add into benchmark/libponyrt and submit that as another PR when it is ready.

@sylvanc
Copy link
Contributor

sylvanc commented Mar 10, 2017

Sorry to be a pain, @dipinhora , but I think another set of changes is needed. In heap.c, you are passing a has_finaliser flag to several methods. This results in an extra test and branch instruction on every allocation.

However, whether a finaliser is present or not is known statically when the memory allocation occurs, so we don't need to pay this cost.

If we separate out functions that set the finalise bit from those that don't in heap.c, we can statically call the right one, and eliminate the branch during allocation.

@@ -225,6 +315,10 @@ void* ponyint_heap_alloc_small(pony_actor_t* actor, heap_t* heap,
m = chunk->m + (bit << HEAP_MINBITS);
chunk->slots = slots;

// note that a finaliser needs to run
if(has_finaliser)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This branch impacts all allocations adversely.

@@ -237,6 +331,12 @@ void* ponyint_heap_alloc_small(pony_actor_t* actor, heap_t* heap,
n->m = (char*) POOL_ALLOC(block_t);
n->size = sizeclass;

// note that a finaliser needs to run
if(has_finaliser)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And this one too.

@@ -265,6 +366,12 @@ void* ponyint_heap_alloc_large(pony_actor_t* actor, heap_t* heap, size_t size)
chunk->slots = 0;
chunk->shallow = 0;

// note that a finaliser needs to run
if(has_finaliser)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And this one for large allocations.

@dipinhora dipinhora force-pushed the PR_pony_alloc_final branch from e2ae104 to 373daae Compare March 10, 2017 18:00
@dipinhora
Copy link
Contributor Author

@sylvanc No problem at all and good catch on the extra/unnecessary branch. I didn't even notice it. 8*/

I've eliminated branch by converting the has_finaliser to a uint32_t which should turn true to 1 and false to 0 since we're using C99 stdbool.h.

Please let me know if you find anything else that needs fixing.

@SeanTAllen SeanTAllen force-pushed the master branch 3 times, most recently from 4ccafba to 362d1d0 Compare March 11, 2017 13:53
@sylvanc
Copy link
Contributor

sylvanc commented Mar 11, 2017

Ah I see what you did there :) But what I meant was, just like how in pony.h we have separate entry points for allocators with finalisers and without, we should maintain that distinction in heap.c. That avoids paying any cost, even bit shifts, for allocations of objects that have no finalisers. As such, there is no performance drop (not even a tiny one) on alloc compared to the old code.

@SeanTAllen SeanTAllen force-pushed the master branch 3 times, most recently from 97b1943 to 21d37aa Compare March 11, 2017 14:51
Thanks to Sylvan and Andy Turley for their ideas.

* The old finaliser implementation used the object hashmap to keep
  track of finalisers that needed to be run. This was not ideal
  because while the hashmap provides constant time operations,
  the constant time was still much larger than the time for a normal
  no finaliser allocation. Additionally, keeping track of finalisers
  in the object hashmap meant that every object with a finaliser
  would be added to the object hashmap even if it was only transient
  and never sent to another actor. This, once again, was different
  from normal allocations where the objects wouldn't be added to
  the hashmap until they were sent to another actor. The benchmark
  using ponybench showed that objects with finalisers were about
  1 order of magnitude slower than objects without finalisers due
  to the overhead of using the object hashmap for tracking them.
* The new finaliser implementation keeps the finaliser information
  in the chunk where the memory was allocated from. This is exactly
  the same as how non-finaliser allocations are tracked except for
  the additional work to keep track of the finaliser. The resulting
  benefit is that objects with finalisers will only get added to the
  object hashmap under the same circumstances as objects without
  finalisers. This gives us an increase in performance by 1 order of
  magnitude so that now objects with finalisers have the same allocation
  performance as objects without finalisers.
* Keep a finaliser bitmap in chunk_t instead of an array of finaliser
  pointers. Run the finaliser from the pony_type_t->final_fn instead
  of storing/using the function passed in to pony_alloc_final.
* Add pony_alloc_small_final and pony_alloc_large_final functions
  to avoid having to go through a branch and another function call
  to allocate memory with a finaliser.
* Update compiler to call the appropriate one of pony_alloc_small_final
  or pony_alloc_large_final instead of pony_alloc_final.
@dipinhora dipinhora force-pushed the PR_pony_alloc_final branch from 373daae to c574c26 Compare March 11, 2017 17:07
@dipinhora
Copy link
Contributor Author

@sylvanc changes made to add functions ponyint_heap_alloc_final, ponyint_heap_alloc_small_final and ponyint_heap_alloc_large_final that are a copy of the logic from the original functions. This introduces the maintenance overhead of making sure all core logic changes to ponyint_heap_alloc_small are also made in ponyint_heap_alloc_small_final and vice versa. Please let me know if you had something other than copying the logic in mind.

Also, one minor note, there is still a tiny overhead with the new implementation of alloc (without a finaliser) as compared to the old code because the new code needs to set chunk->finalisers that the old code didn't. I can't think of any way to avoid that without adding in a memset to 0 the new chunk_t first which would also add overhead as compared to the old code.

@dipinhora
Copy link
Contributor Author

@sylvanc Any additional suggestions or comments before this can be merged?

@sylvanc
Copy link
Contributor

sylvanc commented Mar 16, 2017

Nice! Thanks @dipinhora for bearing with all the little changes on this one.

@sylvanc sylvanc merged commit be82960 into ponylang:master Mar 16, 2017
SeanTAllen pushed a commit that referenced this pull request Oct 9, 2024
…segfault (#4522)

A long time ago, some rando named "Dipin Hora" decided to be clever and rework how objects with finalisers are stored/garbage collected in actor heaps (see: #1638). Unfortunately for us all, he wasn't nearly as clever as he thought and introduced a use after free bug that only occurs when finalisers have logic to reference other objects that might have already been freed and reused during the same garbage collection process. Luckily for us, a smarterer rando who also happens to be named "Dipin Hora" has come along to save the day.

This commit adds in tests to reproduce the bug and rework the actor heap garbage collection logic to make sure this issue can no longer occur by making sure that:

* finalisers can no longer re-use memory that might be freed earlier in the garbage collection process
* heap chunks are only freed/destroyed after all finalisers have been run to ensure all cross object references in finalisers remain valid at the time the finalisers are run
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants