Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

heap-use-after-free in ExampleCluster #174

Closed
tmadlener opened this issue Jan 28, 2021 · 6 comments · Fixed by #514
Closed

heap-use-after-free in ExampleCluster #174

tmadlener opened this issue Jan 28, 2021 · 6 comments · Fixed by #514

Comments

@tmadlener
Copy link
Collaborator

As of 29dfbdd there is at least one heap-use-after-free problem in our tests, and I think this also affects the generated code. It seems to have gone unnoticed until now because for some reason builds with gcc do not seem to suffer from any runtime problems. However, builds with clang do and occasionally lead to runtime problems, i.e. segmentation faults.

After instrumenting the core podio library with AddressSanitizer, running the tests/write points to an attempt of releaseing an already destroyed Obj again. I have attached the complete output below. I have currently named the ExampleCluster in the issue title, since this is the first instance where this problem occurs, but it doesn't have to be the only one: I am not yet sure whether this problem only affects types with relations, or if this is a more general problem that affects instances that have not been created via Collection::create but instead via directly constructing them and then adding them to a collection. I also do not know why gcc builds seem to be less affected by this at runtime compared to clang builds.

address_sanitize_podio.txt

@tmadlener
Copy link
Collaborator Author

This seems to be a more fundamental problem, that is connected with clearing collections. A minimal reproducer is:

#include "datamodel/ExampleClusterCollection.h"

int main () {
  auto clusters = ExampleClusterCollection();
  auto cluster = clusters.create();
  clusters.clear(); // remove this and everything works
}

@tmadlener
Copy link
Collaborator Author

This is a very fundamental problem that, as far as I can tell, happens every time we clear a collection in the same scope as we have either created or obtained objects from the collection.


The underlying problem is that the call to clusters.clear() already destroys the ExampleClusterObj that is held by the cluster. Consequently, by the time cluster goes out of scope and calls its destructor, we are more or less trying to destroy the same ExampleClusterObj again:

ExampleCluster::~ExampleCluster() {
  if (m_obj) m_obj->release();
}

We would usually get a double-free here, but since podio::ObjBase::release checks the index first:

/// checks whether object is "untracked" by a collection
/// if yes, decrease reference count and delete itself if count===0
int release(){
if (id.index != podio::ObjectID::untracked){ return 1;};
if (--ref_counter == 0) {
delete this;
}
return 0;
};

we get a use-after-free. The reason why we are not running in a double-free more often is, because first id.index would have to be -1, which is unlikely when reading from random memory. And additionally also ref_counter would have to be 0, again unlikely from random memory. So the chances of calling delete this a second time are extremely small.

Since the cluster has no connection to the collection any longer there is no way to "inform" it that m_obj has already been destroyed by the collection.

@tmadlener
Copy link
Collaborator Author

Coming back to this once more. I think this is even more fundamental than I previously thought. Up until now I was under the impression that this will not be a problem in our "usual workflows" because the objects will (almost) never be created in the same scope as the collections are cleared.

However, the issue goes deeper than that. E.g. the following minimal showcase also has the same heap-use-after-free problem:

#include "datamodel/ExampleWithOneRelationCollection.h"
#include "datamodel/ExampleClusterCollection.h"

int main() {
  auto clusters = ExampleClusterCollection();
  auto oneRelations = ExampleWithOneRelationCollection();

  // introduce a new scope to get around the originally described problem
  {
    auto cluster = clusters.create();
    auto rel = oneRelations.create();
    rel.cluster(cluster);
  }

  clusters.clear(); // clear this second and everything works
  oneRelations.clear();
}

In this case ~ExampleWithOneRelationObj looks like this (m_cluster is a ConstExampleCluster*)

~ExampleWithOneRelationObj() {
  if (m_cluster) delete m_cluster;
}

Hence, it will call the deconstructor of ConstExampleCluster, which will then in turn call release, where the above described heap-use-after-free happens, because the corresponding ObjBase was already removed by the call to clusters.clear(). Switching the order in which we call clear on the collections could fix the problem here, but that is not really a viable solution in general.

@hegner
Copy link
Collaborator

hegner commented Sep 15, 2021

OK. I think we have to go back to the drawing board here to think of all possible corner cases. What seems a possible conclusion is that the ref counting part and the data itself may need to be separated more clearly. Which would make it potentially incredibly complicated.
We should have a discussion meeting on that.

@gaede
Copy link
Contributor

gaede commented Sep 16, 2021

Following up on our discussion yesterday: could the whole issue be resolved by introducing a new type of objects that are not yet managed (owned) by a collection ?
These handle types would manage the underlying storage (own the obj and thereby the data) and delete it, when going out of scope - unless they have been moved into a collection. Normal types can only be created by the collection that has the ownership from the beginning.
Some pseudo code:

auto clucol = event.get<ClusterCollection>("clusters") ;
for(i : range){
  auto clu = clucol.create() ;
  clu.setEnergy( 42. ) ;
}

// some candidate clusters:
for(j : range){
  ClusterCand clu ;   // a cluster candidate 
  clu.setEnergy( j * 4 ) ;

  if( clu.energy() > 42.) {
     clucol.emplace_back( clu ) ;
     /// ownership and memory handling moved to collection for these clusters
   } 
}

/// clusters not passed to collection go out of scope and are deleted

This would greatly simplify the memory handling. Am I missing something ?

@tmadlener
Copy link
Collaborator Author

Another possibility could be to introduce a "semi smart pointer" that is used instead of the raw Obj* in the user classes. Here we would have a handle in the destructor on how to handle different cases. The basic layout would look like this

template<typename T>
class maybe_shared_ptr {
  // c'tors, d'tors, operator->, get()
private:
  T* ptr;
  struct ControlBlock {
    std::atomic<unsigned> count{1};
    std::atomic<bool> owned{true};
  };
  ControlBlock* ctrl_block{nullptr};
};

There are three different states:

  • When handed out by a collection, only the ptr will be populated and the ctrl_block will remain nullptr. The destructor would check the latter and be a no-op from there.
  • When created via a free standing (user) object, the first construction will construct the ctrl_block and assume ownership (the defaults above). Copies will then simply get the same ptr and ctrl_block but increase the reference count by one. Similarly destroying a user object will decrease the reference count by one (until it reaches 0 and both the ptr and the ctrl_block will be deleted).
  • When created as a free standing (user) object but then added to a collection, everything will happen the same way as in the case above (i.e. the reference counting remains unchanged). The major difference is that at the time the object is added to the collection the owned flag is flipped and this changes the behavior of the destructor to only destroy the ctrl_block once the ref count hits zero, but leave the ptr untouched (as that will be destroyed by the collection).

Basically what would change with respect to the current situation is that the ref-count mainly controls the lifetime of the control block, and only conditionally that of the "managed" pointer. I am not yet entirely sure if this covers all edge cases, but I think it could. One of the drawbacks is that the user facing classes will double their size (essentially from one pointer to two pointers). However, I think that could be acceptable if it indeed solves all our problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants