mpi_aggregator: don't pass a local variable to MPI_Isend #1326

germasch · 2019-03-26T14:38:04Z

The data passed to MPI_Isend needs to remain available until after
MPI_Wait, so use a class member variable instead.

This also means the member variable can't be const, so the workaround for some MPI_Isend implementations isn't needed anymore.

chuckatkins · 2019-03-27T20:52:56Z

Please rebase off current master as the version of clang-format was just bumped from 3.8.1 to 7.0.1

eisenhauer · 2019-03-31T13:24:12Z

The original code is clearly erroneous because it passes the address of a stack-allocated variable to a routine that was allowed to grab that buffer asynchronously at any time prior to the MPI_Wait(), which occurs outside the scope of the passed variable. It likely worked because any MPI implementation worth its salt can (and likely would) buffer an 8 byte send, but it wasn't guaranteed to work by MPI semantics.

This PR fixes the erroneous nature of the code by creating a class member which will persist until the MPI_Wait() happens, in the WaitAbsolutePosition method. Still not entirely thrilled with this abstraction, because it is really customized for the particular usage that appears elsewhere. The IExchangeAbsolutePosition and WaitAbsolutePosition methods are semantically linked by some state which is exposed and returned to the calling environment (the requests vector), and other state that is hidden inside the object. So, one couldn't, for example, do two calls to IExchangeAbsolutePosition() before WaitAbsolutePosition() without having hidden potentially erroneous behavior (that would probably never happen because Isend of 8 bytes would probably always buffer, but still erroneous). I considered the utility of using Send here instead of ISend, relying upon the usual equivalence of Send and ISend for tiny buffers, but that would just be replacing one bit of strictly erroneous code with another. Alternatively, the code could be refactored to export the send buffer as well as the requests vector to the calling environment, where it could be deallocated after (or in) the Wait call. But this code is non-generic (so highly specialized to the needs of its caller that it is unlikely to be used anywhere else, in any other manner) and even if it was called differently, the same factor that made the original erroneous implementation work (MPI always buffering tiny sends) means that actual problems might be unlikely to occur. Given those factors, I'm tempted to say that a more complex, "export all the state to the caller" sort of fix is likely more trouble than it's worth. @germasch @pnorbert do you concur? Maybe at most add a comment about the situation, warning future developers who might be tempted to use this as an example?

germasch · 2019-03-31T15:11:23Z

I was actually going to suggest that this communication piece of code could be cleaned up in general. It's quite opaque in what it does, and also not necessarily particularly efficient in the communication pattern it uses. But I figured I'd start with what's obviously just a small self-contained fix.

On the particular topic of maintaining state, I actually changed things in my local version in a way that the MPI_Request's are not returned to the user but maintained in the class, because as @eisenhauer says, the code maintains internal state anyway that prevents it from being used more generally, so one might as well make the interface simpler.

Let me describe what MPIAggregator actually does (more specifically, what it's derived class MPIChain does):

It performs essentially an exclusive scan (in MPI terms, a.k.a. prefix sum) of m_AbsolutePosition. (On rank 0, however, it finds the total sum)
It performs a Gatherv-like gathering of all data buffers onto rank 0, where they get written.

So my suggestion for the absolute position would be to just use MPI_Exscan (or, ideally, MPI_Iexscan if it is okay to require MPI >= version 3). No loop over rank required to achieve this.

For the actual data communication, as implemented by MPIChain, it actually shifts buffers down one rank at a time, ie., if you're aggregating over 4 processes, the buffer on rank 3 will be communicated rank 3 -> rank 2 -> rank 1 -> rank 0. Obviously, less data would go over the network if it were sent rank 3 -> rank 0 directly.

While MPI_Gatherv could be used directly, I think that's dangerous to do in general because it would result in rank 0 having to buffer the entire data from all processes in the aggregation group at once, which it might not have enough memory for. So my suggestion would be to keep the current interface, where rank 0 receives one data buffer at a time and then writes it as it goes. One could eventually go fancy and adapt behavior depending on available buffer space (ie, if it's all small buffers to be collected, use Gatherv, if they're big, do one a time), but I don't think there's a good reason to do so right now. But even when receiving / handling buffers one at a time, I think they should be sent directly to rank 0, not passed down one rank at a time.

pnorbert · 2019-03-31T15:54:11Z

The chain communication (a.k.a. Brigade method) was implemented in adios 1.x by Qing Liu, which proved to be better performing on Jaguar/Titan for our apps then directly sending buffers to rank 0. However counter-intuitive and strange it seems, this was the case, because it always kept the next "receive next buffer on rank 0" operation faster than the "write current buffer to disk" and hence avoided gaps between write operations. This may or may not be true for Summit anymore but the approach would be to have various aggregations strategies available. I guess the infrastructure is not in place to allow for multiple strategies right now.

…

On Sun, Mar 31, 2019 at 11:11 AM Kai Germaschewski ***@***.***> wrote: I was actually going to suggest that this communication piece of code could be cleaned up in general. It's quite opaque in what it does, and also not necessarily particularly efficient in the communication pattern it uses. But I figured I'd start with what's obviously just a small self-contained fix. On the particular topic of maintaining state, I actually changed things in my local version in a way that the MPI_Request's are not returned to the user but maintained in the class, because as @eisenhauer <https://github.com/eisenhauer> says, the code maintains internal state anyway that prevents it from being used more generally, so one might as well make the interface simpler. Let me describe what MPIAggregator actually does (more specifically, what it's derived class MPIChain does): - It performs essentially an exclusive scan (in MPI terms, a.k.a. prefix sum) of m_AbsolutePosition. (On rank 0, however, it finds the total sum) - It performs a Gatherv-like gathering of all data buffers onto rank 0, where they get written. So my suggestion for the absolute position would be to just use MPI_Exscan (or, ideally, MPI_Iexscan if it is okay to require MPI >= version 3). No loop over rank required to achieve this. For the actual data communication, as implemented by MPIChain, it actually shifts buffers down one rank at a time, ie., if you're aggregating over 4 processes, the buffer on rank 3 will be communicated rank 3 -> rank 2 -> rank 1 -> rank 0. Obviously, less data would go over the network if it were sent rank 3 -> rank 0 directly. While MPI_Gatherv could be used directly, I think that's dangerous to do in general because it would result in rank 0 having to buffer the entire data from all processes in the aggregation group at once, which it might not have enough memory for. So my suggestion would be to keep the current interface, where rank 0 receives one data buffer at a time and then writes it as it goes. One could eventually go fancy and adapt behavior depending on available buffer space (ie, if it's all small buffers to be collected, use Gatherv, if they're big, do one a time), but I don't think there's a good reason to do so right now. But even when receiving / handling buffers one at a time, I think they should be sent directly to rank 0, not passed down one rank at a time. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1326 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ADGMLZP_cIUb-E_zeD3_HaSZY8wOY5y3ks5vcNAcgaJpZM4cLnS9> .

germasch · 2019-03-31T16:14:05Z

I agree that there is a point in having multiple strategies available, so they can be benchmarked and the best option chosen. The current code has laid the groundwork for that in that MPIAggregator can provide different implementations by way of a derived class. Right now MPIChain is the only (and hardcoded) option, but turning m_Aggregator into a pointer to the base class and making it configurable should be quite straightforward.

germasch · 2019-03-31T17:16:42Z

By the way, I think the current (and guess having a longer history method) of chaining the buffers has the following (possibly advantageous) properties:

It "double buffers", ie, while one buffer is written, the data from the next rank is being received into a second buffer using non-blocking MPI. I would think this kind of overlap is a good idea for performance.
It kinda throttles messages onto rank 0. That is, instead of having all ranks send the buffers to rank 0 at the same time, while rank 0 is only ready to receive one at a time, the current scheme basically makes sure that every process is at most sending one and receiving one buffer at the same time. For this one, I'm less sure what performance implications might be, because in general, I believe MPI may communicate small messages even while there is no receiver yet (so it needs to buffer them), while large messages will be kept on the sender side until the receiver is ready, so that'd do similar throttling. But I guess I also know that reality is yet more complicated, and beyond what I really know details about.

eisenhauer · 2019-03-31T23:00:47Z

So, action plan? @germasch, maybe amend this with your changes that maintain the requests internally to simplify the interface? Then alternative implementations of aggregate left for future work?

The data passed to MPI_Isend needs to remain available until after MPI_Wait, so use a class member variable instead.

…rnally

germasch · 2019-04-01T14:40:47Z

I rebased and added commits to maintain the requests internally.
While I was at it, I also added some simplification of MPI_Irecv + immediately following MPI_Wait to MPI_Recv, and used MPI_STATUS_IGNORE. Those additional commits are at the end, so they can be easily removed if considered objectionable.

There's actually more simple maintenance that could be done, but I kept it at the trivial changes:

use MPI_Waitall to combine MPI_Wait calls.
make MPIAggregator an abstract base class, because right now you can use MPIAggregator directly, but it won't actually communicate your data buffers, which is highly likely not what one wants/expects.

williamfgc · 2019-04-01T15:14:06Z

I'd be conservative in changing the current aggregation strategy (MPIChain) if no performance gain is achieved (fixing bugs is fine as in the first commit). Changing internal behavior means rerunning current application benchmarks at scale as that's the only way we'll know (the CI is pointless here, and I personally prefer non-blocking calls), which will involve other people's time and burning core-hours.

As for virtualizing calls to MPIAggregator, that's in the plans (hence the OO boilerplate, MPIChain predates this), so please go ahead as this brings the possibilities to add more strategies for upcoming systems other than MPIChain.

germasch · 2019-04-01T16:06:29Z

I wouldn't want to change the current MPIChain implementation (at least not in fundamental ways), if for no other reason that it wouldn't even be possible to tell whether an updated strategy performs better without having them both available to run benchmarks.

This PR doesn't change anything about the MPIChain strategy -- initially it was just a simple bugfix, now it also includes some cleanup, but doesn't change behavior. (Unless you consider changing MPI_Irecv + immediate MPI_Wait -> MPI_Recv a change of behavior, but your MPI library would have to be rather broken for that to cause troubles.)

williamfgc · 2019-04-01T17:19:13Z

Can we justify why we need the architectural change? (New use-case? New strategy? Performance gain? New engine? Reusability?) While the proposed interface makes everything private, takes away control for Engine developers to play with different Wait options, as performance is driven by potential overlap. Also, no need to add blocking calls as we don't know how each system handles network blocking without measuring (the standard is ambiguous). Rather be safe and keep everything non-blocking.

williamfgc · 2019-04-01T17:21:55Z

I would make a separate MPI aggregation strategy as it is allowed by the object oriented design. We can always overload MPIAggregator calls if needed. Let's not forget this is working at scale.

germasch · 2019-04-01T17:47:05Z

Can we justify why we need the architectural change?

@eisenhauer justified why not to return the MPI_Request vector above: Returning this state implies to the user that the function is designed to be called multiple times -- but it is not, because it also maintains internal state (m_SizeSend was already there, and with the bug fix m_AbsolutePositionSend). So instead of mixing internal and external state, keep it all internal to make it consistent / less prone to misuse.

While the proposed interface makes everything private, takes away control for Engine developers to play with different Wait options, as performance is driven by potential overlap.

Engine developers supposedly need to use WaitAbsolutePosition to do the waiting, so they don't have a choice anyway. If you mean that people developing different implementations of MPIAggregator don't have a choice, I don't think that's true: They can store as many MPI_Requests or whatever they need within their derived class, and external users will see no difference (that's a good thing).

Also, no need to add blocking calls as we don't know how each system handles networking blocking without measuring. Rather be safe and keep everything non-blocking.

MPI_Irecv + MPI_Wait is already blocking, just like MPI_Recv. The only difference is that the former allows for overlap in between the two calls, but in the cases I changed, there was nothing there in between. Nevertheless, I can drop that commit if that's what you want.

pnorbert · 2019-04-01T17:59:50Z

Let's put this topic on the dev-call agenda for tomorrow. We need to pick this topic apart and see what are bug fixes, what are changes to the (internal) interface to provide multiple aggregation strategies, and what are new strategies.
Kai, the problem with any direct changes to the aggregation is that we only have one and it requires testing apps to make sure we don't break anything. Well, the problem is manpower, not the testing itself, which will be needed anyway when the internal interface changes.
We need to discuss if we accept the fixes in this PR and than figure out who is going to test it with apps. Those steps cannot be too far apart.

williamfgc · 2019-04-01T18:00:02Z

That's exactly my point, the proposed architecture changes is constraining things to how the aggregator's Wait is being used today. That's not how object oriented systems are designed, but rather as flexible and extendable interfaces.

If a new performance strategy requires splitting those Wait calls, we'll have to go back to the current architecture. I wouldn't compromise flexibility of the current OO design in choosing to partition Wait calls if a new strategy requires for performance.

Yes, please drop the commit, rather make a separate strategy under MPIAggregator.

germasch · 2019-04-01T18:07:06Z

@williamfgc I'm sorry, but I'm a bit lost. What are we talking about: (1) stop returning a vector of MPI_Requests from the IExchange* functions and handing it back to the corresponding Wait* function or (2) Changing the internal MPI_IRecv + immediate MPI_Wait calls to MPI_Recv?

williamfgc · 2019-04-01T18:13:23Z

@germasch removing (1) will take away control of when to call each Wait for particular Requests as we are reusing the same interface, (2) let's just make everything non-blocking. These are deeper APIs for Engine developers (not users), as @pnorbert said, we want to be open to new strategies, so the MPIAggregator API must be very flexible (not tailored to a single use-case). FYI, getting to the point of making this work at scale was an iterative process that took a lot of work, but it's at least reusable.

williamfgc · 2019-04-01T18:16:45Z

@germasch BTW, glad you reached this point, as aggregation strategies is another optimization technique (@pnorbert described the history) to achieve performance. We need help understanding potential new strategies for new systems, the cost is higher as testing at scale is never cheap.

germasch · 2019-04-01T18:26:12Z

Let's put this topic on the dev-call agenda for tomorrow. We need to pick this topic apart and see what are bug fixes, what are changes to the (internal) interface to provide multiple aggregation strategies, and what are new strategies.

The first commit is a bug fix (temporary passed to MPI_Isend).
This triggered the following two commits which keep state entirely internal following comments from @eisenhauer (which I agree with). Note that this is also can be argued to be "object-oriented" (encapsulated) way. If some future generation machine for whatever reason wanted to use RDMA directly in the aggregation and would need to keep track of RDMA handles while overlapping, that change could be handled internally, transparent to the user (ie., the engines using aggregation). Whereas the current interface does not abstract the internal state between the I and Wait calls, but explicitly relies on std::vector<MPI_Request>. (This second rationale is not why me/Greg proposed this change, it's just about don't split state between internal and external. But it shows that it's generally a useful way to handle separate interface and implementation details.)

Kai, the problem with any direct changes to the aggregation is that we only have one and it requires testing apps to make sure we don't break anything. Well, the problem is manpower, not the testing itself, which will be needed anyway when the internal interface changes.

There is a test which exercises this functionality in a bunch of cases: TestBPWriteAggregateRead I'd consider that good enough to test the aggregation for correctness (a.k.a., stupid bugs I might have introduced by a typo or something). In terms of performance, there's really nothing in this PR which changes which messages are sent and when, so there will be no measurable difference. (It avoids a bunch of dynamic memory allocations for std::vector<MPI_Request>, but that also won't be measurable except for in a micro-benchmark).

We need to discuss if we accept the fixes in this PR and than figure out who is going to test it with apps. Those steps cannot be too far apart.

I certainly appreciate the strong push for stability. But I think the validation effort should be proportional to the kind of changes. If I were to add a new implementation of MPIAggregator with a different communication strategy, and the interface changes that are needed to select that at runtime, and if that new strategy became default, I most definitely agree that the impact on apps needs to be evaluated. But this PR is not that kind of a change, it's just a bunch of minor changes that don't change behavior in any noticeable way.

germasch · 2019-04-01T18:40:32Z

@germasch removing (1) will take away control of when to call each Wait for particular Requests as we are reusing the same interface,

The current API provides two calls to Wait, one for the data buffer, one for the absolute position. The user (ie, engine) can place those two calls according to their needs:

    void WaitAbsolutePosition(size_t &absolutePosition, const int step);
    void Wait(std::vector<MPI_Request> &requests, const int step);

After the change, the API still has those exact two calls:

    void WaitAbsolutePosition(const int step);
    void Wait(const int step);

So the user still has the two Wait calls that they can place according to their needs. I don't see how the second way takes away anything that was possible previously?

(2) let's just make everything non-blocking. These are deeper APIs for Engine developers (not users), as @pnorbert said, we want to be open to new strategies, so the MPIAggregator API must be very flexible. FYI, getting to the point of making this work at scale was an iterative process that took a lot of work.

No-one suggested to make everything non-blocking, as that's clearly a bad idea. Everything that's currently achieving overlap by non-blocking calls in MPIChain still does so (in particular, rank 0 writes out the last-received data buffer at the same time that the next buffer is in flight).

The change we're talking about looks like this:

@@ -64,19 +64,12 @@ void MPIChain::IExchange(BufferSTL &bufferSTL, const int step)
     if (receiver)
     {
         size_t bufferSize = 0;
-        MPI_Request receiveSizeRequest;
-        helper::CheckMPIReturn(MPI_Irecv(&bufferSize, 1, ADIOS2_MPI_SIZE_T,
-                                         m_Rank + 1, 0, m_Comm,
-                                         &receiveSizeRequest),
+        helper::CheckMPIReturn(MPI_Recv(&bufferSize, 1, ADIOS2_MPI_SIZE_T,
+                                        m_Rank + 1, 0, m_Comm,
+                                        MPI_STATUS_IGNORE),
                                ", aggregation Recv size at iteration " +
                                    std::to_string(step) + "\n");

-        MPI_Status receiveStatus;
-        helper::CheckMPIReturn(
-            MPI_Wait(&receiveSizeRequest, &receiveStatus),
-            ", aggregation waiting for receiver size at iteration " +
-                std::to_string(step) + "\n");
-
         BufferSTL &receiveBuffer = GetReceiver(bufferSTL);
         ResizeUpdateBufferSTL(
             bufferSize, receiveBuffer,

Note that it uses a local variable MPI_Request receiveSizeRequest. That's never exposed as part of the interface. None of this change has anything to do with the API of MPIAggregator, no matter how flexible or not you want it to be.

williamfgc · 2019-04-01T19:00:26Z

Current: void Wait(std::vector<MPI_Request> &requests, const int step);
Proposed: void Wait(const int step);

The current call can be reused by any group of Requests as decided by the Engine developer, the proposed one can't.

germasch · 2019-04-01T19:23:11Z

Can you provide an example of how you imagine reusing the call by passing "any group of Requests"? Where do those requests come from? Is the caller now doing their own MPI_Isend / MPI_Irecv? How many requests are in "any group of Requests"?

germasch · 2019-04-02T04:27:53Z

I took out the commit that changes MPI_Irecv + MPI_Wait to MPI_Recv. It's the only change in this PR that actually changed behavior (other than the original bug fix), and it's now gone. I don't think it's worth the time arguing over.

It still moves the MPI_Request vector internal to MPIAggregator, as suggested by @eisenhauer. I have trouble imagining an actual use case where this reduces needed flexibility, and no example has been provided to show otherwise.

germasch · 2019-04-02T04:30:22Z

Also, as I generally try to do, it reduces lines of code:

 8 files changed, 55 insertions(+), 77 deletions(-)

where we're ignoring the status anyway (which is everywhere). MPI_STATUS_IGNORE added to mpidummy.{h,cpp}

chuckatkins · 2019-04-02T19:59:44Z

Indeed this is a bug, sending a local stack variable via MPI_Isend, so thank you for identifying it. So, we just went through this in a fair amount of detail on our weekly dev call and the resolution is more complicated and nuanced. Your fix here, moving both both the absoluteposition and request to local state, resolves this but ends up breaking the interface since both of them are derived quantities based on the input buffer. So if IExchangeAbsolutePosition is called repeatedly with separate buffers then that internal state gets clobbered.

W.R.T how to fix the bug though, to do it via tracking an internal state we'd need to maintain an arbitrary length of requests and absolutesends returning some way to reference them on subsequent wait calls, or not keep it as internal state at all and instead return something akin to a pair<request, position> of sorts that can be referenced in subsequent wait calls; really just something that ensures the data sent lives as long as the request does. Since they are both derived quantities on the input parameter and either way would be returning something that needs to be passed back and referenced later, we're leaning towards the second to avoid adding additional internal state. Since it's more in depth than appears on the surface and goes in part to the intended use cases for the MPIAggregator class, it's been added to @williamfgc's list to address in the next week or two so I'm going to close this for now.

@williamfgc we discussed today about about you implementing the second solution. However, it may make sense to do the first if we can ensure that the buffer passed in that get's operated on is guaranteed to be one of m_Buffer and somehow ties to that.

chuckatkins · 2019-04-09T19:30:00Z

So, after more digging on this, rather than "fix" the underlying state corruption in various places in the class, I think we'll end up taking the approach you have here with adding an additional check to ensure the repeated call sequence that corrupts the state can't happen. I'm going to re-open this just add the 1 extra commit for said check. It should be good to go then.

germasch · 2019-04-10T14:59:30Z

@chuckatkins I can try fortifying it, but since I've thought about it for a while now, let me throw an alternative idea for the interface out there:

@@ -328,30 +348,14 @@ void BP3Writer::AggregateWriteData(const bool isFinal, const int transportIndex)
     m_BP3Serializer.CloseStream(m_IO, false);

     // async?
-    for (int r = 0; r < m_BP3Serializer.m_Aggregator.m_Size; ++r)
-    {
-        m_BP3Serializer.m_Aggregator.IExchange(m_BP3Serializer.m_Data, r);
-
-        m_BP3Serializer.m_Aggregator.IExchangeAbsolutePosition(
-            m_BP3Serializer.m_Data, r);
-
-        if (m_BP3Serializer.m_Aggregator.m_IsConsumer)
-        {
-            const BufferSTL &bufferSTL =
-                m_BP3Serializer.m_Aggregator.GetConsumerBuffer(
-                    m_BP3Serializer.m_Data);
-
-            m_FileDataManager.WriteFiles(bufferSTL.m_Buffer.data(),
-                                         bufferSTL.m_Position, transportIndex);
-
-            m_FileDataManager.FlushFiles(transportIndex);
-        }
-
-        m_BP3Serializer.m_Aggregator.WaitAbsolutePosition(r);
-
-        m_BP3Serializer.m_Aggregator.Wait(r);
-        m_BP3Serializer.m_Aggregator.SwapBuffers(r);
-    }
+    m_BP3Serializer.m_Aggregator(m_BP3Serializer.m_Data,
+              [&](const BufferSTL &buffer) {
+                  m_FileDataManager.WriteFiles(buffer.m_Buffer.data(),
+                                               buffer.m_Position,
+                                               transportIndex);
+
+                  m_FileDataManager.FlushFiles(transportIndex);
+              });

     m_BP3Serializer.UpdateOffsetsInMetadata();

This is clearly much simpler and less error-prone to use. It should be a straightforward change, basically just moving the loop over ranks into MPIAggregator::operator() and calling back to a lambda function that will be passed the aggregated buffers one-by-one.

It's still very specialized (really only useful for this one particular use case). But with this interface, it'd be hard to get issues with maintaining internal state, since it's one single function call one can not really jump into the middle of to break things, so the only way to get that issue would be calling it from multiple threads at the same time. And even that could then actually be handled nicely by moving the state to local variables in the implementation that performs the loop.

It's still not general enough to support all kinds of possible communication strategies easily (e.g., MPI_Gatherv wouldn't quite work), but it does at least hide the fact that double buffering is used.

chuckatkins · 2019-04-10T15:29:43Z

Supersceded by #1369, which is based on this implementation with the member variable plus a few changes.

chuckatkins · 2019-04-10T15:31:32Z

I don't disagree with your ideas here, but I want to limit the scope to fixing the bug in the most limited way possible given the critical path this code lives in.

@cyfdecyf

Code extracted from: https://github.com/pybind/pybind11.git at commit 80d452484c5409444b0ec19383faa84bb7a4d351 (v2.4.3). Upstream Shortlog ----------------- Ahuva Kroizer (1): 8f5b7fce FAQ addition (ornladios#1606) Alexander Gagarin (2): 0071a3fe Fix async Python functors invoking from multiple C++ threads (ornladios#1587) (ornladios#1595) b3bf248e Fix casting of time points with non-system-clock duration with VS (ornladios#1748) Allan Leal (1): e76dff77 Fix for Issue ornladios#1258 (ornladios#1298) Andre Schmeißer (1): 19189b4c Make `overload_cast_impl` available in C++11 mode. (ornladios#1581) Ansgar Burchardt (1): a22dd2d1 correct stride in matrix example and test Antony Lee (6): a303c6fc Remove spurious quote in error message. (ornladios#1202) 0826b3c1 Add spaces around "=" in signature repr. 8fbb5594 Clarify error_already_set documentation. 55dc1319 Clarify docs for functions taking bytes and not str. 58e551cc Properly report exceptions thrown during module initialization. baf6b990 Silence GCC8's -Wcast-function-type. (ornladios#1396) Axel Huebl (9): 4b84bad7 Fix Travis GCC 7 Python 3.6.6 (ornladios#1436) 97b20e53 CMake: Remember Python Version (ornladios#1434) 3a94561c Debug Builds: -DPy_DEBUG (ornladios#1438) 435dbdd1 add_module: allow include as SYSTEM (ornladios#1416) 9424d5d2 type_record: Uninit Member (ornladios#1658) 1c627c9e pybind11_getbuffer: useless safe nullptr check (ornladios#1664) a2cdd0b9 dict_readonly: member init (ornladios#1661) 000aabb2 Test: Numpy Scalar Creation (ornladios#1530) 38f408fc value_and_holder: uninit members (ornladios#1660) Baljak (1): 81da9888 Fix Intel C++ compiler warning on Windows (ornladios#1608) Blake Thompson (1): 30c03523 Added __contains__ to stl bindings for maps (ornladios#1767) Boris Dalstein (2): b30734ee Fix typo in doc: build-in -> built-in 96be2c15 Fix version mismatch typos in .travis.yml (ornladios#1948) Boris Staletic (4): 289e5d9c Implement an enum_ property "name" f4b4e2e9 Use new Doxygen archive URL - fixes Travis 0ca6867e Avoid Visual Studio 2017 15.9.4 ICE b3f0b4de new sphinx (ornladios#1786) Borja Zarco (2): e2b884c3 Use `PyGILState_GetThisThreadState` when using gil_scoped_acquire. (ornladios#1211) b2fdfd12 Avoid use of lambda to work around a clang bug. (ornladios#1883) Bruce Merry (2): 1e6172d4 Fix some minor mistakes in comments on struct instance 3b265787 Document using atexit for module destructors on PyPy (ornladios#1169) Chris Rusby (1): 22859bb8 Support more natural syntax for vector extend Christoph Kahl (1): 640b8fe6 fix ornladios#1406 add mingw compatibility (ornladios#1851) Dan (10): a175b21e Avoid decoding already-decoded strings from cindex. 4612db54 Try to autodetect the location of the clang standard libraries. b46bb64d Allow user to override default values of -x and -std=. a33212df Wrap the main functionality of mkdoc in a function. ede328a7 Allow writing output to file instead of stdout. a163f881 Delete partially-written file in the event of an error. 590e7ace Avoid storing global state. 2c8c5c4e Split into seperate functions for easier invocation from python. e0b8bbbc Use a file-local constant for non-prefixing nodes. 41f29ccd Parse command-line args in a separate function. Darius Arnold (1): 09330b94 Fix typos in documentation (ornladios#1635) David Caron (1): 307ea6b7 Typo Davis E. King (1): 9343e68b Fix cmake scripts so projects using CUDA .cu files build correctly. (ornladios#1441) Dean Moldovan (3): c10ac6cf Make it possible to generate constexpr signatures in C++11 mode 56613945 Use semi-constexpr signatures on MSVC 0aef6422 Simplify function signature annotation and parsing Dennis Luxen (1): 221fb1e1 Untangle cast logic to not implicitly require castability (ornladios#1442) Dmitry (1): 8f5a8ab4 Don't strip debug symbols in RelWithDebInfo mode (ornladios#1892) Elliott Sales de Andrade (1): 5e7591c6 Update PyPI URLs. Eric Cousineau (2): e9ca89f4 numpy: Add test for explicit dtype checks. At present, int64 + uint64 do not exactly match dtype(...).num 4a3464fd numpy: Provide concrete size aliases Francesco Biscani (1): ba33b2fc Add -Wdeprecated to test suite and fix associated warnings (ornladios#1191) François Becker (1): ce9d6e2c Fixed typo in classes.rst (ornladios#1388) Guilhem Saurel (2): e7ef34f2 compatibility with pytest 4.0, fix ornladios#1670 43a39bc7 ignore numpy.ufunc size warnings Henry Schreiner (10): 04b41f03 Upgrading to Xcode 9 & fix OSX/Py3 build failure cf0d0f9d Matching Python 2 int behavior on Python 2 (ornladios#1186) 6c62d279 Fix for conda failures on Windows ffd56ebe Fix pip issues on AppVeyor CI (ornladios#1369) 3789b4f9 Update C++ macros for C++17 and MSVC Z mode (ornladios#1347) 0f404a5d Allow recursive checkout without https (ornladios#1563) ae951ca0 CI fixes (ornladios#1744) 73b840dc Fix for issue in latest conda (ornladios#1757) 9bb33131 Fixing warnings about conversions in GCC 7+ (ornladios#1753) 047ce8c4 Fix iostream when used with nogil (ornladios#1368) Ian Bell (1): 502ffe50 Add docs and tests for unary op on class (ornladios#1814) Igor Socec (1): a301c5ad Dtype field ordering for NumPy 1.14 (ornladios#1837) Ivan Smirnov (1): d1db2ccf Make register_dtype() accept any field containers (ornladios#1225) Ivor Wanders (1): 2b045757 Improve documentation related to inheritance. (ornladios#1676) Jamie Snape (1): a0b8f70d Ensure PythonLibsNew_FOUND is set in FindPythonLibsNew module (ornladios#1373) Jason Rhinelander (31): c6a57c10 Fix dtype string leak d2757d04 Remove superfluous "requires_numpy" 64a99b92 Specify minimum needed cmake version in test suite 1b08df58 Fix `char &` arguments being non-bindable 7672292e Add informative compilation failure for method_adaptor failures 6a81dbbb Fix 2D Nx1/1xN inputs to eigen dense vector args a582d6c7 Build /permissive- under VS2017 835fa9bc Miscellaneous travis-ci updates/fixes 32ef69ac Qualify `cast_op_type` to help ICC 5c7a290d Fix new flake8 E741 error from using `l` variable 71178922 __qualname__ and nested class naming fixes (ornladios#1171) 086d53e8 Clean up eigen download code (and bump to 3.3.4) 3be401f2 Silence new MSVC C++17 deprecation warnings 48e1f9aa Fix premature destruction of args/kwargs arguments 367d723a Simplify arg copying b48d4a01 Added py::args ref counting tests 88efb251 Fixes for numpy 1.14.0 compatibility 507da418 Use a named rather than anon struct in instance 326deef2 Fix segfault when reloading interpreter with external modules (ornladios#1092) adbc8111 Use stricter brace initialization 657a51e8 Remove unnecessary `detail::` add56ccd MSVC workaround for broken `using detail::_` warning 431fc0e1 Fix numpy dtypes test on big-endian architectures 1ddfacba Fix for Python3 via brew e88656ab Improve macro type handling for types with commas 9f41c8ea Fix class name in overload failure message 6d0b4708 Reimplement version check and combine init macros 6862cb9b Add workaround for clang 3.3/3.4 e763f046 Base class destructor should be virtual f7bc18f5 Fix compatibility with catch v2 177713fa Fix gcc-8 compilation warning Jeff VanOss (3): 05d379a9 fix return from std::map bindings to __delitem__ (ornladios#1229) 01839dce remove duplicate feature from list (ornladios#1476) 77ef03d5 compile time check that properties have no py:arg values (ornladios#1524) Jeffrey Quesnelle (1): f93cd0aa PYBIND11_TLS_REPLACE_VALUE should use macro argument value in Python 3.7+ (ornladios#1683) Jeremy Maitin-Shepard (1): a3f4a0e8 Add support for __await__, __aiter__, and __anext__ protocols (ornladios#1842) Josh Kelley (1): 741576dd Update documentation for initialize_interpreter (ornladios#1584) Justin Bassett (1): 2cbafb05 fix detail::pythonbuf::overflow()'s return value to return not_eof(c) (ornladios#1479) Jörg Kreuzberger (1): 69dc380c ornladios#1208 Handle forced unwind exception (e.g. during pthread termination) Karl Haubenwallner (1): e9d6e879 Added a debug flag to the PYBIND11_INTERNALS_VERSION (ornladios#1549) Khachajantc Michael (1): e3cb2a67 Use std::addressof to obtain holder address instead of operator& Krzysztof Fornalczyk (1): 5c8746ff check for already existing enum value added; added test (ornladios#1453) Lori A. Burns (3): bdbe8d0b Enforces intel icpc >= 2017, fixes ornladios#1121 (ornladios#1363) 868d94fc Apply c++ standard flag only to files of CXX language. (ornladios#1678) f6c4c104 restores __invert__ to arithmetic-enabled enum, fixes ornladios#1907 (ornladios#1909) Maciek Starzyk (1): 9b028562 Update PyPI URLs Manuel Schneider (1): 492da592 another typo (ornladios#1675) Marc Schlaich (1): ab003dbd Correct VS version in FAQ Matthias Geier (1): 7bb1da96 fix copy-paste error: non-const -> const Michael Goulding (1): 77374a7e VS 15.8.0 Preview 4.0 has a bug with alias templates (ornladios#1462) Michał Wawrzyniec Urbańczyk (1): 978d439e Add PYBIND11_ prefix to the THROW macro to prevent name collisions. (ornladios#1578) Naotoshi Seo (1): 5ef1af13 Fix SEGV to create empty shaped numpy array (ornladios#1371) Nathan (1): 9b3fb053 Allow Windows.h min/max to coexist with pybind11 (ornladios#1847) Omar Awile (2): ac6cb91a Fixed small typo (ornladios#1633) 95f750a8 Add optional buffer size to pythonbuf::d_buffer constructor (ornladios#1687) Patrik Huber (1): 41a4fd8a Fix missing word typo Pauli Virtanen (1): c9d32a81 numpy: fix refcount leak to dtype singleton (ornladios#1860) Roland Dreier (2): 7a24bcf1 Fix malformed reST (ornladios#1802) 1aa8dd17 Fix assertion failure for unions (ornladios#1685) (ornladios#1709) Rune Paamand (2): 73634b6d Update iostream.h: Changed a local varname 'self' to 'self_' (ornladios#1535) 06d021b6 Issue ornladios#1532: Incompatible config options, /MP vs /Gm for MSVC in DEBUG (ornladios#1533) Ryota Suzuki (1): 1377fbf7 Fix unintentional escaping of character on Windows (ornladios#1574) (ornladios#1575) Samuel Debionne (2): 87fa6a43 Detect whether we are running in a Conda environment and adjust get_include() (ornladios#1877) 6ca312b3 Avoid infinite recursion in is_copy_constructible (ornladios#1910) Saran Tunyasuvunakool (2): b60fd233 Make sure `detail::get_internals` acquires the GIL before making Python calls. (ornladios#1836) bdf1a2cc In internals.h, only look at _DEBUG when compiling with MSVC. (ornladios#1855) Semen Yesylevskyy (1): ef13fb2e Info about inconsistent detection of Python version between pybind11 … (ornladios#1093) Sergei Izmailov (3): 979d75de doc: Add note about casting from `None` to `T*` (ornladios#1760) 09f08294 Avoid conversion to `int_` rhs argument of enum eq/ne (ornladios#1912) 6cb584e9 Adapt to python3.8 C API change (ornladios#1950) Sergei Lebedev (2): 08b0bda4 Added set::contains and generalized dict::contains (ornladios#1884) 046267c6 Added .empty() to all collection types (ornladios#1887) Stephen Larew (1): 5b4751af Add const to buffer:request() (ornladios#1890) Steven Johnson (1): 4ddf7c40 Add missing includes for better Bazel compatibility (ornladios#1255) Tarcísio Fischer (1): 54eb8193 Fix scoped enums comparison for equal/not equal cases (ornladios#1339) (ornladios#1571) Ted Drain (1): 0a0758ce Added write only property functions for issue ornladios#1142 (ornladios#1144) Thomas Hrabe (1): 534b756c Minor documentation clarification in numpy.rst (ornladios#1356) Thomas Peters (1): dffe869d quiet clang warning by adding default move ctor (ornladios#1821) Tom de Geus (1): a7ff616d Simplified example allowing more robust usage, fixed minor spelling issues Tomas Babej (1): 01fada76 Minor typo Toru Niina (1): 74d335a5 Replace a usage of C++14 language features with C++11 code (ornladios#1833) Trevor Laughlin (1): 63c2a972 Enable unique_ptr holder with mixed Deleters between base and derived types (ornladios#1353) Unknown (1): 0b3f44eb Trivial typos Vladimír Vondruš (2): 5b0ea77c Fix -Wmissing-prototypes warning on Clang. (ornladios#1863) 04c8f4b5 Expose BufferError among other pybind11 exceptions. (ornladios#1852) Wenzel Jakob (51): f94d7598 updated changelog for v2.2.1 release 6d19036c support docstrings in enum::value() (ornladios#1160) e7d304fb added citation reference (fixes ornladios#767) (ornladios#1189) 15e0e445 Moved section on licensing of contributions (fixes ornladios#1109) (ornladios#1188) ff6bd092 Fix pybind11 interoperability with Clang trunk (ornladios#1269) 2d0507db added v2.2.2 changelog 060936fe Detect pybind11 header path without depending on pip internals (fixes ornladios#1174) (ornladios#1190) ed670055 Minor fix for MSVC warning CS4459 (ornladios#1374) f5f66189 updated changelog for v2.2.3 cbd16a82 stl.h: propagate return value policies to type-specific casters (ornladios#1455) d4b37a28 added py::ellipsis() method for slicing of multidimensional NumPy arrays 885b5b90 Eigen test suite: don't create a np.matrix e0f3a766 Fixed flake8 error in test_iostream.py 44e39e0d fix regression reported by @cyfdecyf in ornladios#1454 (ornladios#1517) 35c82c72 changelog for version 2.2.4 & features targeted for 2.3.0 06710020 object_api: support the number protocol b4b22924 relax operator[] for tuples, lists, and sequences f4245181 enum_: move most functionality to a non-template implementation c8e9f3cc quench __setstate__ warnings (fixes ornladios#1522) c9b8933e flake8 fix 9f73060c std::array<> caster: support arbitrary sequences (ornladios#1602) adc2cdd5 fixed regression in STL type caster RVPs (fixes ornladios#1561) (ornladios#1603) e2eca4f8 Support C++17 aligned new statement (ornladios#1582) cea42467 fix py::cast<void *> (ornladios#1605) 64205140 added std::deque to overview.rst d1f64fa9 AppVeyor: quench pip deprecation warnings for v2.7 ccbe68b0 added binding delattr() -> PyObject_DelAttr analogous to hasattr() 25abf7ef flake8 fixes e11e71d8 Make compiler flags for -Werror specific to GNU, Clang, or Intel 51ca6b08 Update docs on std::out_of_range exception mapping (ornladios#1254) cf36e3d9 updated changelog 64f2a5f8 begin work on v2.3.1 ed39c504 README.md: added several folks who've made repeated contributions a1b71df1 fix issue ornladios#1804 (warning about redefined macros) 8b90b1da error_already_set: acquire GIL one line earlier (fixes ornladios#1779) 9fd47121 fix test suite (pytest changes in ExceptionInfo class) b2c4ff60 renamed local gil_scoped_acquire to gil_scoped_acquire_local to avoid ambiguity c9f5a464 pybind11 internals: separate different compilers 00a0aa99 v2.4.0 release e825205a begin working on v2.4.1 21d0eb46 Fix Python 3.8 test regression 5fd187eb minor changelog cleanup 31680e6f Implicit conversion from enum to int for Python 3.8 (fix by @sizmailov) e44fcc3c v2.4.1 release 82cf7935 begin working on next version f3109d84 future-proof Python version check from commit 31680e6 (@lgritz) 7f5dad7d Remove usage of C++14 constructs (fixes ornladios#1929) 7ec2ddfc v2.4.2 release 2abd7e1e updated release.rst to remove parts that are now automated 34c2281e begin working on next version 80d45248 v2.4.3 release Yannick Jadoul (4): b4719a60 Switching deprecated Thread Local Storage (TLS) usage in Python 3.7 to Thread Specific Storage (TSS) (ornladios#1454) 085a2943 Increasing timeout in test_gil_scoped.py to get AppVeyor to succeed 97784dad [BUGFIX] Fixing pybind11::error_already_set.matches to also work with exception subclasses (ornladios#1715) d23c821b Make static member functions, added with `def_static`, `staticmethod` descriptor instances (ornladios#1732) Zach DeVito (1): 03874e37 Fix leak in var arg handling ali-beep (1): 5ef13eb6 Add negative indexing support to stl_bind. (ornladios#1882) cdyson37 (1): 111b25b2 Mention flake8 and check-style.sh in CONTRIBUTING (ornladios#1567) kingofpayne (1): 12e8774b Added support for list insertion. (ornladios#1888) luz.paz (2): 28cb6764 misc. typos 13c08072 Typo luzpaz (2): 4b874616 Misc. typos (ornladios#1384) 21bf16f5 misc. comment typo (ornladios#1629) martinRenou (1): 35045eee Add getters for exception type, value and traceback (ornladios#1641) nstelzen (1): c2514340 Added note in documentation regarding make install (ornladios#1801) oremanj (2): fd9bc8f5 Add basic support for tag-based static polymorphism (ornladios#1326) e7761e33 Fix potential crash when calling an overloaded function (ornladios#1327) phil-zxx (1): c6b699d9 Added ability to convert from datetime.date to system_clock::time_point (ornladios#1848) sizmailov (1): 21c3911b add signed overload for `py::slice::compute` voxmea (1): 17983e74 Adds type_caster support for std::deque. (ornladios#1609)

germasch force-pushed the pr/fix_mpi_isend branch from a9f76cd to 6df1aaf Compare March 26, 2019 20:18

germasch force-pushed the pr/fix_mpi_isend branch 3 times, most recently from 9eae640 to d7f0d47 Compare March 28, 2019 14:30

germasch added 3 commits April 1, 2019 09:51

mpi_aggregator: don't pass a local variable to MPI_Isend

8a8b230

The data passed to MPI_Isend needs to remain available until after MPI_Wait, so use a class member variable instead.

mpi_aggregator: maintain requests for absolute position exchange inte…

dc1730f

…rnally

mpi_aggregator: keep requests for actual data exchange internal

f391c41

germasch force-pushed the pr/fix_mpi_isend branch from d7f0d47 to f55eaf0 Compare April 1, 2019 14:33

germasch force-pushed the pr/fix_mpi_isend branch from f55eaf0 to 3276034 Compare April 1, 2019 15:00

germasch force-pushed the pr/fix_mpi_isend branch from 3276034 to d01454f Compare April 1, 2019 16:03

germasch force-pushed the pr/fix_mpi_isend branch from d01454f to 5a7c98d Compare April 1, 2019 17:09

germasch force-pushed the pr/fix_mpi_isend branch from 5a7c98d to 55c3fbc Compare April 1, 2019 18:43

germasch force-pushed the pr/fix_mpi_isend branch from 55c3fbc to 0b3c9c6 Compare April 2, 2019 04:20

germasch force-pushed the pr/fix_mpi_isend branch 2 times, most recently from 9b9b4a3 to 01c3d0f Compare April 2, 2019 13:47

mpi_aggregator: use MPI_STATUS_IGNORE

307c1b8

where we're ignoring the status anyway (which is everywhere). MPI_STATUS_IGNORE added to mpidummy.{h,cpp}

germasch force-pushed the pr/fix_mpi_isend branch from 01c3d0f to 307c1b8 Compare April 2, 2019 14:41

chuckatkins mentioned this pull request Apr 2, 2019

MPIAggregator sending local stack variables via MPI_Isend #1348

Closed

chuckatkins closed this Apr 2, 2019

chuckatkins reopened this Apr 9, 2019

chuckatkins mentioned this pull request Apr 10, 2019

Fix mpi aggregator state #1369

Merged

chuckatkins closed this Apr 10, 2019

germasch deleted the pr/fix_mpi_isend branch April 10, 2019 18:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mpi_aggregator: don't pass a local variable to MPI_Isend #1326

mpi_aggregator: don't pass a local variable to MPI_Isend #1326

germasch commented Mar 26, 2019

chuckatkins commented Mar 27, 2019

eisenhauer commented Mar 31, 2019

germasch commented Mar 31, 2019

pnorbert commented Mar 31, 2019 via email

germasch commented Mar 31, 2019

germasch commented Mar 31, 2019

eisenhauer commented Mar 31, 2019

germasch commented Apr 1, 2019 •

edited

Loading

williamfgc commented Apr 1, 2019 •

edited

Loading

germasch commented Apr 1, 2019

williamfgc commented Apr 1, 2019 •

edited

Loading

williamfgc commented Apr 1, 2019 •

edited

Loading

germasch commented Apr 1, 2019

pnorbert commented Apr 1, 2019

williamfgc commented Apr 1, 2019 •

edited

Loading

germasch commented Apr 1, 2019

williamfgc commented Apr 1, 2019 •

edited

Loading

williamfgc commented Apr 1, 2019

germasch commented Apr 1, 2019

germasch commented Apr 1, 2019

williamfgc commented Apr 1, 2019 •

edited

Loading

germasch commented Apr 1, 2019

germasch commented Apr 2, 2019

germasch commented Apr 2, 2019

chuckatkins commented Apr 2, 2019

chuckatkins commented Apr 9, 2019

germasch commented Apr 10, 2019

chuckatkins commented Apr 10, 2019

chuckatkins commented Apr 10, 2019

mpi_aggregator: don't pass a local variable to MPI_Isend #1326

mpi_aggregator: don't pass a local variable to MPI_Isend #1326

Conversation

germasch commented Mar 26, 2019

chuckatkins commented Mar 27, 2019

eisenhauer commented Mar 31, 2019

germasch commented Mar 31, 2019

pnorbert commented Mar 31, 2019 via email

germasch commented Mar 31, 2019

germasch commented Mar 31, 2019

eisenhauer commented Mar 31, 2019

germasch commented Apr 1, 2019 • edited Loading

williamfgc commented Apr 1, 2019 • edited Loading

germasch commented Apr 1, 2019

williamfgc commented Apr 1, 2019 • edited Loading

williamfgc commented Apr 1, 2019 • edited Loading

germasch commented Apr 1, 2019

pnorbert commented Apr 1, 2019

williamfgc commented Apr 1, 2019 • edited Loading

germasch commented Apr 1, 2019

williamfgc commented Apr 1, 2019 • edited Loading

williamfgc commented Apr 1, 2019

germasch commented Apr 1, 2019

germasch commented Apr 1, 2019

williamfgc commented Apr 1, 2019 • edited Loading

germasch commented Apr 1, 2019

germasch commented Apr 2, 2019

germasch commented Apr 2, 2019

chuckatkins commented Apr 2, 2019

chuckatkins commented Apr 9, 2019

germasch commented Apr 10, 2019

chuckatkins commented Apr 10, 2019

chuckatkins commented Apr 10, 2019

germasch commented Apr 1, 2019 •

edited

Loading

williamfgc commented Apr 1, 2019 •

edited

Loading

williamfgc commented Apr 1, 2019 •

edited

Loading

williamfgc commented Apr 1, 2019 •

edited

Loading

williamfgc commented Apr 1, 2019 •

edited

Loading

williamfgc commented Apr 1, 2019 •

edited

Loading

williamfgc commented Apr 1, 2019 •

edited

Loading