Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WeakRB: New and proven SPSC FIFO implementation #2314

Closed
wants to merge 9 commits into from
Closed

WeakRB: New and proven SPSC FIFO implementation #2314

wants to merge 9 commits into from

Conversation

uklotzde
Copy link
Contributor

@uklotzde uklotzde commented Oct 4, 2019

A Weak-consistency Ring Buffer implementing a bounded, lock-free, single-producer/single-consumer FIFO. This could gradually replace the legacy implementation in lib/portaudio. The element type must provide a default constructor and support move assignment.

Used for:

  • Multi-threaded analysis (replaces MpscFifo)
  • Lock-free CachingReaderWorker (new) deferred
  • CachingReaderWorker <-> CachingReader (engine thread)
  • StatsManager
    • ...fixing a memory leak and avoiding allocations

References:

@uklotzde uklotzde added this to the 2.3.0 milestone Oct 4, 2019
@daschuer
Copy link
Member

daschuer commented Oct 4, 2019

While the new fifo looks intersting, I think it is not the best place to use it.
It is because, in the double load situation, we don't want to process old tracks on the queue. We always immediately want to process the last track the user has thrown into the deck.
So I think a ControlValuAtomic based solution fits better.

@daschuer
Copy link
Member

daschuer commented Oct 4, 2019

No ... we don't enven need a ControlValueAtomic. Just a QAtomicPointer, that contains the track to load and a reference track to compare if the track is already loaded.

@uklotzde
Copy link
Contributor Author

uklotzde commented Oct 4, 2019

No ... we don't enven need a ControlValueAtomic. Just a QAtomicPointer, that contains the track to load and a reference track to compare if the track is already loaded.

pre C++20: https://en.cppreference.com/w/cpp/memory/shared_ptr
since C++20: https://en.cppreference.com/w/cpp/memory/shared_ptr/atomic2

@uklotzde
Copy link
Contributor Author

uklotzde commented Oct 4, 2019

I'll remove the commit from this PR.

@uklotzde uklotzde changed the title WeakRB: New and proven SPSC FIFO implementation [WiP] WeakRB: New and proven SPSC FIFO implementation Oct 4, 2019
@uklotzde uklotzde changed the title [WiP] WeakRB: New and proven SPSC FIFO implementation WeakRB: New and proven SPSC FIFO implementation Oct 5, 2019
@uklotzde
Copy link
Contributor Author

uklotzde commented Oct 5, 2019

Getting all the modulo calculations right was tricky. The new interface includes some nice convenience methods.

@daschuer
Copy link
Member

daschuer commented Oct 8, 2019

This is really an interesting topic. Unfortunately it disproves some of my assumptions about the use of the volatile keyword, which are also part of the Mixxx source.
https://stackoverflow.com/questions/26307071/does-the-c-volatile-keyword-introduce-a-memory-fence

Does it mean that the original PaUtilRingBuffer is not guaranteed to work, because memory fences are not guaranteed with volatile? Scary.

The Lamport queue from https://www.di.ens.fr/~guatto/papers/sbac13.pdf has also no memory explicit fences. The author assumes sequential access. How is this related?

@uklotzde
Copy link
Contributor Author

uklotzde commented Oct 8, 2019

volatile is useless for synchronization purposes. It guarantees that data is actually written to an address, i.e. required for memory-mapped access.

The PortAudio ring buffer uses explicit memory barriers, probably an implementation variant of the mentioned Lamport queue.

Copy link
Member

@daschuer daschuer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we make use of https://en.cppreference.com/w/cpp/memory/uninitialized_move
and avoid to initially fill the cue with default constructed objects?

if (writable > 0) {
DEBUG_ASSERT(writable < m_size);
for (size_type i = 0; i < writable; ++i) {
m_data[(m_pfront + i) % m_size] = std::move(data[i]);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we move the object out of the callers memory. I guess that is surprising and should be avoided.
Can we use value_type&& for the move type?
For me it is also unclear how the compiler distinguishes both push variants. Can we use different names to make it obvious?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(m_pfront + i) % m_size fails in case of overflow and max size_type is not dividable by m_size.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've fixed the overflow detection, adjusting the capacity if necessary.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The const vs. non-const overload is chosen be the argument type. I have renamed the copying push operations to distinguish them explicitly. By default push moves values into the queue while pop moves values out of the queue. Moving is supposed to be more general and efficient than copying.

for (size_type i = 0; i < writable; ++i) {
m_data[(m_pfront + i) % m_size] = data[i];
}
m_pfront = (m_pfront + writable) % m_size;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why we store here "relaxed" and "release" short after.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should make sure to read and write the atomics only once par fifo access and cache their values in a local variable during usage.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

m_cback and m_pfront are already non-shared, cached values. Both members are documented accordingly, that's the basic idea of the algorithm.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Introducing a local variable is not necessary, let's leave this optimization to the compiler.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is IMHO necessary, to better understand the code. And to have a performance gain independent from the used comilerer.

The atomic type gurantees that the value is actually written and not optimized away. So we must not use it if optimization is desired.

In this case I thing = is overloaded with load which defaults to a sequential store.

It could be also interesting, if the new code actually is faster. Sometimes trivial quirks destroys a theoretical performance gain making this PR pointless.

In this case a memory fence is IMHO negligible compared to calling a sequential constructor vs. to a vectorized memcpy.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again: m_pfront and m_cback are simple integers, not atomics. The atomics m_front and m_back are accessed through explicit load and store operations.

The use cases were this new FIFO is used only push or pop single messages into/from the queue, not for plain audio data. The memcpy is dangerous if those types are not PODs. Safety first. For PODs the compiler might be able to optimize the loops away. If this is prevented by the modulo operation then two separate loops could be used.

The existing code already contains a memory leak, because the detached alloc/free operations of the POD message type are used incorrectly. And this is only a minor bug.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, now I see the issue.
So we store the same value redundant in two variables?
That should be fixed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's the point of the algorithm!!!

@uklotzde
Copy link
Contributor Author

uklotzde commented Oct 8, 2019

I've continued with the original Rust naming style after starting it this way out of habit. Negotiable.

m_cback(0) {
DEBUG_ASSERT(m_size > capacity);
DEBUG_ASSERT(m_size < m_size + capacity); // no overflow
DEBUG_ASSERT(std::atomic_is_lock_free(&m_front));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

m_front.is_lock_free() will look nicer.

typedef S size_type;

explicit SpscFifo(size_type capacity)
: m_size(adjust_capacity(capacity) + 1),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the use for the extra field?
In one use, we rely on the fact that only one element is queued. Does this fail now?

The use of the cue is there ugly anyway. I think we should fix the user code as well.

@@ -83,7 +83,7 @@ AnalyzerThread::AnalyzerThread(
m_dbConnectionPool(std::move(dbConnectionPool)),
m_pConfig(std::move(pConfig)),
m_modeFlags(modeFlags),
m_nextTrack(MpscFifoConcurrency::SingleProducer),
m_nextTrack(1),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should not use a queue here.


private:
static size_type adjust_capacity(size_type capacity) {
VERIFY_OR_DEBUG_ASSERT(capacity <= std::numeric_limits<size_type>::max() / 2) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why we limit the capacity to half the possible size?

The % trick only works if there is no rest after an overflow. We need to adjust the size for this.

Let's say size_type is char.
We have possible index 0...255
If capacity is 100 we have 100 + 1 (vor whatever reason)
Index 255 % 101 = 53
Index (255 + 1) % 101 = 0

If we force the capacity to 128, it works
Index 255 % 128 = 127 (last index)
Index (255 + 1) % 127 = 0 (first index)

This is done here:

size = roundUpToPowerOf2(size);

@daschuer
Copy link
Member

daschuer commented Oct 9, 2019

What is the main point of this PR?
Just fixing a memory leak?

I think if we introduce our own FIFO here, It should be fully optimized and real greate.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants