Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore: schedule chains #3819

Merged
merged 1 commit into from
Oct 11, 2024
Merged

chore: schedule chains #3819

merged 1 commit into from
Oct 11, 2024

Conversation

romange
Copy link
Collaborator

@romange romange commented Sep 28, 2024

Use intrusive queue that allows batching of scheduling calls instead of handling each call separately. This optimizations improves latency and throughput by 3-5%

@romange romange requested a review from dranikpg September 29, 2024 09:38
dranikpg
dranikpg previously approved these changes Sep 29, 2024
Copy link
Contributor

@dranikpg dranikpg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Questionable 🤔🤔🤔 I don't understand the technical motivation behind it. We have the same amount of inter-thread coordination (if not more), it's just a mpsc queue embedded in a mpsc queue.

We should ask why it is costly to use the shard set queue? Dynamic dispatch? Heavy functor objects? Object access times? What if we use variant<ScheduleContext, Functor> in shard set?

Comment on lines 106 to 104
struct ScheduleQ {
base::MPSCIntrusiveQueue<ScheduleContext> queue;

static constexpr size_t kSz = sizeof(queue);

char pad1[64];
atomic_bool armed{false};
char pad2[60];
};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤓

alignas(64) MPSC queue;
alignas(64) atomic_bool armed;

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but isntead of 64 exist better option std::hardware_destructive_interference_size

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, but this is just an awful name that translated to 64 anyways

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in the future it can be 128 so awful name is better than const

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are right, my answer was emotional 😺

Comment on lines 520 to 522

static void ScheduleBatchInShard();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please comment on what it does and what the motiviation is, we'll be puzzling in a few months

src/server/transaction.cc Show resolved Hide resolved
}

// of shard_num arity.
ScheduleQ* schedule_queues = nullptr;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unique_ptr

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does not provide any value since we must delete manually anyways.

@romange
Copy link
Collaborator Author

romange commented Sep 29, 2024

  1. Batching of code - allows locality. I have not yet measured metrics but the loop may process 10 items and more in a batch
  2. More importantly - it will allow us to try and prefetch keys memory in Dashtable when processing operations. So before that it was not possible to prefetch memory for multiple ops because all of them were executed individually.

You are asking a good question - why using intrusive queue and not a regular queue. I do not know what's more efficient but using a variant will require factoring our FiberQueue from helio, so it's more work.

@romange
Copy link
Collaborator Author

romange commented Sep 29, 2024

FiberQueue uses mpmc_bounded_queue - which notifies a consumer. mpsc is fully lock free - does not allow blocking

@dranikpg
Copy link
Contributor

Batching of code - allows locality. I have not yet measured metrics but the loop may process 10 items and more in a batch

What you call batching and locality are purely code driven concepts. Technically nothing has changed - you executed 10 times the task sequentially before, just each time calling a functor, not the function itself directly. Instead of checking on shard queue atomics you check on your new queue atomics, that's it.

pop one job on shard set {
  for i = 1, 10 {
     Run()
  }
}

and

for i = 1, 10 {
  pop job from shard set {
    Run()
  }
}

More importantly - it will allow us to try and prefetch keys memory in Dashtable when processing operations. So before that it was not possible to prefetch memory for multiple ops because all of them were executed individually.

So we can plan it first on a higher level. Let's revise what we use the shard set queues for. Structured information is always better than generalized (functors) and our execution flow is always better off in a structured way than hidden. So I assume to parallelize memory access one day you intend to

vector<ScheduleCntx> batch = next_ten_readonly();
vector<vector<it>> keys;
for (cntx: batch):
  keys += fetch_iterators()
for (cntx: batch)
  run(cntx, keys[i])

@romange
Copy link
Collaborator Author

romange commented Sep 29, 2024

Yes, this is the plan. In any case I am not in hurry to submit this Pr. it can stay as food for thought

dranikpg
dranikpg previously approved these changes Sep 30, 2024
@dranikpg
Copy link
Contributor

dranikpg commented Oct 1, 2024

In any case I am not in hurry to submit this Pr

Let's merge, it's just that you reveal your plans step by step whereas we're curious for the final plan

Use intrusive queue that allows batching of scheduling calls instead of handling each call separately.
This optimizations improves latency and throughput by 3-5%
In addition, we expose batching statistics in info transaction block.

Signed-off-by: Roman Gershman <roman@dragonflydb.io>
@romange
Copy link
Collaborator Author

romange commented Oct 11, 2024

hardware_destructive_interference_size does not exist in the gcc versions we use so I can not use it. I added a named const for that. PTAL

@BorysTheDev
Copy link
Contributor

hardware_destructive_interference_size does not exist in the gcc versions we use so I can not use it. I added a named const for that. PTAL

It's really strange because it's c++17

@kostasrim
Copy link
Contributor

hardware_destructive_interference_size does not exist in the gcc versions we use so I can not use it. I added a named const for that. PTAL

It's really strange because it's c++17

the support for C++17 changes from compiler version to version. Same for all C++ releases. For example, you got modules in C++20. Are they properly supported in recent versions of gcc and clang ? I don't think so. Basically, for every new release they just add support of the missing features so even if you compiler something with C++17 flag it doesn't necessarily mean that the feature exists

Copy link
Contributor

@dranikpg dranikpg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hardware_destructive_interference_size does not exist in the gcc versions we use so I can not use it. I added a named const for that. PTAL

We use absl::hardware_something in a few places 🙂


Let's merge and see what we make out of it later

@romange romange merged commit 5d2c308 into main Oct 11, 2024
12 checks passed
@romange romange deleted the ScheduleChains branch October 11, 2024 19:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants