-
Notifications
You must be signed in to change notification settings - Fork 81
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Asynchronous stage in software pipeline #80
Conversation
Many thanks, @masahi! It's an outstanding contribution to push TVM perf to another stage. My experiments show async pipeline will speed up cutlass from 80T to 93T (on fp16 gemm, 3080). I am looking forward to the following PRs. One question: Can it speed up kernels with wmma intrinsic or those using CUDA cores? |
@Hzfengsy On NVIDIA Ampere, the only asynchronous operation is global to shared memory copy. So wmma, mma, or cuda core can all benefit from it. I have an MMA schedule with multi-stage pipeline, where global to shared memory is 4x multi buffered and asynchronous. The test case is here https://github.com/masahi/tvm/blob/2b325394e951b8b38c9a24d9a4b7a8c6f6d749e7/tests/python/unittest/test_tir_transform_inject_software_pipeline.py#L1436. It generates interesting code (generated TIR after lowering to PTX async instructions), but currently I'm not getting good performance: The baseline MMA schedule without any pipelining or async gets 40 TFLOPS on 3070, but this fancy schedule gets only 33 TFLOPS. My next step is to get Ampere async GEMM actually perform on par with or better than the baseline. |
My observation: async works well for fp16-16-16 (use fp16 accumulator) but helps little for fp16-16-32. On the other hand, the best cutlass kernel only uses 3 stages on my machine. I guess it is because of the shared memory usage |
Many thanks~ The settings seems to also greatly benefit DMA synchronizations handling in NPU workloads. For example, there could be "input DMA" - "computation" - "output DMA" pipelines, where each pipeline stage may take it's own IQ thus explicit synchronization instructions should be correctly inserted, like "input DMA waits for the last (i-1 or i-2) output DMA". Here are my two questions, just out of my curiosity :),
|
Thanks @wrongtest for questions!
I've added a section explaining what
What do you mean by "the explicit control-flow dependency annotations" here? From the given annotation, we obviously need to determine read-write dependencies to tell which stmt is the consumer of which async stmt. In that sense, I think I'm already working with data dependencies only. I don't think the TIR software transform pass deals with control flow at all.
Yes, if I understand your question, async_commit_stage/async_wait_stage are new TIR intrinsics: They are generated during sync insertion, and each backend can specify how to lower them, for example in cuda: |
Awesome stuff, @masahi ! Specific questions:
Small doc nits for clarity:
Higher-level thoughts: One concern is moving an async system into TIR that is too narrow. You discuss this quite a bit, and make a lot of good points. This system clearly works very well for CUDA, but if there isn't a good transform to token-based systems, do we end up with two async systems in TIR? Not trying to suggest this should not be incorporated, just genuinely curious what the long-term plans might be.
This can be a bit difficult, as it would make sense for future systems to try to use the existing async system to whatever extent possible. Again, not trying to suggest a specific course of action, just pointing out that there's a difficult general-vs-specific trade-off here. Regarding interopability with a token-based model: you make an excellent point that a token system has expressibility issues in a TIR program, because it has to refer to a specific statement at a specific loop iteration. But, it's also more intuitive. This contrasts with the implicit system here, which feels unintuitive, but is quite succinct and easily expressible. I'm still working through what a translation to/from a token system might look like, but I'm currently thinking that they're much closer than I initially thought. In either case, what we want (at a high level) is a way to refer to a specific chunk, blocking execution until that chunk has finished. Your comment about keeping a buffer in the token case made me realize that it ends up pretty similar: a token system waiting for chunk C from N iterations previous might use It's interesting that you mention that there's no easy translation from tokens to counting (based on MLIR not having implemented one?), but you suspect the reverse could be simple. Does this suggest that the token system has less information encoded than the counting system? (I.e., we can go counting --> tokens but not the reverse because we lost information in the transformation.) Or is it just specifics of a PTX-like system, not a "counting system" in general, that make the translation to it hard? |
Thanks for the detailed feedback @JosephTheOctonaut! I'll update the doc accordingly, but here are my answers.
Correct, commit groups must execute in FIFO order. But the order of completions within one commit group are not specified, following the PTX spec.
Yes, I haven't put deep thought into the name choice, here I simply want to say "the index of stmt/block in the list", provided to
Both correct. I'm using "commit" in the same sense as PTX here. In the doc, I'm probably using "async operations" when I should be using "async commit groups" to be exact. But I think I'm using "async commit groups" when the distinction matters.
Yes, I agree that it was confusing, I was a bit informal since it is just pseudocode. As you said, the exact placement doesn't matter, both in the illustration and the implementation. I made it consistent in the doc. In the implementation,
This is an interesting question, that I haven't really thought about. I would expect that each async "engine" is represented by its own thread, so for example if a vector unit finds out that it needs to wait at some point, the thread that's running the vector unit should block. I hope this makes sense... I think this is a natural model but as you said, I'm not sure if such details should be specified at the TIR level.
What I said is that the reverse seems more "feasible", not simple :) I would say, the counting system relies on more implicit states in the HW, so going from token to counts requires uncovering such states from the given token alone. The lost information would be an ordering of async operations (or commit groups, to be exact), or any information about "other" tokens. Given only a token, we don’t know (1) how many other async-ops are in flight before that sync point and (2) how many of them can still be in-flight after (The latter one is required by PTX). I claim that this is a difficult problem in general, and I gave the MLIR bit as a data point. Maybe we can encode some information in the token itself, but I haven't really thought about it. |
@masahi Hi~ thanks for the reply several day ago. After a bit learning of CUDA async interface https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#with-memcpy_async-pipeline-pattern-multi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for putting this together, and I really like it!
rfcs/0077-async-pipeline.md
Outdated
``` | ||
|
||
**Semantics of the proposed intrinsics**. “stage” refers to the same notion in the TIR software pipeline. | ||
- `async_commit_stage(i)` : Group one or more invocation of async operations, and “commit” them to the `i`-th stage. The exact interpretation of “committing” can be up to each backend, but informally it signifies that a group of async operations are now in-flight. The group of operations committed together is awaited as one chunk, and thus they constitute the granularity at which the synchronization intrinsic discussed next operates on. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be cleaner to express async_commit_stage
as an annotation rather than a callable intrinsic? I'm thinking something like the following:
# A stage includes one or more async_scope blocks, and defines a
# group. async_scope blocks may only occur directly within an
# async_stage block, or as elements of a SeqStmt that occurs directly
# within an async_stage block. The group is committed at the site
# where it is defined.
with async_commit_stage(0):
with async_scope:
B[(i + 1) % 2] = A[i] + 1
This way, there's less backtracking needed for a reader to determine which scopes are being launched, and a runtime wouldn't need to maintain state describing the stages to be launched next time it encounters a call to async_commit_stage
. This would also prevent cases where the scopes have been defined outside a conditional, but the async_commit_stage
call exists inside a conditional.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ooh, this would also give a really clean notation for stages that consist of only a single scope.
# If only one async_scope exists, the annotation can be dropped.
with async_stage(0):
B[(i + 1) % 2] = A[i] + 1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I think this is a great idea. We probably need a separate lowering pass for commit_stage
to insert a target-specific commit at the right place, while currently the lowering is trivial (line-by-line change). But it is certainly doable. I'll try implement this while we resolve other discussion points.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've incorporated this suggestion into the doc and also in my implementation.
|
||
|
||
|
||
(The following is highly speculative) On the other hand, translation from “count” to “token” seems more feasible: At each synchronization point, a backend presumably maintains the number and the order of pending async operations. Given the count, it should be possible to derive the correct token from the corresponding ordered list of tokens. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Though speculative, this makes sense to me. This could even be done within the TIR itself, with each "stage" having a unique integer value, and only ever using N=0
when waiting. In effect, the integer passed to commit/wait would be the unique token. (This assumes that there is minimal overhead in maintaining the existence of a "stage" that can be waited on.)
If sequentially assigned, I think this would also allow the token-based integer synchronization to be translated into the count-based synchronization. (e.g. If iteration i
launches stage i + offset
and wants N
in-flight, it could wait on stage i + offset - N
.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I realized that I should make the notion of "stage" more precise. For example, if you look at the "Multi-stage pipelined GEMM" under https://github.com/masahi/tvm-rfcs/blob/async-pipe/rfcs/0077-async-pipeline.md#more-examples, and in particular how A_shared
and B_shared
are produced and consumed, there are 4 "stages" in the traditional sense, one of which is overlapped with compute. But commit_stage
and wait_stage
only ever refer to the "stage" 0, the async producer stage. This notion of stage
might be TIR software pipeline specific, and it corresponds to "0" in the annotation
sch.annotate(k0, ann_key="software_pipeline_stage", ann_val=[0, 0, 2, 3, 3])
I think the notion of "stage" you have in mind is the traditional one, for example when you say "stage i + offset
", that would correspond to the index (i + 3) % 4
in the example.
On the other hand, I and this proposal talk about "stage" in the sense used by TIR software pipeline. So for example, if I am given an annotation "software_pipeline_stage", ann_val=[0, 0, 3]
, I'd say there are "two" stages in the TIR sense, even though the traditional sense would be four stages.
cc @vinx13
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@masahi I believe I understand the intended usage of "stage" here, but could we use it like Eric is suggesting? Is there a limitation to having an arbitrary number of async stages?
@Lunderberg Unless I've misunderstood, this translation would require fully unrolling all loops, right? Because each time you call commit
you need to pass in a new static integer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An alternative to unrolling is if non-constant stage numbers are allowed. It might not be pretty or succinct, but you could thread variable(s) through the program to hold stage (fence) IDs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@masahi is accurate. In TIR software pipeline, we can group multiple statements into one stage. Here stage is only an annotation of how the loop should be shifted, the i-th iteration of the pipelined loop executes i-stage
-th iteration in the original loop
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@masahi I believe I understand the intended usage of "stage" here, but could we use it like Eric is suggesting? Is there a limitation to having an arbitrary number of async stages?
That's correct. My goal was to map out what can be expressed using the commit/wait semantics defined, whether it results in broader functionality than the pipelinining use case, and whether that broader functionality is desired.
On the other hand, I and this proposal talk about "stage" in the sense used by TIR software pipeline. So for example, if I am given an annotation "software_pipeline_stage", ann_val=[0, 0, 3], I'd say there are "two" stages in the TIR sense, even though the traditional sense would be four stages.
Reading through again, I think there are two different levels of abstraction, and that the use of the term "stage" in both levels may be causing my confusion. At the higher abstraction level with "software_pipeline_stage"
, each stage is defined by their values in the annotations. At the lower abstraction level with commit/wait, the first argument defines a a set of work to be completed, or to be waited on. The first argument at the commit/wait abstraction level is generated from the stage at the "software_pipeline"
abstraction level, but that doesn't mean it would be the only possible lowering.
@Lunderberg Unless I've misunderstood, this translation would require fully unrolling all loops, right? Because each time you call commit you need to pass in a new static integer?
An alternative to unrolling is if non-constant stage numbers are allowed. It might not be pretty or succinct, but you could thread variable(s) through the program to hold stage (fence) IDs.
@JosephTheOctonaut Correct, the non-constant stage numbers are what I had been picturing. For static graphs without unrolling loops, this would be some loop_iter + offset
used to define the unique ID for each commit/wait.
rfcs/0077-async-pipeline.md
Outdated
|
||
(The following is highly speculative) On the other hand, translation from “count” to “token” seems more feasible: At each synchronization point, a backend presumably maintains the number and the order of pending async operations. Given the count, it should be possible to derive the correct token from the corresponding ordered list of tokens. | ||
|
||
Importantly, we are not trying to propose a “general async semantics to TIR”. Rather the goal is to come up with an async design specifically for the TIR software pipeline transform. Hopefully, this would allow making assumptions that might not be reasonable in more general settings (control flow, relative order of operations etc), and simplify the implementation by building on what the TIR software pipeline already produces as part of transform. Hopefully, reading the explanation below should also convince one that the counting based sync is a natural fit (or “good enough”) for the TIR software pipeline. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since a unique integer "stage" value passed to commit/wait would be equivalent to creating and waiting of a fence, I think the current design does result in general async semantics. I'm not opposed to doing so, but if we want to avoid a generic async framework, we should make a more restrictive data structure for it.
@wrongtest |
@JosephTheOctonaut, I'm going to change the definition of For example, if we have three statements in the loop and we want to make the first two statements async, the current annotation is
this would become the following after this change.
With this, the label But the notion of
I want to keep the terminology between this RFC and the TIR software pipeline implementation consistent, so if we want to change the meaning of "stage" in this proposal, I want to evaluate the feasibility of such change to the implementation first. Generally, I agree that we should use the common terminology. cc @vinx13 @Lunderberg if they have any thoughts on this topic. |
On the terminology side, I'm wondering if we want to have separate terminology at the different abstraction levels. At the level of At the commit/wait abstraction level, what if we rename it from "stage" to "queue"? If I'm reading correctly, that is the functionality provided by commit/wait, that sequential calls to commit with the same value of |
I agree with @Lunderberg here, but I'd also like to draw a distinction between the current intended usage of the system and how we might expand upon it later. Specifically, we "intend" that stage Putting in Eric's terms of "queues" (a terminology change I support), a standard usage would have one queue for each async stage, because you need to synchronize around the output of each async group. But we can imagine simple alternative usages that do not have this 1-to-1. E.g., in stage 0 we have 5 DMA units performing 5 parallel loads, which are used in 5 parallel pipelines; here, we'd want 5 queues, but they all correspond to stage 0.
To double check, would this kind of proposed usage affect the implementation at all? I thought you said in our chat that the |
I'm going to write up a few clarifying notes from the chat yesterday; please correct if any of these are wrong.
|
Yeah, my thinking has been that there is a 1-1 mapping between a stage and a queue. So I have no problem with this change.
Oh, what I meant was that we might want to change the terminology "stage" throughout our existing TIR software pipeline implementation first, not just the implementation of this proposal. To align with more standard terminology and avoid potential confusion with this proposal. But reading the discussions more carefully, I'm realizing that your suggestion is not necessarily changing the existing use of "stage" in TIR software pipeline, but rather decoupling asynchrony (commit, wait) from the "stage" in this proposal. As the title of this proposal literally says, the original intention has been to bring asynchrony to the "stage" in TIR software pipeline. So, the current mechanics of commit / wait are naturally tied to the "stage" in the TIR sense and the proposal / implementation are highly influenced by how the current TIR software pipeline works / is implemented. @Lunderberg rightfully hinted when he said: "That the "queue" is produced from the "stage" feels more like an implementation detail of the lowering, rather than something inherent to the commit/wait functionality." I'll go through discussion comments more carefully today and think about how to incorporate the proposed suggestions. Thank you very much for the detailed feedback so far! |
Summarizing the current situation of the proposal and discussion points so far:
|
Updated the doc to talk about commit / wait in terms of "queue". The change is in https://github.com/masahi/tvm-rfcs/blob/async-pipe/rfcs/0077-async-pipeline.md#making-parallelism-more-explicit-asynchronous-pipeline, after the example. |
I've incorporated @Lunderberg's suggestion #80 (comment) of making I think I've addressed all feedbacks from @JosephTheOctonaut and @Lunderberg so far. I'm not sure if the "main thread" issue is resolved by now, and I don't know what to about it otherwise. So I'll leave it as it is for now. |
The RFC states that the proposed mechanism is deliberately more general than what pipelining itself would require. Was that added after the feedback? I think that adding parallelization mechanisms that are specific to a particular scheduling operation is not the right thing to do, but it seems like the async queues could be used independently of pipelining. |
Yes, this text is my attempt to incorporate the feed back I've received so far.
Yes, this is in agreement with the current direction. |
Hey guys, thank you all for the very fruitful and edifying discussion! If all the blocking issues are addressed, let's explicitly approve this RFC and proceed upstreaming this feature! |
@JosephTheOctonaut brought up a valid point: Initially both commit and wait were TIR intrinsic, later I made Any thought on this question before merging? |
To elaborate: my instinct is that they should both be annotations or both be intrinsics. Based on some preliminary sketching of lowering to other targets, I think the annotation route might be easier, because it lets you lower the entire asynchronous block at once. This can reduce the duplication of information that might be needed for lowering each intrinsic separately. Further, if the back-end requires additional transformations of the block, or has a synchronization system that doesn't fit exactly with the |
Ok @JosephTheOctonaut, thanks for the suggestion. I've update wording and pseudo code examples to make
This is now
@junrushao1994 @vinx13 I think all outstanding issues have been addressed, ready for a final look and merge. |
Hmm I hadn't thought of the duplication issue for branches in
|
@JosephTheOctonaut You are right, in the example I gave, the if/else is inside the epilogue, so unrolling takes care of generating only one of the branches. The second suggestion also works: Right now I'm generating
Yes, this is correct. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As @junrushao1994 mentioned early this week it appears we have reached consensus and there have been no more mentions of blocking concerns. Thus let us approve and merge RFC #80.
Many thanks to @masahi @vinx13 @junrushao1994 @JosephTheOctonaut @Lunderberg @Hzfengsy @wrongtest-intellif @kparzysz-quic and all others involved for the great RFC and updates that came from these fruitful discussions.
Rendered
I'm looking for feedbacks, particularly on the synchronization model. Let me know if your target of interest can or cannot be supported by this approach!
@junrushao1994 @vinx13 @csullivan @tqchen @Hzfengsy @kparzysz-quic @wrongtest