-
如上,我在源码中没有找到如何对prefill的sequence进行拆分的逻辑。看了几个比较可能的地方:LlamaBatch.cc#L1635和LlamaBatch.cc#L505但似乎都没有找到对prefill的sequence如何进行chunk拆分。 请问下我可以在哪里找到turbomind对prefill的拆分逻辑? |
Beta Was this translation helpful? Give feedback.
Answered by
MenglingD
Nov 26, 2024
Replies: 1 comment
-
我在注释中找到了一些信息unified_attention_layer.cc#L371: if (pf_batch_size && !isTuning()) {
const int offset = dc_batch_size;
const int sum_k_len = h_cu_k_len[offset + pf_batch_size] - h_cu_k_len[offset];
// We are executing prefill & decoding kernels concurrently, but only have 1 workspace
// disable split kv for prefill for now
auto params = CreateParams(offset, pf_batch_size, 1, pf_stream);
... 所以说现在其实只是根据sequences里累计的sum_q_len,sum_k_len满足约束: |
Beta Was this translation helpful? Give feedback.
0 replies
Answer selected by
MenglingD
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
我在注释中找到了一些信息unified_attention_layer.cc#L371:
所以说现在其实只是根据sequences里累计的sum_q_len,sum_k_len满足约束:
sum_q_len <= max_forward_token_num_ && sum_k_len <= max_context_token_num_
对sequences进行mini-batch的拆分,并没有对prefill的sequence进行chunk拆分是吗?不知道我这样理解正确吗?