Skip to content

turbomind在哪里对prefill进行chunk拆分? #2815

Closed Answered by MenglingD
MenglingD asked this question in Q&A
Discussion options

You must be logged in to vote

我在注释中找到了一些信息unified_attention_layer.cc#L371

    if (pf_batch_size && !isTuning()) {
        const int offset    = dc_batch_size;
        const int sum_k_len = h_cu_k_len[offset + pf_batch_size] - h_cu_k_len[offset];
        // We are executing prefill & decoding kernels concurrently, but only have 1 workspace
        // disable split kv for prefill for now
        auto params = CreateParams(offset, pf_batch_size, 1, pf_stream);
        ...

所以说现在其实只是根据sequences里累计的sum_q_len,sum_k_len满足约束:sum_q_len <= max_forward_token_num_ && sum_k_len <= max_context_token_num_对sequences进行mini-batch的拆分,并没有对prefill的sequence进行chunk拆分是吗?不知道我这样理解正确吗?

Replies: 1 comment

Comment options

You must be logged in to vote
0 replies
Answer selected by MenglingD
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
1 participant