[Bugfix]disable cuda graph when max_decode_seq_len is close to max_seq_len_to_capture #8145
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
when enable spec decoding(num_lookahead_slots = 7)and cuda graph. We meet the problem:ValueError: could not broadcast input array from shape (513,) into shape (512,)
When starting spec decode, in order to ensure sufficient space is allocated, new tokens + num_lookahead_slots slots are allocated by default. Therefore, when the input + output is 8186, 8186 + 7 just triggers the boundary of 8192, requiring an additional block to be allocated, resulting in the block_table length exceeding the input_block_tables[i] range.