Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Performance]: Profile & optimize the BlockManagerV2 #4536

Closed
cadedaniel opened this issue May 1, 2024 · 4 comments
Closed

[Performance]: Profile & optimize the BlockManagerV2 #4536

cadedaniel opened this issue May 1, 2024 · 4 comments
Labels
performance Performance-related issues stale

Comments

@cadedaniel
Copy link
Collaborator

cadedaniel commented May 1, 2024

Proposal to improve performance

We've recently rewritten the block management subsystem for better testability. We need to profile it under real load to make sure it is performant enough to replace the block manager V1, and fix any issues.

We should do this once the block manager v2 is feature complete (still missing a few items).

Known issue:

@cadedaniel
Copy link
Collaborator Author

What we want to profile:
For low-latency use case:

  • Batch size of 8-16 range
  • Various block sizes (16, 32, 128)
  • Sequence length (long context, 1.5k). Can set num_output_tokens=50.
  • For spec decode, also num_lookahead_tokens > 0. Try num_lookahead_tokens=5 (what is lookahead scheduling)

For high-throughput use-case:

  • Batch size up to 256
  • Various block sizes (16, 32, 128)
  • Sequence length (long context, 1.5k). Can set num_output_tokens=50.

Other cases that are important (perhaps we make separate tasks):

  • P0 prefix caching
  • P1 Beam search
  • P1 swapping
  • P1 sliding window

In terms of how to profile, use benchmark_latency + torch profiling (or can use CPU profiler of your choosing)

parser.add_argument(
'--profile',
action='store_true',
help='profile the generation process of a single batch')
parser.add_argument(
'--profile-result-dir',
type=str,
default=None,
help=('path to save the pytorch profiler output. Can be visualized '
'with ui.perfetto.dev or Tensorboard.'))

@cadedaniel
Copy link
Collaborator Author

@robertgshaw2-neuralmagic can you assign Alex

Copy link

This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!

@github-actions github-actions bot added the stale label Oct 28, 2024
Copy link

This issue has been automatically closed due to inactivity. Please feel free to reopen if you feel it is still relevant. Thank you!

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Nov 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Performance-related issues stale
Projects
None yet
Development

No branches or pull requests

1 participant