Skip to content

Benchmark of Apple MLX operations on all Apple Silicon chips (GPU, CPU) + MPS and CUDA.

License

Notifications You must be signed in to change notification settings

TristanBilot/mlx-benchmark

Repository files navigation

⚡️ mlx-benchmark ⚡️

A comprehensive benchmark of MLX ops.

This repo aims to benchmark Apple's MLX operations and layers, on all Apple Silicon chips, along with some GPUs.

Contributions: Everyone can contribute to the benchmark! If you have a missing device or if you want to add a missing layer/operation, please read the contribution guidelines.

Current M chips: M1, M1 Pro, M1 Max, M2, M2 Pro, M2 Max, M2 Ultra, M3, M3 Pro, M3 Max.

Current CUDA GPUs: RTX4090, Tesla V100.

Missing devices: M1 Ultra, and other CUDA GPUs.

Note

You can submit your benchmark even for a device that is already listed, provided you use a newer version of MLX. Simply submit a PR by overriding the old benchmark table. Also, most of the existing benchmarks do not include the mx.compile feature, which has been recently added to mlx-benchmark.

Benchmarks 🧪

Benchmarks are generated by measuring the runtime of every mlx operations on GPU and CPU, along with their equivalent in pytorch with mps, cpu and cuda backends. On MLX with GPU, the operations compiled with mx.compile are included in the benchmark by default. To not benchmark the compiled functions, set --compile=False.

For each operation, we measure the runtime of multiple experiments. We propose 2 benchmarks based on these experiments:

Installation 💻

Installation on Mac devices

Running the benchmark locally is straightforward. Create a new env with osx-arm64 architecture and install the dependencies.

CONDA_SUBDIR=osx-arm64 conda create -n mlx_benchmark python=3.10 numpy pytorch torchvision scipy requests -c conda-forge

pip install -r requirements.txt

Installation on other devices

Other operating systems than macOS can only run the torch experiments, on CPU or with a CUDA device. Install a new env without the CONDA_SUBDIR=osx-arm64 prefix and install the torch package that matches your CUDA version. Then install all the requirements within requirements.txt, except mlx.

Finally, open the config.py file and set:

USE_MLX = False

to avoid importing the mlx package, which cannot be installed on non-Mac devices.

Run the benchmark 🧑‍💻

Run on Mac

To run the benchmark on mps, mlx and CPU:

python run_benchmark.py --include_mps=True --include_mlx_gpu=True --include_mlx_cpu=True --include_cpu=True

Run on other devices

To run the torch benchmark on CUDA and CPU:

python run_benchmark.py --include_mps=False --include_mlx_gpu=False --include_mlx_cpu=False --include_cuda=True --include_cpu=True

Run only compiled functions

If you're interested in benchmarking only operations against operations compiled with mx.compile, you can run:

python run_benchmark.py --include_mps=False --include_cpu=False --include_mlx_cpu=False

Contributing 🚀

If you have a device not yet featured in the benchmark, especially the ones listed below, your PR is welcome to broaden the scope and accuracy of this project.