This repo aims to benchmark Apple's MLX operations and layers, on all Apple Silicon chips, along with some GPUs.
Contributions: Everyone can contribute to the benchmark! If you have a missing device or if you want to add a missing layer/operation, please read the contribution guidelines.
Current M chips: M1
, M1 Pro
, M1 Max
, M2
, M2 Pro
, M2 Max
, M2 Ultra
, M3
, M3 Pro
, M3 Max
.
Current CUDA GPUs: RTX4090
, Tesla V100
.
Missing devices: M1 Ultra
, and other CUDA GPUs
.
Note
You can submit your benchmark even for a device that is already listed, provided you use a newer version of MLX. Simply submit a PR by overriding the old benchmark table. Also, most of the existing benchmarks do not include the mx.compile
feature, which has been recently added to mlx-benchmark.
Benchmarks are generated by measuring the runtime of every mlx
operations on GPU and CPU, along with their equivalent in pytorch with mps
, cpu
and cuda
backends. On MLX with GPU, the operations compiled with mx.compile
are included in the benchmark by default. To not benchmark the compiled functions, set --compile=False
.
For each operation, we measure the runtime of multiple experiments. We propose 2 benchmarks based on these experiments:
- Detailed benchmark: provides the runtime of each experiment.
- Average runtime benchmark: computes the mean of experiments. Easier to navigate, with fewer details.
Running the benchmark locally is straightforward. Create a new env with osx-arm64
architecture and install the dependencies.
CONDA_SUBDIR=osx-arm64 conda create -n mlx_benchmark python=3.10 numpy pytorch torchvision scipy requests -c conda-forge
pip install -r requirements.txt
Other operating systems than macOS can only run the torch experiments, on CPU or with a CUDA device. Install a new env without the CONDA_SUBDIR=osx-arm64
prefix and install the torch package that matches your CUDA version. Then install all the requirements within requirements.txt
, except mlx
.
Finally, open the config.py
file and set:
USE_MLX = False
to avoid importing the mlx package, which cannot be installed on non-Mac devices.
To run the benchmark on mps, mlx and CPU:
python run_benchmark.py --include_mps=True --include_mlx_gpu=True --include_mlx_cpu=True --include_cpu=True
To run the torch benchmark on CUDA and CPU:
python run_benchmark.py --include_mps=False --include_mlx_gpu=False --include_mlx_cpu=False --include_cuda=True --include_cpu=True
If you're interested in benchmarking only operations against operations compiled with mx.compile
, you can run:
python run_benchmark.py --include_mps=False --include_cpu=False --include_mlx_cpu=False
If you have a device not yet featured in the benchmark, especially the ones listed below, your PR is welcome to broaden the scope and accuracy of this project.