Releases · ml-explore/mlx

Highlights

Speed improvements
- Up to 6x faster CPU indexing benchmarks
- Faster Metal compiled kernels for strided inputs benchmarks
- Faster generation with fused-attention kernel benchmarks
Gradient for grouped convolutions
Due to Python 3.8's end-of-life we no longer test with it on CI

Core

New features
- Gradient for grouped convolutions
- mx.roll
- mx.random.permutation
- mx.real and mx.imag
Performance
- Up to 6x faster CPU indexing benchmarks
- Faster CPU sort benchmarks
- Faster Metal compiled kernels for strided inputs benchmarks
- Faster generation with fused-attention kernel benchmarks
- Bulk eval in safetensors to avoid unnecessary serialization of work
Misc
- Bump to nanobind 2.2
- Move testing to python 3.9 due to 3.8's end-of-life
- Make the GPU device more thread safe
- Fix the submodule stubs for better IDE support
- CI generated docs that will never be stale

NN

Add support for grouped 1D convolutions to the nn API
Add some missing type annotations

Bugfixes

Fix and speedup row-reduce with few rows
Fix normalization primitive segfault with unexpected inputs
Fix complex power on the GPU
Fix freeing deep unevaluated graphs details
Fix race with array::is_available
Consistently handle softmax with all -inf inputs
Fix streams in affine quantize
Fix CPU compile preamble for some linux machines
Stream safety in CPU compilation
Fix CPU compile segfault at program shutdown

Highlights

Speed improvements:
- Up to 2x faster I/O: benchmarks.
- Faster transposed copies, unary, and binary ops
  - CPU benchmarks here.
  - GPU benchmarks here and here.
Transposed convolutions
Improvements to mx.distributed (send/recv/average_gradients)

Core

New features:
- mx.conv_transpose{1,2,3}d
- Allow mx.take to work with integer index
- Add std as method on mx.array
- mx.put_along_axis
- mx.cross_product
- int() and float() work on scalar mx.array
- Add optional headers to mx.fast.metal_kernel
- mx.distributed.send and mx.distributed.recv
- mx.linalg.pinv
Performance
- Up to 2x faster I/O
- Much faster CPU convolutions
- Faster general n-dimensional copies, unary, and binary ops for both CPU and GPU
- Put reduction ops in default stream with async for faster comms
- Overhead reductions in mx.fast.metal_kernel
- Improve donation heuristics to reduce memory use
Misc
- Support Xcode 160

NN

Faster RNN layers
nn.ConvTranspose{1,2,3}d
mlx.nn.average_gradients data parallel helper for distributed training

Bug Fixes

Fix boolean all reduce bug
Fix extension metal library finding
Fix ternary for large arrays
Make eval just wait if all arrays are scheduled
Fix CPU softmax by removing redundant coefficient in neon_fast_exp
Fix JIT reductions
Fix overflow in quantize/dequantize
Fix compile with byte sized constants
Fix copy in the sort primitive
Fix reduce edge case
Fix slice data size
Throw for certain cases of non captured inputs in compile
Fix copying scalars by adding fill_gpu
Fix bug in module attribute set, reset, set
Ensure io/comm streams are active before eval
Fix mx.clip
Override class function in Repr so mx.array is not confused with array.array
Avoid using find_library to make install truly portable
Remove fmt dependencies from MLX install
Fix for partition VJP
Avoid command buffer timeout for IO on large arrays

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Highlights

Core

NN

Bug fixes

Highlights

Core

Bugfixes

Highlights

Core

NN

Bugfixes

Highlights

Core

NN

Bug Fixes

Releases: ml-explore/mlx

v0.21.1

v0.21.0

Highlights

Core

NN

Bug fixes

v0.20.0

Highlights

Core

Bugfixes

v0.19.3

v0.19.2

v0.19.1

v0.19.0

Highlights

Core

NN

Bugfixes

v0.18.1

v0.18.0

Highlights

Core

NN

Bug Fixes

v0.17.3