Skip to content

v3.1 🍏 Apple Neural Engine Optimizations

Latest
Compare
Choose a tag to compare
@ashvardanian ashvardanian released this 20 Dec 12:31

Apple chips provide several functional units capable of high-throughput matrix multiplication and AI inference. Those computeUnits include the CPU, GPU, and the Apple Neural Engine (ANE). A user may naively hope that any typical architecture, like BERT or ViT, should work fine on all of those chips in any of the common quantization forms, like switching from f32 single-precision to bf16 and f16 half-precision floats or i8 and u8 integers. That is not the case. Of all the backends that UForm has been tested on, quantizing the entire model for CoreML was the most challenging task, and Apple became the only platform where we distribute the models in the original precision, which is a pity given a fleet of 2 billion potential target devices running iOS worldwide, almost all of which are in the countries and language groups natively supported by UForm multimodal multilingual embeddings.

When using @unum-cloud UForm models in Swift, we pass computeUnits: .all to let Apple's scheduler choose the target device itself and treat it as a black-box optimization. However, a better way to do this is if you can explicitly provide models tuned for the Apple Neural Engine. So, together with our friends from @TheStageAI, we've quantized our models to map perfectly to ANE-supported operations with minimal loss in precision, reducing the model size by 2-4x and accelerating inference up to 5x:

Model GPU Text E. ANE Text E. GPU Image E. ANE Image E.
english-small 2.53 ms 0.53 ms 6.57 ms 1.23 ms
english-base 2.54 ms 0.61 ms 18.90 ms 3.79 ms
english-large 2.30 ms 0.61 ms 79.68 ms 20.94 ms
multilingual-base 2.34 ms 0.50 ms 18.98 ms 3.77 ms

On Apple M4 iPad, running iOS 18.2. The batch size is 1, and the model is pre-loaded into memory. The original encoders use f32 single-precision numbers for maximum compatibility and mostly rely on GPU for computation. The quantized encoders use a mixture of i8, f16, and f32 numbers for maximum performance and mostly rely on the Apple Neural Engine (ANE) for computation. The median latency is reported.


To use them in Swift, check out the docs at unum-cloud.github.io/uform/swift/ or the SwiftSemanticSearch repository for an integrated example with USearch.

Thanks to @ArnoldMSU, @b1n0, @Aydarkhan, @AndreyAgeev from TheStage.ai for help πŸ‘