Apple chips provide several functional units capable of high-throughput matrix multiplication and AI inference. Those computeUnits
include the CPU, GPU, and the Apple Neural Engine (ANE). A user may naively hope that any typical architecture, like BERT or ViT, should work fine on all of those chips in any of the common quantization forms, like switching from f32
single-precision to bf16
and f16
half-precision floats or i8
and u8
integers. That is not the case. Of all the backends that UForm has been tested on, quantizing the entire model for CoreML was the most challenging task, and Apple became the only platform where we distribute the models in the original precision, which is a pity given a fleet of 2 billion potential target devices running iOS worldwide, almost all of which are in the countries and language groups natively supported by UForm multimodal multilingual embeddings.
When using @unum-cloud UForm models in Swift, we pass computeUnits: .all
to let Apple's scheduler choose the target device itself and treat it as a black-box optimization. However, a better way to do this is if you can explicitly provide models tuned for the Apple Neural Engine. So, together with our friends from @TheStageAI, we've quantized our models to map perfectly to ANE-supported operations with minimal loss in precision, reducing the model size by 2-4x and accelerating inference up to 5x:
Model | GPU Text E. | ANE Text E. | GPU Image E. | ANE Image E. |
---|---|---|---|---|
english-small |
2.53 ms | 0.53 ms | 6.57 ms | 1.23 ms |
english-base |
2.54 ms | 0.61 ms | 18.90 ms | 3.79 ms |
english-large |
2.30 ms | 0.61 ms | 79.68 ms | 20.94 ms |
multilingual-base |
2.34 ms | 0.50 ms | 18.98 ms | 3.77 ms |
On Apple M4 iPad, running iOS 18.2. The batch size is 1, and the model is pre-loaded into memory. The original encoders use
f32
single-precision numbers for maximum compatibility and mostly rely on GPU for computation. The quantized encoders use a mixture ofi8
,f16
, andf32
numbers for maximum performance and mostly rely on the Apple Neural Engine (ANE) for computation. The median latency is reported.
To use them in Swift, check out the docs at unum-cloud.github.io/uform/swift/ or the SwiftSemanticSearch repository for an integrated example with USearch.
Thanks to @ArnoldMSU, @b1n0, @Aydarkhan, @AndreyAgeev from TheStage.ai for help π