Apple Foundation Models (Afm) Metrics
Apple Foundation Models (AFM) Metrics: Technical Deep-Dive
1. Overview & Research Landscape
Apple Foundation Models (AFM) consist of two primary tiers: AFM-on-device (~3B parameters) and AFM-server. These models power Apple Intelligence across the ecosystem. Official technical reports were released in June 2024 and significantly updated in mid-2025 (arXiv:2507.13575).
- AFM-on-device: A ~3B parameter dense transformer model optimized for local execution on the Apple Neural Engine (ANE).
- AFM-server: A larger model designed for Private Cloud Compute (PCC), rivaling āfrontierā models like GPT-4 in specific task categories.
2. Core Technical Metrics
The following metrics reflect the performance of the 16-bit base models before compression, as reported in the 2025 Apple Intelligence Foundation Language Models Tech Report.
| Benchmark | AFM-on-device (~3B) | AFM-server | Competitor Baseline (Llama-3-8B) |
|---|---|---|---|
| MMLU (5-shot) | 67.8 | 80.0 | 66.2 |
| GSM8K (8-shot CoT) | 70.4 | 72.4 | 79.6 |
| HumanEval (pass@1) | 16.48 | 30.84 | 33.5 |
| IFEval (Instruction-level) | 85.1 | 89.1 | 78.4 |
Analysis of Scores
- Instruction Following (IFEval): AFM significantly outperforms larger models like Llama-3-8B, reflecting Appleās focus on task-oriented fine-tuning.
- Mathematical Reasoning (GSM8K): AFM-on-device punches significantly above its weight class, outperforming Gemma-7B (46.4) and Mistral-7B (52.2).
- Coding (HumanEval): AFMās general language models show modest coding scores; however, Apple utilizes specialized AFM derivatives for Xcode-specific features.
3. Quantization & Efficiency: Low-Bit Palettization
Apple utilizes a proprietary Low-Bit Palettization technique to fit the 3B model into memory-constrained devices (e.g., iPhone 15 Pro with 8GB RAM).
- Mechanism: K-means clustering is used to group weights into a Lookup Table (LUT) of centroids.
- Mixed-Bit Strategy: Apple employs a variable bitrate approach, averaging 3.5 to 3.7 bits-per-weight (bpw). This involves a mix of 2-bit and 4-bit layers.
- Accuracy Recovery: To offset the perplexity spike from 2-bit quantization, Apple uses:
- Quantization-Aware Training (QAT): Training the model with simulated rounding errors.
- LoRA Adapters: 16-bit high-precision adapters (~tens of MBs) are used to āpatchā the quantized base model at runtime.
- Impact: A 2-bit optimized AFM-on-device maintains an MMLU of 64.4 and an IFEval of 82.3, showing minimal degradation from the 16-bit base.
4. Hardware Benchmarks (M-Series & A-Series)
Performance is heavily driven by Memory Bandwidth and the Apple Neural Engine (ANE).
Throughput (Tokens Per Second - TPS)
| Hardware | AFM-on-device (TPS) | Memory Bandwidth |
|---|---|---|
| iPhone 15 Pro (A17 Pro) | ~30 TPS | 51.2 GB/s |
| M1 Max | ~33 TPS | 400 GB/s |
| M4 Max | ~58.7 TPS | 546 GB/s |
Latency (Time To First Token - TTFT)
- Prompt Latency: ~0.6ms per token (iPhone 15 Pro).
- M-Series TTFT: Generally sub-second for warm starts; cold starts (loading from SSD into Unified Memory) can take 2ā7 seconds depending on chip generation.
- Optimization: KV-cache sharing reduces memory usage by 37.5%, improving throughput and reducing TTFT for longer contexts.
5. Contradiction Detection: Human Preference vs. Raw Logic
There is a documented āHelpfulness Gapā in Appleās modeling strategy.
- The Contradiction: While AFM-on-device (~3B) scores lower on raw parameters and general knowledge (MMLU) than Mistral-7B or Llama-3-8B, it consistently matches or exceeds them in Human Preference ratings.
- Reasoning: Apple prioritizes Instruction Following (IFEval) and Safety through heavy Reinforcement Learning from Human Feedback (RLHF).
- Observation: AFM over-performs in āHuman-likenessā and task completion (summarization, email drafting) compared to its performance on āRaw Logicā benchmarks like HumanEval. This suggests the model is highly specialized for āDigital Assistantā personas rather than general-purpose reasoning.
Gardenerās Summary
For the developer focusing on ML Development, AFMās primary value lies in its latency-to-utility ratio. While it is not a āfrontierā reasoning model like GPT-4o or Claude 3.5 Sonnet, its ability to run at ~60 TPS on M4 hardware with high instruction-following accuracy makes it the ideal engine for Local Agents and background system tasks.
Sources
- Apple Intelligence Foundation Language Models Tech Report 2025 (arXiv:2507.13575)
- Apple Intelligence Foundation Language Models (arXiv:2407.21075)
- Apple Machine Learning Research: Introducing Appleās On-Device and Server Foundation Models (June 2024)
- Apple Developer Documentation: FoundationModels
- Internal stress tests (M1 Max vs M4 Max, 2026) via [[apfel_deep_dive_raw]]