Apple Foundation Models 2: the new on-device model understands voice, text, and images

Apple has unveiled a second version of its Apple Foundation Models, this time multimodal: the model can understand speech, read text, and interpret images, all on the user's hardware without cloud involvement.

A second model, this time sensory

During the WWDC 2026 keynote, Apple announced a second generation of its Apple Foundation Models (AFM). As CNBC reports, the new model goes beyond text processing: it is designed to understand speech, process written text, and interpret images in a combined manner. This is a step toward genuinely multimodal AI running entirely on the user's device.

Difference from AFM Cloud Pro

The new on-device AFM 2 is distinct from AFM Cloud Pro — the model announced in parallel that runs on Nvidia GPUs in Google's cloud and matches Gemini Frontier in quality. AFM 2 is designed for local processing, where user latency and privacy take precedence over computational power. The separation between a local tier and a cloud tier reflects the hybrid architecture Apple is building for Apple Intelligence.

Implications for developers

The multimodality of AFM 2 opens new possibilities for the public APIs announced for third-party developers at WWDC. An on-device model capable of reasoning about voice, text, and images simultaneously reduces cloud dependency for app categories that previously required an external back-end for any visual or voice analysis.

← Back to home