The first half of 2024 has been defined by one trend above all others: AI is going multimodal. Models are no longer just processing text — they are understanding images, generating audio, and combining modalities in ways that felt like science fiction a year ago.
The Claude 3 Family
In March 2024, Anthropic released Claude 3 in three sizes: Haiku, Sonnet, and Opus. The flagship Opus model set new benchmarks across the board, but the real story was the vision capability.
For the first time, Claude could analyze images — charts, screenshots, documents, photos — and reason about them with the same depth it brought to text. Developers could now build applications that understand visual content without stitching together separate OCR, image classification, and text generation systems.
Claude 3 Sonnet quickly became the sweet spot for most applications: fast enough for production, capable enough for complex reasoning, and significantly cheaper than Opus.
GPT-4o: Real-Time Multimodal
In May 2024, OpenAI released GPT-4o (the "o" stands for "omni"). This was not just an upgrade — it was a reimagining of how AI interacts with humans:
- Native audio understanding — no more speech-to-text-to-LLM-to-text-to-speech pipeline
- Real-time conversation — 320ms average response time for audio
- Vision + audio + text in a single model
- GPT-4 level intelligence at GPT-3.5 Turbo speeds and costs
The demo of GPT-4o having a real-time voice conversation while analyzing a live camera feed was a defining moment. It showed a path toward AI that perceives the world more like humans do.
Google Gemini 1.5 Pro
Not to be overlooked, Google's Gemini 1.5 Pro brought its own breakthrough: a 1 million token context window. You could feed it an entire codebase, a full book, or hours of video and ask questions about any part of it.
The combination of massive context and native multimodality made Gemini particularly strong for tasks involving large documents, long videos, or complex codebases.
What Multimodal Means for Developers
The shift to multimodal AI is not just about new features — it fundamentally changes what you can build:
Document processing — upload a PDF, invoice, or handwritten note and extract structured data. No OCR pipeline needed.
Visual Q&A — point the model at a screenshot of your app and ask "what's wrong with this UI?" or "write the CSS to match this design."
Accessibility — describe images for visually impaired users, transcribe audio for deaf users, all through a single API call.
Content understanding — analyze marketing creatives, social media posts with images, or product photos at scale.
The Benchmark Landscape
With three major players releasing competitive multimodal models, benchmarks have become more important than ever:
- MMLU — still the standard for general knowledge (Claude 3 Opus: 86.8%, GPT-4o: 87.2%)
- GPQA — graduate-level science questions, testing deep reasoning
- HumanEval — code generation benchmark
- Vision benchmarks — MathVista, ChartQA, DocVQA for testing visual understanding
The gap between the top models is narrowing. Competition is fierce, and the biggest winner is the developer community — prices are falling, capabilities are rising, and the API surfaces are converging on similar patterns.
2024 is the year AI learned to see. The second half will be about what we build with this new capability.