How LLMs See Images
Vision-enabled LLMs convert images into sequences of tokens that can be processed alongside text. This typically involves dividing images into patches and encoding them with a vision transformer.
The Vision Transformer (ViT)
The Vision Transformer architecture adapts the transformer model for image processing. Instead of processing words, it processes image patches.
Divide into Patches
The image is split into a grid of fixed-size patches (typically 14x14 or 16x16 pixels).
Flatten & Project
Each patch is flattened into a vector and linearly projected into an embedding space.
Add Position Info
Positional embeddings are added so the model knows where each patch came from.
Process with Transformer
The sequence of patch embeddings is processed by standard transformer layers.
Patch Encoding
Images are divided into fixed-size patches (e.g., 14x14 pixels), each converted into an embedding vector similar to text tokens.
Patch Grid (8x8 = 64 tokens)
Each 28x28px patch becomes one token
Flattened Patches
Token Costs
Images are expensive in terms of tokens. Understanding this helps you optimize your applications.
Tip: Always consider whether a text description might be more efficient than passing the actual image.
Common Use Cases
Vision-enabled LLMs unlock many practical applications.
Document Analysis
Extract information from PDFs, receipts, forms, and handwritten notes.
Visual Q&A
Answer questions about image contents, charts, and diagrams.
Image Captioning
Generate detailed descriptions of images for accessibility or indexing.
UI Understanding
Analyze screenshots, wireframes, and user interfaces.
Multimodal Understanding
The model learns to align visual and textual representations, enabling tasks like image captioning, visual QA, and document understanding.
Key Takeaways
- 1Images consume many more tokens than equivalent text descriptions
- 2Resolution and patch size affect detail recognition
- 3Visual understanding is approximateâmodels can miss fine details
- 4Combining vision and language enables powerful new applications