Vision & Images

How modern LLMs process and understand visual information alongside text.

How LLMs See Images

Vision-enabled LLMs convert images into sequences of tokens that can be processed alongside text. This typically involves dividing images into patches and encoding them with a vision transformer.

The Vision Transformer (ViT)

The Vision Transformer architecture adapts the transformer model for image processing. Instead of processing words, it processes image patches.

1

Divide into Patches

The image is split into a grid of fixed-size patches (typically 14x14 or 16x16 pixels).

2

Flatten & Project

Each patch is flattened into a vector and linearly projected into an embedding space.

3

Add Position Info

Positional embeddings are added so the model knows where each patch came from.

4

Process with Transformer

The sequence of patch embeddings is processed by standard transformer layers.

🖼️

Patch Encoding

Images are divided into fixed-size patches (e.g., 14x14 pixels), each converted into an embedding vector similar to text tokens.

Grid Size:

Patch Grid (8x8 = 64 tokens)

Each 28x28px patch becomes one token

Flattened Patches

28x28px
Patch Size
64
Total Patches
64
Image Tokens
~16 words
vs Text

Token Costs

Images are expensive in terms of tokens. Understanding this helps you optimize your applications.

A 512x512 image with 16x16 patches~1,024 tokens
A 1024x1024 high-res image~4,096 tokens
Equivalent text description~50-100 tokens

Tip: Always consider whether a text description might be more efficient than passing the actual image.

Common Use Cases

Vision-enabled LLMs unlock many practical applications.

Document Analysis

Extract information from PDFs, receipts, forms, and handwritten notes.

Visual Q&A

Answer questions about image contents, charts, and diagrams.

Image Captioning

Generate detailed descriptions of images for accessibility or indexing.

UI Understanding

Analyze screenshots, wireframes, and user interfaces.

Multimodal Understanding

The model learns to align visual and textual representations, enabling tasks like image captioning, visual QA, and document understanding.

Key Takeaways

  • 1Images consume many more tokens than equivalent text descriptions
  • 2Resolution and patch size affect detail recognition
  • 3Visual understanding is approximate—models can miss fine details
  • 4Combining vision and language enables powerful new applications