Multimodal AI models process text, images, audio, and video together in a single API call. GPT-4o and Gemini 1.5 Pro are the two models worth building on right now. Both can analyze screenshots, extract data from documents, debug UIs from images, and convert designs into code.
This guide walks through building practical multimodal applications with both models, including working code for image analysis, document processing, visual debugging, and design-to-code generation. For the agent framework side, building agentic AI systems with LangChain covers how to wire these model calls into a reasoning loop with tools.
What these models actually do
The capabilities that matter for developers:
- Process an image and text prompt together, so you can ask questions about what's in the image
- Extract structured data (JSON) from photos of invoices, receipts, and forms
- Analyze UI screenshots for layout bugs, accessibility issues, and design inconsistencies
- Compare multiple images (before/after designs, version comparisons)
- Generate code from wireframe or mockup images
Before multimodal models, you'd stitch separate OCR, image classification, and NLP services together. Now it's one API call.
GPT-4o vs. Gemini: when to use which
GPT-4o for code and technical tasks
Stronger at analysing code screenshots, debugging UI issues from images, and generating code from wireframes. The OpenAI API is well-documented and the output is consistent for structured tasks.
Gemini 1.5 Pro for large inputs
The 2 million token context window is the real differentiator. If you need to process a 50-page PDF, analyse a 20-minute video, or batch-process hundreds of images in one request, Gemini handles it where GPT-4o hits token limits.
Pricing (as of early 2025)
GPT-4o: $5/M input tokens, $15/M output tokens. Gemini 1.5 Pro: $3.50/M input, $10.50/M output. Gemini is cheaper per token, which adds up for high-volume image processing.
API maturity
Both have production-grade APIs with rate limiting, streaming, and batch processing. Both accept base64 images, URLs, and file uploads. Neither is a bottleneck for production use.
Your first multimodal app
Let's build an image analysis application using GPT-4o that can describe images, extract text, and answer questions about visual content.
First, install the required dependencies:
pip install openai python-dotenv pillow requests
Create a basic image analyzer with GPT-4o:
import os
import base64
from pathlib import Path
from dotenv import load_dotenv
from openai import OpenAI
# Load environment variables
load_dotenv()
# Initialize OpenAI client
client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
def encode_image(image_path: str) -> str:
"""Encode image to base64 string."""
with open(image_path, "rb") as image_file:
return base64.b64encode(image_file.read()).decode("utf-8")
def analyze_image(image_path: str, question: str = "Describe this image in detail") -> dict:
"""
Analyze an image using GPT-4o vision capabilities.
Args:
image_path: Path to the image file
question: Question to ask about the image
Returns:
Dictionary with analysis results and metadata
"""
try:
# Encode the image
base64_image = encode_image(image_path)
# Create the vision request
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{
"role": "user",
"content": [
{
"type": "text",
"text": question
},
{
"type": "image_url",
"image_url": {
"url": f"data:image/jpeg;base64,{base64_image}",
"detail": "high" # Options: "low", "high", "auto"
}
}
]
}
],
max_tokens=1000,
temperature=0.2
)
# Extract response
analysis = response.choices[0].message.content
# Calculate costs (approximate)
input_tokens = response.usage.prompt_tokens
output_tokens = response.usage.completion_tokens
cost = (input_tokens * 0.005 / 1000) + (output_tokens * 0.015 / 1000)
return {
"success": True,
"analysis": analysis,
"tokens": {
"input": input_tokens,
"output": output_tokens,
"total": response.usage.total_tokens
},
"cost": round(cost, 4),
"model": response.model
}
except Exception as e:
return {
"success": False,
"error": str(e)
}
# Example usage
if __name__ == "__main__":
# Analyze a code screenshot
result = analyze_image(
"screenshot.png",
"What does this code do? Are there any bugs or improvements you'd suggest?"
)
if result["success"]:
print("Analysis:")
print(result["analysis"])
print(f"\nTokens used: {result['tokens']['total']}")
print(f"Cost: ${result['cost']}")
else:
print(f"Error: {result['error']}")
That's the core pattern for multimodal AI: combine text prompts with image data in a single API call. The detail parameter controls image resolution. Use "high" for screenshots with text or code, "low" for general image understanding to keep costs down.
Document processing
One of the most useful applications is extracting structured data from documents like invoices, receipts, forms, and reports:
import os
import json
import base64
from typing import Dict, List, Optional
from dotenv import load_dotenv
from openai import OpenAI
from pydantic import BaseModel, Field
load_dotenv()
client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
# Define structured output schema
class InvoiceData(BaseModel):
"""Structured invoice information."""
invoice_number: str = Field(description="Invoice number or ID")
date: str = Field(description="Invoice date")
vendor_name: str = Field(description="Vendor or company name")
total_amount: float = Field(description="Total amount due")
currency: str = Field(description="Currency code (USD, EUR, etc)")
line_items: List[Dict[str, any]] = Field(description="List of items with description, quantity, price")
tax_amount: Optional[float] = Field(description="Tax amount if specified")
class DocumentProcessor:
"""Process documents using GPT-4o vision capabilities."""
def __init__(self, model: str = "gpt-4o"):
self.model = model
self.client = client
def encode_image(self, image_path: str) -> str:
"""Encode image to base64."""
with open(image_path, "rb") as f:
return base64.b64encode(f.read()).decode("utf-8")
def extract_invoice_data(self, invoice_path: str) -> Dict:
"""
Extract structured data from an invoice image.
Args:
invoice_path: Path to invoice image (PNG, JPG, PDF)
Returns:
Dictionary with extracted invoice data
"""
try:
# Encode the invoice image
base64_image = self.encode_image(invoice_path)
# Create extraction prompt
prompt = """
Extract all relevant information from this invoice and structure it as JSON.
Include: invoice number, date, vendor name, total amount, currency, line items
(with description, quantity, unit price for each), and tax amount if present.
Return ONLY valid JSON matching this structure:
{
"invoice_number": "string",
"date": "YYYY-MM-DD",
"vendor_name": "string",
"total_amount": number,
"currency": "string",
"line_items": [
{"description": "string", "quantity": number, "unit_price": number, "total": number}
],
"tax_amount": number or null
}
"""
response = self.client.chat.completions.create(
model=self.model,
messages=[
{
"role": "user",
"content": [
{"type": "text", "text": prompt},
{
"type": "image_url",
"image_url": {
"url": f"data:image/jpeg;base64,{base64_image}",
"detail": "high"
}
}
]
}
],
max_tokens=2000,
temperature=0, # Use 0 for consistent structured extraction
response_format={"type": "json_object"} # Ensure JSON output
)
# Parse the JSON response
invoice_data = json.loads(response.choices[0].message.content)
return {
"success": True,
"data": invoice_data,
"tokens_used": response.usage.total_tokens,
"cost": self._calculate_cost(response.usage)
}
except json.JSONDecodeError as e:
return {
"success": False,
"error": f"Failed to parse JSON response: {str(e)}"
}
except Exception as e:
return {
"success": False,
"error": str(e)
}
def extract_receipt_data(self, receipt_path: str) -> Dict:
"""Extract data from a receipt image."""
try:
base64_image = self.encode_image(receipt_path)
prompt = """
Extract information from this receipt. Return JSON with:
- store_name: name of the store
- date: purchase date (YYYY-MM-DD format)
- items: array of {name, price}
- subtotal: subtotal amount
- tax: tax amount
- total: total amount
- payment_method: payment method if visible
"""
response = self.client.chat.completions.create(
model=self.model,
messages=[
{
"role": "user",
"content": [
{"type": "text", "text": prompt},
{
"type": "image_url",
"image_url": {
"url": f"data:image/jpeg;base64,{base64_image}",
"detail": "high"
}
}
]
}
],
max_tokens=1500,
temperature=0,
response_format={"type": "json_object"}
)
receipt_data = json.loads(response.choices[0].message.content)
return {
"success": True,
"data": receipt_data,
"tokens_used": response.usage.total_tokens
}
except Exception as e:
return {"success": False, "error": str(e)}
def _calculate_cost(self, usage) -> float:
"""Calculate approximate API cost."""
input_cost = (usage.prompt_tokens * 5.0) / 1_000_000
output_cost = (usage.completion_tokens * 15.0) / 1_000_000
return round(input_cost + output_cost, 4)
# Example usage
if __name__ == "__main__":
processor = DocumentProcessor()
# Extract invoice data
result = processor.extract_invoice_data("invoice.png")
if result["success"]:
print("Extracted Invoice Data:")
print(json.dumps(result["data"], indent=2))
print(f"\nCost: ${result['cost']}")
else:
print(f"Error: {result['error']}")
The processor uses the response_format parameter to force valid JSON output. Setting temperature to 0 gives you consistent, deterministic results, which is what you want for document extraction.
Visual debugging tools
The same models are good at analysing UI screenshots for bugs, accessibility issues, and design inconsistencies:
import os
import base64
from typing import List, Dict
from dotenv import load_dotenv
from openai import OpenAI
load_dotenv()
client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
class UIDebugger:
"""Debug user interfaces using visual AI analysis."""
def __init__(self):
self.client = client
def encode_image(self, image_path: str) -> str:
"""Encode image to base64."""
with open(image_path, "rb") as f:
return base64.b64encode(f.read()).decode("utf-8")
def analyze_ui(self, screenshot_path: str, context: str = "") -> Dict:
"""
Analyze a UI screenshot for bugs, design issues, and improvements.
Args:
screenshot_path: Path to UI screenshot
context: Optional context about what the UI should do
Returns:
Analysis with identified issues and suggestions
"""
base64_image = self.encode_image(screenshot_path)
prompt = f"""
Analyze this UI screenshot for issues and improvements.
{f"Context: {context}" if context else ""}
Provide analysis in these categories:
1. VISUAL BUGS:
- Layout issues (misaligned elements, overflow, overlapping)
- Styling problems (inconsistent fonts, colors, spacing)
- Broken images or icons
2. ACCESSIBILITY ISSUES:
- Color contrast problems
- Missing alt text indicators
- Font sizes that may be too small
- Touch targets that may be too small
3. UX PROBLEMS:
- Unclear navigation
- Confusing button placements
- Missing feedback indicators
- Inconsistent patterns
4. DESIGN IMPROVEMENTS:
- Visual hierarchy suggestions
- Spacing and layout recommendations
- Consistency improvements
For each issue, specify:
- Severity: Critical/High/Medium/Low
- Location: Where in the UI
- Recommendation: How to fix
Be specific and actionable.
"""
try:
response = self.client.chat.completions.create(
model="gpt-4o",
messages=[
{
"role": "user",
"content": [
{"type": "text", "text": prompt},
{
"type": "image_url",
"image_url": {
"url": f"data:image/jpeg;base64,{base64_image}",
"detail": "high"
}
}
]
}
],
max_tokens=2000,
temperature=0.3
)
return {
"success": True,
"analysis": response.choices[0].message.content,
"tokens_used": response.usage.total_tokens
}
except Exception as e:
return {"success": False, "error": str(e)}
def compare_uis(self, screenshot1_path: str, screenshot2_path: str,
comparison_focus: str = "differences") -> Dict:
"""
Compare two UI screenshots (e.g., before/after, mobile/desktop).
Args:
screenshot1_path: Path to first screenshot
screenshot2_path: Path to second screenshot
comparison_focus: What to focus on (differences, improvements, etc)
"""
image1 = self.encode_image(screenshot1_path)
image2 = self.encode_image(screenshot2_path)
prompt = f"""
Compare these two UI screenshots and identify {comparison_focus}.
For each difference:
- Describe what changed
- Assess if the change is an improvement or regression
- Suggest any additional improvements
Focus on: layout changes, visual design, functionality, user experience.
"""
try:
response = self.client.chat.completions.create(
model="gpt-4o",
messages=[
{
"role": "user",
"content": [
{"type": "text", "text": prompt},
{
"type": "image_url",
"image_url": {
"url": f"data:image/jpeg;base64,{image1}",
"detail": "high"
}
},
{
"type": "image_url",
"image_url": {
"url": f"data:image/jpeg;base64,{image2}",
"detail": "high"
}
}
]
}
],
max_tokens=2000,
temperature=0.3
)
return {
"success": True,
"comparison": response.choices[0].message.content,
"tokens_used": response.usage.total_tokens
}
except Exception as e:
return {"success": False, "error": str(e)}
def generate_accessibility_report(self, screenshot_path: str) -> Dict:
"""Generate comprehensive accessibility analysis."""
base64_image = self.encode_image(screenshot_path)
prompt = """
Generate a comprehensive accessibility audit for this UI screenshot.
Check for:
1. Color contrast ratios (WCAG AA/AAA compliance)
2. Text readability (font sizes, line heights)
3. Touch target sizes (minimum 44x44px)
4. Visual hierarchy and focus indicators
5. Form label associations
6. Error message visibility
7. Icon clarity and labeling
For each issue, provide:
- WCAG guideline reference
- Current state
- Required fix
- Priority level
Format as a checklist with pass/fail for each criterion.
"""
try:
response = self.client.chat.completions.create(
model="gpt-4o",
messages=[
{
"role": "user",
"content": [
{"type": "text", "text": prompt},
{
"type": "image_url",
"image_url": {
"url": f"data:image/jpeg;base64,{base64_image}",
"detail": "high"
}
}
]
}
],
max_tokens=2500,
temperature=0.2
)
return {
"success": True,
"report": response.choices[0].message.content,
"tokens_used": response.usage.total_tokens
}
except Exception as e:
return {"success": False, "error": str(e)}
# Example usage
if __name__ == "__main__":
debugger = UIDebugger()
# Analyze a UI screenshot
result = debugger.analyze_ui(
"dashboard.png",
context="This is an admin dashboard for monitoring user activity"
)
if result["success"]:
print("UI Analysis:")
print(result["analysis"])
print(f"\nTokens used: {result['tokens_used']}")
# Generate accessibility report
print("\n" + "="*70)
print("Accessibility Report:")
print("="*70)
a11y_result = debugger.generate_accessibility_report("dashboard.png")
if a11y_result["success"]:
print(a11y_result["report"])
Visual debugging with AI catches issues that automated testing tools miss: visual inconsistencies, bad colour choices, confusing layouts. It speeds up design review noticeably.
Gemini for long-context analysis
Gemini is good at processing large volumes of visual data, thanks to the 2 million token context window:
import os
import google.generativeai as genai
from pathlib import Path
from typing import List, Dict
from dotenv import load_dotenv
from PIL import Image
load_dotenv()
# Configure Gemini API
genai.configure(api_key=os.getenv("GOOGLE_API_KEY"))
class GeminiMultimodalProcessor:
"""Process multimodal data using Google Gemini."""
def __init__(self, model_name: str = "gemini-1.5-pro"):
self.model = genai.GenerativeModel(model_name)
def analyze_single_image(self, image_path: str, prompt: str) -> Dict:
"""Analyze a single image with Gemini."""
try:
# Load the image
image = Image.open(image_path)
# Generate response
response = self.model.generate_content([prompt, image])
return {
"success": True,
"analysis": response.text,
"model": "gemini-1.5-pro"
}
except Exception as e:
return {"success": False, "error": str(e)}
def analyze_multiple_images(self, image_paths: List[str],
analysis_type: str = "comparison") -> Dict:
"""
Analyze multiple images together.
Useful for: comparing designs, tracking changes, analyzing sequences.
Args:
image_paths: List of paths to images
analysis_type: Type of analysis (comparison, sequence, summary)
"""
try:
# Load all images
images = [Image.open(path) for path in image_paths]
prompts = {
"comparison": f"""
Compare these {len(images)} images and identify:
1. Common patterns and themes
2. Key differences between them
3. Evolution or progression if present
4. Inconsistencies in design or style
5. Recommendations for consistency
Provide a detailed analysis for each image and an overall summary.
""",
"sequence": f"""
Analyze this sequence of {len(images)} images in order:
1. Describe what happens in each image
2. Identify the progression or workflow
3. Note any missing steps or gaps
4. Suggest improvements to the sequence
""",
"summary": f"""
Analyze these {len(images)} images and provide:
1. Overall theme or purpose
2. Key visual elements present across images
3. Quality assessment
4. Suggested improvements
"""
}
prompt = prompts.get(analysis_type, prompts["comparison"])
# Create content list with prompt and all images
content = [prompt] + images
# Generate analysis
response = self.model.generate_content(content)
return {
"success": True,
"analysis": response.text,
"images_processed": len(images)
}
except Exception as e:
return {"success": False, "error": str(e)}
def extract_data_from_chart(self, chart_image_path: str) -> Dict:
"""Extract data points and insights from charts and graphs."""
try:
image = Image.open(chart_image_path)
prompt = """
Analyze this chart or graph and extract:
1. Chart type (bar, line, pie, scatter, etc.)
2. Title and axis labels
3. Data points and values (as accurately as possible)
4. Trends and patterns observed
5. Key insights and conclusions
6. Any anomalies or notable data points
Format the data points as a table or structured list.
"""
response = self.model.generate_content([prompt, image])
return {
"success": True,
"analysis": response.text
}
except Exception as e:
return {"success": False, "error": str(e)}
def design_to_code(self, design_image_path: str,
framework: str = "React") -> Dict:
"""
Generate code from a design mockup.
Args:
design_image_path: Path to design mockup/wireframe
framework: Target framework (React, Vue, HTML/CSS, etc.)
"""
try:
image = Image.open(design_image_path)
prompt = f"""
Analyze this design mockup and generate {framework} code to implement it.
Requirements:
1. Identify all UI components (buttons, inputs, cards, etc.)
2. Analyze the layout structure and spacing
3. Note colors, fonts, and styling details
4. Generate clean, semantic {framework} code
5. Include responsive design considerations
6. Add helpful comments explaining the structure
Provide:
- Component code
- Styling (CSS/Tailwind)
- Layout structure
- Any necessary state management hints
Make the code production-ready and follow best practices.
"""
response = self.model.generate_content([prompt, image])
return {
"success": True,
"code": response.text,
"framework": framework
}
except Exception as e:
return {"success": False, "error": str(e)}
def analyze_document_pages(self, document_image_paths: List[str]) -> Dict:
"""
Analyze multiple pages of a document.
Useful for contracts, reports, forms, etc.
"""
try:
images = [Image.open(path) for path in document_image_paths]
prompt = f"""
Analyze this {len(images)}-page document and provide:
1. Document type and purpose
2. Summary of each page's content
3. Key information extracted (dates, amounts, names, etc.)
4. Overall document summary
5. Any notable clauses, terms, or requirements
6. Action items or important deadlines
Structure the response clearly by page and overall summary.
"""
content = [prompt] + images
response = self.model.generate_content(content)
return {
"success": True,
"analysis": response.text,
"pages_processed": len(images)
}
except Exception as e:
return {"success": False, "error": str(e)}
# Example usage
if __name__ == "__main__":
processor = GeminiMultimodalProcessor()
# Single image analysis
result = processor.analyze_single_image(
"wireframe.png",
"Describe this wireframe design and suggest improvements for better UX"
)
if result["success"]:
print("Wireframe Analysis:")
print(result["analysis"])
# Generate code from design
print("\n" + "="*70)
print("Design to Code:")
print("="*70)
code_result = processor.design_to_code("mockup.png", framework="React")
if code_result["success"]:
print(code_result["code"])
# Analyze multiple screenshots for comparison
screenshots = ["version1.png", "version2.png", "version3.png"]
comparison = processor.analyze_multiple_images(
screenshots,
analysis_type="comparison"
)
if comparison["success"]:
print("\n" + "="*70)
print("Version Comparison:")
print("="*70)
print(comparison["analysis"])
Gemini's large context window is good for batch processing images, analysing entire documents, or tracking visual changes across multiple versions of a design.
What I've learned shipping these to production
Optimise image resolution for cost. The "low" detail setting cuts token usage by 50-70% on general understanding tasks. Reserve "high" detail for screenshots with text, code, detailed diagrams, or anything that needs OCR accuracy. Resize images to a sensible cap before sending (around 2048x2048).
Error handling actually matters here. API rate limits need exponential backoff. JSON parsing errors will happen when extracting structured data, log them. Validate image formats and sizes before the API call. Have a fallback when vision analysis fails.
Use structured outputs for data extraction. Set response_format to json_object for document processing tasks. Put clear JSON schemas in your prompts. Set temperature=0 for deterministic extraction. Validate extracted data against the schema you expected.
Monitor token usage. Track tokens per request so you can spot expensive operations. Cache repeated image analyses. Set spending alerts. Compare AI costs against the manual processing time you're saving, otherwise the ROI conversation gets fuzzy.
Write clear, specific prompts. Include context about what you need and why. Specify output format explicitly. Break complex requests into focused prompts rather than one giant one. Examples in the prompt help the model land in the right shape.
Handle image security and privacy. Never send sensitive personal data through vision APIs without user consent. Sanitise metadata. Store base64-encoded images temporarily and delete after processing. On-premises deployment is worth the effort for highly sensitive use cases.
Test with diverse image quality. Validate with various qualities, lighting, and orientations. Edge cases like blurry images, partial content, or unusual layouts will absolutely show up in production. Give the user a useful error message when image quality is too low for accurate analysis.
Deployment notes
Scalability. Queue requests so high-volume processing doesn't crater your throughput. Async processing for non-real-time applications. Batch processing APIs when you have a lot of images. Load balancing across multiple instances.
Cost management. Alert on unusual spending. Tier your processing: cheap models for simple tasks, expensive ones only where the work demands it. Cache results for images you analyse frequently. Calculate cost per transaction so you actually know if the unit economics work.
Security. Sanitise file uploads. Rate limit per user. Secure storage for temporary image processing. Auth on every API endpoint. Never expose API keys in client-side code.
Monitoring. Track success rates by analysis type. Watch response times and find the slow operations. Log failed requests with error details. Alert on API failures or performance degradation. A/B test different prompting strategies when you have the data to do it.
Where this is being used
- Invoice and receipt processing at scale, replacing manual data entry for accounts payable teams
- Visual QA in manufacturing, where cameras feed product images to the model for defect detection
- Automated UI testing, catching layout bugs and accessibility issues from screenshots during CI
- Document digitisation, converting scanned contracts and forms into structured data
- Design-to-code workflows, where designers hand off mockups and developers get a code starting point (often combined with AI coding assistants to turn the generated scaffold into production code)
- Content moderation, flagging images and video frames that violate platform policies
Conclusion
GPT-4o and Gemini make multimodal development practical in a way it wasn't two years ago. One API call replaces what used to require separate OCR, image classification, and NLP pipelines.
GPT-4o is the better choice for developer tooling and code-related visual tasks. Gemini wins on large inputs and batch processing where the context window matters. Both are production-ready with reasonable pricing.
Start with a simple image analysis task to get a feel for the APIs. Then build a document processor for whatever manual data entry is eating your team's time. The code above works out of the box with minor adjustments for your specific use case.
Next steps
- Get API keys from OpenAI or Google AI Studio and run the basic image analysis example
- Pick your most painful manual data entry task and build a document processor for it
- Add error handling and cost tracking before going to production
- Test with messy, real-world inputs (blurry photos, skewed scans, handwritten text) to calibrate accuracy expectations