Prompt Optimization Techniques: Reducing Costs by 60% While Improving Quality
Our AI costs were $2000/month. Prompts were verbose and inefficient. I optimized them systematically.
Results: 60% cost reduction, 40% better quality. Here’s the methodology.
Table of Contents
The Problem
Before Optimization:
- Average prompt: 1,500 tokens
- API cost: $2,000/month
- Output quality: 70% usable
- Response time: 8 seconds
Optimization Technique 1: Token Reduction
Verbose Prompt (1,200 tokens):
I need you to analyze the following code very carefully and thoroughly. Please look at every single line and identify any potential issues, bugs, problems, or areas that could be improved. I want you to be very detailed and comprehensive in your analysis. Don't miss anything. Look for security vulnerabilities, performance issues, code style problems, potential bugs, edge cases that aren't handled, and anything else that might be wrong. Please provide a very detailed explanation for each issue you find, including why it's a problem and how to fix it. Here is the code:
[code here - 500 tokens]
Please format your response as a detailed report with sections for each type of issue. Include code examples showing the problems and the fixes. Be very thorough and don't leave anything out.
Optimized Prompt (400 tokens):
Analyze this code for:
1. Security vulnerabilities
2. Performance issues
3. Bugs and edge cases
4. Code style
For each issue:
- Severity (Critical/High/Medium/Low)
- Description
- Fix with code example
Code:
[code here - 500 tokens]
Format: JSON
Savings: 800 tokens (67% reduction)
Optimization Technique 2: Structured Output
Unstructured (Requires parsing):
Analyze user feedback and tell me the sentiment, main topics, and any action items.
Feedback: "The app is great but crashes on iOS 16. Also, the dark mode doesn't work properly. Please fix these issues soon!"
Response: "The sentiment is mostly positive with some concerns. The user likes the app overall but has encountered two technical issues: crashes on iOS 16 and dark mode problems. Action items would be to investigate and fix the iOS 16 crash bug and resolve the dark mode functionality issue. These should be prioritized as they affect user experience."
Structured (Direct use):
Analyze feedback. Output JSON:
{
"sentiment": "positive|neutral|negative",
"score": 0-10,
"topics": ["topic1", "topic2"],
"issues": [{"type": "bug|feature", "description": "...", "priority": "high|medium|low"}],
"action_items": ["action1", "action2"]
}
Feedback: "The app is great but crashes on iOS 16. Also, the dark mode doesn't work properly. Please fix these issues soon!"
Response:
{
"sentiment": "positive",
"score": 7,
"topics": ["stability", "ui"],
"issues": [
{"type": "bug", "description": "Crashes on iOS 16", "priority": "high"},
{"type": "bug", "description": "Dark mode not working", "priority": "medium"}
],
"action_items": [
"Fix iOS 16 crash",
"Fix dark mode"
]
}
Benefits:
- No parsing needed
- Consistent format
- Easier to use programmatically
Optimization Technique 3: Context Compression
Full Context (2,000 tokens):
Here is the entire codebase for context:
[file1.py - 500 tokens]
[file2.py - 600 tokens]
[file3.py - 400 tokens]
[file4.py - 500 tokens]
Now, analyze this function in file1.py:
[function - 50 tokens]
Compressed Context (300 tokens):
Relevant context:
- Uses UserService from file2.py
- Calls validate_email() from file3.py
- Returns User object
Analyze this function:
[function - 50 tokens]
Savings: 1,700 tokens (85% reduction)
Implementation:
def compress_context(full_context, target_code):
"""Extract only relevant context."""
# Parse imports and dependencies
imports = extract_imports(target_code)
dependencies = find_dependencies(full_context, imports)
# Build minimal context
context = "Relevant context:\n"
for dep in dependencies:
context += f"- {dep['summary']}\n"
return context
# Usage
full_context = load_entire_codebase()
target = get_function_to_analyze()
compressed = compress_context(full_context, target)
prompt = f"{compressed}\n\nAnalyze: {target}"
Optimization Technique 4: Prompt Templates
Ad-hoc Prompts (Inconsistent):
# Different every time
prompt1 = "Review this code and find bugs"
prompt2 = "Look at this code and tell me if there are any issues"
prompt3 = "Analyze the following code for problems"
Template System (Consistent):
TEMPLATES = {
'code_review': """
Analyze code for {focus_areas}.
Code:
{code}
Output format: {output_format}
""",
'bug_detection': """
Find bugs in this code:
{code}
Focus on:
{bug_types}
Format: JSON with severity, description, fix
""",
'optimization': """
Optimize this code for {optimization_goal}.
Current code:
{code}
Constraints:
{constraints}
Provide optimized version with explanation.
"""
}
def generate_prompt(template_name, **kwargs):
"""Generate prompt from template."""
template = TEMPLATES[template_name]
return template.format(**kwargs)
# Usage
prompt = generate_prompt(
'code_review',
focus_areas='security, performance',
code=code_snippet,
output_format='JSON'
)
Benefits:
- Consistent quality
- Easy to optimize once
- Reusable across team
Optimization Technique 5: Caching
No Caching (Expensive):
# Same prompt, different results, full cost every time
for user in users:
prompt = f"Analyze sentiment: {user.feedback}"
result = llm.predict(prompt) # Full API call
With Caching (60% savings):
import hashlib
from functools import lru_cache
@lru_cache(maxsize=1000)
def cached_llm_call(prompt_hash):
"""Cache LLM responses."""
return llm.predict(prompts[prompt_hash])
def analyze_with_cache(feedback):
"""Analyze with caching."""
prompt = f"Analyze sentiment: {feedback}"
prompt_hash = hashlib.md5(prompt.encode()).hexdigest()
# Check cache first
if prompt_hash in cache:
return cache[prompt_hash]
# Call API
result = llm.predict(prompt)
cache[prompt_hash] = result
return result
# Usage
for user in users:
result = analyze_with_cache(user.feedback)
Results:
- Cache hit rate: 60%
- Cost reduction: 60%
- Latency: 50ms (cached) vs 2s (API)
Optimization Technique 6: Batch Processing
Sequential (Slow + Expensive):
results = []
for item in items: # 100 items
prompt = f"Process: {item}"
result = llm.predict(prompt) # 100 API calls
results.append(result)
# Cost: 100 calls × $0.03 = $3.00
# Time: 100 × 2s = 200s
Batched (Fast + Cheap):
def batch_process(items, batch_size=10):
"""Process items in batches."""
results = []
for i in range(0, len(items), batch_size):
batch = items[i:i+batch_size]
prompt = f"""
Process each item and return JSON array:
Items:
{json.dumps(batch)}
Output: [{{"item": "...", "result": "..."}}, ...]
"""
batch_results = llm.predict(prompt)
results.extend(json.loads(batch_results))
return results
# Usage
results = batch_process(items, batch_size=10)
# Cost: 10 calls × $0.05 = $0.50 (83% savings)
# Time: 10 × 2s = 20s (90% faster)
Optimization Technique 7: Model Selection
Always GPT-4 (Expensive):
# Using GPT-4 for everything
result = openai.ChatCompletion.create(
model="gpt-4",
messages=[{"role": "user", "content": prompt}]
)
# Cost: $0.03 per 1K tokens
Smart Model Selection (Optimized):
def select_model(task_complexity):
"""Select appropriate model based on complexity."""
if task_complexity == 'simple':
return 'gpt-3.5-turbo' # $0.002 per 1K tokens
elif task_complexity == 'medium':
return 'gpt-4' # $0.03 per 1K tokens
else:
return 'gpt-4-turbo' # $0.01 per 1K tokens
def classify_task(prompt):
"""Classify task complexity."""
simple_keywords = ['summarize', 'extract', 'classify']
complex_keywords = ['analyze', 'design', 'architect']
if any(kw in prompt.lower() for kw in simple_keywords):
return 'simple'
elif any(kw in prompt.lower() for kw in complex_keywords):
return 'complex'
return 'medium'
# Usage
complexity = classify_task(prompt)
model = select_model(complexity)
result = openai.ChatCompletion.create(
model=model,
messages=[{"role": "user", "content": prompt}]
)
Savings: 70% on simple tasks
Complete Optimization System
class PromptOptimizer:
def __init__(self):
self.cache = {}
self.templates = TEMPLATES
self.stats = {
'total_calls': 0,
'cache_hits': 0,
'total_tokens': 0,
'total_cost': 0
}
def optimize_prompt(self, prompt):
"""Apply all optimization techniques."""
# 1. Remove redundancy
prompt = self.remove_redundancy(prompt)
# 2. Compress context
prompt = self.compress_context(prompt)
# 3. Structure output
prompt = self.add_output_structure(prompt)
return prompt
def execute(self, prompt, use_cache=True):
"""Execute optimized prompt."""
# Optimize
optimized = self.optimize_prompt(prompt)
# Check cache
if use_cache:
cached = self.check_cache(optimized)
if cached:
self.stats['cache_hits'] += 1
return cached
# Select model
model = self.select_model(optimized)
# Execute
result = self.call_llm(optimized, model)
# Update stats
self.update_stats(optimized, result, model)
# Cache result
if use_cache:
self.cache_result(optimized, result)
return result
def get_stats(self):
"""Get optimization statistics."""
cache_rate = self.stats['cache_hits'] / self.stats['total_calls']
avg_tokens = self.stats['total_tokens'] / self.stats['total_calls']
return {
'cache_hit_rate': f"{cache_rate:.1%}",
'avg_tokens_per_call': int(avg_tokens),
'total_cost': f"${self.stats['total_cost']:.2f}",
'estimated_savings': self.calculate_savings()
}
Real Results
Before Optimization:
- Average tokens/call: 1,500
- API calls/month: 50,000
- Total tokens: 75M
- Cost: $2,000/month
- Cache hit rate: 0%
After Optimization:
- Average tokens/call: 600 (60% reduction)
- API calls/month: 20,000 (60% reduction via caching)
- Total tokens: 12M (84% reduction)
- Cost: $800/month (60% reduction)
- Cache hit rate: 60%
Quality Improvements:
- Structured output: 100% parseable
- Consistency: 95% (vs 70%)
- Usability: 90% (vs 70%)
Monitoring Dashboard
from flask import Flask, jsonify
app = Flask(__name__)
optimizer = PromptOptimizer()
@app.route('/metrics')
def metrics():
"""Optimization metrics dashboard."""
stats = optimizer.get_stats()
return jsonify({
'cost_savings': {
'monthly': '$1,200',
'annual': '$14,400'
},
'performance': {
'cache_hit_rate': stats['cache_hit_rate'],
'avg_tokens': stats['avg_tokens_per_call'],
'avg_latency': '800ms'
},
'quality': {
'structured_output': '100%',
'consistency': '95%',
'usability': '90%'
}
})
Best Practices
- Measure first: Baseline metrics
- Optimize iteratively: One technique at a time
- A/B test: Compare results
- Monitor quality: Don’t sacrifice for cost
- Cache aggressively: 60%+ hit rate possible
Lessons Learned
- Token reduction: 60% savings possible
- Caching is critical: 60% hit rate
- Structured output: Easier to use
- Model selection: 70% savings on simple tasks
- Batch processing: 83% cost reduction
Conclusion
Systematic prompt optimization delivers massive savings. 60% cost reduction while improving quality by 40%.
Key takeaways:
- $1,200/month savings ($14,400/year)
- 60% token reduction
- 60% cache hit rate
- 40% quality improvement
- Structured, consistent outputs
Optimize your prompts. Save money, improve quality.