Running Stable Diffusion Locally - Complete Setup Guide
Stable Diffusion’s release in August 2022 was a game-changer. Unlike DALL-E 2 or Midjourney, it’s open-source and can run on consumer hardware. Here’s how I set it up locally and what I learned in the process.
Table of contents
Why Run Stable Diffusion Locally?
Advantages:
- No usage limits or costs
- Complete privacy (your prompts stay local)
- Full control over parameters
- Ability to fine-tune models
- No internet dependency
Disadvantages:
- Requires powerful GPU
- Initial setup complexity
- Slower than cloud services (unless you have high-end GPU)
Hardware Requirements
Minimum Requirements
GPU: NVIDIA RTX 2060 (6GB VRAM) or better
RAM: 16GB system RAM
Storage: 10GB free space
OS: Windows 10/11, Linux, or macOS (with limitations)
My Setup
GPU: NVIDIA RTX 3080 (10GB VRAM)
CPU: AMD Ryzen 7 5800X
RAM: 32GB DDR4
Storage: 500GB NVMe SSD
OS: Ubuntu 22.04 LTS
Performance: ~15 seconds per 512x512 image at 50 steps.
Installation Guide
Step 1: Install Prerequisites
# Update system
sudo apt update && sudo apt upgrade -y
# Install Python 3.10
sudo apt install python3.10 python3.10-venv python3-pip -y
# Install CUDA toolkit (for NVIDIA GPUs)
wget https://developer.download.nvidia.com/compute/cuda/11.7.0/local_installers/cuda_11.7.0_515.43.04_linux.run
sudo sh cuda_11.7.0_515.43.04_linux.run
# Verify CUDA installation
nvcc --version
Step 2: Clone Stable Diffusion Repository
# Clone the official repository
git clone https://github.com/CompVis/stable-diffusion.git
cd stable-diffusion
# Create virtual environment
python3.10 -m venv venv
source venv/bin/activate
# Install dependencies
pip install -r requirements.txt
Step 3: Download Model Weights
You need to download the model weights from Hugging Face:
# Install git-lfs
sudo apt install git-lfs
git lfs install
# Clone the model repository
git clone https://huggingface.co/CompVis/stable-diffusion-v1-4
Note: You’ll need to accept the license on Hugging Face first.
Step 4: Configure Environment
# Create .env file
cat > .env << EOF
MODEL_PATH=./stable-diffusion-v1-4
DEVICE=cuda
PRECISION=autocast
EOF
Step 5: Test Installation
# test_sd.py
import torch
from diffusers import StableDiffusionPipeline
# Load model
model_id = "CompVis/stable-diffusion-v1-4"
device = "cuda"
pipe = StableDiffusionPipeline.from_pretrained(
model_id,
torch_dtype=torch.float16
)
pipe = pipe.to(device)
# Generate image
prompt = "a photograph of an astronaut riding a horse"
image = pipe(prompt).images[0]
# Save image
image.save("test_output.png")
print("Image generated successfully!")
Run the test:
python test_sd.py
Basic Usage
Simple Text-to-Image
from diffusers import StableDiffusionPipeline
import torch
# Initialize pipeline
pipe = StableDiffusionPipeline.from_pretrained(
"CompVis/stable-diffusion-v1-4",
torch_dtype=torch.float16
).to("cuda")
# Generate image
prompt = "a serene mountain landscape at sunset, oil painting style"
image = pipe(
prompt,
num_inference_steps=50,
guidance_scale=7.5
).images[0]
image.save("mountain_sunset.png")
Advanced Parameters
# More control over generation
image = pipe(
prompt="cyberpunk city street, neon lights, rainy night, highly detailed",
negative_prompt="blurry, low quality, distorted",
num_inference_steps=75,
guidance_scale=8.5,
width=768,
height=512,
num_images_per_prompt=4,
seed=42
).images
# Save all generated images
for idx, img in enumerate(image):
img.save(f"cyberpunk_{idx}.png")
Image-to-Image Generation
from diffusers import StableDiffusionImg2ImgPipeline
from PIL import Image
# Load img2img pipeline
pipe = StableDiffusionImg2ImgPipeline.from_pretrained(
"CompVis/stable-diffusion-v1-4",
torch_dtype=torch.float16
).to("cuda")
# Load initial image
init_image = Image.open("input.jpg").convert("RGB")
init_image = init_image.resize((768, 512))
# Transform image
prompt = "same scene but in winter with snow"
image = pipe(
prompt=prompt,
image=init_image,
strength=0.75, # How much to transform (0-1)
guidance_scale=7.5
).images[0]
image.save("winter_scene.png")
Optimization Tips
1. Use Half Precision (FP16)
Reduces VRAM usage by ~50%:
pipe = StableDiffusionPipeline.from_pretrained(
model_id,
torch_dtype=torch.float16, # Use FP16
revision="fp16"
).to("cuda")
2. Enable Attention Slicing
For GPUs with limited VRAM:
pipe.enable_attention_slicing()
This allows generation on GPUs with as little as 4GB VRAM, though slower.
3. Use xFormers
Significant speed improvement:
# Install xformers
pip install xformers
# Enable in code
pipe.enable_xformers_memory_efficient_attention()
Result: 20-30% faster generation on my RTX 3080.
4. Batch Processing
Generate multiple images efficiently:
prompts = [
"a cat in a spacesuit",
"a dog wearing sunglasses",
"a bird with rainbow feathers"
]
# Generate all at once
images = pipe(
prompts,
num_inference_steps=50,
guidance_scale=7.5
).images
for idx, img in enumerate(images):
img.save(f"batch_{idx}.png")
Creating a Simple Web UI
I built a basic Flask interface:
# app.py
from flask import Flask, request, send_file, render_template
from diffusers import StableDiffusionPipeline
import torch
from io import BytesIO
import base64
app = Flask(__name__)
# Load model once at startup
pipe = StableDiffusionPipeline.from_pretrained(
"CompVis/stable-diffusion-v1-4",
torch_dtype=torch.float16
).to("cuda")
@app.route('/')
def index():
return render_template('index.html')
@app.route('/generate', methods=['POST'])
def generate():
data = request.json
prompt = data.get('prompt', '')
steps = data.get('steps', 50)
guidance = data.get('guidance_scale', 7.5)
# Generate image
image = pipe(
prompt,
num_inference_steps=steps,
guidance_scale=guidance
).images[0]
# Convert to base64
buffered = BytesIO()
image.save(buffered, format="PNG")
img_str = base64.b64encode(buffered.getvalue()).decode()
return {'image': f'data:image/png;base64,{img_str}'}
if __name__ == '__main__':
app.run(host='0.0.0.0', port=5000)
HTML template:
<!-- templates/index.html -->
<!DOCTYPE html>
<html>
<head>
<title>Stable Diffusion Local</title>
<style>
body { font-family: Arial; max-width: 800px; margin: 50px auto; }
input, button { padding: 10px; margin: 5px; }
#prompt { width: 500px; }
#result { margin-top: 20px; }
img { max-width: 100%; }
</style>
</head>
<body>
<h1>Stable Diffusion Generator</h1>
<div>
<input type="text" id="prompt" placeholder="Enter your prompt...">
<button onclick="generate()">Generate</button>
</div>
<div id="result"></div>
<script>
async function generate() {
const prompt = document.getElementById('prompt').value;
const result = document.getElementById('result');
result.innerHTML = 'Generating...';
const response = await fetch('/generate', {
method: 'POST',
headers: {'Content-Type': 'application/json'},
body: JSON.stringify({prompt})
});
const data = await response.json();
result.innerHTML = `<img src="${data.image}">`;
}
</script>
</body>
</html>
Performance Benchmarks
I tested different configurations on my RTX 3080:
| Configuration | Time (512x512, 50 steps) | VRAM Usage |
|---|---|---|
| FP32, no optimization | 28s | 9.2GB |
| FP16 | 18s | 4.8GB |
| FP16 + attention slicing | 22s | 3.2GB |
| FP16 + xFormers | 14s | 4.6GB |
| FP16 + xFormers + batch(4) | 45s total (11.25s each) | 8.9GB |
Common Issues and Solutions
Issue 1: CUDA Out of Memory
RuntimeError: CUDA out of memory
Solutions:
# Enable attention slicing
pipe.enable_attention_slicing()
# Reduce image size
width=512, height=512 # Instead of 768x768
# Use FP16
torch_dtype=torch.float16
Issue 2: Slow Generation
Solutions:
- Install xFormers
- Use FP16
- Reduce inference steps (30-40 often sufficient)
- Ensure GPU is being used (check with
nvidia-smi)
Issue 3: Poor Image Quality
# Increase inference steps
num_inference_steps=75 # Instead of 50
# Adjust guidance scale
guidance_scale=8.5 # Higher = more prompt adherence
# Use negative prompts
negative_prompt="blurry, low quality, distorted, ugly"
Prompt Engineering Tips
Good Prompts
# Detailed and specific
"a majestic lion standing on a cliff at sunset, golden hour lighting, \
photorealistic, highly detailed, 8k, national geographic style"
# Style modifiers
"portrait of a woman, oil painting, renaissance style, by Leonardo da Vinci"
# Quality boosters
"..., masterpiece, best quality, highly detailed, sharp focus"
Negative Prompts
negative_prompt = "blurry, low quality, distorted, ugly, bad anatomy, \
watermark, signature, text, cropped, out of frame"
Cost Analysis
Cloud Services (Midjourney)
- $10/month for ~200 images
- $30/month for ~900 images
Local Setup
- Initial cost: $700 (RTX 3080)
- Electricity: ~$5/month (100 hours usage)
- Unlimited generations
Break-even: 2-3 months of heavy usage.
Future Improvements
I’m planning to:
- Fine-tune models on custom datasets
- Implement inpainting and outpainting
- Add ControlNet for better control
- Create automated batch processing scripts
Conclusion
Running Stable Diffusion locally is incredibly rewarding. The initial setup takes effort, but the freedom and control are worth it.
Pros:
- ✅ Unlimited generations
- ✅ Complete privacy
- ✅ Full customization
- ✅ Learning opportunity
Cons:
- ❌ Requires expensive GPU
- ❌ Technical setup required
- ❌ Slower than high-end cloud services
If you have the hardware, I highly recommend running Stable Diffusion locally. It’s the future of creative AI tools.
My rating: 9/10 - Revolutionary technology with minor setup friction.