Stable Diffusion’s release in August 2022 was a game-changer. Unlike DALL-E 2 or Midjourney, it’s open-source and can run on consumer hardware. Here’s how I set it up locally and what I learned in the process.

Table of contents

Why Run Stable Diffusion Locally?

Advantages:

  • No usage limits or costs
  • Complete privacy (your prompts stay local)
  • Full control over parameters
  • Ability to fine-tune models
  • No internet dependency

Disadvantages:

  • Requires powerful GPU
  • Initial setup complexity
  • Slower than cloud services (unless you have high-end GPU)

Hardware Requirements

Minimum Requirements

GPU: NVIDIA RTX 2060 (6GB VRAM) or better
RAM: 16GB system RAM
Storage: 10GB free space
OS: Windows 10/11, Linux, or macOS (with limitations)

My Setup

GPU: NVIDIA RTX 3080 (10GB VRAM)
CPU: AMD Ryzen 7 5800X
RAM: 32GB DDR4
Storage: 500GB NVMe SSD
OS: Ubuntu 22.04 LTS

Performance: ~15 seconds per 512x512 image at 50 steps.

Installation Guide

Step 1: Install Prerequisites

# Update system
sudo apt update && sudo apt upgrade -y

# Install Python 3.10
sudo apt install python3.10 python3.10-venv python3-pip -y

# Install CUDA toolkit (for NVIDIA GPUs)
wget https://developer.download.nvidia.com/compute/cuda/11.7.0/local_installers/cuda_11.7.0_515.43.04_linux.run
sudo sh cuda_11.7.0_515.43.04_linux.run

# Verify CUDA installation
nvcc --version

Step 2: Clone Stable Diffusion Repository

# Clone the official repository
git clone https://github.com/CompVis/stable-diffusion.git
cd stable-diffusion

# Create virtual environment
python3.10 -m venv venv
source venv/bin/activate

# Install dependencies
pip install -r requirements.txt

Step 3: Download Model Weights

You need to download the model weights from Hugging Face:

# Install git-lfs
sudo apt install git-lfs
git lfs install

# Clone the model repository
git clone https://huggingface.co/CompVis/stable-diffusion-v1-4

Note: You’ll need to accept the license on Hugging Face first.

Step 4: Configure Environment

# Create .env file
cat > .env << EOF
MODEL_PATH=./stable-diffusion-v1-4
DEVICE=cuda
PRECISION=autocast
EOF

Step 5: Test Installation

# test_sd.py
import torch
from diffusers import StableDiffusionPipeline

# Load model
model_id = "CompVis/stable-diffusion-v1-4"
device = "cuda"

pipe = StableDiffusionPipeline.from_pretrained(
    model_id,
    torch_dtype=torch.float16
)
pipe = pipe.to(device)

# Generate image
prompt = "a photograph of an astronaut riding a horse"
image = pipe(prompt).images[0]

# Save image
image.save("test_output.png")
print("Image generated successfully!")

Run the test:

python test_sd.py

Basic Usage

Simple Text-to-Image

from diffusers import StableDiffusionPipeline
import torch

# Initialize pipeline
pipe = StableDiffusionPipeline.from_pretrained(
    "CompVis/stable-diffusion-v1-4",
    torch_dtype=torch.float16
).to("cuda")

# Generate image
prompt = "a serene mountain landscape at sunset, oil painting style"
image = pipe(
    prompt,
    num_inference_steps=50,
    guidance_scale=7.5
).images[0]

image.save("mountain_sunset.png")

Advanced Parameters

# More control over generation
image = pipe(
    prompt="cyberpunk city street, neon lights, rainy night, highly detailed",
    negative_prompt="blurry, low quality, distorted",
    num_inference_steps=75,
    guidance_scale=8.5,
    width=768,
    height=512,
    num_images_per_prompt=4,
    seed=42
).images

# Save all generated images
for idx, img in enumerate(image):
    img.save(f"cyberpunk_{idx}.png")

Image-to-Image Generation

from diffusers import StableDiffusionImg2ImgPipeline
from PIL import Image

# Load img2img pipeline
pipe = StableDiffusionImg2ImgPipeline.from_pretrained(
    "CompVis/stable-diffusion-v1-4",
    torch_dtype=torch.float16
).to("cuda")

# Load initial image
init_image = Image.open("input.jpg").convert("RGB")
init_image = init_image.resize((768, 512))

# Transform image
prompt = "same scene but in winter with snow"
image = pipe(
    prompt=prompt,
    image=init_image,
    strength=0.75,  # How much to transform (0-1)
    guidance_scale=7.5
).images[0]

image.save("winter_scene.png")

Optimization Tips

1. Use Half Precision (FP16)

Reduces VRAM usage by ~50%:

pipe = StableDiffusionPipeline.from_pretrained(
    model_id,
    torch_dtype=torch.float16,  # Use FP16
    revision="fp16"
).to("cuda")

2. Enable Attention Slicing

For GPUs with limited VRAM:

pipe.enable_attention_slicing()

This allows generation on GPUs with as little as 4GB VRAM, though slower.

3. Use xFormers

Significant speed improvement:

# Install xformers
pip install xformers

# Enable in code
pipe.enable_xformers_memory_efficient_attention()

Result: 20-30% faster generation on my RTX 3080.

4. Batch Processing

Generate multiple images efficiently:

prompts = [
    "a cat in a spacesuit",
    "a dog wearing sunglasses",
    "a bird with rainbow feathers"
]

# Generate all at once
images = pipe(
    prompts,
    num_inference_steps=50,
    guidance_scale=7.5
).images

for idx, img in enumerate(images):
    img.save(f"batch_{idx}.png")

Creating a Simple Web UI

I built a basic Flask interface:

# app.py
from flask import Flask, request, send_file, render_template
from diffusers import StableDiffusionPipeline
import torch
from io import BytesIO
import base64

app = Flask(__name__)

# Load model once at startup
pipe = StableDiffusionPipeline.from_pretrained(
    "CompVis/stable-diffusion-v1-4",
    torch_dtype=torch.float16
).to("cuda")

@app.route('/')
def index():
    return render_template('index.html')

@app.route('/generate', methods=['POST'])
def generate():
    data = request.json
    prompt = data.get('prompt', '')
    steps = data.get('steps', 50)
    guidance = data.get('guidance_scale', 7.5)
    
    # Generate image
    image = pipe(
        prompt,
        num_inference_steps=steps,
        guidance_scale=guidance
    ).images[0]
    
    # Convert to base64
    buffered = BytesIO()
    image.save(buffered, format="PNG")
    img_str = base64.b64encode(buffered.getvalue()).decode()
    
    return {'image': f'data:image/png;base64,{img_str}'}

if __name__ == '__main__':
    app.run(host='0.0.0.0', port=5000)

HTML template:

<!-- templates/index.html -->
<!DOCTYPE html>
<html>
<head>
    <title>Stable Diffusion Local</title>
    <style>
        body { font-family: Arial; max-width: 800px; margin: 50px auto; }
        input, button { padding: 10px; margin: 5px; }
        #prompt { width: 500px; }
        #result { margin-top: 20px; }
        img { max-width: 100%; }
    </style>
</head>
<body>
    <h1>Stable Diffusion Generator</h1>
    <div>
        <input type="text" id="prompt" placeholder="Enter your prompt...">
        <button onclick="generate()">Generate</button>
    </div>
    <div id="result"></div>
    
    <script>
        async function generate() {
            const prompt = document.getElementById('prompt').value;
            const result = document.getElementById('result');
            
            result.innerHTML = 'Generating...';
            
            const response = await fetch('/generate', {
                method: 'POST',
                headers: {'Content-Type': 'application/json'},
                body: JSON.stringify({prompt})
            });
            
            const data = await response.json();
            result.innerHTML = `<img src="${data.image}">`;
        }
    </script>
</body>
</html>

Performance Benchmarks

I tested different configurations on my RTX 3080:

ConfigurationTime (512x512, 50 steps)VRAM Usage
FP32, no optimization28s9.2GB
FP1618s4.8GB
FP16 + attention slicing22s3.2GB
FP16 + xFormers14s4.6GB
FP16 + xFormers + batch(4)45s total (11.25s each)8.9GB

Common Issues and Solutions

Issue 1: CUDA Out of Memory

RuntimeError: CUDA out of memory

Solutions:

# Enable attention slicing
pipe.enable_attention_slicing()

# Reduce image size
width=512, height=512  # Instead of 768x768

# Use FP16
torch_dtype=torch.float16

Issue 2: Slow Generation

Solutions:

  • Install xFormers
  • Use FP16
  • Reduce inference steps (30-40 often sufficient)
  • Ensure GPU is being used (check with nvidia-smi)

Issue 3: Poor Image Quality

# Increase inference steps
num_inference_steps=75  # Instead of 50

# Adjust guidance scale
guidance_scale=8.5  # Higher = more prompt adherence

# Use negative prompts
negative_prompt="blurry, low quality, distorted, ugly"

Prompt Engineering Tips

Good Prompts

# Detailed and specific
"a majestic lion standing on a cliff at sunset, golden hour lighting, \
photorealistic, highly detailed, 8k, national geographic style"

# Style modifiers
"portrait of a woman, oil painting, renaissance style, by Leonardo da Vinci"

# Quality boosters
"..., masterpiece, best quality, highly detailed, sharp focus"

Negative Prompts

negative_prompt = "blurry, low quality, distorted, ugly, bad anatomy, \
watermark, signature, text, cropped, out of frame"

Cost Analysis

Cloud Services (Midjourney)

  • $10/month for ~200 images
  • $30/month for ~900 images

Local Setup

  • Initial cost: $700 (RTX 3080)
  • Electricity: ~$5/month (100 hours usage)
  • Unlimited generations

Break-even: 2-3 months of heavy usage.

Future Improvements

I’m planning to:

  1. Fine-tune models on custom datasets
  2. Implement inpainting and outpainting
  3. Add ControlNet for better control
  4. Create automated batch processing scripts

Conclusion

Running Stable Diffusion locally is incredibly rewarding. The initial setup takes effort, but the freedom and control are worth it.

Pros:

  • ✅ Unlimited generations
  • ✅ Complete privacy
  • ✅ Full customization
  • ✅ Learning opportunity

Cons:

  • ❌ Requires expensive GPU
  • ❌ Technical setup required
  • ❌ Slower than high-end cloud services

If you have the hardware, I highly recommend running Stable Diffusion locally. It’s the future of creative AI tools.

My rating: 9/10 - Revolutionary technology with minor setup friction.