I rewrote a Python data processing script in Rust expecting it to be faster. It was—but only 2x faster. After optimization, it became 40x faster. Here’s how.

Table of contents

The Problem

Task: Process 100,000 log entries, extract patterns, and generate statistics.

Python version: 2.1 seconds Initial Rust version: 1.0 seconds (disappointing!) Optimized Rust version: 50 milliseconds (40x improvement!)

Initial Rust Implementation

use std::fs::File;
use std::io::{BufRead, BufReader};
use std::collections::HashMap;

fn process_logs(filename: &str) -> HashMap<String, u32> {
    let file = File::open(filename).unwrap();
    let reader = BufReader::new(file);
    let mut stats = HashMap::new();
    
    for line in reader.lines() {
        let line = line.unwrap();
        let parts: Vec<&str> = line.split('|').collect();
        
        if parts.len() >= 3 {
            let level = parts[1].to_string();
            *stats.entry(level).or_insert(0) += 1;
        }
    }
    
    stats
}

Performance: 1.0 second

Not bad, but we can do much better.

Optimization 1: Avoid Unnecessary Allocations

Problem

let level = parts[1].to_string();  // Allocates new String
*stats.entry(level).or_insert(0) += 1;

Every iteration allocates a new String.

Solution

use std::collections::HashMap;

fn process_logs(filename: &str) -> HashMap<String, u32> {
    let file = File::open(filename).unwrap();
    let reader = BufReader::new(file);
    let mut stats = HashMap::new();
    
    for line in reader.lines() {
        let line = line.unwrap();
        let parts: Vec<&str> = line.split('|').collect();
        
        if parts.len() >= 3 {
            // Use &str instead of String
            stats.entry(parts[1].to_owned())
                .and_modify(|count| *count += 1)
                .or_insert(1);
        }
    }
    
    stats
}

Performance: 800ms (20% improvement)

Optimization 2: Pre-allocate Capacity

Problem

HashMap and Vec grow dynamically, causing reallocations.

Solution

fn process_logs(filename: &str) -> HashMap<String, u32> {
    let file = File::open(filename).unwrap();
    let reader = BufReader::new(file);
    
    // Pre-allocate with estimated capacity
    let mut stats = HashMap::with_capacity(10);
    
    for line in reader.lines() {
        let line = line.unwrap();
        
        // Avoid collecting into Vec
        let mut parts = line.split('|');
        parts.next(); // Skip first part
        
        if let Some(level) = parts.next() {
            *stats.entry(level.to_owned()).or_insert(0) += 1;
        }
    }
    
    stats
}

Performance: 650ms (19% improvement)

Optimization 3: Use Faster Parsing

Problem

split() creates an iterator that allocates.

Solution: Use splitn and avoid intermediate collections

fn process_logs(filename: &str) -> HashMap<String, u32> {
    let file = File::open(filename).unwrap();
    let reader = BufReader::new(file);
    let mut stats = HashMap::with_capacity(10);
    
    for line in reader.lines() {
        let line = line.unwrap();
        
        // Find second field directly
        if let Some(start) = line.find('|') {
            if let Some(end) = line[start + 1..].find('|') {
                let level = &line[start + 1..start + 1 + end];
                *stats.entry(level.to_owned()).or_insert(0) += 1;
            }
        }
    }
    
    stats
}

Performance: 450ms (31% improvement)

Optimization 4: Parallel Processing

Use Rayon for Parallelism

use rayon::prelude::*;
use std::sync::Mutex;

fn process_logs_parallel(filename: &str) -> HashMap<String, u32> {
    let file = File::open(filename).unwrap();
    let reader = BufReader::new(file);
    
    let stats = Mutex::new(HashMap::with_capacity(10));
    
    reader.lines()
        .par_bridge()
        .for_each(|line| {
            if let Ok(line) = line {
                if let Some(start) = line.find('|') {
                    if let Some(end) = line[start + 1..].find('|') {
                        let level = &line[start + 1..start + 1 + end];
                        let mut map = stats.lock().unwrap();
                        *map.entry(level.to_owned()).or_insert(0) += 1;
                    }
                }
            }
        });
    
    stats.into_inner().unwrap()
}

Performance: 200ms (55% improvement)

But we can do better—lock contention is an issue.

Optimization 5: Lock-Free Parallel Processing

use rayon::prelude::*;
use std::collections::HashMap;

fn process_logs_parallel_v2(filename: &str) -> HashMap<String, u32> {
    let file = File::open(filename).unwrap();
    let reader = BufReader::new(file);
    
    // Collect lines into Vec first
    let lines: Vec<String> = reader.lines()
        .filter_map(|l| l.ok())
        .collect();
    
    // Process in parallel, each thread gets its own HashMap
    let partial_results: Vec<HashMap<String, u32>> = lines
        .par_chunks(1000)
        .map(|chunk| {
            let mut local_stats = HashMap::with_capacity(10);
            
            for line in chunk {
                if let Some(start) = line.find('|') {
                    if let Some(end) = line[start + 1..].find('|') {
                        let level = &line[start + 1..start + 1 + end];
                        *local_stats.entry(level.to_owned()).or_insert(0) += 1;
                    }
                }
            }
            
            local_stats
        })
        .collect();
    
    // Merge results
    let mut final_stats = HashMap::with_capacity(10);
    for partial in partial_results {
        for (key, value) in partial {
            *final_stats.entry(key).or_insert(0) += value;
        }
    }
    
    final_stats
}

Performance: 120ms (40% improvement)

Optimization 6: Memory-Mapped Files

use memmap2::Mmap;
use std::fs::File;

fn process_logs_mmap(filename: &str) -> HashMap<String, u32> {
    let file = File::open(filename).unwrap();
    let mmap = unsafe { Mmap::map(&file).unwrap() };
    
    let mut stats = HashMap::with_capacity(10);
    
    // Process as byte slice
    let data = &mmap[..];
    let mut line_start = 0;
    
    for (i, &byte) in data.iter().enumerate() {
        if byte == b'\n' {
            let line = &data[line_start..i];
            
            // Find delimiters
            if let Some(first_pipe) = line.iter().position(|&b| b == b'|') {
                if let Some(second_pipe) = line[first_pipe + 1..]
                    .iter()
                    .position(|&b| b == b'|') 
                {
                    let level = &line[first_pipe + 1..first_pipe + 1 + second_pipe];
                    let level_str = std::str::from_utf8(level).unwrap();
                    *stats.entry(level_str.to_owned()).or_insert(0) += 1;
                }
            }
            
            line_start = i + 1;
        }
    }
    
    stats
}

Performance: 80ms (33% improvement)

Optimization 7: SIMD and Byte-Level Processing

use std::collections::HashMap;
use memmap2::Mmap;

fn process_logs_optimized(filename: &str) -> HashMap<String, u32> {
    let file = File::open(filename).unwrap();
    let mmap = unsafe { Mmap::map(&file).unwrap() };
    let data = &mmap[..];
    
    let mut stats = HashMap::with_capacity(10);
    let mut i = 0;
    
    while i < data.len() {
        // Find first pipe
        let first_pipe = match memchr::memchr(b'|', &data[i..]) {
            Some(pos) => i + pos,
            None => break,
        };
        
        // Find second pipe
        let second_pipe = match memchr::memchr(b'|', &data[first_pipe + 1..]) {
            Some(pos) => first_pipe + 1 + pos,
            None => break,
        };
        
        // Extract level
        let level = &data[first_pipe + 1..second_pipe];
        if let Ok(level_str) = std::str::from_utf8(level) {
            *stats.entry(level_str.to_owned()).or_insert(0) += 1;
        }
        
        // Find next line
        i = match memchr::memchr(b'\n', &data[second_pipe..]) {
            Some(pos) => second_pipe + pos + 1,
            None => break,
        };
    }
    
    stats
}

Performance: 50ms (38% improvement)

Final Optimized Version

use memmap2::Mmap;
use std::collections::HashMap;
use std::fs::File;
use rayon::prelude::*;

pub fn process_logs_final(filename: &str) -> HashMap<String, u32> {
    let file = File::open(filename).expect("Failed to open file");
    let mmap = unsafe { Mmap::map(&file).expect("Failed to mmap file") };
    let data = &mmap[..];
    
    // Split data into chunks for parallel processing
    let chunk_size = data.len() / rayon::current_num_threads();
    
    let partial_results: Vec<HashMap<String, u32>> = (0..rayon::current_num_threads())
        .into_par_iter()
        .map(|thread_id| {
            let start = thread_id * chunk_size;
            let end = if thread_id == rayon::current_num_threads() - 1 {
                data.len()
            } else {
                (thread_id + 1) * chunk_size
            };
            
            let mut local_stats = HashMap::with_capacity(10);
            let mut i = start;
            
            // Align to line boundary
            if i > 0 {
                if let Some(newline) = memchr::memchr(b'\n', &data[i..end]) {
                    i += newline + 1;
                }
            }
            
            while i < end {
                let first_pipe = match memchr::memchr(b'|', &data[i..end]) {
                    Some(pos) => i + pos,
                    None => break,
                };
                
                let second_pipe = match memchr::memchr(b'|', &data[first_pipe + 1..end]) {
                    Some(pos) => first_pipe + 1 + pos,
                    None => break,
                };
                
                let level = &data[first_pipe + 1..second_pipe];
                if let Ok(level_str) = std::str::from_utf8(level) {
                    *local_stats.entry(level_str.to_owned()).or_insert(0) += 1;
                }
                
                i = match memchr::memchr(b'\n', &data[second_pipe..end]) {
                    Some(pos) => second_pipe + pos + 1,
                    None => break,
                };
            }
            
            local_stats
        })
        .collect();
    
    // Merge results
    let mut final_stats = HashMap::with_capacity(10);
    for partial in partial_results {
        for (key, value) in partial {
            *final_stats.entry(key).or_insert(0) += value;
        }
    }
    
    final_stats
}

Final Performance: 50ms

Performance Comparison

VersionTimeImprovement
Python2100msBaseline
Rust (naive)1000ms2.1x
Avoid allocations800ms2.6x
Pre-allocate650ms3.2x
Faster parsing450ms4.7x
Parallel (mutex)200ms10.5x
Parallel (lock-free)120ms17.5x
Memory-mapped80ms26.3x
SIMD + mmap50ms42x

Profiling Tools Used

1. Cargo Flamegraph

cargo install flamegraph
cargo flamegraph --bin process_logs

Identified hot paths in the code.

2. Criterion Benchmarks

use criterion::{black_box, criterion_group, criterion_main, Criterion};

fn benchmark_process_logs(c: &mut Criterion) {
    c.bench_function("process_logs", |b| {
        b.iter(|| process_logs(black_box("test.log")))
    });
}

criterion_group!(benches, benchmark_process_logs);
criterion_main!(benches);

3. Perf

perf record --call-graph dwarf ./target/release/process_logs
perf report

Key Lessons

1. Measure First

Don’t optimize blindly. Profile to find bottlenecks.

2. Avoid Allocations

Every String::from() or Vec::new() has a cost.

3. Use Appropriate Data Structures

HashMap with pre-allocated capacity is much faster.

4. Leverage Parallelism

Rayon makes parallel processing trivial in Rust.

5. Memory-Mapped I/O

For large files, mmap is significantly faster than buffered I/O.

6. Byte-Level Processing

Working with &[u8] is faster than String operations.

7. Use Specialized Libraries

memchr uses SIMD instructions for fast byte searching.

Conclusion

Rust’s performance potential is incredible, but you need to:

  1. Profile to find bottlenecks
  2. Understand Rust’s ownership and borrowing
  3. Leverage zero-cost abstractions
  4. Use the right libraries and tools

Final Result: 42x faster than Python, 20x faster than naive Rust.

The journey from 2 seconds to 50 milliseconds taught me more about performance optimization than any tutorial could.