Rust Performance Optimization - From 2s to 50ms
I rewrote a Python data processing script in Rust expecting it to be faster. It was—but only 2x faster. After optimization, it became 40x faster. Here’s how.
Table of contents
The Problem
Task: Process 100,000 log entries, extract patterns, and generate statistics.
Python version: 2.1 seconds Initial Rust version: 1.0 seconds (disappointing!) Optimized Rust version: 50 milliseconds (40x improvement!)
Initial Rust Implementation
use std::fs::File;
use std::io::{BufRead, BufReader};
use std::collections::HashMap;
fn process_logs(filename: &str) -> HashMap<String, u32> {
let file = File::open(filename).unwrap();
let reader = BufReader::new(file);
let mut stats = HashMap::new();
for line in reader.lines() {
let line = line.unwrap();
let parts: Vec<&str> = line.split('|').collect();
if parts.len() >= 3 {
let level = parts[1].to_string();
*stats.entry(level).or_insert(0) += 1;
}
}
stats
}
Performance: 1.0 second
Not bad, but we can do much better.
Optimization 1: Avoid Unnecessary Allocations
Problem
let level = parts[1].to_string(); // Allocates new String
*stats.entry(level).or_insert(0) += 1;
Every iteration allocates a new String.
Solution
use std::collections::HashMap;
fn process_logs(filename: &str) -> HashMap<String, u32> {
let file = File::open(filename).unwrap();
let reader = BufReader::new(file);
let mut stats = HashMap::new();
for line in reader.lines() {
let line = line.unwrap();
let parts: Vec<&str> = line.split('|').collect();
if parts.len() >= 3 {
// Use &str instead of String
stats.entry(parts[1].to_owned())
.and_modify(|count| *count += 1)
.or_insert(1);
}
}
stats
}
Performance: 800ms (20% improvement)
Optimization 2: Pre-allocate Capacity
Problem
HashMap and Vec grow dynamically, causing reallocations.
Solution
fn process_logs(filename: &str) -> HashMap<String, u32> {
let file = File::open(filename).unwrap();
let reader = BufReader::new(file);
// Pre-allocate with estimated capacity
let mut stats = HashMap::with_capacity(10);
for line in reader.lines() {
let line = line.unwrap();
// Avoid collecting into Vec
let mut parts = line.split('|');
parts.next(); // Skip first part
if let Some(level) = parts.next() {
*stats.entry(level.to_owned()).or_insert(0) += 1;
}
}
stats
}
Performance: 650ms (19% improvement)
Optimization 3: Use Faster Parsing
Problem
split() creates an iterator that allocates.
Solution: Use splitn and avoid intermediate collections
fn process_logs(filename: &str) -> HashMap<String, u32> {
let file = File::open(filename).unwrap();
let reader = BufReader::new(file);
let mut stats = HashMap::with_capacity(10);
for line in reader.lines() {
let line = line.unwrap();
// Find second field directly
if let Some(start) = line.find('|') {
if let Some(end) = line[start + 1..].find('|') {
let level = &line[start + 1..start + 1 + end];
*stats.entry(level.to_owned()).or_insert(0) += 1;
}
}
}
stats
}
Performance: 450ms (31% improvement)
Optimization 4: Parallel Processing
Use Rayon for Parallelism
use rayon::prelude::*;
use std::sync::Mutex;
fn process_logs_parallel(filename: &str) -> HashMap<String, u32> {
let file = File::open(filename).unwrap();
let reader = BufReader::new(file);
let stats = Mutex::new(HashMap::with_capacity(10));
reader.lines()
.par_bridge()
.for_each(|line| {
if let Ok(line) = line {
if let Some(start) = line.find('|') {
if let Some(end) = line[start + 1..].find('|') {
let level = &line[start + 1..start + 1 + end];
let mut map = stats.lock().unwrap();
*map.entry(level.to_owned()).or_insert(0) += 1;
}
}
}
});
stats.into_inner().unwrap()
}
Performance: 200ms (55% improvement)
But we can do better—lock contention is an issue.
Optimization 5: Lock-Free Parallel Processing
use rayon::prelude::*;
use std::collections::HashMap;
fn process_logs_parallel_v2(filename: &str) -> HashMap<String, u32> {
let file = File::open(filename).unwrap();
let reader = BufReader::new(file);
// Collect lines into Vec first
let lines: Vec<String> = reader.lines()
.filter_map(|l| l.ok())
.collect();
// Process in parallel, each thread gets its own HashMap
let partial_results: Vec<HashMap<String, u32>> = lines
.par_chunks(1000)
.map(|chunk| {
let mut local_stats = HashMap::with_capacity(10);
for line in chunk {
if let Some(start) = line.find('|') {
if let Some(end) = line[start + 1..].find('|') {
let level = &line[start + 1..start + 1 + end];
*local_stats.entry(level.to_owned()).or_insert(0) += 1;
}
}
}
local_stats
})
.collect();
// Merge results
let mut final_stats = HashMap::with_capacity(10);
for partial in partial_results {
for (key, value) in partial {
*final_stats.entry(key).or_insert(0) += value;
}
}
final_stats
}
Performance: 120ms (40% improvement)
Optimization 6: Memory-Mapped Files
use memmap2::Mmap;
use std::fs::File;
fn process_logs_mmap(filename: &str) -> HashMap<String, u32> {
let file = File::open(filename).unwrap();
let mmap = unsafe { Mmap::map(&file).unwrap() };
let mut stats = HashMap::with_capacity(10);
// Process as byte slice
let data = &mmap[..];
let mut line_start = 0;
for (i, &byte) in data.iter().enumerate() {
if byte == b'\n' {
let line = &data[line_start..i];
// Find delimiters
if let Some(first_pipe) = line.iter().position(|&b| b == b'|') {
if let Some(second_pipe) = line[first_pipe + 1..]
.iter()
.position(|&b| b == b'|')
{
let level = &line[first_pipe + 1..first_pipe + 1 + second_pipe];
let level_str = std::str::from_utf8(level).unwrap();
*stats.entry(level_str.to_owned()).or_insert(0) += 1;
}
}
line_start = i + 1;
}
}
stats
}
Performance: 80ms (33% improvement)
Optimization 7: SIMD and Byte-Level Processing
use std::collections::HashMap;
use memmap2::Mmap;
fn process_logs_optimized(filename: &str) -> HashMap<String, u32> {
let file = File::open(filename).unwrap();
let mmap = unsafe { Mmap::map(&file).unwrap() };
let data = &mmap[..];
let mut stats = HashMap::with_capacity(10);
let mut i = 0;
while i < data.len() {
// Find first pipe
let first_pipe = match memchr::memchr(b'|', &data[i..]) {
Some(pos) => i + pos,
None => break,
};
// Find second pipe
let second_pipe = match memchr::memchr(b'|', &data[first_pipe + 1..]) {
Some(pos) => first_pipe + 1 + pos,
None => break,
};
// Extract level
let level = &data[first_pipe + 1..second_pipe];
if let Ok(level_str) = std::str::from_utf8(level) {
*stats.entry(level_str.to_owned()).or_insert(0) += 1;
}
// Find next line
i = match memchr::memchr(b'\n', &data[second_pipe..]) {
Some(pos) => second_pipe + pos + 1,
None => break,
};
}
stats
}
Performance: 50ms (38% improvement)
Final Optimized Version
use memmap2::Mmap;
use std::collections::HashMap;
use std::fs::File;
use rayon::prelude::*;
pub fn process_logs_final(filename: &str) -> HashMap<String, u32> {
let file = File::open(filename).expect("Failed to open file");
let mmap = unsafe { Mmap::map(&file).expect("Failed to mmap file") };
let data = &mmap[..];
// Split data into chunks for parallel processing
let chunk_size = data.len() / rayon::current_num_threads();
let partial_results: Vec<HashMap<String, u32>> = (0..rayon::current_num_threads())
.into_par_iter()
.map(|thread_id| {
let start = thread_id * chunk_size;
let end = if thread_id == rayon::current_num_threads() - 1 {
data.len()
} else {
(thread_id + 1) * chunk_size
};
let mut local_stats = HashMap::with_capacity(10);
let mut i = start;
// Align to line boundary
if i > 0 {
if let Some(newline) = memchr::memchr(b'\n', &data[i..end]) {
i += newline + 1;
}
}
while i < end {
let first_pipe = match memchr::memchr(b'|', &data[i..end]) {
Some(pos) => i + pos,
None => break,
};
let second_pipe = match memchr::memchr(b'|', &data[first_pipe + 1..end]) {
Some(pos) => first_pipe + 1 + pos,
None => break,
};
let level = &data[first_pipe + 1..second_pipe];
if let Ok(level_str) = std::str::from_utf8(level) {
*local_stats.entry(level_str.to_owned()).or_insert(0) += 1;
}
i = match memchr::memchr(b'\n', &data[second_pipe..end]) {
Some(pos) => second_pipe + pos + 1,
None => break,
};
}
local_stats
})
.collect();
// Merge results
let mut final_stats = HashMap::with_capacity(10);
for partial in partial_results {
for (key, value) in partial {
*final_stats.entry(key).or_insert(0) += value;
}
}
final_stats
}
Final Performance: 50ms
Performance Comparison
| Version | Time | Improvement |
|---|---|---|
| Python | 2100ms | Baseline |
| Rust (naive) | 1000ms | 2.1x |
| Avoid allocations | 800ms | 2.6x |
| Pre-allocate | 650ms | 3.2x |
| Faster parsing | 450ms | 4.7x |
| Parallel (mutex) | 200ms | 10.5x |
| Parallel (lock-free) | 120ms | 17.5x |
| Memory-mapped | 80ms | 26.3x |
| SIMD + mmap | 50ms | 42x |
Profiling Tools Used
1. Cargo Flamegraph
cargo install flamegraph
cargo flamegraph --bin process_logs
Identified hot paths in the code.
2. Criterion Benchmarks
use criterion::{black_box, criterion_group, criterion_main, Criterion};
fn benchmark_process_logs(c: &mut Criterion) {
c.bench_function("process_logs", |b| {
b.iter(|| process_logs(black_box("test.log")))
});
}
criterion_group!(benches, benchmark_process_logs);
criterion_main!(benches);
3. Perf
perf record --call-graph dwarf ./target/release/process_logs
perf report
Key Lessons
1. Measure First
Don’t optimize blindly. Profile to find bottlenecks.
2. Avoid Allocations
Every String::from() or Vec::new() has a cost.
3. Use Appropriate Data Structures
HashMap with pre-allocated capacity is much faster.
4. Leverage Parallelism
Rayon makes parallel processing trivial in Rust.
5. Memory-Mapped I/O
For large files, mmap is significantly faster than buffered I/O.
6. Byte-Level Processing
Working with &[u8] is faster than String operations.
7. Use Specialized Libraries
memchr uses SIMD instructions for fast byte searching.
Conclusion
Rust’s performance potential is incredible, but you need to:
- Profile to find bottlenecks
- Understand Rust’s ownership and borrowing
- Leverage zero-cost abstractions
- Use the right libraries and tools
Final Result: 42x faster than Python, 20x faster than naive Rust.
The journey from 2 seconds to 50 milliseconds taught me more about performance optimization than any tutorial could.