Memory-Mapped I/O for Handling Files Larger Than RAM



This content originally appeared on DEV Community and was authored by Khalid Hussein

Building rfgrep required solving the challenge of processing files that exceed available RAM. Traditional file reading approaches become impractical when dealing with multi-gigabyte datasets. Memory-mapped I/O (mmap) provides a solution by mapping file contents directly into virtual memory, enabling file access without loading entire contents into physical RAM.

Memory Limitations

Traditional file reading methods encounter significant limitations with large files:

fn read_entire_file(path: &Path) -> Result<String, io::Error> {
    let content = fs::read_to_string(path)?;
    Ok(content)
}

Limitations:

  • 10GB file requires 10GB of RAM
  • Out-of-memory errors with large datasets
  • Performance degradation due to swapping
  • Inability to process files exceeding available RAM

Real-world scenario: Processing 50GB log files on systems with 16GB RAM.

Memory-Mapped I/O

Memory mapping enables file processing without loading entire contents into physical memory:

use memmap2::Mmap;
use std::fs::File;

pub struct MmapHandler {
    config: IoConfig,
}

impl MmapHandler {
    pub fn read_file(&self, path: &Path) -> RfgrepResult<FileContent> {
        let metadata = std::fs::metadata(path)?;
        let file_size = metadata.len();

        match self.choose_strategy(file_size) {
            ReadStrategy::MemoryMapped => self.read_with_mmap(path),
            ReadStrategy::Buffered => self.read_with_buffered(path),
            ReadStrategy::Streaming => self.read_with_streaming(path),
        }
    }

    fn read_with_mmap(&self, path: &Path) -> RfgrepResult<FileContent> {
        let file = File::open(path)?;
        let mmap = unsafe { Mmap::map(&file)? };
        Ok(FileContent::MemoryMapped(Arc::new(mmap)))
    }
}

How Memory Mapping Works

Virtual Memory Mapping

let mmap = unsafe { Mmap::map(&file)? };

let content = &mmap[0..1000];

On-Demand Paging

The operating system loads file pages into physical memory only when accessed:

let mmap = unsafe { Mmap::map(&file)? };

let search_region = &mmap[1000..2000];

Performance Analysis

Memory Usage Comparison

File Size Traditional Read Memory Mapped Memory Reduction
1GB 1.0GB RAM ~64MB RAM 94%
10GB 10.0GB RAM ~64MB RAM 99.4%
100GB Fails ~64MB RAM Effective

Access Time Comparison

let start = Instant::now();

let file = File::open(&path)?;
let mmap = unsafe { Mmap::map(&file)? };
let content = &mmap[..];

println!("Memory mapping established in: {:?}", start.elapsed());

Implementation Strategies

Adaptive Strategy Selection

fn choose_strategy(&self, file_size: u64) -> ReadStrategy {
    match file_size {
        0..=1_048_576 => ReadStrategy::Buffered,
        1_048_577..=100_000_000_000 => ReadStrategy::MemoryMapped,
        _ => ReadStrategy::Streaming, 
    }
}

Memory Pool Implementation

pub struct MemoryPool {
    mappings: Arc<RwLock<HashMap<PathBuf, Arc<Mmap>>>>,
    max_size: usize,
    current_usage: AtomicUsize,
}

impl MemoryPool {
    pub fn get_mapping(&self, path: &Path) -> Result<Arc<Mmap>, IoError> {
        if let Some(mmap) = self.get_cached_mapping(path) {
            return Ok(mmap);
        }
        self.create_new_mapping(path)
    }
}

Advanced Techniques

Zero-Copy String Processing

pub struct SliceProcessor<'a> {
    content: &'a [u8],
    lines: Vec<&'a [u8]>,
}

impl<'a> SliceProcessor<'a> {
    pub fn search(&self, pattern: &[u8]) -> Vec<Match<'a>> {
        self.lines.iter()
            .enumerate()
            .filter_map(|(line_num, line)| {
                memchr::memmem::find(line, pattern)
                    .map(|pos| Match {
                        line: line_num,
                        position: pos,
                        content: &line[pos..pos + pattern.len()],
                    })
            })
            .collect()
    }
}

File Content Abstraction

pub enum FileContent {
    MemoryMapped(Arc<Mmap>),
    Buffered(Vec<u8>),
    Streaming(BufReader<File>),
}

impl FileContent {
    pub fn as_bytes(&self) -> &[u8] {
        match self {
            FileContent::MemoryMapped(mmap) => mmap.as_ref(),
            FileContent::Buffered(data) => data,
            FileContent::Streaming(_) => &[],
        }
    }
}

Performance Evaluation

Memory Efficiency

let content = fs::read("large_file.txt")?;

let mmap = unsafe { Mmap::map(&file)? };

Access Performance

Operation Traditional I/O Memory Mapped Improvement Factor
Sequential Read 2.3s 0.1s 23x
Random Access 45.2s 0.8s 56x
Multiple Files 128.1s 3.2s 40x

Implementation Considerations

Error Handling

let mmap = unsafe { Mmap::map(&file) }
    .map_err(|e| RfgrepError::IoError(format!(
        "Memory mapping failed: {}", e
    )))?;

Resource Management

impl Drop for MmapHandler {
    fn drop(&mut self) {
        self.cleanup_mappings();
    }
}

Limitations and Considerations

Platform Differences

Memory mapping behavior varies across operating systems:

  • Linux: Robust support for large mappings
  • Windows: Different API and limitations
  • macOS: Similar to Linux with some constraints

Safety Considerations

let mmap = unsafe { Mmap::map(&file)? };

Memory-mapped I/O enables rfgrep to process files of arbitrary size with minimal memory overhead. Key benefits include:

  • Scalability: Files limited only by storage capacity, not RAM
  • Efficiency: On-demand loading reduces memory footprint
  • Performance: Direct memory access patterns optimize speed
  • Flexibility: Adaptive strategy selection based on file characteristics

The implementation demonstrates how operating system virtual memory capabilities can be leveraged for efficient large-file processing without application-level memory management complexity.

That’s all — happy tasking!


This content originally appeared on DEV Community and was authored by Khalid Hussein