When I ask my colleagues why mmap is faster than system calls, the answer is inevitably “system call overhead”: the cost of crossing the boundary between the user space and the kernel. It turns out that this overhead is more nuanced than I used to think, so let’s look under the hood to understand the performance differences.
Background (skip if you are OS expert):
System calls. A system call is a special function that lets you cross protection domains. When a program executes in user mode (an unprivileged protection domain) it is not allowed to do things that are permitted for the code executing in the kernel mode (a privileged protection domain). For example, a program running in user space typically cannot read files without help from the kernel. When a user program asks a service from an operating system, the system protects itself from malicious or buggy programs via system calls. A system call executes a special hardware instruction, often called “trap”, that transfers control into the kernel. Then the kernel can decide whether it will honour the request.
While this protection is super useful, it has a cost. When we cross from user space into the kernel, we have to save the hardware registers, because the kernel might need to use them. Further, since it is unsafe to directly dereference user-level pointers (what if they are null — that’ll crash the kernel!) the data referred to by these pointers must be copied into the kernel.
When we return from the system call, we have to repeat the sequence in the reverse order: copy out any data that the user requested (because we can’t just give user programs pointers into kernel memory), restore the registers and jump to user mode.
Page faults. The operating system and the hardware together translate the addresses that are written down in your program’s executable (these are called virtual addresses) to the addresses in the actual physical memory (physical addresses). It would be pretty inconvenient for the compiler to generate physical addresses directly, because it doesn’t know on what machine you might run your program, how much memory it has and what other programs might be using physical memory at the time your program runs. Hence the need for this virtual-to-physical address translation. The translations, or mappings, are set up in your program’s page table. When your program begins to run, none of these mappings are set up. So when your program tries to access a virtual address, it generates a page fault, which signals the kernel to go set up the mapping. The kernel is notified that it needs to handle a page fault via a trap, so in this way it is a bit similar to a system call. The difference is that the system call is explicit and the page fault is implicit.
Buffer cache. Buffer cache is a part of kernel memory that is used to keep recently accessed chunks of files (these chunks are called blocks or pages). When a user program requests to read a file, the page from the file is (usually) first put into the buffer cache. Then the data is copied from the buffer cache out to the user-supplied buffer during the return from the system call.
Mmap. Mmap stands for memory-mapped files. It is a way to read and write files without invoking system calls. The operating system reserves a chunk of a program virtual addresses to “map” directly to a chunk in a file. So if the program reads the data from that part of the address space, it will obtain the data that resides in the corresponding part of the file. If that part of the file happens to reside in the buffer cache, the virtual addresses of the mapped chunk will simply be mapped to the physical addresses of the corresponding buffer cache pages upon the first access, and no system calls or other traps will be invoked later on. If the file data is not in the buffer cache, accessing the mapped area will generate a page fault, prompting the kernel to go fetch the corresponding data from disk.
Why mmap should be faster
Let us begin by formulating the hypothesis. Why do we expect mmap to be faster? There are two obvious reasons. First, it requires no explicit crossing of protection domains, though there is still implicit crossing when we have page faults. That said, if a given range in the file is accessed more than once, chances are we won’t incur page faults after the first access. That, however, did not occur in my experiments, so I did expect to hit a page fault every time I read a new block of the file.
Second, if the application is written such that it can access the data directly in the mapped region, we don’t have to perform a memory copy. In my experiments, though, I was interested in measuring the scenario where the application has the separate target buffer for the data it reads. So even though the file is mmapped, the application will still copy the data from the mapped area into the target buffer.
Therefore, in my experimental environment, I expected mmap to be slightly faster than system calls, because I thought the code for handling page faults would be a bit more streamlined than that for system calls.
I set up my experiment in the following way. I create a 4GB file and then read it either sequentially or randomly using a block size of 4KB, 8KB or 16KB. I read the file using either a read system call or mmap. In the case of mmap, I don’t just access the mapped area directly, but I copy the data from a mapped area into a separate “destination” buffer (see the blog post describing my target use case to understand why I do things this way). So in both experiments, we are copying the data from the kernel buffer cache into the user destination buffer, but in the case of mmap we do this by way of page faults, and in the case of system calls we do this by way of a read system call.
I run these tests using either a cold buffer cache, meaning that the file is not cached there, or a warm buffer cache, meaning that the file is there in kernel memory. The storage medium is an SSD that you might expect to find in a typical server. All reads are performed using a single thread. The source code of my benchmark is here.
The following charts show the throughput of the read benchmark for the sequential/warm, sequential/cold, random/warm and random/cold runs.
Barring few exceptions, mmap is 2–6 times faster than system calls. Let’s analyze what happens in the warm experiments, since there mmap provides a more consistent improvement.
The following figure shows the CPU profile collected during the sequential/warm syscall experiment with 16KB block size. During this experiment the CPU utilization is 100%, so the CPU profile tells us the whole story.
We see that ~60% of the time is spent in copy_user_enhanced_fast_string — a function that copies data out to user space. About 15% is spent on other work that occurs on crossing the system call boundary (functions do_syscall_64, entry_SYSCALL_64 and syscall_return_via_sysret), and about 6% in functions that find the data in the buffer cache (find_get_entry and generic_file_buffered_read).
Now let’s look at what happens during the mmap test with the same parameters:
This profile is vastly different. About 60% of the time is spent in __memmove_avx_unaligned_erms. Like copy_user_enhanced_fast_string, this is also a memory copy function, but it copies data from the mapped file region to a user-supplied buffer.
In summary, we observe that a large portion of time in both of these experiments — roughly 60% — is spent copying data. However, the memory copy functions used with syscall and mmap are very different, and not only in the name.
__memmove_avx_unaligned_erms, called in the mmap experiment, is implemented using Advanced Vector Extensions (AVX) (here is the source code of the functions that it relies on). The implementation of copy_user_enhanced_fast_string, on the other hand, does not appear to rely on AVX and performs certain safety checks when accessing user-level addresses, which __memmove_avx_unaligned_erms does not.
Using profiling and timing information I computed how much time the experiment spends on memory copy and on “everything else”. Here are the data for a 16GB file read by one thread using the 8KB block size:
We observe that the system call experiment spends about 0.5 seconds longer in memory copy, and that is the reason why it takes about 0.5 seconds longer to complete.
Why can’t the kernel use the AVX instructions to improve the speed of its memory copy? Well, if it did, then it would have to save and restore those registers on each system call, and that would make domain crossing even more expensive. So this was a conscious decision in the Linux kernel. And safety checks, which could be another reason for why it is slower, cannot be avoided.
In the meantime, converting your application to use mmap rather than system calls could make it run faster. As part of my consulting for MongoDB, I introduced the option in the WiredTiger storage engine to (safely) use mmap for I/O instead of system calls, and we observed substantial performance improvements for workloads fitting into the buffer cache. See my MongoDB blogpost for details.
** Note (February 7, 2022): since the original publication of this blog, Andy Pavlo and Dick Sites mentioned to me that mmap would not be faster than system calls in use cases not covered in this blog: (1) when the data does not fit into the buffer cache, (2) when we use a much larger block size for I/O. So please consider these other use cases if you choose to apply the information in this blog post to your application.