Fundamentals of fast bulk IO
Basics, details, nuances

How to read/write a lot of file data as quickly as possible.

1.   IO buffers

Synchronous access

The simplest way for a program to read (or to write) a chunk of file is to allocate a buffer for the data, issue a request to the OS and then sit there and wait until the request is fullfiled. Once the data is in (or out), proceed to the next step.

Since the program effectively stops running while waiting for the request to complete, this is called a synchronous IO. It is very simple to implement, it keeps the code nice and tidy and it is widely used in software for reading/writing files.

However, if we want to read/write fast, we can do significantly better.

Asynchronous access

Instead of waiting for a request to complete, a program can make a note that the request is pending and move on to doing other things. Then, it will periodically check if the request is done and when it is, it will deal with the result. Since we are no longer blocking around an IO request, this is called an asynchronous IO.

Note that in terms of the IO performance, it is so far exactly the same as the synchronous case.

Asynchronous, multiple buffers

Now that we are free to do something else while our request is pending, what we can do is submit another request. And then, perhaps, few more, all of which will be pending and queued somewhere in the guts of the OS. What this does is it ensures that once the OS is done with one request, it will immediately have another one to process.

This eliminates idling when reading/writing data from the storage device, so we have data flowing through the file stack continuously.

Knowing when to stop

It may seem that if we just throw a boatload of requests at the OS, it should allow us to go through a file as quickly as possible.

However there's really no point in having too many requests in a queue, because it simply doesn't give us any faster processing.

What we need is to merely make sure the request queue is never empty, so if we can achieve that with as few requests as possible, we'll have the fastest processing rate and the lowest memory usage.

2.   IO buffer size

Another question is how much data to request in one go.

If we ask for too little, the OS may end up reading more than asked for and then trimming the result to fit it into our buffer.

For example, NTFS defaults to 4KB cluster size on desktop versions of Windows, so asking for a smaller chunk is going to be wasteful.

If we ask for too much, it may translate into several requests further down the file stack and it's not likely to get us any speed gains.

3.   IO mode

Buffered and unbuffered access

Windows has two principal modes for accessing files - the so-called buffered and unbuffered IO.

In buffered mode all requests pass through the Cache Manager, which does just what it name implies - it tries to fulfill read requests from the cache and aggregate/delay write requests when appropriate.

In unbuffered mode requests are always fulfilled with an actual disk read/write. They still pass through the Cache Manager to expire any cached copies of same data, but this comes with less overhead than in buffered mode.

Sequential vs random access

Windows also allows programs to indicate if they are planning to read a file "mostly sequentially, but occasionally skipping forward over small ranges of bytes." This is referred to as the sequential scan mode.

This happens to be a very common access pattern that surfaces when files are saved, loaded, copied and when programs are getting launched.

The other pattern is that of a random access. This is when the program is jumping around a file reading and writing it. This is typical of database applications, for example.

Explicitly specifying either access pattern serves as a hint to the OS and it may help speeding up the requests.

Sequential access and the Cache Manager

Explicitely specifying the access pattern serves as a hint to the cache manager and allows it to adjust its caching strategy.

In a sequential scan mode the manager will pre-fetch data when possible and it will also discard cached data more aggressively.

Conversely, using buffered non-sequential access for large files is a recipe for trashing the file cache. One of those "use with care" things.

In theory declaring intended access pattern should only matter for buffered (cached) access.

In practice however it has a noticeable effect on performance of the unbuffered access too, so we have to consider and test for this as well.

IO nuances

There exists a number of finer points to doing bulk IO. Each of these has a little impact on its own, but if applied across millions of requests they can make things go a bit faster still.

IO completion ports

One is a question of how to organize IO processing in general, and more specifically how the program would learn when an IO request is actually completed.

There are, of course, several options.

One is a poll-based Overlapped IO, whereby it's the program who tracks the requests and asks the kernel about their status.

IO Completion Ports is another. It is an event-based mechanism that simply delivers requests back to the program once they are done. This is as light and efficient as it gets.

IOCP can also be used to handle non-IO requests, which makes it ideal for implementing advanced copying pipelines, e.g. like the one used for delta copying.

Synchronous async IO

Windows may sometimes complete asynchronous IO requests ... synchronously.

There's some documentation as to when this may happen, but in reality it's not uncommon to see a fair share of requests completing this way for no apparent reason.

When and if this happens, it is possible to detect this and skip waiting for the completion ping. Savings are minuscule, but they add up.

Locked IO buffers

It is also possible for the program to lock its IO buffers in place. This protects them from being moved around behind the scenes and allows Windows kernel to complete IO requests faster.

For more details see this post.

Summed up

Performance of any bulk IO operation depends on three principle variables - the size of IO buffers, their count and the IO mode.

Some combinations of these parameters will yield a subpar performance, while others will push the stack to its limit and deliver the best IO rate possible.

Finding the perfect combo

One way to find a winning combination is to just go through all of them and CCSIO Benchmark does just that.

It's a little side project, evolved from an internal test framework that was used to prototype ideas for the ultra copier.

CCSIObench diligently goes through all parameter permutations and finds those that yield the best performance. This works well, but it's not terribly practical, because benchmarking takes time.

Its homepage is over here - https://ccsiobench.com.

Choosing good defaults

Another option is to try and pick reasonable defaults.

While there's no universal recipe that would work for all devices, it is possible to look at the device details - HDD, SSD, NVMe, if it's on a USB connection, etc. - and pin IO defaults to that.

Based on what we've seen so far, using large buffers (512KB+) and unbuffered mode tends to deliver rates that are within 10% of maximum achievable throughput.

It's also possible to tune this further and Bvckup 2 does that, but merely using multi-buffer async IO with a fast IO framework like IOCP is already 80% of the recipe.

Made by Pipemetrics in Switzerland
Support

Updates
Blog / RSS
Follow Twitter
Reddit
Miscellanea Press kit
Testimonials
Company Imprint

Legal Terms
Privacy