Parallel execution
Do more, faster

Multi-threaded / parallel
00:00.00
Single-threaded / sequential
00:00.00
* Capture of C:\Windows backup on one of our test boxes

Overview

This article provides an overview of multi-threaded backup processing, a mode in which multiple backup steps are executed in parallel.

Our goal is to see backups complete as quickly as possible, so the discussion is framed relative to this.

Executive summary

Executing several backup operations simultaneously results in a significant increase of overall processing speed, with one notable exception.

Speed improvements are especially pronounced in backups that include a lot of small files or that go over the network.

Local backups also benefit from parallel execution, especially when using non-mechanical drives, e.g. SSDs or NVMes.

The case of copying large files requires special handling which is described below.

Synchronous API

An average backup run involves creating, updating, moving and deleting file and folders.

These operations are carried out by sending requests to the file system and then awaiting confirmation.

The trouble is that most of these requests are synchronous.

This means that the operating system doesn't return control to the program until the request is complete. The program is basically stuck waiting for Windows to be done with a request. Of particular interest here is the time that the program spends idling while requests are in transit - These slivers of gray may not look like much, but once we have 100000 folders to create, they start to add up.

Worse yet is the case of working with a remote network share over a high-latency connection - The overhead of passing requests back and forth can be so substantial that it may end up representing most of the processing time.

The async way

The obvious solution here is to not wait for a request to complete before firing off another one.

This effectively creates a request queue and ensures that the OS is kept busy processing our work for as long as we have work to give it. This is asynchronous processing. We issue a request and have it complete at some point later on, freeing us to do other things instead of waiting.
So now that we know that we want asynchronous processing, the question is how can we get a blocking CreateDirectory request to behave this way.

Luckily, it is very easy to do and the answer is threads.

However threads are just a means to an end. What we are really after is the asynchronicity and in one important case it is readily supported by the OS.

Working with files

All operating systems, Windows included, use the same pattern for working with files.

Before a program can do anything with the file contents or its meta data, the file needs to be "opened".

Opening a file involves passing the file name to the OS along with access requirements and other details. The OS verifies that the file exists, the requestor has required privileges, the file is not locked for exclusive use by some other program, etc. If all this checks out, the OS responds to the program with a "handle", which is basically a file access token.

Then, all further operations with a file - reading, writing, querying or setting of its attributes, timestamps, etc. - are done by using this handle.

Finally, once the program is done with a file, it needs to return the handle to the OS, which is referred to as "closing the file."
Opening a file is a relatively expensive operation, meaning that it may take as much time as, for example, reading the contents of a small file. It is also synchronous.

By comparison, the actual file reading and writing can be done asynchronously. Combined with certain Windows kernel features this asynchronous data transfer can be made to fully saturate all the available bandwidth of the file system stack with just one thread.

* For complete details see Fundamentals of Fast Bulk IO.
In other words, when working with files we have:
  1. An expensive blocking open
  2. A (very) fast bulk transfer
This will come into play in a moment.

Asynchronization

Threads

As previously mentioned, the answer is threads.

Threads are independent execution flows within the program.

Every program starts with a single thread. As this thread goes through the code, it may create additional threads and set them to execute other bits of code.

Threads are part of core programming abstractions, flexible and very powerful. For a good intro on the subject see here.

Parallel execution

The key property of threads is that they execute in parallel.

Threads are also typically made to cooperate with each other, i.e. one thread "delegates" some work to another thread and then collects the results once they are ready.

So if we want to execute a blocking request asynchronously, all we need to do is spawn a thread, tell it to issue that request on our behalf and then get back to us when it's done.

There's not much to it, but there are... details.

Details

1. Dependencies

In the context of making backups, we typically have a long list of simple steps that, when executed, brings the backup in-sync with the source.

The order of steps is important, because some steps may depend on the others. For example, we can't start copying a file until its parent folder is created, nor can we delete a folder until all of its contents are removed.

Fortunately, if a backup program implements a backup planner, all this is taken care of transparently. Two file trees go in - one for the source, another for the backup - and a list of backup steps comes out, in the right order and with all dependencies specified.

Still, the first detail to not be overlooked is that parallel execution needs to observe and respect inter-step dependencies.

2. Thread count

Second is a question of how many threads to use.

More is not necessarily better.

For local operations there should be as many threads as there are CPU cores. Spawning more threads will only cause them to compete for CPU time, resulting in slower overall execution time. Spawning fewer threads will usually be sub-optimal.

For remote operations it depends, but the local CPU count is also a good starting point. For links and networks with high latency increasing the thread count may improve throughput, but for a regular LAN the effect is likely to be small, if any.

Conversely, there's little point in reducing the thread count, because spawning too many threads here will only cause a bit longer request queue there, which is (usually) not harmful to performance.

That is, the rule of thumb for remote backups is to start with the local CPU count and test larger counts when fine-tuning.

3. File copying

The third gotcha has to do with file copying.

When copying smaller files, the time needed for opening and closing them is comparable to that spent on reading and writing them. If we are to copy a stream of small files, we will be looking at a lot of stalling - =>  We do want to be copying more than one small file at a time.

But remember how we can saturate the IO pipeline by copying a single large file? If we are to throw another large copy in the mix, it will cause transfers to compete for bandwidth and slow each other down.

=>  We do not want to copy more than one large file at a time.

The solution lies on the surface - we limit file copying to a single large file at a time, whereby a file is considered large if it needs more than X requests to be read in full.

Bvckup 2 uses a threshold of 32 requests, which may seem a bit high, but has proved to work well in practice.

4. Error handling

Retrying on transient errors in the presence of parallel execution is another pitfall.

When there are X requests pending and one of them fails with, say, network unreachable, what's the right thing to do?

On one hand, the program may just retry each failed request after a pause. However if the network remains down, this will pollute the backup log with redundant failures and generally create a lot of fuss where none is needed.

So instead the program would need to try and exit the retry pause state gingerly, by retrying a single step first and following up with the rest if all goes well.

But then there are also edge cases.

Sometimes a network hiccup will cause only some requests to fail right away. More will fail in a minute, few more in 10 minutes, and some will actually manage to complete, but taking an hour to do so.

As a result the retry logic ends up containing a lot more complexity than may seem initially necessary. Caveat emptor.

5. Memory utilization

Using extra threads invariably means higher memory usage.

Usually the per-thread overhead is not very big unless a thread needs to copy file. Copying involves allocating multiple IO buffers and with modern drives these often need to be large to maximize performance.

It's not atypical for the copying module to use 32 MB in buffers per file transfer. Multiplied, say, by 16 cores, that's a half of GB.

However, since we do not copy large files in parallel, we kill two birds with one stone and get a free pass with this issue.

Robocopy /mt

As you may know Windows ships with a very handy file copying utility called robocopy.

One of its many features is the support for multi-threaded file copying, done with the help of the /mt command-line switch.

There are two differences between robocopy /mt and Bvckup 2.

First, /mt applies to the file copying only. All other operations, e.g. folder creation and item deletion, are still done sequentially.

Second is that it does not differentiate between small and large files and runs all file copies in parallel, unconditionally.

* There are other important differences, between the two, but these aren't related to the multi-threaded processing.

Sample numbers

The following numbers are meant to give an idea of how parallel and sequential modes compare in terms of performance. These are specific to replicating from a local NVMe drive to a local SSD and to a remote share on low-end NAS over Gigabit Ethernet.

Exact numbers and speed-up rates will vary greatly with the setup and the state of the environment. When running your own tests, make sure to compare apples to apples. In particular, the state of the file system cache - warm vs. cold - may sway test results by orders of magnitude.

That said, it is not uncommon to see speed-ups in full multiples of single-threaded numbers when using very fast storage and networks.

Bulk folder creation

Creating a folder structure. Just 99185 folders, no files.

Sequential Parallel Speed up
Local disk 20.3 sec 11.9 sec 1.70
Network share 45.8 sec 25.6 sec 1.79

Bulk folder deletion

Cleaning up after the previous test.

Sequential Parallel Speed up
Local disk 11.3 sec 7.8 sec 1.45
Network share 85.8 sec 53.4 sec 1.61

Backing up C:\Windows

Replicating 7278 folders, 78896 files, 15.4 GB

Sequential Parallel Speed up
Local disk 73.2 sec 38.7 sec 1.89
Network share 820 sec 535 sec 1.53

Deleting a copy of C:\Windows

Cleaning up after the previous test.

Sequential Parallel Speed up
Local disk 11.6 sec 7.4 sec 1.57
Network share 394 sec 259 sec 1.52


Made by IO Bureau in Switzerland
Support

Updates
Blog / RSS
Follow Twitter
Reddit
Miscellanea Press kit
Testimonials
Company Imprint

Legal Terms
Privacy