Ultra copier is an internal name of a brand new bulk and delta copying module, shipping with R79.

A complete rewrite, it builds on what we learned in the past 4 years and implements significant improvements in several key areas.

  • Copying of smaller files
  • Working with very fast drives
  • Faster scalable delta copying
  • First-class support for resuming of copying
What follows is a tour of the first two items. The other two will be covered in a separate post.

Smaller files

Copying lots of small files quickly is a challenge.

The per-file overhead of prep/post work is often comparable to the time needed for the actual copying. This fixed cost is split between program's own overhead and the time spent actually opening and closing files, copying meta info, etc.

With older drives this was not an area worthy of optimization, because the cost of merely opening/creating a file dwarfed that of any prep work that the app itself was doing.

However with newer drives and faster machines it's no longer the case. All that trivial activity like allocating buffers, writing to the log, pre-configuring the IO - it all suddenly adds up and starts to matter.

For this reason the ultra copier now aggressively pre-allocates, caches, recycles and otherwise streamlines the prep/post phases to keep its per-file overhead to an absolute minimum.

The effect of this obviously varies, but it can be as eye-popping as a threefold speed-up, for example, when cloning C:\Windows on an NVMe drive.

Faster drives

Bvckup 2 has been using multi-buffer async IO from its very first release.

The core of the technique is that the program doesn't wait for read/write requests to complete, but it rather just queues them with Windows and later checks if they are done.

That last bit - check if it's done - is where the new code does things differently.

Ultra's IO pipeline is built around IO completion ports (IOCP) which it uses to track, well, the completion of IO requests.

The program issues read/write requests as before, but it also asks Windows to queue a "done" notification once a request is completed.

This queue is called an "IO completion... port", which tends to muddy the waters somewhat, but it is one of more elegant and useful mechanisms of the Windows kernel.

The key point of IOCP is in the last line: With IOCP we no longer need to drag the full list of pending requests across the userspace boundary just to learn which of them might've been completed.

This makes IOCP really quite fast.

But wait, there's more.

IOCP can also accommodate async non-IO operations.

The ultra copier makes a full use of this when delta-copying a file. It feeds a stream of block hashing requests to a pool of worker threads and then receives their completions events via the same port that it uses for IO requests. This allows for uniform handling of all async operations in the IO code. Reading, hashing and writing now become equal parts of the IO pipeline, leading to a simpler code.

Synchronous IO

There are documented cases when async IO requests may complete synchronously. In fact, Microsoft also says that your code should be prepared for this to happen at all times.

We care about synchronous completion, because it makes the pipeline stutter, so it's not good for performance.

When this happens, Windows still queues an IOCP notification, so the simplest thing to do is to ignore how request completes and just wait for an IOCP ping.

This however adds a small delay to the IO flow, because we end up completing a request later than we could've.

You probably see where this is headed and you are correct - the ultra copier suppresses IOCP pings for sync reads/writes and processes them immediately.

If you ever wondered what SetFileCompletionNotificationModes is for - here you go, now you know :)

Locked IO buffers

Among smaller performance tweaks, the ultra copier defaults to using unbuffered IO when reading larger files. Larger files aren't likely to be cached in full, so bypassing the cache has a small, but noticeable effect on the reading speed.

There also happens to be a way to further improve performance by locking IO buffers with SetFileIoOverlappedRange.

    * Some conditions apply

In particular, this requires holding a rather exotic privilege and it just may lead to memory starvation for the rest of the system. Ask me how I know.

But when used with care it does appear to improve bulk IO rate on faster drives.

Next up, the delta copying improvements.
Made by Pipemetrics in Switzerland

Blog / RSS
Miscellanea Press resources
On robocopy

Legal Terms