Feb 23, 2016
Notice - June 2019
This post has been now reworked into a proper article that explains things at a bit greater depth - https://bvckup2.com/kb/delta-copying
Delta copying is an optimized way of copying a file when an older version of the same file already exists at destination.
When no copy of a file exists at destination we have no option by to read every byte of the source file and then write them all into backup copy.
If we are then to make a small change to the source file and repeat the copying, the vast majority of data we'll be writing will be exactly the same as what's already on the disk.
So it only makes sense to try and eliminate these redundant writes, and that's exactly what delta copying is about.
Naturally, there is more than one way to approach this matter.
The rsync way
A widely-used rsync tool  uses two cooperating processes - one at the source and another at destination - which both read their own copy of a file, block by block, and talk to each other to compare block checksums. When checksums don't match, then the source process forwards respective block to the destination process, which merges it into the destination file.
That's a much simplified version of rsync. There's quite a bit more to its algorithm, because after all it was Tridge's PhD thesis :)
The biggest plus of rsync is that all you need is just two copies of a file and then it can one make look like another, expeditiously.
The biggest minus of rsync is that you need to have a copy of rsync running on the receiving end. Meaning, if your NAS doesn't support rsync, then that's it. No rsync for you.
Bvckup 2 and its older brother Bvckup take a different approach.
When a file is first copied, the app splits it into equally-sized blocks, computes a hash for each block and then stores these hashes locally.
On the next copy, as the app goes through the source file block by block, it re-computes the hashes and compares them to the saved versions. If they match (* see below), then a block is assumed to be unchanged and it is skipped over. Otherwise, it is written out and the saved hash is updated to its new value.
Easy-peasy. But what of them caveats, you wonder? Indeed.
1. The last thing we want is to skip a modified block only because it happened to have the exact same hash as its previous version. The risk of this event is mitigated by using two separate checksums for each block, both of which are stored in a hash file.
Additionally, Bvckup 2 computes a full-file checksum using 3rd digest algorithm. In case when no block-level changes are detected in a file, this hash is verified to match its version from the previous run. If there's ever a mismatch, the file is re-copied in full.
2. Delta copying assumes that destination file remains unmodified between the runs. Because if it's not, then all our precious locally saved block hashes will simply be of no use.
Luckily, since we are in a backup software context, this holds true in a vast majority of cases. However as they say - trust, but verify.
To catch changes to destination files the app saves their size and created/last-modified time stamps alongside the block hashes. If these aren't an exact match to the reality on the next run, then destination file is deemed to be modified and the file is re-copied in full.
3. Delta copying is an in-place update algorithm. It works with a live copy of destination file, meaning that if we are to cancel/abort the copying mid-way through, we may end up with a partially updated file.
There's not much we can do about this, but to detect this regrettable development on the next run and deal with it appropriately.
Starting with Release 79 the copying module supports resuming and fast error recovery both for orderly cancellation and error aborts. See https://bvckup2.com/wip/30042018
for complete details on this topic.
* Earlier releases simply re-copied files in full on the next run.
Delta copying is used only for larger files. Files smaller than 16MB and files under 64MB that weren't modified within last 30 days are always copied in full. Threshold were different in older releases and at some point the program used a single threshold, not two.
Release 79 and newer:
⦁ Default block size is 64KB.
⦁ Per-block hashes are Blake2b, xxHash and spooky hash.
This works out to 32 bytes of hashes per 64KB of raw data, plus 512 bytes of a fixed header - about 0.05% of the data size.
Release 78 and older:
⦁ Default block size is 32KB.
⦁ Per-block hashes are MD5 and a variation of CRC32.
⦁ Per-file hash is SHA1.
This works out to 20 bytes of hashes per 32KB of raw data, plus 40 something bytes of a fixed header - about 0.06% of the data size.
For the rationale on the change see https://bvckup2.com/wip/25042018
Internally, the delta copying routine organized into the reading-hashing-writing pipeline, operating fully asynchronously on a pool of I/O buffers.
The copying starts with the app issuing multiple read requests in parallel.
Once a request is completed, the I/O buffer is forwarded to the hashing module, which maintains a stand by pool of hashing threads. Once the buffer is hashed and if it appears to be modified, a write request is issued for it. Then, once the write request completes, the buffer is again used to read the next block in sequence and the cycle repeats.
Delta copying module comes with a lot of settings - from hashing thread count to buffer counts to read/write chunk sizes - all tweakable. However the app does a decent job picking the defaults based on the exact disposition of source/destination - whether they are the same drive, whether they are on the network, whether they are over older or newer SMB protocol, etc. - so generally there's no need for messing with them.
So there you have it - the delta copying - a new best friend of your VM images and TC containers :-)
Feb 23, 2016
Pushing vs pulling backups
In short - push-style backups maximize the efficiency of delta copying.
When a backup is going over the network, there's often a question of where it's better to run the program.
If the program runs on the source machine, it's a "local-to-remote" or "push" backup. And when the program runs on the backup machine, it's a "remote-to-local" or "pull" backup.
Delta copying gets its speed benefits from being selective with writes. With push backups all reads are local (fast) and writes go over the network (slow). With pull backups all reads are over the network (slow) and writes are local (fast).
So if we are reducing the amount of writes, then with faster reads and slower writes the effect will be far more pronounced => push backups are better. In other words, running the program on a source machine will generally result in faster backups.
Additionally, push backups can also make full use of destination snapshot caching - an option that tells Bvckup 2 to preserve and reuse destination tree index between backup runs.
This option is On by default and it eliminates the need for re-scanning destination location on every run. When destination happens to be over the network, this may translate into considerable speed-up, especially if the backup is big, but its per-run changes are few and far between.