Unicode, as you may know, is a standard that assigns pretty pictograms to numerical codes from the 0 to 0x0010FFFF range.

In simpler cases pictograms represent letters, numbers and symbols. In other cases they are decorative glyph elements, like umlauts, that are meant to be combined with the letters.

There are also Unicode encodings - UTF-8, UTF-16, etc. - these deal with how specific Unicode numerical IDs are translated into sequences of *bytes* for storage, transmission, etc.

The simplest encoding is UTF-32 - it simply stores Unicode ID as a 32-bit number. This is however rather wasteful as at most 21 bit is needed to represent any Unicode ID.

Prior to Windows 2000, Microsoft used a simple 16-bit encoding that mapped every Unicode ID from the 0 to 0x00FFFF range to the exact same 16-bit number. This is called UCS-2 encoding. The Unicode range was smaller back then, so this approach worked.

Then, Unicode range grew and IDs no longer fit into 16 bits. So to address this, UCS-2 was modified to encode symbols from the 0x010000 - 0x10FFFF range using two 16-bit numbers, from the 0xD800 - 0xDFFF range. Exact details of this encoding are not important, but what it meant was that some 16-bit numbers, e.g. 0xD812, were no longer valid Unicode symbols just on their own.

This is called UTF-16 encoding, and this is what Windows uses at the moment.

But wait !

As Apple likes to say after a screw-up - "it turns out" that despite routinely talking about UTF-16 in their API documentation, Windows file system still uses plain 16-bit encoding for the file names!

For example, "\xD812Haha" is a valid file name even though it is *not* a valid UTF-16 string. This in turn means that if the app is converting file names into other encoding, it should go easy on full UTF-16 compliance checks and be ready to digest ill-formed UTF-16 string.

Bvckup 2 keeps all text in UTF-8 format internally, so it does quite a bit of conversion from/to UTF-16 and this leads to some interesting things when the app runs into an malformed UTF-16 name. Like those created by *ahem* Adobe *ahem* After Effects *ahem*, for example.

The solution is simple - try and convert as if we have the UTF-16 encoding, but fallback to UCS-2 when running into invalid UTF-16 sequences. Also, apply the same approach when converting back to UTF-16 and all will be good.

Will be in Release 74.8.

* Kudos to Andrea for reporting this.
Made by IO Bureau in Switzerland

Updates Newsletter
Blog & RSS
Follow Twitter
Miscellanea Press kit
Company Imprint

Legal Terms