.reg files - what are those funny numbers?

If you’ve ever exported a segment of the registry into a .reg file for application elsewhere, you may have noticed that a value which seems like a perfectly coherent string in the Registry Editor becomes a sequence of hexadecimal digits. Something like this:

In the registry editor, I have the string:

"This is a test"

In the .reg file:

"test"=hex(2):54,00,68,00,69,00,73,00,20,00,69,00,73,00,20,00,61,00,20,00,74,\
  00,65,00,73,00,74,00,00,00

So how do you find out what all those numbers were? (Assuming you don’t still have the value in the registry). Python to the rescue. First, let’s cut-and-paste the digits only into an interpreter window. We could have used a triple-quoted string, but as it happens they’ve already got line continuation markers, so:

x = "54,00,68,00,69,00,73,00,20,00,69,00,73,00,20,00,61,00,20,00,74,\
  00,65,00,73,00,74,00,00,00"

Now, we do that thing you never do in Python[*] : a one-liner.

encoded_string = "".join (chr (int (i, 16)) for i in x.split (","))
print encoded_string

But it’s got all those extra spaces in it! Aha. I haven’t yet waved my magic Unicode wand:

print encoded_string.decode ("utf_16_le")

And voila! The string you first thought of. So how did I know it was UTF16-LE encoded? Lucky guess, coupled with quite a few years of wandering around Windows.

Ok, the briefest of explanations for anyone who’s less interested in my showmanship and more in a working solution. Windows has taken the string, encoded it in little-endian UTF16 which gives two bytes per codepoint, and then represented that encoding as hexadecimal digits in the .reg file. To reverse the effect, we unstitch the string along the commas, convert each of the resulting digit-strings to an integer using base 16 and convert each of those integers to its corresponding character. That gives you a UTF16 string which is just one decode away from the Unicode string you started off with.

Clear?

[*] except when you do