alexpgp | What a difference a letter makes! (Reply)

Back in the days before Unicode and UTF-8, Cyrillic was often encoded in that half of the 8-bit space that wasn't occupied by ASCII (i.e., in bytes with character values above 128). As these values were also used to encode accented Latin characters, you had to make sure you had the right font specified to read Cyrillic text, else all you'd see would be something that looked like: Àäàïòàöèÿ ê öèôðîâîìó çâó÷àíèþ.

Now, whenever I've run into one of these legacy files, it's generally been no problem to whip something up to convert them into proper Cyrillic. (I've even written a Word macro to do a global search and replace of 8-bit values with Unicode equivalents.)

It's been no problem until now, that is.

I've just gotten two small files that look like legacy files, except that the font representation is actually in UTF-8.

Stated another way, it's as if someone started with 8-bit Cyrillic, displayed it using a Latin font, and then saved the result as Unicode! I'll have to figure out a way to make sense of it, and stat (as the translation is due tomorrow).

Apropos of which, I have been suffering from intermittent cerebral flatulence these past few days. Two days ago, I accepted a job due on the 15th, thinking that the 15th is tomorrow, which it isn't. Fortunately, I managed to catch the error before blowing a deadline, but now I'm looking at too much work due too soon, and a time-consuming simulation looms within the next two hours.

I'll hope to find some time to outline a method for handling the "faux" UTF encoding during the sim.

Cheers...

UPDATE: The mapping from 8-bit Latin to Unicode Cyrillic is pretty straightforward, so much so that, as I mentioned, I wrote a macro a while back to do this tedious job in Word. As it turns out, there was a very simple fix to modify the macro so that it mapped Unicode Latin to Unicode Cyrillic.

The original macro has a line in it that read: