Spellcheck nightmare...
Apr. 3rd, 2006 02:53 pmThe recently installed PDF package seemed to arrive at just the right time, as I had received a fairly large PDF (~50 pp.) that could not be read by FineReader. What's more, the new package claimed to allow the user to save the PDF in a variety of formats, including Word .doc files.
This sort of hinted at the ability to do OCR (although at this point, there's nothing "optical" in the technology, but I digress...). That ability is there, sort of.
The "clean" pages look pretty good, although both FineReader and the new program (ScanSoft PDF Professional) have a nasty habit of trying to create formatting that isn't there, and which can't be edited with any ease, which basically means such formatted information is useless.
But in doing a spell check of the Russian, I started to notice that words such as "специального" would be highlighted as incorrect, with "специального" offered as a replacement.
No, your eyes do not deceive you; the words appear identical. This kind of behavior is a dead giveaway of a phenomenon where one or more of the homoform letters (e.g., "a", "c", "e", and some others) in a word is in the "other" language.
I quickly cobbled together a macro to highlight all such letters and was not pleased to see the result. Somewhere in the recognition algorithm, there needs to be a step whose simplified pseudocode might read: "If the immediately preceding character is in language X, and the immediately following character is in language X, the character being processed is in language X."
This robotics text is turning out to be a real bear, OCR notwithstanding.
Cheers...
UPDATE: The damage is worse than I thought. Almost all instances of the Russian "В" ("Veh") were rendered as "B" (as in Bravo), and several Russian letters "З" ("Zeh") were rendered as "3" (three). Of course, I suppose having this kind of result is better than no result at all.
This sort of hinted at the ability to do OCR (although at this point, there's nothing "optical" in the technology, but I digress...). That ability is there, sort of.
The "clean" pages look pretty good, although both FineReader and the new program (ScanSoft PDF Professional) have a nasty habit of trying to create formatting that isn't there, and which can't be edited with any ease, which basically means such formatted information is useless.
But in doing a spell check of the Russian, I started to notice that words such as "специального" would be highlighted as incorrect, with "специального" offered as a replacement.
No, your eyes do not deceive you; the words appear identical. This kind of behavior is a dead giveaway of a phenomenon where one or more of the homoform letters (e.g., "a", "c", "e", and some others) in a word is in the "other" language.
I quickly cobbled together a macro to highlight all such letters and was not pleased to see the result. Somewhere in the recognition algorithm, there needs to be a step whose simplified pseudocode might read: "If the immediately preceding character is in language X, and the immediately following character is in language X, the character being processed is in language X."
This robotics text is turning out to be a real bear, OCR notwithstanding.
Cheers...
UPDATE: The damage is worse than I thought. Almost all instances of the Russian "В" ("Veh") were rendered as "B" (as in Bravo), and several Russian letters "З" ("Zeh") were rendered as "3" (three). Of course, I suppose having this kind of result is better than no result at all.