Aug. 10th, 2005

alexpgp: (Computing)
I managed to finish scanning the 445 pages of the duplicate dictionary I had at home. Yesterday, during a break in the action here at 92A-50, I set FineReader to the task of recognizing the text.

I have to figure the scanning took me about 3.75 hours (assuming 30 sec per page) distributed over nearly a week. The OCR took about 5 minutes. The result is a nearly 2-MB text file that will require some significant processing, barring my writing some Perl scripts to automate the process.

Among the tasks to be done is to replace all occurrences of tildes (used to save space in entries that make multiple references to the entry's main term). I also notice that the punctuation within entries varies all over the area. Ultimately, the goal is to have a flat file of paired entries that I can later feed into a SQL database.

I already have my eye on the next dictionary to scan, but need to assess how much trouble it will be to process this one, and how effectively its information can be retrieved once processed.

Cheers...

Profile

alexpgp: (Default)
alexpgp

January 2018

S M T W T F S
  1 2 3456
7 8910111213
14 15 16 17181920
21222324252627
28293031   

Most Popular Tags

Page Summary

Style Credit

Expand Cut Tags

No cut tags
Page generated Oct. 26th, 2025 05:34 pm
Powered by Dreamwidth Studios