Scanning redux...
Aug. 10th, 2005 11:37 amI managed to finish scanning the 445 pages of the duplicate dictionary I had at home. Yesterday, during a break in the action here at 92A-50, I set FineReader to the task of recognizing the text.
I have to figure the scanning took me about 3.75 hours (assuming 30 sec per page) distributed over nearly a week. The OCR took about 5 minutes. The result is a nearly 2-MB text file that will require some significant processing, barring my writing some Perl scripts to automate the process.
Among the tasks to be done is to replace all occurrences of tildes (used to save space in entries that make multiple references to the entry's main term). I also notice that the punctuation within entries varies all over the area. Ultimately, the goal is to have a flat file of paired entries that I can later feed into a SQL database.
I already have my eye on the next dictionary to scan, but need to assess how much trouble it will be to process this one, and how effectively its information can be retrieved once processed.
Cheers...
I have to figure the scanning took me about 3.75 hours (assuming 30 sec per page) distributed over nearly a week. The OCR took about 5 minutes. The result is a nearly 2-MB text file that will require some significant processing, barring my writing some Perl scripts to automate the process.
Among the tasks to be done is to replace all occurrences of tildes (used to save space in entries that make multiple references to the entry's main term). I also notice that the punctuation within entries varies all over the area. Ultimately, the goal is to have a flat file of paired entries that I can later feed into a SQL database.
I already have my eye on the next dictionary to scan, but need to assess how much trouble it will be to process this one, and how effectively its information can be retrieved once processed.
Cheers...