alexpgp | Aug. 10th, 2005

I managed to finish scanning the 445 pages of the duplicate dictionary I had at home. Yesterday, during a break in the action here at 92A-50, I set FineReader to the task of recognizing the text.

I have to figure the scanning took me about 3.75 hours (assuming 30 sec per page) distributed over nearly a week. The OCR took about 5 minutes. The result is a nearly 2-MB text file that will require some significant processing, barring my writing some Perl scripts to automate the process.

Among the tasks to be done is to replace all occurrences of tildes (used to save space in entries that make multiple references to the entry's main term). I also notice that the punctuation within entries varies all over the area. Ultimately, the goal is to have a flat file of paired entries that I can later feed into a SQL database.

I already have my eye on the next dictionary to scan, but need to assess how much trouble it will be to process this one, and how effectively its information can be retrieved once processed.

Cheers...

S	M	T	W	T	F	S
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31

AlexPGP's Corner

Die Grenzen meiner Sprache bedeuten die Grenzen meiner Welt.

Aug. 10th, 2005

Aug. 10th, 2005

Scanning redux...

Profile

January 2018

Most Popular Tags

Page Summary

Style Credit

Expand Cut Tags