The neverending search...
Sep. 27th, 2000 11:13 pmAs I dropped off to sleep last night, I started to dream about the various ways one could go about searching a series of LiveJournal posts. I recalled one approach in a book I read that suggested indexing the content in such a way that only relatively unique words enter into the index, while common words are excluded. I did a pass early this morning though the content of my posts and found that about half of the "words" in my posts are unique, and something like 70% occur 6 times or less. Another approach involves a database (e.g., MySQL), but I don't really have a lot of time to devote to come up to speed in that area right now.
I eventually dismissed everything except the following approach.
First, write a Perl script that reformats a LiveJournal download file into a file that contains all information regarding a particular post on a single line. Second, write a CGI script that basically accepts a query from a Web page and searches each line for the query, outputting an appropriate line every time the content contains the query string. Of course, you've got to write the HTML page containing the form, but that's old hat by now.
So when I was asked today to review a document that's used to construct a Russian version of the 3A mission timeline, my mind naturally turned to Perl to find a solution.
At issue with this document is the fact that the timeline is a very compact document, as it summarizes what each member of the crew is doing at any given time. Some activities take a long time (for example, sleep), and thus, there is a lot of room on the page to place a description. No problem, there.
It is the activities that have 5 or 10 minutes allocated to them that are so very challenging to label. In such cases, there are only 5 or 6 characters available for the description (and spaces count as characters), and abbreviations tend to look a bit unnatural squinched in like that (knw wt I mn, jlybn?). As we've done several passes through the timeline at the client's site over the past few months, so there's also a chance that some of these ad hoc abbreviations are inconsistent among themselves, as well. (That's why we're checking them now! :^)
With Perl running through my mind most of the morning, it turned out to be pretty easy to whip together the necessary code to do the CGI script (I wrote a pseudocode shell during lunch). The challenge was to design the code that takes what you get by downloading your journal and outputs one post per line. That took the rest of my lunch hour and I finished it at home.
The whole structure is still a little wobbly, particularly in the formatting department. Also, since the data file is the result of filtering the download file with a Perl script, the "freshness" of the data being sorted through depends on how often I download my LiveJournal, run the filter, and upload the result to my server. For now, I plan to do updates once per week, which ought to suffice. Finally, one of my design decisions was to have the search ignore anything but upper- and lower-case letters in the search, so I don't have to worry about numbers, punctuation, hyphenation, etc. (Actually, the last step was to insert a link to the search page on my LiveJournal, and that's done, too. It'll do, for now.)
The rest of the day at work passed in a blur. I recall holding an all-hands meeting to congratulate the gang that worked on a last-minute rush that arrived on Friday (and which was delivered on Sunday afternoon), and agreeing to audit a teleconference tomorrow morning, but for the most part, I had my head down at my computer all day. Looks like it'll be catch-up time tomorrow.
Cheers...
I eventually dismissed everything except the following approach.
First, write a Perl script that reformats a LiveJournal download file into a file that contains all information regarding a particular post on a single line. Second, write a CGI script that basically accepts a query from a Web page and searches each line for the query, outputting an appropriate line every time the content contains the query string. Of course, you've got to write the HTML page containing the form, but that's old hat by now.
So when I was asked today to review a document that's used to construct a Russian version of the 3A mission timeline, my mind naturally turned to Perl to find a solution.
At issue with this document is the fact that the timeline is a very compact document, as it summarizes what each member of the crew is doing at any given time. Some activities take a long time (for example, sleep), and thus, there is a lot of room on the page to place a description. No problem, there.
It is the activities that have 5 or 10 minutes allocated to them that are so very challenging to label. In such cases, there are only 5 or 6 characters available for the description (and spaces count as characters), and abbreviations tend to look a bit unnatural squinched in like that (knw wt I mn, jlybn?). As we've done several passes through the timeline at the client's site over the past few months, so there's also a chance that some of these ad hoc abbreviations are inconsistent among themselves, as well. (That's why we're checking them now! :^)
With Perl running through my mind most of the morning, it turned out to be pretty easy to whip together the necessary code to do the CGI script (I wrote a pseudocode shell during lunch). The challenge was to design the code that takes what you get by downloading your journal and outputs one post per line. That took the rest of my lunch hour and I finished it at home.
The whole structure is still a little wobbly, particularly in the formatting department. Also, since the data file is the result of filtering the download file with a Perl script, the "freshness" of the data being sorted through depends on how often I download my LiveJournal, run the filter, and upload the result to my server. For now, I plan to do updates once per week, which ought to suffice. Finally, one of my design decisions was to have the search ignore anything but upper- and lower-case letters in the search, so I don't have to worry about numbers, punctuation, hyphenation, etc. (Actually, the last step was to insert a link to the search page on my LiveJournal, and that's done, too. It'll do, for now.)
The rest of the day at work passed in a blur. I recall holding an all-hands meeting to congratulate the gang that worked on a last-minute rush that arrived on Friday (and which was delivered on Sunday afternoon), and agreeing to audit a teleconference tomorrow morning, but for the most part, I had my head down at my computer all day. Looks like it'll be catch-up time tomorrow.
Cheers...