Wildcards are my friends...
Apr. 29th, 2008 06:52 pmLooking back, it appears to me that I have yet to run across a document in which the section typically titled "List of Abbreviations" is anywhere near complete.
And that's when there actually is an attempt to compile such a section.
So right now, I've got some abbreviations that do not seem to appear at all in expanded form in the document before me.
One solution, of course, is to transliterate it and leave it alone, but this leaves an uncomfortable thorn in my side. This got me to thinking: why not try a wildcard approach?
Consider the situation where a document rambles on and on about something called СД, without actually telling you what it is. It is more than likely that the abbreviation represents two words, separated by a space, where the first word starts with... you get the idea.
So let's go look for the start of a word, followed by the first letter of the abbreviation, followed by one or more characters that are not a space, followed by a space, etc.:
This approach will not catch what may be described as a "pathological" abbreviation, where spaces may be missing (e.g., ethylenediaminetetraacetic acid, which is commonly abbreviated EDTA), or some other strangeness is extant, but it works fairly well for run-of-the-text stuff.
Cheers...
And that's when there actually is an attempt to compile such a section.
So right now, I've got some abbreviations that do not seem to appear at all in expanded form in the document before me.
One solution, of course, is to transliterate it and leave it alone, but this leaves an uncomfortable thorn in my side. This got me to thinking: why not try a wildcard approach?
Consider the situation where a document rambles on and on about something called СД, without actually telling you what it is. It is more than likely that the abbreviation represents two words, separated by a space, where the first word starts with... you get the idea.
So let's go look for the start of a word, followed by the first letter of the abbreviation, followed by one or more characters that are not a space, followed by a space, etc.:
<с[! ]@ д[! ]@>Say, this works pretty well!
This approach will not catch what may be described as a "pathological" abbreviation, where spaces may be missing (e.g., ethylenediaminetetraacetic acid, which is commonly abbreviated EDTA), or some other strangeness is extant, but it works fairly well for run-of-the-text stuff.
Cheers...