alexpgp: (Default)
My client sends me a huge (50 MB) Word file that frankly, can only be transmitted using exceptional means (i.e., Dropbox, Skype) because few, if any, email systems will allow attachments of such size. My task is not to translate the monster, but to reduce the size of the file.

My first step involved doing a search-and-replace on the graphics, searching for ^g and replacing with nothing. The resulting file was less than 2 MB in size, so it's a safe bet that reducing the size of the graphics (in the sense of data volume, as opposed to display dimensions) is called for.

I did not quite write a macro to do this, but I think one could be written with little trouble. The general algorithm would be:
  1. Make sure ImageMagick is installed with the legacy convert option enabled.
  2. In Word, find the next graphic
  3. Cut the graphic to the clipboard.
  4. From a DOS command line, in a known directory (say, C:\temp) run the command convert clipboard:myimage image.png.
  5. In Word, click on Insert|Pictures, navigate to the known directory from step 4 (this only needs to be done the first time), select image.png and click on Insert.
  6. Unless all of the graphics have been replaced, go to step 2 and repeat the subsequent steps.

The final product was a hair over 6 MB in size, which is still large enough to make some email systems gag, but completely usable elsewhere.

alexpgp: (St. Jerome w/ computer)
As it turns out, I have a client who really, really does not like to see justified text in the documents I send him. This used to eat a certain amount of time during the post-translation phase, as the folks who put together the source document mix justified willy-nilly with left-aligned, with centered, as well as a dash of right-alignment to round things off.

So it was only natural for me to want to write a Word macro that, basically, goes through each paragraph within a document and changes only those with justified text to display as left-aligned. Check it out:
Sub ChangeJustifiedToLeft()

Dim oSource As Document
Set oSource = ActiveDocument

j = oSource.Paragraphs.Count

For i = 1 To j
StatusBar = "Processing " & i & " of " & j
If oSource.Paragraphs(i).Format.Alignment = wdAlignParagraphJustify Then
oSource.Paragraphs(i).Format.Alignment = wdAlignParagraphLeft
End If
Next i

End Sub

The StatusBar line is there to provide some visual feedback as the macro works its way through a file. Execution is not objectionably slow, but it's far from blazing fast. (Assigning the paragraph count to the variable j is done to keep things from slowing down even further by having to call oSource.Paragraphs.Count for each value of i.)

alexpgp: (St. Jerome w/ computer)
Ever since college, when it became clear to me that actually doing the homework was an important step toward getting an A or a B in a course like fluid mechanics, I've developed an informal set of tactics to turn stuff I'd rather not do into a game.

I do this now with some translations, and it generally stands me in good stead. One technique that I use is to track my progress through a document. The diagram below shows the number of words I had left to translate in a document against local time.

The gap between 10:40 am and 11:20 am occurred as I tried to deal with a small hurricane of tags in my segmented text.

To explain, Word files can contain tons of hidden information in text (this most often occurs when the file was created from a PDF). Since it's hidden, nobody generally gives a rat's tail about its existence. But when such a file is opened by a translation memory program such as memoQ, the result is pretty ugly:

All of those little gray doohickies represent some kind of instruction in the file text, along the lines of a change in font, or font size, or something along those lines. Translation memory programs that use such doohickies (the technical term for which is "tags") pretty much require them to appear in the translation (else the translated text runs a high risk of not looking right), and you'll pardon me if I don't bore you with the million and one ways satisfying this requirement can go wrong when there are this many tags in a segment.

How does one get rid of tags? Well, there are a number of methods out there, and none that I've found are perfect. The one I like the best is a set of Word macros marketed by a fellow named Dave Turner under the name CodeZapper (a copy of which was bought and paid for by yours truly some while ago). After running the basic tag-zapping macro, the text in the above illustration turned into this:

You'll notice there are a lot fewer tags in the cleaned up text, and while I could probably use this text as is, there were some other segments in the text that still retained a liberal quantity of tags. So I ran the heavy-duty zapping macro and got this:

Now, this is what I'm talking about!

The end result was mostly free of tags, and was a pleasure to translate.

P.S. For those impatient to know what kind of fascinatin' stuff it is I translate, here's the English translation:
The unit has a two-cylinder, four-stroke Briggs & Stratton engine, rated at 18 hp. The average fuel consumption (using unleaded gasoline) is 5.5 liters/hr.

alexpgp: (Visa)
So yesterday, I noticed that when I do a file-save-as in Word and then click on one of the listed folders, there's a pretty good chance that the save-as dialog box that pops up after selecting a folder isn't pointing at the correct folder.

I'd try to let Microsoft know about this, except for the following bit of intelligence:

Clicking on "Search all of" obtained a bunch of links that, by inspection, had nothing to do with reporting a bug.

Gr-r-r-rinning and bearing it!
alexpgp: (Visa)
I absolutely adore my ScanSnap scanner, except for one small peeve: the keyboard shortcut that calls up the unit's organizer software (Ctrl+Shift+Space) is the same key chord that inserts a nonbreaking space in Microsoft Word, the software I use to put bread on the table.

One easy fix to this problem would be to simply not have the organizer software running when I'm not actively scanning something, except doing so does away with much of the convenience of having the scanner available at all times.

I undertook a more active approach, and found some information on the Internet that purported to fix the problem by disabling the ScanSnap software's key chord. This was okay by me, since my princpal way of activating the software consisted of putting something through the scanner. Unfortunately, none of the advice I found online actually, like, worked (which I increasingly suspect may be a problem with Windows 8, but that's a separate rant).

It then occurred to me that Word lets one assign (and remove) keystrokes to various actions. The sequence of steps is:
  1. On the keyboard, press Ctrl+i and then s. This brings up the symbol insertion dialog box. (Since I insert a lot of symbols, I learned to do this with the keyboard instead of the mouse. I'm sure it's pretty easy to do with a mouse, I just don't know how. :^)
  2. Click on the Special Characters tab
  3. Click on the line for the Nonbreaking Space.
  4. Click on the Shortcut Key... button.
  5. With the cursor in the Press new shortcut key field (it's there when the dialog box comes up), press your desired key chord. This will display a text description of the chord in the field and a line starting Currently assigned to will appear under the Current keys list box, to warn you if your chord already does something else.
  6. If you like your new shortcut key chord, you can keep your new shortcut key chord by clicking on the Assign button. Doing so will override any action performed previously by the key chord. You can also get rid of the old shortcut key by highlighting it in the Current keys list box and clicking on the Remove button.>
  7. Click on the Close button.
See? Easy peasy!

In any event, I've restored the balance on my machine, allowing the ScanSnap software and the "insert a nonbreaking space" function in my copy of Word to coexist.

alexpgp: (St Jerome a)
The OCR spit out for the document I'm working on right now is behaving very strangely in terms of font formatting.

For reasons best known to my computer's CPUs, in certain places, the OCR detected bold face font where there is clearly no bolding used in the original document. Editing the OCR output would normally not be a problem in Microsoft Word, because it's a simple enough matter to do a Select-All, then hit Ctrl-B a couple of times, where the first keychord takes a mixed string, such as
The rain in Spain falls mainly in the plain.
and makes it all bold, like this:
The rain in Spain falls mainly in the plain.
The second Ctrl-B just turns off the bold-face, and with that, the desired goal is reached, i.e.,
The rain in Spain falls mainly in the plain

However, what I'm running into is some kind of additional weirdness. Starting with that same selection of mixed bold and non-bold characters:
The rain in Spain falls mainly in the plain.
pressing Ctrl-B the first time results in the following:
The rain in Spain falls mainly in the plain.
In effect, it would appear that the bold property is being toggled for each individual character. Pressing Ctrl-B again gets me the original bold/not-bold selection (in effect, again toggling bold at a character, and not selection level). Apparently, the only way to make the whole selection not bold is to highlight each individual string of bold characters and do whatever it takes (e.g., press Ctrl-B) to unbold it.

This is very strange. Anyone else ever experience behavior like this? (More to the point, has anyone figured out a way around it?)

alexpgp: (St Jerome a)
Every once in a while, your friendly, neighborhood translator is gobsmacked by an abbreviation that shows up in a document seemingly out of nowhere and, naturally, demands attention.

Take, for example, the Russian abbreviation "САР" (—please!) in a document I'm working on. To my credit, I realize there's a better than even chance that the last two letters stand for анализ риска (risk analysis), but without knowing what the first letter expands to, I may as well just transliterate the abbreviation (SAR) and move on, as it were.

That is, except for one sturdy little straw that's available for the grasping, involving a search using wildcards. Consider the following string:
In Microsoft Word's variant of wildcard code, this means "one or more occurrences of any lower-case letter between 'а' and 'я'." If one tacks the character 'с' to the front, like this:
followed by a space, performing a search will find every instance of a word of at least two letters whose first letter is 'с'. Continuing with this logic,
с[а-я]@ а[а-я]@ р[а-я]@
will find three consecutive words, of two or more letters each, that begin with 'с', 'а', and 'р', respectively (I use lower case because Russian is generally pretty sparing when it comes to capitalizing words).

I hit paydirt with the second successful "find":
системный анализ риска
or "system risk analysis."

There are times this technique will not work, but it's almost always worth a try when you're up against it.

alexpgp: (Default)
...or the next best thing: recovering original text from a Word file segmented in Wordfast.

Normally, this is done by simply grabbing the original file, but in this case, the client sent a table with two columns, one containing the Russian original and a blank column for the English translation. In the middle of work, I found a significant omission in the original, typed the omission, but forgot to save a "pristine" copy before translating it.

That probably makes no sense, but here's are the keys to the kingdom anyway:
Find what: \{0\>(*)\<*\{\>(*)\<0\}
(Use wildcards)
Replace with: \1
(Not Hidden)
Technically, the wildcarded string within last set of parens is not necessary, but it takes the translation and makes it available as '\2', should it be needed (to, say, quickly "clean" a segmented file without using Wordfast).

alexpgp: (Default)
It occurs to me that the { DATE } field in Word documents isn't tremendously useful, since it seems to always show the current date.

One might argue that it would be useful to show a date, on paper, on which a document was printed, but then you'd never be able to tell what that date was unless you looked at the paper (in other words, the file would be of no help).

Better to use, perhaps:
  • { DATE } for the date on which the document was created;
  • { PRINTDATE } for the date on which the document was last printed;
  • { SAVEDATE } for the date on which the document was last saved.

alexpgp: (Default)
I was simply too tired last night to explain why the "- 1" was important in the macro line
myRange.End = myRangeEnd - 1
but not decrementing the value ate a lot of time, so hopefully, this post will reinforce the knowledge gained. (Can you tell I don't want to settle down and translate?)

By the time we get to this point in the code, we've selected the contents of the table cell we're working with and assigned them to a range called myRange. We've also saved the value of the end point in a variable called myRangeEnd.

If the pattern we're looking for is found, the next step after copying the pattern is to shorten the range to exclude the found pattern. To do this, I wrote code to move the start of the range over by the length of the pattern and then set the end of the range to the original end value.

Schematically, the initial range looks like:

it's time for {123}all{124} good men and women to party

After the first pattern is found, the range is:

it's time for {123}all{124} good men and women to party

Then we move the start point "over" the end point:

it's time for {123}all{124} good men and women to party

And we restore the original end point:

it's time for {123}all{124} good men and women to party

Only after instrumenting the code to show my selection did I realize what that did. If you work much with Word tables, have you ever noticed what happens if you select text in a cell and go one character past the end? The contents of the entire cell become selected!

That's what was killing me yesterday: everytime the code was executed, the range being examined was being reset to the original range, and I couldn't figure out why.

Extending the end point to one character shy of "the end," i.e., like this:

it's time for {123}all{124} good men and women to party

solved the problem. The next time the pattern is searched for, "{124}" will be found (even if it's at the end of the text).

Apropos of which, the macro works like a charm and actually makes it much easier to deal with the source text. It was a good investment in time.


UPDATE: Wordfast has this concept of a "placeable," which basically is any source text string that's not translatable, such as document designations (e.g., the '50578' in "SSP 50578"). It turns out - I find out by accident - that strings such as "{123}" are treated as placeables as well, and since Wordfast has hotkeys for navigating and copying placeables, it turns out my macro sort of reinvented the wheel. Still, I don't regret the time spent developing the code.


alexpgp: (Default)

September 2017

      1 2
34 5 6 7 8 9
1011 12 1314 15 16
17 18 19 20 2122 23


RSS Atom

Most Popular Tags

Style Credit

Expand Cut Tags

No cut tags
Page generated Sep. 24th, 2017 12:20 pm
Powered by Dreamwidth Studios