Wednesday, May 6, 2009

Regular expressions

I believe many translators already know how useful regular expressions are. They're not easy, and I'm pretty sure not that many translators use them in practice, but they (regexes) do really great job.

But they're a pain as well. It considerable depends on the regex implementation in a CAT tool (or TEnT, which stands for Translator's Environment Tools and is pretty fashionable word now), and I so far dealt with two such implementations. One was in MemoQ, and what I really liked about it was the possibility to specify exceptions for each particular regex. I consider such implementations to be very flexible, but as far as I can see, they didn't built regexes into QA process where I mostly use them.

The other implementation is in Trados QA Checker 2.0 where I often use them. What I really liked about this implementation was that I asked the tool developer to support variables pass between source and target segments, and he did it. At the moment I asked about it, it was extremely important for me, and at that moment I got the most flexible regex-enabled QA tool I tried.

However, when we put regexes in "mass production" at the office and tried to describe all common errors with regexes, we ran into a big trouble. To check that a word is not occasionally capitalized after a comma, what a good idea. Translators often change punctuation marks, often decide to join two sentences into one and may easily forget to change the case of the next letter. But if you consider how many occurrences of capitalizaed letter after a comma really are allowed (all proper names at least), you'll understand this regex is gonna be a headache, especially in a technical manual.

In fact, our engineers created a lot of regexes which all had to easy the life of QA people. And each of them had its wrong side and eventually generated a lot of noise. It certainly doesn't mean that regexes are bad. It doesn't even mean that our engineers didn't think well (although sometimes they did). It simply means that the implementation needs to be more flexible.

For me, minimum requirements at the moment are:

- variables support (including the possibility to pass variables from source regex to target and vice versa)
- exceptions to regexes
- forks support (if-then-else).

And the next step, I guess, is gonna be a scripting language that will allow to create short script using regexes, maybe even a lite version of Perl. Oh, I wish... :)