Life is a funny thing. I’ve never been someone with a distinct desire to understand the intricacies of language or characters. That’s stuff that my wife enjoys – she’s the linguist while I’ve barely mastered English. 🙂
Yet, here I am, thanks to MarcEdit, having to become much more familiar with character sets and language than I’d ever dreamed. Some of these issues are easy. Character sets for example. There are actual solutions to the character set issues (both moving and identifying). In MarcEdit, I found a long time ago that I could appease the math geek inside of me by forgoing the creation of large translation tables and coming up with mathematical translations for moving data between various character sets. It’s not perfect (i.e., there are exceptions that require the use of a lookup table) – but it’s slick enough that in MarcEdit all language translations (save CJK) are handled through a relatively simple formula with support for exceptions.
Language on the other hand stymies me. Lately I’ve been having conversations with Arabic and Hebrew users that would like to be able to support Right to Left processing of their language in MarcEdit. In general, support for this is built right into the operating system through a localized implementation of the Unicode Bidirectional (BiDi) Algorithm. So, in MarcEdit, you can mark the MarcEditor window and have it shift to the OS supported output. The problem comes from MARC and the notepad like format MarcEdit uses for editing. Because of how numbers, punctuation and mixing Latinate and non-Latinate characters are handled in the algorithm, the output presented to the user is close – but not perfect. For example, in Arabic:
This is a window from a colleague in Dubai that has been helping me wrap my head around the changes needed to make MarcEdit easier to use for the creation of non-English records.
The problem that we run into specifically has to do with numbers and words. Many of the switched elements are elements that have numbers (or groups of numbers) attached to them – which is problematic because there is a lot of numerical data in MARC.
I’ve been pounding away at this, and I’m hoping that I might have come up with a workable solution. It’s required writing a custom implementation of the Unicode Bidirectional (BiDi) Algorithm – though, I’ll admit, at this point, it’s actually a very simplified version so I’m going to need to send this to a few people to make sure that it doesn’t butcher some general assumptions. But with a little bit of work, I’ve been able to get an output that looks something like this:
Of course, since I’ve change the output algorithm, I’ve had to create a method for reassembling the data so that it gets placed back into an order that MarcEdit can compile into MARC. I’m certain that there is still more work to be done, but the upside is that very soon, MarcEdit should be able to fully (or at least, more closely) provide better language support for any user utilizing a Right to Left rendering language. Will I understand the whole language issue any better, probably not – but I guess that’s as it should be.