Supporting Right to Left languages in MarcEdit

Life is a funny thing.  I’ve never been someone with a distinct desire to understand the intricacies of language or characters.  That’s stuff that my wife enjoys — she’s the linguist while I’ve barely mastered English. 🙂

Yet, here I am, thanks to MarcEdit, having to become much more familiar with character sets and language than I’d ever dreamed.  Some of these issues are easy.  Character sets for example.  There are actual solutions to the character set issues (both moving and identifying).  In MarcEdit, I found a long time ago that I could appease the math geek inside of me by forgoing the creation of large translation tables and coming up with mathematical translations for moving data between various character sets.  It’s not perfect (i.e., there are exceptions that require the use of a lookup table) — but it’s slick enough that in MarcEdit all language translations (save CJK) are handled through a relatively simple formula with support for exceptions.

Language on the other hand stymies me.  Lately I’ve been having conversations with Arabic and Hebrew users that would like to be able to support Right to Left processing of their language in MarcEdit.  In general, support for this is built right into the operating system through a localized implementation of the Unicode Bidirectional (BiDi) Algorithm.  So, in MarcEdit, you can mark the MarcEditor window and have it shift to the OS supported output.  The problem comes from MARC and the notepad like format MarcEdit uses for editing.  Because of how numbers, punctuation and mixing Latinate and non-Latinate characters are handled in the algorithm, the output presented to the user is close — but not perfect.  For example, in Arabic:

arabic

This is a window from a colleague in Dubai that has been helping me wrap my head around the changes needed to make MarcEdit easier to use for the creation of non-English records. 

The problem that we run into specifically has to do with numbers and words.  Many of the switched elements are elements that have numbers (or groups of numbers) attached to them — which is problematic because there is a lot of numerical data in MARC.

I’ve been pounding away at this, and I’m hoping that I might have come up with a workable solution.  It’s required writing a custom implementation of the Unicode Bidirectional (BiDi) Algorithm — though, I’ll admit, at this point, it’s actually a very simplified version so I’m going to need to send this to a few people to make sure that it doesn’t butcher some general assumptions.  But with a little bit of work, I’ve been able to get an output that looks something like this:

image

Of course, since I’ve change the output algorithm, I’ve had to create a method for reassembling the data so that it gets placed back into an order that MarcEdit can compile into MARC.  I’m certain that there is still more work to be done, but the upside is that very soon, MarcEdit should be able to fully (or at least, more closely) provide better language support for any user utilizing a Right to Left rendering language.  Will I understand the whole language issue any better, probably not — but I guess that’s as it should be.

–TR


Posted

in

by

Tags:

Comments

5 responses to “Supporting Right to Left languages in MarcEdit”

  1. Jonathan Rochkind Avatar

    That ‘close but not perfect’ output is in fact what we actually _get_ in our (proprietary) user-facing OPAC display from MARC records with right-to-left languages and numbers. Except I’ve never been able to wrap my head around _why_ it looks like it does, and what if anything could be done about it. Sounds like it’s an understandable consequence of the standard unicode display algorithms — and if I read your post over again three or four more times, I might even understand why, which might be the first step to figuring out how to get it to display ‘right’ in, for example, blacklight.

  2. Ayman Bustanji Avatar

    On the screenshot I found the following changes:
    *Correct changes:
    Subfield codes (letters a, b, c) are now in the right place — located before subfield content.
    Subfield content that contains numeral data only (without letters) is also Ok. Example, 082/a
    *Still Not Correct:
    Subfield content, combining both numerals and text, is still inverted (not correct) — text precedes the numbers. Example: subfields 300/a, 300/c.
    Indicators are still inverted (not correct)

  3. Ayman Bustanji Avatar

    On the screenshot I found the following changes:

    Correct changes:
    Subfield codes (letters a, b, c) are now in the right place — located before subfield content.
    Subfield content that contains numeral data only (without letters) is also Ok. Example, 082/a
    Still Not Correct:
    Subfield content, combining both numerals and text, is still inverted (not correct) — text precedes the numbers. Example: subfields 300/a, 300/c.
    Indicators are still inverted (not correct)

  4. Administrator Avatar
    Administrator

    Ayman,

    That’s good to know. Indicators are easy enough to deal with (just a quick change to the algorithm — so that’s been updated). I’m going to need to see an example of what would be correct in say the 300a or 300c so I can update the function.

    –TR

  5. caleb Avatar
    caleb

    Terry, I think this is awesome. I’ll probably never look at another MARC record in Arabic, but the implications are mind-boggling. Lets make simple tools that work all the time!