UTF-8 to MARC-8 component updates (July 22, 2005)

By reeset / On / In MarcEdit

When I first started putting this version of MarcEdit together, I had a number of goals. Some of the most prominent were:
1) Improved UTF-8 support to remove adoption boundaries for libraries
2) Provide UTF-8 to MARC-8 support to allow for seamless movement of data between XML and MARC sources.

The first was easy — the second has been a pain in the backside. There are a number of issues with providing a round trip back from UTF-8 to MARC-8, some of which are being discussed on LC’s UNICODE MARC listserv. However, for me, the two biggest issues deal with how characters can be represented in UTF-8 and how characters are mapped into UTF-8 when utilizing the MARC-8 to UTF-8 conversion specs. The bane of my existence, at least recently, has been the Latin-1 characterset. When moving data from MARC-8 to Unicode, the LC spec. recommends that these characters be created as composite characters. So if you have a small e with a grave, the Unicode equivalent would be: e+[U+0300]. The nice thing about this syntax is that its easy to bring back into MARC-8. The modified character is still seperate from the diacritic and the diacritic can then be mapped back to its MARC equivalent. However, when typing Latin-1 unicode characters, an e with a grave is represented by character 0xE8. This is much more difficult to break down since there is a single character with must be split into the two corresponding bytes and in correct MARC-8 encoding (which isn’t the same as plain ANSI). I’ll admit it — this has caused me some problems. If you’ve been tracking the developments, over the past week, I’ve posted a number of refreshed builded specifically with the intention of fixing various issues relating to the crossalking of UTF-8 CJK charactersets and latin1 charactersets. Well, after a lot of work, I think that I’ve finally got it nailed down (knock on wood). I’ve tested this build against 6 different types of files.
1) Plain ASCII, no diacritics
2) Latin-1
3) Mix of Latin-1 and Cyrillic
4) XML encoded UTF-8 with Latin-1 and CJK
5) CJK, Greek and Latin1
6) Arabic and Latin1

In each test, it appears that the data is being handled correctly. There are only a few characters left in the latin-1 code range that need to have support added — these are superscript 0-3. For some reason, the Unicode group has left these in the bottom 256 characters of the spec. The difficulty is that in MARC-8, these are special characters that require special handling. I’ll work on adding these to the next build — which barring the discovering of a bug in the MARCEngine component, will occur on Sunday or Monday and I’m hoping, will include a first look at the new Z39.50 client.

You can download the new build at: marcedit50_2005_07_22.zip