Unicode Changes in MarcEdit

This change was implemented in June 2010, but I wanted to draw users attention to it, especially since I just commented on this change via the MarcEdit listserv.

As many people know, the state of Unicode usage in MARC, well, stinks.  While many systems utilize UTF-8 in their records, the UTF-8 notation that is used isn’t actually “usedâ€? in the real world.  In order to allow for lossless conversion between UTF-8 and MARC-8, the recommend UTF-8 notation in MARC is to utilize combining glyphs (KD notation), so you have two unique characters that are placed side by side to create a single combining character.  Within an ILS system, I’m fairly certain that vendors normalize this data to what is called the “KCâ€? or “Câ€? UTF-8 notations to turn these multiple glyphs into a single, UTF-8 character for indexing purposes.  However, when dealing with XML data or utilizing MARC UTF-8 data outside of the system, data formatted in the KD notation format is getting to be problematic — specifically for international users that tend to utilize the “KCâ€? or “Câ€? flavors.

Prior to June 2010, MarcEdit only supported the KD notation when translation data between MARC-8 and UTF-8.  Since this was the notation specified in the LC specs, this was the notation supported.  However, over the past 1/2 year, I’ve been receiving more and more requests from international users that are desperate to shed this albatross and utilize a more canonical normalization.  So, to that end, I’ve added an option to the MarcEdit preferences that give users the option to select the type of Unicode normalization utilized.

image

By default, the program will still utilize the KD notation, since that still remains the recommended specification for lossless conversions between MARC-8 and UTF-8, but I’m hoping that as more and more users, vendors and others call for support for more canonical encodings, LC will drop the current KD recommendations in favor of the more canonical flavors.

Finally, one last note.  The reason LC presently recommends the KD notation is to ensure a lossless conversion when encoding between MARC-8 and UTF-8.  In order to ensure that lossless conversion can still occur, MarcEdit will continue to, internally, convert data to the KD notation when a translation between UTF-8 to MARC8 is requested.  This allows users to utilize the canonical notation while still being backward compatible to the older MARC-8 specification.  The normalization switch occurs behind the scenes and is transparent to the user.

As I noted, this change occurred in June 2010, so if you are interested in taking advantage of the new Unicode notations, simply access the MarcEdit preferences and select the new option.

–TR


Posted

in

by

Tags: