UTF8 Normalizations and why I hate the 880 field in MARC21

The last year, I’ve spent a lot of time thinking about how normalizations impact records in our various current an next-generation MARC systems (https://blog.reeset.net/archives/2532). About a year ago, I embedded into MarcEdit the ability to enforce normalization due to many libraries starting to experience significant record loading issues due to requirements within their systems to utilize either a decomposed (NFKD) or composed (NFC) normalizations. And I thought all was good. And if systems enforced their own rules to all data within the record, it would be. However, its not. Over the past year, I have periodically gotten notes from users related to systems not being able to render specific language data if the normalization is set to enforce the NFKD notation under very specific circumstances (though, can be regularly recreated). And honestly, the problem is one that shouldn’t exist — but it does because systems that require decomposed characters within they systems are using special rules when working with 880 fields (and those linked to them) and not supporting decomposed characters in those spaces.

Here’s an fairly common example of what I’m seeing…when working with languages like Japanese, Korean, etc. that have composed and decomposed characters — when decomposing the characters found in the linked 880 fields, the ILS system stops being able to render or index the data. For example, here’s the $a from a linked 880 with Korean data:

$aì‚¬ê´€ë¡€, ì‚¬í˜¼ë¡€, ì‚¬ìƒ ê²¬ë¡€ — v. 2 í–¥ìŒ ì£¼ë¡€, í–¥ì‚¬ë¡€ — v. 3 ì—°ë¡€, ëŒ€ì‚¬ì˜ — v. 4 ë¹™ë¡€ — v. 5 ê³µì‚¬ëŒ€ë¶€ë¡€, ê·¼ë¡€ — v. 6 ìƒë³µ — v. 7 ì‚¬ìƒë¡€, ê¸°ì„ë¡€, ì‚¬ìš°ë¡€ — v. 8 íŠ¹ìƒ ê¶¤ì‚¬ë¡€, ì†Œë¢° ê¶¤ì‚¬ë¡€, ìœ ì‚¬ì² — v. 9. ìƒ‰ì¸.

These characters can be decomposed to the NFKD notation like this:

$aá„‰á…¡á„€á…ªá†«á„…á…¨, á„‰á…¡á„’á…©á†«á„…á…¨, á„‰á…¡á„‰á…¡á†¼ á„€á…§á†«á„…á…¨ — v. 2 á„’á…£á†¼á„‹á…³á†· á„Œá…®á„…á…¨, á„’á…£á†¼á„‰á…¡á„…á…¨ — v. 3 á„‹á…§á†«á„…á…¨, á„ƒá…¢á„‰á…¡á„‹á…´ — v. 4 á„‡á…µá†¼á„…á…¨ — v. 5 á„€á…©á†¼á„‰á…¡á„ƒá…¢á„‡á…®á„…á…¨, á„€á…³á†«á„…á…¨ — v. 6 á„‰á…¡á†¼á„‡á…©á†¨ — v. 7 á„‰á…¡á„‰á…¡á†¼á„…á…¨, á„€á…µá„‰á…¥á†¨á„…á…¨, á„‰á…¡á„‹á…®á„…á…¨ — v. 8 á„á…³á†¨á„‰á…¢á†¼ á„€á…°á„‰á…¡á„…á…¨, á„‰á…©á„…á…¬ á„€á…°á„‰á…¡á„…á…¨, á„‹á…²á„‰á…¡á„Žá…¥á†¯ — v. 9. á„‰á…¢á†¨á„‹á…µá†«.

If you system supports NFKD notation — the system should be able to correctly recompose the data for display as it does with all other fields where decomposed characters are required. However, I’m finding that within the 880 and it’s linked fields, many systems seem to treat this data very differently and that is causing some problems. So, for users in this boat, I’ve added a new option to the MarcEditor Preferences — Don’t normalize Paired 880 Fields:

This option will not be enabled by default, and this item is also only available if you have identified that your record format is MARC21. When this value is enabled, the program will add a special check that will look for data in both the 880 field and fields that are paired to the 880 and will not normalize that data.

Ideally, this setting will eventually go away as libraries will shift to systems that no longer have to support the older NFKD notation and can utilize the NFC notation. When that is the case, these problems go away. But as long as libraries are required to provide a round trip back and forth to MARC8 when working with MARC21 — this option will be available for users in need of it.

–tr