Gary Smith (of OCLC’s) response to MARBI Proposal 2006-04, Technique for conversion of Unicode to MARC-8

By reeset / On / In Digital Libraries, MarcEdit, Programming

I was glad to see Gary Smith from OCLC finally post OCLC’s official response regarding the current MARBI proposal regarding the techniques for conversion of Unicode to MARC-8.  For those that haven’t see the proposal, the general gist of the document is that the current recommendation is to have non-transformable characters dropped, replaced by a fill character.  Personally, I was for one of the other options in the report, the generation of NCR (Numeric Character References), like you see in XML, so that translation between Unicode and MARC-8 and MARC-8 to Unicode would be a lossless process — a process that would be lost if a fill character was utilized.  However, Gary sums up a very good reason to give this further thought in his post…he writes that:

 OCLC does not support this proposal.  Our recent experience in dealing
with Unicode data has shown that we require a lossless representation
for our own operations.  We expect that many of our users will have
similar requirements.  The use of a replacement character constitutes a
permanent loss of information.  If we produce and distribute records
containing replacement characters, they will inevitably come back to us
— and to every other system that takes in data from another system —
in a degraded and unrepairable form.

And he’s right…there are a number of toy ILS systems that will continue to require and share data in legacy formats.  Heck, we use Innovative Interfaces here at OSU and our system hasn’t been converted to Unicode (though we could if asked Innovative to do the conversion — however, there are consequences to this decision that we haven’t worked through yet), so its not simply a toy ILS problem at this point.  The fact the OCLC or any system would be ingesting these records at some point would be problematic.  Currently, MarcEdit generates NCR’s for unmappable characters when moving between UTF-8 and MARC-8, however, I’ll eventually support whatever MARBI blesses as the desired technique — so I’ll be keeping an eye on this and attending the discussion at midwinter.



3 thoughts on “Gary Smith (of OCLC’s) response to MARBI Proposal 2006-04, Technique for conversion of Unicode to MARC-8

  1. I’m glad that you and Glen are both holding out for lossless translation back to our Z39.47 from unicode. For purposes of clean display unicode is irresistable, but I have no commitment to it from an alphabeting point of view or even in terms of mastering its complexity for use in my own coding for input. Use of those portions of unicode implied by ANSEL in local systems displays makes sense. I imagine that users of other character sets would have an interest in using their own chars for displays, but I imagine that even they would use something as simple as ASCII for the guts of their systems.

    Apart from that if I learned that there was something in ANSEL that could not be represented in unicode, I would be seriously wierded out. I hope that the issues of loss were a matter of finding a way to kludge the contents of nonroman chars into something resembling Z39.47 chars.

    Of course this could be done by using some of the chars undefined by ANSEL to add nonprinting descriptive parenthetical additions to the romanized strings of chars, but you’d be talking a standard to determine the form of the additions, then redefining Z39.47 to be able to physically accept the stuff required by the new standard.

    Lots of work. Little sense of reward. Fairly dubious benefits, I think. Anyway, thinking that anyone would settle for less than 100% is troubling.


    unicode 10B8** (georgian capital letter Shin) =

    D0 start extended char
    53 S
    68 h
    69 i
    6D n
    D2 lang delimiter(Z39.53)
    67 g
    65 e
    6F o
    D3 identification delimiter*
    D1 end extended char

    * In CJK especially, alpha values tend not to be unique even within scope of the language, requiring a further identifier (e.g. a number) to make the value unique. I think the presence of the ID delimiter ought to be required for all unicode chars required for representation in version “x” of Z39.47 chars.

    D0-D3 currently undefined in ANSEL

    ** If I read Windows character map correctly. Their turn of phrase was U+10B8.

  2. Hi Terry,

    How does this all stand today, from your point of view? What’s your general sense of the kind of change that’s happened in the last seven years?


    1. This is kind of an older post — but I think that in the end, the NRC notation was adapted and from my perspective — this has allowed for the kind of roundtrip ability between UTF8 and MARC8 data that was being desired. My disappointment has been that this is still necessary. It’s been a number of years since these discussions were had and the majority of libraries still rely on MARC8 data. What’s more, OCLC, which does support NRC notations — only supports those notations for specific characters (not for the entire range). So, even with NRC support, it can be uneven. Personally, I was hoping that by 2012 we’d be past the need to provide legacy support for MARC-8. Unfortunately, it seems like the opposite is happening. Just this year, I have been asked to provide support to a number of legacy European character sets that I’d never heard of simply because there is so much legacy data still out there and many organization seem to have taken an approach of if it works, why change it.