MarcEdit – dealing with data in mixed character sets

By reeset / On / In MarcEdit

One of the hard and fast rules that MarcEdit has consistently enforced is that you are not allowed to mix character set streams.  What this means – if your data is in UTF8 – MarcEdit will not process mnemonic data.  There are some good reasons for this – but best being that mnemonics in MarcEdit map back to MARC8 representations of a character which  is completely incompatible with the UTF8 character set. 

I’ve tried a few times to look at different ways to deal with this – but in most cases, I’ve been thwarted by the way C# handles streams.  In C#, all data is typed as UTF8 streams, unless the data is specifically types as otherwise.  In order to support MARC8 formatted data, MarcEdit reads all data as either UTF8 or as binary.  This allows MarcEdit to move easily between MARC8 and UTF8.  The problem occurs when someone wants to use mnemonics in a string that is already UTF8 encoded.  For example:
=246  13$aal-Mujāhid $bRees{aacute}, T{eacute}rry

The above is problematic to process.  Currently, MarcEdit ignores the mnemonics and treats them simply as strings.  Because these mnemonics convert directly to MARC8 bytes – one of these three diacritics sets would be flattened when processed against the stream.  If the stream was defaulted to UTF8, the {aacute} and the {eacute} encoded data will be flattened and the record generated by MarcEdit will have incorrect lengths.  If the Stream is converted to binary, then reconstituted as a UTF8 stream, any UTF8 data present in the stream is flattened, but the mnemonic data is processed correctly.  A bit of a pickle. 

To make this work – I ended up having to atomize the data that is to be processed, meaning that only the data in the mnemonic is processed – and then inserted back into a UTF8 data stream.  So, it would look something like this:

if (RecognizeUTF8(System.Text.Encoding.GetEncoding(1252).GetBytes(str_Source)) == RET_VAL_UTF_8)
    if (objChar == null)
       objChar = new marc82utf8.MARCDictionary();
       objChar.UTFNormalize = UTFNormalization;
     string tmp_diacritic = (string)lc_mnemonics_patch[tmp_string];
     tmp_diacritic = objChar.MARC8UTF8(tmp_diacritic);
     //need to convert bytes
     byte[] bytes = System.Text.Encoding.UTF8.GetBytes(tmp_diacritic);
     tmp_diacritic = System.Text.Encoding.GetEncoding(1252).GetString(bytes);
     str_Source = str_Source.Replace(tmp_string, tmp_diacritic);                                    

In this case, the atomized data is in tmp_diacritic and must be processed as MARC8 data to UTF8 utilizing the MarcEdit UTF8 Normalization library. At this point, the stream is switched to UTF8.  This data must now be converted to bytes, then transcoded to the internal base encoding for staging all character data, so it can then be passed back into the library for proper character handling. 

The upshot of this – MarcEdit will soon allow this type of mixed character editing.  The downside is that we still can’t get away from this type of MARC8 legacy crap.


4 thoughts on “MarcEdit – dealing with data in mixed character sets

  1. Wait, mixing Marc8 and UTF8 encodings in a record is always bad data, isn’t it? Why would you want to support this? I may not be understanding what’s going on, and just reacting with instinctual horror due to the experience of dealing with records that have accidentally become of mixed encoding — it is a nightmare, it is by definition bad data, there becomes no reliable way for software to know how to interpret the bytes, which bytes are supposed to be interpreted as utf8, which as marc8, which may be just plain corrupt, whatever.

    (even if i’m misunderstanding what’s going on, the fact that some cataloging/metadata librarians still don’t understand the basic fact that mixing utf8 and marc8 in a record is by definition corrupt… makes it dangerous to title your post “dealing with data in mixed character sets” as if mixing character encodings is an okay thing!)

  2. Jonathan,

    In practical terms — yes — mixing the two character encoding streams isn’t optimal. This is why I have always enforced, for practical term, this notion that if you were working with UTF8 encoded data — you are not allowed to encode data into MARC records using either NRC notations or ALA mnemonics. The reason is that these values often exist to support MARC8 character sets, so the data transposes into MARC8 entities. Unfortunately, this is how data comes to librarians (how the data is generated, I’m not sure) and, yes, some folks have become accustomed to referring to data using the MARC8 mnemonics — so for their workflows — even when dealing with UTF8 data — they would rather use the MARC8 entities because it doesn’t require hunting down the UTF8 key in a key map.

    Since the data exists (more than I’d like — especially when you start looking at data outside of MARC21 and look at the larger bibliographic universe) and those workflows exist — I’ve been looking for a way to ensure that MarcEdit can accommodate this particular use case, and protect the integrity of the records so that folks that are doing this, are not creating data that cannot be read by other systems since another system will pick a character set and either flatten or mangle the data appropriately.

    I will disagree that there isn’t a reliable way to determine character sets of the data as long as you make some assumptions. MarcEdit uses a heuristic approach to character set analysis that allows it to make some better guessing than most to determine if a character is in MARC8 or UTF8. This is part of the reason it can correct this data on the fly.

    But your point is taken. This wasn’t an endorsement of any particular workflow or way of encoding data. Nor was it an endorsement that people should be mixing character encodings within records. It was more of a missive that I’ve come up with what I believe will be an adequate solution for dealing with a problem that already exists within the bibliographic community.


  3. Terry-

    You’re absolutely right! This is especially a problem with MARC records encoded in the elusive MARC8 character set, but our only option is to export or process that data in another characters set, usually UTF-8. Plus, not all MARC data is in MARC formatting – it’s just data. I’m dealing with this problem now. Your MARCEdit tool has been absolutely a godsend with the ability to convert characters in the clipboard! Thanks for your continued development of it.