MarcEdit – dealing with data in mixed character sets

By reeset / On / In MarcEdit

One of the hard and fast rules that MarcEdit has consistently enforced is that you are not allowed to mix character set streams.  What this means – if your data is in UTF8 – MarcEdit will not process mnemonic data.  There are some good reasons for this – but best being that mnemonics in MarcEdit map back to MARC8 representations of a character which  is completely incompatible with the UTF8 character set. 

I’ve tried a few times to look at different ways to deal with this – but in most cases, I’ve been thwarted by the way C# handles streams.  In C#, all data is typed as UTF8 streams, unless the data is specifically types as otherwise.  In order to support MARC8 formatted data, MarcEdit reads all data as either UTF8 or as binary.  This allows MarcEdit to move easily between MARC8 and UTF8.  The problem occurs when someone wants to use mnemonics in a string that is already UTF8 encoded.  For example:
=246  13$aal-Mujāhid $bRees{aacute}, T{eacute}rry

The above is problematic to process.  Currently, MarcEdit ignores the mnemonics and treats them simply as strings.  Because these mnemonics convert directly to MARC8 bytes – one of these three diacritics sets would be flattened when processed against the stream.  If the stream was defaulted to UTF8, the {aacute} and the {eacute} encoded data will be flattened and the record generated by MarcEdit will have incorrect lengths.  If the Stream is converted to binary, then reconstituted as a UTF8 stream, any UTF8 data present in the stream is flattened, but the mnemonic data is processed correctly.  A bit of a pickle. 

To make this work – I ended up having to atomize the data that is to be processed, meaning that only the data in the mnemonic is processed – and then inserted back into a UTF8 data stream.  So, it would look something like this:

if (RecognizeUTF8(System.Text.Encoding.GetEncoding(1252).GetBytes(str_Source)) == RET_VAL_UTF_8)
    if (objChar == null)
       objChar = new marc82utf8.MARCDictionary();
       objChar.UTFNormalize = UTFNormalization;
     string tmp_diacritic = (string)lc_mnemonics_patch[tmp_string];
     tmp_diacritic = objChar.MARC8UTF8(tmp_diacritic);
     //need to convert bytes
     byte[] bytes = System.Text.Encoding.UTF8.GetBytes(tmp_diacritic);
     tmp_diacritic = System.Text.Encoding.GetEncoding(1252).GetString(bytes);
     str_Source = str_Source.Replace(tmp_string, tmp_diacritic);                                    

In this case, the atomized data is in tmp_diacritic and must be processed as MARC8 data to UTF8 utilizing the MarcEdit UTF8 Normalization library. At this point, the stream is switched to UTF8.  This data must now be converted to bytes, then transcoded to the internal base encoding for staging all character data, so it can then be passed back into the library for proper character handling. 

The upshot of this – MarcEdit will soon allow this type of mixed character editing.  The downside is that we still can’t get away from this type of MARC8 legacy crap.