MarcEdit Unicode Question [also posted on the listserv]

By reeset / On / In character encodings, MarcEdit

** This was posted on the listserv, but I’m putting this out there broadly **
** Updated to include a video demonstrating how Normalization currently impacts users **

Video demonstrating the question at hand:

 

So, I have an odd unicode question and I’m looking for some feedback.  I had someone working with MarcEdit and looking for é.  This (and a few other characters) represent some special problems when doing replacements because they can be represented by multiple codepoints.  They can be represented as a letter + diacritic (like you’d find in MARC8) or they can be represented as a single code point.

Here’s the rub.  In Windows 10 — if you do a find and replace using either type of normalization (.NET supports 4 major normalizations), the program will find the string, and replace the data.  The problem is that it replaces the data in the normalization that is presented — meaning, that, if in your file, you have data where your system provides multiple codepoints (the traditional standard with MARC21 — what is called the KD normalization) and you do a search where the replacement using a single code point, the replacement will replace the multiple code points with a single code point.  This is apparently, a Windows 10 behavior.  But I find this behaves differently on Mac system (and linux) — which is problematic and confusing.
At the same time, most folks don’t realize that characters like é have multiple iterations, and MarcEdit can find them but won’t replace them unless they are ordinally equivalent (unless you do a case insensitive search).  So, the tool may tell you it’s found fields with this value, but that when the replacement happens, it reports replacements having been made, but no data is actually changed (because ordinally, they are *not* the same).
So, I’ve been thinking about this.  There is something I could do.  In the preferences, I allow users to define which unicode normalization they want to use when converting data to Unicode.  This value only is used by the MarcEngine.  However, I could extend this to the editing functions.  Using this method, I could for data that comes through the search to conform to the desired normalization — but, you still would have times, again, where you are looking for data say that is normalized in Form C, you’ve told me you want all data in Form KD, and so again, é may not be found because again, ordinally they are not correct.
The other option — and this seems like the least confusing, but it has other impacts, would be to modify the functions so that the tool tests the Find string and based on the data present, normalizes all data so that it matches that normalization.  This way, replacements would always happen appropriately.  Of course, this means that if your data started in KD notation, it may end up (would likely end up, if you enter these diacritics from a keyboard) in C notation.  I’m not sure what the impact would be for ILS systems, as they may expect one notation, and get another.  They should support all Unicode notations, but given that MARC21 assumes KD notation, they may be lazy and default to that set.  To prevent normalization switching, I could have the program on save, ensure that all unicode data matches the encoding specified in the preferences.  That would be possible — it comes with a small speed costs — probably not a big one — but I’d have to see what the trade off would be.
I’m bringing this up because on Windows 10 — it looks as those the Replace functionality in the system is doing these normalizations automatically.  From the users perspective, this is likely desired, but from a final output — that’s harder to say.  And since you’d never be able to tell if the Normalization has changed unless you looked at the data under a hex editor (because honestly, it shouldn’t matter, but again, if your ILS only supported a single normalization, it very much would) — this could be a problem.
My initial inclination, given that Windows 10 appears to be doing normalization on the fly allowing users to search and replace é in multiple normalizations — is to potentially normalizing all data that is recognized as UTF8, which would allow me to filter all strings going into the system, and then when saving, push out the data using the normalization that was requested.  But then, I’m not sure if this is still a big issue, or, if knowing that the data is in single or multiple code points (from a find a replace persepctive) is actually desired.
So, I’m pushing this question out to the community, especially as UTF8 is becoming the rule, and not the exception.

MarcEditor Changes and Right-to-left displays

By reeset / On / In character encodings, MarcEdit

So, this was a tough one.  MarcEdit has a right-to-left data entry mode that was created primary for users that are creating bibliographic records primarily in a right to left language.  But what happens when you are mixing data in a record from left-to-right languages like English and Right-to-left languages, like Hebrew.  Well, in the display, odd things happen.  This is because of what the operating system does when rendering the data.  The operating system assumes certain data belongs to the right-to-left string, and then moves data in a way that it think it should render.  Here’s an example:

In this example, the $a$0 are displayed side-by-side, but this is just a display issue.  Underneath, the data is really correct.  If you compiled this data or loaded into an ILS, the data would parse correctly (though, how it displayed would be up to the ILS support of the language).  But this is confusing, but unfortunately, one of the challenges of working with records in a notepad-like environment.

Now, there is a solution that can solve the display problem.  There are two Unicode characters 0x200E and 0x200F — these are Left-to-right character markers and Right-to-left character markers.  These can be embedded in the display to render characters more appropriately.  They only show up in the display (i.e. are added when reading into the display), and are not preserved in the MARC record.  They help to alleviate some of these problems.

 

The way that this works — when the program identifies that its working with UTF8 data, the program will screen the text for characters have a byte that indicate that they should be rendered RTL.  The program will then embed a RTL marker at the beginning of the string and a LTR marker at the end of the string.  This gives the operating system instructions as to how to render the data, and I believe helps to solve this issue.

–tr

 

 

MarcEdit: Thinking about Charactersets and MARC

By reeset / On / In character encodings, MarcEdit

The topic of charactersets is likely something most North American catalogers rarely give a second thought to.  Our tools, systems – they all are built around a very anglo-centric world-view that assumes data is primarily structured in MARC21, and recorded in either MARC-8 or UTF8.  However, when you get outside of North America, the question of characterset, and even MARC flavor for that matter, becomes much more relevant.  While many programmers and catalogers that work with library data would like to believe that most data follows a fairly regular set of common rules and encodings – the reality is that it doesn’t.  While MARC21 is the primary MARC encoding for North American and many European libraries – it is just one of around 40+ different flavors of MARC, and while MARC-8 and UTF-8 are the predominate charactersets in libraries coding in MARC21, move outside of North American and OCLC, and you will run into Big5, Cyrillic (codepage 1251), Central European (codepage 1250), ISO-5426, Arabic (codepage 1256), and a range of many other localized codepages in use today.  So while UTF-8 and MARC-8 are the predominate encodings in countries using MARC21, a large portion of the international metadata community still relies on localized codepages when encoding their library metadata.  And this can be a problem for any North American library looking to utilize metadata encoded in one of these local codepages, or share data with a library utilizing one of these local codepages.

For years, MarcEdit has included a number of tools for handling this soup of character encodings – tools that work at different levels to allow the tool to handle data from across the spectrum of different metadata rules, encodings, and markups.  These get broken into two different types of processing algorithms.

Characterset Identification:

This algorithm is internal to MarcEdit and vital to how the tool handles data at a byte level.  When working with file streams for rendering, the tool needs to decide if the data is in UTF-8 or something else (for mnemonic processing) – otherwise, data won’t render correctly in the graphical interface without first determining characterset for use when rendering.  For a long time (and honestly, this is still true today), the byte in the LDR of a MARC21 record that indicates if a record is encoded in UTF-8 or something else, simply hasn’t been reliable.  It’s getting better, but a good number of systems and tools simply forget (or ignore) this value.  But more important for MarcEdit, this value is only useful for MARC21.  This encoding byte is set in a different field/position within each different flavor of MARC.  In order for MarcEdit to be able to handle this correctly, a small, fast algorithm needed to be created that could reliably identify UTF8 data at the binary level.  And that’s what’s used – a heuristical algorthm that reads bytes to determine if the characterset might be in UTF-8 or something else.

Might be?  Sadly, yes.  There is no way to auto detect characterset.  It just can’t happen.  Each codepage reuses the same codepoints, they just assign different characters to those codepoints based on which encoding is in use. So, a tool won’t know how to display textual data without first knowing the set of codepointer rules that data was encoded under.  It’s a real pain the backside.

To solve this problem, MarcEdit uses the following code in an identification function:

 
          int x = 0;
            int lLen = 0;
            
            try
            {

                x = 0;
                while (x < p.Length)
                {
                    //System.Windows.Forms.MessageBox.Show(p[x].ToString());
                    if (p[x] <= 0x7F)
                    {
                        x++;
                        continue;
                    }
                    else if ((p[x] & 0xE0) == 0xC0)
                    {
                        lLen = 2;
                    }
                    else if ((p[x] & 0xF0) == 0xE0)
                    {
                        lLen = 3;
                    }
                    else if ((p[x] & 0xF8) == 0xF0)
                    {
                        lLen = 4;
                    }
                    else if ((p[x] & 0xFC) == 0xF8)
                    {
                        lLen = 5;
                    }
                    else if ((p[x] & 0xFE) == 0xFC)
                    {
                        lLen = 6;
                    }
                    else
                    {
                        return RET_VAL_ANSI;
                    }
                    while (lLen > 1)
                    {
                        x++;
                        if (x > p.Length || (p[x] & 0xC0) != 0x80)
                        {
                            return RET_VAL_ERR;
                        }
                        lLen--;
                    }
                    iEType = RET_VAL_UTF_8;
                    		}
                    x++;
                }
            }
            catch (System.Exception kk) {
                iEType= RET_VAL_ERROR

            }
        
            return iEType;

This function allows the tool to quickly evaluate any data at a byte level and identify if that data might be UTF-8 or not.  Which is really handy for my usage.

Character Conversion

MarcEdit has also included a tool that allows users to convert data from one character encoding to another.

image

This tool requires users to identify the original characterset encoding for the file to be converted.  Without that information, MarcEdit would have no idea which set of rules to apply when shifting the data around based on how characters have been assigned to their various codepoints.  Unfortunately, a common problem that I hear from librarians, especially librarians in the United States that don’t have to deal with regularly this problem, is that they don’t know the file’s original characterset encoding, or how to find it.  It’s a common problem – especially when retrieving data from some Eastern European publishers and Asian publishers.  In many of these cases, users send me files, and based on my experience looking at different encodings, I can make a couple educated guesses and generally figure out how the data might be encoded.

Automatic Character Detection

Obviously, it would be nice if MarcEdit could provide some kind of automatic characterset detection.  The problem is that this is a process that is always fraught with errors.  Since there is no way to definitively determine the characterset of a file or data simply by looking at the binary data – we are left having to guess.  And this is where heuristics comes in again.

Current generation web browsers automatically set character encodings when rendering pages.  This is something that they do based on the presence of metadata in the header, information from the server, and a heuristic analysis of the data prior to rendering.  This is why everyone has seen pages that the browser believes is one character set, but is actually in another, making the data unreadable when it renders.  However, the process that browsers are currently using, well, as sad as this may be, it’s the best we got currently.

And so, I’m going to be pulling this functionality into MarcEdit.  Mozilla has made the algorithm that they use public, and some folks have ported that code into C#.  The library can be found on git hub here: https://github.com/errepi/ude.  I’ve tested it – it works pretty well, though is not even close to perfect.  Unfortunately, this type of process works best when you have lots of data to evaluate – but most MARC records are just a few thousand bytes, which just isn’t enough data for a proper analysis.  However, it does provide something — and maybe that something will provide a way for users working with data in an unknown character encodings to actually figure out how their data might be encoded.

The new character detection tools will be added to the next official update of MarcEdit (all versions).

image

And as I noted – this is a tool that will be added to give users one more tool to evaluating their records.  While detection may still only be a best guess – its likely a pretty good guess.

The MARC8 problem

Of course, not all is candy and unicorns.  MARC8, the lingua franca for a wide range of ILS systems and libraries – well, it complicates things.  Unlike many of the localized codepages that are actually well defined standards and in use by a wide range of users and communities around the world – MARC-8 is not.  MARC8 is essentially a made up encoding – it simply doesn’t exist outside of the small world of MARC21 libraries.  To a heuristical parser evaluating character encoding, MARC-8 looks like one of four different characterset: USASCII, Codepage 1252, ISO-8899, and UTF8.  The problem is that MARC-8, as an escape-base language, reuses parts of a couple different encodings.  This really complicates the identification of MARC-8, especially in a world where other encodings may (probably) will be present.  To that end, I’ve had to add a secondary set of heuristics that will evaluate data after detection so that if the data is identified as one of these four types, some additional evaluation is done looking specifically for MARC-8’s fingerprints.  This allows, most of the time, for MARC8 data to be correctly identified, but again, not always.  It just looks too much like other standard character encodings.  Again, it’s a good reminder that this tool is just a best guess at the characterset encoding of a set of records – not a definitive answer.

Honestly, I know a lot of people would like to see MARC as a data structure retired.  They write about it, talk about it, hope that BibFrame might actually do it.  I get their point – MARC as a structure isn’t well suited for the way we process metadata today.  Most programmers simply don’t work with formats like MARC, and fewer tools exist that make MARC easy to work with.  Likewise, most evolving metadata models recognize that metadata lives within a larger context, and are taking advantage of semantic linking to encourage the linking of knowledge across communities.  These are things libraries would like in their metadata models as well, and libraries will get there, though I think in baby steps.  When you consider the train-wreck RDA adoption and development was for what we got out of it (at a practical level) – making a radical move like BibFrame will require a radical change (and maybe event that causes that change).

But I think that there is a bigger problem that needs more immediate action.  The continued reliance on MARC8 actually posses a bigger threat to the long-term health of library metadata.  MARC, as a structure, is easy to parse.  MARC8, as a character encoding, is essentially a virus, one that we are continuing to let corrupt our data and lock it away from future generations.  The sooner we can toss this encoding to the trash heap, the better it will be for everyone – especially since we are likely the passing of one generation away from losing the knowledge of how this made up character encoding actually works.  And when that happens, it won’t matter how the record data is structured – because we won’t be able to read it anyway.

–tr