MarcEdit Unicode Question [also posted on the listserv]

By reeset / On / In character encodings, MarcEdit

** This was posted on the listserv, but I’m putting this out there broadly **
** Updated to include a video demonstrating how Normalization currently impacts users **

Video demonstrating the question at hand:


So, I have an odd unicode question and I’m looking for some feedback.  I had someone working with MarcEdit and looking for é.  This (and a few other characters) represent some special problems when doing replacements because they can be represented by multiple codepoints.  They can be represented as a letter + diacritic (like you’d find in MARC8) or they can be represented as a single code point.

Here’s the rub.  In Windows 10 — if you do a find and replace using either type of normalization (.NET supports 4 major normalizations), the program will find the string, and replace the data.  The problem is that it replaces the data in the normalization that is presented — meaning, that, if in your file, you have data where your system provides multiple codepoints (the traditional standard with MARC21 — what is called the KD normalization) and you do a search where the replacement using a single code point, the replacement will replace the multiple code points with a single code point.  This is apparently, a Windows 10 behavior.  But I find this behaves differently on Mac system (and linux) — which is problematic and confusing.
At the same time, most folks don’t realize that characters like é have multiple iterations, and MarcEdit can find them but won’t replace them unless they are ordinally equivalent (unless you do a case insensitive search).  So, the tool may tell you it’s found fields with this value, but that when the replacement happens, it reports replacements having been made, but no data is actually changed (because ordinally, they are *not* the same).
So, I’ve been thinking about this.  There is something I could do.  In the preferences, I allow users to define which unicode normalization they want to use when converting data to Unicode.  This value only is used by the MarcEngine.  However, I could extend this to the editing functions.  Using this method, I could for data that comes through the search to conform to the desired normalization — but, you still would have times, again, where you are looking for data say that is normalized in Form C, you’ve told me you want all data in Form KD, and so again, é may not be found because again, ordinally they are not correct.
The other option — and this seems like the least confusing, but it has other impacts, would be to modify the functions so that the tool tests the Find string and based on the data present, normalizes all data so that it matches that normalization.  This way, replacements would always happen appropriately.  Of course, this means that if your data started in KD notation, it may end up (would likely end up, if you enter these diacritics from a keyboard) in C notation.  I’m not sure what the impact would be for ILS systems, as they may expect one notation, and get another.  They should support all Unicode notations, but given that MARC21 assumes KD notation, they may be lazy and default to that set.  To prevent normalization switching, I could have the program on save, ensure that all unicode data matches the encoding specified in the preferences.  That would be possible — it comes with a small speed costs — probably not a big one — but I’d have to see what the trade off would be.
I’m bringing this up because on Windows 10 — it looks as those the Replace functionality in the system is doing these normalizations automatically.  From the users perspective, this is likely desired, but from a final output — that’s harder to say.  And since you’d never be able to tell if the Normalization has changed unless you looked at the data under a hex editor (because honestly, it shouldn’t matter, but again, if your ILS only supported a single normalization, it very much would) — this could be a problem.
My initial inclination, given that Windows 10 appears to be doing normalization on the fly allowing users to search and replace é in multiple normalizations — is to potentially normalizing all data that is recognized as UTF8, which would allow me to filter all strings going into the system, and then when saving, push out the data using the normalization that was requested.  But then, I’m not sure if this is still a big issue, or, if knowing that the data is in single or multiple code points (from a find a replace persepctive) is actually desired.
So, I’m pushing this question out to the community, especially as UTF8 is becoming the rule, and not the exception.