MarcEdit Unicode Normalizations–specifying ordinal matching and what that means

I’m continuing to flesh out how to make it easier to work with normalizations — specifically — so that what is being queried is actually what is being found.  In general, the new normalization enforcement options solve issues relating to finding and replacing text.  However, the place where this still comes up as a challenge is when using Find/Find All.  Internally, .NET’s string.IndexOf function uses a cultural invariant settings — so it takes data being queried, and breaks it down into other variations of the character.  And example: ß gets search both as the Unicode normalizations and as “ss”.  There is a pretty good chance users don’t want to search for “ss” when querying for “ß” and then maybe there are times when they do.  In this case, I’ve updated the Find/Find All query so that users can determine how the tool will interpret data for searching.

What exactly does this look like?  Here’s and example:

image

image

 

In this case, both “ß” in the various normalizations and “ss” are found.  For example:

image

However, when we shift the query to an ordinal search, we return results just for the diacritic: “ß” in its various normalizations, but not culturally invariant expressions like: “ss”.

image

image

By providing different case types, users can get a better idea of what types of information are showing up in their records.

Finally, replacements always happen ordinally.  Unlike the search in .NET which escapes data into its cultural variant expressions, replacements are always ordinal so they must match.  This is why the option to enforce unicode normalizations are important, as they enable this to work across values that can be expressed using a wide range of valid codepoints.

Make sense?  This will be available in all versions of MarcEdit.

–tr


Posted

in

by

Tags: