One area that I’d like to see MarcEdit continue to evolve is around the support for the clustering of data to support editing or data extraction. While tools like OpenRefine will provide a much more robust set of tooling, there are barriers to using these tools due to the nature of library data. By embedding lite-weight tools into MarcEdit, the tooling can help to overcome some of these issues.
To that end, I’m exploring a handful of additional clustering options for the application, and beginning the process of rolling a few new options out. The first of these options will be an enhancement to the way the program develops keys/tokens when clustering data. By default, the program takes the data found in specific subfield codes, and then does some very light normalization — before passing the data through a set of fuzzy matching algorithms. This process produces clusters, but can miss some data if names or values are inverted. Take for example:
Reese, Terry and Terry Reese. The clustering algorithms likely won’t put these together because the distance required to normalize these together is pretty high. These would likely be represented as separate clusters. But this is very much one of the use cases that should be addressed. To that end, I’ve added an option that will utilize the same approach OpenRefine utilizes — tokenized fingerprints. Rather than working with the data provided, the tool breaks down the strings, normalizes away data and common diacritics, and then sorts the data so that Reese, Terry and Terry Reese turn into the following identical token: reese terry. Utilizing a combination of fingerprinting and the fuzzy matching algorithms, users can take even more control over how clustering occurs in the application.
Users will see this option (in all versions of MarcEdit) within the Generate cluster screen).
One of the goals in implementing this new option, is that I’ll be extending the format support related to the clustering application. Over the next week or so, I’ll be adding support for delimited forms (so, you cluster on columns) and XML documents (any form — you again, will define values for clustering), allowing users to then make changes across delimited data or any xml formatted data. The Excel/delimited formats will come first, the XML formats second. With luck, I’ll have this work finished prior to hitting Austin for ER&L.
–tr