One of the benefits of moving the MARCXML=>MARC translation algorithm away from XSLT to an inline function is the ability to provide some sanity checking beyond the simple XML validation. One of the issues that I see periodically when working with XML conversions is the need to code data truncation into my XSLT stylesheets. For example, the ETD process that we use with DSpace looks for the abstract and makes sure that the data in the abstract doesn’t exceed the 9,999 bytes for a MARC field.
Recently however, I found a different problem that I don’t run into often, but showed up when working with some data provided by the Hathi Trust. Some colleagues were given a large sample of data (32 GBs of MARCXML) data to do some research into providing better identification of government documents records. The new MarcEdit MARCXML process is able to make short work of this 32 GB file, translating the data into MARC in ~20 minutes. The problem however, that arrives, is that some of these records are too long. For reasons I cannot understand, the Hathi Trust data includes a local 9xx field, that from the context, appears to be item information. Unfortunately, some records include thousands of items, meaning that when the data is translated, the resulting record is too large (exceeds the total length of 99,999 bytes).
However, because of the new MARCXML process, I’ve been able to create a work around for situations like this. When processing MARCXML data, MarcEdit will internally track the record length of a translated record. If that record would exceed the maximum record length, MarcEdit will truncate the record by dropping fields off the end of the record. The program will also modify the 008/38 byte, setting the value to “s” (means modified) and will visually notify the user that a truncation occurred by changing the results panel purple.
While I generally take a hands off approach to modifying MARC data through the translation process, this seems to be a good compromise for dealing with what is now, a rare situation, but what I predict, will become an all too common situation as more data is created in systems without the MARC record limitations.
These changes to the translation engine will occur on the next MarcEdit update (scheduled for 1/23/2012), when I’ll post both an announcement and include a small record set that can demonstrate the new functionality. Hopefully, folks will find these changes useful, especially as technical services departments find themselves having to deal with more and more non-MARC metadata.