I’ve been working on making a few changes to the way in which MarcEdit processes MARCXML data. In MarcEdit, XML metadata transactions happen via XSLT. This was done primarily to provide a great deal of flexibility in the types of XML transactions MarcEdit could preform. It also meant that others could create their own XSLT transactions and share them with the MarcEdit community — meaning that I wouldn’t be a bottleneck.
So, within the application, there were only 4 canonical conversion:
- MARC=>MarcEdit’s mnemonic format
- MarcEdit’s mnemonic format => MARC
- MARC => MARCXML (which occurs through an algorithm, no XSLT)
- MARCXML => MARC (which involves an XSLT translation from MARCXML to MarcEdit’s mnemonic format)
The four conversions represented the foundation for which all other work happened in MarcEdit. The last conversion, MARCXML => MARC, represented a hybrid approach that used an XSLT to translate the data into MarcEdit’s mnemonic format, before handing the data off to the MARCEngine to complete data processing. This method has worked very well throughout the years that I’ve made this functionality available, but it also has imposed a significant bottleneck on users working with large XML data files. Because the MARCXML process utilized an XSLT translation, conversions from MARCXML were particularly expensive because of the XSLT processing. It also meant that uses want to process exceptionally large MARCXML document would run into hard limits associated with the amount of memory available with their system.
In the next update, this will change. Starting in MarcEdit 5.7, the MARCXML => MARC function will shift from being an XSLT process to a native processing algorithm that uses SAX to process data. The affect of this is that MarcEdit’s ability to process MARCXML data will be greatly improved and much less expensive.
In benchmarking the change in process, the results are pretty staggering the larger the source file is. For example, using a 50 MB MARCXML file (14,000 records), you can see the following improvement:
Process | Time | Records Per Second |
MARCXML => MARC (old method) | 5.9 seconds | 2372 |
MARCXML => MARC (new method) | 2.1 seconds | 7000 |
Working with this smaller file, you can see that there has definitely been an improvement. Using the new processing method, we are able to process early 3 times as many records per second. However, this difference becomes even more pronounced as the number of records and the source XML file increases. Using a 7 GB MARCXML file (1.5 million records), the improvement is startling:
Process | Time | Records Per Minute |
MARCXML => MARC (old method) | 50400 seconds | 1785 |
MARCXML => MARC (new method) | 460 seconds | 197,368 |
Working with the larger file sizes, we see that the new method was able to process 110.5 times more records per minute. What’s more, it’s likely that on my benchmarking workstations, this represents the largest file I would be able to process utilizing the old, XSLT centric method. At nearly 14 hours, this file size seriously tested the DOM XSLT processor during the initial loading and validation phases of the process. The new method, however, should easily be able to handle MARCXML files of any size.
So why make these changes? It’s probably pretty rare that folks are going to need to be working with MARCXML files of this size very often. At least, I would hope not. However, the speed improvements were so great working with both small and larger files, that it was well worth the effort to implement this change. Likewise, it will improve MarcEdit’s XSLT based translations by removing one crosswalking step for those transformations moving from a non-MARCXML XML format to MARC. So, the real practical affects of this change will be:
- MARCXML => MARC translations will be much faster
- XSLT translations from a non-MARCXML XML format to MARC will be improved (because you will no longer have the added MARCXML=>Mnemonic translation occurring)
- MarcEdit will be able to process MARCXML files of any size (physical system storage will be the only limiting factor)
- Processing non-MARCXML XSLT translations using XSLT will continue to have practical size limits due to the memory requirements of DOM based XSLT processors. In my own benchmarking, practical limits tend to be around 500 MB — 1 GB.
These changes will be made available when MarcEdit 5.7 is released, sometime before January 1, 2012.
–TR