MarcEdit 7/3: Updating MarcEditor file processing

This year, the biggest project that I’ve been undertaking is a re-write of how MarcEdit’s MarcEditor handles and tracks changes in files. In MarcEdit (since the very first version) — the tool has separated edits into two workflows.

1. Edits made manually

2. Edits made using the global tools

Some resources, like Report generation and validation fell into a middle ground. These workflows existed because of MarcEdit’s special undo function and the ability to track large global changes for undo. Internally, the tool is required to keep copies of files at various state, and these changes get set when global updates occur. This means the best practice within the application has always been that if you are going to perform manual edits, you should save these edits prior to editing globally. The reason is that once a global edit occurred, the manual edits may not be preserved as the internal tracking file snapshots the changes at the time of global edit to support the undo functionality.

To fix this, I need to rethink how the program reads data. In all versions since MarcEdit 5, the program has utilized pages to read MARC data. These pages create a binary map of the file — allow the application to locate and extract data from the file at very specific locations (creating each page). If a page was manually editing, a physical page was created and reinserted when the entire file was saved. The process works well, unless users start to mix the two editing workflows together. Do this too often, and sometimes the internal data map of the file can become out of sync.

To fix the sync problem, what really needs to happen is that the file simply needs to be repaged when manual edits have occured, but before a global action takes place. The problem is that paging the document is expensive. While MarcEdit 7 brought a number of speed enhancements to the page loading and saving — it still took significant time.

For example, when MarcEdit 7 first came out, the tool could process for paging, ~15,000 records per second. That meant that a file of 150,000 records would take ~10 seconds to process for paging. This time was spent primarily doing a lot of heavy IO tasks due to the fact that many files I encounter have mixed new line characters. This time to process was a disincentive to repage documents until necessary — so it occurred when a Save action occurred.

As of Christmas 2018, I introduced a new file processing method. This repaging method works much faster — processing closer to 120,000 records per second. In real life tests, this means that I can process a GB of data in under 6 seconds and 12 GBs of data in around a minute. That is a significant improvement, and now, repaging can happen more liberally.

This has lead to a number of changes in how pages are loaded and processed. Currently, there is a beta version that is being tested to make sure that all functions are being connected to the new document queue. At this point, I think this work is done, but given the wide range of workflows, I’m looking to make sure I haven’t missed something.

Once complete, the current best practice of separating manual and global editing workflows will end, as liberal repaging will enable manual edits to live harmoniously with the special undo function.

If you have questions, please let me know.

–tr