MarcEdit 7: MarcEditor Performance Metrics

By reeset / On / In MarcEdit

Because I change version numbers so rarely when it comes to MarcEdit, I usually like to take the major version numbers as an opportunity to look at how some of the core code works, and this time is no different.  One of the things that I’ve occasionally heard is that Opening and Saving larger files in the MarcEditor can be slow.  I guess, before I talk about some of the early metrics, I’d like to explain how the MarcEditor works, because it works differently than a normal text editor in order to allow users to work with files of any size.

Opening records in the MarcEditor

When you open the MarcEditor, the program utilizes one of two modes to read files into the editing screen: Preview and Paging. 

Preview Mode:

Preview mode has been designed specifically for really large files – but the caveat is that when in Preview mode, the editor gets locked into Read Only mode.  This means you can’t type in the Editor, but you can use any of the Editing functions to change the file.  The benefit of the Preview mode is you remove the need to load the file (which is an expensive process).

Paging Mode:

Paging mode is the editing mode enabled by default.  This mode breaks files into pages, meaning that MarcEdit must first, read the file to determine the number of records, and create an internal directory of page start and end locations.  Once that is accomplished, the program then renders data onto the screen.  The pages created are all virtual (they don’t exist), unless a user actually edits (typing onto the screen) information on a page.  Global edits affect the whole file, so the file get’s re-paged after every global edit. 

The paging mode is by far the best rendering mode for data under, say, 150 MBs (in MarcEdit 6).  This is because at around 150 MB, it starts taking a lot longer to create the virtual pages.  And depending on your operating system, and hard drive type, this process could be really expensive.  I’ve found on older equipment (non-Solid State (SD) drives), this process can really slow down reading and writing because so many disk accesses have to occur when creating pages (even virtually).

Saving records in the MarcEditor

Saving files essentially does the paging operation in reverse, though now, rather than a virtual page, the program does have to access the file and extract the page content for every virtual page in existence.  Again, if you have a non-SD drive or an older 5400 rpm drive, this can by a slow process.  If your operating system is already having disk usage issues (and older computers upgraded to Windows 10 have many of these), this can slow the process considerably.

MarcEdit 7 Enhancements

In thinking about how this process works, I started wondering how I could improve file operations in MarcEdit 7.  Obviously, the easiest way to improve the open and save processes would be to remove as many disk operations as possible.  The fewer file operations, the faster the process.  so, I started looking.  Now, one of the benefits of updating to the new version of .NET, is that I have access to some new programming concepts.  One of these new elements are Thread Tasks to initiate Parallel processes in C# (though, I’ve found these must be handled with care, or I can really cause disk issue as threads spawn too quickly) and the other are simply lamba expressions that enable the compiler to optimize the operations code.  With this in mind, I started working.

Testing:

For the purpose of this benchmark, I’m using an Dell Inspiron 13, with an i-5 processor, SD drive, and 16 GB of RAM. 

Reading Data into the MarcEditor

In order to speed up the reading operation, I had to reduce the number of file operations that were being run on the system.  To do this, I made two significant changes. 

  1. When MarcEdit’s Enhanced File reading mode is enabled, MarcEdit reads files under 60 MB into memory.  Using Parallel Tasks, I was able to improve this process, reducing the number of file reads by 50%.  So, if the old method made 100 file reads to build the page, the new process would only make 50 file reads.  Additionally, with the processing now in a Parallel process, data could be read asynchronously, though this doesn’t help as much as one might hope since data needs to be processed in order.  But, it does seem to help.
  2. For files larger than 60 MB, again, I needed to find a way to reduce the number of file reads.  To do this, I tried two things.  First, I increased the buffer.  This means that more data is read at a time, so fewer file reads must occur.  Previously, the buffer was 1 MB.  The buffer has been increased to 8 MB.  This makes a big difference, as now files under 8 MB only are read once, as the remainder of the data lives in the buffer.  The second thing that I did was moved access down to the abstract classes.  This allowed me to interact beneath the StreamReader class and access the actual positions in the file when data was read.  This couldn’t be done in the current version of MarcEdit, because the position properties report where buffered data was read.  This meant that an additional file operation had to occur just to get the file positions.  Again, if the file needed 100 reads to read the file, the update process would only need 50 reads. 

So, what’s the impact of this.  Well, let’s see.  I have a 350 MB file and paging set to 100 records per page.  This is a UTF8 file with records from materials The Ohio State University Libraries has loaded into the HathiTrust.  Using this as my test set, I simply opened the files in the MarcEditor in MarcEdit 6.3.x and MarcEdit 7.0.0.alpha.  To test, I loaded this file five times, throwing out the slowest and fastest times, and selecting the status message closest to the average.

MarcEdit 6.3.x:    clip_image002

MarcEdit 7.0.0.alpha:    image

We can see that by reducing the number of file reads, the process improves significantly, though, it could be better.  Digging deeper into the results, I’m finding that the actual reading of the data is even faster, with the actual rending of the data in the newer control taking longer than the previous editing control.  The reason for this is that in MarcEdit 6.3.x, this control usage has been optimized, double buffered, etc.  In MarcEdit 7.0.0.alpha, this hasn’t been done yet.  My guess, I can probably get these numbers down to around 8.7-9 seconds for a file of this size.  That would represent a 5-5 1/2 second increase in performance.  Of course the question is, will this help individuals opening smaller files.  I think yes.  On my SD drive, loading of a 50 MB file takes roughly the same amount of time: 1.3 seconds.  But on a non-SD drive, I think the improvement will be significant given that the number of file reads will be reduced. 

This test though was with the old defaults in MarcEdit.  For MarcEdit 7.0.0.alpha, I would like to change the default paging size to 1000 records per page (since the new component is more is more efficient when dealing with larger sets).  So, let’s run the test again, this time using the different paging values using the same approach as above:

MarcEdit 6.3.x:   image

MarcEdit 7.0.0.alpha:    image

Looking at the process, you can see that the gap between the two versions gets larger.  Again, looking closer at the data, the actual loading of the file is faster than the first tests, but rendering the data pushed the final load times higher.  As in the first tests, I believe that once the Editor itself has been optimized, we’ll see this improve significantly.  By the time the final version comes out, the performance different on this type of file could be between 6-8 seconds, or a 37-50% speed improvement over the current 6.3.x version of the software. 

Writing files in the MarcEditor

In looking at the process used to write files on save, the same kind of issues are causing the problems there.  First, saving requires a lot of file access (both read and write), and second, once a file is saved, it is reloaded into the Editor.  This means the on systems with SD drives, the performance benefits may be modest, but for non-SD systems, the gains should be significant.  But there was only one way to tell.  Using the same file, I made edits on 4 pages.  The first page, the 50th page, the 150th page, and the last page.  Paging was set back to 100 records per page.  This forces the tool to combine the changes pages with the unchanged data in the virtual space.  Using the loading times above, we can estimate the time actually used when saving the data.  I’ll be providing numbers for both the save, and save as process (since they work slightly different):

Saving the file using Save:

MarcEdit 6.3.x: image

MarcEdit 7.0.0.alpha: image

Saving the file using Save As:

MarcEdit 6.3.x: image

MarcEdit 7.0.0.alpha: image

As you can see here, the difference between the new saving method and the old saving method is pretty significant.  The time posted here reflects the time it takes to both save the file, and then reload the data back into the Editor window.  Taking the times from the first test, we can determine that the Save function in MarcEdit 6.3.x takes ~6.2 seconds, if rendering the file takes an average of 14 seconds, and the Save As operation takes approximately 6.7 seconds.  Let’s compare that to MarcEdit 7.0.0.alpha.  We know that the rendering of the file takes approximately 10 seconds.  That means that the Save function takes ~.8 seconds to complete, and the Save as function, 1.2 seconds to complete.  In each case, this represents a significant performance improvement, and as noted above, optimizations have yet to be completed.  Additionally, I do believe that on non-SD systems, the performance gains will be even more noticeable.

Thoughts, Conclusions, and So what

Given how early I am in the development and optimization process, why start looking at any of these metrics now.  Surely, some of these things will change, and I’m sure they will.  But these give me a base-line to work with, and a touchstone as I continue working on optimizing the process.  And it is early, but one of the things that I wanted to highlight here is that in addition to the new features, updated interface, and accessibility improvements – a big part of this update is about performance and speed.  When I initially wrote MarcEdit, nearly all the code was written in Assembly.  Shifting to a higher level language was incredibly painful for me to do because I want things to be fast, and Assembly programming is all about building things small and building things fast.  You have access to the CPU registers, and you can make magic happen.  Unfortunately, keeping up with the changes in the metadata world, the need to provide better Unicode support, and my desire to support Mac systems (which used, at the time, a different CPU architecture, meant moving to a higher language that could be compiled for different systems.  Ever since that code migration, I’ve been chasing the clock, trying to get the processing speeds down to the original assembly code-base.  Is that possible?  No.  Though, even if it was, so many things have changed and been added, the process simply does more than the simple libraries that I first created in 1999…but still, that desire is there.

So, while I am spending most of my time communicating publically about the new wireframes and new functionality in MarcEdit 7 (and I’m really excited about these changes)…please know – MarcEdit 7 is also about making it fast.  I think MarcEdit 6.3.x is already pretty quick on its feet.  As you can see here, its about to get faster.

–tr