MarcEdit Paging approach

By reeset / On / In MarcEdit

I’m just about to the point where I have this work completed and will be ready to send it out for a few people for testing.  However, I want to provide some feedback so folks have an idea how this will work (even if you’re not that interested).

Paging:

The idea here is that loading the entire data file into an edit window is a big waste of resources and a performance killer.  So, rather than load all the data, we load small snippets of data, but allow users to search the entire file or page through it.  At this point, here’s what this looks like:

image

This is a sample using a 109 MB file.  Previously, this would have consumed over 450 MB of virtual memory to open, and editing would be limited.  Using the paging approach, memory allocation is down to 37 MB – essentially the memory allocated when the program opens (thanks to the need to initialize the .NET framework)

image

This is a big difference and it shows.  But how does this actually work exactly so that as you page through files, performance doesn’t suffer?

Well, here’s the process when paging. 

  1. The user selects a file to open
  2. MarcEdit opens the file, and does the following preprocessing steps
    1. Is Preview mode selected –> If yes, open in Preview mode
    2. Is Preview mode turned off –> If yes, continue to paging
      1. Pull the configuration option that defines number of records per page (found on the preferences dialog)
      2. Pre-process the file.  Preprocessing does the following
        1. Determine number of records in the file
        2. Determine number of pages to display
        3. Create an internal memory map of the file, capturing a structure of start and end positions within the file for a set of pages.

 

The most important part of the paging process is the pre-processing that occurs on the file.  In order to do paging (at the record level), MarcEdit must read the file and determine how many records are in the file.  This means that when you open a large file, there will be an initial pause while the file is pre-processed – but once this preprocessing is done, there should be no need for the program to need to do this again unless the file is reloaded (through a global edit, etc).  How long will it take?  This is hard to say.  The process that I use is fairly optimized, uses buffers, etc.  So, for example, on the 109 MB file example above, preprocessing took approximately 2 seconds.  I think that this is fair.  However, once the processing is done, each page, no matter where in the file, should be able to be addressed in under a second (or right at 1 second for allocation and render).  For my 109 MB test file, page rendering is an average of 0.7 seconds.  I’m happy with this.

Saving/edits:

I knew when doing this that saving and handling edits on paged data would be one of the biggest issues of this method.  The primarily reason is that in most cases, the method that would be used would be to create a shadow copy (memory mapped file) of the original and save changes to it as the user paged through and made edits.  The problem with this approach are two fold.  Since we are dealing with records (not characters) – each edit would need to be saved, re-preprocessed (because file positions would change) and then re-rendered.  When I attempted to use this approach on my 109 MB test file, paging jumped to nearly 6 seconds to render a page because of all the work being done to save and reprocess the file.  Obviously, that’s not acceptable.  So, I’ve decided to use a different approach.  Internally, I’ve added an enumerated structure that stores a page number and a file pointer.  As pages are changed, a temporary file is created that stores just that modified page.  As MarcEdit is paged, it checked the enumerator to see if a page exists before pulling it from the source.  This way, if you change page 1, then move to page 2 and go back to page 1, you’d see your changes – which would be pulled directly from the shadow buffer.  These temp files will be stored and will then be rectified when:

  1. The user saves a file
  2. The user completes a global edit function (because these always require a full save – even if it is to an internal shadow file).

Using this approach, paging isn’t affected by edits to pages, and saving appears to work fine. 

Anyway, that’s the approach that I’m working with right now.  As I say, I’m hoping to wrap up this work tonight/tomorrow and given that occurs, I’ll be posting a test version for those brave souls who what to give this a whirl and give me feedback.  While may let folks see one more tool – I’m going to add a debugger switch which will allow you to capture a log file that stores variable states at critical moments.  This is something that I’ve been wanting – as it should help me when people as for debugging help.

 

–TR