MarcEdit 7: Startup Wizard

By reeset / On / In MarcEdit

One of the aspects of MarcEdit that I’ve been trying to think a lot about over the past year, is how to make it easier for users to know which configuration settings are important, and which ones are not.  This is the problem of writing a library metadata application that is MARC agnostic.  There are a lot of assumptions that users make because they associate MARC with the specific flavor of MARC that they are using.  So, for someone who only has exposure to MARC21, associating the title with MARC field 245 would be second nature.  But MarcEdit is used by a large community that doesn’t use MARC21, but UNIMARC (or other flavors for that matter).  For those users, the 245 field has a completely different meaning.

This presents a special challenge.  Simple things, like just displaying title information for a record, gets harder, because assumptions I make for one set of users will cause issues for others.  To address this, MarcEdit has a rich set of application settings, designed to enable users to tell the application a little about the data they are working with.  Once that information is provided, MarcEdit can configure the components and adjust assumptions so title information pulls from the correct fields, or Unicode bits get update in the correct leader locations.  The problem, from a usability perspective, is that these values are sorted into a wide range of other MarcEdit settings and preferences…which raises the question: which are the most important?

If you’ve installed MarcEdit 6 recently on a new computer, the way that the program has attempted to deal with this issue is by showing the preferences window on the application’s first run.  This means that the first time the program is executed, you see the following window:

image

Now, I’m not naïve.  I know that most users just click OK, and the program opens up for them, and they work with MarcEdit until they run across something that might require them to go back and look at the settings.  But when I do MarcEdit workshops, I get some specific questions related to Accessibility questions (i.e., can I make the fonts bigger or change the font), display (my Unicode characters don’t display), UNIMARC versus MARC21, etc.  From the window above, you can answer all the questions above, but you have to know which settings group handles each option.  It’s admittedly a pain, and because of that, most workshops I do include 20-30 minutes just going over the setting that might be worth considering.

With MarcEdit 7, I have an opportunity to rethink how users interact with the program, and I started to think about how other software does this successfully.  By and large, the ones that I think are more successful provide a kind of wizard at the start that helps to push the most important options forward…and the best examples include a little bit of whimsy in the process.  No, I might not do whimsy well, but I can think about the setting groups that might be the most important to bring front and center to the user.

To that end, I’ve developed a startup wizard for MarcEdit 7.  All users that install the application will see it (because MarcEdit 7 will install into its own user space, everyone will have this first run experience).  Based on the answers to questions, I’m able to automatically set data in the background to ensure that the application is better configured for the user, the first time they start using MarcEdit, rather than later, when they need help finding configuration settings.   It also will give me an opportunity to bring potential issues to the user’s attention.  So, for example, the tool will specifically look to see if you have a comprehensive Unicode Font installed (so, MS Arial Unicode or the Noto Sans fonts).  If you don’t, the program will point you to help files that discuss how to get one for free; as this will directly impact how the program displays Unicode characters (and comes up all the time given some decisions Microsoft has made in distributing their own Unicode fonts).  Additionally, I’ll be utilizing some automatic translation services, so the program will automatically react to your systems default language settings.  If they are English, text will show in English.  If they are Greek, the interface will show the machine translated Greek.  Users will have the option to change the language in the wizard, and I’ll provide notes about the translations (since machine translations are getting better, but there’s bound to be some pretty odd text. )  The hope is that this will make the program more accessible, and usable…and whimsical.  Yes, there is that too.  MarcEdit 7’s codename was developed after a nickname for my Golden Doodle.  So, she’s volunteered to help get users through the initial startup process.

The Wizard will likely change as I continue to evaluate settings groups, but at this point, I’m kind of leaning towards something that looks like this:

image

image

image

image

 

I’ve had a  few folks walk through this process, and by and large, they find it much more accessible than the current, just show the settings screen, process.  Additionally, they like the idea of the language translations, but wonder if the machine translations will be useful (I did an initial set, they are what they are)…I’ll get more feedback on that before release.  If they aren’t useful, I may remove that option, though I have to feel that for folks where English is a challenge, having anything is better than nothing (though, I could be wrong).

But this is what I’m thinking.  Its hopefully a little fun, easy to walk through, and will allow me to ensure that MarcEdit has been optimally configured for your data.  What do you think?

–tr

MarcEdit 7: Super charging Task Processing

By reeset / On / In MarcEdit

One of the components getting a significant overhaul in MarcEdit 7 is how the application processes tasks.  This work started in MarcEdit 6.3.x, when I introduced a new –experimental bit when processing tasks from the command-line.  This bit shifted task processing from within the MarcEdit application to directly against the libraries where the underlying functions for each task was run.  The process was marked as experimental, in part, because task process have always been tied to the MarcEdit GUI.  Essentially, this is how a task works in MarcEdit:

image

Essentially, when running a task, MarcEdit opens and closes the corresponding edit windows and processes the entire file, on each edit.  So, if there are 30 steps in a task, the program will read the entire file, 30 times.  This is wildly inefficient, but also represents the easiest way that tasks could be added into MarcEdit 6 based on the limitations within the current structure of the program.

In the console program, I started to experiment with accessing the underlying libraries directly – but still, maintained the structure where each task item represented a new pass through the program.  So, while the UI components were no longer being interacted with (improving performance), the program was still doing a lot of file reading and writing.

In MarcEdit 7, I re-architected how the application interacts with the underlying editing libraries, and as part of that, included the ability to process tasks at that more abstract level.  The benefit of this, is that now all tasks on a record can be completed in one pass.  So, using the example of a 30 item task – rather than needing to open and close a file 30 times, the process now opens the file once and then processes all defined task operations on the record.  The tool can do this, because all task processing has been pulled out of the MarcEdit application, and pushed into a task broker.  This new library accepts from MarcEdit the file to process, and the defined task (and associated tasks), and then facilitates task processing at a record, rather than file, level.  I then modified the underlying library functions, which actually was really straightforward given how streams work in .NET. 

Within MarcEdit, all data is generally read and written using the StreamReader/StreamWriter classes, unless I specifically have need to access data at the binary level.  In those cases, I’d use a MemoryStream.  The benefit of using the StreamReader/Writer classes, however, is that it is an instance of the abstract TextReader class.  .NET also has a StringReader class, that allows C# to read strings like a stream – it too is an instance of the TextReader class.  This means that I’ve been able to make the following changes to the functions, and re-use all the existing code while still providing processing at both a file and  a record level:

string function(string sSource, string sDest, bool isFile=true) {

StringBuilder output = new StringBuilder(sDest);

System.IO.TextReader reader = null;
System.IO.TextWriter writer = null;

if (isFile) {

    reader = new System.IO.StreamReader(sSource);
    writer = new System.IO.StreamWriter(output.ToString(), false);

} else {

      output.Clear();  
     reader = new System.IO.StringReader(sSource);
     writer = new System.IO.StringWriter(output);

}

//…Do Stuff

return output.ToString()

}

As a TextReader/TextWriter, I now have access to the necessary functions needed to process both data streams like a file.  This means that I can now handle file or record level processing using the same code – as long as both data sources are in the mnemonic format.  Pretty cool.

What does this mean for users?  It means that in MarcEdit 7, tasks will be supercharged.  In testing, I’m seeing tasks that use to take 1, 2, or 3 minutes to complete now run in a matter of seconds.  So, while there are a lot of really interesting changes planned for MarcEdit 7, this enhancement feels like the one that might have the biggest impact for users as it will represent significant time savings when you consider processing time over the course of a month or year. 

Questions, let me know.

–tr

MarcEdit 7 release schedule planning

By reeset / On / In MarcEdit

I’m going to put this here to help folks that need to work with IT depts when putting new software on their machines.  At this point, with the new features, the updates related to the .NET language changes, the filtering of old XP code and the updated performance code, and new installer – this will be the largest update to the application since I ported the codebase from Assembly to C#.  Just looking at this past weekend, I added close to 17,000 lines of code while completing the clustering work, and removed ~3000 lines of code doing optimization work and removing redundant information. 

In total, work on MarcEdit 7 has been ongoing since April 2017 (formally), and informally since Jan. 2017.  However, last night, I hit a milestone of sorts – I setup the new build environment for MarcEdit 7.  In fact, this morning (around 1 am), I created the first version of the new MarcEdit 7 installer that can installed without administrator permissions.  I’ve heard again and again, the administrator requirements are one of the single biggest issues for users in staying up today.  With MarcEdit 7, the program will provide multiple installation options that should help to alleviate these problems. 

Anyway, given the pace of change and my desire to have some folks put this through its paces prior to the formal release, I’ll be making multiple versions of MarcEdit 7 available for testing using the following schedule below.  Please note, the Alpha and Beta dates are soft dates (they could move up or down by a few days), but the Release Date is a hard date.  Please note, unlike previous versions of MarcEdit, MarcEdit 7 will be able to be installed along-side MarcEdit 6, so both versions will be able to be installed on the same machine.  To simplify this process, all test builds of MarcEdit will be released requiring non-administrator access to install as this will allow me to sandbox the software easier.

Alpha Testing

Sept. 14, 2017 – this will be the first version of MarcEdit.  It won’t be feature complete, but the features included should be finished and working – but I’m expecting to hear from people that some things are broken.  Really, this first version is for those waiting to get their hands on the installer and play with software that likely is a little broken.

Beta Testing:

Oct 2, 2017 – First beta build will be created.  New builds will likely be made available biweekly.

MarcEdit 7 Release Date:

Nov. 25, 2017 – MarcEdit 7.0.x release date.  The release will happen over the U.S. Thanksgiving Holiday. 

This gives users approximately 3 months to ensure that their local systems will be ready for the new update.  Remember, the system requirements are changing.  As of MarcEdit 7, the software will have the following system requirements on Windows (mac and linux already require these requirements):

System Requirements:

  1. Operating System
    Windows 7-present (software may work on Windows Vista, but given the low install-base [smaller than Windows XP], Windows 7 will be the lowest version of Windows I’ll be officially testing on and supporting)
  2. .NET Version
    4.6.1+ –  Version 4.6.1 is the minimal required version of the .NET platform.  If you have Windows 8-10,you should be fine.  If you have Windows 7, you may have to update your .NET instance (though, this will happen automatically if you accept Microsoft’s updates).  If you have questions, you’ll want to contact your IT departments.

That’s it.  But this does represent a very significant change for the program.  For years, I’ve been limping Windows XP support along, and MarcEdit 7 does represent a break from that platform.  I’ll be keeping the last version of MarcEdit 6.3.x available for users that run an unsupported operating system and cannot upgrade, though, I won’t be making any more changes to MarcEdit 6.3.x after MarcEdit 7 comes out. 

If you have questions, let me know.

–tr

MarcEdit 7 alpha: Introducing Clustering tools

By reeset / On / In MarcEdit

Folks sometimes ask me how I decide what kinds of new tools and functions to add to MarcEdit.  When I was an active cataloger/metadata librarian, the answer was easy – I added tools and functions that helped me do my work.  As my work has transitioned to more and more non-MARC/integrations work; I still add things to the program that I need (like the linked data tooling), but I’ve become more reliant on the MarcEdit and metadata communities to provide feedback regarding new features or changes to the program.

This is kind of how the Clustering work came about.  It started with this tweet: https://twitter.com/LibSkrat/status/898189609859002368.  There are already tools that catalogers can use to do large scale data clustering (OpenRefine); and my hope is that more and more individuals make use of them.  But in reading the responses and asking some questions, I started thinking about what this might look like in a tool like MarcEdit – and could I provide a set of lite-weight functionality that would help users solve some problems, while at the same time exposing them to other tooling (like OpenRefine)…and I hope this is what I’ve done.

This work is very much still in active development, but I’ve started the process of creating a new way of batch editing records in MarcEdit.  The clustering tools will be provided as both a stand alone resource and a resource integrated into the MarcEditor, and will be somewhat special in that it will require that the application extract the data out of MARC and store it in a different data model.  This will allow me to provide a different way of visualizing one’s data, and potentially make it easier to surface issues with specific data elements.

The challenge with doing clustering is that this is a very computationally expensive process.  From the indexing of the data out of MARC, to the creation of the clusters using different matching algorithms, the process can take time to generate.  But beyond performance, the question that I’m most interested in right now is how to make this function easier for users to navigate and understand.  How to create an interface that makes it simple to navigate clustered groups and make edits within or across clustered groups.  I’m still trying to think about what this looks like.  Presently, I’ve created a simple interface to test the processes and start asking those questions.

If you are interested in see how this function is being created and some of the assumptions being made as part of the development work – please see: https://youtu.be/DH93QDmeOW8

I’m interested in feedback – particularly around the questions of UI and editing options, so if you see the video and have thoughts, let me know.

–tr

MarcEdit 6.3 Updates (all versions)

By reeset / On / In MarcEdit

I spent sometime this week working on a few updates for MarcEdit 6.3.  Full change log below (for all versions).

Windows/Linux/MacOS:

* Bug Fix: MarcEditor: When processing data with right to left characters, the embedded markers were getting flagged by the validator.
* Bug Fix: MarcEditor: When processing data with right to left characters, I’ve heard that there have been some occasions when the markers are making it into the binary files (they shouldn’t).  I can’t recreate it, but I’ve strengthen the filters to make sure that these markers are removed when the mnemonic file format is saved.
* Bug Fix: Linked data tool:  When creating VIAF entries in the $0, the subfield code can be dropped.  This was missed because viaf should no longer be added to the $0, so I assumed this was no longer a valid use case.  However local practice in some places is overriding best practice.  This has been fixed.

A note on the MarcEditor changes.  The processing of right to left characters is something I was aware of in regards to the validator – but in all my testing and unit tests, the data was always filtered prior to compiling the data.  These markers that are inserted are for display, as noted here: http://blog.reeset.net/archives/2103.  However, on the pymarc list, there was apparently an instance where these markers slipped through.  The conversation can be found here: https://groups.google.com/forum/#!topic/pymarc/5zxuOh0fVuc.  I posted a long response on the list, but I think i t’s being held in moderation (I’m a new member to the list), but generally, here’s what I found.  I can’t recreate it, but I have updated the code to ensure that this shouldn’t happen.  Once a mnemonic file is saved (and that happens prior to compiling), these markers are removed from the file.  I guess if you find this isn’t the case, let me know.  I can add the filter down into the MARCEngine level, but I’d rather not, as there are cases where these values may be present (legally)…this is why the filtering happens in the Editor, where it can assess their use and if the markers are present already, determine if they are used correctly.

Downloads can be picked up through the automated update tool, or via http://marcedit.reeset.net/downloads.

–tr

MarcEdit 7 Z39.50/SRU Client Wireframes

By reeset / On / In MarcEdit

One of the appalling discoveries when taking a closer look at the MarcEdit 6 codebase, was the presence of 3(!) Z39.50 clients (all using slightly different codebases.  This happened because of the ILS integration, the direct Z39.50 Database editing, and the actual Z39.50 client.  In the Mac version, these clients are all the same thing – so I wanted to emulate that approach in the Windows/Linux version.  And as a plus, maybe I would stop (or reduce) my utter distain at having support Z39.50 generally, within any library program that I work with. 

* Sidebar – I really, really, really can’t stand working with Z39.50.  SRU is a fine replacement for the protocol, and yet, over the 10-15 years that its been available, SRU remains a fringe protocol.  That tells me two things:

  1. Library vendors generally have rejected this as a protocol and there are some good reason for this…most vendors that support (and I’m thinking specifically about ExLibris), use a custom profile.  This is a pain in the ass because the custom profile requires code to handle foreign namespaces.  This wouldn’t be a problem if this only happened occasionally, but it happens all the time.  Every SRU implementation works best if you use their custom profiles.  I think what made Z39.50 work, is the well-defined set of Bib-1 attributes.  The flexibility in SRU is a good thing, but I also think it’s why very few people support it, and fewer understand how it actually works.
  2. That SRU is a poor solution to begin with.  Hey, just like OAI-PMH, we created library standards to work on the web.  If we had it to do over again, we’d do it differently.  We should probably do it differently at this point…because supporting SRU in software is basically just checking a box.  People have heard about it, they ask for it, but pretty much no one uses it.

By consolidating the Z39.50 client code, I’m able to clean out a lot of old code, and better yet, actually focus on a few improvements (which has been hard because I make improvements in the main client, but forget to port them everywhere else).  The main improvements that I’ll be applying has to do with searching multiple databases.  Single search has always allowed users to select up to 5 databases to query.  I may remove that limit.  It’s kind of an arbitrary one.  However, I’ll also be adding this functionality to the batch search.  When doing multiple database searches in batch, users will have an option to take all records, the first record found, or potentially (I haven’t worked this one out), records based on order of database preference. 

Wireframes:

Main Window:

image

Z39.50 Database Settings:

image

SRU Settings:

image

There will be a preferences panel as well (haven’t created it yet), but this is where you will set proxy information and notes related to batch preferences.  You will no longer need to set title field or limits, as the limits are moving to the search screen (this has always needed to be variable) and the title field data is being pulled from preferences already set in the program preferences.

One of the benefits of making the changes is that this folds the z39.50/sru client into the Main MarcEdit application (rather than as a program that was shelled to), which allows me to leverage the same accessibility platform that has been developed for the rest of the application.  It also highlights one of the other changes happening in MarcEdit 7.  MarcEdit 6- is a collection of about 7 or 8 individual executables.  This makes sense in some cases, less sense in others.  I’m evaluating all the stand-alone programs and if I replicate the functionality in the main program, then it means that while initially, having these as separate program might have been a good thing, the current structure of the application has changed, and so the code (both external and internal) code needs to be re-evaluated and put in one spot.  In the application, this has meant that in some cases, like the Z39.50 client, the code will move into MarcEdit proper (rather being a separate program called mebatch.exe) and for SQL interactions, it will mean that I’ll create a single shared library (rather than replicating code between three different component parts….the sql explorer, the ILS integration, and the local database query tooling).

Questions, let me know.

–tr

MarcEdit 7 Alpha: the XML/JSON Profiler

By reeset / On / In MarcEdit

Metadata transformations can be really difficult.  While I try to make them easier in MarcEdit, the reality is, the program really has functioned for a long time as a facilitator of the process; handling the binary data processing and character set conversions that may be necessary.  But the heavy lifting, that’s all been on the user.  And if you think about it, there is a lot of expertise tied up in even the simplest transformation.  Say your library gets an XML file full of records from a vendor.  As a technical services librarian, I’d have to go through the following steps to remap that data into MARC (or something else):

  1. Evaluate the vended data file
  2. Create a metadata dictionary for the new xml file (so I know what each data element represents)
  3. Create a mapping between the data dictionary for the vended file and MARC
  4. Create the XSLT crosswalk that contains all the logic for turning this data into MARCXML
  5. Setup the process to move data between XML=>MARC

 

All of these steps are really time consuming, but the development of the XSLT/XQuery to actually translate the data is the one that stops most people.  While there are many folks in the library technology space (and technical services spaces) that would argue that the ability to create XSLT is a vital job skill, let’s be honest, people are busy.  Additionally, there is a big difference between knowing how to create an XSLT and writing a metadata translation.  These things get really complicated, and change all the time (XSLT is up to version 3), meaning that even if you’ve learned how to do this years ago, the skills may be stale or not translate into the current XSLT version.

Additionally, in MarcEdit, I’ve tried really hard to make the XSLT process as simple and straightforward as possible.  But, the reality is, I’ve only been able to work on the edges of this goal.  The tool handles the transformation of binary and character encoding data (since the XSLT engines cannot do that), it uses a smart processing algorithm to try to improve speed and memory handling while still enabling users to work with either DOM or Sax processing techniques.  And I’ve tried to introduce a paradigm that enables reuse and flexibility when creating transformations.  Folks that have heard me speak have likely heard me talk about this model as a wheel and spoke:

image

The idea behind this model is that as long as users create translations that map to and from MARCXML, the tool can automatically enable transformations to any of the known metadata formats registered with MarcEdit.  There are definitely tradeoffs to this approach (for sure, doing a 1-to-1, direct translation would produce the best translation, but it also requires more work and users to be experts in the source and final metadata formats), but the benefit from my perspective is that I don’t have to be the bottleneck in the process.  Were I to hard-code or create 1-to-1 conversions, any deviation or local use within a spec, would render the process unusable…and that was something that I really tried to avoid.  I’d like to think that this approach has been successful, and has enabled technical services folks to make better use of the marked up metadata that they are provided.

The problem is that as content providers have moved more of their metadata operations online,  a large number have shifted away from standards-based metadata to locally defined metadata profiles.  This is challenging because these are one off formats that really are only applicable for a publisher’s particular customers.  As a result, it’s really hard to find conversions for these formats.  The result of this, for me, are large numbers of catalogers/MarcEdit users asking for help creating these one off transformations…work that I simply don’t have time to do.  And that can surprise folks.  I try hard to make myself available to answer questions.  If you find yourself on the MarcEdit listserv, you’ll likely notice that I answer a lot of the questions…I enjoy working with the community.  And I’m pretty much always ready to give folks feedback and toss around ideas when folks are working on projects.  But there is only so much time in the day, and only so much that I can do when folks ask for this type of help.

So, transformations are an area where I get a lot of questions.  Users faced with these publisher specific metadata formats often reach out for advice or to see if I’ve worked with a vendor in the past.  And for years, I’ve been wanting to do more for this group.  While many metadata librarians would consider XSLT or XQuery as required skills, these are not always in high demand when faced with a mountain of content moving through an organization.  So, I’ve been collecting user stories and outlining a process that I think could help: an XML/JSON Profiler.

So, it’s with a lot of excitement, that I can write that MarcEdit 7 will include this tool.  As I say, it’s been a long-term coming; and the goal is to reduce the technical requirements needed to process XML or JSON metadata.

XML/JSON Profiler

To create this tool, I had decide how users would define their data for mapping.  Given that MarcEdit has a Delimited Text Translator for converting Excel data to MARC, I decided to work form this model.  The code produced does a couple of things:

  1. It validates the XML format to be profiled.  Mostly, this means that the tool is making sure that schema’s are followed, namespaces are defined and discoverable, etc.
  2. Output data in MARC, MARCXML, or another XML format
  3. Shifts mapping of data from an XML file to a delimited text file (though, it’s not actually creating a delimited text file).
  4. Since the data is in XML, there is  a general assumption that data should be in UTF8.

 

Users can access the Wizard through the updated XML Functions Editor.  Users open MARC Tools and select Edit XML function list, and you see the following:

image

I highlighted the XML Function Wizard.  I may also make this tool available from the main window.  Once selected, the program walks users through a basic reference interview:

Page 1:

image

 

From here, users just need to follow the interview questions.  User will need a sample XML file that contains at least one record in order to create the mappings against.  As users walk through the interview, they are asked to identify the record element in the XML file, as well as map xml tags to MARC tags, using the same interface and tools as found in the delimited text translator.  Users also have the option to map data directly to a new metadata format by creating an XML mapping file – or a representation of the XML output, which MarcEdit will then use to generate new records.

Once a new mapping has been created, the function will then be registered into MarcEdit, and be available like any other translation.  Whether this process simplifies the conversion of XML and JSON data for librarians, I don’t know.  But I’m super excited to find out.  This creates a significant shift in how users can interact with marked up metadata, and I think will remove many of the technical barriers that exist for users today…at least, for those users working with MarcEdit.

To give a better idea of what is actually happening, I created a demonstration video of the early version of this tool in action.  You can find it here: https://youtu.be/9CtxjoIktwM.  This provides an early look at the functionality, and hopefully help provide some context around the above discussion.  If you are interested in seeing how the process works, I’ve posted the code for the parser on my github page here: https://github.com/reeset/meparsemarkup

Do you have questions, concerns?  Let me know.

 

–tr

MarcEdit 7: MarcEditor Performance Metrics

By reeset / On / In MarcEdit

Because I change version numbers so rarely when it comes to MarcEdit, I usually like to take the major version numbers as an opportunity to look at how some of the core code works, and this time is no different.  One of the things that I’ve occasionally heard is that Opening and Saving larger files in the MarcEditor can be slow.  I guess, before I talk about some of the early metrics, I’d like to explain how the MarcEditor works, because it works differently than a normal text editor in order to allow users to work with files of any size.

Opening records in the MarcEditor

When you open the MarcEditor, the program utilizes one of two modes to read files into the editing screen: Preview and Paging. 

Preview Mode:

Preview mode has been designed specifically for really large files – but the caveat is that when in Preview mode, the editor gets locked into Read Only mode.  This means you can’t type in the Editor, but you can use any of the Editing functions to change the file.  The benefit of the Preview mode is you remove the need to load the file (which is an expensive process).

Paging Mode:

Paging mode is the editing mode enabled by default.  This mode breaks files into pages, meaning that MarcEdit must first, read the file to determine the number of records, and create an internal directory of page start and end locations.  Once that is accomplished, the program then renders data onto the screen.  The pages created are all virtual (they don’t exist), unless a user actually edits (typing onto the screen) information on a page.  Global edits affect the whole file, so the file get’s re-paged after every global edit. 

The paging mode is by far the best rendering mode for data under, say, 150 MBs (in MarcEdit 6).  This is because at around 150 MB, it starts taking a lot longer to create the virtual pages.  And depending on your operating system, and hard drive type, this process could be really expensive.  I’ve found on older equipment (non-Solid State (SD) drives), this process can really slow down reading and writing because so many disk accesses have to occur when creating pages (even virtually).

Saving records in the MarcEditor

Saving files essentially does the paging operation in reverse, though now, rather than a virtual page, the program does have to access the file and extract the page content for every virtual page in existence.  Again, if you have a non-SD drive or an older 5400 rpm drive, this can by a slow process.  If your operating system is already having disk usage issues (and older computers upgraded to Windows 10 have many of these), this can slow the process considerably.

MarcEdit 7 Enhancements

In thinking about how this process works, I started wondering how I could improve file operations in MarcEdit 7.  Obviously, the easiest way to improve the open and save processes would be to remove as many disk operations as possible.  The fewer file operations, the faster the process.  so, I started looking.  Now, one of the benefits of updating to the new version of .NET, is that I have access to some new programming concepts.  One of these new elements are Thread Tasks to initiate Parallel processes in C# (though, I’ve found these must be handled with care, or I can really cause disk issue as threads spawn too quickly) and the other are simply lamba expressions that enable the compiler to optimize the operations code.  With this in mind, I started working.

Testing:

For the purpose of this benchmark, I’m using an Dell Inspiron 13, with an i-5 processor, SD drive, and 16 GB of RAM. 

Reading Data into the MarcEditor

In order to speed up the reading operation, I had to reduce the number of file operations that were being run on the system.  To do this, I made two significant changes. 

  1. When MarcEdit’s Enhanced File reading mode is enabled, MarcEdit reads files under 60 MB into memory.  Using Parallel Tasks, I was able to improve this process, reducing the number of file reads by 50%.  So, if the old method made 100 file reads to build the page, the new process would only make 50 file reads.  Additionally, with the processing now in a Parallel process, data could be read asynchronously, though this doesn’t help as much as one might hope since data needs to be processed in order.  But, it does seem to help.
  2. For files larger than 60 MB, again, I needed to find a way to reduce the number of file reads.  To do this, I tried two things.  First, I increased the buffer.  This means that more data is read at a time, so fewer file reads must occur.  Previously, the buffer was 1 MB.  The buffer has been increased to 8 MB.  This makes a big difference, as now files under 8 MB only are read once, as the remainder of the data lives in the buffer.  The second thing that I did was moved access down to the abstract classes.  This allowed me to interact beneath the StreamReader class and access the actual positions in the file when data was read.  This couldn’t be done in the current version of MarcEdit, because the position properties report where buffered data was read.  This meant that an additional file operation had to occur just to get the file positions.  Again, if the file needed 100 reads to read the file, the update process would only need 50 reads. 

So, what’s the impact of this.  Well, let’s see.  I have a 350 MB file and paging set to 100 records per page.  This is a UTF8 file with records from materials The Ohio State University Libraries has loaded into the HathiTrust.  Using this as my test set, I simply opened the files in the MarcEditor in MarcEdit 6.3.x and MarcEdit 7.0.0.alpha.  To test, I loaded this file five times, throwing out the slowest and fastest times, and selecting the status message closest to the average.

MarcEdit 6.3.x:    clip_image002

MarcEdit 7.0.0.alpha:    image

We can see that by reducing the number of file reads, the process improves significantly, though, it could be better.  Digging deeper into the results, I’m finding that the actual reading of the data is even faster, with the actual rending of the data in the newer control taking longer than the previous editing control.  The reason for this is that in MarcEdit 6.3.x, this control usage has been optimized, double buffered, etc.  In MarcEdit 7.0.0.alpha, this hasn’t been done yet.  My guess, I can probably get these numbers down to around 8.7-9 seconds for a file of this size.  That would represent a 5-5 1/2 second increase in performance.  Of course the question is, will this help individuals opening smaller files.  I think yes.  On my SD drive, loading of a 50 MB file takes roughly the same amount of time: 1.3 seconds.  But on a non-SD drive, I think the improvement will be significant given that the number of file reads will be reduced. 

This test though was with the old defaults in MarcEdit.  For MarcEdit 7.0.0.alpha, I would like to change the default paging size to 1000 records per page (since the new component is more is more efficient when dealing with larger sets).  So, let’s run the test again, this time using the different paging values using the same approach as above:

MarcEdit 6.3.x:   image

MarcEdit 7.0.0.alpha:    image

Looking at the process, you can see that the gap between the two versions gets larger.  Again, looking closer at the data, the actual loading of the file is faster than the first tests, but rendering the data pushed the final load times higher.  As in the first tests, I believe that once the Editor itself has been optimized, we’ll see this improve significantly.  By the time the final version comes out, the performance different on this type of file could be between 6-8 seconds, or a 37-50% speed improvement over the current 6.3.x version of the software. 

Writing files in the MarcEditor

In looking at the process used to write files on save, the same kind of issues are causing the problems there.  First, saving requires a lot of file access (both read and write), and second, once a file is saved, it is reloaded into the Editor.  This means the on systems with SD drives, the performance benefits may be modest, but for non-SD systems, the gains should be significant.  But there was only one way to tell.  Using the same file, I made edits on 4 pages.  The first page, the 50th page, the 150th page, and the last page.  Paging was set back to 100 records per page.  This forces the tool to combine the changes pages with the unchanged data in the virtual space.  Using the loading times above, we can estimate the time actually used when saving the data.  I’ll be providing numbers for both the save, and save as process (since they work slightly different):

Saving the file using Save:

MarcEdit 6.3.x: image

MarcEdit 7.0.0.alpha: image

Saving the file using Save As:

MarcEdit 6.3.x: image

MarcEdit 7.0.0.alpha: image

As you can see here, the difference between the new saving method and the old saving method is pretty significant.  The time posted here reflects the time it takes to both save the file, and then reload the data back into the Editor window.  Taking the times from the first test, we can determine that the Save function in MarcEdit 6.3.x takes ~6.2 seconds, if rendering the file takes an average of 14 seconds, and the Save As operation takes approximately 6.7 seconds.  Let’s compare that to MarcEdit 7.0.0.alpha.  We know that the rendering of the file takes approximately 10 seconds.  That means that the Save function takes ~.8 seconds to complete, and the Save as function, 1.2 seconds to complete.  In each case, this represents a significant performance improvement, and as noted above, optimizations have yet to be completed.  Additionally, I do believe that on non-SD systems, the performance gains will be even more noticeable.

Thoughts, Conclusions, and So what

Given how early I am in the development and optimization process, why start looking at any of these metrics now.  Surely, some of these things will change, and I’m sure they will.  But these give me a base-line to work with, and a touchstone as I continue working on optimizing the process.  And it is early, but one of the things that I wanted to highlight here is that in addition to the new features, updated interface, and accessibility improvements – a big part of this update is about performance and speed.  When I initially wrote MarcEdit, nearly all the code was written in Assembly.  Shifting to a higher level language was incredibly painful for me to do because I want things to be fast, and Assembly programming is all about building things small and building things fast.  You have access to the CPU registers, and you can make magic happen.  Unfortunately, keeping up with the changes in the metadata world, the need to provide better Unicode support, and my desire to support Mac systems (which used, at the time, a different CPU architecture, meant moving to a higher language that could be compiled for different systems.  Ever since that code migration, I’ve been chasing the clock, trying to get the processing speeds down to the original assembly code-base.  Is that possible?  No.  Though, even if it was, so many things have changed and been added, the process simply does more than the simple libraries that I first created in 1999…but still, that desire is there.

So, while I am spending most of my time communicating publically about the new wireframes and new functionality in MarcEdit 7 (and I’m really excited about these changes)…please know – MarcEdit 7 is also about making it fast.  I think MarcEdit 6.3.x is already pretty quick on its feet.  As you can see here, its about to get faster.

–tr

MarcEdit Updates (all)

By reeset / On / In Uncategorized

I’ve posted update for all versions.  Windows and linux updates for 6.3.x Sunday evening and updates to MacOS for 2.5.x on Wed. morning.  Change log below:

Windows/Linux:

* Bug Fix: MarcEditor: Convert clipboard content to….: The change in control caused this to stop working – mostly because the data container that renders the content is a rich object, not plain text like the function was expecting.  Missed that one.  I’ve fixed this in the code.
* Enhancement: Extract Selected Records:  Connected the exact match to the search by file
* Bug Fix: MarcEditor: Right to left flipping wasn’t working correctly for Arabic and Hebrew if the codes were already embedded into the file.
* Update: Cleaned up some UI code.
* Update: Batch Process MarcXML: respecting the native versus the XSLT options.

MacOS Updates:

* Bug Fix: MarcEditor: Right to left flipping wasn’t working correctly for Arabic and Hebrew if the codes were already embedded into the file.
* Update: Cleaned up some UI code.
* Update: Batch Process MarcXML: respecting the native versus the XSLT options.
* Enhancement: Exact Match searching in the Extract, Delete Selected Records tool
* Enhancement: Exact Match searching in the Find/Replace Tool
* Enhancement: Work updates in the Linked data tool to support the new MAC proposal

–tr

MarcEdit 7 Wireframes–XML Functions

By reeset / On / In MarcEdit

In this set of wireframes, you can see one of the concepts that I’ll be introducing with MarcEdit 7…wizards.  Each wizard is designed to encapsulate a reference interview to attempt to make adding new functions, etc. to the tool easier.  You will find these throughout MarcEdit 7. 

XML Functions Window:

image

XML Functions Wizard Screens:

image

image

imageimage

image

You’ll notice one of the options is the new XML/JSON Profiler.  This is a new tool that I’ll wireframe later; likely sometime in August 2017.

–tr