MarcEdit 7: Continued Task Improvement; Part 2

By reeset / On / In MarcEdit

Last week, I discussed some of the work I was doing to continue to evaluate how Task processing will work in MarcEdit 7.  To do some of this work, I’ve been working with a set of outlier data who’s performance in MarcEdit 6.3 left much to be desired.  You can read about the testing and the file set here: MarcEdit 7: Continued Task Refinements

Over the week, I’ve continued to work on how this data is processed, hoping to continue to move the processing time of this data from almost 7 hours in MarcEdit 6.3 to around 1 1/2 hours, and I’ve been able to do that and more.  My guess was that by adding targeted pre-processing statements into the task processing queue, I could improve processing by only running the task processes that absolutely had to be run.  In this case, I had 962 task actions, but on any given record, maybe 20-30 needed to be run.  By adding a preprocessing step, I was able to move the processing time from 2+hours to 25 minutes.  My guess is that I’ve reached the ceiling in terms of optimizations, but I can live with this.  Of course, over the next few days, what I’ll need to do is validate that these new changes don’t cause the program to miss processing a step that should be run.  Generally, I’ve setup the preprocessing steps so that it will fall back to running the task when in doubt.

–tr

MarcEdit 7: Continued Task Refinements

By reeset / On / In MarcEdit

Now that MarcEdit 7 is available for alpha testers, I’ve been getting back some feedback on the new task processing.  Some of this feedback relates to a couple of errors showing up in tasks that request user interaction…other feedback is related to the tasks themselves and continued performance improvements.

In this implementation, one of the areas that I am really focusing on is performance.  To that end, I changed the way that tasks are processed.  Previously, task processing looked very much like this:

image

A user would initiate a task via the GUI or command-line, and once the task was processed, the program would then, via the GUI, open a hidden window that would populate each of the “task” windows and then “click” the process button.  Essentially, it was working much like a program that “sends” keystrokes to a window, but in a method that was a bit more automated.

This process had some pros and cons.  On the plus-side, Tasks was something added to MarcEdit 6.x, so this allowed me to easily add the task processing functionality without tearing the program apart.  That was a major win, as tasks then were just a simple matter of processing the commands and filling a hidden form for the user.   On the con-side, the task processing had a number of hidden performance penalties.  While tasks automated processing (which allowed for improved workflows), each task processed the file separately, and after each process, the file would be reloaded into the MarcEditor.  Say you had a file that took 10 seconds to load and a task list with 6 tasks.  The file loading alone, would cost you a minute.  Now, consider if that same file had to be processed by a task list with 60 different task elements – that would be 10 minutes dedicated just to file loading and unloading; and doesn’t count the time to actually process the data.

This was a problem, so with MarcEdit 7, I took the opportunity to actually tear down the way that tasks work.  This meant divorcing the application from the task process and creating a broker that could evaluate tasks being passed to it, and manage the various aspects of task processing.  This has led to the development of a process model that looks more like this now:

image

Once a task is initiated and it has been parsed, the task operations are passed to a broker.  The broker then looks at the task elements and the file to be processed, and then negotiates those actions directly with the program libraries.  This removes any file loading penalties, and allows me to manage memory and temporary file use at a much more granular way.  It also immediately speeds up the process.  Take that file that takes 10 seconds to load and 60 tasks to complete.  Immediately, you improve processing time by 10 minutes.  But the question still arises, could I do more?

And the answer to this question is actually yes.  The broker has the ability to process tasks in a number of different ways.  One of these is by handling each task process one by one at a file level, the other is handling all tasks all at once, but at a record level.  You might think that record level processing would always be faster, but it’s not.  Consider the task list with 60 tasks.  Some of these elements may only apply to a small subset of records.  In the by file process, I can quickly shortcut processing of records that are out of scope, in a record by record approach, I actually have to evaluate the record.  So, in testing, I found that when records are smaller than a certain size, and the number of task actions to process was within a certain number (regardless of file size), it was almost always better to process the data by file.  Where this changes is when you have a larger task list.  How large, I’m trying to figure that out.  But as an example, I had a real-world example sent to me that has over 950 task actions to process on a file ~350 MB (344,000 records) in size.   While the by file process is significantly faster than the MarcEdit 6.x method (each process incurred a 17 second file load penalty) – this still takes a lot of time to process because you are doing 950+ actions and complete file reads.  While this type of processing might not be particularly common (I do believe this is getting into outlier territory), the process can help to illustrate what I’m trying to teach the broker to do.  I ran this file using the three different processing methodologies, and here’s the results:

  1. MarcEdit 6.3.x: 962 Task Actions, completing in ~7 hours
  2. By File: 962 Task Actions, completing in 3 hours, 12 minutes
  3. By Record: 962 Task Actions, completing in 2 hours and 20 minutes

 

So, that’s still a really, really long time, but taking a closer look at the file and the changes made, and you can start to see why this process takes so much time.  Looking at the results file, >10 million changes have been processed against the 340,000+ records.  Also, consider the number of iterations that must take place.  The average record has approximately 20 fields.  Since each task needs to act upon the results of the task before it, it’s impossible to have tasks process at the same time – rather, tasks must happen in succession.  This means that each task must process the entire record as the results of a task may require an action based on data changed anywhere in the record.  This means that for one record, the program needs to run 962 operations, which means looping through 19, 240 fields (assuming no fields are added or deleted).  Extrapolate that number for 340,000 records, and the program needs to evaluate 6,541,600,000 fields or over 6 billion field evaluations which works out to 49,557,575 field evaluations per minute.

Ideally, I’d love to see the processing time for this task/file pair to be down around 1 hour and 30 minutes.  That would cut the current MarcEdit 7 processing time in half, and be almost 5 hours and 30 minutes faster than the current MarcEdit 6.3.x processing.  Can I get the processing down to that number – I’m not sure.  There are still optimizations to be hand – loops that can be optimized, buffering, etc. – but I think the biggest potential speed gains may possibly be available by adding some pre-processing to a task process to do a cursory evaluation of a recordset if a set of find criteria is present.  This wouldn’t affect every task, but potentially could improve selective processing of Edit Indicator, Edit Field, Edit Subfield, and Add/Delete field functions.  This is likely the next area that I’ll be evaluating.

Of course, the other question to solve is what exactly is the tipping point when By File Processing becomes less efficient than By Record processing.  My guess is that the characteristics that will be most applicable in this decision will be the number of task actions needing to be processed.  Splitting this file for example, into a file of 1000 and running this task by record versus by file – we see the following:

  1. By File processing, 962 Task Actions, completed in: 0.69 minutes
  2. By Record Processing, 962 Task Actions, completed in: 0.36 minutes

 

The processing times are relatively close, but the By Record processing is twice as fast as the By File Processing.  If we reduced the number of tasks to under 20, there is a dramatic switch in the processing time and By File Processing is the clear winner.

Obviously, there is some additional work to be done here, and more testing to do to understand what characteristics and which processing style will lead to the greatest processing gains, but from this testing, I came away with  a couple pieces of information.  First, the MarcEdit 7 process, regardless of method used, is way faster than MarcEdit 6.3.x.  Second, the MarcEdit 7 process and the MarcEdit 6.3.x process suffered from a flaw related to temp file management.  You can’t see it unless you work with files this large and with this many tasks, but the program cleans up temporary files after all processing is complete.  Normally, in a single operation environment, that happens right away.  Since a task represents a single operation, ~962 temporary files at 350 MBs per file were created as part of both processes.  That’s 336, 700 MB of data or 336 GBs of Temporary data!  When you close the program, that data is all cleared, but again, Ouch.  As I say, normally, you’d never see this kind of problem, but in this kind of edge case, it shows up clearly.  This has led me to implement periodic temp file cleanup so that no more than 10 temporary files are stored at any given time.  While that still means that in the case of this test file, up to 3 GB of temporary data could be stored, the size of that temp cache would never grow larger.  This seems to be a big win, and something I would have never seen without working with this kind of data file and use case.

Finally, let’s say after all this work, I’m able to hit the best case benchmarks (1 hr. 30 min.) and a user still feels that this is too long.  What more could be done?  Honestly, I’ve been thinking about that…but really, very little.  There will be a performance ceiling given how MarcEdit has to process task data.  So for those users – if this kind of performance time wasn’t acceptable, I believe only a custom built solution would provide better performance – but even with a custom build, I doubt you’d see significant gains if one continued to require tasks to be processed in sequence.

Anyway – this is maybe a bit more of a deeper dive into how tasks work in MarcEdit 6.3.x and how they will work in MarcEdit 7 than anyone really was looking for – but this particular set of files and use case represented and interesting opportunity to really test the various methods and provide benchmarks that easily demonstrate the impact of the current task process changes.

If you have questions, feel free to let me know.

–tr

MarcEdit 7: Add/Delete Field Changes

By reeset / On / In MarcEdit

I’m starting to think about the global editing functions in the MarcEditor – and one of the first things I’m trying to do is start to flesh out a few confusing options related to the interface.  This is the first update in thinking about these kinds of changes

image

The idea here is to make it clear which options belong to which editing groupset as sometimes folks aren’t sure which options are add field options and which are delete field options.  Hopefully, this will make the form easier to decipher.

–tr

MarcEdit 7: Startup Wizard

By reeset / On / In MarcEdit

One of the aspects of MarcEdit that I’ve been trying to think a lot about over the past year, is how to make it easier for users to know which configuration settings are important, and which ones are not.  This is the problem of writing a library metadata application that is MARC agnostic.  There are a lot of assumptions that users make because they associate MARC with the specific flavor of MARC that they are using.  So, for someone who only has exposure to MARC21, associating the title with MARC field 245 would be second nature.  But MarcEdit is used by a large community that doesn’t use MARC21, but UNIMARC (or other flavors for that matter).  For those users, the 245 field has a completely different meaning.

This presents a special challenge.  Simple things, like just displaying title information for a record, gets harder, because assumptions I make for one set of users will cause issues for others.  To address this, MarcEdit has a rich set of application settings, designed to enable users to tell the application a little about the data they are working with.  Once that information is provided, MarcEdit can configure the components and adjust assumptions so title information pulls from the correct fields, or Unicode bits get update in the correct leader locations.  The problem, from a usability perspective, is that these values are sorted into a wide range of other MarcEdit settings and preferences…which raises the question: which are the most important?

If you’ve installed MarcEdit 6 recently on a new computer, the way that the program has attempted to deal with this issue is by showing the preferences window on the application’s first run.  This means that the first time the program is executed, you see the following window:

image

Now, I’m not naïve.  I know that most users just click OK, and the program opens up for them, and they work with MarcEdit until they run across something that might require them to go back and look at the settings.  But when I do MarcEdit workshops, I get some specific questions related to Accessibility questions (i.e., can I make the fonts bigger or change the font), display (my Unicode characters don’t display), UNIMARC versus MARC21, etc.  From the window above, you can answer all the questions above, but you have to know which settings group handles each option.  It’s admittedly a pain, and because of that, most workshops I do include 20-30 minutes just going over the setting that might be worth considering.

With MarcEdit 7, I have an opportunity to rethink how users interact with the program, and I started to think about how other software does this successfully.  By and large, the ones that I think are more successful provide a kind of wizard at the start that helps to push the most important options forward…and the best examples include a little bit of whimsy in the process.  No, I might not do whimsy well, but I can think about the setting groups that might be the most important to bring front and center to the user.

To that end, I’ve developed a startup wizard for MarcEdit 7.  All users that install the application will see it (because MarcEdit 7 will install into its own user space, everyone will have this first run experience).  Based on the answers to questions, I’m able to automatically set data in the background to ensure that the application is better configured for the user, the first time they start using MarcEdit, rather than later, when they need help finding configuration settings.   It also will give me an opportunity to bring potential issues to the user’s attention.  So, for example, the tool will specifically look to see if you have a comprehensive Unicode Font installed (so, MS Arial Unicode or the Noto Sans fonts).  If you don’t, the program will point you to help files that discuss how to get one for free; as this will directly impact how the program displays Unicode characters (and comes up all the time given some decisions Microsoft has made in distributing their own Unicode fonts).  Additionally, I’ll be utilizing some automatic translation services, so the program will automatically react to your systems default language settings.  If they are English, text will show in English.  If they are Greek, the interface will show the machine translated Greek.  Users will have the option to change the language in the wizard, and I’ll provide notes about the translations (since machine translations are getting better, but there’s bound to be some pretty odd text. )  The hope is that this will make the program more accessible, and usable…and whimsical.  Yes, there is that too.  MarcEdit 7’s codename was developed after a nickname for my Golden Doodle.  So, she’s volunteered to help get users through the initial startup process.

The Wizard will likely change as I continue to evaluate settings groups, but at this point, I’m kind of leaning towards something that looks like this:

image

image

image

image

 

I’ve had a  few folks walk through this process, and by and large, they find it much more accessible than the current, just show the settings screen, process.  Additionally, they like the idea of the language translations, but wonder if the machine translations will be useful (I did an initial set, they are what they are)…I’ll get more feedback on that before release.  If they aren’t useful, I may remove that option, though I have to feel that for folks where English is a challenge, having anything is better than nothing (though, I could be wrong).

But this is what I’m thinking.  Its hopefully a little fun, easy to walk through, and will allow me to ensure that MarcEdit has been optimally configured for your data.  What do you think?

–tr

MarcEdit 7: Super charging Task Processing

By reeset / On / In MarcEdit

One of the components getting a significant overhaul in MarcEdit 7 is how the application processes tasks.  This work started in MarcEdit 6.3.x, when I introduced a new –experimental bit when processing tasks from the command-line.  This bit shifted task processing from within the MarcEdit application to directly against the libraries where the underlying functions for each task was run.  The process was marked as experimental, in part, because task process have always been tied to the MarcEdit GUI.  Essentially, this is how a task works in MarcEdit:

image

Essentially, when running a task, MarcEdit opens and closes the corresponding edit windows and processes the entire file, on each edit.  So, if there are 30 steps in a task, the program will read the entire file, 30 times.  This is wildly inefficient, but also represents the easiest way that tasks could be added into MarcEdit 6 based on the limitations within the current structure of the program.

In the console program, I started to experiment with accessing the underlying libraries directly – but still, maintained the structure where each task item represented a new pass through the program.  So, while the UI components were no longer being interacted with (improving performance), the program was still doing a lot of file reading and writing.

In MarcEdit 7, I re-architected how the application interacts with the underlying editing libraries, and as part of that, included the ability to process tasks at that more abstract level.  The benefit of this, is that now all tasks on a record can be completed in one pass.  So, using the example of a 30 item task – rather than needing to open and close a file 30 times, the process now opens the file once and then processes all defined task operations on the record.  The tool can do this, because all task processing has been pulled out of the MarcEdit application, and pushed into a task broker.  This new library accepts from MarcEdit the file to process, and the defined task (and associated tasks), and then facilitates task processing at a record, rather than file, level.  I then modified the underlying library functions, which actually was really straightforward given how streams work in .NET. 

Within MarcEdit, all data is generally read and written using the StreamReader/StreamWriter classes, unless I specifically have need to access data at the binary level.  In those cases, I’d use a MemoryStream.  The benefit of using the StreamReader/Writer classes, however, is that it is an instance of the abstract TextReader class.  .NET also has a StringReader class, that allows C# to read strings like a stream – it too is an instance of the TextReader class.  This means that I’ve been able to make the following changes to the functions, and re-use all the existing code while still providing processing at both a file and  a record level:

string function(string sSource, string sDest, bool isFile=true) {

StringBuilder output = new StringBuilder(sDest);

System.IO.TextReader reader = null;
System.IO.TextWriter writer = null;

if (isFile) {

    reader = new System.IO.StreamReader(sSource);
    writer = new System.IO.StreamWriter(output.ToString(), false);

} else {

      output.Clear();  
     reader = new System.IO.StringReader(sSource);
     writer = new System.IO.StringWriter(output);

}

//…Do Stuff

return output.ToString()

}

As a TextReader/TextWriter, I now have access to the necessary functions needed to process both data streams like a file.  This means that I can now handle file or record level processing using the same code – as long as both data sources are in the mnemonic format.  Pretty cool.

What does this mean for users?  It means that in MarcEdit 7, tasks will be supercharged.  In testing, I’m seeing tasks that use to take 1, 2, or 3 minutes to complete now run in a matter of seconds.  So, while there are a lot of really interesting changes planned for MarcEdit 7, this enhancement feels like the one that might have the biggest impact for users as it will represent significant time savings when you consider processing time over the course of a month or year. 

Questions, let me know.

–tr

MarcEdit 7 release schedule planning

By reeset / On / In MarcEdit

I’m going to put this here to help folks that need to work with IT depts when putting new software on their machines.  At this point, with the new features, the updates related to the .NET language changes, the filtering of old XP code and the updated performance code, and new installer – this will be the largest update to the application since I ported the codebase from Assembly to C#.  Just looking at this past weekend, I added close to 17,000 lines of code while completing the clustering work, and removed ~3000 lines of code doing optimization work and removing redundant information. 

In total, work on MarcEdit 7 has been ongoing since April 2017 (formally), and informally since Jan. 2017.  However, last night, I hit a milestone of sorts – I setup the new build environment for MarcEdit 7.  In fact, this morning (around 1 am), I created the first version of the new MarcEdit 7 installer that can installed without administrator permissions.  I’ve heard again and again, the administrator requirements are one of the single biggest issues for users in staying up today.  With MarcEdit 7, the program will provide multiple installation options that should help to alleviate these problems. 

Anyway, given the pace of change and my desire to have some folks put this through its paces prior to the formal release, I’ll be making multiple versions of MarcEdit 7 available for testing using the following schedule below.  Please note, the Alpha and Beta dates are soft dates (they could move up or down by a few days), but the Release Date is a hard date.  Please note, unlike previous versions of MarcEdit, MarcEdit 7 will be able to be installed along-side MarcEdit 6, so both versions will be able to be installed on the same machine.  To simplify this process, all test builds of MarcEdit will be released requiring non-administrator access to install as this will allow me to sandbox the software easier.

Alpha Testing

Sept. 14, 2017 – this will be the first version of MarcEdit.  It won’t be feature complete, but the features included should be finished and working – but I’m expecting to hear from people that some things are broken.  Really, this first version is for those waiting to get their hands on the installer and play with software that likely is a little broken.

Beta Testing:

Oct 2, 2017 – First beta build will be created.  New builds will likely be made available biweekly.

MarcEdit 7 Release Date:

Nov. 25, 2017 – MarcEdit 7.0.x release date.  The release will happen over the U.S. Thanksgiving Holiday. 

This gives users approximately 3 months to ensure that their local systems will be ready for the new update.  Remember, the system requirements are changing.  As of MarcEdit 7, the software will have the following system requirements on Windows (mac and linux already require these requirements):

System Requirements:

  1. Operating System
    Windows 7-present (software may work on Windows Vista, but given the low install-base [smaller than Windows XP], Windows 7 will be the lowest version of Windows I’ll be officially testing on and supporting)
  2. .NET Version
    4.6.1+ –  Version 4.6.1 is the minimal required version of the .NET platform.  If you have Windows 8-10,you should be fine.  If you have Windows 7, you may have to update your .NET instance (though, this will happen automatically if you accept Microsoft’s updates).  If you have questions, you’ll want to contact your IT departments.

That’s it.  But this does represent a very significant change for the program.  For years, I’ve been limping Windows XP support along, and MarcEdit 7 does represent a break from that platform.  I’ll be keeping the last version of MarcEdit 6.3.x available for users that run an unsupported operating system and cannot upgrade, though, I won’t be making any more changes to MarcEdit 6.3.x after MarcEdit 7 comes out. 

If you have questions, let me know.

–tr

MarcEdit 7 alpha: Introducing Clustering tools

By reeset / On / In MarcEdit

Folks sometimes ask me how I decide what kinds of new tools and functions to add to MarcEdit.  When I was an active cataloger/metadata librarian, the answer was easy – I added tools and functions that helped me do my work.  As my work has transitioned to more and more non-MARC/integrations work; I still add things to the program that I need (like the linked data tooling), but I’ve become more reliant on the MarcEdit and metadata communities to provide feedback regarding new features or changes to the program.

This is kind of how the Clustering work came about.  It started with this tweet: https://twitter.com/LibSkrat/status/898189609859002368.  There are already tools that catalogers can use to do large scale data clustering (OpenRefine); and my hope is that more and more individuals make use of them.  But in reading the responses and asking some questions, I started thinking about what this might look like in a tool like MarcEdit – and could I provide a set of lite-weight functionality that would help users solve some problems, while at the same time exposing them to other tooling (like OpenRefine)…and I hope this is what I’ve done.

This work is very much still in active development, but I’ve started the process of creating a new way of batch editing records in MarcEdit.  The clustering tools will be provided as both a stand alone resource and a resource integrated into the MarcEditor, and will be somewhat special in that it will require that the application extract the data out of MARC and store it in a different data model.  This will allow me to provide a different way of visualizing one’s data, and potentially make it easier to surface issues with specific data elements.

The challenge with doing clustering is that this is a very computationally expensive process.  From the indexing of the data out of MARC, to the creation of the clusters using different matching algorithms, the process can take time to generate.  But beyond performance, the question that I’m most interested in right now is how to make this function easier for users to navigate and understand.  How to create an interface that makes it simple to navigate clustered groups and make edits within or across clustered groups.  I’m still trying to think about what this looks like.  Presently, I’ve created a simple interface to test the processes and start asking those questions.

If you are interested in see how this function is being created and some of the assumptions being made as part of the development work – please see: https://youtu.be/DH93QDmeOW8

I’m interested in feedback – particularly around the questions of UI and editing options, so if you see the video and have thoughts, let me know.

–tr