MarcEdit 7: Release Candidate

By reeset / On / In MarcEdit

A new milestone was reached this past weekend, in that the MarcEdit 7 release candidate was posted. Over this next week, I’ll be working on tests, prepping final installation packages, writing documentation, and getting a package together for Linux installation. As I noted, the Mac version of MarcEdit will come later, as there are a number of UI changes that will need to be accommodated due to some differences with the new MacOS install. My guess at this point, I should complete most of the Mac work by Christmas.

Keep an eye out for more information on the final release. At this point, it should happen on Nov. 26th.

-tr

MarcEdit 7 alpha weekly build

By reeset / On / In MarcEdit

Following changes were made:

  • Bug Fix: Export Tab Delimited Records (open file and save file buttons not working)
  • Enhancement: XML Crosswalk wizard — enabled root element processing
  • Bug Fix: XML Crosswalk wizard — some elements not being picked up, all descendants should now be accounted for
  • Bug Fix: Batch Process Function – file collisions in subfolders would result in overwritten results.
  • Enhancement: Batch Processing Function: task processing uses the new task manager
  • Enhancement: Batch Processing Function: tasks can be processed as subdirectories

I had intended to move the program into beta this Sunday, the above issues made me decide to keep it in alpha for one more week while I finish checking legacy forms/code.

Downloads can be retrieved from: http://marcedit.reeset.net/marcedit-7-alphabeta-downloads-page

–tr

MarcEdit Delete Field by Position documentation

By reeset / On / In MarcEdit

I was working through the code and found an option that quite honestly, I didn’t even know existed.  Since I’m creating new documentation for MarcEdit 7, I wanted to pin this somewhere so I wouldn’t forget again.

A number of times on the list, folks will ask if they can delete say the second field in a field group.  Apparently, you can.  In the MarcEditor, select the Add/Delete field tool.  To delete by position, you would enter {#} to denote the position to delete in the find.

Obviously, this is pretty obscure – so in MarcEdit 7, this function is exposed as an option

image

To delete multiple field positions, you just add a comma.  So, say I wanted to delete fields 2-5, I would enter: 2,3,4,5 into the Field Data box and check this option.  One enhancement that I would anticipate a request for is the ability to delete just the last option – this is actually harder than you’d think – in part, because it means I can’t process data as it comes in, but have to buffer it first, then process, and there are some reason why this complicates things due to the structure of the function.  So for now, it’s by direct position.  I’ll look at what it might take to allow for more abstract options (like last).

–tr

MarcEdit 7 weekly build

By reeset / On / In MarcEdit

Issues completed as part of the MarcEdit 7 weekly update  Couple of things to highlight. 

  • * I’ve integrated a check and download of a Unicode Font into the Startup Wizard.  This will enable users to retrieve and install the Noto Fonts set into a private fonts collection for use by the application.
  • * Clustering tools are now available as a stand alone tool
  • * New Translations
  • * Lots of bug fixes

Thanks to all the folks that are downloading the alpha and trying it out.  Most of the bug reports are directly related to user testing.

  1. Enhancement: All processes: Updated Temp file management
  2. Bug Fix: Plugin Manager failing because it’s missing a column for MarcEdit version (note, none of the current plugins will work with MarcEdit 7)
  3. Enhancement: Added new languages for Croatian, Estonian, Indonesian, Hungarian, and Vietnamese
  4. Enhancement: Offer download into private font collection the Noto fonts when no Unicode font is present. This will make the fonts *only* available for use with MarcEdit.
  5. When Editing a task list — could the list not refresh? This occurs when you have a theme defined. 
  6. Bug Fix: Update all the Z39.50/SRU databases (specifically — the lc databases point to the old voyager endpoint that I believe is turned off)
  7. Bug Fix: Working with Saxon, XSLT transformations that link to files with spaces or special characters fail
  8. Bug Fix: Clustering Tool — selecting a top level cluster would include # of records in the cluster, not just the data to copy
  9. Enhancement: Clustering Tools — Add to the Main Window as a stand-alone tool
  10. Bug Fix: On install, the file types are not associated
  11. Enhancement: New Font’s dialog to support private fonts collections
  12. Bug Fix: Fonts not sticking when using the startup wizard
  13. Enhancement: Added Unicode Font download to help
  14. Bug Fix: Z39.50/SRU downloads were only downloading as .mrk formatted data, not as binary MARC. The Tool has been updated to select download type by extension.
  15. Enhancement: Updated the Icon a bit so that it’s not so transparent on the desktop.

Finally, I recorded and uploaded a video demonstrating the new startup wizard options related to the unicode fonts.  Please see: https://youtu.be/7GWZ_UDUf00

The download can be retrieved from the MarcEdit 7 alpha/beta downloads page: http://marcedit.reeset.net/marcedit-7-alphabeta-downloads-page

Questions, let me know.

–tr

MarcEdit 7: Continued Task Improvement; Part 2

By reeset / On / In MarcEdit

Last week, I discussed some of the work I was doing to continue to evaluate how Task processing will work in MarcEdit 7.  To do some of this work, I’ve been working with a set of outlier data who’s performance in MarcEdit 6.3 left much to be desired.  You can read about the testing and the file set here: MarcEdit 7: Continued Task Refinements

Over the week, I’ve continued to work on how this data is processed, hoping to continue to move the processing time of this data from almost 7 hours in MarcEdit 6.3 to around 1 1/2 hours, and I’ve been able to do that and more.  My guess was that by adding targeted pre-processing statements into the task processing queue, I could improve processing by only running the task processes that absolutely had to be run.  In this case, I had 962 task actions, but on any given record, maybe 20-30 needed to be run.  By adding a preprocessing step, I was able to move the processing time from 2+hours to 25 minutes.  My guess is that I’ve reached the ceiling in terms of optimizations, but I can live with this.  Of course, over the next few days, what I’ll need to do is validate that these new changes don’t cause the program to miss processing a step that should be run.  Generally, I’ve setup the preprocessing steps so that it will fall back to running the task when in doubt.

–tr

MarcEdit 7: Continued Task Refinements

By reeset / On / In MarcEdit

Now that MarcEdit 7 is available for alpha testers, I’ve been getting back some feedback on the new task processing.  Some of this feedback relates to a couple of errors showing up in tasks that request user interaction…other feedback is related to the tasks themselves and continued performance improvements.

In this implementation, one of the areas that I am really focusing on is performance.  To that end, I changed the way that tasks are processed.  Previously, task processing looked very much like this:

image

A user would initiate a task via the GUI or command-line, and once the task was processed, the program would then, via the GUI, open a hidden window that would populate each of the “task” windows and then “click” the process button.  Essentially, it was working much like a program that “sends” keystrokes to a window, but in a method that was a bit more automated.

This process had some pros and cons.  On the plus-side, Tasks was something added to MarcEdit 6.x, so this allowed me to easily add the task processing functionality without tearing the program apart.  That was a major win, as tasks then were just a simple matter of processing the commands and filling a hidden form for the user.   On the con-side, the task processing had a number of hidden performance penalties.  While tasks automated processing (which allowed for improved workflows), each task processed the file separately, and after each process, the file would be reloaded into the MarcEditor.  Say you had a file that took 10 seconds to load and a task list with 6 tasks.  The file loading alone, would cost you a minute.  Now, consider if that same file had to be processed by a task list with 60 different task elements – that would be 10 minutes dedicated just to file loading and unloading; and doesn’t count the time to actually process the data.

This was a problem, so with MarcEdit 7, I took the opportunity to actually tear down the way that tasks work.  This meant divorcing the application from the task process and creating a broker that could evaluate tasks being passed to it, and manage the various aspects of task processing.  This has led to the development of a process model that looks more like this now:

image

Once a task is initiated and it has been parsed, the task operations are passed to a broker.  The broker then looks at the task elements and the file to be processed, and then negotiates those actions directly with the program libraries.  This removes any file loading penalties, and allows me to manage memory and temporary file use at a much more granular way.  It also immediately speeds up the process.  Take that file that takes 10 seconds to load and 60 tasks to complete.  Immediately, you improve processing time by 10 minutes.  But the question still arises, could I do more?

And the answer to this question is actually yes.  The broker has the ability to process tasks in a number of different ways.  One of these is by handling each task process one by one at a file level, the other is handling all tasks all at once, but at a record level.  You might think that record level processing would always be faster, but it’s not.  Consider the task list with 60 tasks.  Some of these elements may only apply to a small subset of records.  In the by file process, I can quickly shortcut processing of records that are out of scope, in a record by record approach, I actually have to evaluate the record.  So, in testing, I found that when records are smaller than a certain size, and the number of task actions to process was within a certain number (regardless of file size), it was almost always better to process the data by file.  Where this changes is when you have a larger task list.  How large, I’m trying to figure that out.  But as an example, I had a real-world example sent to me that has over 950 task actions to process on a file ~350 MB (344,000 records) in size.   While the by file process is significantly faster than the MarcEdit 6.x method (each process incurred a 17 second file load penalty) – this still takes a lot of time to process because you are doing 950+ actions and complete file reads.  While this type of processing might not be particularly common (I do believe this is getting into outlier territory), the process can help to illustrate what I’m trying to teach the broker to do.  I ran this file using the three different processing methodologies, and here’s the results:

  1. MarcEdit 6.3.x: 962 Task Actions, completing in ~7 hours
  2. By File: 962 Task Actions, completing in 3 hours, 12 minutes
  3. By Record: 962 Task Actions, completing in 2 hours and 20 minutes

 

So, that’s still a really, really long time, but taking a closer look at the file and the changes made, and you can start to see why this process takes so much time.  Looking at the results file, >10 million changes have been processed against the 340,000+ records.  Also, consider the number of iterations that must take place.  The average record has approximately 20 fields.  Since each task needs to act upon the results of the task before it, it’s impossible to have tasks process at the same time – rather, tasks must happen in succession.  This means that each task must process the entire record as the results of a task may require an action based on data changed anywhere in the record.  This means that for one record, the program needs to run 962 operations, which means looping through 19, 240 fields (assuming no fields are added or deleted).  Extrapolate that number for 340,000 records, and the program needs to evaluate 6,541,600,000 fields or over 6 billion field evaluations which works out to 49,557,575 field evaluations per minute.

Ideally, I’d love to see the processing time for this task/file pair to be down around 1 hour and 30 minutes.  That would cut the current MarcEdit 7 processing time in half, and be almost 5 hours and 30 minutes faster than the current MarcEdit 6.3.x processing.  Can I get the processing down to that number – I’m not sure.  There are still optimizations to be hand – loops that can be optimized, buffering, etc. – but I think the biggest potential speed gains may possibly be available by adding some pre-processing to a task process to do a cursory evaluation of a recordset if a set of find criteria is present.  This wouldn’t affect every task, but potentially could improve selective processing of Edit Indicator, Edit Field, Edit Subfield, and Add/Delete field functions.  This is likely the next area that I’ll be evaluating.

Of course, the other question to solve is what exactly is the tipping point when By File Processing becomes less efficient than By Record processing.  My guess is that the characteristics that will be most applicable in this decision will be the number of task actions needing to be processed.  Splitting this file for example, into a file of 1000 and running this task by record versus by file – we see the following:

  1. By File processing, 962 Task Actions, completed in: 0.69 minutes
  2. By Record Processing, 962 Task Actions, completed in: 0.36 minutes

 

The processing times are relatively close, but the By Record processing is twice as fast as the By File Processing.  If we reduced the number of tasks to under 20, there is a dramatic switch in the processing time and By File Processing is the clear winner.

Obviously, there is some additional work to be done here, and more testing to do to understand what characteristics and which processing style will lead to the greatest processing gains, but from this testing, I came away with  a couple pieces of information.  First, the MarcEdit 7 process, regardless of method used, is way faster than MarcEdit 6.3.x.  Second, the MarcEdit 7 process and the MarcEdit 6.3.x process suffered from a flaw related to temp file management.  You can’t see it unless you work with files this large and with this many tasks, but the program cleans up temporary files after all processing is complete.  Normally, in a single operation environment, that happens right away.  Since a task represents a single operation, ~962 temporary files at 350 MBs per file were created as part of both processes.  That’s 336, 700 MB of data or 336 GBs of Temporary data!  When you close the program, that data is all cleared, but again, Ouch.  As I say, normally, you’d never see this kind of problem, but in this kind of edge case, it shows up clearly.  This has led me to implement periodic temp file cleanup so that no more than 10 temporary files are stored at any given time.  While that still means that in the case of this test file, up to 3 GB of temporary data could be stored, the size of that temp cache would never grow larger.  This seems to be a big win, and something I would have never seen without working with this kind of data file and use case.

Finally, let’s say after all this work, I’m able to hit the best case benchmarks (1 hr. 30 min.) and a user still feels that this is too long.  What more could be done?  Honestly, I’ve been thinking about that…but really, very little.  There will be a performance ceiling given how MarcEdit has to process task data.  So for those users – if this kind of performance time wasn’t acceptable, I believe only a custom built solution would provide better performance – but even with a custom build, I doubt you’d see significant gains if one continued to require tasks to be processed in sequence.

Anyway – this is maybe a bit more of a deeper dive into how tasks work in MarcEdit 6.3.x and how they will work in MarcEdit 7 than anyone really was looking for – but this particular set of files and use case represented and interesting opportunity to really test the various methods and provide benchmarks that easily demonstrate the impact of the current task process changes.

If you have questions, feel free to let me know.

–tr

MarcEdit 7: Add/Delete Field Changes

By reeset / On / In MarcEdit

I’m starting to think about the global editing functions in the MarcEditor – and one of the first things I’m trying to do is start to flesh out a few confusing options related to the interface.  This is the first update in thinking about these kinds of changes

image

The idea here is to make it clear which options belong to which editing groupset as sometimes folks aren’t sure which options are add field options and which are delete field options.  Hopefully, this will make the form easier to decipher.

–tr

MarcEdit 7: Startup Wizard

By reeset / On / In MarcEdit

One of the aspects of MarcEdit that I’ve been trying to think a lot about over the past year, is how to make it easier for users to know which configuration settings are important, and which ones are not.  This is the problem of writing a library metadata application that is MARC agnostic.  There are a lot of assumptions that users make because they associate MARC with the specific flavor of MARC that they are using.  So, for someone who only has exposure to MARC21, associating the title with MARC field 245 would be second nature.  But MarcEdit is used by a large community that doesn’t use MARC21, but UNIMARC (or other flavors for that matter).  For those users, the 245 field has a completely different meaning.

This presents a special challenge.  Simple things, like just displaying title information for a record, gets harder, because assumptions I make for one set of users will cause issues for others.  To address this, MarcEdit has a rich set of application settings, designed to enable users to tell the application a little about the data they are working with.  Once that information is provided, MarcEdit can configure the components and adjust assumptions so title information pulls from the correct fields, or Unicode bits get update in the correct leader locations.  The problem, from a usability perspective, is that these values are sorted into a wide range of other MarcEdit settings and preferences…which raises the question: which are the most important?

If you’ve installed MarcEdit 6 recently on a new computer, the way that the program has attempted to deal with this issue is by showing the preferences window on the application’s first run.  This means that the first time the program is executed, you see the following window:

image

Now, I’m not naïve.  I know that most users just click OK, and the program opens up for them, and they work with MarcEdit until they run across something that might require them to go back and look at the settings.  But when I do MarcEdit workshops, I get some specific questions related to Accessibility questions (i.e., can I make the fonts bigger or change the font), display (my Unicode characters don’t display), UNIMARC versus MARC21, etc.  From the window above, you can answer all the questions above, but you have to know which settings group handles each option.  It’s admittedly a pain, and because of that, most workshops I do include 20-30 minutes just going over the setting that might be worth considering.

With MarcEdit 7, I have an opportunity to rethink how users interact with the program, and I started to think about how other software does this successfully.  By and large, the ones that I think are more successful provide a kind of wizard at the start that helps to push the most important options forward…and the best examples include a little bit of whimsy in the process.  No, I might not do whimsy well, but I can think about the setting groups that might be the most important to bring front and center to the user.

To that end, I’ve developed a startup wizard for MarcEdit 7.  All users that install the application will see it (because MarcEdit 7 will install into its own user space, everyone will have this first run experience).  Based on the answers to questions, I’m able to automatically set data in the background to ensure that the application is better configured for the user, the first time they start using MarcEdit, rather than later, when they need help finding configuration settings.   It also will give me an opportunity to bring potential issues to the user’s attention.  So, for example, the tool will specifically look to see if you have a comprehensive Unicode Font installed (so, MS Arial Unicode or the Noto Sans fonts).  If you don’t, the program will point you to help files that discuss how to get one for free; as this will directly impact how the program displays Unicode characters (and comes up all the time given some decisions Microsoft has made in distributing their own Unicode fonts).  Additionally, I’ll be utilizing some automatic translation services, so the program will automatically react to your systems default language settings.  If they are English, text will show in English.  If they are Greek, the interface will show the machine translated Greek.  Users will have the option to change the language in the wizard, and I’ll provide notes about the translations (since machine translations are getting better, but there’s bound to be some pretty odd text. )  The hope is that this will make the program more accessible, and usable…and whimsical.  Yes, there is that too.  MarcEdit 7’s codename was developed after a nickname for my Golden Doodle.  So, she’s volunteered to help get users through the initial startup process.

The Wizard will likely change as I continue to evaluate settings groups, but at this point, I’m kind of leaning towards something that looks like this:

image

image

image

image

 

I’ve had a  few folks walk through this process, and by and large, they find it much more accessible than the current, just show the settings screen, process.  Additionally, they like the idea of the language translations, but wonder if the machine translations will be useful (I did an initial set, they are what they are)…I’ll get more feedback on that before release.  If they aren’t useful, I may remove that option, though I have to feel that for folks where English is a challenge, having anything is better than nothing (though, I could be wrong).

But this is what I’m thinking.  Its hopefully a little fun, easy to walk through, and will allow me to ensure that MarcEdit has been optimally configured for your data.  What do you think?

–tr