MarcEdit 7 alpha weekly build

By reeset / On / In MarcEdit

Following changes were made:

  • Bug Fix: Export Tab Delimited Records (open file and save file buttons not working)
  • Enhancement: XML Crosswalk wizard — enabled root element processing
  • Bug Fix: XML Crosswalk wizard — some elements not being picked up, all descendants should now be accounted for
  • Bug Fix: Batch Process Function – file collisions in subfolders would result in overwritten results.
  • Enhancement: Batch Processing Function: task processing uses the new task manager
  • Enhancement: Batch Processing Function: tasks can be processed as subdirectories

I had intended to move the program into beta this Sunday, the above issues made me decide to keep it in alpha for one more week while I finish checking legacy forms/code.

Downloads can be retrieved from: http://marcedit.reeset.net/marcedit-7-alphabeta-downloads-page

–tr

MarcEdit Delete Field by Position documentation

By reeset / On / In MarcEdit

I was working through the code and found an option that quite honestly, I didn’t even know existed.  Since I’m creating new documentation for MarcEdit 7, I wanted to pin this somewhere so I wouldn’t forget again.

A number of times on the list, folks will ask if they can delete say the second field in a field group.  Apparently, you can.  In the MarcEditor, select the Add/Delete field tool.  To delete by position, you would enter {#} to denote the position to delete in the find.

Obviously, this is pretty obscure – so in MarcEdit 7, this function is exposed as an option

image

To delete multiple field positions, you just add a comma.  So, say I wanted to delete fields 2-5, I would enter: 2,3,4,5 into the Field Data box and check this option.  One enhancement that I would anticipate a request for is the ability to delete just the last option – this is actually harder than you’d think – in part, because it means I can’t process data as it comes in, but have to buffer it first, then process, and there are some reason why this complicates things due to the structure of the function.  So for now, it’s by direct position.  I’ll look at what it might take to allow for more abstract options (like last).

–tr

MarcEdit 7 weekly build

By reeset / On / In MarcEdit

Issues completed as part of the MarcEdit 7 weekly update  Couple of things to highlight. 

  • * I’ve integrated a check and download of a Unicode Font into the Startup Wizard.  This will enable users to retrieve and install the Noto Fonts set into a private fonts collection for use by the application.
  • * Clustering tools are now available as a stand alone tool
  • * New Translations
  • * Lots of bug fixes

Thanks to all the folks that are downloading the alpha and trying it out.  Most of the bug reports are directly related to user testing.

  1. Enhancement: All processes: Updated Temp file management
  2. Bug Fix: Plugin Manager failing because it’s missing a column for MarcEdit version (note, none of the current plugins will work with MarcEdit 7)
  3. Enhancement: Added new languages for Croatian, Estonian, Indonesian, Hungarian, and Vietnamese
  4. Enhancement: Offer download into private font collection the Noto fonts when no Unicode font is present. This will make the fonts *only* available for use with MarcEdit.
  5. When Editing a task list — could the list not refresh? This occurs when you have a theme defined. 
  6. Bug Fix: Update all the Z39.50/SRU databases (specifically — the lc databases point to the old voyager endpoint that I believe is turned off)
  7. Bug Fix: Working with Saxon, XSLT transformations that link to files with spaces or special characters fail
  8. Bug Fix: Clustering Tool — selecting a top level cluster would include # of records in the cluster, not just the data to copy
  9. Enhancement: Clustering Tools — Add to the Main Window as a stand-alone tool
  10. Bug Fix: On install, the file types are not associated
  11. Enhancement: New Font’s dialog to support private fonts collections
  12. Bug Fix: Fonts not sticking when using the startup wizard
  13. Enhancement: Added Unicode Font download to help
  14. Bug Fix: Z39.50/SRU downloads were only downloading as .mrk formatted data, not as binary MARC. The Tool has been updated to select download type by extension.
  15. Enhancement: Updated the Icon a bit so that it’s not so transparent on the desktop.

Finally, I recorded and uploaded a video demonstrating the new startup wizard options related to the unicode fonts.  Please see: https://youtu.be/7GWZ_UDUf00

The download can be retrieved from the MarcEdit 7 alpha/beta downloads page: http://marcedit.reeset.net/marcedit-7-alphabeta-downloads-page

Questions, let me know.

–tr

Saxon.NET and local file paths with special characters and spaces

By reeset / On / In C#

I thought I’d post this here in case this can help other folks.  One of the parsers that I like to use is Saxon.Net, but within the .net platform at least, it has problems doing XSLT or XQuery transformations when the files in question have paths with special characters or spaces (or if they reference files via xsl:include statements that live inside paths with special characters or spaces).  The question comes up a lot on the Saxon support site and it sounds like Saxon is actually processing the data correctly.  Saxon is expecting valid URIs, and a URI can’t have a spaces.  Internally, the URI is escaped, but when you process those escaped paths against a local file system, accessing the file will fail.  So, what do I mean – here are two different types of problems I encounter:

  • Path 1: c:\myfile\C#\folder1\test.xsl
  • Path2: c:\myfile\C#\folder 1\test.xsl

When setting up a transformation using Saxon, you setup a XSLTransform.  You can set this up using either a stream, like an XMLReader, or a URI.  But here the problem.  If you create the statement like this:

System.Xml.XmlReader xstream = System.Xml.XmlReader.Create(filepath);
transformer = xsltCompiler.Compile(xstream).Load();

The program can read Path 1, but will always fail on Path 2, and will fail on Path 1 if it includes secondary data.  If rather than using a stream, I use a URI class like:

transformer = xsltCompiler.Compile(new Uri(sXSLT, UriKind.Absolute)).Load();

Both Path’s will break.  On the Saxon list, there was a suggestion to create a sealed class, and to wrap the URI in that class.  So, you’d end up code that looked more like:

transformer = xsltCompiler.Compile(new SaxonUri(new Uri(sXSLT, UriKind.Absolute))).Load();

public sealed class SaxonUri : Uri
    {
        public SaxonUri(Uri wrappedUri)
            : base(GetUriString(wrappedUri), GetUriKind(wrappedUri))
        {
        }
        private static string GetUriString(Uri wrappedUri, bool localuri = false)
        {
            if (wrappedUri == null)
                throw new ArgumentNullException("wrappedUri", "wrappedUri is null.");            
            if (wrappedUri.IsAbsoluteUri) 
                return wrappedUri.AbsoluteUri;
            return wrappedUri.OriginalString;
        }
        private static UriKind GetUriKind(Uri wrappedUri)
        {
            if (wrappedUri == null)
                throw new ArgumentNullException("wrappedUri", "wrappedUri is null.");
            if (wrappedUri.IsAbsoluteUri)
                return UriKind.Absolute;
            return UriKind.Relative;
        }
        public override string ToString()
        {
            if (IsWellFormedOriginalString())
                return OriginalString;
            else if (IsAbsoluteUri)
                return AbsoluteUri;
            return base.ToString();
        }
    }

And this get’s a closer.  Using this syntax, Path 1 doesn’t work, but Path 2 will.  So, you could use an if…then statement to look for spaces in the XSLT file path, and if there are no spaces, open the stream, and if there are, wrap the URI.  Unfortunately, that doesn’t work either – because if you include a reference (like xsl:include) in your XSLT, Path 1 and Path 2 fail, because internally, the BaseURI is set to an escaped version of the URI, and Windows will fail to locate the string.  At which point, you end up feeling like you might be pretty much screwed, but there are still other options but they take more work.  In my case, the solution that I adopted was to create a custom XmlResolver.  This allows me to handle all the URI processing myself, and in the case of the two path statements, I’m interested in handling all local file URIs.  So how does that work:

xsltCompiler.XmlResolver = new CustomeResolver();
transformer = xsltCompiler.Compile(new Uri(sXSLT, UriKind.Absolute)).Load();

internal class CustomeResolver : XmlUrlResolver
    {
        
        public override object GetEntity(Uri absoluteUri, string role, Type ofObjectToReturn)
        {
            if (absoluteUri.IsFile)
            {
                string filename = absoluteUri.LocalPath;
                if (System.IO.File.Exists(filename)==false) {
                    filename = Uri.UnescapeDataString(filename);
                    if (System.IO.File.Exists(filename)==false)
                    {
                        return (System.IO.Stream)base.GetEntity(absoluteUri, role, ofObjectToReturn);
                    } else
                    {
                        System.IO.Stream myStream = new System.IO.FileStream(filename, System.IO.FileMode.Open);
                        return myStream;
                    }
                } else
                {
                    return (System.IO.Stream)base.GetEntity(absoluteUri, role, ofObjectToReturn);
                }
            }
            else
            {

                return (System.IO.Stream) base.GetEntity(absoluteUri, role, ofObjectToReturn);
            }
        }

By creating your own XmlResolver, you can fix the URI problems and allow Saxon to process both use cases above.

–tr

MarcEdit 7: Continued Task Improvement; Part 2

By reeset / On / In MarcEdit

Last week, I discussed some of the work I was doing to continue to evaluate how Task processing will work in MarcEdit 7.  To do some of this work, I’ve been working with a set of outlier data who’s performance in MarcEdit 6.3 left much to be desired.  You can read about the testing and the file set here: MarcEdit 7: Continued Task Refinements

Over the week, I’ve continued to work on how this data is processed, hoping to continue to move the processing time of this data from almost 7 hours in MarcEdit 6.3 to around 1 1/2 hours, and I’ve been able to do that and more.  My guess was that by adding targeted pre-processing statements into the task processing queue, I could improve processing by only running the task processes that absolutely had to be run.  In this case, I had 962 task actions, but on any given record, maybe 20-30 needed to be run.  By adding a preprocessing step, I was able to move the processing time from 2+hours to 25 minutes.  My guess is that I’ve reached the ceiling in terms of optimizations, but I can live with this.  Of course, over the next few days, what I’ll need to do is validate that these new changes don’t cause the program to miss processing a step that should be run.  Generally, I’ve setup the preprocessing steps so that it will fall back to running the task when in doubt.

–tr

MarcEdit 7: Continued Task Refinements

By reeset / On / In MarcEdit

Now that MarcEdit 7 is available for alpha testers, I’ve been getting back some feedback on the new task processing.  Some of this feedback relates to a couple of errors showing up in tasks that request user interaction…other feedback is related to the tasks themselves and continued performance improvements.

In this implementation, one of the areas that I am really focusing on is performance.  To that end, I changed the way that tasks are processed.  Previously, task processing looked very much like this:

image

A user would initiate a task via the GUI or command-line, and once the task was processed, the program would then, via the GUI, open a hidden window that would populate each of the “task” windows and then “click” the process button.  Essentially, it was working much like a program that “sends” keystrokes to a window, but in a method that was a bit more automated.

This process had some pros and cons.  On the plus-side, Tasks was something added to MarcEdit 6.x, so this allowed me to easily add the task processing functionality without tearing the program apart.  That was a major win, as tasks then were just a simple matter of processing the commands and filling a hidden form for the user.   On the con-side, the task processing had a number of hidden performance penalties.  While tasks automated processing (which allowed for improved workflows), each task processed the file separately, and after each process, the file would be reloaded into the MarcEditor.  Say you had a file that took 10 seconds to load and a task list with 6 tasks.  The file loading alone, would cost you a minute.  Now, consider if that same file had to be processed by a task list with 60 different task elements – that would be 10 minutes dedicated just to file loading and unloading; and doesn’t count the time to actually process the data.

This was a problem, so with MarcEdit 7, I took the opportunity to actually tear down the way that tasks work.  This meant divorcing the application from the task process and creating a broker that could evaluate tasks being passed to it, and manage the various aspects of task processing.  This has led to the development of a process model that looks more like this now:

image

Once a task is initiated and it has been parsed, the task operations are passed to a broker.  The broker then looks at the task elements and the file to be processed, and then negotiates those actions directly with the program libraries.  This removes any file loading penalties, and allows me to manage memory and temporary file use at a much more granular way.  It also immediately speeds up the process.  Take that file that takes 10 seconds to load and 60 tasks to complete.  Immediately, you improve processing time by 10 minutes.  But the question still arises, could I do more?

And the answer to this question is actually yes.  The broker has the ability to process tasks in a number of different ways.  One of these is by handling each task process one by one at a file level, the other is handling all tasks all at once, but at a record level.  You might think that record level processing would always be faster, but it’s not.  Consider the task list with 60 tasks.  Some of these elements may only apply to a small subset of records.  In the by file process, I can quickly shortcut processing of records that are out of scope, in a record by record approach, I actually have to evaluate the record.  So, in testing, I found that when records are smaller than a certain size, and the number of task actions to process was within a certain number (regardless of file size), it was almost always better to process the data by file.  Where this changes is when you have a larger task list.  How large, I’m trying to figure that out.  But as an example, I had a real-world example sent to me that has over 950 task actions to process on a file ~350 MB (344,000 records) in size.   While the by file process is significantly faster than the MarcEdit 6.x method (each process incurred a 17 second file load penalty) – this still takes a lot of time to process because you are doing 950+ actions and complete file reads.  While this type of processing might not be particularly common (I do believe this is getting into outlier territory), the process can help to illustrate what I’m trying to teach the broker to do.  I ran this file using the three different processing methodologies, and here’s the results:

  1. MarcEdit 6.3.x: 962 Task Actions, completing in ~7 hours
  2. By File: 962 Task Actions, completing in 3 hours, 12 minutes
  3. By Record: 962 Task Actions, completing in 2 hours and 20 minutes

 

So, that’s still a really, really long time, but taking a closer look at the file and the changes made, and you can start to see why this process takes so much time.  Looking at the results file, >10 million changes have been processed against the 340,000+ records.  Also, consider the number of iterations that must take place.  The average record has approximately 20 fields.  Since each task needs to act upon the results of the task before it, it’s impossible to have tasks process at the same time – rather, tasks must happen in succession.  This means that each task must process the entire record as the results of a task may require an action based on data changed anywhere in the record.  This means that for one record, the program needs to run 962 operations, which means looping through 19, 240 fields (assuming no fields are added or deleted).  Extrapolate that number for 340,000 records, and the program needs to evaluate 6,541,600,000 fields or over 6 billion field evaluations which works out to 49,557,575 field evaluations per minute.

Ideally, I’d love to see the processing time for this task/file pair to be down around 1 hour and 30 minutes.  That would cut the current MarcEdit 7 processing time in half, and be almost 5 hours and 30 minutes faster than the current MarcEdit 6.3.x processing.  Can I get the processing down to that number – I’m not sure.  There are still optimizations to be hand – loops that can be optimized, buffering, etc. – but I think the biggest potential speed gains may possibly be available by adding some pre-processing to a task process to do a cursory evaluation of a recordset if a set of find criteria is present.  This wouldn’t affect every task, but potentially could improve selective processing of Edit Indicator, Edit Field, Edit Subfield, and Add/Delete field functions.  This is likely the next area that I’ll be evaluating.

Of course, the other question to solve is what exactly is the tipping point when By File Processing becomes less efficient than By Record processing.  My guess is that the characteristics that will be most applicable in this decision will be the number of task actions needing to be processed.  Splitting this file for example, into a file of 1000 and running this task by record versus by file – we see the following:

  1. By File processing, 962 Task Actions, completed in: 0.69 minutes
  2. By Record Processing, 962 Task Actions, completed in: 0.36 minutes

 

The processing times are relatively close, but the By Record processing is twice as fast as the By File Processing.  If we reduced the number of tasks to under 20, there is a dramatic switch in the processing time and By File Processing is the clear winner.

Obviously, there is some additional work to be done here, and more testing to do to understand what characteristics and which processing style will lead to the greatest processing gains, but from this testing, I came away with  a couple pieces of information.  First, the MarcEdit 7 process, regardless of method used, is way faster than MarcEdit 6.3.x.  Second, the MarcEdit 7 process and the MarcEdit 6.3.x process suffered from a flaw related to temp file management.  You can’t see it unless you work with files this large and with this many tasks, but the program cleans up temporary files after all processing is complete.  Normally, in a single operation environment, that happens right away.  Since a task represents a single operation, ~962 temporary files at 350 MBs per file were created as part of both processes.  That’s 336, 700 MB of data or 336 GBs of Temporary data!  When you close the program, that data is all cleared, but again, Ouch.  As I say, normally, you’d never see this kind of problem, but in this kind of edge case, it shows up clearly.  This has led me to implement periodic temp file cleanup so that no more than 10 temporary files are stored at any given time.  While that still means that in the case of this test file, up to 3 GB of temporary data could be stored, the size of that temp cache would never grow larger.  This seems to be a big win, and something I would have never seen without working with this kind of data file and use case.

Finally, let’s say after all this work, I’m able to hit the best case benchmarks (1 hr. 30 min.) and a user still feels that this is too long.  What more could be done?  Honestly, I’ve been thinking about that…but really, very little.  There will be a performance ceiling given how MarcEdit has to process task data.  So for those users – if this kind of performance time wasn’t acceptable, I believe only a custom built solution would provide better performance – but even with a custom build, I doubt you’d see significant gains if one continued to require tasks to be processed in sequence.

Anyway – this is maybe a bit more of a deeper dive into how tasks work in MarcEdit 6.3.x and how they will work in MarcEdit 7 than anyone really was looking for – but this particular set of files and use case represented and interesting opportunity to really test the various methods and provide benchmarks that easily demonstrate the impact of the current task process changes.

If you have questions, feel free to let me know.

–tr

MarcEdit 7: Add/Delete Field Changes

By reeset / On / In MarcEdit

I’m starting to think about the global editing functions in the MarcEditor – and one of the first things I’m trying to do is start to flesh out a few confusing options related to the interface.  This is the first update in thinking about these kinds of changes

image

The idea here is to make it clear which options belong to which editing groupset as sometimes folks aren’t sure which options are add field options and which are delete field options.  Hopefully, this will make the form easier to decipher.

–tr