I spent sometime this week working on a few updates for MarcEdit 6.3. Full change log below (for all versions).
* Bug Fix: MarcEditor: When processing data with right to left characters, the embedded markers were getting flagged by the validator.
* Bug Fix: MarcEditor: When processing data with right to left characters, I’ve heard that there have been some occasions when the markers are making it into the binary files (they shouldn’t). I can’t recreate it, but I’ve strengthen the filters to make sure that these markers are removed when the mnemonic file format is saved.
* Bug Fix: Linked data tool: When creating VIAF entries in the $0, the subfield code can be dropped. This was missed because viaf should no longer be added to the $0, so I assumed this was no longer a valid use case. However local practice in some places is overriding best practice. This has been fixed.
A note on the MarcEditor changes. The processing of right to left characters is something I was aware of in regards to the validator – but in all my testing and unit tests, the data was always filtered prior to compiling the data. These markers that are inserted are for display, as noted here: http://blog.reeset.net/archives/2103. However, on the pymarc list, there was apparently an instance where these markers slipped through. The conversation can be found here: https://groups.google.com/forum/#!topic/pymarc/5zxuOh0fVuc. I posted a long response on the list, but I think i t’s being held in moderation (I’m a new member to the list), but generally, here’s what I found. I can’t recreate it, but I have updated the code to ensure that this shouldn’t happen. Once a mnemonic file is saved (and that happens prior to compiling), these markers are removed from the file. I guess if you find this isn’t the case, let me know. I can add the filter down into the MARCEngine level, but I’d rather not, as there are cases where these values may be present (legally)…this is why the filtering happens in the Editor, where it can assess their use and if the markers are present already, determine if they are used correctly.
One of the appalling discoveries when taking a closer look at the MarcEdit 6 codebase, was the presence of 3(!) Z39.50 clients (all using slightly different codebases. This happened because of the ILS integration, the direct Z39.50 Database editing, and the actual Z39.50 client. In the Mac version, these clients are all the same thing – so I wanted to emulate that approach in the Windows/Linux version. And as a plus, maybe I would stop (or reduce) my utter distain at having support Z39.50 generally, within any library program that I work with.
* Sidebar – I really, really, really can’t stand working with Z39.50. SRU is a fine replacement for the protocol, and yet, over the 10-15 years that its been available, SRU remains a fringe protocol. That tells me two things:
Library vendors generally have rejected this as a protocol and there are some good reason for this…most vendors that support (and I’m thinking specifically about ExLibris), use a custom profile. This is a pain in the ass because the custom profile requires code to handle foreign namespaces. This wouldn’t be a problem if this only happened occasionally, but it happens all the time. Every SRU implementation works best if you use their custom profiles. I think what made Z39.50 work, is the well-defined set of Bib-1 attributes. The flexibility in SRU is a good thing, but I also think it’s why very few people support it, and fewer understand how it actually works.
That SRU is a poor solution to begin with. Hey, just like OAI-PMH, we created library standards to work on the web. If we had it to do over again, we’d do it differently. We should probably do it differently at this point…because supporting SRU in software is basically just checking a box. People have heard about it, they ask for it, but pretty much no one uses it.
By consolidating the Z39.50 client code, I’m able to clean out a lot of old code, and better yet, actually focus on a few improvements (which has been hard because I make improvements in the main client, but forget to port them everywhere else). The main improvements that I’ll be applying has to do with searching multiple databases. Single search has always allowed users to select up to 5 databases to query. I may remove that limit. It’s kind of an arbitrary one. However, I’ll also be adding this functionality to the batch search. When doing multiple database searches in batch, users will have an option to take all records, the first record found, or potentially (I haven’t worked this one out), records based on order of database preference.
Z39.50 Database Settings:
There will be a preferences panel as well (haven’t created it yet), but this is where you will set proxy information and notes related to batch preferences. You will no longer need to set title field or limits, as the limits are moving to the search screen (this has always needed to be variable) and the title field data is being pulled from preferences already set in the program preferences.
One of the benefits of making the changes is that this folds the z39.50/sru client into the Main MarcEdit application (rather than as a program that was shelled to), which allows me to leverage the same accessibility platform that has been developed for the rest of the application. It also highlights one of the other changes happening in MarcEdit 7. MarcEdit 6- is a collection of about 7 or 8 individual executables. This makes sense in some cases, less sense in others. I’m evaluating all the stand-alone programs and if I replicate the functionality in the main program, then it means that while initially, having these as separate program might have been a good thing, the current structure of the application has changed, and so the code (both external and internal) code needs to be re-evaluated and put in one spot. In the application, this has meant that in some cases, like the Z39.50 client, the code will move into MarcEdit proper (rather being a separate program called mebatch.exe) and for SQL interactions, it will mean that I’ll create a single shared library (rather than replicating code between three different component parts….the sql explorer, the ILS integration, and the local database query tooling).
Metadata transformations can be really difficult. While I try to make them easier in MarcEdit, the reality is, the program really has functioned for a long time as a facilitator of the process; handling the binary data processing and character set conversions that may be necessary. But the heavy lifting, that’s all been on the user. And if you think about it, there is a lot of expertise tied up in even the simplest transformation. Say your library gets an XML file full of records from a vendor. As a technical services librarian, I’d have to go through the following steps to remap that data into MARC (or something else):
Evaluate the vended data file
Create a metadata dictionary for the new xml file (so I know what each data element represents)
Create a mapping between the data dictionary for the vended file and MARC
Create the XSLT crosswalk that contains all the logic for turning this data into MARCXML
Setup the process to move data between XML=>MARC
All of these steps are really time consuming, but the development of the XSLT/XQuery to actually translate the data is the one that stops most people. While there are many folks in the library technology space (and technical services spaces) that would argue that the ability to create XSLT is a vital job skill, let’s be honest, people are busy. Additionally, there is a big difference between knowing how to create an XSLT and writing a metadata translation. These things get really complicated, and change all the time (XSLT is up to version 3), meaning that even if you’ve learned how to do this years ago, the skills may be stale or not translate into the current XSLT version.
Additionally, in MarcEdit, I’ve tried really hard to make the XSLT process as simple and straightforward as possible. But, the reality is, I’ve only been able to work on the edges of this goal. The tool handles the transformation of binary and character encoding data (since the XSLT engines cannot do that), it uses a smart processing algorithm to try to improve speed and memory handling while still enabling users to work with either DOM or Sax processing techniques. And I’ve tried to introduce a paradigm that enables reuse and flexibility when creating transformations. Folks that have heard me speak have likely heard me talk about this model as a wheel and spoke:
The idea behind this model is that as long as users create translations that map to and from MARCXML, the tool can automatically enable transformations to any of the known metadata formats registered with MarcEdit. There are definitely tradeoffs to this approach (for sure, doing a 1-to-1, direct translation would produce the best translation, but it also requires more work and users to be experts in the source and final metadata formats), but the benefit from my perspective is that I don’t have to be the bottleneck in the process. Were I to hard-code or create 1-to-1 conversions, any deviation or local use within a spec, would render the process unusable…and that was something that I really tried to avoid. I’d like to think that this approach has been successful, and has enabled technical services folks to make better use of the marked up metadata that they are provided.
The problem is that as content providers have moved more of their metadata operations online, a large number have shifted away from standards-based metadata to locally defined metadata profiles. This is challenging because these are one off formats that really are only applicable for a publisher’s particular customers. As a result, it’s really hard to find conversions for these formats. The result of this, for me, are large numbers of catalogers/MarcEdit users asking for help creating these one off transformations…work that I simply don’t have time to do. And that can surprise folks. I try hard to make myself available to answer questions. If you find yourself on the MarcEdit listserv, you’ll likely notice that I answer a lot of the questions…I enjoy working with the community. And I’m pretty much always ready to give folks feedback and toss around ideas when folks are working on projects. But there is only so much time in the day, and only so much that I can do when folks ask for this type of help.
So, transformations are an area where I get a lot of questions. Users faced with these publisher specific metadata formats often reach out for advice or to see if I’ve worked with a vendor in the past. And for years, I’ve been wanting to do more for this group. While many metadata librarians would consider XSLT or XQuery as required skills, these are not always in high demand when faced with a mountain of content moving through an organization. So, I’ve been collecting user stories and outlining a process that I think could help: an XML/JSON Profiler.
So, it’s with a lot of excitement, that I can write that MarcEdit 7 will include this tool. As I say, it’s been a long-term coming; and the goal is to reduce the technical requirements needed to process XML or JSON metadata.
To create this tool, I had decide how users would define their data for mapping. Given that MarcEdit has a Delimited Text Translator for converting Excel data to MARC, I decided to work form this model. The code produced does a couple of things:
It validates the XML format to be profiled. Mostly, this means that the tool is making sure that schema’s are followed, namespaces are defined and discoverable, etc.
Output data in MARC, MARCXML, or another XML format
Shifts mapping of data from an XML file to a delimited text file (though, it’s not actually creating a delimited text file).
Since the data is in XML, there is a general assumption that data should be in UTF8.
Users can access the Wizard through the updated XML Functions Editor. Users open MARC Tools and select Edit XML function list, and you see the following:
I highlighted the XML Function Wizard. I may also make this tool available from the main window. Once selected, the program walks users through a basic reference interview:
From here, users just need to follow the interview questions. User will need a sample XML file that contains at least one record in order to create the mappings against. As users walk through the interview, they are asked to identify the record element in the XML file, as well as map xml tags to MARC tags, using the same interface and tools as found in the delimited text translator. Users also have the option to map data directly to a new metadata format by creating an XML mapping file – or a representation of the XML output, which MarcEdit will then use to generate new records.
Once a new mapping has been created, the function will then be registered into MarcEdit, and be available like any other translation. Whether this process simplifies the conversion of XML and JSON data for librarians, I don’t know. But I’m super excited to find out. This creates a significant shift in how users can interact with marked up metadata, and I think will remove many of the technical barriers that exist for users today…at least, for those users working with MarcEdit.
To give a better idea of what is actually happening, I created a demonstration video of the early version of this tool in action. You can find it here: https://youtu.be/9CtxjoIktwM. This provides an early look at the functionality, and hopefully help provide some context around the above discussion. If you are interested in seeing how the process works, I’ve posted the code for the parser on my github page here: https://github.com/reeset/meparsemarkup
Because I change version numbers so rarely when it comes to MarcEdit, I usually like to take the major version numbers as an opportunity to look at how some of the core code works, and this time is no different. One of the things that I’ve occasionally heard is that Opening and Saving larger files in the MarcEditor can be slow. I guess, before I talk about some of the early metrics, I’d like to explain how the MarcEditor works, because it works differently than a normal text editor in order to allow users to work with files of any size.
Opening records in the MarcEditor
When you open the MarcEditor, the program utilizes one of two modes to read files into the editing screen: Preview and Paging.
Preview mode has been designed specifically for really large files – but the caveat is that when in Preview mode, the editor gets locked into Read Only mode. This means you can’t type in the Editor, but you can use any of the Editing functions to change the file. The benefit of the Preview mode is you remove the need to load the file (which is an expensive process).
Paging mode is the editing mode enabled by default. This mode breaks files into pages, meaning that MarcEdit must first, read the file to determine the number of records, and create an internal directory of page start and end locations. Once that is accomplished, the program then renders data onto the screen. The pages created are all virtual (they don’t exist), unless a user actually edits (typing onto the screen) information on a page. Global edits affect the whole file, so the file get’s re-paged after every global edit.
The paging mode is by far the best rendering mode for data under, say, 150 MBs (in MarcEdit 6). This is because at around 150 MB, it starts taking a lot longer to create the virtual pages. And depending on your operating system, and hard drive type, this process could be really expensive. I’ve found on older equipment (non-Solid State (SD) drives), this process can really slow down reading and writing because so many disk accesses have to occur when creating pages (even virtually).
Saving records in the MarcEditor
Saving files essentially does the paging operation in reverse, though now, rather than a virtual page, the program does have to access the file and extract the page content for every virtual page in existence. Again, if you have a non-SD drive or an older 5400 rpm drive, this can by a slow process. If your operating system is already having disk usage issues (and older computers upgraded to Windows 10 have many of these), this can slow the process considerably.
MarcEdit 7 Enhancements
In thinking about how this process works, I started wondering how I could improve file operations in MarcEdit 7. Obviously, the easiest way to improve the open and save processes would be to remove as many disk operations as possible. The fewer file operations, the faster the process. so, I started looking. Now, one of the benefits of updating to the new version of .NET, is that I have access to some new programming concepts. One of these new elements are Thread Tasks to initiate Parallel processes in C# (though, I’ve found these must be handled with care, or I can really cause disk issue as threads spawn too quickly) and the other are simply lamba expressions that enable the compiler to optimize the operations code. With this in mind, I started working.
For the purpose of this benchmark, I’m using an Dell Inspiron 13, with an i-5 processor, SD drive, and 16 GB of RAM.
Reading Data into the MarcEditor
In order to speed up the reading operation, I had to reduce the number of file operations that were being run on the system. To do this, I made two significant changes.
When MarcEdit’s Enhanced File reading mode is enabled, MarcEdit reads files under 60 MB into memory. Using Parallel Tasks, I was able to improve this process, reducing the number of file reads by 50%. So, if the old method made 100 file reads to build the page, the new process would only make 50 file reads. Additionally, with the processing now in a Parallel process, data could be read asynchronously, though this doesn’t help as much as one might hope since data needs to be processed in order. But, it does seem to help.
For files larger than 60 MB, again, I needed to find a way to reduce the number of file reads. To do this, I tried two things. First, I increased the buffer. This means that more data is read at a time, so fewer file reads must occur. Previously, the buffer was 1 MB. The buffer has been increased to 8 MB. This makes a big difference, as now files under 8 MB only are read once, as the remainder of the data lives in the buffer. The second thing that I did was moved access down to the abstract classes. This allowed me to interact beneath the StreamReader class and access the actual positions in the file when data was read. This couldn’t be done in the current version of MarcEdit, because the position properties report where buffered data was read. This meant that an additional file operation had to occur just to get the file positions. Again, if the file needed 100 reads to read the file, the update process would only need 50 reads.
So, what’s the impact of this. Well, let’s see. I have a 350 MB file and paging set to 100 records per page. This is a UTF8 file with records from materials The Ohio State University Libraries has loaded into the HathiTrust. Using this as my test set, I simply opened the files in the MarcEditor in MarcEdit 6.3.x and MarcEdit 7.0.0.alpha. To test, I loaded this file five times, throwing out the slowest and fastest times, and selecting the status message closest to the average.
We can see that by reducing the number of file reads, the process improves significantly, though, it could be better. Digging deeper into the results, I’m finding that the actual reading of the data is even faster, with the actual rending of the data in the newer control taking longer than the previous editing control. The reason for this is that in MarcEdit 6.3.x, this control usage has been optimized, double buffered, etc. In MarcEdit 7.0.0.alpha, this hasn’t been done yet. My guess, I can probably get these numbers down to around 8.7-9 seconds for a file of this size. That would represent a 5-5 1/2 second increase in performance. Of course the question is, will this help individuals opening smaller files. I think yes. On my SD drive, loading of a 50 MB file takes roughly the same amount of time: 1.3 seconds. But on a non-SD drive, I think the improvement will be significant given that the number of file reads will be reduced.
This test though was with the old defaults in MarcEdit. For MarcEdit 7.0.0.alpha, I would like to change the default paging size to 1000 records per page (since the new component is more is more efficient when dealing with larger sets). So, let’s run the test again, this time using the different paging values using the same approach as above:
Looking at the process, you can see that the gap between the two versions gets larger. Again, looking closer at the data, the actual loading of the file is faster than the first tests, but rendering the data pushed the final load times higher. As in the first tests, I believe that once the Editor itself has been optimized, we’ll see this improve significantly. By the time the final version comes out, the performance different on this type of file could be between 6-8 seconds, or a 37-50% speed improvement over the current 6.3.x version of the software.
Writing files in the MarcEditor
In looking at the process used to write files on save, the same kind of issues are causing the problems there. First, saving requires a lot of file access (both read and write), and second, once a file is saved, it is reloaded into the Editor. This means the on systems with SD drives, the performance benefits may be modest, but for non-SD systems, the gains should be significant. But there was only one way to tell. Using the same file, I made edits on 4 pages. The first page, the 50th page, the 150th page, and the last page. Paging was set back to 100 records per page. This forces the tool to combine the changes pages with the unchanged data in the virtual space. Using the loading times above, we can estimate the time actually used when saving the data. I’ll be providing numbers for both the save, and save as process (since they work slightly different):
Saving the file using Save:
Saving the file using Save As:
As you can see here, the difference between the new saving method and the old saving method is pretty significant. The time posted here reflects the time it takes to both save the file, and then reload the data back into the Editor window. Taking the times from the first test, we can determine that the Save function in MarcEdit 6.3.x takes ~6.2 seconds, if rendering the file takes an average of 14 seconds, and the Save As operation takes approximately 6.7 seconds. Let’s compare that to MarcEdit 7.0.0.alpha. We know that the rendering of the file takes approximately 10 seconds. That means that the Save function takes ~.8 seconds to complete, and the Save as function, 1.2 seconds to complete. In each case, this represents a significant performance improvement, and as noted above, optimizations have yet to be completed. Additionally, I do believe that on non-SD systems, the performance gains will be even more noticeable.
Thoughts, Conclusions, and So what
Given how early I am in the development and optimization process, why start looking at any of these metrics now. Surely, some of these things will change, and I’m sure they will. But these give me a base-line to work with, and a touchstone as I continue working on optimizing the process. And it is early, but one of the things that I wanted to highlight here is that in addition to the new features, updated interface, and accessibility improvements – a big part of this update is about performance and speed. When I initially wrote MarcEdit, nearly all the code was written in Assembly. Shifting to a higher level language was incredibly painful for me to do because I want things to be fast, and Assembly programming is all about building things small and building things fast. You have access to the CPU registers, and you can make magic happen. Unfortunately, keeping up with the changes in the metadata world, the need to provide better Unicode support, and my desire to support Mac systems (which used, at the time, a different CPU architecture, meant moving to a higher language that could be compiled for different systems. Ever since that code migration, I’ve been chasing the clock, trying to get the processing speeds down to the original assembly code-base. Is that possible? No. Though, even if it was, so many things have changed and been added, the process simply does more than the simple libraries that I first created in 1999…but still, that desire is there.
So, while I am spending most of my time communicating publically about the new wireframes and new functionality in MarcEdit 7 (and I’m really excited about these changes)…please know – MarcEdit 7 is also about making it fast. I think MarcEdit 6.3.x is already pretty quick on its feet. As you can see here, its about to get faster.
I’ve posted update for all versions. Windows and linux updates for 6.3.x Sunday evening and updates to MacOS for 2.5.x on Wed. morning. Change log below:
* Bug Fix: MarcEditor: Convert clipboard content to….: The change in control caused this to stop working – mostly because the data container that renders the content is a rich object, not plain text like the function was expecting. Missed that one. I’ve fixed this in the code.
* Enhancement: Extract Selected Records: Connected the exact match to the search by file
* Bug Fix: MarcEditor: Right to left flipping wasn’t working correctly for Arabic and Hebrew if the codes were already embedded into the file.
* Update: Cleaned up some UI code.
* Update: Batch Process MarcXML: respecting the native versus the XSLT options.
* Bug Fix: MarcEditor: Right to left flipping wasn’t working correctly for Arabic and Hebrew if the codes were already embedded into the file.
* Update: Cleaned up some UI code.
* Update: Batch Process MarcXML: respecting the native versus the XSLT options.
* Enhancement: Exact Match searching in the Extract, Delete Selected Records tool
* Enhancement: Exact Match searching in the Find/Replace Tool
* Enhancement: Work updates in the Linked data tool to support the new MAC proposal
In this set of wireframes, you can see one of the concepts that I’ll be introducing with MarcEdit 7…wizards. Each wizard is designed to encapsulate a reference interview to attempt to make adding new functions, etc. to the tool easier. You will find these throughout MarcEdit 7.
XML Functions Window:
XML Functions Wizard Screens:
You’ll notice one of the options is the new XML/JSON Profiler. This is a new tool that I’ll wireframe later; likely sometime in August 2017.
Something that comes up a lot is the lack of key combinations or pathways to using functions in MarcEdit. I’ll admit, the program is very mouse heavy. So, as part of the accessibility work in MarcEdit 7, I’m taking a long look at how access to all functions can be accommodated via the keyboard. This means that for MarcEdit 7, I’m mapping out all keycode combinations (the ALT+[KEY]) paths and the more traditional shortcut key combinations) for each window in MarcEdit. When it’s finished, I’ll make this part of the application documentation. Before I get too far along, I wanted to show what this looks like. Please see: http://marcedit.reeset.net/software/MarcEdit7_KeycodeMap.pdf
I’m continuing to flesh out new wireframes, and one of the areas where I’ll be consolidating some options is in the preferences window. I’ve decided to reorganize the menu and some of the settings. Additionally, I’m adding a new setting: Ease of Access.
Here’s the Initial Wireframes demonstrating the new menu layout
Ease of Use:
This is a new section developed to support Accessibility options. At this point, these are the options that I’m working on:
While MarcEdit will respect the operating system’s accessibility settings (i.e., if you’ve scaled fonts, etc.), but these settings directly affect the MarcEdit application. In this section, you’ll find the themes (and I’m working out a way to provide a wizardry way to create themes and find ones that have been created), feedback options (right now, if this is selected, you’ll get audible clicks letting you know that an action has occurred), and Keyboard options. I’m spending a lot of time mapping the current keyboard options, with the intention that I’ll try to map all actions to some keyboard combination. These settings tell MarcEdit if this information should show up in the Tooltips, as well as rich descriptions about an operation. The last thing that I’ll likely add is a set of links to topics for users looking for accessible friendly fonts, etc.
I think that the reorganization should help to provide some clarity in the settings and will help me in thinking about the first run wizard – and hopefully the currently planned accessibility options will provide users with a wider range of options.
I’ve been working a bit more around this notion of creating “themes” to improve visible accessibility options. This started with an initial implementation that included the default interface and then a High Contrast interface. Over the past few days, I’ve been getting a wide range of feedback, and one of the things that is becoming apparent is that folks would like to have a wide range of preferences. So, this afternoon, I spent time taking the hardcoded default and high contract themes, and rewriting all non-default UI implementations as themes.
When I think about theming, I immediately start thinking about the operating system themes, or themes that you can download for browsers. At this point, we aren’t talking about anything quite so complex. In fact, until I get feedback, I’ll be keeping theming light weight – but I think that in the long run, this might actually make them more useful.
How do they work? Essentially, a theme is going to be the implementation of an XML file. Here’s the dark (high contract) theme written out in the new xml theme structure.
<?xml version=”1.0″ encoding=”utf-8″ ?>
<theme> <name>Dark (High Contrast) Theme</name> <global> <!–Use HTML web color codes for these values–> <font_color>#ffffff</font_color> <background_color>#000000</background_color> </global> <marceditor> <font_color>#000000</font_color> <background_color>#ffffff</background_color> </marceditor> <overrides> <!– Override values <menus> <font_color> <background_color> </menus> <links> <font_color> <visited_font_color> <behavior> [set to always, hover, none] </links> –> </overrides>
As you can see, the initial implementation of theming is very limited. Essentially, users can theme font color and background colors globally, at the MarcEditor level, and override options for menus and links found within the program. This may (and likely) will be extended prior to the release of MarcEdit 7, but I don’t anticipate it being enhanced a lot. While the new GUI rendering engine makes this kind of work easier, I don’t want to develop an entire rendering process around this method until I know there is more than a passing interest.
What this means, however, is that I can quickly create new themes. Right now, I’ve implemented this in the Options dialog. You can see the current line of thinking below:
Using the Main Windows of MarcEdit 7 as the example page, I’ll run through the current themes that I’ve marked up:
Default Theme (hardcoded):
Dark (High Contrast) Theme:
Dark Gray Theme:
All these themes were created using the theming xml files. As I say, if I get feedback, I’ll look to expand this as we move towards the official release.
An interesting request made while reviewing the Wireframes was if MarcEdit 7 could support a kind of high contrast, or “Dark” theme mode. An Example of this would be Office:
Some people find this interface easier on the eyes, especially if you are working on a screen all day.
Since MarcEdit utilizes its own GUI engine to handle font sizing, scaling, and styling – this seems like a pretty easy request. So, I did some experimentation. Here’s MarcEdit 7 using the conventional UI:
And here it is under the “high contrast” theme:
Since theming falls into general accessibility options, I’ve put this in the language section of the options:
However, I should point out that in MarcEdit 7, I will be changing this layout to include a dedicated setting area for Accessibility options, and this will likely move into that area.
I’m not sure this is an option that I’d personally use as the “Dark” theme or High Contrast isn’t my cup of tea, but with the new GUI engine added to MarcEdit 7 with the removal of XP support – supporting this option really took about 5 minutes to turn on.