MarcEdit 7 Alpha: the XML/JSON Profiler

Metadata transformations can be really difficult.  While I try to make them easier in MarcEdit, the reality is, the program really has functioned for a long time as a facilitator of the process; handling the binary data processing and character set conversions that may be necessary.  But the heavy lifting, that’s all been on the user.  And if you think about it, there is a lot of expertise tied up in even the simplest transformation.  Say your library gets an XML file full of records from a vendor.  As a technical services librarian, I’d have to go through the following steps to remap that data into MARC (or something else):

  1. Evaluate the vended data file
  2. Create a metadata dictionary for the new xml file (so I know what each data element represents)
  3. Create a mapping between the data dictionary for the vended file and MARC
  4. Create the XSLT crosswalk that contains all the logic for turning this data into MARCXML
  5. Setup the process to move data between XML=>MARC

 

All of these steps are really time consuming, but the development of the XSLT/XQuery to actually translate the data is the one that stops most people.  While there are many folks in the library technology space (and technical services spaces) that would argue that the ability to create XSLT is a vital job skill, let’s be honest, people are busy.  Additionally, there is a big difference between knowing how to create an XSLT and writing a metadata translation.  These things get really complicated, and change all the time (XSLT is up to version 3), meaning that even if you’ve learned how to do this years ago, the skills may be stale or not translate into the current XSLT version.

Additionally, in MarcEdit, I’ve tried really hard to make the XSLT process as simple and straightforward as possible.  But, the reality is, I’ve only been able to work on the edges of this goal.  The tool handles the transformation of binary and character encoding data (since the XSLT engines cannot do that), it uses a smart processing algorithm to try to improve speed and memory handling while still enabling users to work with either DOM or Sax processing techniques.  And I’ve tried to introduce a paradigm that enables reuse and flexibility when creating transformations.  Folks that have heard me speak have likely heard me talk about this model as a wheel and spoke:

image

The idea behind this model is that as long as users create translations that map to and from MARCXML, the tool can automatically enable transformations to any of the known metadata formats registered with MarcEdit.  There are definitely tradeoffs to this approach (for sure, doing a 1-to-1, direct translation would produce the best translation, but it also requires more work and users to be experts in the source and final metadata formats), but the benefit from my perspective is that I don’t have to be the bottleneck in the process.  Were I to hard-code or create 1-to-1 conversions, any deviation or local use within a spec, would render the process unusable…and that was something that I really tried to avoid.  I’d like to think that this approach has been successful, and has enabled technical services folks to make better use of the marked up metadata that they are provided.

The problem is that as content providers have moved more of their metadata operations online,  a large number have shifted away from standards-based metadata to locally defined metadata profiles.  This is challenging because these are one off formats that really are only applicable for a publisher’s particular customers.  As a result, it’s really hard to find conversions for these formats.  The result of this, for me, are large numbers of catalogers/MarcEdit users asking for help creating these one off transformations…work that I simply don’t have time to do.  And that can surprise folks.  I try hard to make myself available to answer questions.  If you find yourself on the MarcEdit listserv, you’ll likely notice that I answer a lot of the questions…I enjoy working with the community.  And I’m pretty much always ready to give folks feedback and toss around ideas when folks are working on projects.  But there is only so much time in the day, and only so much that I can do when folks ask for this type of help.

So, transformations are an area where I get a lot of questions.  Users faced with these publisher specific metadata formats often reach out for advice or to see if I’ve worked with a vendor in the past.  And for years, I’ve been wanting to do more for this group.  While many metadata librarians would consider XSLT or XQuery as required skills, these are not always in high demand when faced with a mountain of content moving through an organization.  So, I’ve been collecting user stories and outlining a process that I think could help: an XML/JSON Profiler.

So, it’s with a lot of excitement, that I can write that MarcEdit 7 will include this tool.  As I say, it’s been a long-term coming; and the goal is to reduce the technical requirements needed to process XML or JSON metadata.

XML/JSON Profiler

To create this tool, I had decide how users would define their data for mapping.  Given that MarcEdit has a Delimited Text Translator for converting Excel data to MARC, I decided to work form this model.  The code produced does a couple of things:

  1. It validates the XML format to be profiled.  Mostly, this means that the tool is making sure that schema’s are followed, namespaces are defined and discoverable, etc.
  2. Output data in MARC, MARCXML, or another XML format
  3. Shifts mapping of data from an XML file to a delimited text file (though, it’s not actually creating a delimited text file).
  4. Since the data is in XML, there is  a general assumption that data should be in UTF8.

 

Users can access the Wizard through the updated XML Functions Editor.  Users open MARC Tools and select Edit XML function list, and you see the following:

image

I highlighted the XML Function Wizard.  I may also make this tool available from the main window.  Once selected, the program walks users through a basic reference interview:

Page 1:

image

 

From here, users just need to follow the interview questions.  User will need a sample XML file that contains at least one record in order to create the mappings against.  As users walk through the interview, they are asked to identify the record element in the XML file, as well as map xml tags to MARC tags, using the same interface and tools as found in the delimited text translator.  Users also have the option to map data directly to a new metadata format by creating an XML mapping file — or a representation of the XML output, which MarcEdit will then use to generate new records.

Once a new mapping has been created, the function will then be registered into MarcEdit, and be available like any other translation.  Whether this process simplifies the conversion of XML and JSON data for librarians, I don’t know.  But I’m super excited to find out.  This creates a significant shift in how users can interact with marked up metadata, and I think will remove many of the technical barriers that exist for users today…at least, for those users working with MarcEdit.

To give a better idea of what is actually happening, I created a demonstration video of the early version of this tool in action.  You can find it here: https://youtu.be/9CtxjoIktwM.  This provides an early look at the functionality, and hopefully help provide some context around the above discussion.  If you are interested in seeing how the process works, I’ve posted the code for the parser on my github page here: https://github.com/reeset/meparsemarkup

Do you have questions, concerns?  Let me know.

 

–tr


Posted

in

by