Jan 252016
 

One of the changes in the current MarcEdit update is the introduction of a linked data rules file to help the program understand what data elements should be processed for automatic URL generation, and how that data should be treated.  The Rules file is found in the Configs directory and is called: linked_data_profile.xml

 

image

image

The rules file is pretty straightforward.  At this point, I haven’t created a schema for it, but I will to make defining data easier.  Until then, I’ve added references in the header of the document to note fields and values. 

Here’s a small snippet of the file:

<?xml version=”1.0″ encoding=”UTF-8″?>
<marcedit_linked_data_profile>
  <!–
    rules block:
        top level: field
            Attributes:
                type: authority, bibliographic, authority|bibliographic
            tag (required):
                Value: Field value
                Description: field to process
            subfield (required):
                Value: Subfield codes
                Description: subfields to use for matching
            index (optional):
                Values: subfield code or empty
                Description: field that denotes index
            atomize(optional):
                Values: 1 or empty
                Description: determines if field should be broken up for uri disambiguation
            special_instructions (optional):
                Values: name|subject|mixed
                Description: special instructions to improve normalization for names and subjects. 
            uri (required):
                Values: subfield code to include a url
                Description: Used to determine which subfield is used to embed a URI
            vocab (optional):
                Values (see supported vocabularies section)
                Description: when no index is supplied, you can predefine a supported index
               
               
  Supported Vocabularies:
    Value: lcshac
    Description: LC Childrens Subjects
   
    Value: lcdgt
    Description: LC Demographic Terms
   
    Value: lcsh
    Description: LC Subjects
   
    Value: lctmg
    Description: TGM
   
    Value: aat
    Description: Getty Arts and Architecture Thesaurus
   
    Value: ulan
    Description: Getty ULAN
   
    Value: lcgft
    Description: LC Genre Forms
  
   Value: lcmpt
   Descirption: LC Medium Performance Thesaurus
  
   Value: naf
   Description: LC NACO Terms
  
   Value: naf_lcsh
   Description: lcsh/naf combined indexes.
  
   Value: mesh
   Description: MESH indexes
    –>
  <rules>
    <field type=”bibliographic”>
      <tag>100</tag>
      <subfields>abcdqnp</subfields>
      <uri>0</uri>
      <special_instructions>name</special_instructions>
    </field>
    <field type=”bibliographic”>
      <tag>110</tag>
      <subfields>abcdqnp</subfields>
      <uri>0</uri>
      <special_instructions>name</special_instructions>
    </field>
</rules>
</marcedit_linked_data_profile>

The rules file is pretty straightforward.  You have a field where you define a type.  Acceptable values are: authority, bibliographic, authority|bibliographic.  This tells the tool which type of record the process rules apply to.  Second you define a tag, subfields to process when evaluating for linking, a uri field (this is the subfield used when outputting the URI, special instructions (if there are any), where the field is atomized (i.e., broken up so that you have one concept per URI), and vocab (to preset a default vocabulary for processing).  So for example, say a user wanted to atomize a field that currently isn’t defined as such – they would just find the processing block for the field and add: <atomize>1</atomized> into the block – and that’s it.

The idea behind this rules file is to support the work of a PCC Task Force while they are testing embedding of URIs in MARC records.  By shifting from a compiled solution to a rules based solution, I can provide immediate feedback and it should make the process easier to customize and test. 

An important note – these rules will change.  They are pretty well defined for bibliographic data, but authority data is still being worked out. 

–tr

 Posted by at 10:01 pm

MarcEdit Update (all versions)

 MarcEdit  Comments Off on MarcEdit Update (all versions)
Jan 182016
 

This update includes a new tool, changes to the merge tool, and a behavior change in the MARCEngine.  You can see the change log at:

You can get the update through MarcEdit’s automated update mechanism or from: http://marcedit.reeset.net/downloads/

–tr

 Posted by at 3:05 pm

MarcEdit and OpenRefine

 MarcEdit  Comments Off on MarcEdit and OpenRefine
Jan 162016
 

There have been a number of workshops and presentations that I’ve seen floating around that talk about ways of using MarcEdit and OpenRefine together when doing record editing.  OpenRefine, for folks that might not be familiar, use to be known as Google Refine, and is a handy tool for working with messy data.  While there is a lot of potential overlap between the types of edits available between MarcEdit and OpenRefine, the strength of the tool is that it allows you to access your data via a tabular interface to easily find variations in metadata, relationships, and patterns.

For most folks working with MarcEdit and OpenRefine together, the biggest challenge is moving the data back and forth.  MARC binary data isn’t supported by OpenRefine, and MarcEdit’s mnemonic format isn’t well suited for import using OpenRefine’s import options as well.  And once the data has been put into OpenRefine, getting back out and turned into MARC can be difficult for first time users as well.

Because I’m a firm believe that uses should use the tool that they are most comfortable with – I’ve been talking to a few OpenRefine users trying to think about how I could make the process of moving data between the two systems easier.  And to that end, I’ll be adding to MarcEdit a toolset that will facilitate the export and import of MARC (and MarcEdit’s mnemonic) data formats into formats that OpenRefine can parse and easily generate.  I’ve implemented this functionality in two places – one as a standalone application found on the Main MarcEdit Window, and one as part of the MarcEditor – which will automatically convert or import data directly into the MarcEditor Window.

Exporting Data from MarcEdit

As noted above, there will be two methods of exporting data from MarcEdit into one of two formats for import into OpenRefine.  Presently, MarcEdit supports generating either json or tab delimited format.  These are two formats that OpenRefine can import to create a new project.

image
OpenRefine Option from the Main Window

image
OpenRefine Export/Import Tool.

If I have a MARC file and I want to export it for use in OpenRefine – I would using the following steps:

  1. Open MarcEdit
  2. Select Tools/OpenRefine/Export from the menu
  3. Enter my Source File (either a marc or mnemonic file)
  4. My Save File – MarcEdit supports export in json or tsv (tab delimited)
  5. Select Process

This will generate a file that can used for importing into OpenRefine.  A couple notes about that process.  When importing via tab delimited format – you will want to unselect options that does number interpretation.  I’d also uncheck the option to turn blanks into nulls and make sure the option is selected that retains blank rows.  These are useful on export and reimport into MarcEdit.  When using Json as the file format – you will want to make sure after import to order your columns as TAG, Indicators, Content.  I’ve found OpenRefine will mix this order, even though the json data is structured in this order.

Once you’ve made the changes to your data – Select the export option in OpenRefine and select the export tab delimited option.  This is the file format MarcEdit can turn back into either MARC or the mnemonic file format.  Please note – I’d recommend always going back to the mnemonic file format until you are comfortable with the process to ensure that the import process worked like you expected.

And that’s it.  I’ve recorded a video on YouTube walking through these steps – you can find it here:

This if course just shows how to data between the two systems.  If you want to learn more about how to work with the data once it’s in OpenRefine, I’d recommend one of the many excellent workshops that I’ve been seeing put on at conferences and via webinars by a wide range of talented metadata librarians.

*** Update****

In addition to the addition of the tool, I’ve set it up so that this tool can be selected as one of the user defined tools on the front page for quick access.  This way, if this is one of the tools you use often, you can just get right to it.

MarcEdit's Start Window Preferences with new OpenRefine Data Transfer Tool Option

MarcEdit’s Start Window Preferences with new OpenRefine Data Transfer Tool Option

Main Window with OpenRefine Data Transfer Tool

Main Window with OpenRefine Data Transfer Tool

 Posted by at 6:27 pm

MarcEdit Update (all versions)

 MarcEdit  Comments Off on MarcEdit Update (all versions)
Jan 102016
 

I decided to celebrate my absence from ALA’s Midwinter by doing a little coding.  Smile  I’ve uploaded updates for all versions of MarcEdit, though the Mac version has experienced the most significant revisions.  The changes:

Windows/Linux ChangeLog:

OSX ChangeLog:

You can get the update from the Downloads page (http://marcedit.reeset.net/downloads) or using the automated updating tools within MarcEdit.

Questions,

–tr

 Posted by at 8:39 pm

MarcEdit Mac: Verify URLs

 MarcEdit  Comments Off on MarcEdit Mac: Verify URLs
Jan 102016
 

In the Windows/Linux version — on of the oldest tools has been the ability to validate URLs.  This tool generates a report providing the HTTP status codes returned for URLs in a record set.  This didn’t make the initial migration  — but has been added to the current OSX version of MarcEdit.

To find the resource, you open the main window and select the menu:

MarcEdit Mac: Main Window Menu -- Verify URLs

MarcEdit Mac: Main Window Menu — Verify URLs

Once selected, if works a lot like the Windows/Linux version.  You have two report types (HTML/XML), you can define a title field, you can also set the fields to check.  By default, MarcEdit selects all.  To change this — you just need to add each new field/subfield combination in a new line.

MarcEdit Mac: Verify URLs screen

MarcEdit Mac: Verify URLs screen

Questions, let me know.

–tr

 Posted by at 8:36 pm
Jan 102016
 

One of the functions that didn’t make the initial migration cut in the MarcEditor was the ability to edit the 006/008 in a graphical interface.  I’ve added this back into the OSX version.  You can find it in the Edit Menu:

MarcEdit Mac -- Edit 006/008 Menu Location

MarcEdit Mac — Edit 006/008 Menu Location

Invoking the tool works a little differently than the windows/linux version.  Just put your cursor into the field that you want to edit, and the select Edit.  MarcEdit will then read your record data and generate an edit form based on the material format selected (or the material format from the record if editing).

MarcEdit Mac -- Edit 006/008 Screen

MarcEdit Mac — Edit 006/008 Screen

Questions — let me know.

–tr

 Posted by at 8:30 pm

Build New Field Enhancements

 MarcEdit  Comments Off on Build New Field Enhancements
Jan 092016
 

Couple of interesting questions this week got me thinking about a couple of enhancements to MarcEdit.  I’m not sure these are things that other folks will make use of often, but I can see these being really useful answering questions that come up on the listserv.

The particular question that got me thinking about this today was the following scenario:

The user has two fields – an 099 that includes data that needs to be retained, and then an 830$v that needs to be placed into the 099.  The 830$v has trailing punctuation that will need to be removed. 

Example data:
=099  \\$aELECTRONIC DATA
=830  \\$aSeries Title $v 12-031.

The final data output should be:
=099  \\$aELECTRONIC RESOURCE 12-013
=830  \\$aSeries Title $v 12-031.

With the current tools, you can do this but it would require multiple steps.  Using the current build new field tool, you could create the pattern for the data:
=099  \\$a{099$a} {830$v}

This would lead to an output of:
=099  \\$aELECTRONIC RESOURCE 12-031.

To remove the period – you could use a replace function and fix the $a at the same time.  You could have also made the ELECTRONIC RESOURCE string a constant in the build new field – but the problem is that you’d have to know that this was the only data that ever showed up in the 099$a (and it probably won’t be).

So thinking about this problem, I’ve been thinking about how I might be able to add a few processing “macros” into the pattern language – and that’s what I’ve done.  At this point, I’ve added the following commands:

  • replace(find,replace)
  • trim(chars)
  • trimend(chars)
  • trimstart(chars)
  • substring(start,length)

The way that these have been implemented – these commands are stackable – they are also very ridged in structure.  These commands are case sensitive (command labels are all lower case), and in the places where you have multiple parameters – there are no spaces between the commas. 

So how does this work – here’s some examples (not full patterns):
{099$a.trim(“.”)}
{050$b.replace(“1950”,”1980”).trim(“.”)}
{LDR.substring(6,1)}

As you can see in the patterns, the commands are initialized by adding “.command” to the end of the field pattern.  So how we would apply this to the user story above.  It’s easy:
=099  \\$a{099$a.replace(“DATA”,”RESOURCE”)} {830$v.trimend(“.”)}

And that would be it.  With this single pattern, we can run the replacement on the data in the 099$a and trim the data in the 830$v. 

Now, I realize that this syntax might not be the easiest for everyone right out of the gate, but as I said, I’m hoping this will be useful for folks interested in learning the new options, but am really excited to have this in my toolkit for answering questions posed on the listserv.

This has been implemented in all versions of MarcEdit, and will be part of this weekend’s update.

–tr

 Posted by at 9:17 pm

MarcEdit updates

 MarcEdit  Comments Off on MarcEdit updates
Jan 062016
 

I noted earlier today that I’d be making a couple MarcEdit updates.  You can see the change logs here:

Please note – if you use the Linked data tools, it is highly recommended that you update.  This update was done in part to make the interactions with LC more efficent on all sides.

You can get the download from the automated update mechanism in MarcEdit or from the downloads page: http://marcedit.reeset.net/downloads

Questions, let me know.

–tr

 Posted by at 8:30 pm

Heads Up: MarcEdit Linked Data Components Update (all versions) scheduled for this evening

 MarcEdit  Comments Off on Heads Up: MarcEdit Linked Data Components Update (all versions) scheduled for this evening
Jan 062016
 

A heads up to those folks using MarcEdit and using the following components:

  • Validate Headings
  • Build Links
  • Command-Line tool using the build links option

These components rely on MarcEdit’s linked data framework to retrieve semantic data from a wide range of vocabulary services.  I’ll be updating one of these components in order to improve the performance and how they interact with the Library of Congress’s id.loc.gov service.  This will provide a noticeable improvement on the MarcEdit side (with response time cut by a little over 2/3rds) and will make MarcEdit much more friendly to the LC id.loc.gov service.  Given the wide range of talks at Midwinter this year discussing experimentations related to embedding semantic data into MARC records and the role MarcEdit is playing in that work – I wanted to make sure this was available prior to ALA.

Why the change

When MarcEdit interacts with id.loc.gov, it’s communications are nearly always just HEAD requests.  This is because over the past year or so, the folks at LC have been incredibly responsive developing into their headers statements nearly all the information someone might need if they are just interested in looking up a controlled term and finding out if:

  1. It exists
  2. The preferred label
  3. Its URI

Prior to the HEADER lookup, this had to be done using a different API which resulted in two requests – one to the API, and then one to the XML representation of the document for parsing.  By moving the most important information into the document headers (X- elements), I can minimize the amount of data I’m having to request from LC.  And that’s a good thing – because LC tends to have strict guidelines around how often and how much data you are allowed to request from them at any given time.  In fact, were it not LC’s willingness to allow me to by-pass those caps when working this this service —  a good deal of the new functionality being developed into the tool simply wouldn’t exist.  So, if you find the linked data work in MarcEdit useful, you shouldn’t be thanking me – this work has been made possible by LC and their willingness to experiment with id.loc.gov. 

Anyway – the linked data tools have been available in MarcEdti for a while, and they are starting to generate significant traffic on the LC side of things.  Adding the Validate Headings tool only exasperated this – enough so that LC has been asking if I could do some things to help throttle the requests coming from MarcEdit.  So, we are working on some options – but in the mean time, LC noticed something odd in their logs.  While MarcEdit only makes HEAD requests, and only processes the information from that request – they were seeing 3 requests showing up in their logs. 

Some background on the LC service — it preforms a lot of redirection.  One request to the label service, results in ~3 redirects.  All the information MarcEdit need is found in the first request, but when looking at the logs, they can see MarcEdit is following the redirects, resulting in 2 more Head requests for data that the tool is simply throwing away.  This means that in most cases, a single request for information is generating 3 HEAD requests – an if you take a file of 2000 records, with ~5 headings to be validated (on average) – that means MarcEdit would generate ~30,000 requests (10,000 x 3).  That’s not good – and when LC approached me to ask why MarcEdit was asking for the other data files – I didn’t have an answer.  It wasn’t till I went to the .NET documentation that the answer became apparent.

As folks should know, MarcEdit is developed using C#, which means, it utilizes .NET.  The primary component for handling network interactions happens in the System.Net component – specifically, the System.Net.HttpWebRequest component.  Here’s the function:

       public System.Collections.Hashtable ReadUriHeaders(string uri, string[] headers)
        {
            System.Net.ServicePointManager.DefaultConnectionLimit = 10;
            System.Collections.Hashtable headerTable = new System.Collections.Hashtable();
            uri = System.Uri.EscapeUriString(uri);

            //after escape -- we need to catch ? and &
            uri = uri.Replace("?", "%3F").Replace("&", "%26");

            System.Net.WebRequest.DefaultWebProxy = null;
            System.Net.HttpWebRequest objRequest = (System.Net.HttpWebRequest)System.Net.WebRequest.Create(MyUri(uri));
            objRequest.UserAgent = "MarcEdit 6.2 Headings Retrieval";
            objRequest.Proxy = null;
            
            //Changing the default timeout from 100 seconds to 30 seconds.
            objRequest.Timeout = 30000;
            
            

            //System.Net.HttpWebResponse objResponse = null;
            //.Create(new System.Uri(uri));


            objRequest.Method = "HEAD";


            try
            {
                using (var objResponse = (System.Net.HttpWebResponse)objRequest.GetResponse())
                {
                    //objResponse = (System.Net.HttpWebResponse)objRequest.GetResponse();
                    if (objResponse.StatusCode == System.Net.HttpStatusCode.NotFound)
                    {
                        foreach (string name in headers)
                        {
                            headerTable.Add(name, "");
                        }
                    }
                    else
                    {

                        foreach (string name in headers)
                        {
                            if (objResponse.Headers.AllKeys.Contains(name))
                            {
                                
                                string orig_header = objResponse.Headers[name];
                                byte[] b = System.Text.Encoding.GetEncoding(28591).GetBytes(orig_header);

                                headerTable.Add(name, System.Text.Encoding.UTF8.GetString(b));
                                
                            }
                            else
                            {
                                headerTable.Add(name, "");
                            }
                        }
                    }
                }
                
                return headerTable;
            }
            catch (System.Exception p)
            {
                foreach (string name in headers)
                {
                    headerTable.Add(name, "");
                }
                headerTable.Add("error", p.ToString());
                return headerTable;
            }
        }

It’s a pretty straightforward piece of code – the tool looks up a URI, reads the header, and outputs a hash of the values.  There doesn’t appear to be anything in the code that would explain why MarcEdit was generating so many requests (because this function was only being called once per item).  But looking at the documentation – well, there is.  The HttpWebRequest object has a property – AllowAutoRedirect, and it’s set to true by default.  This tells the component that a web request can be automatically redirected up to the value set in MaxRedirections (by default, I think it’s 5).  Since every request to the LC service generates redirects – MarcEdit was following them and just tossing the data.  So that was my problem.  Allowing redirects is a fine assumption to make for a lot of things – but for my purposes – not so much.  It’s an easy fix – I added a value to the function header – something that is set to false by default, and then use that value to set the AllowAutoRedirect bit.  This way I can allow redirects when I need them, but turn it off when by default when I don’t (which is almost always).  Once finished, I tested against LC’s service and they confirmed that this reduced the number of HEAD requests.  On my side – I noticed that things were much, much faster.  On the LC side, they are pleased because MarcEdit is generating a lot of traffic, and this should help to reduce and focus that traffic.  So win, win, all around.

What does this mean

So what this means – I’ll be posting an update this evening.  It will include a couple tweaks based on feedback from the update this past Sunday – but most importantly, it will include this change.  If you use the linked data tools or the Validate Headings tools – you will want to update.  I’ve updated MarcEdit’s user agent string, so LC will now be able to tell if a user is using a version of MarcEdit that is fixed.  If you aren’t and you are generating a lot of traffic – don’t be surprised if they ask you to update. 

The other thing that I think that it shows (and this I’m excited about), is that LC really has been incredibly accommodating when it has come to using this service, and rather than telling me that MarcEdit needed to start following LC’s data request guidelines for the id.loc.gov service (which would make this service essentially useless), they worked with me to figure out what was going on so we could find a solution that everyone is happy with.  And like I said, we both are concerned that as more users hit the service, there will be a need to do spot throttle those requests globally, so we are talking about how that might be done. 

For me, this type of back and forth has been incredibly refreshing and somewhat new.  It certainly has never happened when I’ve spoken to any ILS vendor or data provider (save for members of the Koha and OLE communities) – and gives me some hope that just maybe we can all come together and make this semantic web thing actually work.  The problem with linked data is that unless there is trust: trust in the data and trust in the service providing the data – it just doesn’t work.  And honestly, I’ve had concerns that in Library land, there are very few services that I feel you could actually trust (and that includes OCLC at this point).  Service providers are slowly wading in – but these types of infrastructure components take resources – lots of resources, and they are invisible to the user…or, when they are working, they are invisible.  Couple that with the fact that these services are infrastructure components, not profit engines – its not a surprise that so few services exist, and the ones that do, are not designed to support real-time, automated look up.  When you realize that this is the space we live in, right now, It makes me appreciate the folks at LC, and especially Nate Trail, all the more.  Again, if you happen to be at ALA and find these services useful, you really should let them know.

Anyway – I started the process to run tests and then build this morning before heading off to work.  So, sometime this evening, I’ll be making this update available.  However, given that these components are becoming more mainstream and making their way into authority workflows – I wanted to give a heads up.

Questions – let me know.

–tr

 Posted by at 8:18 am

Happy Holidays: MarcEdit Update

 MarcEdit  Comments Off on Happy Holidays: MarcEdit Update
Jan 032016
 

Over the past few years, holiday updates have become a part of a MarcEdit tradition.  This year, I’ve been spending the past month working on two significant set of changes.  On the Windows side, I’ve been working on enhancing the Linked Data tools, profiling more fields and more services.  This update represents a first step in the process – as I’ll be working with the PCC to profile additional services and add new elements as we work through a pilot test around embedding linked data into MARC records and potential implications.  For a full change list, please see: http://blog.reeset.net/archives/1822

The Mac version has seen a lot of changes – and because of that, I’ve moved the version number from 1.3.35 to 1.4.5.  In addition to all the infrastructure changes made within the Windows/Linux program (the tools share a lot of code), I’ve also done significant work exposing preferences and re-enabling the ILS Integration.  I didn’t get to test the ILS integration well – so there may be a few updates to correct problems once people start working with them – but getting to this point took a lot of work and I’m glad to see it through.  For a full list of updates on the Mac Version, please see: http://blog.reeset.net/archives/1824

Before Christmas, I’d mentioned that I was working on three projects – with the idea that all would be ready by the time these updates were complete.  I was wrong – so it looks like I’ll have one more Christmas/New Years gift left to give – and I’ll be trying to wrap that work up this week.

Downloads – you can pick up the new downloads at: http://marcedit.reeset.net/downloads or if you have the automatic update notification enabled, the tool should provide you with an option to update from within the program.

This represents a lot of work, and a lot of changes.  I’ve tested to the best of my ability – but I’m expecting that I may have missed something.  If you find something, let me know.  I’m saving time over the next couple weeks to fix problems that might come up and turn around builds faster than normal.

Here’s looking forward to a wonderful 2016.

–tr

 Posted by at 8:55 pm