Heads Up: MarcEdit Linked Data Components Update (all versions) scheduled for this evening

By reeset / On / In MarcEdit

A heads up to those folks using MarcEdit and using the following components:

  • Validate Headings
  • Build Links
  • Command-Line tool using the build links option

These components rely on MarcEdit’s linked data framework to retrieve semantic data from a wide range of vocabulary services.  I’ll be updating one of these components in order to improve the performance and how they interact with the Library of Congress’s id.loc.gov service.  This will provide a noticeable improvement on the MarcEdit side (with response time cut by a little over 2/3rds) and will make MarcEdit much more friendly to the LC id.loc.gov service.  Given the wide range of talks at Midwinter this year discussing experimentations related to embedding semantic data into MARC records and the role MarcEdit is playing in that work – I wanted to make sure this was available prior to ALA.

Why the change

When MarcEdit interacts with id.loc.gov, it’s communications are nearly always just HEAD requests.  This is because over the past year or so, the folks at LC have been incredibly responsive developing into their headers statements nearly all the information someone might need if they are just interested in looking up a controlled term and finding out if:

  1. It exists
  2. The preferred label
  3. Its URI

Prior to the HEADER lookup, this had to be done using a different API which resulted in two requests – one to the API, and then one to the XML representation of the document for parsing.  By moving the most important information into the document headers (X- elements), I can minimize the amount of data I’m having to request from LC.  And that’s a good thing – because LC tends to have strict guidelines around how often and how much data you are allowed to request from them at any given time.  In fact, were it not LC’s willingness to allow me to by-pass those caps when working this this service —  a good deal of the new functionality being developed into the tool simply wouldn’t exist.  So, if you find the linked data work in MarcEdit useful, you shouldn’t be thanking me – this work has been made possible by LC and their willingness to experiment with id.loc.gov. 

Anyway – the linked data tools have been available in MarcEdti for a while, and they are starting to generate significant traffic on the LC side of things.  Adding the Validate Headings tool only exasperated this – enough so that LC has been asking if I could do some things to help throttle the requests coming from MarcEdit.  So, we are working on some options – but in the mean time, LC noticed something odd in their logs.  While MarcEdit only makes HEAD requests, and only processes the information from that request – they were seeing 3 requests showing up in their logs. 

Some background on the LC service — it preforms a lot of redirection.  One request to the label service, results in ~3 redirects.  All the information MarcEdit need is found in the first request, but when looking at the logs, they can see MarcEdit is following the redirects, resulting in 2 more Head requests for data that the tool is simply throwing away.  This means that in most cases, a single request for information is generating 3 HEAD requests – an if you take a file of 2000 records, with ~5 headings to be validated (on average) – that means MarcEdit would generate ~30,000 requests (10,000 x 3).  That’s not good – and when LC approached me to ask why MarcEdit was asking for the other data files – I didn’t have an answer.  It wasn’t till I went to the .NET documentation that the answer became apparent.

As folks should know, MarcEdit is developed using C#, which means, it utilizes .NET.  The primary component for handling network interactions happens in the System.Net component – specifically, the System.Net.HttpWebRequest component.  Here’s the function:

       public System.Collections.Hashtable ReadUriHeaders(string uri, string[] headers)
        {
            System.Net.ServicePointManager.DefaultConnectionLimit = 10;
            System.Collections.Hashtable headerTable = new System.Collections.Hashtable();
            uri = System.Uri.EscapeUriString(uri);

            //after escape -- we need to catch ? and &
            uri = uri.Replace("?", "%3F").Replace("&", "%26");

            System.Net.WebRequest.DefaultWebProxy = null;
            System.Net.HttpWebRequest objRequest = (System.Net.HttpWebRequest)System.Net.WebRequest.Create(MyUri(uri));
            objRequest.UserAgent = "MarcEdit 6.2 Headings Retrieval";
            objRequest.Proxy = null;
            
            //Changing the default timeout from 100 seconds to 30 seconds.
            objRequest.Timeout = 30000;
            
            

            //System.Net.HttpWebResponse objResponse = null;
            //.Create(new System.Uri(uri));


            objRequest.Method = "HEAD";


            try
            {
                using (var objResponse = (System.Net.HttpWebResponse)objRequest.GetResponse())
                {
                    //objResponse = (System.Net.HttpWebResponse)objRequest.GetResponse();
                    if (objResponse.StatusCode == System.Net.HttpStatusCode.NotFound)
                    {
                        foreach (string name in headers)
                        {
                            headerTable.Add(name, "");
                        }
                    }
                    else
                    {

                        foreach (string name in headers)
                        {
                            if (objResponse.Headers.AllKeys.Contains(name))
                            {
                                
                                string orig_header = objResponse.Headers[name];
                                byte[] b = System.Text.Encoding.GetEncoding(28591).GetBytes(orig_header);

                                headerTable.Add(name, System.Text.Encoding.UTF8.GetString(b));
                                
                            }
                            else
                            {
                                headerTable.Add(name, "");
                            }
                        }
                    }
                }
                
                return headerTable;
            }
            catch (System.Exception p)
            {
                foreach (string name in headers)
                {
                    headerTable.Add(name, "");
                }
                headerTable.Add("error", p.ToString());
                return headerTable;
            }
        }

It’s a pretty straightforward piece of code – the tool looks up a URI, reads the header, and outputs a hash of the values.  There doesn’t appear to be anything in the code that would explain why MarcEdit was generating so many requests (because this function was only being called once per item).  But looking at the documentation – well, there is.  The HttpWebRequest object has a property – AllowAutoRedirect, and it’s set to true by default.  This tells the component that a web request can be automatically redirected up to the value set in MaxRedirections (by default, I think it’s 5).  Since every request to the LC service generates redirects – MarcEdit was following them and just tossing the data.  So that was my problem.  Allowing redirects is a fine assumption to make for a lot of things – but for my purposes – not so much.  It’s an easy fix – I added a value to the function header – something that is set to false by default, and then use that value to set the AllowAutoRedirect bit.  This way I can allow redirects when I need them, but turn it off when by default when I don’t (which is almost always).  Once finished, I tested against LC’s service and they confirmed that this reduced the number of HEAD requests.  On my side – I noticed that things were much, much faster.  On the LC side, they are pleased because MarcEdit is generating a lot of traffic, and this should help to reduce and focus that traffic.  So win, win, all around.

What does this mean

So what this means – I’ll be posting an update this evening.  It will include a couple tweaks based on feedback from the update this past Sunday – but most importantly, it will include this change.  If you use the linked data tools or the Validate Headings tools – you will want to update.  I’ve updated MarcEdit’s user agent string, so LC will now be able to tell if a user is using a version of MarcEdit that is fixed.  If you aren’t and you are generating a lot of traffic – don’t be surprised if they ask you to update. 

The other thing that I think that it shows (and this I’m excited about), is that LC really has been incredibly accommodating when it has come to using this service, and rather than telling me that MarcEdit needed to start following LC’s data request guidelines for the id.loc.gov service (which would make this service essentially useless), they worked with me to figure out what was going on so we could find a solution that everyone is happy with.  And like I said, we both are concerned that as more users hit the service, there will be a need to do spot throttle those requests globally, so we are talking about how that might be done. 

For me, this type of back and forth has been incredibly refreshing and somewhat new.  It certainly has never happened when I’ve spoken to any ILS vendor or data provider (save for members of the Koha and OLE communities) – and gives me some hope that just maybe we can all come together and make this semantic web thing actually work.  The problem with linked data is that unless there is trust: trust in the data and trust in the service providing the data – it just doesn’t work.  And honestly, I’ve had concerns that in Library land, there are very few services that I feel you could actually trust (and that includes OCLC at this point).  Service providers are slowly wading in – but these types of infrastructure components take resources – lots of resources, and they are invisible to the user…or, when they are working, they are invisible.  Couple that with the fact that these services are infrastructure components, not profit engines – its not a surprise that so few services exist, and the ones that do, are not designed to support real-time, automated look up.  When you realize that this is the space we live in, right now, It makes me appreciate the folks at LC, and especially Nate Trail, all the more.  Again, if you happen to be at ALA and find these services useful, you really should let them know.

Anyway – I started the process to run tests and then build this morning before heading off to work.  So, sometime this evening, I’ll be making this update available.  However, given that these components are becoming more mainstream and making their way into authority workflows – I wanted to give a heads up.

Questions – let me know.

–tr