Jan 062016
 

A heads up to those folks using MarcEdit and using the following components:

  • Validate Headings
  • Build Links
  • Command-Line tool using the build links option

These components rely on MarcEdit’s linked data framework to retrieve semantic data from a wide range of vocabulary services.  I’ll be updating one of these components in order to improve the performance and how they interact with the Library of Congress’s id.loc.gov service.  This will provide a noticeable improvement on the MarcEdit side (with response time cut by a little over 2/3rds) and will make MarcEdit much more friendly to the LC id.loc.gov service.  Given the wide range of talks at Midwinter this year discussing experimentations related to embedding semantic data into MARC records and the role MarcEdit is playing in that work – I wanted to make sure this was available prior to ALA.

Why the change

When MarcEdit interacts with id.loc.gov, it’s communications are nearly always just HEAD requests.  This is because over the past year or so, the folks at LC have been incredibly responsive developing into their headers statements nearly all the information someone might need if they are just interested in looking up a controlled term and finding out if:

  1. It exists
  2. The preferred label
  3. Its URI

Prior to the HEADER lookup, this had to be done using a different API which resulted in two requests – one to the API, and then one to the XML representation of the document for parsing.  By moving the most important information into the document headers (X- elements), I can minimize the amount of data I’m having to request from LC.  And that’s a good thing – because LC tends to have strict guidelines around how often and how much data you are allowed to request from them at any given time.  In fact, were it not LC’s willingness to allow me to by-pass those caps when working this this service —  a good deal of the new functionality being developed into the tool simply wouldn’t exist.  So, if you find the linked data work in MarcEdit useful, you shouldn’t be thanking me – this work has been made possible by LC and their willingness to experiment with id.loc.gov. 

Anyway – the linked data tools have been available in MarcEdti for a while, and they are starting to generate significant traffic on the LC side of things.  Adding the Validate Headings tool only exasperated this – enough so that LC has been asking if I could do some things to help throttle the requests coming from MarcEdit.  So, we are working on some options – but in the mean time, LC noticed something odd in their logs.  While MarcEdit only makes HEAD requests, and only processes the information from that request – they were seeing 3 requests showing up in their logs. 

Some background on the LC service — it preforms a lot of redirection.  One request to the label service, results in ~3 redirects.  All the information MarcEdit need is found in the first request, but when looking at the logs, they can see MarcEdit is following the redirects, resulting in 2 more Head requests for data that the tool is simply throwing away.  This means that in most cases, a single request for information is generating 3 HEAD requests – an if you take a file of 2000 records, with ~5 headings to be validated (on average) – that means MarcEdit would generate ~30,000 requests (10,000 x 3).  That’s not good – and when LC approached me to ask why MarcEdit was asking for the other data files – I didn’t have an answer.  It wasn’t till I went to the .NET documentation that the answer became apparent.

As folks should know, MarcEdit is developed using C#, which means, it utilizes .NET.  The primary component for handling network interactions happens in the System.Net component – specifically, the System.Net.HttpWebRequest component.  Here’s the function:

       public System.Collections.Hashtable ReadUriHeaders(string uri, string[] headers)
        {
            System.Net.ServicePointManager.DefaultConnectionLimit = 10;
            System.Collections.Hashtable headerTable = new System.Collections.Hashtable();
            uri = System.Uri.EscapeUriString(uri);

            //after escape -- we need to catch ? and &
            uri = uri.Replace("?", "%3F").Replace("&", "%26");

            System.Net.WebRequest.DefaultWebProxy = null;
            System.Net.HttpWebRequest objRequest = (System.Net.HttpWebRequest)System.Net.WebRequest.Create(MyUri(uri));
            objRequest.UserAgent = "MarcEdit 6.2 Headings Retrieval";
            objRequest.Proxy = null;
            
            //Changing the default timeout from 100 seconds to 30 seconds.
            objRequest.Timeout = 30000;
            
            

            //System.Net.HttpWebResponse objResponse = null;
            //.Create(new System.Uri(uri));


            objRequest.Method = "HEAD";


            try
            {
                using (var objResponse = (System.Net.HttpWebResponse)objRequest.GetResponse())
                {
                    //objResponse = (System.Net.HttpWebResponse)objRequest.GetResponse();
                    if (objResponse.StatusCode == System.Net.HttpStatusCode.NotFound)
                    {
                        foreach (string name in headers)
                        {
                            headerTable.Add(name, "");
                        }
                    }
                    else
                    {

                        foreach (string name in headers)
                        {
                            if (objResponse.Headers.AllKeys.Contains(name))
                            {
                                
                                string orig_header = objResponse.Headers[name];
                                byte[] b = System.Text.Encoding.GetEncoding(28591).GetBytes(orig_header);

                                headerTable.Add(name, System.Text.Encoding.UTF8.GetString(b));
                                
                            }
                            else
                            {
                                headerTable.Add(name, "");
                            }
                        }
                    }
                }
                
                return headerTable;
            }
            catch (System.Exception p)
            {
                foreach (string name in headers)
                {
                    headerTable.Add(name, "");
                }
                headerTable.Add("error", p.ToString());
                return headerTable;
            }
        }

It’s a pretty straightforward piece of code – the tool looks up a URI, reads the header, and outputs a hash of the values.  There doesn’t appear to be anything in the code that would explain why MarcEdit was generating so many requests (because this function was only being called once per item).  But looking at the documentation – well, there is.  The HttpWebRequest object has a property – AllowAutoRedirect, and it’s set to true by default.  This tells the component that a web request can be automatically redirected up to the value set in MaxRedirections (by default, I think it’s 5).  Since every request to the LC service generates redirects – MarcEdit was following them and just tossing the data.  So that was my problem.  Allowing redirects is a fine assumption to make for a lot of things – but for my purposes – not so much.  It’s an easy fix – I added a value to the function header – something that is set to false by default, and then use that value to set the AllowAutoRedirect bit.  This way I can allow redirects when I need them, but turn it off when by default when I don’t (which is almost always).  Once finished, I tested against LC’s service and they confirmed that this reduced the number of HEAD requests.  On my side – I noticed that things were much, much faster.  On the LC side, they are pleased because MarcEdit is generating a lot of traffic, and this should help to reduce and focus that traffic.  So win, win, all around.

What does this mean

So what this means – I’ll be posting an update this evening.  It will include a couple tweaks based on feedback from the update this past Sunday – but most importantly, it will include this change.  If you use the linked data tools or the Validate Headings tools – you will want to update.  I’ve updated MarcEdit’s user agent string, so LC will now be able to tell if a user is using a version of MarcEdit that is fixed.  If you aren’t and you are generating a lot of traffic – don’t be surprised if they ask you to update. 

The other thing that I think that it shows (and this I’m excited about), is that LC really has been incredibly accommodating when it has come to using this service, and rather than telling me that MarcEdit needed to start following LC’s data request guidelines for the id.loc.gov service (which would make this service essentially useless), they worked with me to figure out what was going on so we could find a solution that everyone is happy with.  And like I said, we both are concerned that as more users hit the service, there will be a need to do spot throttle those requests globally, so we are talking about how that might be done. 

For me, this type of back and forth has been incredibly refreshing and somewhat new.  It certainly has never happened when I’ve spoken to any ILS vendor or data provider (save for members of the Koha and OLE communities) – and gives me some hope that just maybe we can all come together and make this semantic web thing actually work.  The problem with linked data is that unless there is trust: trust in the data and trust in the service providing the data – it just doesn’t work.  And honestly, I’ve had concerns that in Library land, there are very few services that I feel you could actually trust (and that includes OCLC at this point).  Service providers are slowly wading in – but these types of infrastructure components take resources – lots of resources, and they are invisible to the user…or, when they are working, they are invisible.  Couple that with the fact that these services are infrastructure components, not profit engines – its not a surprise that so few services exist, and the ones that do, are not designed to support real-time, automated look up.  When you realize that this is the space we live in, right now, It makes me appreciate the folks at LC, and especially Nate Trail, all the more.  Again, if you happen to be at ALA and find these services useful, you really should let them know.

Anyway – I started the process to run tests and then build this morning before heading off to work.  So, sometime this evening, I’ll be making this update available.  However, given that these components are becoming more mainstream and making their way into authority workflows – I wanted to give a heads up.

Questions – let me know.

–tr

 Posted by at 8:18 am
Jan 032016
 

Over the past few years, holiday updates have become a part of a MarcEdit tradition.  This year, I’ve been spending the past month working on two significant set of changes.  On the Windows side, I’ve been working on enhancing the Linked Data tools, profiling more fields and more services.  This update represents a first step in the process – as I’ll be working with the PCC to profile additional services and add new elements as we work through a pilot test around embedding linked data into MARC records and potential implications.  For a full change list, please see: http://blog.reeset.net/archives/1822

The Mac version has seen a lot of changes – and because of that, I’ve moved the version number from 1.3.35 to 1.4.5.  In addition to all the infrastructure changes made within the Windows/Linux program (the tools share a lot of code), I’ve also done significant work exposing preferences and re-enabling the ILS Integration.  I didn’t get to test the ILS integration well – so there may be a few updates to correct problems once people start working with them – but getting to this point took a lot of work and I’m glad to see it through.  For a full list of updates on the Mac Version, please see: http://blog.reeset.net/archives/1824

Before Christmas, I’d mentioned that I was working on three projects – with the idea that all would be ready by the time these updates were complete.  I was wrong – so it looks like I’ll have one more Christmas/New Years gift left to give – and I’ll be trying to wrap that work up this week.

Downloads – you can pick up the new downloads at: http://marcedit.reeset.net/downloads or if you have the automatic update notification enabled, the tool should provide you with an option to update from within the program.

This represents a lot of work, and a lot of changes.  I’ve tested to the best of my ability – but I’m expecting that I may have missed something.  If you find something, let me know.  I’m saving time over the next couple weeks to fix problems that might come up and turn around builds faster than normal.

Here’s looking forward to a wonderful 2016.

–tr

 Posted by at 8:55 pm
Jan 022016
 

I’ve been working with the PCC Linked Data in MARC Task Group over the past couple of months, and as part of this process, I’ve been working on expanding the values that can be recognized by the Linking tool in MarcEdit.  As those that have used it might remember, MarcEdit’s linking tool showed up about a year and a half ago, and leverages id.loc.gov, MESH, and VIAF (primarily).  As part of this process with the PCC – a number of new vocabularies and fields have been added to the tools capacity.  This also has meant created profiles for linking data in both bibliographic and authority data. 

The big changes come in the range of indexes now supported by the tool (if defined within the record).  At this point, the following vocabularies are profiled for use:

  1. NAF
  2. LCSH
  3. LCSH Children
  4. MESH
  5. ULAN
  6. AAT
  7. LCGFT
  8. AGROVOC
  9. LCMPT
  10. LCDGT
  11. TGM
  12. LCMPT
  13. LCDGT
  14. RDA Vocabularies

The data profiled has also expanded beyond just 1xx, 6xx, and 7xx data to include 3xx data and data unique to the authority data.

This has required changing the interface slightly:

image

But I believe that I have the bugs worked out.  This function will be changing often over the next month or so as the PCC utilizes this and other tools while piloting a variety of methods for embedding linked data into MARC records and considering the implications.  As such, I’ll be adding to the list of profiled data over the coming month – however, if you use a specific vocabulary and don’t see it in the list – let me know.  As long as the resource provides a set of APIs (it cannot be a data down – that doesn’t work for client applications – at this point, the profiled resources would require users to download almost 12 GB of data almost monthly if I went that route) that can support a high volume of queries.

Questions…let me know.

 Posted by at 12:54 pm
Jan 022016
 

Sometime this past month, I was asked if there was a way to automate the batch processing of Sanborn Cutters.  OCLC’s connection provides a handy set of methods for doing this if you are an OCLC member and utilize Connexion.  It’s wrapped up in a nifty library and provides access to the current Sanborn Table 4 Cutters (which I believe are under the control of OCLC). For most users, this is probably what they want to use — and at some point I’d be interested in seeing OCLC might be willing to let me link to this particular library to provide a set of batch tools around Sanborn Cutter creation for users using the Four Figure table.  However, the Three Figure Tables were published long before 1921 (the copy I’m using dates to 1904) — so I decided to provide a tool for batch creation of Sanborn Figure 3 Cutters.

I guess before we go further, I’m not particularly familiar with this set of cutters.  I’ve only cataloged in an academic library, so I’m primarily familiar with LC’s cuttering methodology — so I’ll be interested in hearing if this actually works.

Ok, with that out of the way — this tool is available from within the MarcEditor.  The assumptions here are that:

  1. You have a call number stem (assuming dewey).  The tool defaults to assuming that you are looking to cutter on (by order of preference) — 1xx, 245, 243, 240 fields.  However, if you want to cutter on a different value (say a 600) you can identify the cuttering field and provide data to be queried to select the correct cuttering field (in case there are multiple values).
  2. That you know what you are doing (because I don’t :))

To run the tool, have the file you want to process in the MarcEditor.  Open the MarcEditor and Select Tools/Cuttering Tools/Sanborn Table 3 Cutters

MarcEdit Mac:

MarcEdit Mac Sanborn Cutter menu

MarcEdit Windows/Linux:

Sanborn Cutters Menu Windows

When you select the value — you see the following window:

MarcEdit Mac:

Generate Sanborn Cutters Mac

MarcEdit Windows/Linux

Generate Sanborn Cutters Windows Form

As you can see, I’ve tried to keep the function identical between the various platforms.  In the first textbox, you need to enter the field to evaluate.  Again, the tool assumes that you have the start of a call number in your record.  So, for LC records (when generating LC tools) that would be 090$a, 099$a, 050$a; for Dewey, 082, 092.  You then select how the cutter will be generated.

A couple words about how this works.  Cuttering is generated off of a set of tables.  The process works by looking for either a match in the cutter table or finding the closest match within the cutter table.  In my testing, it looks like the process that I’m employing will work reliably — but text data can be weird — so you’ll have to let me know if you see problems.

When you process the data — the program will insert a $b into the defined call number field based on the information it determined best represents the information in the cutter tables.

–tr

 Posted by at 12:42 pm
Jan 022016
 

One of the parts of the MarcEdit Mac port that has been lagging has been the ability to manage a number of the preferences that MarcEdit utilizes when running the program.  Originally, I exposed the preferences for the MARCEngine and the Updates.  As part of the next update, I’ve included options to handle preference settings for the MarcEditor, the Locations, the miscellaneous settings, and the ILS Integration.  The set still isn’t as robust as the Windows/Linux version — but part of that is because some of the options are not applicable; though more likely, they just weren’t part of the list that were most commonly asked for.  I’ll be working on adding the remainder through Jan.

MarcEditor Preferences:

MarcEditor Mac Preferences

Locations:

Mac Locations Preferences

Other Settings:

Other Mac Preferences

ILS Integration:

ILS Mac Integrations

 Posted by at 12:20 pm
Jan 012016
 

I’ve been working hard over the last month an a half trying to complete the process of porting functionality into the OSX version of MarcEdit.  I’ve completed the vast majority of this work, in addition to bringing in a number of other changes.  These changes will be made available as part of the 1/3/2016 update.  The changes in this update will be as follows:

  • Bug Fix: RDA Helper — 260/264 changes were sometimes not happening when running across specific 260$c formatting.
  • Bug Fix: MARCValidator: Validator was crashing when records would go beyond 150,000 bytes.  This has been corrected.
  • Bug Fix: Build Links Tool — MESH headings were utilizing older syntax and occasionally missing values.
  • Bug Fix: Validation Headings tool: When checking Automatically Correct variants, the embed URIs tool is automatically selected.  This has been corrected.
  • Bug Fix: Edit XML Functions: The modify option, the save button was turned on.  This has been corrected.
  • Enhancement: Build Links: Build links tool use to use the opensearch api in the id.loc.gov resolution.  This was changed to be like the validate headings tool and provide more consistent linking.
  • Enhancement: Most of MarcEdit’s preferences have been exposed.
  • Enhancement: Build Links Tool – I’ve added profiles for a wide range of vocabularies being tested by the PCC Linked Data task force.  These are available.
  • Enhancement: Build Links Tool — Profiled services are found under a link.
  • Enhancement: Build Links Tool — Task management options have been added for the new validate options.
  • Enhancement: MarcEditor: Generate Cutters: LC cutter generation has been updated.
  • Enhancement: MarcEditor: Generate Sanborn Cutters: Added function to generate Sanborn Table 3 Cutters.
  • Enhancement: ILS Framework — MarcEdit’s ILS framework options were added.
  • Enhancement: Koha Integration: Koha Integration options were added to the tool.

This doesn’t complete the function migration, but its close.  These changes will be part of the 1/3/2016 update.  I’ll be working to add a few YouTube videos to document new functions.  Let me know if you have questions.

 Posted by at 9:04 pm
Jan 012016
 

Over the past month, I’ve been working hard to make a few MarcEdit Changes.  These changes will be released on 1/3/2016.  This update will include a version number change to 6.2.  This update will have the following changes:

  • Bug Fix: RDA Helper — 260/264 changes were sometimes not happening when running across specific 260$c formatting.
  • Bug Fix: MARCValidator: Validator was crashing when records would go beyond 150,000 bytes.  This has been corrected.
  • Bug Fix: Build Links Tool — MESH headings were utilizing older syntax and occasionally missing values.
  • Bug Fix: Tutorials Link pointed to dead endpoint.  Corrected.
  • Bug Fix: 006/007 Menu Selection: The incorrect for is being selected when selecting the Serial and cartographic materials.
  • Bug Fix: Validation Headings tool: When checking Automatically Correct variants, the embed URIs tool is automatically selected.  This has been corrected.
  • Bug Fix: MarcEditor Find: When selecting edit query, the find box goes to the Replace dialog.  This has been corrected.
  • Bug Fix: Harvest OAI Records: If the harvester.txt file isn’t present, an unrecoverable error occurs.  This has been corrected.
  • Bug Fix: MarcEditor Task List: When you have a lot of tasks, the list of available tasks may not refresh on first run.  I believe I’ve corrected this.
  • Enhancement: Build Links: Build links tool use to use the opensearch api in the id.loc.gov resolutionn.  This was changed to be like the validate headings tool and provide more consistent linking.
  • Enhancement: Preferences: Under File preferences, you can set the default drive for the information in the MARC Tools source and output textboxes.
  • Enhancement: Build Links Tool – I’ve added profiles for a wide range of vocabularies being tested by the PCC Linked Data task force.  These are available.
  • Enhancement: Build Links Tool — Profiled services are found under a link.
  • Enhancement: Build Links Tool — Task management options have been added for the new validate options.
  • Enhancement: MarcEditor: Generate Cutters: LC cutter generation has been updated.
  • Enhancement: MarcEditor: Generate Sanborn Cutters: Added function to generate Sanborn Table 3 Cutters.

This update will be posted 1/3/2016. I’ll be working to add a few YouTube videos to document new functions. 

 Posted by at 9:04 pm
Nov 082015
 

I’ve posted a new MarcEdit update.  You can get the builds directly from: http://marcedit.reeset.net/downloads or using the automated update tool within MarcEdit.  Direct links:

The change log follows:

–tr

***********************************************************************************************

MarcEdit Mac ChangeLog: 11/8/2015

MarcEdit Applications Changes:
* Build New Field Tool Added
** Added Build New Field Tool to the Task Manager
* Validate Headings Tool Added
* Extract/Delete Selected Records Tool Added

* Updates to Linked Data tool
** Added option to select oclc number for work id embedding
** Updated Task Manager signatures

* Edit Indicators
** Removed a blank space as legacy wildcard value.  Wildcards are now strictly “*”

Merge Records Tool
* Updated User defined fields options to allow 776$w to be used (fields used as part of the MARC21 option couldn’t previously be redefined to act as a single match point)

Validator
* Results page will print UTF8 characters (always) if present

Sorting
* Adding an option so if selected, 880 will be sorted as part of their paired field.

Z39.50 Client
* Supports Single and Batch Search Options

 Posted by at 8:52 am
Nov 082015
 

I’ve posted a new MarcEdit update.  You can get the builds directly from: http://marcedit.reeset.net/downloads or using the automated update tool within MarcEdit.  Direct links:

The change log follows:

–tr

***********************************************************************************************

MarcEdit Windows/Linux ChangeLog: 11/8/2015

MarcEdit Application Changes:
* Updates to the Build New Field Tool
** Code moved into meedit code library (for portability to the mac system)
** Separated options to provide an option to add new field only, add when not present, replace existing fields
** Updated Task Manager signatures — if you use this function in a task, you will need to update the task

* Updates to Linked Data tool
** Added option to select oclc number for work id embedding
** Updated Task Manager signatures
** Updated cmarcedit commandline options

* Edit Indicators
** Removed a blank space as legacy wildcard value.  Wildcards are now strictly “*”

Merge Records Tool
* Updated User defined fields options to allow 776$w to be used (fields used as part of the MARC21 option couldn’t previously be redefined to act as a single match point)

Validator
* Results page will print UTF8 characters (always) if present

Validate ISBN/ISSN
* Results page now includes the 001 if present in addition to the record # in the file

Sorting
* Adding an option so if selected, 880 will be sorted as part of their paired field.

Preferences:
* Added Sorting Preferences
* Added New Options Option, shifting the place where the folder settings are set.

UI Improvements
* Various UI improvements made to better support Windows 10.

 Posted by at 8:51 am