Harvesting UMich OAI records with MarcEdit

By reeset / On / In MarcEdit, OAI

I’ve had a few folks ask about the the procedure would be for a user wanting to harvest the UMich OAI records using MarcEdit.  Well, there are two workflows that can be followed depending on what you want to do.  You can harvest the OAI data and translate it directly to MARC or you can harvest the raw data directly to one’s file system.  Here’s how each would work:

Generating MARC records from the OAI content:

  1. Start MarcEdit
  2. From the Main Screen, click on the Harvest OAI Records Link
  3. Once the link has been selected, you have a number of options available to you to control the harvesting.  Required options are those that are seen when the screen opens.  Advanced Settings, or optional settings define additional options available to the user.  Here’s a screenshot of the Harvester with the Advanced Options expanded:
    The required elements that must be filled in are the Server Address (the address pointing to the OAI URL), metadata type (format to be downloaded) and Crosswalk Path.  If you select any of the predefined metadata types, the program will select the crosswalk path for you.  If you add your own, then you will need to point the program to the crosswalk path.  Set name is optional.  If you leave this value blank, the harvester will attempt to harvest all available sets on the defined server. 

    Advanced settings give the user a number of additional harvesting options, generally set aside to help the users control flow.  For example, users can harvest an individual record by entering the record’s identifier into the GetRecord Textbox.  A user could resume a harvest by entering the resumptionToken into the ResumptionToken textbox.  If the user wanted to harvest a subset of a specific data set, they can use a date limit (of course, you must use the date format supported by the server — generally yyyy or yyyy-mm-dd format).  Users can also determine if they want their metadata translated into MARC8 (since the harvester assumed UTF8 for all xml data) and change the timeout settings the harvester uses for returning data (you generally shouldn’t change this).  Finally, for users that don’t want to harvest data into MARC, but just need the raw data — there is the ability to tell the harvester to just harvest data to the local file system.  If this option is checked, then the CrossWalk Path’s label and behavior will change — requiring the user to enter a path to a directory to tell the harvester where it should save the harvested files.

  4. For the UMich Digital Books, a user would want to utilize the following settings to harvest metadata into MARC:
    Users wanting to ensure that the MARC data is in MARC8 and not UTF8 format should check the Translate to MARC-8 option.  Once these settings have been set, a user will just need to click the OK button.  For this set (mbooks), there are approximately 111000+ records, so harvesting will take approximately an hour or so to complete.  Longer if you ask the program to translate data into MARC8.
  5. When finished, users will be prompted with a status box indicating the number of records, resumptiontokens and last resumptiontoken processed (and any error information if an error occurred on process).


Harvesting OAI records directly to the filesystem

  1. Start up MarcEdit
  2. Select Harvest OAI records link
  3. Enter the following information (Server folder location will obviously vary):
  4. Files are harvested into the defined directory — number numerically according to resumption token processed.  Again, when processing is finished, a summary window will be generated to inform the user of harvest status and error information related to the harvest.

Errors related to the UMich Harvest that could be encounted:

My guess is that you would see these if you are using the most current version of MarcEdit uploaded 2008-01-27, however, you may run into this if harvesting using other tools or older versions of MarcEdit.

  1. Server Timeout:  When harvesting all records, I was routinely seeing the server reset its connection after harvesting 10-18 resumption Tokens.  The current version of MarcEdit has some fall over code that will reinitiate the harvest under these conditions, stopping after 3 failed attempts.
  2. Invalid MARC data:  Within the 111000+ records, there are approximately 40-60+ MARC records that have too few characters represented in the MARC leader element.  This is problematic because this error will invalidate the record and depending on how the MARC parser handles records, poison the remainder of the file.  MarcEdit accommodates these errors by auto correcting the leader values — but this could be a problem with other tools.
  3. image
    This error message will be generated if you set the start and end elements using an invalid date format.  You should always check with the OAI server to see what date formats are supported by the server.  In this case, the date format expected by the UM OAI server is as follows:
    <repositoryName>University of Michigan Library Repository</repositoryName> 

    Notice the granularity element — this tells me that any of the following formats would be valid:

Anyway — that’s pretty much it.  If you are just interested in see what type of data the UM is exposing with these data elements, you can find that data (harvested 2008-01-25) at: umich_books.zip (~63 mb).




MARC21 University of Michigan Google Digital Books Records (records for testing/viewing)

By reeset / On / In MarcEdit, OAI

I was playing with MarcEdit’s OAI harvester, making a few changes to fix a problem that had been discovered, as well as add some fall-over code that allows the harvester to continue processing (or at least, attempt to continue processing) when the OAI server breaks the connection (generally through timeout).  To test, I decided to work with the UMichigan Google Books sets of records Michigan recently made available.  It’s a large set and is one of those servers where the server timeout had been identified as an issue (i.e., this came up because a MarcEdit user had inquired about a problem they were having harvesting data). 

Anyway, I’ll likely post the update to the OAI harvesting code on Sunday or so (which will also include an update to the CJK processing component when going from MARC8-UTF8 — particularly when the record sets contain badly encoded data), and with it, I’ll likely include a small tutorial for users wanting to use MarcEdit to do one of the following:

  1. Harvest the UM digital book records from OAI directly into MARC21 (saving characterset in either legacy MARC8 or UTF8 formats)
  2. Harvesting the raw UM digital book metadata records via OAI (without the MARC conversion)

While I think that the the Harvester is fairly straightforward to use, I’m going to post some instruction, in part, so that I can underline some of the common error messages that one might see and what they mean.  For example, with the UM harvesting, I found that the OAI server tended to timeout after approximately 15 queries using a persistent connection.  When it would stop, it would throw a 503 error from the server.  I was able to over come the issue by simply adding some code into the app. to track failures and simply pause harvesting and restart the connection to the server — but these types of errors are not easy for most users to debug since they are not sure if the issue lies with the harvesting software or the server being harvested. 

Another problem that I’ve coded in MarcEdit to fix on the fly is that a handful of MARC21 records (I believe I identified approximately 40ish of 111000+) sent via OAI have invalid leader statements (i.e., not enough characters in the string).  For example, this record: http://quod.lib.umich.edu/cgi/o/oai/oai?verb=GetRecord&metadataPrefix=marc21&identifier=oai:quod.lib.umich.edu:MIU01-001300473, the leader is one character too short.  MarcEdit can fix these on the fly (at least it will try) by validating the length of the LDR and if short, padding spaces to the end of the string.  Since length and directory are calculated algorithmically, the records will be valid, but some of the leader data may get offset due to the padding.  However, there isn’t a thing you can really do about that, outside of rejecting the records as invalid or accepting the data as it (which the poisons all the other records downloaded in the set).  I’m putting together some info for the folks at UM that includes some of the problems that I’ve run into working with their OAI data just in case they are interested.

Anyway, one thing I thought I would do is post a set of these records, in MARC UTF8 and MARC8 charactersets (harvested 20080126 around 1:30 am to 3:00 am) for folks interested in taking a look at the exposed metadata.  You will find that the vast majority of these records appear to be brief metadata records containing basically an author, title and url — though full records are scattered through the record sets.  There are over 111000 records found in the six files.  The files in the zip are:

  1. mbooks-utf8 (combined data set)
  2. mbooks-marc8 (combined data set in marc8)
  3. pd-utf8 (international public domain books)
  4. pd-marc8 (international public domain books in marc8)
  5. pdus-utf8 (u.s. public domain books)
  6. pdus-marc8 (u.s. public domain books in marc8)

A quick note.  These are largish files.  MarcEdit has a preview mode specifically for this purpose.  Unless disabled, MarcEdit by default only loads the first 1 MB of data into the MarcEditor.  This will allow you to preview ~1000-1500 records, but using the editor tools, you can globally edit the entire data file.  This is done because reading data into the Editor is expensive (memory and time).  If you really want to open large files into the Editor, you need to make sure your virtual memory is set fairly high. 

So long as the folks at UM don’t ask me to take it down, I’ve posted these test files at: http://osulibrary.oregonstate.edu/techservices/marc/umich_books.zip for viewing and testing purposes (~62.7 MB), but I would recommend harvesting these records from http://quod.lib.umich.edu/cgi/o/oai/oai directly yourself if you want to use them since UM is adding new records all the time.  And remember, if you want to harvest them with MarcEdit, you’ll need to wait till I post the update on Sunday.


Technorati Tags: ,,,,