Harvesting UMich OAI records with MarcEdit

By reeset / On / In MarcEdit, OAI

I’ve had a few folks ask about the the procedure would be for a user wanting to harvest the UMich OAI records using MarcEdit.  Well, there are two workflows that can be followed depending on what you want to do.  You can harvest the OAI data and translate it directly to MARC or you can harvest the raw data directly to one’s file system.  Here’s how each would work:

Generating MARC records from the OAI content:

  1. Start MarcEdit
  2. From the Main Screen, click on the Harvest OAI Records Link
  3. Once the link has been selected, you have a number of options available to you to control the harvesting.  Required options are those that are seen when the screen opens.  Advanced Settings, or optional settings define additional options available to the user.  Here’s a screenshot of the Harvester with the Advanced Options expanded:
    The required elements that must be filled in are the Server Address (the address pointing to the OAI URL), metadata type (format to be downloaded) and Crosswalk Path.  If you select any of the predefined metadata types, the program will select the crosswalk path for you.  If you add your own, then you will need to point the program to the crosswalk path.  Set name is optional.  If you leave this value blank, the harvester will attempt to harvest all available sets on the defined server. 

    Advanced settings give the user a number of additional harvesting options, generally set aside to help the users control flow.  For example, users can harvest an individual record by entering the record’s identifier into the GetRecord Textbox.  A user could resume a harvest by entering the resumptionToken into the ResumptionToken textbox.  If the user wanted to harvest a subset of a specific data set, they can use a date limit (of course, you must use the date format supported by the server — generally yyyy or yyyy-mm-dd format).  Users can also determine if they want their metadata translated into MARC8 (since the harvester assumed UTF8 for all xml data) and change the timeout settings the harvester uses for returning data (you generally shouldn’t change this).  Finally, for users that don’t want to harvest data into MARC, but just need the raw data — there is the ability to tell the harvester to just harvest data to the local file system.  If this option is checked, then the CrossWalk Path’s label and behavior will change — requiring the user to enter a path to a directory to tell the harvester where it should save the harvested files.

  4. For the UMich Digital Books, a user would want to utilize the following settings to harvest metadata into MARC:
    Users wanting to ensure that the MARC data is in MARC8 and not UTF8 format should check the Translate to MARC-8 option.  Once these settings have been set, a user will just need to click the OK button.  For this set (mbooks), there are approximately 111000+ records, so harvesting will take approximately an hour or so to complete.  Longer if you ask the program to translate data into MARC8.
  5. When finished, users will be prompted with a status box indicating the number of records, resumptiontokens and last resumptiontoken processed (and any error information if an error occurred on process).


Harvesting OAI records directly to the filesystem

  1. Start up MarcEdit
  2. Select Harvest OAI records link
  3. Enter the following information (Server folder location will obviously vary):
  4. Files are harvested into the defined directory — number numerically according to resumption token processed.  Again, when processing is finished, a summary window will be generated to inform the user of harvest status and error information related to the harvest.

Errors related to the UMich Harvest that could be encounted:

My guess is that you would see these if you are using the most current version of MarcEdit uploaded 2008-01-27, however, you may run into this if harvesting using other tools or older versions of MarcEdit.

  1. Server Timeout:  When harvesting all records, I was routinely seeing the server reset its connection after harvesting 10-18 resumption Tokens.  The current version of MarcEdit has some fall over code that will reinitiate the harvest under these conditions, stopping after 3 failed attempts.
  2. Invalid MARC data:  Within the 111000+ records, there are approximately 40-60+ MARC records that have too few characters represented in the MARC leader element.  This is problematic because this error will invalidate the record and depending on how the MARC parser handles records, poison the remainder of the file.  MarcEdit accommodates these errors by auto correcting the leader values — but this could be a problem with other tools.
  3. image
    This error message will be generated if you set the start and end elements using an invalid date format.  You should always check with the OAI server to see what date formats are supported by the server.  In this case, the date format expected by the UM OAI server is as follows:
    <repositoryName>University of Michigan Library Repository</repositoryName> 

    Notice the granularity element — this tells me that any of the following formats would be valid:

Anyway — that’s pretty much it.  If you are just interested in see what type of data the UM is exposing with these data elements, you can find that data (harvested 2008-01-25) at: umich_books.zip (~63 mb).




2 thoughts on “Harvesting UMich OAI records with MarcEdit

  1. Hi Terry, I am librarian working in HKU. I would like to harvest the hathi (mbooks) record into normal MARC format for further editing in MarcEdit (like those files in umich_books.zip).

    But I am just confused which “Metadata type” and “Crosswalk path” to choose in MarcEdit. Is the suitable Crosswalk already built-in in MarcEdit?

    It will be very helpful if you can give me some advice! Thanks so much!

    Alan Ng
    Library Systems Analyst
    HKU Libraries
    Feb 27

  2. So, the metadata type just identifies what type of data that you’ll be expecting from the OAI server. For UMich, the type should be MARCXML. When you set that, MarcEdit should auto select the OAIMARCXML2Mnemonic (or something like that) Xsl file (so yes, the necessary stylesheets are built into the program).