Harvesting UMich OAI records with MarcEdit

By reeset / On / In MarcEdit, OAI

I’ve had a few folks ask about the the procedure would be for a user wanting to harvest the UMich OAI records using MarcEdit.  Well, there are two workflows that can be followed depending on what you want to do.  You can harvest the OAI data and translate it directly to MARC or you can harvest the raw data directly to one’s file system.  Here’s how each would work:

Generating MARC records from the OAI content:

  1. Start MarcEdit
  2. From the Main Screen, click on the Harvest OAI Records Link
  3. Once the link has been selected, you have a number of options available to you to control the harvesting.  Required options are those that are seen when the screen opens.  Advanced Settings, or optional settings define additional options available to the user.  Here’s a screenshot of the Harvester with the Advanced Options expanded:
    The required elements that must be filled in are the Server Address (the address pointing to the OAI URL), metadata type (format to be downloaded) and Crosswalk Path.  If you select any of the predefined metadata types, the program will select the crosswalk path for you.  If you add your own, then you will need to point the program to the crosswalk path.  Set name is optional.  If you leave this value blank, the harvester will attempt to harvest all available sets on the defined server. 

    Advanced settings give the user a number of additional harvesting options, generally set aside to help the users control flow.  For example, users can harvest an individual record by entering the record’s identifier into the GetRecord Textbox.  A user could resume a harvest by entering the resumptionToken into the ResumptionToken textbox.  If the user wanted to harvest a subset of a specific data set, they can use a date limit (of course, you must use the date format supported by the server — generally yyyy or yyyy-mm-dd format).  Users can also determine if they want their metadata translated into MARC8 (since the harvester assumed UTF8 for all xml data) and change the timeout settings the harvester uses for returning data (you generally shouldn’t change this).  Finally, for users that don’t want to harvest data into MARC, but just need the raw data — there is the ability to tell the harvester to just harvest data to the local file system.  If this option is checked, then the CrossWalk Path’s label and behavior will change — requiring the user to enter a path to a directory to tell the harvester where it should save the harvested files.

  4. For the UMich Digital Books, a user would want to utilize the following settings to harvest metadata into MARC:
    Users wanting to ensure that the MARC data is in MARC8 and not UTF8 format should check the Translate to MARC-8 option.  Once these settings have been set, a user will just need to click the OK button.  For this set (mbooks), there are approximately 111000+ records, so harvesting will take approximately an hour or so to complete.  Longer if you ask the program to translate data into MARC8.
  5. When finished, users will be prompted with a status box indicating the number of records, resumptiontokens and last resumptiontoken processed (and any error information if an error occurred on process).


Harvesting OAI records directly to the filesystem

  1. Start up MarcEdit
  2. Select Harvest OAI records link
  3. Enter the following information (Server folder location will obviously vary):
  4. Files are harvested into the defined directory — number numerically according to resumption token processed.  Again, when processing is finished, a summary window will be generated to inform the user of harvest status and error information related to the harvest.

Errors related to the UMich Harvest that could be encounted:

My guess is that you would see these if you are using the most current version of MarcEdit uploaded 2008-01-27, however, you may run into this if harvesting using other tools or older versions of MarcEdit.

  1. Server Timeout:  When harvesting all records, I was routinely seeing the server reset its connection after harvesting 10-18 resumption Tokens.  The current version of MarcEdit has some fall over code that will reinitiate the harvest under these conditions, stopping after 3 failed attempts.
  2. Invalid MARC data:  Within the 111000+ records, there are approximately 40-60+ MARC records that have too few characters represented in the MARC leader element.  This is problematic because this error will invalidate the record and depending on how the MARC parser handles records, poison the remainder of the file.  MarcEdit accommodates these errors by auto correcting the leader values — but this could be a problem with other tools.
  3. image
    This error message will be generated if you set the start and end elements using an invalid date format.  You should always check with the OAI server to see what date formats are supported by the server.  In this case, the date format expected by the UM OAI server is as follows:
    <repositoryName>University of Michigan Library Repository</repositoryName> 

    Notice the granularity element — this tells me that any of the following formats would be valid:

Anyway — that’s pretty much it.  If you are just interested in see what type of data the UM is exposing with these data elements, you can find that data (harvested 2008-01-25) at: umich_books.zip (~63 mb).




MARC21 University of Michigan Google Digital Books Records (records for testing/viewing)

By reeset / On / In MarcEdit, OAI

I was playing with MarcEdit’s OAI harvester, making a few changes to fix a problem that had been discovered, as well as add some fall-over code that allows the harvester to continue processing (or at least, attempt to continue processing) when the OAI server breaks the connection (generally through timeout).  To test, I decided to work with the UMichigan Google Books sets of records Michigan recently made available.  It’s a large set and is one of those servers where the server timeout had been identified as an issue (i.e., this came up because a MarcEdit user had inquired about a problem they were having harvesting data). 

Anyway, I’ll likely post the update to the OAI harvesting code on Sunday or so (which will also include an update to the CJK processing component when going from MARC8-UTF8 — particularly when the record sets contain badly encoded data), and with it, I’ll likely include a small tutorial for users wanting to use MarcEdit to do one of the following:

  1. Harvest the UM digital book records from OAI directly into MARC21 (saving characterset in either legacy MARC8 or UTF8 formats)
  2. Harvesting the raw UM digital book metadata records via OAI (without the MARC conversion)

While I think that the the Harvester is fairly straightforward to use, I’m going to post some instruction, in part, so that I can underline some of the common error messages that one might see and what they mean.  For example, with the UM harvesting, I found that the OAI server tended to timeout after approximately 15 queries using a persistent connection.  When it would stop, it would throw a 503 error from the server.  I was able to over come the issue by simply adding some code into the app. to track failures and simply pause harvesting and restart the connection to the server — but these types of errors are not easy for most users to debug since they are not sure if the issue lies with the harvesting software or the server being harvested. 

Another problem that I’ve coded in MarcEdit to fix on the fly is that a handful of MARC21 records (I believe I identified approximately 40ish of 111000+) sent via OAI have invalid leader statements (i.e., not enough characters in the string).  For example, this record: http://quod.lib.umich.edu/cgi/o/oai/oai?verb=GetRecord&metadataPrefix=marc21&identifier=oai:quod.lib.umich.edu:MIU01-001300473, the leader is one character too short.  MarcEdit can fix these on the fly (at least it will try) by validating the length of the LDR and if short, padding spaces to the end of the string.  Since length and directory are calculated algorithmically, the records will be valid, but some of the leader data may get offset due to the padding.  However, there isn’t a thing you can really do about that, outside of rejecting the records as invalid or accepting the data as it (which the poisons all the other records downloaded in the set).  I’m putting together some info for the folks at UM that includes some of the problems that I’ve run into working with their OAI data just in case they are interested.

Anyway, one thing I thought I would do is post a set of these records, in MARC UTF8 and MARC8 charactersets (harvested 20080126 around 1:30 am to 3:00 am) for folks interested in taking a look at the exposed metadata.  You will find that the vast majority of these records appear to be brief metadata records containing basically an author, title and url — though full records are scattered through the record sets.  There are over 111000 records found in the six files.  The files in the zip are:

  1. mbooks-utf8 (combined data set)
  2. mbooks-marc8 (combined data set in marc8)
  3. pd-utf8 (international public domain books)
  4. pd-marc8 (international public domain books in marc8)
  5. pdus-utf8 (u.s. public domain books)
  6. pdus-marc8 (u.s. public domain books in marc8)

A quick note.  These are largish files.  MarcEdit has a preview mode specifically for this purpose.  Unless disabled, MarcEdit by default only loads the first 1 MB of data into the MarcEditor.  This will allow you to preview ~1000-1500 records, but using the editor tools, you can globally edit the entire data file.  This is done because reading data into the Editor is expensive (memory and time).  If you really want to open large files into the Editor, you need to make sure your virtual memory is set fairly high. 

So long as the folks at UM don’t ask me to take it down, I’ve posted these test files at: http://osulibrary.oregonstate.edu/techservices/marc/umich_books.zip for viewing and testing purposes (~62.7 MB), but I would recommend harvesting these records from http://quod.lib.umich.edu/cgi/o/oai/oai directly yourself if you want to use them since UM is adding new records all the time.  And remember, if you want to harvest them with MarcEdit, you’ll need to wait till I post the update on Sunday.


Technorati Tags: ,,,,

Securing licenses

By reeset / On / In Uncategorized

Kind of off topic  — but I had to make a trip to the dmv this morning.  I’ve been finding that airports, rental car agencies, etc. are having a difficult time accepting my current license because much of the information on it has been smeared or damaged due to water (I have on older license that was renewed about 8 years ago).  Anyway, this came to a head when I was at ALA in Philly and the airport folks were not thrilled with letting me pass threw security with my current license. 

Anyway, I was curious as to what I’d need to bring to renew my license.  Oregon is one of the many states that has recently moved to make getting a license more difficult.  So, I brought my Passport and a few other things and found getting my license renewed to be easy enough.  However, after getting my picture taken and info updated, I didn’t get a license.  I got a piece of paper that is noted as an interm license.  Given the current frenzy around securing state licenses — it seems odd that in Oregon, and interm license is simply a slip of paper that has no security information at all, easily reproducible (and modified) with a scanner and laser printer.  This certainly makes me feel much more secure.


atscap and pchdtvr GPL revoked or can it be

By reeset / On / In General Computing

I’ve never used this package (apparently its used for HDTV scheduling/recording on Linux), but this link on Slashdot caught my eye: http://sourceforge.net/developer/diary.php?diary_id=26407&diary_user=147583.  Apparently, the developer of this software package is seeking to revoke the GPL license not just for his current code, but his past code/package as well.  I have a difficult time believing that this is possible, but I’m sure we will soon find out.  My guess is that this guy is productizing his software and has a good idea who is currently using, selling and distributing his source so there will likely be some kind of legal challenge to the GPL as well.  It’s always interesting to see how these kind of things play out in the U.S. courts which can sometimes be a little schizophrenic, though I’d have a difficult time believing that this type of retroactive license change is actually possible.



Is IT becoming too disposable?

By reeset / On / In General Computing

This is something that came up when I was expanding my thoughts from one of my “non-lita tech trends” earlier this morning and the more I’ve thought about it, the more I’m finding it weighing on my mind.  I’m wondering if we are making our hardware too disposable in the name of convenience.  This comes from my conversations about low budget, ultra portable systems to thinking about Apple’s new Mac Book Air — a computer that comes without a replaceable battery and limited upgradability.  I’ll admit, I’m a little bit of a pack rat.  I’ve either kept or found homes for every computer I’ve ever owned.  In fact, it was only recently that I upgraded our 8 year old desktop at home to something newer and zipper (relegating the old machine to file server status).  When things break — I like to fix them.  When things slow down — I take them apart and upgrade the components.  I do this for a number of reasons — one being that I do like to encourage an environmentally friendly lifestyle (more or less).  I drive very little, we recycle fanatically, try to buy local — but I’m having a hard time rectifying this lifestyle with the gadgets that I’ve come to know and love.  One of the problems, as I’m seeing it, is that many of these low budget machines (or in the Mac Air’s case — premium priced machines) are making hardware much more throw away that it ever was before.  If I have a $200 desktop (or notebook for that matter) and something breaks — do I fix it?  If its a year old — probably not since the cost to fix it will likely be close to the cost to replace it.  So, the computer is landfill’d (as most computers are even though most companies offer recycling programs) and the process repeats.  Even Apple’s Mac Air seems to be built to encourage a rapid replacement cycle.  Low expandability, no battery replacement, under powered processor — while sleek and stylish I wonder if these too won’t become high end disposable products. 

In a time when green computer seems to be gaining traction everywhere, the current disposable PC trend seems to fly in its face.  And I’m no better in this regard.  I too would like an ultra-portable device and am in the group looking for something on the higher end scale (I want something that will perform better than a PDA) and there’s the dilemma.  This class of machines simply is disposable by default due to the nature of the beast.  Keep size down, keep price down — and performance suffers.  When performance suffers — performance lust sets in and the cycle repeats.  A great cycle for investors, maybe, but not for those wanting live a little greener.

Anyway, random thoughts for a Thursday,


Technorati Tags:

Bring on the Penn State Kitty cats

By reeset / On / In Family, Travel

So, its official (as of last week I believe), but Oregon State University’s football team will be traveling to Happy Valley to play the Lions — and I’ll be there.  My wife has given me the ok so I’ve got the plane tickets, a hotel room in altoona, pa and a rental car.  Now all I need are tickets to the game. :).  But how hard can it be to get tickets to a stadium that holds over a 100,000 people, right?  No, I’m asking.  🙂 


My non-LITA top tech trends

By reeset / On / In Digital Libraries, General Computing

(Note, I started this post last night, but had to put it away so I could get some rest before a 6 am flight.  I finished the remainder of this while waiting for my flight). 

So, after getting up way to early this morning, I staggered my way down to the LITA Top Tech Trends discussion this morning.  Unfortunately, it seemed like a number of other folks did the same thing as well, so I only ended up hanging out for a little bit.  I just don’t have the stamina in the morning to live through cramped quarters, poor broadband and no caffeine.  I get enough of that when I fly (which I get to do tomorrow).  Fortunately, a number of folks who had been asked to provide tech trends have begun (or have been) posting their lists and some folks who braved the early morning hours have started blogging their response (here).  I personally wasn’t asked to provide my list of tech trends, but I’m going to anyway, as well as comment on a few of the trends either posted or discussed during the meeting.  Remember, this is just one nuts list, so take it for what it is.

  1. Ultra-light and small PCs (Referenced from Karen Coombs)
    Karen is one of a number of folks that has taken note of a wide range of low-cost computers currently being made available to the general public.  These machines, which run between $189-$400, provide low-cost, portable machines that have the potential to bring computers to a wider audience.  I’ll have to admit, I’m personally not sold on these machines, in part because of the customer-base that they are aiming for.  Companies such as EeePC note that these machines are primarily targeted to users that are looking for a portable second machine and kids/elderly looking for a machine simply to surf the web.  A look at the specifications for many of these low cost machines are celerion class processors with 512 MB of RAM with poor graphics processing.  Is this good enough for web surfing or browsing the web?  I’d argue, no.  The current and future web is a rich environment, built on CSS, XML, XSLT, flash, java, etc.  I think what people seem to forget is that this rich content takes a number of resources to simply view.  Case in point — I setup a copy of Centos  on a 1.2 MHz Centrino with 512 MB RAM and a generic graphics card (8 Mb of shared memory) and while I could use this machine to browse the web and doing office work with Open office, I certainly wouldn’t want to.  Just running the Linux shell was painful, but web browsing is clunky and office work is basically unusable — essentially, surpassing the machine’s capabilities right out of the box.  Is this the type of resource I’d want to be lending to my patrons…probably not since I wouldn’t want my patrons to associate my library’s technical expertise with sub-standard resources.  Does this mean that ultra-portables will not be in vogue this year and the next?  Well, I didn’t say that.  A look at the success the IPhone is having (a pocket PC retailing for close to $1500 without a contract) seems to indicate that users are wanting to and willing to pay a premium price for portability — so long as that portability doesn’t come at too high of a price. 
  2. Branding outside services as our own (and branding in general)
    There was a little bit of talk about this — the idea of moving specific services outside the library to services like Google or Amazon, and essentially, rebranding them.  This makes some sense — however, I always cringe when we start talking about branding and how to make the library more visible.  From my perspective, the library is already too visible, i.e., intrusive into our users lives.  Libraries want to be noticed, and we want our patrons and organizations to see where the library gives them value.  It’s a necessary evil in times when competition for budget dollars is high.  However, I think it does our users a disservice.  Personally, I’d like to see the library become less visible — providing users direct access to information without the need to have the library’s finger prints all over the process.  We can make services that are transparent (or mostly transparent), and we should. 

    The same thing goes for our vendors.  I’ll use III as an example only because we are an Innovative Library so I’m more  familiar with their software.  By all rights, Encore is a serviceable product that will likely make III a lot of money.  However, of the public instances currently available (Michigan State, Nashville Public Library), the III branding is actually larger than that of the library (if the library branding shows up as well).  And this is in no way unique to III.  Do patrons care what software is being used?  I doubt it.  Should they care?  No.  They should simply be concerned that it works, and works in a way that it doesn’t get in in their way.  From my perspective, branding is just one more thing that gets in the way.

  3. Collections as services will change the way libraries do collection development
    I’m surprised that we don’t here more about this — but I’m honestly of the opinion that metadata portability and the ability for libraries to build their collections as web services will change the way libraries do collection development.  In the past, collection development was focused primarily on what could be physically or digitally acquired.  However, as more organizations move content online (particularly primary resources), libraries will be able to shift from an acquisitions model to a services model.  Protocols like OAI-PMH make it possible (and relatively simple) for libraries to actively “collect” content from their peer institutions in ways that were never possible in the past. 
  4. Increased move to outside library IT and increased love for hosted services (whether we want them or not)
    While it has taken a great deal of time, I think it is fair to say that libraries are more open to the idea of using Open Source software than ever before.  In the short term, this has been a boon for library IT departments, which has seen an investment in hardware and programmer support.  I think this investment in programming support will be short-lived.  In some respects, I see libraries going through their own version of the .COM boom (just, without all the money).  Open Source is suddenly in vogue.  Sexy programs like Evergreen have made a great deal of noise and inroads into a very traditionally vendor oriented community.  People are excited and that excitement is being made manifest by the growing number of software development positions being offered within libraries.  However, at some point, I see the bubble bursting.  And why?  Because most libraries will come to realize that either 1) having a programmer on staff is prohibitively expensive or 2) that the library will be bled dry by what I’ve heard coined by Kyle Banerjee as vampire services.  What is a vampire service?  A vampire service is a service that consumes a disproportional number of resources but will not die (generally for political reasons).  One of the dangers for libraries employing developers is the inclination to develop services as part of a grant or grandiose vision, that eventually becomes a vampire service.  They bleed an organization dry and build a culture that is distrustful of all in-house development (see our current caution looking at open source ILS systems.  It wasn’t too long ago that a number of institutions used locally developed [or open] ILS systems and the pain associated with those early products still affects our opinions of non-vendor ILS software today). 

    But here’s the good news.  Will all software development position within the library go away?  No.  In fact, I’d like to think that as position within individual organizations become more scarce — that consortia will move to step into this vacated space.  Like many of our other services moving to a network level, I think that the centralization of library development efforts would be a very positive outcome, in that it would help to increase collaboration between organizations and reduce the number of projects that are all trying to re-invent the same wheel.  I think of our own consortia in Oregon and Washington– Summit — and the dynamic organization it could become if only the institutions within it would be willing to give over some of their autonomy and funding to create a research and development branch within the consortia.  Much of the current development work (not all) could be moved up to the consortia level allowing more members to directly benefit from the work done. 

    At the same time, I see the increase of hosted services on the horizon.  I think that folks like LibLime really get it.  Their hosted services for small to medium size libraries presumably reduce LibLime’s costs to manage and maintain the software and those hosted libraries from the need to worry about hardware and support issues.  When you look at the future of open source in libraries — I think that this is it.  For every one organization willing to run open source within their library, there will be 5 others that will only be able to feasibly support that infrastructure if it is outsourced as a hosted service.  We will see a number of open source projects move this direction.  Hosted services for Dspace, Fedora, metasearch, the ILS — these will all continue to emerge and grow throughout this year and into the next 5 years.  And we will see the vendor space start to react to this phenomenon as well.  A number of vendors, like III, already provide hosted services.  However, I see them making a much more aggressive push to compel their users (higher licensing, etc) to move to a hosted service model. 

  5. OCLC will continue to down the path to becoming just another vendor
    I’d like nothing more than to be wrong, but I don’t think I am.  Whether its this year, the next or the year after that, OCLC will continue to alienate its member institutions, eventually losing the privileged status libraries have granted it throughout the years, becoming just another vendor (though a powerful one).  Over the last two years, we’ve seen a lot of happenings come from Dublin, Ohio.  There was the merger of RLG, the hiring of many talented librarians, WorldCat.org, WorldCat Local and OCLC’s newest initiatives circulating around their grid services.  OCLC is amassing a great deal of capital (money, data, members) and I think we will see how they intend to leverage this capital this year and the next.  Now, how they leverage this capital will go a long way to deciding what type of company OCLC will be from here forward.  Already, grumblings are being heard within the library development community as OCLC continues to move to build new revenue streams from webservices made possible only through the contribution of metadata records from member libraries.  As this process continues, I think you will continue to hear grumblings from libraries who believe that these services should be made freely available to members, since it was member dollars and time that provided OCLC exclusively with the data necessary to develop these services.  **Sidebar, this is something that we shouldn’t over look.  If you’re library is an OCLC member, you should be paying close attention to how OCLC develops their grid services.  Remember, OCLC is suppose to be a member driven organization.  It’s your organization.  Hold it accountable and make your voice heard when it comes to how these services are implemented.  Remember, OCLC only exists through the cooperative efforts of both OCLC and the thousands of member libraries that contribute metadata to the database.**  Unfortunately, I’m not sure what OCLC could do at this point to retain this position of privilege.  Already, too many people that I talk to see OCLC as just another vendor that doesn’t necessarily have the best interests of the library community at heart.  I’d like to think that they are wrong — that OCLC still remains an organization dedicated to furthering libraries and not just OCLC.  But at this point, I’m not sure we know (or they know).  What we do know is that there are a number of dedicated individuals that came to OCLC because they wanted to help move libraries forward — let’s hope OCLC will continue to let them do so.  And we watch, and wait.

Anyway, that’s my list of trends.



Technorati Tags: ,,