I was playing with MarcEdit’s OAI harvester, making a few changes to fix a problem that had been discovered, as well as add some fall-over code that allows the harvester to continue processing (or at least, attempt to continue processing) when the OAI server breaks the connection (generally through timeout). To test, I decided to work with the UMichigan Google Books sets of records Michigan recently made available. It’s a large set and is one of those servers where the server timeout had been identified as an issue (i.e., this came up because a MarcEdit user had inquired about a problem they were having harvesting data).
Anyway, I’ll likely post the update to the OAI harvesting code on Sunday or so (which will also include an update to the CJK processing component when going from MARC8-UTF8 — particularly when the record sets contain badly encoded data), and with it, I’ll likely include a small tutorial for users wanting to use MarcEdit to do one of the following:
- Harvest the UM digital book records from OAI directly into MARC21 (saving characterset in either legacy MARC8 or UTF8 formats)
- Harvesting the raw UM digital book metadata records via OAI (without the MARC conversion)
While I think that the the Harvester is fairly straightforward to use, I’m going to post some instruction, in part, so that I can underline some of the common error messages that one might see and what they mean. For example, with the UM harvesting, I found that the OAI server tended to timeout after approximately 15 queries using a persistent connection. When it would stop, it would throw a 503 error from the server. I was able to over come the issue by simply adding some code into the app. to track failures and simply pause harvesting and restart the connection to the server — but these types of errors are not easy for most users to debug since they are not sure if the issue lies with the harvesting software or the server being harvested.
Another problem that I’ve coded in MarcEdit to fix on the fly is that a handful of MARC21 records (I believe I identified approximately 40ish of 111000+) sent via OAI have invalid leader statements (i.e., not enough characters in the string). For example, this record: http://quod.lib.umich.edu/cgi/o/oai/oai?verb=GetRecord&metadataPrefix=marc21&identifier=oai:quod.lib.umich.edu:MIU01-001300473, the leader is one character too short. MarcEdit can fix these on the fly (at least it will try) by validating the length of the LDR and if short, padding spaces to the end of the string. Since length and directory are calculated algorithmically, the records will be valid, but some of the leader data may get offset due to the padding. However, there isn’t a thing you can really do about that, outside of rejecting the records as invalid or accepting the data as it (which the poisons all the other records downloaded in the set). I’m putting together some info for the folks at UM that includes some of the problems that I’ve run into working with their OAI data just in case they are interested.
Anyway, one thing I thought I would do is post a set of these records, in MARC UTF8 and MARC8 charactersets (harvested 20080126 around 1:30 am to 3:00 am) for folks interested in taking a look at the exposed metadata. You will find that the vast majority of these records appear to be brief metadata records containing basically an author, title and url — though full records are scattered through the record sets. There are over 111000 records found in the six files. The files in the zip are:
- mbooks-utf8 (combined data set)
- mbooks-marc8 (combined data set in marc8)
- pd-utf8 (international public domain books)
- pd-marc8 (international public domain books in marc8)
- pdus-utf8 (u.s. public domain books)
- pdus-marc8 (u.s. public domain books in marc8)
A quick note. These are largish files. MarcEdit has a preview mode specifically for this purpose. Unless disabled, MarcEdit by default only loads the first 1 MB of data into the MarcEditor. This will allow you to preview ~1000-1500 records, but using the editor tools, you can globally edit the entire data file. This is done because reading data into the Editor is expensive (memory and time). If you really want to open large files into the Editor, you need to make sure your virtual memory is set fairly high.
So long as the folks at UM don’t ask me to take it down, I’ve posted these test files at: http://osulibrary.oregonstate.edu/techservices/marc/umich_books.zip for viewing and testing purposes (~62.7 MB), but I would recommend harvesting these records from http://quod.lib.umich.edu/cgi/o/oai/oai directly yourself if you want to use them since UM is adding new records all the time. And remember, if you want to harvest them with MarcEdit, you’ll need to wait till I post the update on Sunday.
–TR
Technorati Tags: Google Books,University of Michigan,OAI,MARC,MarcEdit