DSpace REST API built in JERSEY

By reeset / On / In Dspace, Programming

I thought I’d take a quick moment to highlight some work that was done by one of the programmers here at The OSU, Peter Dietz.  Peter is a bit of a DSpace wiz and a contributor to the project, and one of the things that he’s been interested in working on has been the development of a REST API for DSpace.  You can see the notes on his work on this GitHub pull request: https://github.com/DSpace/DSpace/pull/323.


Thankfully, I’m at a point in my career where I no longer have to be the individual that has to wrestle with DSpace’s UI development, but I’ve never been a big fan of it.  From the days when the interface was primarily JSP to the, it sounded like a good idea at the time, XSLT interfaces that most people use today, I’ve long pined for the ability to separate the DSpace interface development from the actual application, and move that development into a framework environment (any framework environment).  However, the lack of a mature REST API has made this type of separation very difficult. 


The work that Peter has done introduces a simple READ API into the DSpace environment.  A good deal more work would need to be done around authentication to manage access to non-public materials as well as expansions to the API around search, etc., but I think that this work represents a good first step. 

However, what’s even more exciting is the demonstration applications that Peter has written to test the API.  The primary client that he’s used to test his implementation is a Google Play application, which was developed utilizing a MVC framework.  While a very, very simple client, it’s a great first step I think that shows some of the benefits of separating the interface development away from the core repository functionality, as changes related to the API or development around the API no longer require recompiling the entire repository stack. 

Anyway – Peter’s work and his notes can be found as part of this GitHub pull request.  https://github.com/DSpace/DSpace/pull/323.  Here’s hoping that either through Peter’s work, or new development, or a combination of the two; we will see the inclusion of a REST API in the next release of the DSpace application.


Libraries and cooking

By reeset / On / In Digital Libraries

With my move to The Ohio State University Libraries, one of the things that I’ve been thinking a lot about is how to tie together a lot of seemingly desperate systems to start creating a coherent digital initiatives infrastructure.  It’s a fun problem to be working though…something I was trying to explain to a friend the other day when they had asked me about my recent career change.  Either I haven’t found a way to make my work sound particularly interesting, or I have a widely different idea of what is fun, but after discussing some of the challenges and opportunities that I see available for Libraries, I got contemplative silence, followed by a comment that they’d worry they might screw everything up.

The funny thing is, I don’t think I really worry about that anymore.  Don’t get me wrong, I can see where they are coming from.  It seems like every day, new projects, new initiatives, new wiz-bang solutions are popping up everyday.  How is a library to decide where to invest their time and money – and what happens if you are wrong (because invariably you will be on occasion).  Between concerns related to changing workflows, migrating legacy data, and just building up capacity within your staff – how do you go about making changes confidently. 

Obviously, there are a lot of things that go into making a decision like this…but as I was working on dinner this evening, I think I found a better way to think about this type of work.  I enjoy cooking, and I enjoy taking recipes or meals that I’m familiar with, and start mixing an matching different flavor profiles together.  Today, for example, I was looking for a way to spice up a pasta dish, and decided that it might be fun to make a thai flavored chili sauce to cook the chicken in and coat the pasta noodles with (rather than a traditional alfredo sauce).  It sounded good, and happily, it tasted good.  But, the point is, I have absolutely no problem messing up dinner (and if you ask my kids, it happens).  The way I look at, is that is, if I screw it up, it’s only one meal and I can always make folks peanut butter and apple butter sandwiches.  I think I like to look at system building the same way.  Invariably, you are going to start down a path and realize that a new tool or component doesn’t match within your environment.  It might have looked like it would fit, but the end result just doesn’t “taste” right.  That’s ok – just try a new recipe and start again. 

I think my favorite thing about cooking is I love making new things.  Sure, there are recipes, but I see those more like loose guidelines, rather than hard and fast rules (this is probably why I don’t do desserts – I find they tend to be a bit less forgiving).  I look at the work I do in libraries the same way.  There are a lot of recipes for building digital infrastructure, but they are really more like loose guidelines.  The trick is to take those recipes and come up with a flavor that meets the needs of your local environment. 


Thinking about what’s not being measured by the Ithaka S+R Survey

By reeset / On / In Asides, Digital Libraries, Library

I’d posted some not so random thoughts on the Ohio State Libraries local IT blog.  If interested, you can find it here:


In a nutshell, I’ve been struck this time around the very old-world view of the library the survey seems to be presenting to faculty.  Certainly, the survey has value, but I wonder how much value when so much of what libraries do today is beyond simply providing access to journals (in fact, I’d argue most of the interesting stuff we do for faculty has nothing to do with providing access to traditional materials).  Anyway, if you think I’m off base, let me know.


Turning your phones into Cataloging clients

By reeset / On / In Digital Libraries, MarcEdit

Over the past few years, I’ve owned a number of different smart phones.  I’ve had an IPhone, Android (the first in fact) and now a Windows 7 Phone.  I have admit, they are all great, especially when I compare them to my old Blackberry.  What you can do with each of these devices is quite cool.  One of my favorite aspects of these phones is how easy it is to hack on them.  When I had my IPhone, I spent some time learning Object C and writing a few simple IOS apps.  I did the same thing with Android and Java.  However, now that I have a Windows Phone, I find that I have many more opportunities to write applications for it because there’s no learning curve…I already use both Silverlight and C# in some personal coding projects. 

So, why do I bring this up.  Well, one of the things I’ve been thinking about is how these little micro computers that fit in our pockets can potentially be used in libraries.  There are some obvious uses (making our catalogs more mobile, using geolocation within a building to help users navigate to a book, etc), but what I’m more interested in is how we can make staff life a little easier with these devices.  Looking around our library, one area that I can definitely see where these kind of devices might be able to make a big impact is in cataloging and technical services – well, more specifically, eliminating the need to perform recon within cataloging and technical services. 

Travelling around a few libraries in my immediate area, one thing that I’ve found is many libraries still have small card catalogs.  The often are of materials that have yet to be reconned and represent older journal titles and monographs.  Many libraries also have large gift shelves, and areas in the stacks themselves, that remain uncataloged.  It would be nice if we could take these micro computers, fully equipped with a digital camera, and photograph ourselves out of this problem.  The difficulties of course relate to OCR and the conversion of this data into MARC itself…or maybe it’s not a difficulty.

I’ve been doing a little bit of playing around (well, more than a little bit) and here’s what I’ve found.  It’s easy to do OCR on the web (free OCR).  Folks my not realize it, but the Google Docs API provides a free OCR service.  So does Microsoft.  By working with the camera on a smart phone, it’s easy to send a snapshot of a book title page or card catalog card to one of these OCR services and return the results back to the phone.  Using MarcEdit (being written in C#, MarcEdit can be compiled to run on a windows phone, I’ve done it), I’m able to utilize the MARCEngine to take that OCR data and either retrieve data from Amazon, the library of congress, another library catalog – massage the data, and upload it to my catalog – all from my phone.  Pretty cool stuff. 

Right now, this work all remains in the research stages…its rough.  The UI is sad, and the parsing of the OCR’d data could be much better.  But the interesting thing is that it does work.  Does it have a real applicability in the library world – maybe, maybe not.  I’m just not sure if enough reconned material still exists for this type of application to be needed.  But what this type of experimentation does show is that libraries probably should be looking at these little micro computers as more than consumer devices (i.e., how they change the way our users interact with our services) and consider how these devices may change the way libraries perform their own work. 

BTW, if folks are interested in this recon project – my intention is to talk about it at C4L this Feb during a lightening talk.  Ideally, I’ll have it cleaned up enough to show it off, and maybe, if there is interest, talk to some folks about how they can run something like this on their own Windows 7 Phone. 


Evaluating digital services: an alternative reality

By reeset / On / In Digital Libraries

Sorry for what I’m sure will be a longish post.  This is a bit of a brain dump —

Lately, I’ve been thinking a lot about how libraries determine if services that they provide are successful.  Well, specifically, how libraries determine if digital services that they provide are successful.  And after attending DLF and listening to more than a few folks talk about very cool digital programs, I’m starting to think that libraries view digital programs as living in a kind of alternative reality, where the rules of regular evaluation and assessment don’t apply. 

Our recently retired University Librarian and I would spend a good deal of time talking about assessing digital projects.  However, the months prior to her retirement, we spent a lot of time talking about digital services and the necessity for libraries to being looking at those services that we provide critically and assess their feasibility in terms of impact and cost of operation.

I think in the library world, there is a feeling that libraries need to catch up, and in many respects, catching up is a code-word for building digital programs, creating mobile sites, creating institutional repositories, etc.  It could be a lot of things – but I think the question that sometimes gets lost is the question of if a library should participate in those activities.  Let’s use the institutional repository as an example.  I know that libraries and librarians love to jump on the open access hobby horse (yes, I’m quite cynical about this one), but does every library need to have an institutional repository.  I’d argue no.  In fact, OSU’s IR is very successful when compared to other IR efforts in the library community, but even here, I sometimes wonder if its necessary for an institution of our size to maintain such  a repository.  I honestly think that as repository efforts go, OSU’s is one of the best.  At the same time, I know how many resources (both infrastructure and people) go into delivering this service and because of that, I sometimes get discouraged when I consider the substantial costs per item.  At the same time, I see how this effort becomes more important to the campus and the library each passing year as more content finds its way into the IR.

At the same time, it seems to me that the concept of an IR is an old one.  Yes, organizations want to maintain walls around their information to demonstrate ownership and provenance, but in reality, I often times  believe our patrons would be much better served with repository efforts that removed the concept of the organization.  Thinking about OSU’s repository efforts…could this work have more impact if we could leverage a larger state-wide repository effort.  For that matter, couldn’t every institution in Oregon.  And why would it have to just be statewide…again, those are fairly arbitrary boarders. 

I think the thing that I find myself struggling against sometimes is that libraries search for digital services to distinguish themselves and provide services that make resources more readily available for their patrons.  But we often times have approached this the same way we build traditional print collections.  We start a local program, brand it, promote it.  When in reality, the digital space provides libraries an opportunity to work outside the traditional boundaries of ownership and build more collaborative services.  And collaborative not in the sense that I run and IR and you run and IR and we have an API that let’s us communicate with each other – but collaborate in the sense that we build tools that everyone shares.  We see this model happening and it will happen more as libraries are forced to justify stretched resources, but I often times think that we could be doing so much more today.

Anyway – what has got me thinking about this lately has been talking to people about users and their digital services.  Over the past couple of months, the number of times I’ve spoken to folks building mobile sites, or grant driven projects that excitedly talk about 500 visitors a day, or 2000 visitors daily – and see those numbers as justifying tens of thousands of dollars of startup monies or locking up finite FTE resources makes me wonder if libraries have lost their minds, and should be doing better assessment with our digital collections.  Because of MarcEdit’s automatic updating, I know that it’s opened over 5,000 times daily.  A few of my map cataloging tools on my page that I don’t even update any more get a few hundred visitors a day – these are tools that are created in my spare time, with little resources – yet many times, have larger audiences than many digital library services being spun out by libraries.  And yet, it are these vampire projects that are siphoning valuable time and resources away from libraries and make really transitive development more difficult.

This isn’t to say that libraries shouldn’t be doing things.  We should.  However, what I’m finding myself looking for more and more at digital library conferences, are librarians talking seriously about assessment of their digital assets and projects.  They are out there –


An OA mandate for the OSU library faculty

By reeset / On / In Digital Libraries, Open Access

The OSU Library faculty recently adopted an OA mandate, which is pretty cool.  You can read about it here: http://ir.library.oregonstate.edu/dspace/handle/1957/10850 and what a few others are saying about it:

I think that this is important on a number of levels.

  1. Symbolically, it’s important.  It’s very difficult for the library to go to faculty on campus and ask them to contribute content to the IR, when in fact, the Library faculty itself is not regularly submitting to the IR.  This changes that – and hopefully – will act as a catalysis for other departments on campus to follow the Library faculty’s lead.
  2. As tenured faculty, the research (both papers and presentations) our librarians generate represent an important contribution to the scholarly community. As researchers and scholars, preserving our content and making it freely accessible to future researchers is indeed one of our primary responsibilities as faculty.
  3. This was really a faculty initiated endeavor, that has a great back story, but I won’t include it here right now.  But suffice it to say, a good number of people at OSU deserve a lot of credit for making this happen, chief among those being Michael Boock and Janet Webster – who have worked tirelessly from the beginning to advertise, grow and advocate for the IR in the library.  And for the faculty as well, for stepping up and making this a reality. 
  4. Finally, it’s just one more example of that Beaver ingenuity and can do’edness.  🙂



What would it look like if OCLC was broken up?

By reeset / On / In Digital Libraries, OCLC, Programming

The other day, I posted what I seen as some very big concerns with OCLC’s revised policy (currently being reconsidered) on the transfer of records (two of which, I would consider deal breakers).  In this post, I made the argument that maybe it was time to consider breaking OCLC up to reflect what it has become — an organization with two distinct facets: a membership component and a vendor component.  This comment led to a conversation from someone at OCLC who questioned whether I honestly believed that the library community would be better off if OCLC was broken up and it was obvious from our conversation that on this point, we would simply need to agree to disagree.  As a side note, I think that these types of disagreements and conversations are actually really important to have.  I’m always nervous of communities or groups in which everyone agrees since it usually means that people either are not thinking critically or no really cares.  Secondly, I think that we all (OCLC and myself for that matter) want what’s best for the library community — we just have different visions of what that might be. 

Anyway, back to my topic.  Now, I’m going to preface this discussion by saying that this is obviously my own opinion and one that may not be shared by many people within the library community (I really have no idea).  Even within the library open source community, where I’m sure this opinion would be more prevalent (or at least entertained), I’m pretty sure I’m still in the minority.  But as I say, I think that these conversations are important to consider — specifically as we move down a path where OCLC is very quickly positioning themselves to become the library community’s default service provider for all things library (in terms of ILL, ILS interface, cataloging, etc.).

So when I talk about breaking up OCLC, exactly what am I’m talking about?  Well, in order to follow me down the path that I am going to take you, we have to talk about OCLC as I currently see them.  Watching OCLC during the 10 years (I can’t believe it’s actually been 10 years) that I have been in libraries, I have seen a quickening evolution of OCLC from strictly a member driven organization to more of a hybrid organization.  On the one hand, there is what many would consider the membership side of OCLC, that being WorldCat, ILL and their research and development office.  On the other hand, there is OCLC’s vendor arm…a good example of this would be WorldCat Local and WorldCat Navigator.  So how do I make these distinctions — membership services are those that I would consider core services.  These are services that OCLC has developed to add value to what OCLC likes to refer to as the Library Commons (WorldCat).  OCLC’s vendor services are those tools or programs that OCLC sells on top of the Library Commons, of which, I think WorldCat Local/Navigator is a good example.  Now I think that at this point, I know that folks at OCLC (and likely in the membership) would argue that both WorldCat Local/Navigator do provide services that the OCLC membership is currently requesting.  I won’t deny that — however, I would answer that the fact that OCLC treats the Library Commons (WorldCat) as it’s own closed personal community has the unintended affect of limiting the library community’s (and I include both commercial and non-commercial entities in my definition of community) ability to develop new service models.  In effect, we become much more dependent on how OCLC envisions the future of libraries.  Let me try and tease this out a little bit more…

Philosophically, the biggest problem that I have with the current situation is the commingling of OCLC’s treatment of the Commons (WorldCat) and their current strategy of being the sole commercial entity with the ability to interact with the Commons.  I’m a firm believer that the more diverse the landscape or ecology, the more likely that innovation will take place.  We’ve seen this time and time again both inside (Evergreen and Koha certainly have shaken up the traditional ILS market) and outside (web browsers are a good example of how competition breeds innovation) the library community.  However, by isolating the Commons, OCLC is threatening this diversity of thought.  Now, I have a whole set of different issues with the current library ILS community, but in this case, I think that OCLC’s treatment of the Commons, and their ability to leverage that service unfairly skews the ability for both commercial and non-commercial entities to provide innovative services on top of those Commons (and before anyone jumps on me for non-commercial use, let me finish my thoughts here).  Commercially, I’m fairly certain that the current crop of ILS vendors would very much like to provide their own WorldCat Local/Navigator interfaces to their customers, and I’m sure, would be able to tie these interfaces closely with services already provided by the users ILS.  I could envision things like ERM (electronic resource management), simplified requesting, etc. all being possible if the likes of ExLibris or Innovative Interfaces were allowed to build tools upon the Library Commons (WorldCat).  Maybe I would like to develop my own version of WorldCat Local/Navigator that interacts with the Commons and sell it as a product (kind of the same way ezproxy was sold prior to being acquired by OCLC) or a group of researchers would like to do the same.  As a commercial entity, I’m fairly certain that this type of development model wouldn’t be kosher with OCLC unless I licensed access to WorldCat (and I’m not certain that they would given that this would compete against one of their services).  Likewise, open source folks like LibLime or Equinox may like to create an open source version of the WorldCat Local interface.  Under the current guidelines, I understand that an open source implementation of WorldCat Local can exist — but as I understand that agreement, I’m not certain that groups like LibLime or Equinox (or another entity) could not take that project and then sell support-based services around it (I’m unclear on that one though).  However, it’s very unlikely that the library world will see any of these types of developments (well, maybe the open source WorldCat Local since I have a group that could use this and a number of people interested in developing it) because OCLC has come to treat what it calls the Commons (WorldCat), as it’s own personal data store.  There’s that commingling again. 

So if it was up to me, how would I resolve this situation?  Well, I see two possible scenarios. 

  1. Open up WorldCat.  OCLC likes to refer to WorldCat as the Library Commons — well, let’s treat it as such.  Remove the barriers for access and allow anyone and everyone the ability to essentially have their own copy of the Library Commons and it’s data.  Now, rather than specifying terms of transfer and telling libraries under what conditions they can and cannot make their metadata available to other groups, the membership could consider what type of Open Data license that the Commons could be made available under.  Something like the creative commons share alike license which allows for both commercial and non-commercial usage, but requires all parties to contribute all changes to the data back to the community (in essence, this is kind of what Open Library is doing with their metadata) may be appropriate.  OCLC would be free to develop their own products, but the rest of the library community (both library and vendor community) would have equal opportunity to develop new services and ways of visualizing the data found in the Commons.  Does this devalue the Commons (WorldCat)?  I don’t think so — look at Wikipedia.  It uses this model of distribution, yet I’ve never heard anyone say that this devalues it’s content.  Would there be challenges?  For sure.  Probably one of the biggest would be the way that it would change what it means to be a member of OCLC.  If each person could download their own personal copy of the Commons, would libraries stay members.  I’m certain that they would — but I’m sure that what it means to be a member would certainly change.
  2. Split OCLC’s membership services from OCLC’s vendor services.  Under this example, WorldCat Local/Navigator development would be spun away from OCLC as a separate business (this happens in academia all the time).  Were this to happen, OCLC would be able to develop terms for license that could then be leverage by all members of the commercial library community removing the artificial advantage OCLC is currently able to leverage (both in terms of data and deciding who is allowed to work with the Commons).  In all likelihood, I think that this model likely represents the smallest change for the membership and would continue to allow OCLC to make the Commons more available to non-commercial development without artificially limiting other groups interested in building new services. 

One last thought.  In talking to people today, I heard a number of times that OCLC restricting access to the Commons was in fact good thing, in part, because it finally allowed the library community the ability to leverage resources not available to the vendor communities.  In some way, we could finally stick it to them.  That’s fine, I’m all for developing tools and services, but this particular type of thinking I find worrisome.  If we, as a community, feel that we are unable to develop compelling tools and services that are able to compete with other vendor offerings without an artificial advantage — well that’s just sad and says a little something about how we see ourselves as a community.  And this too is something that I’d like to see change because if you look around, you will see that there are a myriad of projects (Koha, Evergreen, VuFind, Fedora, DSpace, LibraryFind, XC Catalog, Zotero, etc.) where developers (some library developers, some not) are re-envisioning how they see many of the services within the library and putting their time and effort into realizing those visions. 



Reading between the lines

By reeset / On / In Digital Libraries, OCLC

Like a number of people, I found the following piece (http://chronicle.com/weekly/v54/i24/24a01101.htm) from the Chronicle of Higher Education on the Open Library fairly interesting — in part, because of the topics that the author chose to highlight.  I tend to categorizes pieces such as this as fluff, in that one rarely gets any content of substance from them.  However, in a short article about the Internet Archive’s Open Library initiative, I found it interesting that so much of the article centered around OCLC, or, should I say, the silence coming from OCLC as members seek to clarify OCLC’s position in regards to the Open Library and it’s members potential participation in this project.  Two things that jump out:

  1. “Librarians are not just uneasy having nonlibrarians edit catalogs; they are also afraid of offending OCLC.”

    An exceptional understatement, though one that doesn’t extend just to the Open Library.  As a general rule, I find that librarians are way to concerned with offending OCLC, with many having a feeling that should an offense be taken, that it could have long running repercussions for the institution.  Are these concerns valid — for OCLC — I think not.  While I firmly believe that OCLC occupies the same vendor space as other entities like EBSCOhost, Elseiver and Serial Solutions, I think that they are much more responsive to their members customers — due in part to the organization’s roots as a large co-opt.  Of course, librarians and libraries have been conditioned to believing that consequences will follow if one rocks the boat or steps on their partner’s toes.  And unfortunately (and much to my chagrin), I’ve had occasion myself to say or post opinions that have cause push back from content/software providers currently serving Oregon State.  Fortunately, my director doesn’t mind when the pot periodically gets stirred, but not everyone is as lucky.  So, I can certainly understand where the nervousness is coming from.At the same time, I think that OCLC is contributing to this sense of uncertainty.  OCLC hasn’t been caught by surprise by the Open Library’s development work and certainly hasn’t been surprised by the Open Library asking OCLC members to contribute data to the project.  For close to a year, OCLC has had the opportunity to provide some form of guidance or position, as it relates to the Open Library project.  Instead, they have been silent.  This leaves librarians and libraries to consult their local OCLC representatives who have been given widely varying information regarding the legality of participating in this project.  While I’ve yet to hear of anyone being told that a library could not participate in the project, it has been quietly discouraged by OCLC’s deafening silence. 
  2. “But one OCLC official, speaking on the condition that he not be identified, said Open Library was a waste of time and resources, and predicted it would fail.”Again, it’s interesting that in a piece like this, that this comment would make it’s way into the article.  Whether or not this reflect’s OCLC’s current position on this particular project, I think that a number of good things may come out of the Open Library project, even if indirectly.  First, OCLC’s grid services.  While likely not a direct result of the Open Library’s project, I’d guess that the current desire to accelerate their availability is in response to the growing number of projects currently looking to move into the space the OCLC has traditionally monopolized.  Yes, let’s call it what it is, in this space, OCLC functions as a monopoly, because OCLC has essentially been allowed to rely on it’s position to squeeze out competing projects (RLG) and leverage their data to create services that would be otherwise impossible to create without the metadata that OCLC currently possess.  I think to some degree, projects like the Open Library give OCLC pause in the sense that at present, they see their bibliographic and holdings content, WorldCat, as their crown jewel.  It represents a body of work that exists no where else in the world and gives them a potential advantage over any cloud-based service being developed within the library community.  At the same time, as OCLC goes forward and libraries become more interested in building some of their own tools (either individually or as part of a consortia), I think that WorldCat, and the data beneath it will actually become less important for OCLC — rather, it will be the services that they develop on top of it that will hold the most value.  And I think that projects like the Open Library have accelerated this development.  As Martha Stewart would say, it’s a good thing. 

    Secondly, I think that this quote is interesting in a larger sense as to how it relates to OCLC as a whole.  They are undergoing big changes — business changes, philosophical changes and I think that this represents that to some degree.  As the piece notes, OCLC’s public face see cooperation as a good thing, while maybe privately, that’s not the case.  But honestly, I think that this is healthy.  OCLC is hiring a lot of bright people and has traditionally had a lot of bright people on staff and what we see is that they are thinking about these issues and how they relate within the larger community (even beyond OCLC).  Now, whether or not OCLC is particularly happy that these disagreements are being aired publicly (something that hasn’t traditionally happen), well, that would be something to keep an eye on as well.


[update: Spell check fails me again, sorry Martha]

Technorati Tags: ,

Harvesting UMich OAI records with MarcEdit

By reeset / On / In MarcEdit, OAI

I’ve had a few folks ask about the the procedure would be for a user wanting to harvest the UMich OAI records using MarcEdit.  Well, there are two workflows that can be followed depending on what you want to do.  You can harvest the OAI data and translate it directly to MARC or you can harvest the raw data directly to one’s file system.  Here’s how each would work:

Generating MARC records from the OAI content:

  1. Start MarcEdit
  2. From the Main Screen, click on the Harvest OAI Records Link
  3. Once the link has been selected, you have a number of options available to you to control the harvesting.  Required options are those that are seen when the screen opens.  Advanced Settings, or optional settings define additional options available to the user.  Here’s a screenshot of the Harvester with the Advanced Options expanded:
    The required elements that must be filled in are the Server Address (the address pointing to the OAI URL), metadata type (format to be downloaded) and Crosswalk Path.  If you select any of the predefined metadata types, the program will select the crosswalk path for you.  If you add your own, then you will need to point the program to the crosswalk path.  Set name is optional.  If you leave this value blank, the harvester will attempt to harvest all available sets on the defined server. 

    Advanced settings give the user a number of additional harvesting options, generally set aside to help the users control flow.  For example, users can harvest an individual record by entering the record’s identifier into the GetRecord Textbox.  A user could resume a harvest by entering the resumptionToken into the ResumptionToken textbox.  If the user wanted to harvest a subset of a specific data set, they can use a date limit (of course, you must use the date format supported by the server — generally yyyy or yyyy-mm-dd format).  Users can also determine if they want their metadata translated into MARC8 (since the harvester assumed UTF8 for all xml data) and change the timeout settings the harvester uses for returning data (you generally shouldn’t change this).  Finally, for users that don’t want to harvest data into MARC, but just need the raw data — there is the ability to tell the harvester to just harvest data to the local file system.  If this option is checked, then the CrossWalk Path’s label and behavior will change — requiring the user to enter a path to a directory to tell the harvester where it should save the harvested files.

  4. For the UMich Digital Books, a user would want to utilize the following settings to harvest metadata into MARC:
    Users wanting to ensure that the MARC data is in MARC8 and not UTF8 format should check the Translate to MARC-8 option.  Once these settings have been set, a user will just need to click the OK button.  For this set (mbooks), there are approximately 111000+ records, so harvesting will take approximately an hour or so to complete.  Longer if you ask the program to translate data into MARC8.
  5. When finished, users will be prompted with a status box indicating the number of records, resumptiontokens and last resumptiontoken processed (and any error information if an error occurred on process).


Harvesting OAI records directly to the filesystem

  1. Start up MarcEdit
  2. Select Harvest OAI records link
  3. Enter the following information (Server folder location will obviously vary):
  4. Files are harvested into the defined directory — number numerically according to resumption token processed.  Again, when processing is finished, a summary window will be generated to inform the user of harvest status and error information related to the harvest.

Errors related to the UMich Harvest that could be encounted:

My guess is that you would see these if you are using the most current version of MarcEdit uploaded 2008-01-27, however, you may run into this if harvesting using other tools or older versions of MarcEdit.

  1. Server Timeout:  When harvesting all records, I was routinely seeing the server reset its connection after harvesting 10-18 resumption Tokens.  The current version of MarcEdit has some fall over code that will reinitiate the harvest under these conditions, stopping after 3 failed attempts.
  2. Invalid MARC data:  Within the 111000+ records, there are approximately 40-60+ MARC records that have too few characters represented in the MARC leader element.  This is problematic because this error will invalidate the record and depending on how the MARC parser handles records, poison the remainder of the file.  MarcEdit accommodates these errors by auto correcting the leader values — but this could be a problem with other tools.
  3. image
    This error message will be generated if you set the start and end elements using an invalid date format.  You should always check with the OAI server to see what date formats are supported by the server.  In this case, the date format expected by the UM OAI server is as follows:
    <repositoryName>University of Michigan Library Repository</repositoryName> 

    Notice the granularity element — this tells me that any of the following formats would be valid:

Anyway — that’s pretty much it.  If you are just interested in see what type of data the UM is exposing with these data elements, you can find that data (harvested 2008-01-25) at: umich_books.zip (~63 mb).