Apr 112013
 

I’d posted some not so random thoughts on the Ohio State Libraries local IT blog.  If interested, you can find it here:

http://library.osu.edu/blogs/it/thinking-about-whats-not-being-measured-by-the-ithaka-sr-survey/

In a nutshell, I’ve been struck this time around the very old-world view of the library the survey seems to be presenting to faculty.  Certainly, the survey has value, but I wonder how much value when so much of what libraries do today is beyond simply providing access to journals (in fact, I’d argue most of the interesting stuff we do for faculty has nothing to do with providing access to traditional materials).  Anyway, if you think I’m off base, let me know.

TR

 Posted by at 10:29 am
Jan 082012
 

Over the past few years, I’ve owned a number of different smart phones.  I’ve had an IPhone, Android (the first in fact) and now a Windows 7 Phone.  I have admit, they are all great, especially when I compare them to my old Blackberry.  What you can do with each of these devices is quite cool.  One of my favorite aspects of these phones is how easy it is to hack on them.  When I had my IPhone, I spent some time learning Object C and writing a few simple IOS apps.  I did the same thing with Android and Java.  However, now that I have a Windows Phone, I find that I have many more opportunities to write applications for it because there’s no learning curve…I already use both Silverlight and C# in some personal coding projects. 

So, why do I bring this up.  Well, one of the things I’ve been thinking about is how these little micro computers that fit in our pockets can potentially be used in libraries.  There are some obvious uses (making our catalogs more mobile, using geolocation within a building to help users navigate to a book, etc), but what I’m more interested in is how we can make staff life a little easier with these devices.  Looking around our library, one area that I can definitely see where these kind of devices might be able to make a big impact is in cataloging and technical services – well, more specifically, eliminating the need to perform recon within cataloging and technical services. 

Travelling around a few libraries in my immediate area, one thing that I’ve found is many libraries still have small card catalogs.  The often are of materials that have yet to be reconned and represent older journal titles and monographs.  Many libraries also have large gift shelves, and areas in the stacks themselves, that remain uncataloged.  It would be nice if we could take these micro computers, fully equipped with a digital camera, and photograph ourselves out of this problem.  The difficulties of course relate to OCR and the conversion of this data into MARC itself…or maybe it’s not a difficulty.

I’ve been doing a little bit of playing around (well, more than a little bit) and here’s what I’ve found.  It’s easy to do OCR on the web (free OCR).  Folks my not realize it, but the Google Docs API provides a free OCR service.  So does Microsoft.  By working with the camera on a smart phone, it’s easy to send a snapshot of a book title page or card catalog card to one of these OCR services and return the results back to the phone.  Using MarcEdit (being written in C#, MarcEdit can be compiled to run on a windows phone, I’ve done it), I’m able to utilize the MARCEngine to take that OCR data and either retrieve data from Amazon, the library of congress, another library catalog – massage the data, and upload it to my catalog – all from my phone.  Pretty cool stuff. 

Right now, this work all remains in the research stages…its rough.  The UI is sad, and the parsing of the OCR’d data could be much better.  But the interesting thing is that it does work.  Does it have a real applicability in the library world – maybe, maybe not.  I’m just not sure if enough reconned material still exists for this type of application to be needed.  But what this type of experimentation does show is that libraries probably should be looking at these little micro computers as more than consumer devices (i.e., how they change the way our users interact with our services) and consider how these devices may change the way libraries perform their own work. 

BTW, if folks are interested in this recon project – my intention is to talk about it at C4L this Feb during a lightening talk.  Ideally, I’ll have it cleaned up enough to show it off, and maybe, if there is interest, talk to some folks about how they can run something like this on their own Windows 7 Phone. 

–TR

 Posted by at 5:29 pm
Dec 042010
 

Sorry for what I’m sure will be a longish post.  This is a bit of a brain dump —

Lately, I’ve been thinking a lot about how libraries determine if services that they provide are successful.  Well, specifically, how libraries determine if digital services that they provide are successful.  And after attending DLF and listening to more than a few folks talk about very cool digital programs, I’m starting to think that libraries view digital programs as living in a kind of alternative reality, where the rules of regular evaluation and assessment don’t apply. 

Our recently retired University Librarian and I would spend a good deal of time talking about assessing digital projects.  However, the months prior to her retirement, we spent a lot of time talking about digital services and the necessity for libraries to being looking at those services that we provide critically and assess their feasibility in terms of impact and cost of operation.

I think in the library world, there is a feeling that libraries need to catch up, and in many respects, catching up is a code-word for building digital programs, creating mobile sites, creating institutional repositories, etc.  It could be a lot of things – but I think the question that sometimes gets lost is the question of if a library should participate in those activities.  Let’s use the institutional repository as an example.  I know that libraries and librarians love to jump on the open access hobby horse (yes, I’m quite cynical about this one), but does every library need to have an institutional repository.  I’d argue no.  In fact, OSU’s IR is very successful when compared to other IR efforts in the library community, but even here, I sometimes wonder if its necessary for an institution of our size to maintain such  a repository.  I honestly think that as repository efforts go, OSU’s is one of the best.  At the same time, I know how many resources (both infrastructure and people) go into delivering this service and because of that, I sometimes get discouraged when I consider the substantial costs per item.  At the same time, I see how this effort becomes more important to the campus and the library each passing year as more content finds its way into the IR.

At the same time, it seems to me that the concept of an IR is an old one.  Yes, organizations want to maintain walls around their information to demonstrate ownership and provenance, but in reality, I often times  believe our patrons would be much better served with repository efforts that removed the concept of the organization.  Thinking about OSU’s repository efforts…could this work have more impact if we could leverage a larger state-wide repository effort.  For that matter, couldn’t every institution in Oregon.  And why would it have to just be statewide…again, those are fairly arbitrary boarders. 

I think the thing that I find myself struggling against sometimes is that libraries search for digital services to distinguish themselves and provide services that make resources more readily available for their patrons.  But we often times have approached this the same way we build traditional print collections.  We start a local program, brand it, promote it.  When in reality, the digital space provides libraries an opportunity to work outside the traditional boundaries of ownership and build more collaborative services.  And collaborative not in the sense that I run and IR and you run and IR and we have an API that let’s us communicate with each other – but collaborate in the sense that we build tools that everyone shares.  We see this model happening and it will happen more as libraries are forced to justify stretched resources, but I often times think that we could be doing so much more today.

Anyway – what has got me thinking about this lately has been talking to people about users and their digital services.  Over the past couple of months, the number of times I’ve spoken to folks building mobile sites, or grant driven projects that excitedly talk about 500 visitors a day, or 2000 visitors daily – and see those numbers as justifying tens of thousands of dollars of startup monies or locking up finite FTE resources makes me wonder if libraries have lost their minds, and should be doing better assessment with our digital collections.  Because of MarcEdit’s automatic updating, I know that it’s opened over 5,000 times daily.  A few of my map cataloging tools on my page that I don’t even update any more get a few hundred visitors a day – these are tools that are created in my spare time, with little resources – yet many times, have larger audiences than many digital library services being spun out by libraries.  And yet, it are these vampire projects that are siphoning valuable time and resources away from libraries and make really transitive development more difficult.

This isn’t to say that libraries shouldn’t be doing things.  We should.  However, what I’m finding myself looking for more and more at digital library conferences, are librarians talking seriously about assessment of their digital assets and projects.  They are out there –

–TR

 Posted by at 10:22 pm
Mar 162009
 

The OSU Library faculty recently adopted an OA mandate, which is pretty cool.  You can read about it here: http://ir.library.oregonstate.edu/dspace/handle/1957/10850 and what a few others are saying about it:

I think that this is important on a number of levels.

  1. Symbolically, it’s important.  It’s very difficult for the library to go to faculty on campus and ask them to contribute content to the IR, when in fact, the Library faculty itself is not regularly submitting to the IR.  This changes that – and hopefully – will act as a catalysis for other departments on campus to follow the Library faculty’s lead.
  2. As tenured faculty, the research (both papers and presentations) our librarians generate represent an important contribution to the scholarly community. As researchers and scholars, preserving our content and making it freely accessible to future researchers is indeed one of our primary responsibilities as faculty.
  3. This was really a faculty initiated endeavor, that has a great back story, but I won’t include it here right now.  But suffice it to say, a good number of people at OSU deserve a lot of credit for making this happen, chief among those being Michael Boock and Janet Webster – who have worked tirelessly from the beginning to advertise, grow and advocate for the IR in the library.  And for the faculty as well, for stepping up and making this a reality. 
  4. Finally, it’s just one more example of that Beaver ingenuity and can do’edness.  :)

 

–TR

Nov 032008
 

The other day, I posted what I seen as some very big concerns with OCLC’s revised policy (currently being reconsidered) on the transfer of records (two of which, I would consider deal breakers).  In this post, I made the argument that maybe it was time to consider breaking OCLC up to reflect what it has become — an organization with two distinct facets: a membership component and a vendor component.  This comment led to a conversation from someone at OCLC who questioned whether I honestly believed that the library community would be better off if OCLC was broken up and it was obvious from our conversation that on this point, we would simply need to agree to disagree.  As a side note, I think that these types of disagreements and conversations are actually really important to have.  I’m always nervous of communities or groups in which everyone agrees since it usually means that people either are not thinking critically or no really cares.  Secondly, I think that we all (OCLC and myself for that matter) want what’s best for the library community — we just have different visions of what that might be. 

Anyway, back to my topic.  Now, I’m going to preface this discussion by saying that this is obviously my own opinion and one that may not be shared by many people within the library community (I really have no idea).  Even within the library open source community, where I’m sure this opinion would be more prevalent (or at least entertained), I’m pretty sure I’m still in the minority.  But as I say, I think that these conversations are important to consider — specifically as we move down a path where OCLC is very quickly positioning themselves to become the library community’s default service provider for all things library (in terms of ILL, ILS interface, cataloging, etc.).

So when I talk about breaking up OCLC, exactly what am I’m talking about?  Well, in order to follow me down the path that I am going to take you, we have to talk about OCLC as I currently see them.  Watching OCLC during the 10 years (I can’t believe it’s actually been 10 years) that I have been in libraries, I have seen a quickening evolution of OCLC from strictly a member driven organization to more of a hybrid organization.  On the one hand, there is what many would consider the membership side of OCLC, that being WorldCat, ILL and their research and development office.  On the other hand, there is OCLC’s vendor arm…a good example of this would be WorldCat Local and WorldCat Navigator.  So how do I make these distinctions — membership services are those that I would consider core services.  These are services that OCLC has developed to add value to what OCLC likes to refer to as the Library Commons (WorldCat).  OCLC’s vendor services are those tools or programs that OCLC sells on top of the Library Commons, of which, I think WorldCat Local/Navigator is a good example.  Now I think that at this point, I know that folks at OCLC (and likely in the membership) would argue that both WorldCat Local/Navigator do provide services that the OCLC membership is currently requesting.  I won’t deny that — however, I would answer that the fact that OCLC treats the Library Commons (WorldCat) as it’s own closed personal community has the unintended affect of limiting the library community’s (and I include both commercial and non-commercial entities in my definition of community) ability to develop new service models.  In effect, we become much more dependent on how OCLC envisions the future of libraries.  Let me try and tease this out a little bit more…

Philosophically, the biggest problem that I have with the current situation is the commingling of OCLC’s treatment of the Commons (WorldCat) and their current strategy of being the sole commercial entity with the ability to interact with the Commons.  I’m a firm believer that the more diverse the landscape or ecology, the more likely that innovation will take place.  We’ve seen this time and time again both inside (Evergreen and Koha certainly have shaken up the traditional ILS market) and outside (web browsers are a good example of how competition breeds innovation) the library community.  However, by isolating the Commons, OCLC is threatening this diversity of thought.  Now, I have a whole set of different issues with the current library ILS community, but in this case, I think that OCLC’s treatment of the Commons, and their ability to leverage that service unfairly skews the ability for both commercial and non-commercial entities to provide innovative services on top of those Commons (and before anyone jumps on me for non-commercial use, let me finish my thoughts here).  Commercially, I’m fairly certain that the current crop of ILS vendors would very much like to provide their own WorldCat Local/Navigator interfaces to their customers, and I’m sure, would be able to tie these interfaces closely with services already provided by the users ILS.  I could envision things like ERM (electronic resource management), simplified requesting, etc. all being possible if the likes of ExLibris or Innovative Interfaces were allowed to build tools upon the Library Commons (WorldCat).  Maybe I would like to develop my own version of WorldCat Local/Navigator that interacts with the Commons and sell it as a product (kind of the same way ezproxy was sold prior to being acquired by OCLC) or a group of researchers would like to do the same.  As a commercial entity, I’m fairly certain that this type of development model wouldn’t be kosher with OCLC unless I licensed access to WorldCat (and I’m not certain that they would given that this would compete against one of their services).  Likewise, open source folks like LibLime or Equinox may like to create an open source version of the WorldCat Local interface.  Under the current guidelines, I understand that an open source implementation of WorldCat Local can exist — but as I understand that agreement, I’m not certain that groups like LibLime or Equinox (or another entity) could not take that project and then sell support-based services around it (I’m unclear on that one though).  However, it’s very unlikely that the library world will see any of these types of developments (well, maybe the open source WorldCat Local since I have a group that could use this and a number of people interested in developing it) because OCLC has come to treat what it calls the Commons (WorldCat), as it’s own personal data store.  There’s that commingling again. 

So if it was up to me, how would I resolve this situation?  Well, I see two possible scenarios. 

  1. Open up WorldCat.  OCLC likes to refer to WorldCat as the Library Commons — well, let’s treat it as such.  Remove the barriers for access and allow anyone and everyone the ability to essentially have their own copy of the Library Commons and it’s data.  Now, rather than specifying terms of transfer and telling libraries under what conditions they can and cannot make their metadata available to other groups, the membership could consider what type of Open Data license that the Commons could be made available under.  Something like the creative commons share alike license which allows for both commercial and non-commercial usage, but requires all parties to contribute all changes to the data back to the community (in essence, this is kind of what Open Library is doing with their metadata) may be appropriate.  OCLC would be free to develop their own products, but the rest of the library community (both library and vendor community) would have equal opportunity to develop new services and ways of visualizing the data found in the Commons.  Does this devalue the Commons (WorldCat)?  I don’t think so — look at Wikipedia.  It uses this model of distribution, yet I’ve never heard anyone say that this devalues it’s content.  Would there be challenges?  For sure.  Probably one of the biggest would be the way that it would change what it means to be a member of OCLC.  If each person could download their own personal copy of the Commons, would libraries stay members.  I’m certain that they would — but I’m sure that what it means to be a member would certainly change.
  2. Split OCLC’s membership services from OCLC’s vendor services.  Under this example, WorldCat Local/Navigator development would be spun away from OCLC as a separate business (this happens in academia all the time).  Were this to happen, OCLC would be able to develop terms for license that could then be leverage by all members of the commercial library community removing the artificial advantage OCLC is currently able to leverage (both in terms of data and deciding who is allowed to work with the Commons).  In all likelihood, I think that this model likely represents the smallest change for the membership and would continue to allow OCLC to make the Commons more available to non-commercial development without artificially limiting other groups interested in building new services. 

One last thought.  In talking to people today, I heard a number of times that OCLC restricting access to the Commons was in fact good thing, in part, because it finally allowed the library community the ability to leverage resources not available to the vendor communities.  In some way, we could finally stick it to them.  That’s fine, I’m all for developing tools and services, but this particular type of thinking I find worrisome.  If we, as a community, feel that we are unable to develop compelling tools and services that are able to compete with other vendor offerings without an artificial advantage — well that’s just sad and says a little something about how we see ourselves as a community.  And this too is something that I’d like to see change because if you look around, you will see that there are a myriad of projects (Koha, Evergreen, VuFind, Fedora, DSpace, LibraryFind, XC Catalog, Zotero, etc.) where developers (some library developers, some not) are re-envisioning how they see many of the services within the library and putting their time and effort into realizing those visions. 

 

–TR

 Posted by at 10:49 pm
Feb 232008
 

Like a number of people, I found the following piece (http://chronicle.com/weekly/v54/i24/24a01101.htm) from the Chronicle of Higher Education on the Open Library fairly interesting — in part, because of the topics that the author chose to highlight.  I tend to categorizes pieces such as this as fluff, in that one rarely gets any content of substance from them.  However, in a short article about the Internet Archive’s Open Library initiative, I found it interesting that so much of the article centered around OCLC, or, should I say, the silence coming from OCLC as members seek to clarify OCLC’s position in regards to the Open Library and it’s members potential participation in this project.  Two things that jump out:

  1. “Librarians are not just uneasy having nonlibrarians edit catalogs; they are also afraid of offending OCLC.”

    An exceptional understatement, though one that doesn’t extend just to the Open Library.  As a general rule, I find that librarians are way to concerned with offending OCLC, with many having a feeling that should an offense be taken, that it could have long running repercussions for the institution.  Are these concerns valid — for OCLC — I think not.  While I firmly believe that OCLC occupies the same vendor space as other entities like EBSCOhost, Elseiver and Serial Solutions, I think that they are much more responsive to their members customers — due in part to the organization’s roots as a large co-opt.  Of course, librarians and libraries have been conditioned to believing that consequences will follow if one rocks the boat or steps on their partner’s toes.  And unfortunately (and much to my chagrin), I’ve had occasion myself to say or post opinions that have cause push back from content/software providers currently serving Oregon State.  Fortunately, my director doesn’t mind when the pot periodically gets stirred, but not everyone is as lucky.  So, I can certainly understand where the nervousness is coming from.At the same time, I think that OCLC is contributing to this sense of uncertainty.  OCLC hasn’t been caught by surprise by the Open Library’s development work and certainly hasn’t been surprised by the Open Library asking OCLC members to contribute data to the project.  For close to a year, OCLC has had the opportunity to provide some form of guidance or position, as it relates to the Open Library project.  Instead, they have been silent.  This leaves librarians and libraries to consult their local OCLC representatives who have been given widely varying information regarding the legality of participating in this project.  While I’ve yet to hear of anyone being told that a library could not participate in the project, it has been quietly discouraged by OCLC’s deafening silence. 
  2. “But one OCLC official, speaking on the condition that he not be identified, said Open Library was a waste of time and resources, and predicted it would fail.”Again, it’s interesting that in a piece like this, that this comment would make it’s way into the article.  Whether or not this reflect’s OCLC’s current position on this particular project, I think that a number of good things may come out of the Open Library project, even if indirectly.  First, OCLC’s grid services.  While likely not a direct result of the Open Library’s project, I’d guess that the current desire to accelerate their availability is in response to the growing number of projects currently looking to move into the space the OCLC has traditionally monopolized.  Yes, let’s call it what it is, in this space, OCLC functions as a monopoly, because OCLC has essentially been allowed to rely on it’s position to squeeze out competing projects (RLG) and leverage their data to create services that would be otherwise impossible to create without the metadata that OCLC currently possess.  I think to some degree, projects like the Open Library give OCLC pause in the sense that at present, they see their bibliographic and holdings content, WorldCat, as their crown jewel.  It represents a body of work that exists no where else in the world and gives them a potential advantage over any cloud-based service being developed within the library community.  At the same time, as OCLC goes forward and libraries become more interested in building some of their own tools (either individually or as part of a consortia), I think that WorldCat, and the data beneath it will actually become less important for OCLC — rather, it will be the services that they develop on top of it that will hold the most value.  And I think that projects like the Open Library have accelerated this development.  As Martha Stewart would say, it’s a good thing. 

    Secondly, I think that this quote is interesting in a larger sense as to how it relates to OCLC as a whole.  They are undergoing big changes — business changes, philosophical changes and I think that this represents that to some degree.  As the piece notes, OCLC’s public face see cooperation as a good thing, while maybe privately, that’s not the case.  But honestly, I think that this is healthy.  OCLC is hiring a lot of bright people and has traditionally had a lot of bright people on staff and what we see is that they are thinking about these issues and how they relate within the larger community (even beyond OCLC).  Now, whether or not OCLC is particularly happy that these disagreements are being aired publicly (something that hasn’t traditionally happen), well, that would be something to keep an eye on as well.

–TR

[update: Spell check fails me again, sorry Martha]

Technorati Tags: ,

 Posted by at 3:23 am
Jan 282008
 

I’ve had a few folks ask about the the procedure would be for a user wanting to harvest the UMich OAI records using MarcEdit.  Well, there are two workflows that can be followed depending on what you want to do.  You can harvest the OAI data and translate it directly to MARC or you can harvest the raw data directly to one’s file system.  Here’s how each would work:

Generating MARC records from the OAI content:

  1. Start MarcEdit
  2. From the Main Screen, click on the Harvest OAI Records Link
    image
  3. Once the link has been selected, you have a number of options available to you to control the harvesting.  Required options are those that are seen when the screen opens.  Advanced Settings, or optional settings define additional options available to the user.  Here’s a screenshot of the Harvester with the Advanced Options expanded:
     image
    The required elements that must be filled in are the Server Address (the address pointing to the OAI URL), metadata type (format to be downloaded) and Crosswalk Path.  If you select any of the predefined metadata types, the program will select the crosswalk path for you.  If you add your own, then you will need to point the program to the crosswalk path.  Set name is optional.  If you leave this value blank, the harvester will attempt to harvest all available sets on the defined server. 

    Advanced settings give the user a number of additional harvesting options, generally set aside to help the users control flow.  For example, users can harvest an individual record by entering the record’s identifier into the GetRecord Textbox.  A user could resume a harvest by entering the resumptionToken into the ResumptionToken textbox.  If the user wanted to harvest a subset of a specific data set, they can use a date limit (of course, you must use the date format supported by the server — generally yyyy or yyyy-mm-dd format).  Users can also determine if they want their metadata translated into MARC8 (since the harvester assumed UTF8 for all xml data) and change the timeout settings the harvester uses for returning data (you generally shouldn’t change this).  Finally, for users that don’t want to harvest data into MARC, but just need the raw data — there is the ability to tell the harvester to just harvest data to the local file system.  If this option is checked, then the CrossWalk Path’s label and behavior will change — requiring the user to enter a path to a directory to tell the harvester where it should save the harvested files.

  4. For the UMich Digital Books, a user would want to utilize the following settings to harvest metadata into MARC:
    image
    Users wanting to ensure that the MARC data is in MARC8 and not UTF8 format should check the Translate to MARC-8 option.  Once these settings have been set, a user will just need to click the OK button.  For this set (mbooks), there are approximately 111000+ records, so harvesting will take approximately an hour or so to complete.  Longer if you ask the program to translate data into MARC8.
  5. When finished, users will be prompted with a status box indicating the number of records, resumptiontokens and last resumptiontoken processed (and any error information if an error occurred on process).

 

Harvesting OAI records directly to the filesystem

  1. Start up MarcEdit
  2. Select Harvest OAI records link
  3. Enter the following information (Server folder location will obviously vary):
    image 
  4. Files are harvested into the defined directory — number numerically according to resumption token processed.  Again, when processing is finished, a summary window will be generated to inform the user of harvest status and error information related to the harvest.

Errors related to the UMich Harvest that could be encounted:

My guess is that you would see these if you are using the most current version of MarcEdit uploaded 2008-01-27, however, you may run into this if harvesting using other tools or older versions of MarcEdit.

  1. Server Timeout:  When harvesting all records, I was routinely seeing the server reset its connection after harvesting 10-18 resumption Tokens.  The current version of MarcEdit has some fall over code that will reinitiate the harvest under these conditions, stopping after 3 failed attempts.
  2. Invalid MARC data:  Within the 111000+ records, there are approximately 40-60+ MARC records that have too few characters represented in the MARC leader element.  This is problematic because this error will invalidate the record and depending on how the MARC parser handles records, poison the remainder of the file.  MarcEdit accommodates these errors by auto correcting the leader values — but this could be a problem with other tools.
  3. image
    This error message will be generated if you set the start and end elements using an invalid date format.  You should always check with the OAI server to see what date formats are supported by the server.  In this case, the date format expected by the UM OAI server is as follows:
    <repositoryName>University of Michigan Library Repository</repositoryName> 
      <baseURL>http://quod.lib.umich.edu/cgi/o/oai/oai</baseURL> 
      <protocolVersion>2.0</protocolVersion> 
      <adminEmail>dlps-help@umich.edu</adminEmail> 
      <earliestDatestamp>2007-10-24T18:48:49Z</earliestDatestamp> 
      <deletedRecord>persistent</deletedRecord> 
      <granularity>YYYY-MM-DDThh:mm:ssZ</granularity> 
    

    Notice the granularity element — this tells me that any of the following formats would be valid:
    2008
    2008-01
    2008-01-01

Anyway — that’s pretty much it.  If you are just interested in see what type of data the UM is exposing with these data elements, you can find that data (harvested 2008-01-25) at: umich_books.zip (~63 mb).

 

–TR

 

 Posted by at 1:35 am
Jan 262008
 

I was playing with MarcEdit’s OAI harvester, making a few changes to fix a problem that had been discovered, as well as add some fall-over code that allows the harvester to continue processing (or at least, attempt to continue processing) when the OAI server breaks the connection (generally through timeout).  To test, I decided to work with the UMichigan Google Books sets of records Michigan recently made available.  It’s a large set and is one of those servers where the server timeout had been identified as an issue (i.e., this came up because a MarcEdit user had inquired about a problem they were having harvesting data). 

Anyway, I’ll likely post the update to the OAI harvesting code on Sunday or so (which will also include an update to the CJK processing component when going from MARC8-UTF8 — particularly when the record sets contain badly encoded data), and with it, I’ll likely include a small tutorial for users wanting to use MarcEdit to do one of the following:

  1. Harvest the UM digital book records from OAI directly into MARC21 (saving characterset in either legacy MARC8 or UTF8 formats)
  2. Harvesting the raw UM digital book metadata records via OAI (without the MARC conversion)

While I think that the the Harvester is fairly straightforward to use, I’m going to post some instruction, in part, so that I can underline some of the common error messages that one might see and what they mean.  For example, with the UM harvesting, I found that the OAI server tended to timeout after approximately 15 queries using a persistent connection.  When it would stop, it would throw a 503 error from the server.  I was able to over come the issue by simply adding some code into the app. to track failures and simply pause harvesting and restart the connection to the server — but these types of errors are not easy for most users to debug since they are not sure if the issue lies with the harvesting software or the server being harvested. 

Another problem that I’ve coded in MarcEdit to fix on the fly is that a handful of MARC21 records (I believe I identified approximately 40ish of 111000+) sent via OAI have invalid leader statements (i.e., not enough characters in the string).  For example, this record: http://quod.lib.umich.edu/cgi/o/oai/oai?verb=GetRecord&metadataPrefix=marc21&identifier=oai:quod.lib.umich.edu:MIU01-001300473, the leader is one character too short.  MarcEdit can fix these on the fly (at least it will try) by validating the length of the LDR and if short, padding spaces to the end of the string.  Since length and directory are calculated algorithmically, the records will be valid, but some of the leader data may get offset due to the padding.  However, there isn’t a thing you can really do about that, outside of rejecting the records as invalid or accepting the data as it (which the poisons all the other records downloaded in the set).  I’m putting together some info for the folks at UM that includes some of the problems that I’ve run into working with their OAI data just in case they are interested.

Anyway, one thing I thought I would do is post a set of these records, in MARC UTF8 and MARC8 charactersets (harvested 20080126 around 1:30 am to 3:00 am) for folks interested in taking a look at the exposed metadata.  You will find that the vast majority of these records appear to be brief metadata records containing basically an author, title and url — though full records are scattered through the record sets.  There are over 111000 records found in the six files.  The files in the zip are:

  1. mbooks-utf8 (combined data set)
  2. mbooks-marc8 (combined data set in marc8)
  3. pd-utf8 (international public domain books)
  4. pd-marc8 (international public domain books in marc8)
  5. pdus-utf8 (u.s. public domain books)
  6. pdus-marc8 (u.s. public domain books in marc8)

A quick note.  These are largish files.  MarcEdit has a preview mode specifically for this purpose.  Unless disabled, MarcEdit by default only loads the first 1 MB of data into the MarcEditor.  This will allow you to preview ~1000-1500 records, but using the editor tools, you can globally edit the entire data file.  This is done because reading data into the Editor is expensive (memory and time).  If you really want to open large files into the Editor, you need to make sure your virtual memory is set fairly high. 

So long as the folks at UM don’t ask me to take it down, I’ve posted these test files at: http://osulibrary.oregonstate.edu/techservices/marc/umich_books.zip for viewing and testing purposes (~62.7 MB), but I would recommend harvesting these records from http://quod.lib.umich.edu/cgi/o/oai/oai directly yourself if you want to use them since UM is adding new records all the time.  And remember, if you want to harvest them with MarcEdit, you’ll need to wait till I post the update on Sunday.

–TR

Technorati Tags: ,,,,

 Posted by at 4:16 pm
Jan 142008
 

(Note, I started this post last night, but had to put it away so I could get some rest before a 6 am flight.  I finished the remainder of this while waiting for my flight). 

So, after getting up way to early this morning, I staggered my way down to the LITA Top Tech Trends discussion this morning.  Unfortunately, it seemed like a number of other folks did the same thing as well, so I only ended up hanging out for a little bit.  I just don’t have the stamina in the morning to live through cramped quarters, poor broadband and no caffeine.  I get enough of that when I fly (which I get to do tomorrow).  Fortunately, a number of folks who had been asked to provide tech trends have begun (or have been) posting their lists and some folks who braved the early morning hours have started blogging their response (here).  I personally wasn’t asked to provide my list of tech trends, but I’m going to anyway, as well as comment on a few of the trends either posted or discussed during the meeting.  Remember, this is just one nuts list, so take it for what it is.

  1. Ultra-light and small PCs (Referenced from Karen Coombs)
    Karen is one of a number of folks that has taken note of a wide range of low-cost computers currently being made available to the general public.  These machines, which run between $189-$400, provide low-cost, portable machines that have the potential to bring computers to a wider audience.  I’ll have to admit, I’m personally not sold on these machines, in part because of the customer-base that they are aiming for.  Companies such as EeePC note that these machines are primarily targeted to users that are looking for a portable second machine and kids/elderly looking for a machine simply to surf the web.  A look at the specifications for many of these low cost machines are celerion class processors with 512 MB of RAM with poor graphics processing.  Is this good enough for web surfing or browsing the web?  I’d argue, no.  The current and future web is a rich environment, built on CSS, XML, XSLT, flash, java, etc.  I think what people seem to forget is that this rich content takes a number of resources to simply view.  Case in point — I setup a copy of Centos  on a 1.2 MHz Centrino with 512 MB RAM and a generic graphics card (8 Mb of shared memory) and while I could use this machine to browse the web and doing office work with Open office, I certainly wouldn’t want to.  Just running the Linux shell was painful, but web browsing is clunky and office work is basically unusable — essentially, surpassing the machine’s capabilities right out of the box.  Is this the type of resource I’d want to be lending to my patrons…probably not since I wouldn’t want my patrons to associate my library’s technical expertise with sub-standard resources.  Does this mean that ultra-portables will not be in vogue this year and the next?  Well, I didn’t say that.  A look at the success the IPhone is having (a pocket PC retailing for close to $1500 without a contract) seems to indicate that users are wanting to and willing to pay a premium price for portability — so long as that portability doesn’t come at too high of a price. 
  2. Branding outside services as our own (and branding in general)
    There was a little bit of talk about this — the idea of moving specific services outside the library to services like Google or Amazon, and essentially, rebranding them.  This makes some sense — however, I always cringe when we start talking about branding and how to make the library more visible.  From my perspective, the library is already too visible, i.e., intrusive into our users lives.  Libraries want to be noticed, and we want our patrons and organizations to see where the library gives them value.  It’s a necessary evil in times when competition for budget dollars is high.  However, I think it does our users a disservice.  Personally, I’d like to see the library become less visible — providing users direct access to information without the need to have the library’s finger prints all over the process.  We can make services that are transparent (or mostly transparent), and we should. 

    The same thing goes for our vendors.  I’ll use III as an example only because we are an Innovative Library so I’m more  familiar with their software.  By all rights, Encore is a serviceable product that will likely make III a lot of money.  However, of the public instances currently available (Michigan State, Nashville Public Library), the III branding is actually larger than that of the library (if the library branding shows up as well).  And this is in no way unique to III.  Do patrons care what software is being used?  I doubt it.  Should they care?  No.  They should simply be concerned that it works, and works in a way that it doesn’t get in in their way.  From my perspective, branding is just one more thing that gets in the way.

  3. Collections as services will change the way libraries do collection development
    I’m surprised that we don’t here more about this — but I’m honestly of the opinion that metadata portability and the ability for libraries to build their collections as web services will change the way libraries do collection development.  In the past, collection development was focused primarily on what could be physically or digitally acquired.  However, as more organizations move content online (particularly primary resources), libraries will be able to shift from an acquisitions model to a services model.  Protocols like OAI-PMH make it possible (and relatively simple) for libraries to actively “collect” content from their peer institutions in ways that were never possible in the past. 
  4. Increased move to outside library IT and increased love for hosted services (whether we want them or not)
    While it has taken a great deal of time, I think it is fair to say that libraries are more open to the idea of using Open Source software than ever before.  In the short term, this has been a boon for library IT departments, which has seen an investment in hardware and programmer support.  I think this investment in programming support will be short-lived.  In some respects, I see libraries going through their own version of the .COM boom (just, without all the money).  Open Source is suddenly in vogue.  Sexy programs like Evergreen have made a great deal of noise and inroads into a very traditionally vendor oriented community.  People are excited and that excitement is being made manifest by the growing number of software development positions being offered within libraries.  However, at some point, I see the bubble bursting.  And why?  Because most libraries will come to realize that either 1) having a programmer on staff is prohibitively expensive or 2) that the library will be bled dry by what I’ve heard coined by Kyle Banerjee as vampire services.  What is a vampire service?  A vampire service is a service that consumes a disproportional number of resources but will not die (generally for political reasons).  One of the dangers for libraries employing developers is the inclination to develop services as part of a grant or grandiose vision, that eventually becomes a vampire service.  They bleed an organization dry and build a culture that is distrustful of all in-house development (see our current caution looking at open source ILS systems.  It wasn’t too long ago that a number of institutions used locally developed [or open] ILS systems and the pain associated with those early products still affects our opinions of non-vendor ILS software today). 

    But here’s the good news.  Will all software development position within the library go away?  No.  In fact, I’d like to think that as position within individual organizations become more scarce — that consortia will move to step into this vacated space.  Like many of our other services moving to a network level, I think that the centralization of library development efforts would be a very positive outcome, in that it would help to increase collaboration between organizations and reduce the number of projects that are all trying to re-invent the same wheel.  I think of our own consortia in Oregon and Washington– Summit — and the dynamic organization it could become if only the institutions within it would be willing to give over some of their autonomy and funding to create a research and development branch within the consortia.  Much of the current development work (not all) could be moved up to the consortia level allowing more members to directly benefit from the work done. 

    At the same time, I see the increase of hosted services on the horizon.  I think that folks like LibLime really get it.  Their hosted services for small to medium size libraries presumably reduce LibLime’s costs to manage and maintain the software and those hosted libraries from the need to worry about hardware and support issues.  When you look at the future of open source in libraries — I think that this is it.  For every one organization willing to run open source within their library, there will be 5 others that will only be able to feasibly support that infrastructure if it is outsourced as a hosted service.  We will see a number of open source projects move this direction.  Hosted services for Dspace, Fedora, metasearch, the ILS — these will all continue to emerge and grow throughout this year and into the next 5 years.  And we will see the vendor space start to react to this phenomenon as well.  A number of vendors, like III, already provide hosted services.  However, I see them making a much more aggressive push to compel their users (higher licensing, etc) to move to a hosted service model. 

  5. OCLC will continue to down the path to becoming just another vendor
    I’d like nothing more than to be wrong, but I don’t think I am.  Whether its this year, the next or the year after that, OCLC will continue to alienate its member institutions, eventually losing the privileged status libraries have granted it throughout the years, becoming just another vendor (though a powerful one).  Over the last two years, we’ve seen a lot of happenings come from Dublin, Ohio.  There was the merger of RLG, the hiring of many talented librarians, WorldCat.org, WorldCat Local and OCLC’s newest initiatives circulating around their grid services.  OCLC is amassing a great deal of capital (money, data, members) and I think we will see how they intend to leverage this capital this year and the next.  Now, how they leverage this capital will go a long way to deciding what type of company OCLC will be from here forward.  Already, grumblings are being heard within the library development community as OCLC continues to move to build new revenue streams from webservices made possible only through the contribution of metadata records from member libraries.  As this process continues, I think you will continue to hear grumblings from libraries who believe that these services should be made freely available to members, since it was member dollars and time that provided OCLC exclusively with the data necessary to develop these services.  **Sidebar, this is something that we shouldn’t over look.  If you’re library is an OCLC member, you should be paying close attention to how OCLC develops their grid services.  Remember, OCLC is suppose to be a member driven organization.  It’s your organization.  Hold it accountable and make your voice heard when it comes to how these services are implemented.  Remember, OCLC only exists through the cooperative efforts of both OCLC and the thousands of member libraries that contribute metadata to the database.**  Unfortunately, I’m not sure what OCLC could do at this point to retain this position of privilege.  Already, too many people that I talk to see OCLC as just another vendor that doesn’t necessarily have the best interests of the library community at heart.  I’d like to think that they are wrong — that OCLC still remains an organization dedicated to furthering libraries and not just OCLC.  But at this point, I’m not sure we know (or they know).  What we do know is that there are a number of dedicated individuals that came to OCLC because they wanted to help move libraries forward — let’s hope OCLC will continue to let them do so.  And we watch, and wait.

Anyway, that’s my list of trends.

–TR

 

Technorati Tags: ,,
 Posted by at 4:12 pm