Finally, in Athens

By reeset / On / In Code4Lib 2007

So Kyle and I finally made our way from Oregon to Athens, Georgia for Code4Lib.  Apparently, we missed a big storm in the Valley.  Apparently, the wind is blowing 30-45 mph with cold, slushy rain falling off an on.  I was told that it hit hard around 5 am.  Fortunately (depending on your point of view), I was up at 2:30 this morning to drive up to PDX for my flight — so I basically missed it. 

Flight was uneventful — though the drive to Athens was fun.  Got into the Airport and picked up a car from Enterprise.  I originally hadn’t planned on driving, but my flight home is early enough that I didn’t want to trust the bus to get me there.  So, picked up a car and got the full treatment of Atlanta Rush-hour, fighting my way up 85 N.  Thank goodness for the car pool lane.  Kyle and I were able to skip a lot of the traffic by using the carpool lane. 

Found Athens, then lost our way to our hotel, then found it.  So bring on the conference.


Keeping your cycle rolling

By reeset / On / In Cycling

Most folks know, I love cycling and am probably one of a relatively few number of folks that commutes any real distance that actually puts more miles on their bike than they do in their car.  I was figuring it out last year:  ~15,000 miles for the bike, ~10,000 miles for the car.  🙂 

So when I buy equipment for my bike, I rarely skim on the essentials.  However, for a lot of components, much of the cost difference comes from weight and some extra durability of materials — something most normal folks wouldn’t notice or probably even care about (exactly how many hundreds of dollars is shedding a few grams worth to you :)).  But the bike tires…that’s completely different.  Few things can make a commute miserable as a flat tire.  Since I bike year round, that means that 1/2 of year I’m riding in the dark, and the rain.  Stopping and trying to change a tire in that slop is not only time consuming but often times fruitless.  More than once have I pulled off a tire, changed it, only to have it flat 5 minutes later because some grime found its way into the inside of the tire (and how can it not in weather like this). 

So, the solution — Kyle had turned me on to the tires he rides in the winter, Schawalbe’s Marathon Plus.  These things are beasts of a tire (though I found a 700 x 25 to fit my bike) that I never really consider riding until I changed 3 flats in one day on relatively clean road.  Suddenly, adding a little extra weight to the bike didn’t seem so bad.

Well, its been almost 3 months and I finally got my first flat (which is coming close to unheard of.  I tend to throw out most tires after 4 months) and I’m impressed.   It was a slow leak that I actually managed to limp home on.  When I got home, I pulled a construction staple, ~1/2 long, from my tire.  It had just pricked the inner tube.  I love these things.  They’re a little spendy (as far as bike tires go) and I’ll only ride them in the winter (when flats are most plentiful) but am I ever glad that Kyle convinced me to get myself a set.  For any of you winter road warriors out there — they come highly recommended.


MarcEdit update

By reeset / On / In MarcEdit

I’ve been doing some work on the Delimited Text importer this weekend.  Nothing big — just some simple changes to make the import process work better — particularly when dealing with csv files (which MarcEdit periodically splits up incorrectly when dealing with MS Excel data).  Anyway, I think I’ve got a much better process using a much better set of regular expressions.  You can get the update at: MarcEdit50_Setup.exe if this affects you.



Images: New radiation warning–run for your life | ZDNet Photo Gallery

By reeset / On / In Uncategorized

I’m not quite sure why this strikes me as hillarious…probably the description as posted by ZDNet.  It reads:

The International Atomic Energy Agency and the International Organization for Standardization want you to know when it’s time to panic by adding a person running and a skull and crossbones to its radiation warning symbol.

Instantly, I remember the Simpson Episode where Kent Brockman is talking to an analysis about a nuclear warning at the power plant and asks:

KENT BROCKMAN: Without knowing any of the facts, would you say now is the perfect time to panic?

Ha — love those simpsons.

Link to Images: New radiation warning–run for your life | ZDNet Photo Gallery

Running MarcEdit 5 on macs, linux and other updates

By reeset / On / In MarcEdit

So with the coming preconference at Code4Lib (which I wish I was going to :(), I’ve had a few folks ask if MarcEdit can run on a mac or linux system.  Well, the console version of the application can.  The GUI portion of the program still relies on components that haven’t been fully migrated into MONO, but the console version has worked just fine for about the past 6 months.  So, if you have a copy of mono ( installed on your mac or linux box, you can try the instructions below.

In addition to providing some info on how to run on a mac or linux — I also thought I’d let folks know that this (the zipped content) and the formal install program have gone through a number of changes.  Most of these changes are related to the work I’ve been doing playing with solr over the last week.  I wanted a large dataset to work with, and in doing some testing, I was disappointed in how quickly MarcEdit was translating data.  So I pulled a random sample of data from our catalog (1000 records) and started benchmarking from there.  At the start, processing a MARC file from MARC=>Solr was taking ~8 seconds.  After spending some time relooking at the algorithms used to do this processing, I’ve cut processing time for the 1000 records to just under 2 seconds.  So this means you can process a 10,000 record file in ~18 to 20 seconds — as an fyi, 10,000 record files seem to be the sweet spot for the application.  I had two large data sets — 2 million records from our catalog and 20 million records from our consortia.  Originally, I tried to process the 2 million records directly.  It worked, but it took forever.  Since MarcEdit does these translations using XSLT, the processing of 2 million records directly took ~6 hours.  However, splitting these into files of 10,000, I was able to process my 2 million records in under an hour.  Much better processing time I thought.

Anyway, the changes to this build:

  1. Changes to the MARCEngine (stated above).  I’ve turned over 78 million records over the past 2 days to ensure that the character encoding is working correctly.  As far as I can tell, everything is working fine, though my datasets were not the most lingually diverse.  So if you see a problem, let me know.  A smaller change that I worked on is some additional healing functions to the engine.  This allows the program to “correct” invalid character data that can sometimes (at least in our records) appear. 
  2. I added two parameters that are available in all XSLT transformations.  You can define global params for pdate (the processing date of the file in yyyymmdd format) and for destfile (the name of the created xslt file).  I’ll likely add a few more parameters so that I can get access to data elements that I have a hard time recreating in pure XSLT.
  3. OAI harvester — I added the ability to harvest individual items from a repository for targeted harvesting.

You can download the update to MarcEdit at: MarcEdit50_Setup.exe.

So, instructions.  For folks looking to run on alternative platforms, give this a try:

1) Download the following to your mac:

2) Some common commands
a) Breaking a record:
mono cmarcedit.exe -s [your file] -d [save file] -break

b) making a file:
mono cmarcedit.exe -s [your file] -d [save file] -make

c) Splitting a large MARC file to smaller files:
mono cmarcedit.exe -s [your file] -d [path to save directory] -split -records [num of records]

d) Translate MARC=>XML
mono cmarcedit.exe -s /home/reeset/Desktop/z3950.mrc -d /home/reeset/Desktop/solrtext.xml -marctoxml -xslt /home/reeset/marcedit5/XSLT/marcxml2solr.xsl

e) Translate a batch of marc records to xml

mono cmarcedit.exe -s /home/reeset/Desktop/oasis_split2/ -d mrc -xslt /home/reeset/marcedit5/XSLT/marcxml2solr.xsl -batch -marctoxml

f) Get help info:

mono cmarcedit.exe -help

As I mentioned, I’ve been making some changes to the xml components to make them faster.  I’m pretty sure you won’t run into any characterset issues — but if you do, let me know.  I’ve processed some 70 million items over the past 2 days using the new method generating items for index in solr.  BTW, the solr xslt that Andrew Nagy had sent out is included in the marcedit5/XSLT folder (as are my current in development stylesheets)

I’ve run all these tests on my linux box (CentOs) — but I’m sure it will work on a mac.


Mainstreaming R&D

By reeset / On / In Digital Libraries, Innovative Interfaces

I’ve become more and more convened over the past year talking to directors that for OSS development to be accepted as a part of the library community, it’s going to have to become a mainstream service.  Too much R&D in libraries is done as part of an individual, student or demo project.  To a large degree, front-line workers and developers within the library community have a healthy bent towards OSS.  But organizational attitudes change slower and these are the ones that tend to matter.  So, I’m going to be taking a different course over this next year — at least within my own small part of the world. 

Over the past three months, I’ve been leading a group looking at next generation ILS services for our regional consortia, Summit.  Summit is a consortia made up of 33 academic libraries throughout Oregon and Washington — with all system’s being Innovative.  This is due to the fact that III’s consortia software really only works with III libraries.  In looking at the various options available — we’ve tried to keep an open mind.  I’ve been running copies of Koha and Evergreen over the past month to look at current functionality within a very untraditional consortial setting, folks have spoken to vendors like Endeca, Aquabrowser, III and OCLC as well as others.  In all, the process has showed me a couple of things.

  1. Given that this decision will be just on the consortia database, our options are somewhat limited.  III doesn’t make the process of having an outside vendor interact with the Innreach system easy — though we’ve been told it could be done.  This means that we migrate off III as a group (can’t see that happening), partnering with III (what I think many would consider to be the safer, least disruptive choice), working closely with OCLC —  though the second and third options don’t hold much appeal to me personally. 
  2. Which leads me to number 2 — while the consortia has more than enough talent to develop an inhouse solution — the organization infrastructure simply doesn’t exist to allow such a solution to be considered. 

The second realization is what struck me most.  I spend a great deal of my time helping folks within the Pacific NW implement tools around their ILS — but there really isn’t a centralized or formalized R&D process within the consortia — and for a group this large, that seems to be a shame.  There is a lot of talent tied up within the 33 member organizations, the question is how to get at it.

Well, I’ve got an idea.  While my group really cannot make a recommendation related to the current software available (we can talk about what’s available and what I believe to be the future trends) — I can advise that we formalize an R&D group within the consortia.  Fortunately, Summit is hiring a digital library coordinator — and I think that this position would be perfect to lead this group.  I envision a committee that could be used to:

  1. coordinate Summit development efforts and investigate options like SOPAC, metasearching within a consortial environment, OpenURL within a consortial environment, etc.
  2. provide Summit with shared development resources — allowing member libraries to help drive development of services, while distributing the R&D between member libraries
  3. advocating for OSS and an active R&D agenda to the member libraries directors and the Summit executive board.

In all honesty, I think #3 is the most important.  The proprietary vendor community is very adapt at dealing with the library community at a high level, and this allows them to shape the overall environment within the organization.  My hope is that by creating a formal working group within the consortia and identifying that this is indeed important — and help to lead to an attitude shift within the Pacific Northwest. 

Will it work?  Who knows.  I’ve floated the idea by a few folks — some on other committees, some familiar with the current makeup of the executive committee, and the overall mood isn’t optimistic.  The biggest challenge to overcome is this idea that one’s library doesn’t have any special skills to offer (or any bodies to offer).  If R&D is valued at an organization — resources and people can be found. 

Anyway, my hope is that the recommendations that come out of this study will help to move this conversation forward.  As I said — there is a lot of talent in the Pacific Northwest — its time we started tapping into as a group and seeing what can be accomplished within a consortia when everyone contributes.


MarcEdit 5 update

By reeset / On / In MarcEdit

Apparently in an update, I’d broken the MARCJoin function (the files to join wasn’t being set).  Sorry about that.  So, I’ve fixed this and in fixing this, corrected a problem with the same function in the console program. 

To run in the console, you’d use the following:

%cmarcedit.exe -s c:\myfile.mrc;c:\myfile2.mrc -d c:\newfile.mrc -join

In the GUI, is on the main window, top menu under Tools/MARCJoin.

The update, as always, can be found at: MarcEdit50_Setup.exe.



MarcEdit and Solr

By reeset / On / In MarcEdit

With the preconference at Code4lib coming, folks are looking for ways to get their MARC data into a format that Solr can load.  Andrew Nagy has made an XSLT that can convert MARCXML data to a Solr format, and David Bigwood notes that MarcEdit can be used to generate those MARCXML records.  This is true — but you could also generate the Solr records directly.  Basically, you just need to register the crosswalk with MarcEdit and then you can process items directly into the Solr format from MARC.  I’d left some instructions as a comment on David’s page, however, for those that might not see it and find this helpful.  I’ve reproduced my comment from here below.



Actually, you could use MarcEdit to go straight from MARC to the Solr syntax — though, you’d want to modify the posted stylesheet to include the marc: namespace. This way, the tool could process files with or without that namespace.
The way that you make it work is simply register the crosswalk with MarcEdit. Since some folks aren’t sure how this works — I’ve quickly recorded a quick avi file of what that looks like. See: Adding and Using MARC=>Solr crosswalk for the AVI file showing how to register the MARC=>Solr crosswalk. BTW, the avi file is ~29 MB.
I also modified the crosswalk that you’d linked to so that it works better in MarcEdit. Since MarcEdit uses the marc namespace by default, xslt stylesheets work best in MarcEdit if they include the namespace. This way, MarcEdit can process items with namespace and without. Here’s the stylesheet with the revisions made (BTW, this is the stylesheet I used in my example): Modified MARC21XML=>Solr XSLT

Can the open source community help the ILS matter?

By reeset / On / In Digital Libraries, General Computing, Travel

So, let’s start out with a preface to my comments here.  First, it’s a little on the long side.  Sorry.  I got a bit wordy and occasionally wonder a little bit here and there :).  Second — these reflect my opinions and observations.  So with that out of the way… 

This question comes from two experiences recently.  First, at Midwinter in Seattle, a number of OSU folks and myself met with Innovative Interfaces regarding Encore (III’s “next generation” public interface in development) and the difficulty that we have accessing our data in real-time without buying additional software or access to the system (via access to API or in III’s case, access via a special XML Server).  The second meeting has been the current eXtensible Catalog meeting here in Rochester where I’ve been talking to a lot of folks that are currently looking at next generation library tools. 

Sitting here, listening to the XC project and other projects currently ongoing, I’m more convinced than ever that our public ILS, which was once the library communities most visible public success (i.e., getting our library catalogs online) — has become one of the library communities’ biggest liabilities — an albatross holding back our communities’ ability to innovate.  The ILS and how our patrons interact with the ILS shapes their view of the library.  The ILS, at least, the part of the system that we show to the public (or would like to show to the public — like web services, etc.) simply has failed to keep up with library patron or the library communities’ needs.  The internet and the ways in which our patrons interact with the internet have moved forward — while libraries have not.  Our patrons have become a savvy bunch.  They work with social systems to create communities of interest — often times, without even realizing it.  Users are driving the development and evolution of many services.  A perfect example to this has been Google Maps.  A service that in and of itself, isn’t too interesting in my opinion.  But what is interesting is the way in which the service has embraced user participation.  Google maps mashups liter the virtual world — to the point that the service (Google maps) has become a transparent part of the world that the user is creating.

So what does this have to do with libraries?  Libraries up to this point simply are not participating in the space that our users currently occupy.  Vendors, librarians — we are all trying to play catch-up in this space by brandishing about phrases like “next generation”, though I doubt anyone really knows what that means.  During one of my many conversations over the weekend, something that Andrew Pace said really stuck with me.  Libraries don’t need a next generation ILS; they need a current generation system.  Once we catch-up — then maybe we can start looking at ways to anticipate the needs of our community.  But until the library community creates a viable current generation system and catches-up, we will continue to fall further and further behind.

So how do we catch-up?  Is it with our vendors?  Certainly, I think that there is a path in which this could happen.  But it would take a tremendous shift in the current business models utilized by today’s ILS systems, but a shift that needs to occur.  Too many ILS systems make it very difficult for libraries to access their data outside of a few very specific points of access.  As an Innovative Interfaces library, our access points are limited based on the types of services we are willing to purchase from our vendor.  However, I don’t want to turn this specifically into a rant against the current state of ILS systems.  I’m not going to throw stones, because I live in a glass house that the library community created and has carefully cultivated to the present.  I think to a very large degree, the library community…no, I’ll qualify this, the decision makers within the library community — remember the time when moving to a vendor ILS meant better times for a library.  This was before my time — but I still hear decision makers within the library community apprehensive of library initiated development efforts because the community had “gone down that road” before when many organizations spun their own ILS systems and were then forced to maintain them over the long-term.  For these folks, moving away from a vendor controlled system would be analogous to going back to the dark ages.  The vendor ILS has become a security blanket for libraries — it’s the teddy bear that lets everyone sleep at night because we know that when we wake up, our ILS system will be running and if its not, there’s always someone else to call. 

With that said, our ILS vendors certainly aren’t doing libraries any favors.  NSIP, SRU/W, OpenSearch, web services — these are just a few standards that libraries could easily accommodate to standardize the flow of information into and out of the ILS, but find little support in the current vendor community.  RSS, for example, a simple protocol that now most IlS vendors support in one way or another, took years to finally be developed. 

Talking to an ILS vendor, I’d used the analogy that the ILS business closely resembles the PC business of the late 80’s, early 90’s when Microsoft made life difficult for 3rd-partly developers looking to build tools that competed against them.  Three anti-trust cases later (US, EU and Korean) and Microsoft is legally binded to produce specific documentation and protocols to allow 3rd-party developers the ability to compete on the same level as Microsoft themselves.  At which point, the vendor deftly noted that they have no such requirements, i.e., don’t hold your breath.  Until the ILS community is literately forced to provide standard access methods to data within their systems, I don’t foresee a scenario in which this will ever happen — at least in the next 10 years.  And why is that?  Why wouldn’t the vendor community want to enable the creation of a vibrant user community.  I’ll tell you — we are competitors now.  The upswing in open source development within libraryland has place the library community in the position of being competitors with our ILS vendors.  Dspace, Umlaut, LibraryFind, XC — these projects directly compete against products that our ILS vendors are currently developing or have developed.  We are encroaching into their space, and the more we encroach, the more difficult I predict our current systems will become to work with. 

A good example could be the Open source development of not one, but two main stream open source ILS products.  At this point in time, commercial vendors don’t have to worry about losing customers to open source projects like Koha and Evergreen, but this won’t always be the case.  And let me just say, this isn’t a knock against Evergreen or Koha.  I love both projects and am particularly infatuated with Evergreen right now — but the simple fact is that libraries have come to rely on our ILS systems (for better or worst) as acquisition systems, serial control systems, ERM systems — and with ILS vendors having little incentive to commoditize these functions.  This makes it makes it very difficult for an organization to simply move to or interact with another system.  For one, it’s expensive.  Fortunately, the industrious folks building Evergreen will get to the point where it will be a viable option and when it does, will the library community respond?  I hope so, but I wonder which large ACRL organization will have the courage to let go of their security blanket and make the move — maybe for the second time — to using an institutional supported ILS.  But get that first large organization with the courage to switch, and I think you’ll find a critical mass waiting and maybe, just maybe, it will finally breathe some competitive life into what has quickly become a very stale marketplace.  Of course, that assumes that the concept of an OPAC will still relevant — but that’s another post I guess.

Anyway, back to the meeting at Rochester.  Looking at the projects currently be described, there is an interesting characteristic of nearly all “next generation” opac projects.  All involve exporting the data out of their ILS.  Did you get that — the software that we are currently spending tens or even hundreds of thousands of dollars to do all kinds of magical things must be cut out of the equation when it comes to developing systems that interact with the public.  I think that this is the message that libraries and those making decisions about the ILS within libraries are missing.  A quick look around at folks recognized at creating current generation opacs (the list isn’t long) like NCState have one thing in common — the ILS has become more of an inventory management system, providing information relating to an item’s status, while the data itself is being moved outside of the ILS for indexing and display.

What worries me about current solutions being considered (like Endeca) is that they aren’t cheap and will not be available to every library.  NCState’s solution, for example, still requires NCState to have their ILS, as well as an Endeca license.  XC, an ambitious project with grand goals, may suffer from the same problem.  Even if the program is wildly successful and meets all its goals, implementers may still have a hard time selling their institutions on taking on a new project that likely won’t save the organization any money upfront.  XP partners will be required to provide money and time while still supporting their vendor systems.  What concerns me most about the current path that we are on is the potential to deepen already existing inequities that exist between libraries with funding and libraries without. 

But projects like XC, the preconference at Code4lib discussion Solr and Lucene — these are developments that should excite and encourage the library community.  As a community — we should continue to cultivate these types of projects and experimentation.  In part, because that’s what research organizations do — seek knowledge through research.  But also, to encourage the community to take a more active role when it comes to how our systems are developed and interact with our patrons.  


Chilly in Rochester

By reeset / On / In Travel

So I spent a rather nippy night last night hanging out in Rochester, NY.  I’m in town for the next few days with a number of other (maybe 30) folks to talk about (and learn about) the University of Rochester’s XC (eXtensible Catalog) project. 

Funny, the day started out oddly.  While driving to Portland to catch my flight — all the lights on the interior of the car when out (speedometer, etc).  Fortunately, my headlights didn’t go out — but for about 60 miles on the freeway, I drove by penlight so I could see how fast I was going.  Not an auspicious start — but the only hitch I experienced getting to Rochester.  Once here — I found it to be cold and snowy.  The snowy I loved.  Spent some time romping around the snow before making my way to the Eastman mansion for a tour and dinner.  Afterwards, a group of us walked the 5 blocks back (much colder) and then stayed up much too late (6 am [or 3 am my time] comes awful early in the morning).

And it did come early.  I got up this morning (Friday) and found that it had snowed a bit last night, which was great.  I bundled up and took a 2 mile run to get the blood going (and wake up) and had a great time cutting new tracks in the snow.

I’ll post some about the meeting later tonight (or tomorrow).