I’ve been having a great time playing with ruby — but one of the things that I find myself constantly running up against is the “ruby way”. The place where I notice this most is when dealing with XML. One of the things I’ve noticed in working with XML in ruby is that the ruby crowd doesn’t seem to know what to do with it. In most documentation that I’ve read — it appears that most ruby folks find little value in XML and prefer to work in YAML. However, in my world — XML is king and working around it really isn’t an option. So with XML — what are the ruby options? There are a few — but by default — ruby pushes a component known as REXML. In general, I’ve found this to be a nifty little library with lots of convenience functions. However, it comes a very high cost. First, I’ve found that there’s a limit to the size of file that can be loaded (~5 MB, then the component starts breaking) — and its slow. Oh, how is it slow. But how slow?
Well, I’ve been spending a lot of time reworking the API for LibraryFind, our soon to be open source hybrid Federated Search system. The big project of course has been moving the data into ruby and out of PHP so we can place the project within a web framework (in our case, Rails). Well, part of the project deals with harvesting data for local indexing — and in many cases, the data being harvested is in OAI. So, I’ve been playing with Ed Summer’s ruby-oai (very cool) and its based around REXML. After installing the module and testing the harvesting of a small collection, I was pretty dismayed with the speed of the application. I suspected that it was REXML that was causing the slow down.
So after chatting with Ed a little bit, he’d said that he’d be open to hacking up a very that supported libxml provided the changes didn’t:
- Turn the code into an unholy mess
- Didn’t require custom XPath statements (and they almost do)
Given that I was going to make these changes anyway for our own local instance — I thought this sounded like a good idea (and I’ve never had a chance to work with Ed, with I couldn’t pass up either 🙂 )
So I spent sometime today modifying the ruby-oai module and finished integrating libxml into the module (I guess I’ll wait now to see if I violated number 1 though 🙂 ) — and after running a small benchmarking application — all is well with the world again.
So how did the benchmarking go? Funny you should ask. 🙂 I made an OAI request using the REXML codebase which returned 394 records parsing just the identifier from the header. Total time:
Time to run: 21.685583
Records returned: 394
Same codebase but just changing the parser — what’s the difference. Let’s see:
Time to run: 0.75901
Records returned: 394
This kindof a difference really brings a smile to my face. And considering how many oai sites I have to harvest from (~300), these extra seconds really start to add up. Test code below…
require 'oai' buffer = "" start_time = Time.now() client = OAI::Client.new 'http://digitalcollections.library.oregonstate.edu/cgi-bin/oai.exe', :parser =>'libxml' last_check = Date.new(2006,1,1) records = client.list_records :set => 'archives', :metadata_prefix => 'oai_dc', :from => last_check x = 0 records.each do |record| #fields = record.serialize_metadata(record.metadata, "oai_dc", "Oai_Dc") #puts "Primary Title: " + fields.title + "\n" buffer << record.header.identifier + "\n" x += 1 end end_time = Time.now() puts buffer puts "Time to run: " + (end_time - start_time).to_s + "\n" puts "Records returned: " + x.to_s