ruby-oai and processing metadata elements

Since modifying the ruby-oai module to work with libxml — I’ve found extracting the data from “metadata” to be much easier.  And the nice part — using libxml, I was able to do an initial metadata harvest from 4 collections in under 10 seconds.  Here’s the code from the harvesting component — see below.


oai_dc harvester

require 'rubygems'
require 'xml/libxml'
require 'oai'

  class OaiDc
    attr_accessor :title, :creator, :subject, :description, :publisher,
                  :relation, :date, :type, :format, :contributor,
                  :identifier, :source, :language, :coverage, :rights

    def parse_metadata(element)

      labels = self.metadata_list()

      if element == nil: return nil end
      labels.each do |item|
        x = 0
        tmp_element = element.metadata.find("./oai_dc:dc/" + item, ["oai_dc:","dc:"])
        item = item.gsub('dc:','')
        eval("@" + item + " = []")
        tmp_element.each do |i|
          s = i.content
          if s != nil
            eval("@" + item + "[" + x.to_s + '] = ' +  s.dump )
            x += 1

        if x==0: eval('@' + item + '[' + x.to_s + '] = nil') end

    def metadata_list()
      labels = ['dc:title','dc:creator',

  1. How long did the rexml version take to do the same thing? It’s a bit hard to read your unindented code. I’d be interested to know what about libxml made the xml any easier to process. I’m fundamentally scared of evaling code that comes over the wire, so those evals are giving me he willies. Enough for now 🙂

  2. I hadn’t tried it — though I just did. The REXML code took ~15 minutes for 3 collections — one collection seems to hang the component (~13 MB of data) and change to run. In terms of XML processing — there’s nothing about libxml that makes it easier to access in ruby, its just a better library. As I’d mentioned earlier — the REXML code has a number of great convenience functions that makes accessing elements actually easier. The problem is that this comes at a high cost. Speed and loading of large files are problems that I can’t seem to overcome using REXML.


