Since modifying the ruby-oai module to work with libxml — I’ve found extracting the data from “metadata” to be much easier. And the nice part — using libxml, I was able to do an initial metadata harvest from 4 collections in under 10 seconds. Here’s the code from the harvesting component — see below.
–TR
oai_dc harvester
require 'rubygems'
require 'xml/libxml'
require 'oai'
class OaiDc
attr_accessor :title, :creator, :subject, :description, :publisher,
:relation, :date, :type, :format, :contributor,
:identifier, :source, :language, :coverage, :rights
def parse_metadata(element)
labels = self.metadata_list()
if element == nil: return nil end
labels.each do |item|
x = 0
tmp_element = element.metadata.find("./oai_dc:dc/" + item, ["oai_dc:http://www.openarchives.org/OAI/2.0/oai_dc/","dc:http://purl.org/dc/elements/1.1/"])
item = item.gsub('dc:','')
eval("@" + item + " = []")
tmp_element.each do |i|
s = i.content
if s != nil
eval("@" + item + "[" + x.to_s + '] = ' + s.dump )
x += 1
end
end
if x==0: eval('@' + item + '[' + x.to_s + '] = nil') end
end
end
def metadata_list()
labels = ['dc:title','dc:creator',
'dc:subject','dc:description',
'dc:publisher','dc:relation',
'dc:date','dc:type','dc:format',
'dc:contributor','dc:identifier',
'dc:source','dc:language',
'dc:coverage','dc:rights']
end
end
Comments
2 responses to “ruby-oai and processing metadata elements”
How long did the rexml version take to do the same thing? It’s a bit hard to read your unindented code. I’d be interested to know what about libxml made the xml any easier to process. I’m fundamentally scared of evaling code that comes over the wire, so those evals are giving me he willies. Enough for now 🙂
I hadn’t tried it — though I just did. The REXML code took ~15 minutes for 3 collections — one collection seems to hang the component (~13 MB of data) and change to run. In terms of XML processing — there’s nothing about libxml that makes it easier to access in ruby, its just a better library. As I’d mentioned earlier — the REXML code has a number of great convenience functions that makes accessing elements actually easier. The problem is that this comes at a high cost. Speed and loading of large files are problems that I can’t seem to overcome using REXML.
–TR