Sep 292006
 

At OSU, we’ve played with this on and off and finally decided to just take this live.  For those that use CONTENTdm, I’ve created a small document that discusses how this works and what it looks like.  As I said, simple implementation at this point, but if use takes off, I’ll look to add things like tag clouds, integrated search results, etc.  This won’t be interesting to anyone but folks using CONTENTdm.  Sorry.

Here’s a link to the document: CONTENTdm_Tagging.doc

 

–TR

 Posted by at 11:16 am
Sep 272006
 

As I’ve started to get more involved in the Dspace work here at OSU, one thing that has struck me is the interface really could use a once over.  No place could this be more obvious than the community-list page.  While I doubt that this page was every meant to be the default access mechanism into an institution’s Dspace collection, I’d bet that most often it is (outside of accesses that come from outside Dspace like Google).  Like most institutions, Oregon State University is adding lots of new communities and collections to Dspace, and as the number of collections grow, so do does the list.  At some point, this list simply is too unwieldy for folks to actually work with, so our IR group has started looking at ways to make this easier for users to work with. 

So our IR group has taken a first crack at this.  Basically, the idea is to simply collapse the list using a little dhtml and then allow users to toggle the collections that they want to see. 

So with this interface here, we’ve obviously shorten the list, but introduced a whole new problem which I’m not sure will be a step forward or back — that being that you no longer can see collections.  Now, a user has to know what community a collection is a part of in order to find it.  To make this a little easier, I added a toggle all button, but that just gets you back to the big list interface.

I’ve toyed around with another option.  One thing that I thought might be interesting is adding links to the collections that are used the most.  At first, I thought that this was the type of information that dspace probably would be logging in the database — but after looking over the tables, I couldn’t find it.  So I started looking at the source and found a set of statistic classes that generate the dspace-general log files.  These log files are what dspace uses to generate the statistics screen in the administrative interface.  So with that, we are in business. 

If you look at the dspace-log-general log files, you see a couple of things. 

  1. A new one exists for each day (so long as you have logging enabled)
  2. They are very parseable. 

Using properties set in the config file, dspace analyzes log entries using a threshold to determine if the item should be placed into the list of most viewed items.  At OSU, we continue to use the default floor of 20 views to make the list, but you can have just about anything.  Once an item reacts that floor, the following entry will be placed into the dspace-log-general log file (from our dev server):

  • item.123456789/28=31

Here, you can see that you are getting a stem that can be matched on: item., the handle for the item being accessed: 123456789/28 and the number of times that the item has been accessed: 31.  So one of my thoughts for exposing often used collections would be to take this list and determine how often collections are being utilized.

Technically, the process isn’t really that difficult, though I’ll admit that I stumbled a little bit when working with the HashMaps since I’ve gotten lazy working with scripting languages.  Ideally, what you’d want to do is resolve each items collection and keep track of the number of times items have been accessed from the collection.  This way, you could then do a descending sort on the count to get the items with the most number of accesses — i.e., probably your most used collections.  The initial problem that I ran into was I wanted to use Java’s HashMap like a PHP associative array — which allows you to sort on either key or value, while maintaining keys.  There are a number of ways in Java to accomplish the same thing, but most seemed to require the use of Lists or LinkedHashMaps, and I didn’t want to bother with either.  So instead, I just created my own storage class and implemented the Comparable interface.  Much easier and cleaner I think. 

So code — I’ll have to know that I haven’t run this on our production environment yet and because of the limited logging that we do on our dspace dev instance, the code hasn’t yet been run against a large set of items to be resolved, so there may need to be revisions, but, for anyone wanting to look at the code, here you go. 

The changes made are basically broken into three parts:

  1. New class files (there are 2)
  2. Changes to the JSPs (one change to the community-list.jsp file)
  3. Additions to the dspace.cfg (basically, I didn’t want to hard code the log matching elements)

New Classes:

OSUCollection — This is the storage container that returns the collection name, the collection handle and the number of times the items logged in the collection had been accessed.  Items are returned in descending order — with most accessed items returned first.

 


package edu.oregonstate.library.util.objects;

import java.lang.*;
import java.util.*;

public class OSUCollection implements Comparable {
    private String name;
    private String handle;
    private int count = 0;


    public void setName(String s) {
        name = s;
    }

    public String getName() {
        return name;
    }

    public void setHandle(String s) {
        handle = s;
    }

    public String getHandle() {
        return handle;
    }

    public void setCount(int i) {
        count = i;
    }

    public int getCount() {
        return count;
    }

    public int compareTo(Object o) {
        return count - ((OSUCollection)o).count;
    }
}


GetPopularCollections – This is the actual logic part of the class that takes the stats log file, parses it, retrieves the collection specific metadata and returns an array of OSUCollection objects.

 


package edu.oregonstate.library.util;

import java.util.Arrays;
import java.lang.Integer;
import java.sql.SQLException;
import java.io.*;
import java.net.*;
import java.util.Set;
import java.util.Iterator;
import java.util.HashMap;
import java.util.Collections;
import java.util.Date;
import java.text.SimpleDateFormat;
import java.text.Format;


import org.dspace.content.Collection;
import org.dspace.content.DCValue;
import org.dspace.content.Item;
import org.dspace.core.ConfigurationManager;
import org.dspace.core.Context;
import org.dspace.handle.HandleManager;
import edu.oregonstate.library.util.objects.OSUCollection;

public class GetPopularCollections {
   public OSUCollection[] GetCollections()  throws Exception, SQLException {
    Context context = new Context();
    context.setIgnoreAuthorization(true);
    String record = null;
    HashMap tmpMap = new HashMap();
    HashMap itemMap = new HashMap();

    try {
       FileReader fr = null;
       BufferedReader br = null;

       try {
           String file = getLogFile(ConfigurationManager.getProperty("log.dir"),
                                    ConfigurationManager.getProperty("log.stem"),
                                    ConfigurationManager.getProperty("log.extension"));

           //String file = ConfigurationManager.getProperty("log.dir") + "/dspace-log-general-2005-9-20.dat";
           fr = new FileReader(file);
           br = new BufferedReader(fr);
       }
       catch (IOException e) {
          e.printStackTrace();
          System.out.println("Failed to read input file");
          System.exit(0);
       }
       while ((record = br.readLine()) != null) {
         String key = record.substring(0,5);
         String item = "";
         if (key.equals("item.")) {
            record = record.substring(5);
            tmpMap.put(record.substring(0, record.indexOf("=")), record.substring(record.indexOf("=")+1));
         }
       }

       //Setup the OSUCollections Object
       OSUCollection[] c = new OSUCollection[tmpMap.size()];
       Set set = tmpMap.keySet();
       Iterator it = set.iterator();
       int index = 0;
       while (it.hasNext()) {
           /*getCollectionInfo returns:
            * element[0]: collection name
            * element[1]: collection handle
            */
           String element = (String)it.next();
           String[] tmp = getCollectionInfo(context, element);
           if (!itemMap.containsKey(tmp[1])) {
              itemMap.put(tmp[1], new Integer(index));
              c[index] = new OSUCollection();
              c[index].setName(tmp[0]);
              c[index].setHandle(tmp[1]);
              c[index].setCount(Integer.parseInt((String)tmpMap.get(element)));
              index++;
           } else {
              int tint = Integer.parseInt((String)itemMap.get(element));
              int added = Integer.parseInt((String)tmpMap.get(element)) + c[tint].getCount();
              c[tint].setCount(added);
           }
       }

       Arrays.sort(c, Collections.reverseOrder());

       br.close();
       fr.close();
       return c;
    }catch(Exception e) {
        System.out.println(e.toString());
        return null;
    }
  }


  private String[] getCollectionInfo(Context context, String handle) throws Exception, SQLException {
     Item item = null;
     String[] vals = new String[2];

     // ensure that the handle exists
     try
     {
       item = (Item) HandleManager.resolveToObject(context, handle);
     }
     catch (Exception e)
     {
        return null;
     }

     // if no handle that matches is found then also return null
     if (item == null)
     {
        return null;
     }

     // build the referece
     // FIXME: here we have blurred the line between content and presentation
     // and it should probably be un-blurred
     Collection myCollection = null;
     myCollection = item.getOwningCollection();
     vals[0] = myCollection.getMetadata("name");
     vals[1] = myCollection.getHandle();
     return vals;
  }

  private static String getLogFile(String lfile, String lstem, String extension) {
      //We use the simpledateformat to check for the presence of the default
      //file, which dspace creates in the stats-general file.
      //If this file isn't present, then we use the stem and extension to return our file
      Format sdf = new SimpleDateFormat("yyyy-MM-dd");
      Date myDate = new Date();
      String default_file = lfile + File.separator + lstem + "-" + sdf.format(myDate) + "." + extension;
      File tmp = new File(default_file);
      if (tmp.exists()) {
         return default_file;
      }

      File dir = new File(lfile);
      String[] children = dir.list();
      String logfile = "";
      long lastmod = 0;
      if (children == null) {
          return null;
      } else {
          for (int i=0; i lastmod) {
                  lastmod = tmp.lastModified();
                  logfile = lfile + File.separator + children[i];
               }
              }
            }
          }
          return logfile;
      }
  }

}


JSP Changes:

Community-list.jsp

Around line 189 you should add the following:

 



<%
   GetPopularCollections objCol = new GetPopularCollections();
   OSUCollection[] col  = objCol.GetCollections();
   int x =0;
   if (col.length >0) {
%>
   

Most Viewed Collections

<% for (int i = 0; i < col.length; i++) { %>
  • /handle/<%=col[i].getHandle()%>"><%=col[i].getName()%> [<%=col[i].getCount()%>]
  • <% x++; if (x > 5) { break; } } } %>

    Config additions

    dspace.cfg — as noted, I basically add these entries so I don’t have to hardcode the values into the GetPopularCollections class. These are added around line 166, near log.dir

     

    
    
    log.stem = dspace-log-general
    log.extension  = dat
    
    
    

    And that’s it.  If you want to access the two class files directly, they can be found at:

    The end result of all this, is a display that looks like the following:

    Again, I have no idea if this will be useful for our users in general, but I can see some future applicability in terms of evaluating collection usage in helping to determine what materials to collect and archive.  But I guess we’ll see.

     

    –TR

     Posted by at 12:11 pm
    Sep 212006
     

    Its funny how things work out.  As some folks know, we are currently in the process at OSU of re-writing our hybrid metasearch tool from PHP to ruby, and now that much of the heavy lifting has been accomplished (a simple caching engine, wsdl api, oai harvesting, ferret integration, etc.), its been time to start looking at porting some of the niceties that we added to the PHP instance of LibraryFind to gauge our users reactions.  One of these features that remained to be ported was a spell checker. 

    In the PHP instance of LibraryFind, the utility made use of a couple of built-in PHP components to provide a spell checker that wasn’t simply a dictionary lookup.  Utilizing Aspell as the dictionary behind the spell checker, the tool utilized metaphonics — the analysis of sounds — to determine which entry returned by the dictionary was closes to the actual typed text.  This of course meant that the text entered into the application had to be misspelled phonetically if the tool was to find the best match possible.

    Well, ruby doesn’t have many of these tools available to it.  While I could have coupled it with Aspell, I couldn’t find a good metaphonics engine to attach to ruby, which I think is important since straight dictionary matching is only so useful.  So I started looking for alternatives…

    Its actually interesting how  many alternatives one can find.  Google actually has two.  The first is its publicly available Search API which includes an option to returned spell checked items.  This looked promising, but the 1000 search limit was disappointing.  However, Google provides a secondary service that is for the most part undocumented that is currently utilized by it’s FireFox plugin.  Apparently, this undocumented API has been well-known for sometime, going back to 2005 when Google first released its beta Firefox toolbar.  What interesting as well (and I tool a look at the source to see it for myself) — is that within the toolbar, Google has a number of code comments that indicate if a specific piece of functionality is intended to be used by outside organizations.  A good example of this is the pagerank calls in the toolbar.  Comments surround the code that make it clear that this isn’t a publicly available function.  However, this isn’t the case with the toolbar spell checker API and after looking through the EULA and doing some web sleuthing, I couldn’t find anyone talking about restrictions to use.  Quite the contrary, there are a number of people building interfaces to this api.  You can find them here and here and here and here and here.  In fact, I could find examples working with this api in PHP, python, C# and PERL.  So I figured, why not a ruby.  So, last night I wrote a small ruby class to parse this data for LibraryFind.

    I know, I know — what does this have to do with Dspace?  Well, after coding it into ruby I figured, heck, why not Java?  So I spent today putting together a new java class that implements the Google Spell checking API within Dspace.  Unfortunately, this isn’t the simplest of hacks (since it requires touching a number of the jsp files and adding a class file — but take a look at the results:

    Currently, this exists on our development server, but will likely make its way into our production environment next week.  ***Side note:  it makes senses that today would be the day that I’d choose to make these changes.  We are currently freezing Dspace development so we can finish porting our current hacks into the 1.4 source so we can bring that into production.  The process was just about finished when I decided that we had time to sneak on more change into the migration. :)  You can see what Jeremy puts up — too much copious free time at home I guess. :)

    So how can you too add this functionality to your Dspace instance?  Good question.  Well, first you need this java class.  I won’t guarantee that you won’t find any problems with it (and if you do, give me a holler) since I just finished it today and only did a little bit of lite testing, but I haven’t had trouble with it so far, so I feel pretty good about it.  You’ll likely need to change where the class is packaged.   We tend to package OSU specific classes within its own namespace.  Second, you need to make changes to the following files:

    1. search/results.jsp
    2. utils.js

    First, results.jsp: 

    Around line: 75 — add the reference to the class.

    
    <%@ page import="edu.oregonstate.library.util.GoogleSpell" %>
    
    

    Around line 104 — you need to add an id to the simple-search form tag.

    
    

    Around line 161 — you need to add an id to the query text box

     
      " />
    
    

    Around line 170 — add the following snippet. You will notice that there is a snippet of inline javascript code. The reason its there is if Google offers no corrections for any word, I just don’t want to show them. So, by default, the results are hidden and the inline javascript actually displays the results if the can_see variable is set to true.

     
    <%
    
    if (query!=null) {
    %>
       
    <%
       boolean can_see = false;
       GoogleSpell objURL = new GoogleSpell();
       String[] words = query.split(" ");
       String[] t = objURL.GetWords(query);
       if (t!=null) {
    %>
     
       
    <%
      }
    
    if (can_see==true) {
    %>
      
    <%
    }
    }
    %>
    
    
    

     

    And next, the utils.js file:

    In the utils.js file we are just adding two new functions. One is a convenance function and the other is the BuildSearch function. I put these at the end of the utils.js file — but it really doesn’t matter where it goes:

    
    function BuildSearch(f,q, cf, n) {
      var s = "";
      for (i =0; i
    

    And that's pretty much it.  Recompile the source and next time you do a search, if Google returns a suggestion, the tool will present it in the context of a Did you mean question.  Since sometimes misspelling occur within a phrase, or can have multiple suggestions, I've built the interface so that multiple selections show up in a listbox.  If a word is spelled correctly -- it is frozen, so only the misspelled words can be selected from.  For example:

    In this example, degree was spelled correctly, but forestry was not.  Since Google returns suggestions for the misspelled word, those options are placed into a listbox, while the other is frozen since its spelled correctly and no other suggestions were offered.

    So that's it in a nutshell.  Hopefully someone else will find these snippets useful.

     

    --TR

     Posted by at 10:09 pm
    Sep 202006
     

    Ed posted revisions to the ruby-oai package.  It includes the ability to utilize libxml as the xml parser. 

     

    For those interested, this can now be coupled with the following class to parse a normal dublin core unqualified record into an object version of the record. BTW, I realize that this code makes some use of the eval function — probably one that you want to general avoid — but is a very powerful function when one needs to execute dynamic code.

    
    require 'xml/libxml'
    require 'oai'
    
      class OaiDc
        attr_accessor :title, :creator, :subject, :description, :publisher,
                      :relation, :date, :type, :format, :contributor,
                      :identifier, :source, :language, :coverage, :rights
    
        def parse_metadata(element)
    
          labels = self.metadata_list()
    
          if element == nil: return nil end
          labels.each do |item|
            x = 0
            tmp_element = element.metadata.find("./oai_dc:dc/" + item, ["oai_dc:http://www.openarchives.org/OAI/2.0/oai_dc/","dc:http://purl.org/dc/elements/1.1/"]) 
            item = item.gsub('dc:','')
            eval("@" + item + " = []")
            tmp_element.each do |i|
              s = i.content
              if s != nil
                eval("@" + item + "[" + x.to_s + '] = ' +  s.dump )
                x += 1
              end
            end
    
            if x==0: eval('@' + item + '[' + x.to_s + '] = nil') end
          end
        end
    
        def metadata_list()
          labels = ['dc:title','dc:creator',
                    'dc:subject','dc:description',
                    'dc:publisher','dc:relation',
                    'dc:date','dc:type','dc:format',
                    'dc:contributor','dc:identifier',
                    'dc:source','dc:language',
                    'dc:coverage','dc:rights']
        end
      end
    
    
     Posted by at 9:59 pm
    Sep 112006
     

    A milestone for Kenny (seems we are having a few at the house lately) — it was his first day at kindergarten.  He’s been so, so, so excited about today.  Last week we went in and met his teacher and his teacher got to see first hand how much hard work Kenny has been doing with Alyce.  His teacher, Ms. Young, had him count as high as he could (and stopped him at 50), tested him with his colors, shapes, letters, sounds, writing his name and some simple words.  He did great.  To be honest, I didn’t even realize how much work he and Alyce had done together.  I was really impressed. 

    Anyway, today, we took him to school (and we will be picking him back up in about an hour) and he was bouncing off the walls.  I really hope he has a good time today.  He loves to learn, to read, to be with other kids — I just hope that he stays excited for school.

    So pictures — I do have pictures.  Here’s a few we took before we went to school (he wanted a picture of his backpack as well):

    And then we took some pictures at the school:

    He’s getting so grown up.  I have a hard time putting into words how proud I was of him today.  So grown up. 

    –TR

     Posted by at 1:57 pm

    Daily commute

     Cycling  Comments Off
    Sep 112006
     

    I’ve been playing with the Microsoft Live Writer and have been itching to include a map into one of my posts.  So here’s a good one.  Here’s a map of my daily commute — 50 miles, round-trip down hwy. 99.  It’s actually a fantastic ride — with mostly courteous drivers (till you hit Corvallis anyway) — and fairly scenic.  I figure, I save myself ~1000 commuting miles per month between the months of April – October, and somewhere in the neighborhood of 600 miles per month during the winter months (when I commute with friends a couple of days a week to avoid the cold). 

     

    –TR

     Posted by at 2:02 am
    Sep 112006
     

    Since modifying the ruby-oai module to work with libxml — I’ve found extracting the data from “metadata” to be much easier.  And the nice part — using libxml, I was able to do an initial metadata harvest from 4 collections in under 10 seconds.  Here’s the code from the harvesting component — see below.

    –TR

    oai_dc harvester

    
    require 'rubygems'
    require 'xml/libxml'
    require 'oai'
    
      class OaiDc
        attr_accessor :title, :creator, :subject, :description, :publisher,
                      :relation, :date, :type, :format, :contributor,
                      :identifier, :source, :language, :coverage, :rights
    
        def parse_metadata(element)
    
          labels = self.metadata_list()
    
          if element == nil: return nil end
          labels.each do |item|
            x = 0
            tmp_element = element.metadata.find("./oai_dc:dc/" + item, ["oai_dc:http://www.openarchives.org/OAI/2.0/oai_dc/","dc:http://purl.org/dc/elements/1.1/"])
            item = item.gsub('dc:','')
            eval("@" + item + " = []")
            tmp_element.each do |i|
              s = i.content
              if s != nil
                eval("@" + item + "[" + x.to_s + '] = ' +  s.dump )
                x += 1
              end
            end
    
            if x==0: eval('@' + item + '[' + x.to_s + '] = nil') end
          end
        end
    
        def metadata_list()
          labels = ['dc:title','dc:creator',
                    'dc:subject','dc:description',
                    'dc:publisher','dc:relation',
                    'dc:date','dc:type','dc:format',
                    'dc:contributor','dc:identifier',
                    'dc:source','dc:language',
                    'dc:coverage','dc:rights']
        end
      end
    
    
    
     Posted by at 1:48 am
    Sep 092006
     

    Two years ago, my wife and I were blessed with our second son, Nathan Wallace.  Most folks probably don’t realize it — but Nathan’s name has a special meaning.  When Kenny was born, we choose a name that would give him his own personal identity.  Most folks in our family had thought that Kenny would take my name (I’m a junior), but I really wanted to give Kenny his own name and identity.  And while there is a story behind his name, I’ll save that for another time.  Anyway, that’s how we came up with Kenneth Terry. 

    Nathan however was an answer to a prayer and a culmination of a very long pregnancy for my wife.  So when we were trying to come up with names for our second — there was one name that really stuck out — Nathan.  The name Nathan is of Hebrew origin and translates as God has given.  Considering some past events — this name really fit — he is a gift that was given and we knew it.  Like Kenny, Nathan’s middle name is significant as well.  Wallace happens to be Nathan’s great-grandfather’s (on my dad’s side) first name and this was something that I wanted to do for my grandfather.

    So in our house, Sept. 11th is Nathan’s day.  He’s a little boy that’s 2, going on 5.  I wonder if that’s the case with all younger siblings.  Nathan wants so bad to be able to do everything that Kenny does — and Kenny, like a good big brother, dutifully lets Nathan tag along whenever possible.  I think that we’ve been pretty lucky that the two boys are so close.  It’s funny sometimes to watch.  Recently for example — Kenny was out riding his bike out on the street.  Well, out comes Nathan pulling Kenny’s old Red Rider out of the garage.  He’d seen Kenny riding and he wanted to ride his bike too.  So we dusted off Kenny’s old bike helmet and onto the bike he went. 

    If you look carefully, you can see that his feet just about reached the pedals :).  Which of course made me think about Kenny’s first time on this little bike.  Kenny was actually the same age as Nathan when he got this bike (2 years) but really didn’t start riding it until he was three – so to the pictures I went and look at what I found…

    Kenny's birthday  Kenny on the bike with his fire hat

    The first picture is at Kenny’s birthday.  The second is when he is 3 after we moved to our new house.  It’s funny looking at pictures of the two boys.  They are such different little people….So…

    On Saturday, family came up and celebrated Nathan’s birthday with us, which was pretty funny.  I don’t think Nathan quite understood what was going on — but I think he’s probably hoping that every day will be like Saturday was.  He got cake, pizza, presents, lots of toys, etc.  as well as getting to be the center of everyone’s attention. 

    Right now, Nathan is really partial to Nemo.  So this year, we had a Nemo birthday cake made for him.

    Nathan's birthday cake

    This was obviously a crowd favorite.  Not only was Nemo and Dory on the cake, but Nemo and Dory were bath toys — after we started cutting the cake, he got to keep these two little guys as toys. 

    Now Nathan is very different from Kenny in one way — Nathan love’s sweets.  Kenny really didn’t start liking candy, cake, doughnuts…until he was around 3, 3 1/2.  Nathan, on the other hand, has always liked his sweets.  And he really gets into them.

       

    If you look closely, you can see that Nemo is enjoying a little bit of cake as well.  As I said, he was pretty thrilled that he was able to play with Nemo and Dory as well.

    And what is a birthday party without presents?  And Nathan came out with a pretty good haul. 

     

    Some new clothes (lots of Cars themed sweaters), some trucks (lots of trucks) and a few other things.  At this point, he’s really into cars and trucks — so he spent the rest of the day carrying his new favorite trucks around the house.  (Did I mention he went to bed with them :)?  No, well he did.  For about an hour after he went to bed — I’ve been hearing, “Ninny (that’s how he pronounces Kenny), its a truck!  Vurroomm, Vurroom.”  I’m glad he had such a good time and hope he enjoys a night filled with dreams of trucks towing cars.

    –TR

     Posted by at 11:21 pm
    Sep 082006
     

    I’ve been having a great time playing with ruby — but one of the things that I find myself constantly running up against is the “ruby way”.  The place where I notice this most is when dealing with XML.  One of the things I’ve noticed in working with XML in ruby is that the ruby crowd doesn’t seem to know what to do with it.  In most documentation that I’ve read — it appears that most ruby folks find little value in XML and prefer to work in YAML.  However, in my world — XML is king and working around it really isn’t an option.  So with XML — what are the ruby options?  There are a few — but by default — ruby pushes a component known as REXML.  In general, I’ve found this to be a nifty little library with lots of convenience functions.  However, it comes a very high cost.  First, I’ve found that there’s a limit to the size of file that can be loaded (~5 MB, then the component starts breaking) — and its slow.  Oh, how is it slow.  But how slow?

    Well, I’ve been spending a lot of time reworking the API for LibraryFind, our soon to be open source hybrid Federated Search system.  The big project of course has been moving the data into ruby and out of PHP so we can place the project within a web framework (in our case, Rails).  Well, part of the project deals with harvesting data for local indexing — and in many cases, the data being harvested is in OAI.  So, I’ve been playing with Ed Summer’s ruby-oai (very cool) and its based around REXML.  After installing the module and testing the harvesting of a small collection, I was pretty dismayed with the speed of the application.  I suspected that it was REXML that was causing the slow down. 

    So after chatting with Ed a little bit, he’d said that he’d be open to hacking up a very that supported libxml provided the changes didn’t:

    1. Turn the code into an unholy mess
    2. Didn’t require custom XPath statements (and they almost do)

    Given that I was going to make these changes anyway for our own local instance — I thought this sounded like a good idea (and I’ve never had a chance to work with Ed, with I couldn’t pass up either :) )

    So I spent sometime today modifying the ruby-oai module and finished integrating libxml into the module (I guess I’ll wait now to see if I violated number 1 though :) ) — and after running a small benchmarking application — all is well with the world again. 

    So how did the benchmarking go?  Funny you should ask. :)  I made an OAI request using the REXML codebase which returned 394 records parsing just the identifier from the header.  Total time:

    Time to run: 21.685583
    Records returned: 394

    Same codebase but just changing the parser — what’s the difference.  Let’s see:

    Time to run: 0.75901
    Records returned: 394

    This kindof a difference really brings a smile to my face.  And considering how many oai sites I have to harvest from (~300), these extra seconds really start to add up.  Test code below…

    –TR

    Test Code:

    
    require 'oai'
    
    buffer = ""
    start_time = Time.now()
    
    client = OAI::Client.new 'http://digitalcollections.library.oregonstate.edu/cgi-bin/oai.exe', :parser =>'libxml'
    
    last_check = Date.new(2006,1,1)
    records = client.list_records :set => 'archives', :metadata_prefix => 'oai_dc', :from => last_check
    x = 0
    records.each do |record|
      #fields = record.serialize_metadata(record.metadata, "oai_dc", "Oai_Dc")
      #puts "Primary Title: " + fields.title[0] + "\n"
      buffer << record.header.identifier + "\n"
      x += 1
    end
    
    end_time = Time.now()
    
    puts buffer
    puts "Time to run: " + (end_time - start_time).to_s + "\n"
    puts "Records returned: " + x.to_s
    
    
    
     Posted by at 11:08 pm
    Sep 072006
     

    Very cool.  I’ll admit that my experience with Python is limited to IronPython — but I’ve been playing with it a bit and benchmarking it against the C-based implementation of Python, and in many places, its actually faster or dead even.  See: IronPython version 1.0.  Great work by the IronPython team.

    –TR

     Posted by at 1:07 am