Oct 102013
 

I thought I’d take a quick moment to highlight some work that was done by one of the programmers here at The OSU, Peter Dietz.  Peter is a bit of a DSpace wiz and a contributor to the project, and one of the things that he’s been interested in working on has been the development of a REST API for DSpace.  You can see the notes on his work on this GitHub pull request: https://github.com/DSpace/DSpace/pull/323.

<sidenote>

Thankfully, I’m at a point in my career where I no longer have to be the individual that has to wrestle with DSpace’s UI development, but I’ve never been a big fan of it.  From the days when the interface was primarily JSP to the, it sounded like a good idea at the time, XSLT interfaces that most people use today, I’ve long pined for the ability to separate the DSpace interface development from the actual application, and move that development into a framework environment (any framework environment).  However, the lack of a mature REST API has made this type of separation very difficult. 

</sidenote>

The work that Peter has done introduces a simple READ API into the DSpace environment.  A good deal more work would need to be done around authentication to manage access to non-public materials as well as expansions to the API around search, etc., but I think that this work represents a good first step. 

However, what’s even more exciting is the demonstration applications that Peter has written to test the API.  The primary client that he’s used to test his implementation is a Google Play application, which was developed utilizing a MVC framework.  While a very, very simple client, it’s a great first step I think that shows some of the benefits of separating the interface development away from the core repository functionality, as changes related to the API or development around the API no longer require recompiling the entire repository stack. 

Anyway – Peter’s work and his notes can be found as part of this GitHub pull request.  https://github.com/DSpace/DSpace/pull/323.  Here’s hoping that either through Peter’s work, or new development, or a combination of the two; we will see the inclusion of a REST API in the next release of the DSpace application.

–tr

 Posted by at 9:53 pm
Jan 022008
 

I’ve been thinking a little bit about some of the things that I use MarcEdit for and have been pushing some of this work off my desk to some of the staff in our technical services department.  We actually use MarcEdit quite a bit when it comes to sharing metadata from our Dspace instance with other systems, like OCLC’s WorldCat and our online Catalog.  For example, we use MarcEdit to automatically generate MARC21 records for our theses submitted through Dspace.  The process seems to work fairly well, and has been very easy for our staff to learn.  Should write an article documenting this process and how its working at OSU at some point. 

To that end, I’m writing a plug-in for MarcEdit that may enable me to mainstream the processing of web page archiving in Dspace.  At this point, the process is a bit too manual for my tastes.  Along with spidering a site (using whatever the chosen depth may be), there is this pesky manual step of flattening the site and making the urls relative.  Not a big deal (unless there are file name collisions [which there always are] when reading depths), but it takes time.  So, I spent some time this afternoon and wrote a threaded web crawler.  Seems to work well.  At this point, I just need to add the logic to flatten all paths, and come up with a naming schema to re-write all urls to provide unique file names.  Once I get that down, building the batch import package for Dspace should be fairly trivial.  Not sure how much time I’ll have to work on this over the week/weekend, but would be a pretty cool project to finish I think.  It would certainly allow the library to provide site archiving as a dspace option (at this point, its only done under very special circumstances) and should simplify the process enough to the point that it could probably become a mainstream process. 

Anyway, if I do get a chance to get this finished, I’ll certainly make it available as a plug-in (with source).  Of course, if someone has already developed a simplified process that requires no manual processing after harvest, I would love to hear it.

–TR

Technorati Tags: ,,
 Posted by at 9:38 pm
Jan 212007
 

Ok — here’s the info. 
File: MidWinter 2007 ALCTS Presentation
So what’s included?  Zip file contains our custom XSLT that’s used in MarcEdit, the Macro that I use to clean data, and the ppt slides.  The XSLT file is a custom version of the default OAIDC translation found in MarcEdit.  Its customized to deal with the specific data that will be encountered within our ETD records.  If you wanted to use this XSLT for your own library — you will very likely need to make some small modifications — but it should get you started.  Anyway questions?  Send cookies :)

–TR

 Posted by at 12:58 am
Sep 272006
 

As I’ve started to get more involved in the Dspace work here at OSU, one thing that has struck me is the interface really could use a once over.  No place could this be more obvious than the community-list page.  While I doubt that this page was every meant to be the default access mechanism into an institution’s Dspace collection, I’d bet that most often it is (outside of accesses that come from outside Dspace like Google).  Like most institutions, Oregon State University is adding lots of new communities and collections to Dspace, and as the number of collections grow, so do does the list.  At some point, this list simply is too unwieldy for folks to actually work with, so our IR group has started looking at ways to make this easier for users to work with. 

So our IR group has taken a first crack at this.  Basically, the idea is to simply collapse the list using a little dhtml and then allow users to toggle the collections that they want to see. 

So with this interface here, we’ve obviously shorten the list, but introduced a whole new problem which I’m not sure will be a step forward or back — that being that you no longer can see collections.  Now, a user has to know what community a collection is a part of in order to find it.  To make this a little easier, I added a toggle all button, but that just gets you back to the big list interface.

I’ve toyed around with another option.  One thing that I thought might be interesting is adding links to the collections that are used the most.  At first, I thought that this was the type of information that dspace probably would be logging in the database — but after looking over the tables, I couldn’t find it.  So I started looking at the source and found a set of statistic classes that generate the dspace-general log files.  These log files are what dspace uses to generate the statistics screen in the administrative interface.  So with that, we are in business. 

If you look at the dspace-log-general log files, you see a couple of things. 

  1. A new one exists for each day (so long as you have logging enabled)
  2. They are very parseable. 

Using properties set in the config file, dspace analyzes log entries using a threshold to determine if the item should be placed into the list of most viewed items.  At OSU, we continue to use the default floor of 20 views to make the list, but you can have just about anything.  Once an item reacts that floor, the following entry will be placed into the dspace-log-general log file (from our dev server):

  • item.123456789/28=31

Here, you can see that you are getting a stem that can be matched on: item., the handle for the item being accessed: 123456789/28 and the number of times that the item has been accessed: 31.  So one of my thoughts for exposing often used collections would be to take this list and determine how often collections are being utilized.

Technically, the process isn’t really that difficult, though I’ll admit that I stumbled a little bit when working with the HashMaps since I’ve gotten lazy working with scripting languages.  Ideally, what you’d want to do is resolve each items collection and keep track of the number of times items have been accessed from the collection.  This way, you could then do a descending sort on the count to get the items with the most number of accesses — i.e., probably your most used collections.  The initial problem that I ran into was I wanted to use Java’s HashMap like a PHP associative array — which allows you to sort on either key or value, while maintaining keys.  There are a number of ways in Java to accomplish the same thing, but most seemed to require the use of Lists or LinkedHashMaps, and I didn’t want to bother with either.  So instead, I just created my own storage class and implemented the Comparable interface.  Much easier and cleaner I think. 

So code — I’ll have to know that I haven’t run this on our production environment yet and because of the limited logging that we do on our dspace dev instance, the code hasn’t yet been run against a large set of items to be resolved, so there may need to be revisions, but, for anyone wanting to look at the code, here you go. 

The changes made are basically broken into three parts:

  1. New class files (there are 2)
  2. Changes to the JSPs (one change to the community-list.jsp file)
  3. Additions to the dspace.cfg (basically, I didn’t want to hard code the log matching elements)

New Classes:

OSUCollection — This is the storage container that returns the collection name, the collection handle and the number of times the items logged in the collection had been accessed.  Items are returned in descending order — with most accessed items returned first.

 


package edu.oregonstate.library.util.objects;

import java.lang.*;
import java.util.*;

public class OSUCollection implements Comparable {
    private String name;
    private String handle;
    private int count = 0;


    public void setName(String s) {
        name = s;
    }

    public String getName() {
        return name;
    }

    public void setHandle(String s) {
        handle = s;
    }

    public String getHandle() {
        return handle;
    }

    public void setCount(int i) {
        count = i;
    }

    public int getCount() {
        return count;
    }

    public int compareTo(Object o) {
        return count - ((OSUCollection)o).count;
    }
}


GetPopularCollections — This is the actual logic part of the class that takes the stats log file, parses it, retrieves the collection specific metadata and returns an array of OSUCollection objects.

 


package edu.oregonstate.library.util;

import java.util.Arrays;
import java.lang.Integer;
import java.sql.SQLException;
import java.io.*;
import java.net.*;
import java.util.Set;
import java.util.Iterator;
import java.util.HashMap;
import java.util.Collections;
import java.util.Date;
import java.text.SimpleDateFormat;
import java.text.Format;


import org.dspace.content.Collection;
import org.dspace.content.DCValue;
import org.dspace.content.Item;
import org.dspace.core.ConfigurationManager;
import org.dspace.core.Context;
import org.dspace.handle.HandleManager;
import edu.oregonstate.library.util.objects.OSUCollection;

public class GetPopularCollections {
   public OSUCollection[] GetCollections()  throws Exception, SQLException {
    Context context = new Context();
    context.setIgnoreAuthorization(true);
    String record = null;
    HashMap tmpMap = new HashMap();
    HashMap itemMap = new HashMap();

    try {
       FileReader fr = null;
       BufferedReader br = null;

       try {
           String file = getLogFile(ConfigurationManager.getProperty("log.dir"),
                                    ConfigurationManager.getProperty("log.stem"),
                                    ConfigurationManager.getProperty("log.extension"));

           //String file = ConfigurationManager.getProperty("log.dir") + "/dspace-log-general-2005-9-20.dat";
           fr = new FileReader(file);
           br = new BufferedReader(fr);
       }
       catch (IOException e) {
          e.printStackTrace();
          System.out.println("Failed to read input file");
          System.exit(0);
       }
       while ((record = br.readLine()) != null) {
         String key = record.substring(0,5);
         String item = "";
         if (key.equals("item.")) {
            record = record.substring(5);
            tmpMap.put(record.substring(0, record.indexOf("=")), record.substring(record.indexOf("=")+1));
         }
       }

       //Setup the OSUCollections Object
       OSUCollection[] c = new OSUCollection[tmpMap.size()];
       Set set = tmpMap.keySet();
       Iterator it = set.iterator();
       int index = 0;
       while (it.hasNext()) {
           /*getCollectionInfo returns:
            * element[0]: collection name
            * element[1]: collection handle
            */
           String element = (String)it.next();
           String[] tmp = getCollectionInfo(context, element);
           if (!itemMap.containsKey(tmp[1])) {
              itemMap.put(tmp[1], new Integer(index));
              c[index] = new OSUCollection();
              c[index].setName(tmp[0]);
              c[index].setHandle(tmp[1]);
              c[index].setCount(Integer.parseInt((String)tmpMap.get(element)));
              index++;
           } else {
              int tint = Integer.parseInt((String)itemMap.get(element));
              int added = Integer.parseInt((String)tmpMap.get(element)) + c[tint].getCount();
              c[tint].setCount(added);
           }
       }

       Arrays.sort(c, Collections.reverseOrder());

       br.close();
       fr.close();
       return c;
    }catch(Exception e) {
        System.out.println(e.toString());
        return null;
    }
  }


  private String[] getCollectionInfo(Context context, String handle) throws Exception, SQLException {
     Item item = null;
     String[] vals = new String[2];

     // ensure that the handle exists
     try
     {
       item = (Item) HandleManager.resolveToObject(context, handle);
     }
     catch (Exception e)
     {
        return null;
     }

     // if no handle that matches is found then also return null
     if (item == null)
     {
        return null;
     }

     // build the referece
     // FIXME: here we have blurred the line between content and presentation
     // and it should probably be un-blurred
     Collection myCollection = null;
     myCollection = item.getOwningCollection();
     vals[0] = myCollection.getMetadata("name");
     vals[1] = myCollection.getHandle();
     return vals;
  }

  private static String getLogFile(String lfile, String lstem, String extension) {
      //We use the simpledateformat to check for the presence of the default
      //file, which dspace creates in the stats-general file.
      //If this file isn't present, then we use the stem and extension to return our file
      Format sdf = new SimpleDateFormat("yyyy-MM-dd");
      Date myDate = new Date();
      String default_file = lfile + File.separator + lstem + "-" + sdf.format(myDate) + "." + extension;
      File tmp = new File(default_file);
      if (tmp.exists()) {
         return default_file;
      }

      File dir = new File(lfile);
      String[] children = dir.list();
      String logfile = "";
      long lastmod = 0;
      if (children == null) {
          return null;
      } else {
          for (int i=0; i lastmod) {
                  lastmod = tmp.lastModified();
                  logfile = lfile + File.separator + children[i];
               }
              }
            }
          }
          return logfile;
      }
  }

}


JSP Changes:

Community-list.jsp

Around line 189 you should add the following:

 



<%
   GetPopularCollections objCol = new GetPopularCollections();
   OSUCollection[] col  = objCol.GetCollections();
   int x =0;
   if (col.length >0) {
%>
   

Most Viewed Collections

<% for (int i = 0; i < col.length; i++) { %>
  • /handle/<%=col[i].getHandle()%>"><%=col[i].getName()%> [<%=col[i].getCount()%>]
  • <% x++; if (x > 5) { break; } } } %>

    Config additions

    dspace.cfg — as noted, I basically add these entries so I don’t have to hardcode the values into the GetPopularCollections class. These are added around line 166, near log.dir

     

    
    
    log.stem = dspace-log-general
    log.extension  = dat
    
    
    

    And that’s it.  If you want to access the two class files directly, they can be found at:

    The end result of all this, is a display that looks like the following:

    Again, I have no idea if this will be useful for our users in general, but I can see some future applicability in terms of evaluating collection usage in helping to determine what materials to collect and archive.  But I guess we’ll see.

     

    –TR

     Posted by at 12:11 pm
    Sep 212006
     

    Its funny how things work out.  As some folks know, we are currently in the process at OSU of re-writing our hybrid metasearch tool from PHP to ruby, and now that much of the heavy lifting has been accomplished (a simple caching engine, wsdl api, oai harvesting, ferret integration, etc.), its been time to start looking at porting some of the niceties that we added to the PHP instance of LibraryFind to gauge our users reactions.  One of these features that remained to be ported was a spell checker. 

    In the PHP instance of LibraryFind, the utility made use of a couple of built-in PHP components to provide a spell checker that wasn’t simply a dictionary lookup.  Utilizing Aspell as the dictionary behind the spell checker, the tool utilized metaphonics — the analysis of sounds — to determine which entry returned by the dictionary was closes to the actual typed text.  This of course meant that the text entered into the application had to be misspelled phonetically if the tool was to find the best match possible.

    Well, ruby doesn’t have many of these tools available to it.  While I could have coupled it with Aspell, I couldn’t find a good metaphonics engine to attach to ruby, which I think is important since straight dictionary matching is only so useful.  So I started looking for alternatives…

    Its actually interesting how  many alternatives one can find.  Google actually has two.  The first is its publicly available Search API which includes an option to returned spell checked items.  This looked promising, but the 1000 search limit was disappointing.  However, Google provides a secondary service that is for the most part undocumented that is currently utilized by it’s FireFox plugin.  Apparently, this undocumented API has been well-known for sometime, going back to 2005 when Google first released its beta Firefox toolbar.  What interesting as well (and I tool a look at the source to see it for myself) — is that within the toolbar, Google has a number of code comments that indicate if a specific piece of functionality is intended to be used by outside organizations.  A good example of this is the pagerank calls in the toolbar.  Comments surround the code that make it clear that this isn’t a publicly available function.  However, this isn’t the case with the toolbar spell checker API and after looking through the EULA and doing some web sleuthing, I couldn’t find anyone talking about restrictions to use.  Quite the contrary, there are a number of people building interfaces to this api.  You can find them here and here and here and here and here.  In fact, I could find examples working with this api in PHP, python, C# and PERL.  So I figured, why not a ruby.  So, last night I wrote a small ruby class to parse this data for LibraryFind.

    I know, I know — what does this have to do with Dspace?  Well, after coding it into ruby I figured, heck, why not Java?  So I spent today putting together a new java class that implements the Google Spell checking API within Dspace.  Unfortunately, this isn’t the simplest of hacks (since it requires touching a number of the jsp files and adding a class file — but take a look at the results:

    Currently, this exists on our development server, but will likely make its way into our production environment next week.  ***Side note:  it makes senses that today would be the day that I’d choose to make these changes.  We are currently freezing Dspace development so we can finish porting our current hacks into the 1.4 source so we can bring that into production.  The process was just about finished when I decided that we had time to sneak on more change into the migration. :)  You can see what Jeremy puts up — too much copious free time at home I guess. :)

    So how can you too add this functionality to your Dspace instance?  Good question.  Well, first you need this java class.  I won’t guarantee that you won’t find any problems with it (and if you do, give me a holler) since I just finished it today and only did a little bit of lite testing, but I haven’t had trouble with it so far, so I feel pretty good about it.  You’ll likely need to change where the class is packaged.   We tend to package OSU specific classes within its own namespace.  Second, you need to make changes to the following files:

    1. search/results.jsp
    2. utils.js

    First, results.jsp: 

    Around line: 75 — add the reference to the class.

    
    <%@ page import="edu.oregonstate.library.util.GoogleSpell" %>
    
    

    Around line 104 — you need to add an id to the simple-search form tag.

    
    

    Around line 161 — you need to add an id to the query text box

     
      " />
    
    

    Around line 170 — add the following snippet. You will notice that there is a snippet of inline javascript code. The reason its there is if Google offers no corrections for any word, I just don’t want to show them. So, by default, the results are hidden and the inline javascript actually displays the results if the can_see variable is set to true.

     
    <%
    
    if (query!=null) {
    %>
       
    <%
       boolean can_see = false;
       GoogleSpell objURL = new GoogleSpell();
       String[] words = query.split(" ");
       String[] t = objURL.GetWords(query);
       if (t!=null) {
    %>
     
       
    <%
      }
    
    if (can_see==true) {
    %>
      
    <%
    }
    }
    %>
    
    
    

     

    And next, the utils.js file:

    In the utils.js file we are just adding two new functions. One is a convenance function and the other is the BuildSearch function. I put these at the end of the utils.js file — but it really doesn’t matter where it goes:

    
    function BuildSearch(f,q, cf, n) {
      var s = "";
      for (i =0; i
    

    And that's pretty much it.  Recompile the source and next time you do a search, if Google returns a suggestion, the tool will present it in the context of a Did you mean question.  Since sometimes misspelling occur within a phrase, or can have multiple suggestions, I've built the interface so that multiple selections show up in a listbox.  If a word is spelled correctly -- it is frozen, so only the misspelled words can be selected from.  For example:

    In this example, degree was spelled correctly, but forestry was not.  Since Google returns suggestions for the misspelled word, those options are placed into a listbox, while the other is frozen since its spelled correctly and no other suggestions were offered.

    So that's it in a nutshell.  Hopefully someone else will find these snippets useful.

     

    --TR

     Posted by at 10:09 pm
    Aug 302006
     

    At OSU, we have very few Dspace collections configured to allow direct submission to the repository.  Nearly anyone on campus can submit an item into Dspace, but that item is then vetted through Technical Services where metadata is looked at and corrected before being added to Dspace.  To do this, we have ~4 individuals (though primarily 3) that can take a task from the pool for evaluation. 

    Now those folks that use Dspace know that Dspace places items into the task list in the order that it was received.  This means that if a cataloger was responsible for a particular collection, they would have to always look over the entire task list to see if any items from their collections had been submitted.  It was a fairly time consuming process and one that constantly soured staff on working within the Dspace interface.

    Usually, my level of caring for this problem would be ancillary.  Up until August of this year, we had a programmer that handled the majority of the Dspace customizations.  I think that my well-known aversion to Java might have had a hand in this — but to be honest I didn’t mind.  (Yes, I never was a Java convert.  I use it when I have to — but traditionally, I’ve always preferred a more procedural style found in Assembler or C.  However, over the past two years, my experience with C# has really softened my stance on Java a bit.  I still find some of the syntax non-intuitive).  Anyway, in August, I was asked to spend some time working with Dspace since it would allow changes to be incorporated faster since changes could now be made just by knocking on my door.

    Anyway, getting back to the task pool.  During one of my weekly Digital Production Unit meetings, my staff let me know that this was an issue.  What they wanted was the pool, sorted by collection/date.  Seemed easy enough — and it was.  Now I’ll admit, this is a bit of a quick and dirty hack — but as I look at Dspace, I seem to see a lot of these types of hacks, so mine should fit in.  Changes need to be made only to the main.jsp file in the mydspace directory. 

    Original Code:

    lns: 204-212

    String row = “even”;
    for (int i = 0; i < pooled.length; i++)
    {
       DCValue[] titleArray = pooled[i].getItem().getDC(“title”, null, Item.ANY);
      String title = (titleArray.length > 0 ? titleArray[0].value : LocaleSupport.getLocalizedMessag(pageContext,”jsp.general.untitled”) ); 
    EPerson submitter = pooled[i].getItem().getSubmitter();

     

    Modified:

    String[] tcoll = new String[pooled.length];
    for (int i = 0; i < pooled.length; i++) {
    tcoll[i] = pooled[i].getCollection().getMetadata(“name”) + “_” + String.valueOf(i);
    }
    Arrays.sort(tcoll);

    for (int z = 0; z < tcoll.length; z++)
    {
    int i = Integer.parseInt(tcoll[z].substring(tcoll[z].lastIndexOf(“_”)+1));

     

    As folks can see, basically, the modification reads the collection name of each item in the task book and stores the data in an array as: [collection name]_number.  The number stored is the position that the item occurs in the list.  Once the new array is setup, its sorted and then it is this array, not the pooled array, that is used to step through the tasks.  The index number for accessing items in the pool array is pulled by processing the position number from the collection array.

    We’ve been using this for ~2 1/2 weeks now, and the staff are much happier. 

    Now that I’m working on Dspace, I may periodically post changes that we are making to the application if I find them interesting.  Whether anyone else will, well, we’ll see.

     

    –TR