Its funny how things work out. As some folks know, we are currently in the process at OSU of re-writing our hybrid metasearch tool from PHP to ruby, and now that much of the heavy lifting has been accomplished (a simple caching engine, wsdl api, oai harvesting, ferret integration, etc.), its been time to start looking at porting some of the niceties that we added to the PHP instance of LibraryFind to gauge our users reactions. One of these features that remained to be ported was a spell checker.
In the PHP instance of LibraryFind, the utility made use of a couple of built-in PHP components to provide a spell checker that wasn’t simply a dictionary lookup. Utilizing Aspell as the dictionary behind the spell checker, the tool utilized metaphonics — the analysis of sounds — to determine which entry returned by the dictionary was closes to the actual typed text. This of course meant that the text entered into the application had to be misspelled phonetically if the tool was to find the best match possible.
Well, ruby doesn’t have many of these tools available to it. While I could have coupled it with Aspell, I couldn’t find a good metaphonics engine to attach to ruby, which I think is important since straight dictionary matching is only so useful. So I started looking for alternatives…
Its actually interesting how many alternatives one can find. Google actually has two. The first is its publicly available Search API which includes an option to returned spell checked items. This looked promising, but the 1000 search limit was disappointing. However, Google provides a secondary service that is for the most part undocumented that is currently utilized by it’s FireFox plugin. Apparently, this undocumented API has been well-known for sometime, going back to 2005 when Google first released its beta Firefox toolbar. What interesting as well (and I tool a look at the source to see it for myself) — is that within the toolbar, Google has a number of code comments that indicate if a specific piece of functionality is intended to be used by outside organizations. A good example of this is the pagerank calls in the toolbar. Comments surround the code that make it clear that this isn’t a publicly available function. However, this isn’t the case with the toolbar spell checker API and after looking through the EULA and doing some web sleuthing, I couldn’t find anyone talking about restrictions to use. Quite the contrary, there are a number of people building interfaces to this api. You can find them here and here and here and here and here. In fact, I could find examples working with this api in PHP, python, C# and PERL. So I figured, why not a ruby. So, last night I wrote a small ruby class to parse this data for LibraryFind.
I know, I know — what does this have to do with Dspace? Well, after coding it into ruby I figured, heck, why not Java? So I spent today putting together a new java class that implements the Google Spell checking API within Dspace. Unfortunately, this isn’t the simplest of hacks (since it requires touching a number of the jsp files and adding a class file — but take a look at the results:
Currently, this exists on our development server, but will likely make its way into our production environment next week. ***Side note: it makes senses that today would be the day that I’d choose to make these changes. We are currently freezing Dspace development so we can finish porting our current hacks into the 1.4 source so we can bring that into production. The process was just about finished when I decided that we had time to sneak on more change into the migration. 🙂 You can see what Jeremy puts up — too much copious free time at home I guess. 🙂
So how can you too add this functionality to your Dspace instance? Good question. Well, first you need this java class. I won’t guarantee that you won’t find any problems with it (and if you do, give me a holler) since I just finished it today and only did a little bit of lite testing, but I haven’t had trouble with it so far, so I feel pretty good about it. You’ll likely need to change where the class is packaged. We tend to package OSU specific classes within its own namespace. Second, you need to make changes to the following files:
- search/results.jsp
- utils.js
First, results.jsp:
Around line: 75 — add the reference to the class.
<%@ page import="edu.oregonstate.library.util.GoogleSpell" %>
Around line 104 — you need to add an id to the simple-search form tag.
Around line 161 — you need to add an id to the query text box
" />
Around line 170 — add the following snippet. You will notice that there is a snippet of inline javascript code. The reason its there is if Google offers no corrections for any word, I just don’t want to show them. So, by default, the results are hidden and the inline javascript actually displays the results if the can_see variable is set to true.
<%
if (query!=null) {
%>
<%
}
if (can_see==true) {
%>
<%
}
}
%>
And next, the utils.js file:
In the utils.js file we are just adding two new functions. One is a convenance function and the other is the BuildSearch function. I put these at the end of the utils.js file — but it really doesn’t matter where it goes:
function BuildSearch(f,q, cf, n) {
var s = "";
for (i =0; i
And that's pretty much it. Recompile the source and next time you do a search, if Google returns a suggestion, the tool will present it in the context of a Did you mean question. Since sometimes misspelling occur within a phrase, or can have multiple suggestions, I've built the interface so that multiple selections show up in a listbox. If a word is spelled correctly -- it is frozen, so only the misspelled words can be selected from. For example:
In this example, degree was spelled correctly, but forestry was not. Since Google returns suggestions for the misspelled word, those options are placed into a listbox, while the other is frozen since its spelled correctly and no other suggestions were offered.
So that's it in a nutshell. Hopefully someone else will find these snippets useful.
--TR
Comments
6 responses to “Dspace hack #2 — Did you mean?”
THIS IS TEH AWESOME. You should submit it as a patch. They won’t take it (because of its dependence on Google), but that should smooth the road toward accepting it as an add-in later.
Until now, I hadn’t seen any projects actually utilizing this feature full-time. I think your implementation is particularly cool (even if it is written in Java). Now, if I could work that into the live search on my blog, that would rock…
Anyway, glad I could be a source of hackery inspiration!
Chris,
Thank you for the documentation. I’d pretty much resigned myself to using a pure dictionary search until I ran across your post. And then once I knew what to look for — well, then I was off and running.
–TR
Terry,
I’d second Dorothea’s request that you make this into a DSpace patch. Even as a “committer”, I cannot promise that we’d be legally allowed to distribute it as part of DSpace (that’s probably a question for MIT/HP or the Advisory Board), but I’m sure others would LOVE to add this in themselves.
Also, as Dorothea mentioned, DSpace does have some AddOn functionality in the works, which this would be perfect for (hopefully that link works..the DSpace Wiki is currently undergoing a migration to MediaWiki).
If you don’t have a chance to make a patch, or won’t be able to maintain it, you could just “copy” or link this post over on the DSpace wiki (once the wiki is stable again), and perhaps someone else can take on the maintenance, until it can be made into an “AddOn” for DSpace.
In any case, thanks for the great code! We’re hoping to be able to use it here at UIUC as well!
Tim, Dorothea,
Thanks for the kind words. My intention is to bake this as a patch and get it submitted — hopefully this week. But you both are probably right — something like this makes more sense as an add-on. Maybe when I have time — I’ll take a closer look at what’s been done here so far.
–TR
This is awesome. Can you post the ruby code you wrote to access google’s web service?