For those that have been interested and following development, I’ve completed the harvesting component of the Metasearch tool. Basically, we are invisioning this tool as a hybrid search…we harvest as much data as we can but federate search when we have to. Then we bring together the results and rank them within the context of the returned results. Anyway, here’s the updated search screen:
You can see from the screen shot, the search is querying ~25 databases in about 8 seconds. The reason why we are getting such good results is many of these items have been harvested an indexed within our mysql harvested database (which by the way, is internally normalized to dublin core. I realize we lose some granularity in the metadata, but for our purposes [search], I think that its ok — though I guess we’ll see).
Currently, the ranking algorithem is fairly simple. It uses the following to create a numeric rank:
- Exact title match
- Instring title match
- words in title (with first word in the phrase ranking higher)
- match within the subjects
- match within the creators
- instring match in all metadata (all words together)
- instring match of each search word within the metadata
The number that comes up isn’t a percentage by any sense of the word — but it does seem to do a pretty good job of putting the most relevant result in the returned record set on the top. Anyway, I have a list of 2500 actual user searches and I’m going to be writing a script to beat the heck out of this tool, capturing error messages, time to process, number of results, etc. to see how this might work under load. Currently, we have a metasearch tool that we pay for, Innovatives MetaFind. However, looking at the numbers sent to us by III, usage for this tool (and you have to realize, its been available for a year), has hovered around 90 queries a day. I know the system could easily handle this type of load — but we are expecting this to be successful.