I’ve been thinking a little bit about some of the things that I use MarcEdit for and have been pushing some of this work off my desk to some of the staff in our technical services department. We actually use MarcEdit quite a bit when it comes to sharing metadata from our Dspace instance with other systems, like OCLC’s WorldCat and our online Catalog. For example, we use MarcEdit to automatically generate MARC21 records for our theses submitted through Dspace. The process seems to work fairly well, and has been very easy for our staff to learn. Should write an article documenting this process and how its working at OSU at some point.
To that end, I’m writing a plug-in for MarcEdit that may enable me to mainstream the processing of web page archiving in Dspace. At this point, the process is a bit too manual for my tastes. Along with spidering a site (using whatever the chosen depth may be), there is this pesky manual step of flattening the site and making the urls relative. Not a big deal (unless there are file name collisions [which there always are] when reading depths), but it takes time. So, I spent some time this afternoon and wrote a threaded web crawler. Seems to work well. At this point, I just need to add the logic to flatten all paths, and come up with a naming schema to re-write all urls to provide unique file names. Once I get that down, building the batch import package for Dspace should be fairly trivial. Not sure how much time I’ll have to work on this over the week/weekend, but would be a pretty cool project to finish I think. It would certainly allow the library to provide site archiving as a dspace option (at this point, its only done under very special circumstances) and should simplify the process enough to the point that it could probably become a mainstream process.
Anyway, if I do get a chance to get this finished, I’ll certainly make it available as a plug-in (with source). Of course, if someone has already developed a simplified process that requires no manual processing after harvest, I would love to hear it.