Running MarcEdit 5 on macs, linux and other updates

So with the coming preconference at Code4Lib (which I wish I was going to :(), I’ve had a few folks ask if MarcEdit can run on a mac or linux system. Well, the console version of the application can. The GUI portion of the program still relies on components that haven’t been fully migrated into MONO, but the console version has worked just fine for about the past 6 months. So, if you have a copy of mono (http://www.go-mono.com) installed on your mac or linux box, you can try the instructions below.

In addition to providing some info on how to run on a mac or linux — I also thought I’d let folks know that this (the zipped content) and the formal install program have gone through a number of changes. Most of these changes are related to the work I’ve been doing playing with solr over the last week. I wanted a large dataset to work with, and in doing some testing, I was disappointed in how quickly MarcEdit was translating data. So I pulled a random sample of data from our catalog (1000 records) and started benchmarking from there. At the start, processing a MARC file from MARC=>Solr was taking ~8 seconds. After spending some time relooking at the algorithms used to do this processing, I’ve cut processing time for the 1000 records to just under 2 seconds. So this means you can process a 10,000 record file in ~18 to 20 seconds — as an fyi, 10,000 record files seem to be the sweet spot for the application. I had two large data sets — 2 million records from our catalog and 20 million records from our consortia. Originally, I tried to process the 2 million records directly. It worked, but it took forever. Since MarcEdit does these translations using XSLT, the processing of 2 million records directly took ~6 hours. However, splitting these into files of 10,000, I was able to process my 2 million records in under an hour. Much better processing time I thought.

Anyway, the changes to this build:

Changes to the MARCEngine (stated above). I’ve turned over 78 million records over the past 2 days to ensure that the character encoding is working correctly. As far as I can tell, everything is working fine, though my datasets were not the most lingually diverse. So if you see a problem, let me know. A smaller change that I worked on is some additional healing functions to the engine. This allows the program to “correct” invalid character data that can sometimes (at least in our records) appear.
I added two parameters that are available in all XSLT transformations. You can define global params for pdate (the processing date of the file in yyyymmdd format) and for destfile (the name of the created xslt file). I’ll likely add a few more parameters so that I can get access to data elements that I have a hard time recreating in pure XSLT.
OAI harvester — I added the ability to harvest individual items from a repository for targeted harvesting.

You can download the update to MarcEdit at: MarcEdit50_Setup.exe.

So, instructions. For folks looking to run on alternative platforms, give this a try:

1) Download the following to your mac: marcedit5.zip

2) Some common commands
a) Breaking a record:
mono cmarcedit.exe -s [your file] -d [save file] -break

b) making a file:
mono cmarcedit.exe -s [your file] -d [save file] -make

c) Splitting a large MARC file to smaller files:
mono cmarcedit.exe -s [your file] -d [path to save directory] -split -records [num of records]

d) Translate MARC=>XML
mono cmarcedit.exe -s /home/reeset/Desktop/z3950.mrc -d /home/reeset/Desktop/solrtext.xml -marctoxml -xslt /home/reeset/marcedit5/XSLT/marcxml2solr.xsl

e) Translate a batch of marc records to xml

mono cmarcedit.exe -s /home/reeset/Desktop/oasis_split2/ -d mrc -xslt /home/reeset/marcedit5/XSLT/marcxml2solr.xsl -batch -marctoxml

f) Get help info:

mono cmarcedit.exe -help

As I mentioned, I’ve been making some changes to the xml components to make them faster. I’m pretty sure you won’t run into any characterset issues — but if you do, let me know. I’ve processed some 70 million items over the past 2 days using the new method generating items for index in solr. BTW, the solr xslt that Andrew Nagy had sent out is included in the marcedit5/XSLT folder (as are my current in development stylesheets)

I’ve run all these tests on my linux box (CentOs) — but I’m sure it will work on a mac.

–TR

Comments

2 responses to “Running MarcEdit 5 on macs, linux and other updates”

dchud

February 22, 2007

I was going to ask you about sweetspot file sizes yesterday but forgot! Thanks a ton for this writeup.
Administrator

February 22, 2007

Yeah — for the XSLT processing, there really is a happy spot. I found I could go as high as 20,000 records and still get good performance — but given the variable nature of MARC record sizes, 10,000 seemed to work best.

If you are just making/breaking files or translating to MARCXML — this isn’t a concern. In fact, for the traditional algorithems, processing larger files actually seems to be more efficient. For example, when processing the 70 million records just to make and break the records, I was piping through, on average, 25,000 records per second or taking ~46 minutes to process on a 1.8 GHz Intel Cent., single core with 2 GB of RAM.

–TR