Dec 252011
 

I’ve been meaning to write something about this because I’ve noticed the trend myself, but it took a post from Jonathan (http://bibwild.wordpress.com/2011/12/25/why-a-shift-to-ebooks-imperils-libraries/) regarding challenges libraries will continue to face with the shift to e-books that reminded me to get back to it.  There was an interesting story in the Guardian the other day entitled, “The great ebook price swindle” (http://www.guardian.co.uk/commentisfree/cifamerica/2011/dec/23/ebook-price-swindle-publishing) that talked about rising ebook prices.

The article is interesting as it looks at how the 6 large publishing houses, along with Apple, have worked together to shift ebook pricing.  The end result of this being that ebook prices for many popular books are quickly becoming more costly than their print alternatives.  It’s an interesting article, and one that I think is very pertinent given the millions of people that sat down to open gifts today, only to find a shiny new ebook reader.  Once you have one of those little buggers, you are going to want to put something on it – and while there is a lot of free content available, you are going to want to be able to purchase books for reading as well.  And this is where it gets tricky.  When ebook readers were first starting to show up, the big selling point was that the device was convenient (reduced the physical footprint of having a text) and the content had a very low price point.  This year, the ebook readers themselves have literately become disposable technology – but the content is becoming more and more cost prohibitive due to the cost of content.  The question now for ebook owners is when does the convince of owning digital copies become less attractive than the price point for the books.   For example, I could go onto my Kindle today and buy a digital copy of The Girl with the Dragon Tattoo.   The book will cost me $9.99.  That doesn’t seem bad, but just the other day, I found a paper-back copy of this book in the sale bin for a little less than 1/2 that price.  Depending on the circumstances around my purchase (am I travelling, or at home) – this price point would factor into my decision regarding what copy of the item to own.  At a reduced price point, I can live with some of the inherent limitations with current ebooks (the lack of ability to lend liberally being the largest), but when the cost of the digital object exceeds the physical copy, those limitations become much more difficult to overlook.

So the question that would be really interesting to ask the millions who got shiny new e-readers next Christmas, would be how many actually use there e-readers and if the shifting costs of digital versus analog has changed or effected their purchasing.

–TR

 Posted by at 11:47 pm
Dec 252011
 

Merry Christmas everyone.  I hope that everyone has a safe, and happy holidays with their family and their friends.  In what has become a bit of a holiday tradition, I’m releasing an update to MarcEdit, MarcEdit 5.7.  Yep, this shifts from version number from 5.6 to 5.7, and there are some pretty good reasons why – so lets get to it.

Updates

Native MARCXML Processing

I’ve talked about this change at length in an earlier post, but in order to facilitate some of the work that I’m interested in doing situated around MarcEdit and Linked Data, I had to improve the XML processing related to MARCXML.  Previously, MarcEdit utilized XSLT processing for all XML conversions.  This works great, provides a lot of flexibility, but has a fairly substantial memory footprint with visible performance issues when dealing with larger (500 MB+) MARCXML file sets.  To deal with these issues, I’ve updated MarcEdit so that I’ve now included native processing of MARCXML data using a SAX style XML processor.  This means validation of the document happens as the document is processed, but the take away is that MarcEdit’s MARCXML process has nearly no additional memory footprint and processes data approximately 190 times faster than the current process.  Of course, some people may have good reason to want to continue to use the XSLT style processing (for example, they may have customized the MARCXML=>MARC xslt), so I’ve also maintained the ability for users to continue to use the previous XSLT style MARCXML processing (though the new method is the default).

You can modify the MARCXML processing preferences within the Application Preferences window.

image

Users wanting to disable the native XSLT processing function and utilize the previous XSLT process simply need to uncheck the Use Native Option (Non-XSLT Process).  When this option is unchecked, the non-native option will be used.

This change has an impact in other parts of the program as well.  If you use the MarcEdit COM based API or .NET API to access the MARCEngine – API calls to the engine for MARCXML=>MARC processing will utilize the XSLT translation process if an XSLT is passed into the function.  If you want to use the native process, simply pass an empty string (or null value) to the function.

Likewise, individuals using the cmarcedit.exe program (MarcEdit’s Console Program).  If you want to use the native process, simply do not provide an xslt when calling the MARCXML=>MARC translation.

UTF8=>MARC8 conversion updates

The UTF8=>MARC character conversion process wasn’t treating combining characters for diacritics represented as {dotb} or {commab} correctly.  These diacritics were recognized, but the combining byte wasn’t being moved properly within the string causing the diacritic to modify the wrong value.  I’d like to thank Joe Altimus at Arizona State University for bringing this to my attention this week.

Multiple File Record Deduplication Utility

One of the feature requests that I get every now and again, is a request to update the MarcEdit duplication record function found in the MarcEditor.  Very often, users want to run this tool over multiple files, rather than find duplication records in a single source file.  So, I’ve modified the existing function so that you can now perform this function outside of the MarcEditor, and upon multiple files.

You find this function on the main MarcEdit window, under Tools/Find Duplicate Records.

image

When you run this function, you get the following window.

image

Simply click on the Open folder, and select a file.  To add another file, simply select the open icon again and select another file.  You’ll see selected files added to the dropdown list.  MarcEdit will then utilize the files in this list to perform the stated operation.  At this point, this function is an extension of the existing deduplication tool.  I was considering making a tool that did a more heuristically analysis of the records to determine duplicate records, but at this point, I’m going to wait for users to give this a try and provide some feedback so I can target my development accordingly.

Introduction of MarcEditor Editing Shortcuts

I was spending some time looking through the MarcEdit listserv the past few weeks, and one of the things that I have noticed is that a lot of questions to the listserv revolve around regular expressions.  Generally, these are questions from catalogers that have used regular expressions in the past, but just need a little nudge to solve a problem.  That’s great…but I also notice that there are a few questions that come up a lot.  One of these questions revolve around the character case within records (specifically titles).  So what I’ve done (and if it’s useful we’ll keep it, if it’s not, I can retire it quickly), is added a new menu entry in the MarcEditor/Tools menu called Edit Shortcuts.

image

As you can see from the screenshot, the first set of Edit Shortcuts that I’ve added to the program deal with changing character case within the program.  Essentially, these are shortcuts that initialize specific regular expressions for you over a defined set of MARC data (field/subfield combination).  My hope is that people will find these shortcuts useful, and will suggest additional shortcuts that I can add to the program.  Moreover, at this point, you cannot add these shortcuts to an Automation Task.  This is primarily because these shortcuts are virtual placeholders within the program – they only are meta-functions.  However, if people thing that this would be useful, I’m certainly happy to go back and figure out a way to make these a part of the task automation function.

MODS=>RDF XSLT stylesheet added to the XSLT repository

I’ve starting to look at ways to

    1. Make the generation of linked data easier
    2. Provide tangible linked data examples from MARC

As part of that work, I’ve been working with a MODS=>RDF (linked data) example created by Stefano Mazzocchi in 2006, with edits.  For users interested in following that work, or playing with it themselves, they can download the stylesheet from the MarcEdit XSLT repository.  As part of this work, I’ve found on enhancement that I’ve started working on – and that is the ability to chain XSLT processes together.  Currently, if you want to use this stylesheet from MARC, you will need to translate the data from MARC=>MODS, and then run a second process translating the data from MODS to RDF triples.  Ideally, I’d like to make that one step – so I’ll be spending some time looking at how that might be accomplished.

Getting the update

In addition to the updates listed above, I made a handful of minor changes to the program.  The majority of these changes represent usability or code optimizations, but there are there nevertheless.  If you want to get the update and you currently have MarcEdit, you can download the updated application through the automated updater found within MarcEdit, or you can get the update from:

    1. MarcEdit Website:  http://people.oregonstate.edu/~reeset/marcedit/html/downloads.html
    2. Windows 32-bit download:  MarcEdit_Setup.msi
    3. Windows 64-bit download:  MarcEdit_Setup64.msi
    4. Alternative Windows/Linux/Mac Download:  marcedit.zip

Again, have a safe and merry Christmas everybody,

 

–TR

 Posted by at 2:09 am
Dec 232011
 

This is one of those questions that I ponder ever now and again, because I wonder how effective libraries really can be as open data advocates when our current practice demonstrates that we don’t fully believe in the concept.  Well, I should qualify that – we have no problem believing that other people have a moral obligation to make their research and data open to the world using the most permissive (CC0) licenses available, but we have an extremely difficult time doing the same.  That’s right, my name is Terry Reese and I’m a hypocrite, I mean librarian.

There are a lot of places where libraries could be doing much better in terms of how we manage our own “research assets”.  This includes how we manage the release and reuse of metadata and digital objects within our special and archival collections to how we manage really mundane information like bibliographic data found in library catalogs.  In a sense, this is our research data, and as a group, libraries continue to tell other communities that for the good of libraries, this data cannot be openly shared. 

This question of how committed the library community is to open data came up again this week.  The National Library of Sweden recently announced (http://www.kb.se/english/about/news/No-deal-with-OCLC/) that they were ending negotiations with OCLC regarding the use and reuse of WorldCat derived data within their national catalog.  The two organizations simply couldn’t come to an agreement due to the restrictions placed on the sharing and redistribution of WorldCat derived MARC data.  Essentially, the National Library of Sweden, it’s participant members and Europeana (the European Library) feel that library bibliographic data wants to be free.  OCLC (and to a large degree, many within its membership) disagree.  Hence the impasse, and the source of my dilemma. 

Within the U.S., it is nearly impossible to run a research library and not be a member of the OCLC cooperative.  And that’s not necessarily a bad thing.  OCLC is doing some great things.  Just this month, they release FAST (http://www.oclc.org/research/news/2011-12-14.htm) as linked data and they announced the WorldShare Platform (http://oclc.org/developer/platform).  Additionally, they continue to advocate for libraries and use their position within the library community to focus RLG on research pertinent to the library community (http://www.oclc.org/research/publications/reports.htm).  Heck, I doubt most ILS vendors would be putting so many resources into networked based ILS systems had OCLC not lit a fire under them with their WMS platform development.  And yet, for all OCLC is and has done for libraries, the WorldCat Rights and Responsibilities Statement (http://www.oclc.org/worldcat/recorduse/policy/default.htm) continues to put OCLC at odds with libraries and their missions.  For libraries that want to advocate for open data, OCLC’s position on data reuse is unfortunately making OCLC much more of a hindrance than a willing partner.

Which brings me back to my original question.  Can libraries really be effective advocates for open data, when we consider our own inability to make our most basic “research” data openly available.  I think that we can certainly try…we’ve always been a community that advocates pretty passionately for knowing what’s best for other people’s data.  And who knows, maybe if we advocate for open data long enough, we might begin to believe in it ourselves and become participants (rather than cheerleaders) in the open data community.

–TR

 Posted by at 10:17 pm
Dec 192011
 

I’ve been working on making a few changes to the way in which MarcEdit processes MARCXML data.  In MarcEdit, XML metadata transactions happen via XSLT.  This was done primarily to provide a great deal of flexibility in the types of XML transactions MarcEdit could preform.  It also meant that others could create their own XSLT transactions and share them with the MarcEdit community – meaning that I wouldn’t be a bottleneck. 

So, within the application, there were only 4 canonical conversion:

    1. MARC=>MarcEdit’s mnemonic format
    2. MarcEdit’s mnemonic format => MARC
    3. MARC => MARCXML (which occurs through an algorithm, no XSLT)
    4. MARCXML => MARC (which involves an XSLT translation from MARCXML to MarcEdit’s mnemonic format)

 

The four conversions represented the foundation for which all other work happened in MarcEdit.  The last conversion, MARCXML => MARC, represented a hybrid approach that used an XSLT to translate the data into MarcEdit’s mnemonic format, before handing the data off to the MARCEngine to complete data processing.  This method has worked very well throughout the years that I’ve made this functionality available, but it also has imposed a significant bottleneck on users working with large XML data files.  Because the MARCXML process utilized an XSLT translation, conversions from MARCXML were particularly expensive because of the XSLT processing.  It also meant that uses want to process exceptionally large MARCXML document would run into hard limits associated with the amount of memory available with their system. 

In the next update, this will change.  Starting in MarcEdit 5.7, the MARCXML => MARC function will shift from being an XSLT process to a native processing algorithm that uses SAX to process data.  The affect of this is that MarcEdit’s ability to process MARCXML data will be greatly improved and much less expensive. 

In benchmarking the change in process, the results are pretty staggering the larger the source file is.  For example, using a 50 MB MARCXML file (14,000 records), you can see the following improvement:

Process Time Records Per Second
MARCXML => MARC (old method) 5.9 seconds 2372
MARCXML => MARC (new method) 2.1 seconds 7000

 

Working with this smaller file, you can see that there has definitely been an improvement.  Using the new processing method, we are able to process early 3 times as many records per second.  However, this difference becomes even more pronounced as the number of records and the source XML file increases.  Using a 7 GB MARCXML file (1.5 million records), the improvement is startling:

Process Time Records Per Minute
MARCXML => MARC (old method) 50400 seconds 1785
MARCXML => MARC (new method) 460 seconds 197,368

 

Working with the larger file sizes, we see that the new method was able to process 110.5 times more records per minute.  What’s more, it’s likely that on my benchmarking workstations, this represents the largest file I would be able to process utilizing the old, XSLT centric method.  At nearly 14 hours, this file size seriously tested the DOM XSLT processor during the initial loading and validation phases of the process.  The new method, however, should easily be able to handle MARCXML files of any size. 

So why make these changes?  It’s probably pretty rare that folks are going to need to be working with MARCXML files of this size very often.  At least, I would hope not.  However, the speed improvements were so great working with both small and larger files, that it was well worth the effort to implement this change.  Likewise, it will improve MarcEdit’s XSLT based translations by removing one crosswalking step for those transformations moving from a non-MARCXML XML format to MARC.  So, the real practical affects of this change will be:

    1. MARCXML => MARC translations will be much faster
    2. XSLT translations from a non-MARCXML XML format to MARC will be improved (because you will no longer have the added MARCXML=>Mnemonic translation occurring)
    3. MarcEdit will be able to process MARCXML files of any size (physical system storage will be the only limiting factor)
    4. Processing non-MARCXML XSLT translations using XSLT will continue to have practical size limits due to the memory requirements of DOM based XSLT processors.   In my own benchmarking, practical limits tend to be around 500 MB – 1 GB.

 

These changes will be made available when MarcEdit 5.7 is released, sometime before January 1, 2012.

–TR

 Posted by at 5:03 pm