MarcEdit 7: The great [Normalization] escape

By reeset / On / In General Computing, MarcEdit, Programming

working out some thoughts here — this will change as I continue working through some of these issues.

If you follow the MarcEdit development, you’ll know that last week, I posted a question in a number of venues about the affects of Unicode Normalization and its potential impacts for our community.  I’ve been doing a little bit of work in MarcEdit, having a number of discussions with vendors and folks that work with normalizations regularly – and have started to come up with a plan.  But I think there is a teaching opportunity here as well, an opportunity to discuss how we find ourselves having to deal with this particular problem, where the issue is rooted, and the impacts that I see right now in ILS systems and for users of tools like MarcEdit.  This isn’t going to be an exhaustive discussion, but hopefully it helps folks understand a little bit more what’s going on, and why this needs to be addressed.

Background

So, let’s start at the beginning.  What exactly are Unicode normalizations, and why is this something that we even need to care about….

Unicode Normalizations are, in my opinion, largely an artifact of our (the computing industry’s) transition from a non-Unicode world to Unicode, especially in the way that the extended Latin character sets ended up being supported.

So, let’s talk about character sets and code pages.  Character sets define the language that is utilized to represent a specific set of data.  Within the operating system and programming languages, these character sets are represented as code pages. For example, Windows provides support for the following code pages: https://msdn.microsoft.com/en-us/library/windows/desktop/dd317756(v=vs.85).aspx.    Essentially, code pages are lists of numeric values that tell the computer how to map a  representation of a letter to a specific byte.  So, let’s use a simple example, “A”.  In ASCII and UTF8 (and other) code pages, the A that we read, is actually represented as a byte of data.  This byte is 0x41.  When the browser (or word processor) sees this value, it checks the value against the defined code page, and then provides the appropriate value from the font being utilized.  This is why, in some fonts, some characters will be represented as a “?” or a block.  These represent bytes or byte sequences that may (or may not) be defined within the code page, but are not available in the font.

Prior to Unicode implementations, most languages had their own code pages.  In Windows, the US. English code page would default to 1252.  In Europe, if ISO-8859 was utilized, the code page would default to  28591.  In China, the code page could be one of many.  Maybe “Big-5”, or code page 950, or what is referred to as Simplified Chinese, or code page 936.  The gist here, is that prior to the Unicode standard, languages were represented by different values, and the keyboards, fonts, systems – would take the information about a specific code page and interpret the data so that it could be read.  Today, this is why catalogers may still encounter confusion if they get records from Asia, where the vendor or organization makes use of “Big-5” as the encoding.  When they open the data in their catalog (or editor), the data will be jumbled.  This is because MARC doesn’t include information about the record code page – rather, it defines values as Unicode, or something else.  So, it is on catalogers and systems to know the character set being utilized, and utilized tools to convert the byte points from a character encoding that they might not be able to use, to one that is friendly for their systems.

So, let’s get back to this idea of Normalization Forms.  My guess is that much of the Normalization mess that we find ourselves in is related to ISO-8859.  This code page and standard has been widely utilized in European countries, and provides a standard method of representing extended Latinate characters [those between 129-255], though, Normalizations also affect other languages as well.  Essentially, the unicode specification included ISO-8859 to ease the transition, but also provide new, composed code points for many of the characters.  And Normalizations were born.

Unicode Normalizations, very basically, define how characters are represented.  There are 4 primary normalization forms that I think we need to care about in libraries.  These are (https://en.wikipedia.org/wiki/Unicode_equivalence):

  1. NFC – The canonical normalization, which will replace decomposed characters with composed code points.
  2. NFD – The canonical normalization, but in which data is fully decomposed
  3. NFKC – A normalization that utilizes a full compatibility decomposition, followed by the replacement of sequences with their primary composites, if possible.
  4. NFKD – A normalization that utilizes a full compatibility decomposition.

 

Practically, what does this mean.  Well, it means that a value like: é can be represented in multiple ways.  In fact, this is a good example of the problems that having differing Unicode Normalization Forms is having in the library community.  In the NFC and NFKC notation, this value é is represented by a single code point, that represents the letter and its diacritic fully.  In the NFD and NFKD notations, this character is represented by code points that correspond to the “e” and the diacritic separately.  This has definite implications, as composed characters make indexing of data with diacritical marks easier, whereas, decomposed characters must be composed to index correctly.

And how does this affect the library community.  Well, we have this made up character encoding known as MARC8 (https://en.wikipedia.org/wiki/MARC-8).  MARC8 is a library specific character set (doesn’t have a code page value, so all rendering is done by applications the understand MARC8) that has no equivalent outside of the library.  Like many character sets with a need to represent wide-characters (those with diacritics), MARC8 represented characters with diacritics by utilizing decomposed characters (though, this decomposition was MARC8 specific).  For librarians, this matters because the U.S. Library of Congress, when providing instructions on support for Unicode in MARC records, provided for the ability to round-trip between MARC8 and UTF8 and back (http://www.loc.gov/marc/specifications/speccharucs.html).  This round-tripability comes at a cost, and that cost is that data, to be in sync with the recommendations, should only be provided in the NFKD notation.  This has implications however, as current generation operating systems are are generally implemented utilizing NFC as the internal representation for string data, and for programmers, who have to navigate challenges within their languages, as in most cases, language functions that deal with concepts like in-string searching or regular expressions, are done using settings that make them culturally aware (i.e., allows searching across data in different normalizations), but the replacement and manipulation of that data is almost always done using ordinal (binary) matching, which means that data using different normalization forms are not compatible.  And quite honestly, this is confusing the hell out of metadata people.  Using our “é” character as an example – a user may be able to open a program or work in a programming language, and find this value regardless of the underlying data normalization, but when it comes to making changes – the data will need to match the underlying normalization; otherwise, no changes are actually made.  And if you are a user that is just looking at the data on the screen (without the ability to see the underlying binary data or without knowledge of what normalization is being used), you’d rightly start to wonder, why didn’t the changes complete.  This is the legacy that round-trip support for MARC-8 has left us within the library community, and the implications of having data moving fluidly between different normalizations is having real consequences today.

Had We Listened to Gandalf

Can holding on with caption -- Run you fool
(Source: http://quicklol.com/wp-content/uploads/2012/03/run-you-fools-cat-lol.jpg)

The ability to round-trip data from MARC-8 to UTF8 and back seemed like such a good idea at the time.  And the specifications that the U.S. Library of Congress lays out were/are easy enough  to understand and implement.  But we should have known that it was going to be easy, and that in creating this kind of backward compatibility, we were just looking for trouble down the road.

Probably the first indication that this was going to be problematic was the use of Numeric Character Reference (NRC) form to represent characters that exist outside of the MARC-8 repertoire.  Once UTF8 became allowed and a standard for representation of bibliographic data, the frequency in which MARC-8 records began to be littered by NRC representations (i.e., &#xXXXX; notation) increased exponentially, as too did the number of questions on the MarcEdit list for ways to find better substitutions for that data – primarily, because most ILS providers never fully adopted support for NRC encoded data.  Looking back now, what is interesting, is that many of the questioned related to the substitution of NRC notations can be traced to the utilization of NFC normalized data and the rise in the presence of “smart” characters generated in our text editing systems.  Looking at the MarcEdit archive, I can find multiple entries from users looking to replace NRC data elements that exist, simply because these elements represented the composed data points, and were thus, in compatible with MARC-8.  So, we probably should have seen this coming…and quite honestly, should have made a break.  Data created in UTF8 will almost always result in some level of data change when being converted back to MARC8…we should probably have just accepted that was a likely outcome, and not worried about the importance of round-tripability.

But….we have, and did, and now we have to find a way to make the data that we have, within the limitations of our systems, work.   But what are the limitations or consequences when thinking about the normalization form of data?  The data should render the same, right?  The data should search the same, right?  The data should export the same, right?  The answer to those questions, is that this shouldn’t matter, if the local system was standardizing the normalization of data as it is added or exported from the system, but in practice, it appears that few (if any systems) do that, so the normalization form of the data can have significant impacts on what the user sees, can discovery, or can export.

What the user sees

Probably the most perplexing issues that arise related to the normalization form of data, arise in how the data is rendered to the user.  While normalization forms have a binary difference, the system should be able to accommodate these differences aren’t visible to the user.  Throughout this document, I’ve been using different normalized forms of the letter “é”, but if the browser and the operating system are working like they are suppose to – you – as the reader shouldn’t be aware of these differences.  But we know that this isn’t always the case.  Here’s one such example:

image

The top set of data represents the data seen in an ILS prior to export.  The bottom shows the data once reimported, but the Normalization form had shifted from NFC to NFKD.  The interface being presented to the user has chosen to represent the data as bytes to flag that the data is represented as a decomposed character.  But this is jarring to the user as they probably should care.

The above example is actually not as uncommon as you might think.  In experimenting with a variety of ILS systems, changes in Normalization form can often have unintended effects for the user…and since it is impossible to know which normalization form is utilized without looking at the data at the binary level – how would one know when changes to records will result in significant changes to the user experience.

The short answer, is you can’t.  I started to wonder how OCLC treats Unicode data, and if internally, OCLC normalized the data coming into and out of its system.  And the answer is no – as long as the data is valid, the characters, in whatever normalization, is accepted into the system.  To test this, I made changes to the following record: http://osu.worldcat.org/title/record-builder-added-this-test-record-on-06262013-130714/oclc/850940559.  First, I was interested if any normalization was happening when interacting with OCLC’s Metadata API, and secondly, I was wondering if data brought in with different normalizations would impact searching of the resource.  And, the answers to these questions are interesting.  First, I wanted to confirm that OCLC accepted data in any normalization provided (as was relayed to me by OCLC), and indeed that is the case.  OCLC doesn’t do any normalization, as far as I can tell, of data going into the system.  This means that a user could download a master record, and make no other change to the record but updating the normalized form, and replace that record.  From the users perspective, the change wouldn’t be noticeable – but at the data level, the changes could be profound.  Given the variety of differences in how different ILS system utilize data in the different Unicode normalization forms, this likely explains some of the “diacritic display issue” questions that periodically make their way on the MarcEdit listserv.  Users are expecting that their data is compatible with their system because the OCLC data downloaded is in UTF8 and their system supports UTF8.  However, unknown to the cataloger, the reliance of data existing in a specific normalized form may cause issues.

The second question I was interested in, as it related to OCLC, was indexing.  Would a difference in normalization form cause indexing issues.  We know that in some systems, it does.  And for many European users, I have long recommended using MarcEdit’s normalization options to ensure that data converted to UTF8 utilizes the NFC normalization – as it enables local systems to index data correctly (i.e., index the letter + diacritic, rather than the letter then diacritic, then other data).  I was wondering if OCLC would demonstrate this kind of indexing behavior, but curiously, I found OCLC had trouble indexing any data with diacritical values.  Since I’m sure that isn’t the expected result, I’ve reached out to see exactly what is the expectation for the user.

Indexing implications

As noted above, for years now, I’ve recommended that users who utilize Koha as their ILS system configure MarcEdit to utilize the NFC normalization as the standard data output when converting data between MARC-8 and UTF-8.  The reason for this has been to ensure that data indexes correctly rather than flatly.  But maybe this recommendation should have been more broadly.  While  I didn’t look at every system, one common aspect of many of the systems that I did look at show, is that data normalized as NFKD tends to not index a representation of data as diacritical value.   They either normalize all diacritical data way, or they index the data as it appears in the binary – so for example, a record like this: évery would index as e_acute_very, i.e., the indexed value would be a plan “e”, but if the data appeared in NFC notation, the data would be indexed as an é (the combined character) allowing users to search for data using the letter + diacritic.  How does your system index its data?  It’s a question I’m asking today, and wondering how much of an impact normalization form has without the ILS, as well as outside the ILS (as we reuse data in a variety of contexts).  Since each system may make different assumptions and indexing decisions based on UTF8 data presented – its an interesting question to consider.

Export implications

The best case scenario is that a system would export data the same way that its represented in the system.  This is what OCLC does – and while it likely helps to exasperate some of the problems I see upstream with systems that may look for specific normalizations, its regular and expected.  Is this behavior the rule?  Unfortunately it is not.  I see many examples where data is altered on export, and often times, the diacritic related, the issue can be traced to the normalized form of the original data.  Again, the system probably should care which form is provided (in a perfect work), but if the system is implementing the MARC specification as written (see LC guidance above), then developing operations around the expectation of NFKD formed would likely led to complications.  But again, you’d likely never know until you tried to take the data out of the system.

Thinking about this in MarcEdit

So if you’ve stayed with me this long, you may be wondering if there is anything that we can do about the problems, short of getting everyone to agree that we all normalize our data a certain way (good luck).  In MarcEdit, I’ve been looking at addressing this question in order to address the following problems that I get asked about regularly:

  1. When I try to replace x diacritic, I can find the instances, but when I try to replace, only some (or none) are replaced
  2. When I import my data back into my system, diacritics are decomposed
  3. How can I ensure my records can index diacritics correctly

 

The first two issues are ones that come up periodically, and are especially confusing to users because the differences in data is at a binary level – so hard to see.  The last issue, MarcEdit has provided a 1/2 answer for.  It has always provided a way to set normalization when converted data to UTF8, but once there, it assumes that the user will provide the data in the form that they require (I’m realizing, this is a bad assumption).

To address this problem, I’m providing a method in MarcEdit that will allow the user to force the normalization of UTF8 data into a specific normalization, and will enable the application to support search and replace of data, regardless of the normalized form of a character that a user might us.  This will show up in the MarcEdit preferences.  Under the MARCEngine settings, there are options related to data normalization.  These show up as:

image

MarcEdit has included support for sometime to set normalization when compiling data.  But this doesn’t solve the problem when trying to edit, search, etc. records in the MarcEditor or within the other areas of the program.  So, a new option will be available – Enforce Defined Normalization.  This will enable the application to save data in the preferred normalization and also force all user submitted data through a wrapper that will enable edit operations to be completed, regardless of the normalized form a user may use when searching for data or the underlying normalization form of the individual records.  Internally, MarcEdit will make this process invisible, but the output created will be records that place all UTF8 characters into the specified normalization.  This seems to be a good option, and its very unlikely that tomorrow, the systems that we use will suddenly all start to use UTF8 data the same way – and taking this approach, they don’t have to.  MarcEdit will work as a bridge to take data in any UTF8 normalization, and will ensure that the data outputted all meets the criteria specified for the user.

Sounds good – I think so.  But it makes me a little nervous as well.  Why – because OCLC takes any data provided to it.  In theory, a record could switch normalizations multiple times, if users pulled data down, edited them using this option, and uploaded the data back to the database.  Does this matter?  Will it cause unforeseen issues?  I don’t know – I’m asking OCLC.  I also worry that allowing users to specify normalization form could have cascading issues when it comes to record sharing.  No everyone uses MarcEdit (nor should they) and its hard to know what impact this makes on other coding tools, etc.  This is why this function won’t be enabled by default – but will need to be turned on by the user – as I continue to inquire and have conversations about the larger implications of this work.  The short answer is that this is a pain point, and a problem that needs to be addressed somehow.  I see too many questions and too many records where the normalization form of the data plays a role in providing confusing data to the user, confusing data to the cataloger, or difficulties in reusing or sharing the data with other systems/processes.  At the same time, this feels like a band-aid fix until we reach a point in the evolution of our systems and metadata that we can free ourselves from MARC-8, and begin to think only about our data in UTF8.

Conclusions

So what should folks take away from all this?  Let’s start with the obvious.  Just because your data is in UTF8, doesn’t mean that its the same as my data in UTF8.  Normalization forms of data, a tool that was initially used to ease the transition of data from non-Unicode to Unicode data, can have other implications as well.  The information that I’ve provided, are just examples of challenges that make there way to me due to my work with MarcEdit.  I’m sure other folks have had different experiences…and I’d love to hear these if you want to provide them below.

Best,

–tr

Working with the Koha ILS HTTP API

By reeset / On / In General Computing, Koha

I’ve been spending the last week working with the Koha API, using it as an example for MarcEdit’s direct ILS integration platform.  After spending some time working with it and pushing some data through it, I have a couple of brief thoughts.

  1. I was pleasantly surprised as how easy the API was to work with.  Generally, the need for good authentication often stymies many a good API designs because the process for doing and maintaining authentication becomes so painful.  I found the cookejar approach that Koha implemented to be a very simple one to support and work with.  What’s more, error responses when working with the API tended to show up as HTTP Status codes, so it was easy to work with them using existing html tools.
  2. While the API is easy to use, it’s also really, really sparse.  There isn’t a facility for deleting records and I’m not sure if there is an easy way with the API to affect holdings for a set of records. I do know you can create items, but I’m not sure if that is a one off that occurs when you pass an entire bib record for update, or if there is a separate API that works just for Item data.  Search is also disappointing.  There is a specific API for retrieving individual records data – but the Search API is essentially Z39.50 (or SRU).  I’m not particularly enamored with either, though Z39.50 works (and I’m told that it’s fairly universal in terms of implementation).  I’ve never really liked SRU so it didn’t hurt my feelings too much to not work with it.  However, after spending time working with the Summon search API for other projects here at Oregon State, I was disappointed that search wasn’t something that the API specifically addressed.
  3. The API documentation leaves much to be desired.  I was primarily utilizing the Wiki (http://wiki.koha-community.org/wiki/Koha_/svc/_HTTP_API) which includes a single page on the API.  The page provided some simple demonstrations to show usage, which are really helpful.  What is less helpful is the lack of information regarding what happens when an error occurs.  The Authorization API returns an XML file with a status message – however, all other API return HTTP status codes.  This caught me a little by surprise, given the Authorization response – it would be nice if that information was documented somewhere.
  4. One thing that I can’t find in the documentation, so I really can’t answer this question is the impact of the API on system resources.  The API seems really to be geared towards working with individual records.  Well, MarcEdit is a batch records tool.  So, in my testing, I tried to see what would happen if I uploaded 1010 records through the API.  The process finished, sluggishly, but it appeared that uploading records through the API at high rates was having an impact on system performance.  The Upload process itself slowed considerably as the records were fed through the API.  But more curious – after the process finished, I had to wait about 15 minutes or so for all the records to make it through the workflow.  I’m assuming the API must queue items coming into the system, but this made it very difficult to test successful upload because the API was reporting success, but the data changes were not visible for a considerable amount of time.  Since I’ve never worked in a library that ran Koha in a production environment, I’m not sure if this type of record queuing is normal, but a better description of what is happening in the documentation would have been nice.  When I first started working with the API, I actually thought that the data updates were failing because I was expecting the changes to be in real-time in the system…my experience however seemed to indicate that they are not.

Anyway – those are my quick thoughts.  I need to caveat these notes by saying I have never worked at a library where Koha has been used in production, so maybe some of these behaviors are common knowledge.

 

–TR

Preventing outside linking to images to prevent bandwidth stealing

By reeset / On / In General Computing

One of my hats that I wear at home is IT professional for my wife, specifically when it comes to her blog.  To keep things running well, I periodically monitor bandwidth usage and space usage to make sure that we keep our hosts happy.  Well, over the past two months, I’d noticed a really, really, really large spike in bandwidth traffic.  Over a general month, the blog handles approximately 60 GB of http traffic.  A generous portion of that comes from robots that I allow to harvest the site (~12 GB), the remainder from traffic from visitors.  This changed however, last month when bandwidth usage jumped from ~60 GB a month to a little over 120 GB.  Now, our hosts are great.  We pay for 80 GB of bandwidth a month, but this is a soft cap so they didn’t complain when we went way over our allotment.  At the same time, I like to be good neighbors on our shared host – so I wanted to figure out what was causing the spike in traffic.

Looking through the log files and chatting with the hosts (who have better log files), we were able to determine that the jump in traffic was due to one image.  This one (example of linking to the file — should be broken unless you are reading it through the google reader, which I allow as an exception):

 

(this one has been downloaded and placed on my blog)

It’s an image from the Thomas Jefferson memorial.  My wife had taken the picture the last time we were in DC and had posted it here: http://athomewithbooks.net/2012/10/saturday-snapshot-october-27/.  In Oct. and 1/2 of Nov., this single image had been responsible for close to 100 GB of bandwidth traffic.  What I couldn’t figure out was where it was all coming from…but looking at the logs, we were able to determine that it was being linked to from StumbleUpon.  While the linking to the image wasn’t a problem, the bandwidth usage was.  So, I started to look at options, and there actually is a quite elegant one if you find yourself in a position where you need to limit bandwidth.

The simple solution is to simply not allow linking to any images (or specific file types) from outside the blog domain.  This is actually pretty easy to accomplish using mod_rewrite and an .htaccess file.  Using the following snippet (replacing my domain – athomewithbooks.net with your own):

RewriteCond %{HTTP_REFERER} !^$
RewriteCond %{HTTP_REFERER} !^http://(www\.)?athomewithbooks.net/.*$ [NC]
RewriteCond %{HTTP_REFERER} |^http://(www\.)?athomewithbooks.net.*$  [NC]
RewriteRule \.(gif|jpg|js|css|png|jpeg)$ – [F]

I’ve directed the webserver to only serve these file types (gif|jpg|jpeg|js|css|png) when linked from my specific domain.  So, if I were to try and link to an image from a different domain, the image would come back broken (though – you can also serve a “you can’t link to this image” image if you wanted to as well.  The way that this works, is it reads the headers passed by the browser, specifically the HTTP_REFERRER value to determine if the request is originating from an allowed domain.  If it’s not, it doesn’t server the image.

Now, this method isn’t perfect.  Some browsers don’t pass this value, some pass it poorly – so it’s likely some people that shouldn’t see the image will see it (due to the first line which serves content if the domain isn’t defined) but in generally, it provides a method for managing your bandwidth usage.

–TR

Hybrid storage solutions

By reeset / On / In General Computing

While I was at the PASIG conference this last weekend, a number of people talked about the death of the harddrive, at least in the sense of our personal portable devices.  The popularity of ultrabooks and small form notebooks was discussed many times, noting that personal computing will move more and more away from local copies to cloud-based drives because:

  • Solid State Drives provide the instant on/performance that people are wanting in their portable devices
  • The Expense of Solid State drives and their current relative small size will eventually relegate storage off the local device and into the cloud.

While I certainly agree that this likely will continue to be a trend (look at how tools like Dropbox are changing the way researchers store and share their data), I think think that many of the folks at PASIG may be too quick to overlook some of the very cool developments related to SSD technology that allow for microform factors, allowing ultra portables to support both a traditional SSD drive and the more traditional spinning drive.  Of course, I’m talking about the current work being done with msata drives. 

Currently, there are very few mainstream systems that support msata technology, which is unfortunate because these really are cool devices.  The two best probably are produced by Intel, which produces a 40 GB and 80 GB flavor of their drive (http://ark.intel.com/products/56547/Intel-SSD-310-Series-(80GB-mSATA-3Gbs-34nm-MLC)).  When I was looking for a replacement laptop this last month, I was looking specifically for a device that had both a SSD and traditional drive setup.  However, my requirements that the system be under 4 lbs and compact made this a difficult search.  However, in doing my research, I stumbled upon the Intel msata drive system. 

Now SSD drives are small to begin with, but the msata drives are downright microscopic.  The image below, taken from a review of these devices, shows just how small.  In fact, when I ordered one, I had a hard time believing that they really got an 80 GB drive on a chip a little bit bigger than a quarter.  Yet, they did.


(Image linked from http://hothardware.com/Reviews/Intel-310-Series-80GB-SSD-Review/)

So how well does this work?  From my limited experience with it (about 2 weeks) – great.  Intel provides a set of disk tools that allow you to migrate your current partitions onto the SSD disk – however, I choose to do a fresh install.  Installing Windows and all my programs onto the SSD drive cost me ~35 GB.  Setting up a little symlinking, I moved all the data components to the traditional harddrive (500 GB), leaving the SSD for just the operating system and programs.  Then I tested.

When I first received the laptop, I did some start up and shutdown testing.  On a clean system, the laptop, running a I-7 with 8 GB of RAM would take approximately 35 seconds for Windows 7 to finish it’s startup cycle.  Not bad, but not great.  Additionally, on a full charge, the system would run for ~3.7 hours on the battery (not good).  Running the Windows Experience tests, it gave the 500 GB, 7200 rpm drive a 6.2 (of 7.9) performance score.

After installing the msata drive and making it the primary boot partition, I gave the tests another whirl and the difference was striking.  First, on the Windows Experience testing, there was a significant different in rating.  Using the SSD as the primary system disk, the Experience tests gave the Intel 80 GB msata drive a score of 7.7 (of 7.9) – a pretty high score.  So what does that mean in real life?  Well, let’s start with boot times.  From a cold boot, it now takes Windows 7 approximately 5-7 seconds.  Closing the lid and opening it back up has essentially become instance on (for a while, I was wondering if the system was actually going to sleep when I closed the lid because it was on as soon as I opened it).  And finally, battery life.  On a full charge, under heavy use at the PASIG conference, I got nearly 8 hours on a single charge. 

While the move away from local disks may indeed happen in the near future, my more recent laptop purchasing experience showed me that for those that want to continue to have a very high performance system, with an small form factor – it is possible to have the best of both worlds utilizing these emerging SSD technologies to create very high performance (and relatively low-cost) portal systems.

–TR

Probably time to rethink our icons

By reeset / On / In General Computing

One of my boys has really been on a bit of a writing jag lately.  I’m guessing that it’s because I write a lot for my job  (and his mom writes a lot as well) and they have a lot of stories that they’d like to tell.  Well, my youngest had asked me to setup the computer so that he could type a story there.  So, I fired up Word and let him have at it.  About an hour later, he came over and asked me how he could save his work.  Without thinking, I told him to click on the little disk picture in the upper left hand corner.  At which point, he looked at me like I had two heads because he’s never seen a disk.

My first reaction was to chuckle a little bit and then show him where the button was.  But, as I’m sitting here looking at MarcEdit, and thinking about the icon palette that I use in my own applications, it got me thinking.  So I started opening up applications on my computer and looked at how applications represent the “save” action.  Nearly universally, the save icon is still represented by some iteration of a disk (on my windows and linux systems) (I should note, MarcEdit uses a folder, with an arrow pointing downwards save).  Talking to my boys (hardly a representative sampling), I started asking about other icons on the palette, and one of the things that strikes me is that many of the graphical representations that we still use to represent actions are relics of a bygone technical past.  My oldest son knew which icon was used for saving, but the image was meaningless to him.  He’d learned how to save because someone had taught him which image to push. In fact, I think the idea of having “disks” sounded somewhat cool, until I told him that the disks were about the size of his little brother’s hand and didn’t have enough space to save his powerpoint to it.  At that point, we all agreed that his 4 GB jump drive was much better.   But to get back to my point, it was interesting that the save icons wasn’t something that he knew intuitively, and certainly isn’t something my youngest son knew intuitively.

This does bring up a question.  Technology is changing and if we are developing interfaces so that they can be used intuitively by our users, which users become our baseline when doing interface development?  Let’s use the save button as an example.  If we created a save button today, it would likely look much different, but would today’s users intuitively understand what it was?  My guess, no (actually, it’s not a guess, when I changed the save button in MarcEdit, it was quite confusing for a number of people).  So, an image of a nearly extinct technology remains the mainstream representation for how we “save” content within most applications.

But this isn’t just an application design question.  Libraries perpetuate these kinds of relic technologies as well.  A great example, in my mind is the ILS.  While this isn’t universally true, the ILS is essentially an electronic representation of a card catalog.  Users come to the library and generally can find things in the ILS, but really, how many would consider their ILS to be intuitive?  How many libraries have classes, online tutorials, etc. to teach users how to find things efficiently in their ILS?  I’m sure most still do.  We have these classes because we have to.  Our ILSs simply don’t work logically when considering current search technology.   And why should they, they were build to be electronic card catalogs, not search engines (or even product engines like amazon) that people use today.  These tools work by building transparent connections to other related options.  ILSs work by building connections through subject headings, or as I heard someone in the library say, by using, “what the hell’s a subject heading”.  Smile

Of course, changing things is easier said that done.  We have large groups of legacy users that would be just as confused if today’s interfaces were actually created to be more intuitive (i.e., were more representative of today’s technology).  [I think this is the same thing we say about why we still use MARC]  So, we have two pain points.  One is for new users that won’t understand today’s interfaces because they lack the institutional memory required to understand what they mean, and legacy users that are currently writing and designing the interfaces and tools in use today.  Guess who wins. 

I’ve got to believe that there is a way around this tension point between legacy users with institutional memory and new users looking for new intuitive interfaces.  Do we develop multiple interfaces (classic and contemporary) or maybe do something different all together.  I’m really not sure, but as I think about how I do my own coding and consider the projects that I’m working on – this is one of those things that I’m going to start thinking about.  We talk about taking bias out of the research process…well, this institutional memory represents its own type of bias that essentially poisons the design process.  I think one of my new years resolutions will be working on ways to eliminate this type of bias as I consider my own interface design projects.

–TR

Geneseo Resource Sharing project

By reeset / On / In General Computing, Travel

For the first time, in a really long time, I won’t be home on Mother’s Day.  A trip to Boston last week was extended to include some time in NY to talk to friends and the Geneseo library about the resource sharing project that they are working on.   I was really impressed by the work that had currently been done on this project.  Their work to integrate ILLIAD instances to work towards unmediated article sharing within their partners as well as their ability to generate workflow reports generating time spent on each part of the request and delivery process (seeing information about both the borrowing and lending institutions) was pretty cool.  At this point, this group is looking to expand their current work into a very ambitious open source project that could potentially help both their own consortia, but also provide an open tool that could be utilized by other Interlibrary Loan offices to deal with issues relating to publisher licensing guidelines, as they relate to lending digital articles.  If anyone happens to be working on something like that, they should contact Cyril Oberlander.  At this point, they are starting work on their next implementation of this project and are interested in knowing if anyone else is working a project like this.

 

–TR

Sun to begin Close sourcing parts of MySQL Development

By reeset / On / In General Computing

I remember mentioning (http://blog.reeset.net/archives/490) that I wasn’t sure why, but I wasn’t wild about Sun aquiring MySQL.  And then today, I seen this link picked up on Slashdot (http://jcole.us/blog/archives/2008/04/14/just-announced-mysql-to-launch-new-features-only-in-mysql-enterprise/).  Apparently, Sun will start close sourcing parts of the code-base, making specific elements of the database (think enterprise level functionality), available to MySQL Enterprise customers.  I can’t say that this suprises me, though it does disappointment.

–TR

atscap and pchdtvr GPL revoked or can it be

By reeset / On / In General Computing

I’ve never used this package (apparently its used for HDTV scheduling/recording on Linux), but this link on Slashdot caught my eye: http://sourceforge.net/developer/diary.php?diary_id=26407&diary_user=147583.  Apparently, the developer of this software package is seeking to revoke the GPL license not just for his current code, but his past code/package as well.  I have a difficult time believing that this is possible, but I’m sure we will soon find out.  My guess is that this guy is productizing his software and has a good idea who is currently using, selling and distributing his source so there will likely be some kind of legal challenge to the GPL as well.  It’s always interesting to see how these kind of things play out in the U.S. courts which can sometimes be a little schizophrenic, though I’d have a difficult time believing that this type of retroactive license change is actually possible.

 

–TR

Is IT becoming too disposable?

By reeset / On / In General Computing

This is something that came up when I was expanding my thoughts from one of my “non-lita tech trends” earlier this morning and the more I’ve thought about it, the more I’m finding it weighing on my mind.  I’m wondering if we are making our hardware too disposable in the name of convenience.  This comes from my conversations about low budget, ultra portable systems to thinking about Apple’s new Mac Book Air — a computer that comes without a replaceable battery and limited upgradability.  I’ll admit, I’m a little bit of a pack rat.  I’ve either kept or found homes for every computer I’ve ever owned.  In fact, it was only recently that I upgraded our 8 year old desktop at home to something newer and zipper (relegating the old machine to file server status).  When things break — I like to fix them.  When things slow down — I take them apart and upgrade the components.  I do this for a number of reasons — one being that I do like to encourage an environmentally friendly lifestyle (more or less).  I drive very little, we recycle fanatically, try to buy local — but I’m having a hard time rectifying this lifestyle with the gadgets that I’ve come to know and love.  One of the problems, as I’m seeing it, is that many of these low budget machines (or in the Mac Air’s case — premium priced machines) are making hardware much more throw away that it ever was before.  If I have a $200 desktop (or notebook for that matter) and something breaks — do I fix it?  If its a year old — probably not since the cost to fix it will likely be close to the cost to replace it.  So, the computer is landfill’d (as most computers are even though most companies offer recycling programs) and the process repeats.  Even Apple’s Mac Air seems to be built to encourage a rapid replacement cycle.  Low expandability, no battery replacement, under powered processor — while sleek and stylish I wonder if these too won’t become high end disposable products. 

In a time when green computer seems to be gaining traction everywhere, the current disposable PC trend seems to fly in its face.  And I’m no better in this regard.  I too would like an ultra-portable device and am in the group looking for something on the higher end scale (I want something that will perform better than a PDA) and there’s the dilemma.  This class of machines simply is disposable by default due to the nature of the beast.  Keep size down, keep price down — and performance suffers.  When performance suffers — performance lust sets in and the cycle repeats.  A great cycle for investors, maybe, but not for those wanting live a little greener.

Anyway, random thoughts for a Thursday,

–TR

Technorati Tags:

My non-LITA top tech trends

By reeset / On / In Digital Libraries, General Computing

(Note, I started this post last night, but had to put it away so I could get some rest before a 6 am flight.  I finished the remainder of this while waiting for my flight). 

So, after getting up way to early this morning, I staggered my way down to the LITA Top Tech Trends discussion this morning.  Unfortunately, it seemed like a number of other folks did the same thing as well, so I only ended up hanging out for a little bit.  I just don’t have the stamina in the morning to live through cramped quarters, poor broadband and no caffeine.  I get enough of that when I fly (which I get to do tomorrow).  Fortunately, a number of folks who had been asked to provide tech trends have begun (or have been) posting their lists and some folks who braved the early morning hours have started blogging their response (here).  I personally wasn’t asked to provide my list of tech trends, but I’m going to anyway, as well as comment on a few of the trends either posted or discussed during the meeting.  Remember, this is just one nuts list, so take it for what it is.

  1. Ultra-light and small PCs (Referenced from Karen Coombs)
    Karen is one of a number of folks that has taken note of a wide range of low-cost computers currently being made available to the general public.  These machines, which run between $189-$400, provide low-cost, portable machines that have the potential to bring computers to a wider audience.  I’ll have to admit, I’m personally not sold on these machines, in part because of the customer-base that they are aiming for.  Companies such as EeePC note that these machines are primarily targeted to users that are looking for a portable second machine and kids/elderly looking for a machine simply to surf the web.  A look at the specifications for many of these low cost machines are celerion class processors with 512 MB of RAM with poor graphics processing.  Is this good enough for web surfing or browsing the web?  I’d argue, no.  The current and future web is a rich environment, built on CSS, XML, XSLT, flash, java, etc.  I think what people seem to forget is that this rich content takes a number of resources to simply view.  Case in point — I setup a copy of Centos  on a 1.2 MHz Centrino with 512 MB RAM and a generic graphics card (8 Mb of shared memory) and while I could use this machine to browse the web and doing office work with Open office, I certainly wouldn’t want to.  Just running the Linux shell was painful, but web browsing is clunky and office work is basically unusable — essentially, surpassing the machine’s capabilities right out of the box.  Is this the type of resource I’d want to be lending to my patrons…probably not since I wouldn’t want my patrons to associate my library’s technical expertise with sub-standard resources.  Does this mean that ultra-portables will not be in vogue this year and the next?  Well, I didn’t say that.  A look at the success the IPhone is having (a pocket PC retailing for close to $1500 without a contract) seems to indicate that users are wanting to and willing to pay a premium price for portability — so long as that portability doesn’t come at too high of a price. 
  2. Branding outside services as our own (and branding in general)
    There was a little bit of talk about this — the idea of moving specific services outside the library to services like Google or Amazon, and essentially, rebranding them.  This makes some sense — however, I always cringe when we start talking about branding and how to make the library more visible.  From my perspective, the library is already too visible, i.e., intrusive into our users lives.  Libraries want to be noticed, and we want our patrons and organizations to see where the library gives them value.  It’s a necessary evil in times when competition for budget dollars is high.  However, I think it does our users a disservice.  Personally, I’d like to see the library become less visible — providing users direct access to information without the need to have the library’s finger prints all over the process.  We can make services that are transparent (or mostly transparent), and we should. 

    The same thing goes for our vendors.  I’ll use III as an example only because we are an Innovative Library so I’m more  familiar with their software.  By all rights, Encore is a serviceable product that will likely make III a lot of money.  However, of the public instances currently available (Michigan State, Nashville Public Library), the III branding is actually larger than that of the library (if the library branding shows up as well).  And this is in no way unique to III.  Do patrons care what software is being used?  I doubt it.  Should they care?  No.  They should simply be concerned that it works, and works in a way that it doesn’t get in in their way.  From my perspective, branding is just one more thing that gets in the way.

  3. Collections as services will change the way libraries do collection development
    I’m surprised that we don’t here more about this — but I’m honestly of the opinion that metadata portability and the ability for libraries to build their collections as web services will change the way libraries do collection development.  In the past, collection development was focused primarily on what could be physically or digitally acquired.  However, as more organizations move content online (particularly primary resources), libraries will be able to shift from an acquisitions model to a services model.  Protocols like OAI-PMH make it possible (and relatively simple) for libraries to actively “collect” content from their peer institutions in ways that were never possible in the past. 
  4. Increased move to outside library IT and increased love for hosted services (whether we want them or not)
    While it has taken a great deal of time, I think it is fair to say that libraries are more open to the idea of using Open Source software than ever before.  In the short term, this has been a boon for library IT departments, which has seen an investment in hardware and programmer support.  I think this investment in programming support will be short-lived.  In some respects, I see libraries going through their own version of the .COM boom (just, without all the money).  Open Source is suddenly in vogue.  Sexy programs like Evergreen have made a great deal of noise and inroads into a very traditionally vendor oriented community.  People are excited and that excitement is being made manifest by the growing number of software development positions being offered within libraries.  However, at some point, I see the bubble bursting.  And why?  Because most libraries will come to realize that either 1) having a programmer on staff is prohibitively expensive or 2) that the library will be bled dry by what I’ve heard coined by Kyle Banerjee as vampire services.  What is a vampire service?  A vampire service is a service that consumes a disproportional number of resources but will not die (generally for political reasons).  One of the dangers for libraries employing developers is the inclination to develop services as part of a grant or grandiose vision, that eventually becomes a vampire service.  They bleed an organization dry and build a culture that is distrustful of all in-house development (see our current caution looking at open source ILS systems.  It wasn’t too long ago that a number of institutions used locally developed [or open] ILS systems and the pain associated with those early products still affects our opinions of non-vendor ILS software today). 

    But here’s the good news.  Will all software development position within the library go away?  No.  In fact, I’d like to think that as position within individual organizations become more scarce — that consortia will move to step into this vacated space.  Like many of our other services moving to a network level, I think that the centralization of library development efforts would be a very positive outcome, in that it would help to increase collaboration between organizations and reduce the number of projects that are all trying to re-invent the same wheel.  I think of our own consortia in Oregon and Washington– Summit — and the dynamic organization it could become if only the institutions within it would be willing to give over some of their autonomy and funding to create a research and development branch within the consortia.  Much of the current development work (not all) could be moved up to the consortia level allowing more members to directly benefit from the work done. 

    At the same time, I see the increase of hosted services on the horizon.  I think that folks like LibLime really get it.  Their hosted services for small to medium size libraries presumably reduce LibLime’s costs to manage and maintain the software and those hosted libraries from the need to worry about hardware and support issues.  When you look at the future of open source in libraries — I think that this is it.  For every one organization willing to run open source within their library, there will be 5 others that will only be able to feasibly support that infrastructure if it is outsourced as a hosted service.  We will see a number of open source projects move this direction.  Hosted services for Dspace, Fedora, metasearch, the ILS — these will all continue to emerge and grow throughout this year and into the next 5 years.  And we will see the vendor space start to react to this phenomenon as well.  A number of vendors, like III, already provide hosted services.  However, I see them making a much more aggressive push to compel their users (higher licensing, etc) to move to a hosted service model. 

  5. OCLC will continue to down the path to becoming just another vendor
    I’d like nothing more than to be wrong, but I don’t think I am.  Whether its this year, the next or the year after that, OCLC will continue to alienate its member institutions, eventually losing the privileged status libraries have granted it throughout the years, becoming just another vendor (though a powerful one).  Over the last two years, we’ve seen a lot of happenings come from Dublin, Ohio.  There was the merger of RLG, the hiring of many talented librarians, WorldCat.org, WorldCat Local and OCLC’s newest initiatives circulating around their grid services.  OCLC is amassing a great deal of capital (money, data, members) and I think we will see how they intend to leverage this capital this year and the next.  Now, how they leverage this capital will go a long way to deciding what type of company OCLC will be from here forward.  Already, grumblings are being heard within the library development community as OCLC continues to move to build new revenue streams from webservices made possible only through the contribution of metadata records from member libraries.  As this process continues, I think you will continue to hear grumblings from libraries who believe that these services should be made freely available to members, since it was member dollars and time that provided OCLC exclusively with the data necessary to develop these services.  **Sidebar, this is something that we shouldn’t over look.  If you’re library is an OCLC member, you should be paying close attention to how OCLC develops their grid services.  Remember, OCLC is suppose to be a member driven organization.  It’s your organization.  Hold it accountable and make your voice heard when it comes to how these services are implemented.  Remember, OCLC only exists through the cooperative efforts of both OCLC and the thousands of member libraries that contribute metadata to the database.**  Unfortunately, I’m not sure what OCLC could do at this point to retain this position of privilege.  Already, too many people that I talk to see OCLC as just another vendor that doesn’t necessarily have the best interests of the library community at heart.  I’d like to think that they are wrong — that OCLC still remains an organization dedicated to furthering libraries and not just OCLC.  But at this point, I’m not sure we know (or they know).  What we do know is that there are a number of dedicated individuals that came to OCLC because they wanted to help move libraries forward — let’s hope OCLC will continue to let them do so.  And we watch, and wait.

Anyway, that’s my list of trends.

–TR

 

Technorati Tags: ,,