MarcEdit 7: The great [Normalization] escape

By reeset / On / In General Computing, MarcEdit, Programming

working out some thoughts here — this will change as I continue working through some of these issues.

If you follow the MarcEdit development, you’ll know that last week, I posted a question in a number of venues about the affects of Unicode Normalization and its potential impacts for our community.  I’ve been doing a little bit of work in MarcEdit, having a number of discussions with vendors and folks that work with normalizations regularly – and have started to come up with a plan.  But I think there is a teaching opportunity here as well, an opportunity to discuss how we find ourselves having to deal with this particular problem, where the issue is rooted, and the impacts that I see right now in ILS systems and for users of tools like MarcEdit.  This isn’t going to be an exhaustive discussion, but hopefully it helps folks understand a little bit more what’s going on, and why this needs to be addressed.


So, let’s start at the beginning.  What exactly are Unicode normalizations, and why is this something that we even need to care about….

Unicode Normalizations are, in my opinion, largely an artifact of our (the computing industry’s) transition from a non-Unicode world to Unicode, especially in the way that the extended Latin character sets ended up being supported.

So, let’s talk about character sets and code pages.  Character sets define the language that is utilized to represent a specific set of data.  Within the operating system and programming languages, these character sets are represented as code pages. For example, Windows provides support for the following code pages:    Essentially, code pages are lists of numeric values that tell the computer how to map a  representation of a letter to a specific byte.  So, let’s use a simple example, “A”.  In ASCII and UTF8 (and other) code pages, the A that we read, is actually represented as a byte of data.  This byte is 0x41.  When the browser (or word processor) sees this value, it checks the value against the defined code page, and then provides the appropriate value from the font being utilized.  This is why, in some fonts, some characters will be represented as a “?” or a block.  These represent bytes or byte sequences that may (or may not) be defined within the code page, but are not available in the font.

Prior to Unicode implementations, most languages had their own code pages.  In Windows, the US. English code page would default to 1252.  In Europe, if ISO-8859 was utilized, the code page would default to  28591.  In China, the code page could be one of many.  Maybe “Big-5”, or code page 950, or what is referred to as Simplified Chinese, or code page 936.  The gist here, is that prior to the Unicode standard, languages were represented by different values, and the keyboards, fonts, systems – would take the information about a specific code page and interpret the data so that it could be read.  Today, this is why catalogers may still encounter confusion if they get records from Asia, where the vendor or organization makes use of “Big-5” as the encoding.  When they open the data in their catalog (or editor), the data will be jumbled.  This is because MARC doesn’t include information about the record code page – rather, it defines values as Unicode, or something else.  So, it is on catalogers and systems to know the character set being utilized, and utilized tools to convert the byte points from a character encoding that they might not be able to use, to one that is friendly for their systems.

So, let’s get back to this idea of Normalization Forms.  My guess is that much of the Normalization mess that we find ourselves in is related to ISO-8859.  This code page and standard has been widely utilized in European countries, and provides a standard method of representing extended Latinate characters [those between 129-255], though, Normalizations also affect other languages as well.  Essentially, the unicode specification included ISO-8859 to ease the transition, but also provide new, composed code points for many of the characters.  And Normalizations were born.

Unicode Normalizations, very basically, define how characters are represented.  There are 4 primary normalization forms that I think we need to care about in libraries.  These are (

  1. NFC – The canonical normalization, which will replace decomposed characters with composed code points.
  2. NFD – The canonical normalization, but in which data is fully decomposed
  3. NFKC – A normalization that utilizes a full compatibility decomposition, followed by the replacement of sequences with their primary composites, if possible.
  4. NFKD – A normalization that utilizes a full compatibility decomposition.


Practically, what does this mean.  Well, it means that a value like: é can be represented in multiple ways.  In fact, this is a good example of the problems that having differing Unicode Normalization Forms is having in the library community.  In the NFC and NFKC notation, this value é is represented by a single code point, that represents the letter and its diacritic fully.  In the NFD and NFKD notations, this character is represented by code points that correspond to the “e” and the diacritic separately.  This has definite implications, as composed characters make indexing of data with diacritical marks easier, whereas, decomposed characters must be composed to index correctly.

And how does this affect the library community.  Well, we have this made up character encoding known as MARC8 (  MARC8 is a library specific character set (doesn’t have a code page value, so all rendering is done by applications the understand MARC8) that has no equivalent outside of the library.  Like many character sets with a need to represent wide-characters (those with diacritics), MARC8 represented characters with diacritics by utilizing decomposed characters (though, this decomposition was MARC8 specific).  For librarians, this matters because the U.S. Library of Congress, when providing instructions on support for Unicode in MARC records, provided for the ability to round-trip between MARC8 and UTF8 and back (  This round-tripability comes at a cost, and that cost is that data, to be in sync with the recommendations, should only be provided in the NFKD notation.  This has implications however, as current generation operating systems are are generally implemented utilizing NFC as the internal representation for string data, and for programmers, who have to navigate challenges within their languages, as in most cases, language functions that deal with concepts like in-string searching or regular expressions, are done using settings that make them culturally aware (i.e., allows searching across data in different normalizations), but the replacement and manipulation of that data is almost always done using ordinal (binary) matching, which means that data using different normalization forms are not compatible.  And quite honestly, this is confusing the hell out of metadata people.  Using our “é” character as an example – a user may be able to open a program or work in a programming language, and find this value regardless of the underlying data normalization, but when it comes to making changes – the data will need to match the underlying normalization; otherwise, no changes are actually made.  And if you are a user that is just looking at the data on the screen (without the ability to see the underlying binary data or without knowledge of what normalization is being used), you’d rightly start to wonder, why didn’t the changes complete.  This is the legacy that round-trip support for MARC-8 has left us within the library community, and the implications of having data moving fluidly between different normalizations is having real consequences today.

Had We Listened to Gandalf

Can holding on with caption -- Run you fool

The ability to round-trip data from MARC-8 to UTF8 and back seemed like such a good idea at the time.  And the specifications that the U.S. Library of Congress lays out were/are easy enough  to understand and implement.  But we should have known that it was going to be easy, and that in creating this kind of backward compatibility, we were just looking for trouble down the road.

Probably the first indication that this was going to be problematic was the use of Numeric Character Reference (NRC) form to represent characters that exist outside of the MARC-8 repertoire.  Once UTF8 became allowed and a standard for representation of bibliographic data, the frequency in which MARC-8 records began to be littered by NRC representations (i.e., &#xXXXX; notation) increased exponentially, as too did the number of questions on the MarcEdit list for ways to find better substitutions for that data – primarily, because most ILS providers never fully adopted support for NRC encoded data.  Looking back now, what is interesting, is that many of the questioned related to the substitution of NRC notations can be traced to the utilization of NFC normalized data and the rise in the presence of “smart” characters generated in our text editing systems.  Looking at the MarcEdit archive, I can find multiple entries from users looking to replace NRC data elements that exist, simply because these elements represented the composed data points, and were thus, in compatible with MARC-8.  So, we probably should have seen this coming…and quite honestly, should have made a break.  Data created in UTF8 will almost always result in some level of data change when being converted back to MARC8…we should probably have just accepted that was a likely outcome, and not worried about the importance of round-tripability.

But….we have, and did, and now we have to find a way to make the data that we have, within the limitations of our systems, work.   But what are the limitations or consequences when thinking about the normalization form of data?  The data should render the same, right?  The data should search the same, right?  The data should export the same, right?  The answer to those questions, is that this shouldn’t matter, if the local system was standardizing the normalization of data as it is added or exported from the system, but in practice, it appears that few (if any systems) do that, so the normalization form of the data can have significant impacts on what the user sees, can discovery, or can export.

What the user sees

Probably the most perplexing issues that arise related to the normalization form of data, arise in how the data is rendered to the user.  While normalization forms have a binary difference, the system should be able to accommodate these differences aren’t visible to the user.  Throughout this document, I’ve been using different normalized forms of the letter “é”, but if the browser and the operating system are working like they are suppose to – you – as the reader shouldn’t be aware of these differences.  But we know that this isn’t always the case.  Here’s one such example:


The top set of data represents the data seen in an ILS prior to export.  The bottom shows the data once reimported, but the Normalization form had shifted from NFC to NFKD.  The interface being presented to the user has chosen to represent the data as bytes to flag that the data is represented as a decomposed character.  But this is jarring to the user as they probably should care.

The above example is actually not as uncommon as you might think.  In experimenting with a variety of ILS systems, changes in Normalization form can often have unintended effects for the user…and since it is impossible to know which normalization form is utilized without looking at the data at the binary level – how would one know when changes to records will result in significant changes to the user experience.

The short answer, is you can’t.  I started to wonder how OCLC treats Unicode data, and if internally, OCLC normalized the data coming into and out of its system.  And the answer is no – as long as the data is valid, the characters, in whatever normalization, is accepted into the system.  To test this, I made changes to the following record:  First, I was interested if any normalization was happening when interacting with OCLC’s Metadata API, and secondly, I was wondering if data brought in with different normalizations would impact searching of the resource.  And, the answers to these questions are interesting.  First, I wanted to confirm that OCLC accepted data in any normalization provided (as was relayed to me by OCLC), and indeed that is the case.  OCLC doesn’t do any normalization, as far as I can tell, of data going into the system.  This means that a user could download a master record, and make no other change to the record but updating the normalized form, and replace that record.  From the users perspective, the change wouldn’t be noticeable – but at the data level, the changes could be profound.  Given the variety of differences in how different ILS system utilize data in the different Unicode normalization forms, this likely explains some of the “diacritic display issue” questions that periodically make their way on the MarcEdit listserv.  Users are expecting that their data is compatible with their system because the OCLC data downloaded is in UTF8 and their system supports UTF8.  However, unknown to the cataloger, the reliance of data existing in a specific normalized form may cause issues.

The second question I was interested in, as it related to OCLC, was indexing.  Would a difference in normalization form cause indexing issues.  We know that in some systems, it does.  And for many European users, I have long recommended using MarcEdit’s normalization options to ensure that data converted to UTF8 utilizes the NFC normalization – as it enables local systems to index data correctly (i.e., index the letter + diacritic, rather than the letter then diacritic, then other data).  I was wondering if OCLC would demonstrate this kind of indexing behavior, but curiously, I found OCLC had trouble indexing any data with diacritical values.  Since I’m sure that isn’t the expected result, I’ve reached out to see exactly what is the expectation for the user.

Indexing implications

As noted above, for years now, I’ve recommended that users who utilize Koha as their ILS system configure MarcEdit to utilize the NFC normalization as the standard data output when converting data between MARC-8 and UTF-8.  The reason for this has been to ensure that data indexes correctly rather than flatly.  But maybe this recommendation should have been more broadly.  While  I didn’t look at every system, one common aspect of many of the systems that I did look at show, is that data normalized as NFKD tends to not index a representation of data as diacritical value.   They either normalize all diacritical data way, or they index the data as it appears in the binary – so for example, a record like this: évery would index as e_acute_very, i.e., the indexed value would be a plan “e”, but if the data appeared in NFC notation, the data would be indexed as an é (the combined character) allowing users to search for data using the letter + diacritic.  How does your system index its data?  It’s a question I’m asking today, and wondering how much of an impact normalization form has without the ILS, as well as outside the ILS (as we reuse data in a variety of contexts).  Since each system may make different assumptions and indexing decisions based on UTF8 data presented – its an interesting question to consider.

Export implications

The best case scenario is that a system would export data the same way that its represented in the system.  This is what OCLC does – and while it likely helps to exasperate some of the problems I see upstream with systems that may look for specific normalizations, its regular and expected.  Is this behavior the rule?  Unfortunately it is not.  I see many examples where data is altered on export, and often times, the diacritic related, the issue can be traced to the normalized form of the original data.  Again, the system probably should care which form is provided (in a perfect work), but if the system is implementing the MARC specification as written (see LC guidance above), then developing operations around the expectation of NFKD formed would likely led to complications.  But again, you’d likely never know until you tried to take the data out of the system.

Thinking about this in MarcEdit

So if you’ve stayed with me this long, you may be wondering if there is anything that we can do about the problems, short of getting everyone to agree that we all normalize our data a certain way (good luck).  In MarcEdit, I’ve been looking at addressing this question in order to address the following problems that I get asked about regularly:

  1. When I try to replace x diacritic, I can find the instances, but when I try to replace, only some (or none) are replaced
  2. When I import my data back into my system, diacritics are decomposed
  3. How can I ensure my records can index diacritics correctly


The first two issues are ones that come up periodically, and are especially confusing to users because the differences in data is at a binary level – so hard to see.  The last issue, MarcEdit has provided a 1/2 answer for.  It has always provided a way to set normalization when converted data to UTF8, but once there, it assumes that the user will provide the data in the form that they require (I’m realizing, this is a bad assumption).

To address this problem, I’m providing a method in MarcEdit that will allow the user to force the normalization of UTF8 data into a specific normalization, and will enable the application to support search and replace of data, regardless of the normalized form of a character that a user might us.  This will show up in the MarcEdit preferences.  Under the MARCEngine settings, there are options related to data normalization.  These show up as:


MarcEdit has included support for sometime to set normalization when compiling data.  But this doesn’t solve the problem when trying to edit, search, etc. records in the MarcEditor or within the other areas of the program.  So, a new option will be available – Enforce Defined Normalization.  This will enable the application to save data in the preferred normalization and also force all user submitted data through a wrapper that will enable edit operations to be completed, regardless of the normalized form a user may use when searching for data or the underlying normalization form of the individual records.  Internally, MarcEdit will make this process invisible, but the output created will be records that place all UTF8 characters into the specified normalization.  This seems to be a good option, and its very unlikely that tomorrow, the systems that we use will suddenly all start to use UTF8 data the same way – and taking this approach, they don’t have to.  MarcEdit will work as a bridge to take data in any UTF8 normalization, and will ensure that the data outputted all meets the criteria specified for the user.

Sounds good – I think so.  But it makes me a little nervous as well.  Why – because OCLC takes any data provided to it.  In theory, a record could switch normalizations multiple times, if users pulled data down, edited them using this option, and uploaded the data back to the database.  Does this matter?  Will it cause unforeseen issues?  I don’t know – I’m asking OCLC.  I also worry that allowing users to specify normalization form could have cascading issues when it comes to record sharing.  No everyone uses MarcEdit (nor should they) and its hard to know what impact this makes on other coding tools, etc.  This is why this function won’t be enabled by default – but will need to be turned on by the user – as I continue to inquire and have conversations about the larger implications of this work.  The short answer is that this is a pain point, and a problem that needs to be addressed somehow.  I see too many questions and too many records where the normalization form of the data plays a role in providing confusing data to the user, confusing data to the cataloger, or difficulties in reusing or sharing the data with other systems/processes.  At the same time, this feels like a band-aid fix until we reach a point in the evolution of our systems and metadata that we can free ourselves from MARC-8, and begin to think only about our data in UTF8.


So what should folks take away from all this?  Let’s start with the obvious.  Just because your data is in UTF8, doesn’t mean that its the same as my data in UTF8.  Normalization forms of data, a tool that was initially used to ease the transition of data from non-Unicode to Unicode data, can have other implications as well.  The information that I’ve provided, are just examples of challenges that make there way to me due to my work with MarcEdit.  I’m sure other folks have had different experiences…and I’d love to hear these if you want to provide them below.



Saxon.NET and local file paths with special characters and spaces

By reeset / On / In C#

I thought I’d post this here in case this can help other folks.  One of the parsers that I like to use is Saxon.Net, but within the .net platform at least, it has problems doing XSLT or XQuery transformations when the files in question have paths with special characters or spaces (or if they reference files via xsl:include statements that live inside paths with special characters or spaces).  The question comes up a lot on the Saxon support site and it sounds like Saxon is actually processing the data correctly.  Saxon is expecting valid URIs, and a URI can’t have a spaces.  Internally, the URI is escaped, but when you process those escaped paths against a local file system, accessing the file will fail.  So, what do I mean – here are two different types of problems I encounter:

  • Path 1: c:\myfile\C#\folder1\test.xsl
  • Path2: c:\myfile\C#\folder 1\test.xsl

When setting up a transformation using Saxon, you setup a XSLTransform.  You can set this up using either a stream, like an XMLReader, or a URI.  But here the problem.  If you create the statement like this:

System.Xml.XmlReader xstream = System.Xml.XmlReader.Create(filepath);
transformer = xsltCompiler.Compile(xstream).Load();

The program can read Path 1, but will always fail on Path 2, and will fail on Path 1 if it includes secondary data.  If rather than using a stream, I use a URI class like:

transformer = xsltCompiler.Compile(new Uri(sXSLT, UriKind.Absolute)).Load();

Both Path’s will break.  On the Saxon list, there was a suggestion to create a sealed class, and to wrap the URI in that class.  So, you’d end up code that looked more like:

transformer = xsltCompiler.Compile(new SaxonUri(new Uri(sXSLT, UriKind.Absolute))).Load();

public sealed class SaxonUri : Uri
        public SaxonUri(Uri wrappedUri)
            : base(GetUriString(wrappedUri), GetUriKind(wrappedUri))
        private static string GetUriString(Uri wrappedUri, bool localuri = false)
            if (wrappedUri == null)
                throw new ArgumentNullException("wrappedUri", "wrappedUri is null.");            
            if (wrappedUri.IsAbsoluteUri) 
                return wrappedUri.AbsoluteUri;
            return wrappedUri.OriginalString;
        private static UriKind GetUriKind(Uri wrappedUri)
            if (wrappedUri == null)
                throw new ArgumentNullException("wrappedUri", "wrappedUri is null.");
            if (wrappedUri.IsAbsoluteUri)
                return UriKind.Absolute;
            return UriKind.Relative;
        public override string ToString()
            if (IsWellFormedOriginalString())
                return OriginalString;
            else if (IsAbsoluteUri)
                return AbsoluteUri;
            return base.ToString();

And this get’s a closer.  Using this syntax, Path 1 doesn’t work, but Path 2 will.  So, you could use an if…then statement to look for spaces in the XSLT file path, and if there are no spaces, open the stream, and if there are, wrap the URI.  Unfortunately, that doesn’t work either – because if you include a reference (like xsl:include) in your XSLT, Path 1 and Path 2 fail, because internally, the BaseURI is set to an escaped version of the URI, and Windows will fail to locate the string.  At which point, you end up feeling like you might be pretty much screwed, but there are still other options but they take more work.  In my case, the solution that I adopted was to create a custom XmlResolver.  This allows me to handle all the URI processing myself, and in the case of the two path statements, I’m interested in handling all local file URIs.  So how does that work:

xsltCompiler.XmlResolver = new CustomeResolver();
transformer = xsltCompiler.Compile(new Uri(sXSLT, UriKind.Absolute)).Load();

internal class CustomeResolver : XmlUrlResolver
        public override object GetEntity(Uri absoluteUri, string role, Type ofObjectToReturn)
            if (absoluteUri.IsFile)
                string filename = absoluteUri.LocalPath;
                if (System.IO.File.Exists(filename)==false) {
                    filename = Uri.UnescapeDataString(filename);
                    if (System.IO.File.Exists(filename)==false)
                        return (System.IO.Stream)base.GetEntity(absoluteUri, role, ofObjectToReturn);
                    } else
                        System.IO.Stream myStream = new System.IO.FileStream(filename, System.IO.FileMode.Open);
                        return myStream;
                } else
                    return (System.IO.Stream)base.GetEntity(absoluteUri, role, ofObjectToReturn);

                return (System.IO.Stream) base.GetEntity(absoluteUri, role, ofObjectToReturn);

By creating your own XmlResolver, you can fix the URI problems and allow Saxon to process both use cases above.


Fonts, Font-sizes and the MacOS

By reeset / On / In C#, code, Programming

So, one of the questions I’ve occasionally been getting from Mac users is that they would really like the ability to shift the font and font sizes of the programs’ interface.  If you’ve used the Windows version of MarcEdit, this has been available for some time, but I’ve not put it into the Mac version in part, because, I didn’t know how.  The Mac UI is definitely different from what I’m use to, and the way that the AppKit exposes controls and the way controls are structures as a collection of Views and Subviews complicates some of the sizing and layout options.  But I’ve been wanting to provide something because on really high resolution screens, the application was definitely getting hard to read.

Anyway, I’m not sure if this is the best way to do it, but this is what I’ve come up with.  Essentially, it’s a function that can determine if an element has text, an image, and perform the font scaling, control resizing and ultimately, windows sizing to take advantage of Apples Autolayout features.  Code is below.


public void SizeLabels(NSWindow objW, NSControl custom_control = null)
			string test_string = "THIS IS MY TEST STRING";
			string val = string.Empty;
			string font_name = "";
			string font_size = "";
			NSStringAttributes myattribute = new NSStringAttributes();
			cxmlini.GetSettings(XMLPath(), "settings", "mac_font_name", "", ref font_name);
			cxmlini.GetSettings(XMLPath(), "settings", "mac_font_size", "", ref font_size);
			if (string.IsNullOrEmpty(font_name) && string.IsNullOrEmpty(font_size))
			NSFont myfont = null;
			if (String.IsNullOrEmpty(font_name))
				myfont = NSFont.UserFontOfSize((nfloat)System.Convert.ToInt32(font_size));
			else if (String.IsNullOrEmpty(font_size))
				font_size = "13";
				myfont = NSFont.FromFontName(font_name, (nfloat)System.Convert.ToInt32(font_size));
			else {
				myfont = NSFont.FromFontName(font_name, (nfloat)System.Convert.ToInt32(font_size));
			if (custom_control == null)
				CoreGraphics.CGSize original_size = NSStringDrawing.StringSize(test_string, myattribute);
				myattribute.Font = myfont;
				CoreGraphics.CGSize new_size = NSStringDrawing.StringSize(test_string, myattribute);
				CoreGraphics.CGRect frame = objW.Frame;
				frame.Size = ResizeWindow(original_size, new_size, frame.Size);
				objW.MinSize = frame.Size;
				objW.SetFrame(frame, true);
				//MessageBox(objW, objW.Frame.Size.Width.ToString() + ":" + objW.Frame.Size.Height.ToString());
				foreach (NSView v in objW.ContentView.Subviews)
					if (v.IsKindOfClass(new ObjCRuntime.Class("NSControl")))
						NSControl mycontrol = ((NSControl)v);
						switch (mycontrol.GetType().ToString())
							case "AppKit.NSTextField":
							case "AppKit.NSButtonCell":
							case "AppKit.NSBox":
							case "AppKit.NSButton":
								if (mycontrol.GetType().ToString() == "AppKit.NSButton")
									if (((NSButton)mycontrol).Image != null)
								mycontrol.Font = myfont;
								//if (!string.IsNullOrEmpty(mycontrol.StringValue))
								//	mycontrol.SizeToFit();
						if (mycontrol.Subviews.Length > 0)
							SizeLabels(objW, mycontrol);
					else if (v.IsKindOfClass(new ObjCRuntime.Class("NSTabView")))
						NSTabView mytabview = ((NSTabView)v);
						foreach (NSTabViewItem ti in mytabview.Items)
							foreach (NSView tv in ti.View.Subviews)
								if (tv.IsKindOfClass(new ObjCRuntime.Class("NSControl")))
									SizeLabels(objW, (NSControl)tv);
			else {
				if (custom_control.Subviews.Length == 0)
					if (custom_control.GetType().ToString() != "AppKit.NSButton" ||
						(custom_control.GetType().ToString() == "AppKit.NSButton" &&
						 ((NSButton)custom_control).Image == null))
						custom_control.Font = myfont;
				else {
					foreach (NSView v in custom_control.Subviews)
						NSControl mycontrol = ((NSControl)v);
						switch (mycontrol.GetType().ToString())
							case "AppKit.NSTextField":
							case "AppKit.NSButtonCell":
							case "AppKit.NSBox":
							case "AppKit.NSButton":
								if (mycontrol.GetType().ToString() == "AppKit.NSButton")
									if (((NSButton)mycontrol).Image != null)
								mycontrol.Font = myfont;
								//if (!string.IsNullOrEmpty(mycontrol.StringValue))
								//	mycontrol.SizeToFit();
								if (mycontrol.Subviews.Length > 0)
									SizeLabels(objW, mycontrol);

And that was it. I’m sure there might be better ways, but this is (crossing my fingers) working for me right now.

Working with the Clipboard on OSX

By reeset / On / In C#, code, Programming

Coming from the Windows and Linux world — the object where data is copy and pasted from is called the Clipboard.  Not so in OSX.  In OSX, this is referred to as the NSPasteBoard.  Should you need to get string data on and off of it – use the following:


private static string[] pboardTypes = new string[] { "NSStringPboardType" };
public void SetClipboardText(string text)
	NSPasteboard.GeneralPasteboard.DeclareTypes(pboardTypes, null);
	NSPasteboard.GeneralPasteboard.SetStringForType(text, pboardTypes[0]);

public string GetClipboardText()
	return NSPasteboard.GeneralPasteboard.GetStringForType(pboardTypes[0]);


Automated Language Translation using Microsoft’s Translation Services

By reeset / On / In C#, translation services

We hear the refrain over and over – we live in a global community.  Socially, politically, economically – the ubiquity of the internet and free/cheap communications has definitely changed the world that we live in.  For software developers, this shift has definitely been felt as well.  My primary domain tends to focus around software built for the library community, but I’ve participated in a number of open source efforts in other domains as well, and while it is easier than ever to make one’s project/source available to the masses, efforts to localize said projects is still largely overlooked.  And why?  Well, doing internationalization work is hard and often times requires large numbers of volunteers proficient in multiple languages to provide quality translations of content in a wide range of languages.  It also tends to slow down the development process and requires developers to create interfaces and inputs that support language sets that they themselves may not be able to test or validate.   


If your project team doesn’t have the language expertise to provide quality internalization support, you have a variety of options available to you (with the best ones reserved for those with significant funding).  These range of tools available to open source projects like: TranslateWiki ( which provides a platform for volunteers to participate in crowd-sourced translation services.  There are also some very good subscription services like Transifex (, a subscription service that again, works as both a platform and match-making service between projects and translators.  Additionally, Amazon’s Mechanical Turk can be utilized to provide one off translation services at a fairly low cost.  The main point though, is that services do exist that cover a wide spectrum in terms of cost and quality.   The challenge of course, is that many of the services above require a significant amount of match-making, either on the part of the service or the individuals involved with the project and oftentimes money.  All of this ultimately takes time, sometimes a significant amount of time, making it a difficult cost/benefit analysis of determining which languages one should invest the time and resources to support.

Automated Translation

This is a problem that I’ve been running into a lot lately.  I work on a number of projects where the primary user community hails largely from North America; or, well, the community that I interact with most often are fairly English language centric.  But that’s changing — I’ve seen a rapidly growing international community and increasing calls for localized versions of software or utilities that have traditionally had very niche audiences. 

I’ll use MarcEdit ( as an example.  Over the past 5 years, I’ve seen the number of users working with the program steadily increase, with much of that increase coming from a growing international user community.  Today, 1/3-1/2 of each month’s total application usage comes from outside of North America, a number that I would have never expected when I first started working on the program in 1999.  But things have changed, and finding ways to support these changing demographics are challenging.. 

In thinking about ways to provide better support for localization, one area that I found particularly interesting was the idea of marrying automated language transcription with human intervention.  The idea being that a localized interface could be automatically generated using an automated translation tool to provide a “good enough” translation, that could also serve as the template for human volunteers to correct and improve the work.  This would enable support for a wide range of languages where English really is a barrier but no human volunteer has been secured to provide localized translation; but would enable established communities to have a “good enough” template to use as a jump-off point to improve and speed up the process of human enhanced translation.  Additionally, as interfaces change and are updated, or new services are added, automated processes could generate the initial localization, until a local expert was available to provide the high quality transcription of the new content, to avoid slowing down the development and release process.

This is an idea that I’ve been pursing for a number of months now, and over the past week, have been putting into practice.  Utilizing Microsoft’s Translation Services, I’ve been working on a process to extract all text strings from a C# application and generate localized language files for the content.  Once the files have been generated, I’ve been having the files evaluated by native speakers to comment on quality and usability…and for the most part, the results have been surprising.  While I had no expectation that the translations generated through any automated service would be comparable to human mediated translation, I was pleasantly surprised to hear that the automated data is very often, good enough.  That isn’t to say that it’s without its problems, there are definitely problems.  The bigger question has been, do these problems impede the use of the application or utility.  In most cases, the most glaring issue with the automated translation services has been context.  For example, take the word Score.  Within the context of MarcEdit and library bibliographic description, we know score applies to musical scores, not points scored in a game…context.  The problem is that many languages do make these distinctions with distinct words, and if the translation service cannot determine the context, it tends to default to the most common usage of a term – and in the case of library bibliographic description, that would be often times incorrect.  It’s made for some interesting conversations with volunteers evaluating the automated translations – which can range from very good, to down right comical.  But by a large margin, evaluators have said that while the translations were at times very awkward, they would be “good enough” until someone could provide better a better translation of the content.  And what is more, the service gets enough of the content right, that it could be used as a template to speed the translation process.  And for me, this is kind of what I wanted to hear.

Microsoft’s Translation Services

There really aren’t a lot of options available for good free automated translation services, and I guess that’s for good reason.  It’s hard, and requires both resources and adequate content to learn how to read and output natural language.  I looked hard at the two services that folks would be most familiar with: Google’s Translation API ( and Microsoft’s translation services (  When I started this project, my intention was to work with Google’s Translation API – I’d used it in the past with some success, but at some point in the past few years, Google seems to have shut down its free API translation services and replace them with a more traditional subscription service model.  Now, the costs for that subscription (which tend to be based on number of characters processed) is certainly quite reasonable, my usage will always be fairly low and a little scattershot making the monthly subscription costs hard to justify.  Microsoft’s translation service is also a subscription based service, but it provides a free tier that supports 2 million characters of through-put a month.  Since that more than meets my needs, I decided to start here. 

The service provides access to a wide range of languages, including Klingon (Qo’noS marcedit qaStaHvIS tlhIngan! nuq laH ‘oH Dunmo’?), which made working with the service kind of fun.  Likewise, the APIs are well-documented, though can be slightly confusing due to shifts in authentication practice to an OAuth Token-based process sometime in the past year or two.  While documentation on the new process can be found, most code samples found online still reference the now defunct key/secret key process.

So how does it work?  Performance-wise, not bad.  In generating 15 language files, it took around 5-8 minutes per file, with each file requiring close to 1600 calls against the server, per file.  As noted above, accuracy varies, especially when doing translations of one word commands that could have multiple meanings depending on context.  It was actually suggested that some of these context problems may actually be able to be overcome by using a language other than English as the source, which is a really interesting idea and one that might be worth investigating in the future. 

Seeing how it works

If you are interested in seeing how this works, you can download a sample program which pulls together code copied or cribbed from the Microsoft documentation (and then cleaned for brevity) as well as code on how to use the service from:–Language-Translator.  I’m kicking around the idea of converting the C# code into a ruby gem (which is actually pretty straight forward), so if there is any interest, let me know.


OCLC WorldCat Metadata API Ruby Gem

By reeset / On / In OCLC, ruby

Since last December, I’ve had the opportunity to spend a good deal of time working with the OCLC WorldCat Metadata API.  The focus was primarily around kicking the tires, and then eventually developing some integration components with MarcEdit, as well as a C# library ( for those that may have use of such things.

However, most folks in libraries tend to not use C# – they tend to focus on lighter-weight languages like ruby, python, or PHP.  Well, I work in ruby as well, so I decided to take the C# library and port it to Ruby.  The resulting code can be found here:

It’s pretty straightforward – and provides wrappers for all the current functionality found in the API, with the caveat being that the defined transfer format between OCLC and the client is in MARCXML, and response formats are in application/atom+xml.  The API supports a handful of other formats, but I’ve standardized on MARCXML since it’s well understood and there are lots of tools that can work with it.

The code isn’t well commented at this point.  I’ll take some time over the next few days to add some commenting, improve exception handling, and possibly add a few helper functions to make message processing easier – but this should help anyone interested in knowing how the API works get a better understanding.


Code Example:

** Get Local Bib Record

require ‘rubygems’
require ‘wc_metadata_api’

key = ‘[your key]’
secret = ‘[your secret]’
principalid = ‘[your principal_id]’
principaldns = ‘[your principal_idns]’
schema = ‘LibraryOfCongress’
holdingLibraryCode='[your holding code]’
instSymbol = ‘[your oclc symbol]’

client = => key, :secret => secret, :principalID => principalid, :principalDNS => principaldns, :debug =>false)

response = client.WorldCatReadLocalBibRecord(:oclcNumber =>’338544583′, :schema => schema, :holdingLibraryCode => holdingLibraryCode, :instSymbol => instSymbol)

puts response

MarcEdit and the OCLC Metadata API: Introduction

By reeset / On / In MarcEdit, OCLC, Programming


I wanted to note that I’ve updated this post to correct/clarify two statements within this post. 

  1. The requirement of 2 wskeys
  2. Terms of use

OCLC has two wskey structures.  For those developers that have been working with OCLC for a long time and have a wskey for their search services, OCLC can decommission your former key and create a new one that supports all functionality or they can give you a second key for the Metadata API.  For new users, you simply need to request a key that includes specific functionality.  In MarcEdit, I will continue to keep the key’s separate for configuration purposes, but this value can be the same.

Secondly – related to terms of use.  I need to clarify – if you are developer and have been using the WorldCat Platform Services outside of the Search API, the terms to use this service as a developer are no different from the other licensing terms you or your institution may have agreed to.  However, if you are a cataloger, the terms to retrieve/create a developer key are very likely different from the terms of service associated with your license related to the cataloging service.  Users need to be aware of these differences because your organization may/will care.


I’ve been starting to work on a couple of different write-ups around working with the OCLC Metadata API ( – my general impressions and some specific improvements that really need to be in place to make this resource more usable.  But I also have realized that I have neglected to give any kind of write-up about using the API within MarcEdit, save for an early YouTube video.  So, I’m taking a little bit of time here to jot down some information related to how the API can be utilized within MarcEdit.


So a little bit of background.  Over the past couple of years, I’ve been looking for opportunities to make use of the great work that OCLC Research does in exposing some interesting new data streams to the library community.  When OCLC first released their classify service a number of years ago, I believe I was probably one of the first people to grab it and integrate it into a product that could be used within existing cataloging workflows.  This year, I expanded the service to include integration with OCLC’s FAST headings service.  While OCLC’s services do have some baggage attached (in terms of who is allowed to use them), the baggage is pretty small and they provide real value to the library community.  Providing them through MarcEdit made a lot of sense.

The same can be true of OCLC’s new Metadata API.  For a number of years, I’ve been having individuals in the user community looking for the ability to interact directly with WorldCat, specifically around the ability to set and delete holdings.  The API provides this functionality, in addition to the ability to add and edit bibliographic records, as well as functionality around local data records.  For the first round of integration, I’ve limited my work primarily to dealing with holdings and bibliographic data.  I’ll be looking for people interested in the local data functionality and see how MarcEdit might be augmented to support that as well.

Early Use cases

Part of the reason I chose to work and support the specific API actions that I did was due to existing use cases.  Processing around E-books and E-Resources have lead users within the MarcEdit community to make a couple common requests related to OCLC and WorldCat.  Specifically, users are asking to:

  1. Automate the process of setting holdings
  2. Automate the upload of new records without using Connexion
  3. Automate Headings validation

The API provides the ability to address the first two questions, and I hope with time, OCLC will make some of their heading validation services available to support the 3rd request.  But for now, MarcEdit’s development has really been focused around supporting these two specific use cases.

First Impressions

I’ll take some time to provide some additional feedback in a few days, but my first impressions were mixed.  One the one hand, interacting with the service once I understood what the API expected in terms of requests and responses was pretty straightforward.  On the other hand, very little documentation exists.  Often times, I would initiate purposeful errors because simple things, like acceptable schemas, are left undocumented but are provided in an error message.  Unfortunately, the missing documentation definitely complicates working with the service, as things related to validation, reasons for authentication errors, etc. simply are not documented and whose descriptions are fairly cryptic when reading as a response from the API.

However, I think anyone that does development probably is use to working with scarce documentation, and to OCLC’s credit, they have been very responsive in providing answers to the holes that currently exist in the documentation.  From my perspective, the most concerning aspect of the API right now is the authentication.  From a developer’s standpoint, the authentication process is easy enough.  For the user however, the process for getting keys to utilize tools built upon the services is fairly problematic because the process assumes that users are developers and forces key requests and management through that portal.  Likewise, these keys come with new terms of use outside of your traditional cataloging licensing and may (likely) will need to be vetted through an organizations legal council.  Again, this is primarily a problem for non-developers – who have relationships with OCLC in ways outside of the WorldShare Platform Services…but these are primarily the users that this integration  in MarcEdit targets (i.e., catalogers, not developers).  This honestly is my biggest concern with the service right now (that and that keys are tied to individuals, not institutions) – bigger than issues related to documentation, validation, and the somewhat cryptic nature of the feedback received through the API.

Using the API in MarcEdit

So, to use the OCLC Metadata API in MarcEdit, there are a couple things that you need to do.  First, you need to request your OCLC keys.  MarcEdit’s integration requires users the appropriate Wskeys:

  1. OCLC’s Search API Key
  2. OCLC’s Metadata API Key

For long-term OCLC Platform Users (think Search API) – this means requesting two keys due to the fact that the previous key format isn’t compatible with their new authentication system.  For new users, a single key should suffice.  Regardless, a key that can support OCLC’s Search functionality is required to support MarcEdit’s ability to query the WorldCat database.

Likewise, MarcEdit needs a key that can utilize the Metadata API key.  Again, for legacy API users, this will likely be a separate key – for new users – the search and metadata keys should be the same.  This key has two parts – there is the Key and a Secret key.  You need both to make requests.  Likewise, there are 3 other values that users need to know to utilize the Metadata API services, and that is what is called a principalID a principalIDNS, a schema, a holdings symbol and a 4 character holdings code.  All of these values need to be available in order to work with the various functions provided by the API.

Once you have these keys and supplemental information, you need to enter them into MarcEdit.  Users will be able to see the menu entries for many of the OCLC functions, but until a user has entered their OCLC key data into the MarcEdit preferences and validated the data – these functions will remain disabled.

In the Preference’s area, there is a new tab that has been setup to store data related to OCLC’s Metadata services.


From this screen, users can enter all the information that they will need in order to utilize the various Metadata and Search API functions.  MarcEdit will store this information in it’s configuration file, but does encrypt the data using a methodology that would make it impractical to reconstruct the key data.

After a user enters their data – the user should Click Apply, and then Validate.  The Validate process is what enables the OCLC API integration functions.  MarcEdit performs a couple API operations on an existing test record to determine if the information that user provided will support the required functionality.  As MarcEdit validates a key, it will turn the textbox either green (for validated) or red (to indicate a problem).

Example of a problem with a key

Example of all keys validated

Once the data has been validated – the OCLC Menu Items found in the Main Window and on the MarcEditor will be enabled.

Main Window – OCLC Functions

MarcEditor OCLC Functions

Using the OCLC API Functions

Hopefully, the functions are fairly straightforward to use and look identical regardless of where you interact with them within the application.  At present, MarcEdit provides a function for setting holdings and working with bibliographic data.

Setting Holdings

MarcEdit’s Holdings Tool

MarcEdit provides a straightforward tool for updating a user’s holdings for a batch of records.  MarcEdit’s holdings tool can process either MARC data, MarcEdit’s mnemonic data or a plain text file of OCLC numbers.  The tool does exactly what it says.  Using the information set in the Preferences, MarcEdit will pre-populate your Institution Code and Classification Schema.  Users then need to select an action.  MarcEdit’s batch tool can update or delete holdings, but will perform just one of those actions on all records within a designated file.  That means, you cannot upload a file that contains records that need to have holdings added and deleted.  The tool requires that these actions are isolated into discrete operations.

Update/Creating Bibliographic Data

MarcEdit’s Bibliographic Tool

When working with MarcEdit’s bibliographic updating/creation tool for WorldCat – this tool does allow users to work with files that contain records that are both new or updated.  OCLC’s system and MarcEdit evaluates records within a file and based on the presence or absence of an OCLC number in a record will determine if a bibliographic record is to be created or updated.  Of course, it’s not that simple – all bibliographic records passed through the API must be validated on the OCLC side, and at this point, that validation only occurs when the data is submitted for update or creation – a weakness that I hope at some point will be rectified since this causes a number of issues when working in a batch record environment.

Working within the MarcEditor

When working within the MarcEditor, the tools for working with Holdings and Bibliographic data are identical to outside.  The main difference is that the data being acted upon is the data in the MarcEditor.  This means that the MarcEditor needs to include one additional function, and that is the ability to retrieve data from WorldCat.  MarcEdit has the ability to search and download a record set from WorldCat for editing.

Searching WorldCat from Within the MarcEditor

The search tool provides a handy way for users to quickly retrieve data from WorldCat for edit.  However, this too underlines one of the glaring issues with the API – especially around record editing.  Traditionally, when catalogers work on a record, they can lock the record for editing.  This prevents other users from changing the record while they are working on it, and protects the transaction information in the 005 which OCLC requires for updating purposes.  When working with the API, there appears to be no way to lock a record.  This means that records can be downloaded, edited and then on upload, be invalid due to someone else making an edit that updates the 005 information.  And since OCLC doesn’t provide a method for validating information prior to upload, users won’t know that this problem has occurred until an upload is attempted.  Again, in order to make the API functionally equivalent to OCLC’s other editing services, there needs to be a way for catalogers to lock records for editing.

Where we go from here

This is really hard to say.  I believe that OCLC sees the release of the Metadata APIs as the first of a much larger portfolio of API services to provide programmatic support for their WorldShare endeavors.  So, it will be interesting to see how these develop.  At the same time, I’m really interested in seeing if the organization puts the type of resources behind the development of their API Services Portfolio to provide really meaningful services that librarians, developers, and 3-party library developers can work with, consume, or augment the wide range of data streams that they have available.  On the one hand, I’m hopeful that the release of the Metadata API signals are wider commitment to providing transparent access to a wide variety of services.  At the same time, the lack of development around the OCLC search API – which has seen only incremental changes since they were first released some 5 or so years ago – makes me wonder how much support within the larger organization exists around this type of work.  So I guess that is a question that remains to be answered.



DSpace REST API built in JERSEY

By reeset / On / In Dspace, Programming

I thought I’d take a quick moment to highlight some work that was done by one of the programmers here at The OSU, Peter Dietz.  Peter is a bit of a DSpace wiz and a contributor to the project, and one of the things that he’s been interested in working on has been the development of a REST API for DSpace.  You can see the notes on his work on this GitHub pull request:


Thankfully, I’m at a point in my career where I no longer have to be the individual that has to wrestle with DSpace’s UI development, but I’ve never been a big fan of it.  From the days when the interface was primarily JSP to the, it sounded like a good idea at the time, XSLT interfaces that most people use today, I’ve long pined for the ability to separate the DSpace interface development from the actual application, and move that development into a framework environment (any framework environment).  However, the lack of a mature REST API has made this type of separation very difficult. 


The work that Peter has done introduces a simple READ API into the DSpace environment.  A good deal more work would need to be done around authentication to manage access to non-public materials as well as expansions to the API around search, etc., but I think that this work represents a good first step. 

However, what’s even more exciting is the demonstration applications that Peter has written to test the API.  The primary client that he’s used to test his implementation is a Google Play application, which was developed utilizing a MVC framework.  While a very, very simple client, it’s a great first step I think that shows some of the benefits of separating the interface development away from the core repository functionality, as changes related to the API or development around the API no longer require recompiling the entire repository stack. 

Anyway – Peter’s work and his notes can be found as part of this GitHub pull request.  Here’s hoping that either through Peter’s work, or new development, or a combination of the two; we will see the inclusion of a REST API in the next release of the DSpace application.


Building your own reminder system

By reeset / On / In C#, Family, Library, Microsoft

One of the hats I wear is as a member of the Independence Library Board.  I love it because I don’t work with public libraries as often as I’d like to in my real job, and honestly, the Independence Public Library is the center of the community.  The Library is a center for adults looking for education opportunities, kids looking for resources, and home to a number of talented librarians that are dedicated to encouraging a love of reading to our community.  It’s one of the few libraries I’ve ever known to have both a children’s and adult reading programs and takes advantage of that in the summer – by having the adults and kids compete against each other to see who logs the most pages (the kids always win). 

Each board meeting is interesting, because as the economy became more difficult for people, more people turned to the library.  Every month, the library sees more circulations, more bodies in the building, more kids, more adults – just more.  And they do it on a budget that doesn’t accurately reflect the impact that they have on the community. 

Anyway, one of the things that the Library has going for it is a very active friends program – and through that group (and some grant funds), the library was able to purchase a number of Laptop computers for circulation within the Library.  The Library currently has some, 8-10 terminals that are always being used and the laptops would provide additional seats, and allow people to work anywhere within the library using the wifi.

The Library setup the laptops using the usual software – DeepFreeze, etc. to provide a fairly locked down environment.  However, what was missing was a customizable timer on the machines.  Essentially, the staff was looking for a way to make it easier for patrons checking out the laptops to avoid fines.  The Laptops circulate for a finite period of time within the building.  Once that time is over, the clock starts ticking for fines.  To avoid confusion, and help make it easier for patrons to know when the clock was running out – I’d offered to work on building a simplified timer/kiosk program. 

The impetus for this work comes from Access 2007 I think.  I had attended the hackfest before the conference and one of the project ideas was an open source timing program.  I had worked on and developed a proof of concept that I passed on.  And while I never worked on the code since – I kept a copy myself.  When we were talking about things that would be helpful, I was reminded of this work. 

Now, unfortunately, I couldn’t use much of the old project at all.  The needs were slightly different – but it helped me have a place to start so that I wasn’t just looking at a blank screen.  So, with idea in hand, I decided to see how much time it would take to whip together an application that could meet the needs. 

I’ll admit, nights like tonight make me happy that I still do more than write code in scripting languages like python and ruby.  Taking about 3 hours, I put together a feature complete application that meets our specific needs.  I’ll be at the Oregon Library Association meeting this week, and if folks find this kind of work interesting, I’ll make it a bit more generic and post the source for anyone that wants to tinker with it.

So what does it do?  It’s pretty simple.  Basically, it’s an application that keeps time for the user and provides some built-in kiosk functionality to prevent the application was being disabled. 

Here are a few of the screen shots:

When the program is running, you see the clock situated in the task tray

Click on the icon, and see the program menu

Preferences – password protected



Because we have a large Hispanic population, all the strings will need be able to be translated.  This was essentially is just the locked message.  I’ll ensure the others are customizable as well – maybe with an option to just use Google Translate (even though it far, far from perfect) if a need to just get the gist across is the most important.

Run an action (both functions require a password)

Place your cursor over the icon to get the minutes

Information box letting you know you are running out of time

Sample lockout screen

In order to run any of the functions, you must authenticate yourself.  In order to disable the lockout screen, you must authenticate yourself.  What’s more, while the program is running, it creates a low-level keyboard hook to capture and pre-process all keystrokes, disabling things like escape keys, the windows key, ctrl+alt+del so that once this screen comes up – a user can not break out of it without shutting off the computer (which would result in needing to log in).  Coupled with DeepFreeze and some group policy settings, my guess is that this will suffice.

The source code itself is a few thousand lines of code, with maybe a 1000 or 1500 lines of actual business logic and the remainder around the UI/threading components.  Short and simple.

Hopefully, I’ll get a chance to do a little testing and get some feedback later this week – but for now – I’m just happy that maybe I can give a little bit back to the community library that gives so much to my family.  And if I hear from anyone that this might be of interest outside my library – I’ll certainly post the code up to github.


Getting the real %windir%\system32 directory from within a 32 bit process on a 64 bit machine

By reeset / On / In C#

When working with the 64-bit flavor of Windows, there are a couple of quirks that you just need to accept.  First, when Microsoft designed windows for 64 bit processing, they weren’t going to break legacy applications and second, how this gets done makes absolutely no sense unless you simply have faith that the Operating System will take care of it all for you.

So why doesn’t it make sense?  Well, Windows 64-bit systems are essentially operating systems with two minds.  One the one hand, you have the system designed to run 64-bit processes, but lives within an ecosystem where nearly all windows applications are still designed for 32-bit systems.  So what is an OS designer to do?  Well, you have both versions of the OS running, obviously.  Really, it’s not that simple (or complex) – in reality, Microsoft has a 32 bit emulator that allows all 32 bit applications to run on 64-bit versions of Windows.  However, this is where it gets complicated.  Since 64 bit processes cannot access 32 bit processes and vise versa – you run into a scenerio where Windows must mirror a number of systems components for both 32 and 64 bit processes.  They do this two ways:

  1. In the registry – Microsoft has what I like to think of as a shadow registry that exists for WOW64 bit processes (that’s the 32 bit emulator if you can’t tell) – to register COM objects and components accessible to 32-bit applications.
  2. The presence of a system32 (this is ironically where the 64bit libraries live) and a SysWow64 (this is where the 32 bit libraries live) system folders which replicate large portions of the Windows API and provide system components for 32-bit and 64 bit processes.

So how does this all work together?  Well, Microsoft makes it work through redirection.  Quietly, transparently – the Windows operating system will access the applicable part of the system based on the process type (being either 64 or 32 bit).  The problem arises when one is running a 32-bit process, but want to have access to a dedicated 64 bit folder or registry key.  Because redirection happens at the system level, programming tools, api access, etc. all are redirected to the appropriate folder for the process type.

Presently, I’m in the process of rebuilding MarcEdit, and one of the requirements is that it run natively in either 32 or 64 bit mode.  The problem is that there are a large cohort of users that are working with MarcEdit on 64 bit systems, but have it installed as a 32 bit app.  So, I’ve been tweaking an Installer class that will evaluate the user’s environment (if 32 bit process running in a 64 bit OS) and move the appropriate 64 bit files into their correct locations so that when they run the program for the first time, the C# code compiled to run on Any CPU (so, if it’s a 64 bit OS, it will run natively as a 64 bit app) – the necessary components will be available.

So how do we do this?  Actually, it’s not all that hard.  Microsoft provides two important API for the task: Wow64DisableWow64FsRedirection and Wow64RevertWow64FsRedirection.  Using these two functions, you can temporarily disable redirection within your application.  However, you want to make sure you re-enable when your required access of these components is over (or apparently, the world as we know it will come to and end).   So, here’s my simple example of how this might work:

#region API_To_Disable_File_Redirection_On_64bit_Machines
public static string gslash = System.IO.Path.DirectorySeparatorChar.ToString();
[DllImport("kernel32.dll", SetLastError = true)]
private static extern bool Wow64DisableWow64FsRedirection(ref IntPtr ptr);
[DllImport("kernel32.dll", SetLastError = true)]
private static extern bool Wow64RevertWow64FsRedirection(IntPtr ptr);
#region Fix 64bit 
if (System.Environment.Is64BitProcess == false && System.Environment.Is64BitOperatingSystem == true)
//This needs to have some data elements moved from the 64bit folder to the System32 directory
string[] files = System.IO.Directory.GetFiles(AppPath() + "64bit" + gslash);
string windir = Environment.ExpandEnvironmentVariables("%windir%");
string system32dir = Path.Combine(windir, "System32");
if (system32dir.EndsWith(gslash) == false) {
system32dir += gslash;

//We need to run off the redirection that happens on 64 bit systems.
IntPtr ptr = new IntPtr();
bool isWow64FsRedirectionDisabled = Wow64DisableWow64FsRedirection(ref ptr);
foreach (string f in files)
string filename = System.IO.Path.GetFileName(f);
if (System.IO.File.Exists(system32dir + filename) == false)
    System.IO.File.Copy(AppPath() + "64bit" + gslash + filename, system32dir + filename);
catch (System.Exception yyy){//System.Windows.Forms.MessageBox.Show(yyy.ToString()); 
//We need to run the redirection back on
if (isWow64FsRedirectionDisabled)
bool isWow64FsRedirectionReverted = Wow64RevertWow64FsRedirection(ptr);



And that’s it.  Pretty straightforward.  As noted above, the big thing to remember is to Reenable the Redirection.  If you don’t, lord knows what will happen.