Character conversion in .NET

By reeset / On / In General Computing, Programming

I had someone who uses UniMARC ask me about some problems that they were having with the conversion between Unimarc to Dublin Core.  Some of the characters where being skewed when the translation occurred.  The problem of course relates to the characterset that the Unimarc records are encoded in.  Because a number of MARC formats utilize the MARC8 characterset, I’ve code MarcEdit to expect either MARC8 or UTF8 when moving to XML.  The reason for this is because the MARC8 characterset overlaps with the ISO 8859-1 (cp 1252) charactermap.  Since the UniMARC data was in the ISO 8859-1 encoding, the overlapping elements were translated into Unicode as if they were in MARC8.  Ugh.  Fortunately, MarcEdit provides a facility that allows users to migrate their data between other formats and UTF8.  This layer allows users to move data from any supported characters (any available windows code page) to UTF8 and then back to MARC8 if necessary.  So for this example, I was able to recommend that the user use this character tool to convert the data into UTF8 and then process the file into XML.  It adds one step extra step, but it works for now.  I’m thinking in the near future, I’ll likely add an option in the Preferences to allow users to set a default characterset.  This will allow the MarcEngine to internally handle these problems easier when moving between MARC and XML.

One of the things I’ve been happy with in .NET has been the ease of moving between charactersets.  As some of you may know, the MARCEngine in MarcEdit has traditionally be written in assemblier.  This meant that I wrote my own character conversions for the most part — making the process fairly tedious.  In C# however, this is handled in a couple of lines of code.  So for example — If I was openning a file in windows codepage 1252 and needed to convert it to codepage 1250 or even UTF8:

string s = “”;
byte[] in;
byte[] out;
System.IO.StreamReader reader = new System.IO.StreamReader(@”c:\test1252.txt”, System.Text.Encoding.GetEncoding(1252);
System.IO.StreamWriter writer = new System.IO.StreamWriter(@”c:\testutf8.txt”, false, System.Text.Encoding.UTF8);

//Read the file in
s = reader.ReadToEnd();
in = System.Text.Encoding.GetEncoding(1252).GetBytes(s);
out = System.Text.Encoding.Convert(System.Text.Encoding.GetEncoding(1252), System.Text.Encoding.UTF8, in);
writer.Write(System.Text.Encoding.UTF8.GetString(out));

reader.close();
writer.close();

5 thoughts on “Character conversion in .NET

  1. Terry,

    I have needed something like this for a while, but I have a problem with your code. I cannot find any reference to System.Text.Convert in any namespaces. Actually, I searched all of Microsoft and found nothing either. Can you verify this line or tell me what references you have in your project to compile this code?

    Thanks,

    Matt

  2. Yep — your right. When I copied, left out the Encoding. I’ve corrected above. The correct class should be System.Text.Encoding.Convert

    –TR

  3. hi, I have a mrc file having 30,000 records can i use the same code for that, sorry to ask like this, since i am new to coding, i wanted help.

  4. For marc records no, MARC-8, the legacy characterset — is an artificial codepage that doesn’t actually exist save as code points in a documentation manual. Mapping between Marc-8 and UTF8 involves analysing the individual code points and making the conversions yourself.

    sorry