Character conversion in .NET

I had someone who uses UniMARC ask me about some problems that they were having with the conversion between Unimarc to Dublin Core.  Some of the characters where being skewed when the translation occurred.  The problem of course relates to the characterset that the Unimarc records are encoded in.  Because a number of MARC formats utilize the MARC8 characterset, I’ve code MarcEdit to expect either MARC8 or UTF8 when moving to XML.  The reason for this is because the MARC8 characterset overlaps with the ISO 8859-1 (cp 1252) charactermap.  Since the UniMARC data was in the ISO 8859-1 encoding, the overlapping elements were translated into Unicode as if they were in MARC8.  Ugh.  Fortunately, MarcEdit provides a facility that allows users to migrate their data between other formats and UTF8.  This layer allows users to move data from any supported characters (any available windows code page) to UTF8 and then back to MARC8 if necessary.  So for this example, I was able to recommend that the user use this character tool to convert the data into UTF8 and then process the file into XML.  It adds one step extra step, but it works for now.  I’m thinking in the near future, I’ll likely add an option in the Preferences to allow users to set a default characterset.  This will allow the MarcEngine to internally handle these problems easier when moving between MARC and XML.

One of the things I’ve been happy with in .NET has been the ease of moving between charactersets.  As some of you may know, the MARCEngine in MarcEdit has traditionally be written in assemblier.  This meant that I wrote my own character conversions for the most part — making the process fairly tedious.  In C# however, this is handled in a couple of lines of code.  So for example — If I was openning a file in windows codepage 1252 and needed to convert it to codepage 1250 or even UTF8:

string s = “”;
byte[] in;
byte[] out;
System.IO.StreamReader reader = new System.IO.StreamReader(@”c:\test1252.txt”, System.Text.Encoding.GetEncoding(1252);
System.IO.StreamWriter writer = new System.IO.StreamWriter(@”c:\testutf8.txt”, false, System.Text.Encoding.UTF8);

//Read the file in
s = reader.ReadToEnd();
in = System.Text.Encoding.GetEncoding(1252).GetBytes(s);
out = System.Text.Encoding.Convert(System.Text.Encoding.GetEncoding(1252), System.Text.Encoding.UTF8, in);
writer.Write(System.Text.Encoding.UTF8.GetString(out));

reader.close();
writer.close();


Posted

in

,

by

Tags:

Comments

5 responses to “Character conversion in .NET”

  1. Matt Avatar
    Matt

    Terry,

    I have needed something like this for a while, but I have a problem with your code. I cannot find any reference to System.Text.Convert in any namespaces. Actually, I searched all of Microsoft and found nothing either. Can you verify this line or tell me what references you have in your project to compile this code?

    Thanks,

    Matt

  2. Administrator Avatar
    Administrator

    Yep — your right. When I copied, left out the Encoding. I’ve corrected above. The correct class should be System.Text.Encoding.Convert

    –TR

  3. Marko Avatar
    Marko

    Thank you very much

  4. uma Avatar
    uma

    hi, I have a mrc file having 30,000 records can i use the same code for that, sorry to ask like this, since i am new to coding, i wanted help.

  5. Administrator Avatar
    Administrator

    For marc records no, MARC-8, the legacy characterset — is an artificial codepage that doesn’t actually exist save as code points in a documentation manual. Mapping between Marc-8 and UTF8 involves analysing the individual code points and making the conversions yourself.

    sorry