MarcEdit 5.0 and MARC Proposal #2006-04

By reeset / On / In MarcEdit, Programming

At Midwinter this year, MARBI will be discussing MARC Proposal #2006-04. This proposal deals with the mapping of Unicode characters back to the MARC-8 characterset. The proposal lays out a number of options, settling on the recommendation that systems doing character remapping should utilize a placeholder character for Unicode characters not in the MARC8 characterset. Its a lossy conversion (i.e., you will not be able to retranslate the data back to Unicode) but its the cheapest and easiest method to implement. Personally, I wish MARBI would have recommended the use of Numeric Character References since they are currently utilized in the XML world and would have made moving data from XML to MARC8 easier. Now data will need to be processed twice for this conversion to take place.

So what does this mean for MarcEdit? Well, at this point, the UTF-MARC8 conversion tool does utilize NCRs and will continue to allow individuals to do so. However, once this proposal is finalized and a fill character has been agreed upon, I will setup a place in the options to allow users to specify how the UTF8 translation should work. By default, the translation will utilize the fill character (since that will be the blessed proposal) — but it will also still allow you to utilize the NCR references if desired. Internally, MarcEdit will still utilize the NRCs, particularly when dealing with moving information from XML to MARC8, since the MarcEdit compiler has been created to read NRC characters.

When I get back from MidWinter, I’ll post some more information on the decision (if one was made) and will turn on the changes (MarcEdit currently will accomidate all these proposals, but I’m waiting for a final word before turning them on).