Just an fyi —
I’d received some test files today for a small set of records where some diacritics were being mangled when translating data from MARCXML to MARC in MARC-8. I investigated, and there is a very narrow bug in the application that I’ve corrected and will post as an update this evening. The problem only exists when doing the following:
- Have a MARC record already in UTF-8
- Convert the UTF-8 MARC record to MARCXML with a narrow band of characters (about 3 combining ansel characters, and these characters must be the last character in the byte stream)
The problem was easy to correct but should never have shown up in the first place. It was spawned from a lack of test data (we don’t have a unicode database, so all my tests generating MARCXML records have been with MARC-8 record that are converted to UTF-8 on the fly during the conversion). So here’s why it happened…because MarcEdit must support both the MARC8 and UTF8 charactersets, the program includes code to handle data at a byte (in the case of MARC data) and character (for XML data) level. Rather than forcing users to specify the characterset used within a set of records, I’ve made the MARCEngine smart enough to detect what characterset is in use (a real pain, when you consider that MARC records cannot use the BOM markes to designate characterset — so instead, a bytes characteristics must be evaluated to determine its characterset). Yes, there are MARC fields for setting characterset, but I find that they are unreliable (i.e., unused by most systems). So, the program uses a custom algorithm that can read a set of bytes and determine if those bytes make up a UTF-8 character. Within the part of the MARCEngine that handles MARC data processing (Maker and breaker) — this algorthem works fine (which is why the error doesn’t affect this part of the program). However, in the XML API, I tried to create a “lite” algorthem that wasn’t quite robust enough. So, I’ve taken the old algorthem out and inserted the robust algorithem and everything works again.
In the process of testing the algorithems, I’ve found one other error that I’ve corrected as well. One of the difficulties with doing UTF-8 translation is that that there are multiple code points for some characters. If they are translated from MARC-8 to UTF-8 using the LC defined tables, they will be one code point….if they are entered from a keyboard, they will be another code point. To accommidate this, I’d created a catch function to handle these characters — and it ended up trapping some characters that shouldn’t have been trapped. This affects any translation from UTF-8 to MARC-8, though was limited primarily to Cyrillic scripts (since their translated characters matched the extra codepoints). I’ve corrected this as well and this too will be in tonights update.
Anyway, sorry for missing these two (I guess I’m fortunately that most systems still don’t export and import unicode data) and I will have it uploaded and fixed tonight. I’m also going to spend some time downloading some records from OCLC to get 880s so that I can test other charactersets as well.