Thoughts on NACOs proposed process on updating CJK records

I would like to take a few minutes and share my thoughts about an updated best practice recently posted by the PCC and NACO related to an update on CJK records. The update is found here: https://www.loc.gov/aba/pcc/naco/CJK/CJK-Best-Practice-NCR.docx. I’m not certain if this is active or a simply a proposal, but I’ve been having a number of private discussions with members at the Library of Congress and the PCC as I’ve been trying to understand the genesis for this policy change. I personally believe that formally adopting a policy like this would be exceptionally problematic, and I wanted to flesh out my thoughts on why and some potential better options that could fix the issue that this problem is attempting to solve.

But first, I owe some folks an apology. In chatting with some folks at LC (because, let’s be clear, this proposal was created specifically because there are local, limiting practices at LC that artificially are complicating this work) – it came to my attention that the individuals that spent a good deal of time considering and creating this proposal have received some unfair criticism – and I think I bare a lot of responsibility for that. I have done work creating best practices and standards and its thankless, difficult work. Because of that, in cases where I disagree with a particular best practice, my preference has been to address those privately and attempt to understand and share my issues with a set of practices. This is what I have been doing related to this work. However, on the MarcEdit list (a private list), when a request was made related to a feature request in MarcEdit to support this work – I was less thoughtful in my response as the proposed change could fundamentally undo almost a decade of work as I have dealt with thousands of libraries stymied by these kinds of best practices that have significant unintended consequences. My regret is that I’ve been told that my thoughts shared on the MarcEdit list, have been used by others in more public spaces to take this committee’s work to task. This is unfortunate and disappointing, and something I should have been more thoughtful of in my responses on the MarcEdit list. Especially, given that every member of that committee is doing this work as a service to the community. I know I forget that sometimes. So, to the folks that did this work – I’ve not followed (or seen) any feedback you may have received, but in as much that I’m sure I played a part in any push back you may have received, I’m sorry.

What does this problem seek to solve?

If you look at the proposal, I think that the writers do a good job identifying the issue. Essentially, this issue is unique to authority records. At present, NACO still requires that records created within the program only utilize UTF8 characters that fall within the MARC-8 repertoire. OCLC, the pipeline for creating these records, enforces this rule by invalidating records with UTF8 characters outside the MARC8 range. The proposal seeks to address this by encouraging the use of NRC (Numeric Character Reference) data in UTF8 records, to work around these normalization issues.

So, in a nutshell, that is the problem, and that is the proposed solution. But before we move on, let’s talk a little bit about how we got here. This problem currently exists because of, what I believe to be, an extremely narrow and unproductive read of what MARC8 repertoire actually means. For those not in Libraries, MARC8 is essentially a made-up character encoding, used only in libraries, that has so outlived its usefulness. Modern systems have largely stopped supporting it outside of legacy ingest workflows. The issue is that for every academic library or national library that has transitioned to UTF8, hundreds of small libraries or organizations around the world have not. MARC8 continues to exist because the infrastructure that supports these smaller libraries is built around it.

But again, I think it is worth thinking about today, what actually is the MARC8 repertoire. Previously, this had been a hard set of defined values. But really, that changed in 2004ish when LC updated guidance and introduced the concept of NRCs to preserve lossless data transfer between systems that were fully UTF8 compliant and older MARC8 systems. NRCs in MARC8 were workable, because it left local systems the ability to handle (or not handle) the data as it seen fit and finally provided an avenue for the Library community as a whole to move on from the limitations MARC8 was imposing on systems. It allowed for the facilitation of data into non-MARC formats that were UTF8 compliant and provided a pathway to allow data from other metadata formats, the ability to reuse that data in MARC records. I would argue that today, the MARC8 repertoire includes NRC notation – and to assume or pretend otherwise, is shortsighted and revisionist.

But why is all of this important. Well, it is at the heart of the problem that we find ourselves in. For authority data, the Library of Congress appears to have adopted this very narrow view of what MARC8 means (against their own stated recommendations) and as a result, NACO and OCLC place artificial limits on the pipeline. There are lots of reasons why LC does this, I recognize they are moving slowly because any changes that they make are often met with some level of resistance from members of our community – but in this case, this paralysis is causing more harm to the community than good.

Why this proposal is problematic?

So, this is the environment that we are working in and the issue this proposal sought to solve. The issue, however, is that the proposal attempts to solve this problem by adopting a MARC8 solution and applying it within UTF8 data – essentially making the case that NRC values can be embedded in UTF8 records to ensure lossless data entry. And while I can see why someone might think that – that assumption is fundamentally incorrect. When LC developed its guidance on NRC notation, this was guidance that was specifically directed in the lossless translation of data to MARC8. UTF8 data has no need for NRC notation. This does not mean that it does not sometimes show up – and as a practical purpose, I’ve spent thousands of hours working with Libraries dealing with the issues this creates in local systems. Aside from the issues this creates in MARC systems around indexing and discovery, it makes data almost impossible to be used outside of that system and in times of migration. In thinking about the implications of this change in the context of MarcEdit, I had the following, specific concerns:

NRC data in UTF8 records would break existing workflows for users with current generation systems that would have no reason to expect this data as being present in UTF8 MARC records
It would make normalization functionally virtually impossible and potentially re-introduce a problem I spent months solving for organizations related to how UTF8 data is normalized and introduced into local systems.
It would break many of the transformation options. MarcEdit allows for the flow of data to many different metadata formats – all are built on the concept that the first thing MarcEdit does is clean up character encodings to ensure the output data is in UTF8.
MarcEdit is used by ~20k active users and ~60k annual users. Over 1/3 of those users do not use MARC21 and do not use MARC-8. Allowing the mixing of NRCs and UTF8 data potentially breaks functionality for broad groups of international users.

While I very much appreciate the issue that this is attempting to solve, I’ve spent years working with libraries where this kind of practice would introduce a long-term data issue that is very difficult to identify and fix and often shows up unexpectedly when it comes time to migration or share this information with other services, communities, or organizations.

So what is the solution?

I think that we can address this issue on two fronts. First, I would advise NACO and OCLC to essentially stop limiting data entry to this very limited notion of MARC8 repertoire. In all other contexts, OCLC provides the ability to enter any valid UTF8 data. This current limit within the authority process is artificial and unnecessary. OCLC could easily remove it, and NACO could amend their process to allow record entry to utilize any valid UTF8 character. This would address the problem that this group was attempting to solve for catalogers creating these records.

The second step could take two forms. If LC continues to ignore their own guidance and cleave to an outdated concept of the MARC8 repertoire – OCLC could provide to LC via their pipeline a version of the records where data includes NRC notation for use in LCs own systems. It would mean that I would not recommend using LC as a trusted system for downloading authorities if this was the practice unless I had an internal local process to remove any NRC data found in valid UTF8 records. Essentially, we essentially treat LC’s requirements as a disease and quarantine them and their influence in this process. Of course, what would be more ideal, is LC making the decision to accept UTF8 data without restrictions and rely on applicable guidance and MARC21 best practice by supporting UTF8 data fully, and for those still needing MARC8 data – providing that data using the lossless process of NRCs (per their own recommendations).

Conclusion

Ultimately, this proposal is a recognition that the current NACO rules and process is broken and broken in a way that it is actively undermining other work in the PCC around linked data development. And while I very much appreciate the thoughtful work that went into the consideration of a different approach, I think the unintended side affects would cause more long-term damage that any short-term gains. Ultimately, what we need is for the principles to rethink why these limitations are in place, and, honestly, really consider ways that we start to deemphasize the role LC plays as a standard holder if in that role, LC’s presence continues to be an impediment for moving libraries forward.