MarcEdit 7.2: Generating LCSH for Records

For about the past year or so, I’ve been interested in working on a tool that would allow a user to take records that include an LC call number and generate LCSH headings for the records. In theory, I’ve thought it was generally possible. Library of Congress call numbers roughly correspond to parts of LCSH — at least, for the primary or subject, location, or subject/location pair. The problem, this isn’t at a granular level, and doesn’t expose any subjects or topics that might be related to a specific item. And then, of course, there is the pesky issue that by and large, most of the data that I would need to poll doesn’t exist in a place that is easy to automate. Sure, LC makes their data available, but often times as PDFs. The Linked Data tools don’t cover my need, and the resources that would be really, really helpful, are locked up in tools like Cataloger’s Desktop. So, I let this one go for a little while so I could ruminate on the problem.

My solution

While letting this go, I spent this year doing some work within MarcEdit to enable the creation, loading, and management of knowledge-graphs (fairly small [1 million items or less]) within the application. These are built in memory, and are performant. It’s built upon a couple .NET components (like dotnetrdf) and some glue written into MarcEdit’s link data platform to allow me to think about how I might address the import, creation, and editing of records using application profiles (so I could think about Bibframe or whatever comes after). However, in completing this work, I had an idea as to how I might be able to address the LCSH generation. Since it would be almost impossible to actually generate subjects out of thin air — the next best think would be to take information about a record, develop a knowledge graph of information related to that record, and then extract the common subjects that are mostly likely applicable to the record at hand.

I started writing workflows on my whiteboards at home, and came up with something that roughly looks like the following:

Essentially, to “generate subjects”, the tool starts with a query derived from breaking down the LC call number found within the record. Using that as a starting point, the tool queries either WorldCat (using the Search API) or the U.S. Library of Congress (currently with Z39.50, but I’d love to transition to SRU if the call number index could be enabled) to get an initial set of records. From there, the tool breaks down records and starts to build a graph looking for common threads. As threads are discovered, new queries are spawned. This occurs quickly, and across asynchronous threads. Once a corpus is set, the tool evaluates the available subjects to select those and meet a specific threshold and have commonalities. This means that the generation process isn’t static, so generating subjects across the same sets of records could result in minor differences between the suggestions as a thread ignored in building the graph may be promoted with a second run (I don’t recommend multiple runs — I just found this interesting) — creating near but not always exactly the same suggestions.

But does it work?

Well, if you are able to live with the caveats that these subjects are generated based on similarities to records and are not generated out of thin air, I’ve found that this approach works pretty well. It works best if you have a WorldCat API key (or can get one) as the corpus of records to query is much larger and the response time is a lot better than Z39.50 (maybe SRU would solve the issue) — and if you can live with the process not being super fast (it’s not), then I think it works pretty well.

And if I don’t have Call Numbers?

Well, MarcEdit and OCLC has you covered. MarcEdit leverages OCLC’s classify API allowing you to build call numbers:

OCLC makes this tool public under very permissive usage terms. So, you can use this tool to generate call numbers, and then use the Generate LCSH to generate subjects. And to make sure this works, I pulled a sample of 1200 records from Harvard’s open MARC records set. Within the set, there were only 400ish LC Call numbers. Using the Call number tool, that number raises to 1100. I deleted all subjects, and then asked the tool to generate subjects. The tool then created suggestions for 90% of the records with call numbers, taking ~4 minutes to generate.

When can I try this?

I’m currently allowing users in the MarcEdit community to test the work. I made a call and have emailed those interested with links to the beta software for testing. I’m hoping that within two weeks or so, I’ll hear from folks that this is working as expected, and I’ll move the tool into a production release of MarcEdit.

Can I learn more:

I posted about this work on Twitter over the weekend. You can see:

Original Post with initial wireframes: https://twitter.com/reese_terry/status/1203175593317163009
Video using the US Library of Congress (no sound): https://twitter.com/reese_terry/status/1203729634841509892
Video using WorldCat (no sound): https://twitter.com/reese_terry/status/1203750861110960128

Otherwise, feel free to ask questions. I’ll answer the best I can.

–tr

MarcEdit 7.2: Generating LCSH for Records

Comments

One response to “MarcEdit 7.2: Generating LCSH for Records”