MarcEdit Validate Headings: Part 2

By reeset / On / In MarcEdit

Last week, I posted an update that included the early implementation of the Validate Headings tool.  After a week of testing, feedback and refinement, I think that the tool now functions in a way that will be helpful to users.  So, let me describe how the tool works and what you can expect when the tool is run.

Background:

The Validate Headings tool was added as a new report to the MarcEditor to enable users to take a set of records and get back a report detailing how many records had corresponding Library of Congress authority headings.  The tool was designed to validate data in the 1xx, 6xx, and 7xx fields.  The tool has been set to only query headings and subjects that utilize the LC authorities.  At some point, I’ll look to expand to other vocabularies.

How does it work

Presently, this tool must be run from within the MarcEditor – though at some point in the future, I’ll extract this out of the MarcEditor, and provide a stand alone function and a integration with the command line tool.  Right now, to use the function, you open the MarcEditor and select the Reports/Validate Headings menu.

image

Selecting this option will open the following window:

image

Options – you’ll notice 3 options available to you.  The tool allows users to decide what values that they would like to have validated.  They can select names (1xx, 600,10,11, 7xx) or subjects (6xx).  Please note, when you select names, the tool does look up the 600,610,611 as part of the process because the validation of these subjects occurs within the name authority file.  The last option deals with the local cache.  As MarcEdit pulls data from the Library of Congress – it caches the data that it receives so that it can use it on subsequent headings validation checked.  The cache will be used until it expires in 30 days…however, a user at any time can check this option and MarcEdit will delete the existing cache and rebuild it during the current data run. 

Couple things you’ll also note on this screen. There is an extract button and it’s not enabled.  Once the Validate report is run, this button will become enabled if there are any records that are identified as having headings that could not be validated against the service. 

Running the Tool:

Couple notes about running the tool.  When you run the tool, what you are asking MarcEdit to do is process your data file and query the Library of Congress for information related to the authorized terms in your records.  As part of this process, MarcEdit sends a lot of data back and forth to the Library of Congress utilizing the http://id.loc.gov service.  The tool attempts to use a light touch, only pulling down headings for a specific request – but do realize that a lot of data requests are generated through this function.  You can estimate approximately how many requests might be made on a specific file by using the following formula: (number of records x 2)  + (number of records), assuming that most records will have 1 name to authorize and 1 subjects per record.  So a file with 2500 records would generate ~7500 requests to the Library of Congress.  Now, this is just a guess, in my tests, I’ve had some sets generate as many as 12,000 requests for 2500 records and as few as 4000 requests for 2500 records – but 7500 tended to be within 500 requests in most test files.

So why do we care?  Well, this report has the potential to generate a lot of requests to the Library of Congress’s identifier service – and while I’ve been told that there shouldn’t be any issues with this – I think that question won’t really be known until people start using it.  At the same time – this function won’t come as a surprise to the folks at the Library of Congress – as we’ve spoken a number of times during the development.  At this point, we are all kind of waiting to see how popular this function might be, and if MarcEdit usage will create any noticeable up-tick in the service usage.

Validation Results:

When you run the validation tool, the program will go through each record, making the necessary validation requests of the LC ID service.  When the service has completed, the user will receive a report with the following information:

Validation Results:
Process completed in: 121.546001431667 minutes. 
Average Response Time from LC: 0.847667984420415
Total Records: 2500
Records with Invalid Headings: 1464
**************************************************************
1xx Headings Found: 1403
6xx Headings Found: 4106
7xx Headings Found: 1434
**************************************************************
1xx Headings Not Found: 521
6xx Headings Not Found: 1538
7xx Headings Not Found: 624
**************************************************************
1xx Variants Found: 6
6xx Variants Found: 1
7xx Variants Found: 3
**************************************************************
Total Unique Headings Queried: 8604
Found in Local Cache: 1001
***************************************************************

This represents the header of the report.  I wanted users to be able to quickly, at a glance, see what the Validator determined during the course of the process.  From here, I can see a couple of things:

  1. The tool queried a total of 2500 records
  2. Of those 2500 records, 1464 of those records had a least one heading that was not found
  3. Within those 2500 records, 8604 unique headers were queried
  4. Within those 2500 records, there were 1001 duplicate headings across records (these were not duplicate headings within the same record, but for example, multiple records with the same author, subject, etc.)
  5. We can see how many Headings were found by the LC ID service within the 1xx, 6xx, and 7xx blocks
  6. Likewise, we can see how many headings were not found by the LC ID service within the 1xx, 6xx, and 7xx blocks.
  7. We can see number of Variants as well.  Variants are defined as names that resolved, but that the preferred name returned by the Library of Congress didn’t match what was in the record.  Variants will be extracted as part of the records that need further evaluation.

After this summary of information, the Validation report returns information related to the record # (record number count starts at zero) and the headings that were not found.  For example:

Record #0
Heading not found for: Performing arts--Management--Congresses
Heading not found for: Crawford, Robert W

Record #5
Heading not found for: Social service--Teamwork--Great Britain

Record #7
Heading not found for: Morris, A. J

Record #9
Heading not found for: Sambul, Nathan J

Record #13
Heading not found for: Opera--Social aspects--United States
Heading not found for: Opera--Production and direction--United States

The current report format includes specific information about the heading that was not found.  If the value is a variant, it shows up in the report as:

Record #612
Term in Record: bible.--criticism, interpretation, etc., jewish
LC Preferred Term: Bible. Old Testament--Criticism, interpretation, etc., Jewish
URL: http://id.loc.gov/authorities/subjects/sh85013771
Heading not found for: Bible.--Criticism, interpretation, etc

Here you see – the report returns the record number, the normalized form of the term as queried, the current LC Preferred term, and the URL to the term that’s been found.

The report can be copied and placed into a different program for viewing or can be printed (see buttons).

image

To extract the records that need work, minimize or close this window and go back to the Validate Headings Window.  You will now see two new options:

image

First, you’ll see that the Extract button has been enabled.  Click this button, and all the records that have been identified as having headings in need of work will be exported to the MarcEditor.  You can now save this file and work on the records. 

Second, you’ll see the new link – save delimited.  Click on this link, and the program will save a tab delimited copy of the validation report.  The report will have the following format:

Record ID [tab] 1xx [tab] 6xx [tab] 7xx [new line]

Each column will be delimited by a colon, so if two 1xx headings appear in a record, the current process would create a single column, but with the headings separated by a colon like: heading 1:heading 2. 

Future Work:

This function required making a number of improvements to the linked data components – and because of that, the linking tool should work better and faster now.  Additionally, because of the variant work I’ve done, I’ll soon be adding code that will give the user the option to update headings for Variants as is report or the linking tool is running – and I think that is pretty cool.  If you have other ideas or find that this is missing a key piece of functionality – let me know.

–tr

One thought on “MarcEdit Validate Headings: Part 2