One of the changes in the current MarcEdit update is the introduction of a linked data rules file to help the program understand what data elements should be processed for automatic URL generation, and how that data should be treated. The Rules file is found in the Configs directory and is called: linked_data_profile.xml
The rules file is pretty straightforward. At this point, I haven’t created a schema for it, but I will to make defining data easier. Until then, I’ve added references in the header of the document to note fields and values.
Here’s a small snippet of the file:
<?xml version=”1.0″ encoding=”UTF-8″?>
top level: field
type: authority, bibliographic, authority|bibliographic
Value: Field value
Description: field to process
Value: Subfield codes
Description: subfields to use for matching
Values: subfield code or empty
Description: field that denotes index
Values: 1 or empty
Description: determines if field should be broken up for uri disambiguation
Description: special instructions to improve normalization for names and subjects.
Values: subfield code to include a url
Description: Used to determine which subfield is used to embed a URI
Values (see supported vocabularies section)
Description: when no index is supplied, you can predefine a supported index
Description: LC Childrens Subjects
Description: LC Demographic Terms
Description: LC Subjects
Description: Getty Arts and Architecture Thesaurus
Description: Getty ULAN
Description: LC Genre Forms
Descirption: LC Medium Performance Thesaurus
Description: LC NACO Terms
Description: lcsh/naf combined indexes.
Description: MESH indexes
The rules file is pretty straightforward. You have a field where you define a type. Acceptable values are: authority, bibliographic, authority|bibliographic. This tells the tool which type of record the process rules apply to. Second you define a tag, subfields to process when evaluating for linking, a uri field (this is the subfield used when outputting the URI, special instructions (if there are any), where the field is atomized (i.e., broken up so that you have one concept per URI), and vocab (to preset a default vocabulary for processing). So for example, say a user wanted to atomize a field that currently isn’t defined as such – they would just find the processing block for the field and add: <atomize>1</atomized> into the block – and that’s it.
The idea behind this rules file is to support the work of a PCC Task Force while they are testing embedding of URIs in MARC records. By shifting from a compiled solution to a rules based solution, I can provide immediate feedback and it should make the process easier to customize and test.
An important note – these rules will change. They are pretty well defined for bibliographic data, but authority data is still being worked out.