Dealing with recursive case replacement in MarcEdit

**This post deals with functionality officially made available May 30, 2011. **

As more and more of our libraries outsource their bibliographic description, metadata records that would previously have been considered unusable are quietly making their ways into our library’s ILS systems.  There are a number of common issues that show up, some that tend to be somewhat more cosmetic and some that actually do affect how records are indexed for query.  For a lot of people, MarcEdit is a tool that allows them to quickly correct records before introducing them into their libraries.

One issue that has become all too common when receiving records from vendors, has to do with casing.  Take for example the following record from Proquest (specific fields and URL redacted):

=LDR  00975nam a2200253   4500
=001  AAI0003706
=005  20110415143745.5
=006  m\\\\\\\\d\\\\\\\\
=007  cr\cn|
=008  110415s1929\\\\|||||||||||||||||\|||||\d
=035  \\$a(UMI)AAI0003706
=040  \\$aUMI$cUMI
=100  1\$aHANKINS, JOHN ERSKINE.
=245  14$aTHE POEMS OF GEORGE TURBERVILE $h[electronic resource]
 /$c EDITED WITH CRITICAL NOTES AND A STUDY OFHIS LIFE AND WORKS.
=260  \\$c1929
=300  \\$a 1 online resource (712 p.)
=500  \\$aSource: Dissertation Abstracts International,
Volume: 12-05, page: 0619.
=502  \\$aThesis (Ph.D.)--Yale University, 1929.
=650  \4$aLiterature, English.
=773  0\$tDissertation Abstracts International$g12-05.
=791  \\$aPh.D.
=792  \\$a1929

The above record demonstrates an issue that is becoming more and more common — the improper casing of primary access fields.  In this case, the 100 and the 245 have been improperly cased.

MarcEdit includes a number of methods that can help to alleviate this problem.  You could write a script or use regular expressions.  Since most users tend to not utilized scripts as part of their normal MarcEdit work — the most widely utilized option tends to be working with regular expressions.  However, even here, the process can sometimes be laborious.  This is because MarcEdit utilizes the .NET regular expression syntax, which doesn’t include handy case switching functions (as you might find in PERL).

As such, MarcEdit’s regular expression functionality has been augmented to include two additional functions, lcase and ucase.  These functions allow you to change the case of entire regular expression groups.  This certainly helps — but still, the problem isn’t a simple one.

Since this is an issue that is becoming more and more common, I’ve spent some time working to add support for redundancy matching available in the .NET framework.  What does this mean.  Well, in the record above, say we wanted to change the 100 and 245$a and $c so that the first letter of each word remained upper case, while changing case of the remainder of the characters — we could now do that with a simple expression.

Changing the 100:

Using the Edit Subfield tool in the MarcEditor, you’d enter the following:

Field: 100

Subfield: a

Find: ([A-Z])([A-Z0-9′]+)

Replace: $1lcase($+)

That’s it.  The Replace syntax is where the magic occurs.  In .NET, the $+ allows you to query “last matchâ€?, making the initial statement in the find a recursive match on words with a leading upper-case character, followed by other upper-case characters.

Changing the 245ac

The process is the same, though we perform the action twice.  Once for the $a and once for the $c.

Field: 245

Subfield: a

Find: ([A-Z])([A-Z0-9′]+)

Replace: $1lcase($+)

Field: 245

Subfield: c

Find: ([A-Z])([A-Z0-9′]+)

Replace: $1lcase($+)

Using the task automation tool, these three operations could be changed together and run as a single process allowing the user to process all issues of this type in one keystroke.

Running the above operations on our test record returns the following result:

=LDR  00975nam a2200253   4500
=001  AAI0003706
=005  20110415143745.5
=006  m\\\\\\\\d\\\\\\\\
=007  cr\cn|
=008  110415s1929\\\\|||||||||||||||||\|||||\d
=035  \\$a(UMI)AAI0003706
=040  \\$aUMI$cUMI
=100  1\$aHankins, John Erskine.
=245  14$aThe Poems Of George Turbervile $h[electronic resource]
/$c Edited With Critical Notes And A Study Ofhis Life And Works.
=260  \\$c1929
=300  \\$a 1 online resource (712 p.)
=500  \\$aSource: Dissertation Abstracts International,
Volume: 12-05, page: 0619.
=502  \\$aThesis (Ph.D.)--Yale University, 1929.
=650  \4$aLiterature, English.
=773  0\$tDissertation Abstracts International$g12-05.
=791  \\$aPh.D.
=792  \\$a1929

As you can see from the example above, the regular expressions readjust the case for these elements.  Further refinements to the regular expression could be made to make the replacement more refined — but in my experience — the above would satisfy the most common requests that I’ve seen made by catalogers dealing with these records.

The change to MarcEdit, while fairly simple code wise, allows MarcEdit to make use of a wide variety of advanced pattern replacement functions found in the .NET framework.  As of this update, the following can now be utilized within MarcEdit and paired with the lcase and ucase functions:

$& Substitutes a copy of the whole match. (\$*(\d*(\.+\d+)?){1}) **$& “$1.30” “**$1.30**”
$` Substitutes all the text of the input string before the match. B+ $` “AABBCC” “AAAACC”
$’ Substitutes all the text of the input string after the match. B+ $’ “AABBCC” “AACCCC”
$+ Substitutes the last group that was captured. B+(C+) $+ “AABBCCDD” AACCDD
$_ Substitutes the entire input string. B+ $_ “AABBCC” “AAAABBCCCC”

(Table information from: http://msdn.microsoft.com/en-us/library/ewy2t5e0.aspx)

Hopefully, by augmenting this functionality, users will have one more tool in their tool chest for dealing with faulty data.

–TR


Posted

in

by

Tags:

Comments

3 responses to “Dealing with recursive case replacement in MarcEdit”

  1. Matthew Beacom Avatar

    thank you, Terry. This will be a big help for many catalogers.

  2. Chris Fox Avatar
    Chris Fox

    This is great. How could the regular expressions above be modified so that only the first letter of the first word in the 245a remains capitalized, while the rest of the words are uncapitalized? In other words, in the example, how could you go from _The Poems Of George Turbervile_ to _The poems of george turbervile_ ? This would obviously require some manual re-editing to capitalize proper nouns, names, etc., but would still save a lot of time.
    Thanks.

    1. Administrator Avatar
      Administrator

      Actually, this may have gotten much easier. In MarcEdit, I added some shortcuts that simplify this type of case shifting. If you run MarcEdit 5.7+, you can find them in the MarcEditor under Tools/shortcuts.

      –TR