Truncating a field by a # of words in MarcEdit

By reeset / On / In Cataloging, MarcEdit

This question came up on the listserv, and I thought that it might be generically useful that other folks might find it interesting.  Here’s the question:

I’d like to limit the length of the 520 summary fields to a maximum of 100 words and adding the punctuation “…” at the end. Anyone have a good process/regex for doing this?
Example:
=520  \\$aNew York Times Bestseller Award-winning and New York Times bestselling author Laura Lippman’s Tess Monaghan—first introduced in the classic Baltimore Blues—must protect an up-and-coming Hollywood actress, but when murder strikes on a TV set, the unflappable PI discovers everyone’s got a secret. {esc}(S2{esc}(B[A] welcome addition to Tess Monaghan’s adventures and an insightful look at the desperation that drives those grasping for a shot at fame and those who will do anything to keep it.{esc}(S3{esc}(B—San Francisco Chronicle When private investigator Tess Monaghan literally runs into the crew of the fledgling TV series Mann of Steel while sculling, she expects sharp words and evil looks, not an assignment. But the company has been plagued by a series of disturbing incidents since its arrival on location in Baltimore: bad press, union threats, and small, costly on-set “accidents” that have wreaked havoc with its shooting schedule. As a result, Mann’s creator, Flip Tumulty, the son of a Hollywood legend, is worried for the safety of his young female lead, Selene Waites, and asks Tess to serve as her bodyguard. Tumulty’s concern may be well founded. Recently, a Baltimore man was discovered dead in his home, surrounded by photos of the beautiful—if difficult—aspiring star. In the past, Tess has had enough trouble guarding her own body. Keeping a spoiled movie princess under wraps may be more than she can handle since Selene is not as naive as everyone seems to think, and instead is quite devious. Once Tess gets a taste of this world of make-believe—with their vanities, their self-serving agendas, and their remarkably skewed visions of reality—she’s just about ready to throw in the towel. But she’s pulled back in when a grisly on-set murder occurs, threatening to topple the wall of secrets surrounding Mann of Steel as lives, dreams, and careers are scattered among the ruins.
So, there isn’t really a true expression that can break on number of words, in part, because how we define word boundaries will vary between different languages.  Likewise, the MARC formatting can cause a challenge.  So, the best approach is to look for good enough – and in this case, good enough is likely breaking on spaces.  My suggestion is to look for 100 spaces, and then truncate.
In MarcEdit, this is easiest to do using the Replace function.  The expression would look like the following:
Find: (=520.{4})(\$a)(?<words>([^ ]*\s){100})(.*)
Replace: $1$2${words}…
Check the use regular expressions option. (image below).
So why does this work.  Let’s break it down.
Find:
(=520.{4}) – this matches the field number, the two spaces related to the mnemonic format, and then the two indicator values.
(\$a) – this matches on the subfield a
(?<words>([^ ]*\s){100}) – this is where the magic happens.  You’ll notice two things about this.   First, I use a nested expression, and second, I name one.  Why do I do that?  Well, the reason is because the group numbering gets wonky once you start nesting expressions.  In those cases, it’s easier to name them.  So, in this case, I’ve named the group that I want to retrieve, and then have created a subgroup that matches on characters that aren’t a space, and then a space.  I then use the qualifier {100}, which means, must match at least 100 times.
(.*) — match the rest of the field.
Now when we do the replace, putting the field back together is really easy.  We know we want to reprint the field number, the subfield code, and then the group that captured the 100 units.  Since we named the 100 units, we call that directly by name.  Hence,
Replace:
$1 — prints out =520  \\
$2 — $a
${words} — prints 100 words
… — the literals
And that’s it.  Pretty easy if you know what you are looking for.
–tr