Conditional Regular Expression Replacements using substitutions in MarcEdit

A question that comes up occasionally is the need to be able to conditionally add or replace a set of character data within a MARC Field.  For example, consider this use case:

I’d like to add a period to the end of a field (like say, the 650 field), but only under the following conditions:

  1. The field ends with a word (a-z) character.
  2. The field doesn’t already end in a period or parenthesis
  3. If the field ends with any other punctuation, that value is replaced with a period.

Doing option 1 and 2 is easy and straightforward.  For that option, I’d probably do something like this:

Find: (=650.*[^.;])$
Replace With: $1.

This allows MarcEdit to match any line that doesn’t end in a period or a parenthesis.  However, the conditional makes this more difficult.  In C#’s implementation of regular expressions, you can use substitutions and conditional matching to achieve the above result.  Consider the following data:

=650  \6$aMusique populaire$zQuébec (Province)$y1951-1960.
=650  \6$aMusique populaire$zQuébec (Province)$y1961-1970,
=650  \6$aMusique populaire$zQuébec (Province)$y1961-1970)
=650  \6$aMusique populaire$zQuébec (Province)$y1961-1970;
=650  \6$aMusique populaire$zQuébec (Province)$y1961-1970
=650  \6$aMusique populaire$zQuébec

Using the above criteria, I’d like to be able to run a process that will turn the comma in line two, into a period, the semi-colon in line 4 into a period and add a period to the end of line 5 and 6.  To do this, you’d setup a substitution. 

Find: ((?<one>=650.*[\w])|(?<one>=650.*)(?<two>[^.)]))$
Replace With: ${one}.

So what exactly is happening here.  In the .NET regular expressions, you can use named substitutions to represent groups.  In this case, we create a conditional using an ‘or’ clause, using the same substitution name for each element of the clause.  We then push out the replacement clause and give it a separate grouping.  Now, we have isolated the data we want to keep, and can use the same statement to get all the data we want to keep/append to.  Using the above, you will receive the following output:

=650  \6$aMusique populaire$zQuébec (Province)$y1951-1960.
=650  \6$aMusique populaire$zQuébec (Province)$y1961-1970.
=650  \6$aMusique populaire$zQuébec (Province)$y1961-1970)
=650  \6$aMusique populaire$zQuébec (Province)$y1961-1970.
=650  \6$aMusique populaire$zQuébec (Province)$y1961-1970.
=650  \6$aMusique populaire$zQuébec.

Obviously, the above is a fairly simple example — but the concept should can be applied to much more complicated workflows.  If you are interested in reading more about the Regular Expression implementation used in MarcEdit, please see: https://msdn.microsoft.com/en-us/library/vstudio/az24scfc(v=vs.100).aspx.

Questions, let me know.

–tr


Posted

in

by

Tags: