Using RegEx in C#

By reeset / On / In General Computing, MarcEdit

Something I found out today that I thought interesting.  Within MarcEdit, I have one regular expression which looks for entities and converts them to their byte equivalents (i.e., &#1B;).  Since this regular expression has the potential to be called a number of times, I decided to make use of the RegEx compile option.  In the documentation, it notes that this will compile the regular expression object which has a startup cost associated with it — but makes runtime faster.  After setting the option and testing the code — everything works fine and is nice and peppy.  Great…

Not so fast.  Apparently, what this option is doing is invoking the CLR compiler each time this statement is seen.  For some reason, when calling this code directly within my application — it runs nice and fast.  However, when I call it indirectly through a COM object, it runs like its wading through molasses.  I’m not exactly sure what the difference is here — but looking at memory usage — it appears that the CLR is being loaded and involved on each pass through the RegEx — which does appear to be happening through the direct call.  Either way, I found that removing the Compiled option actually sped up the process from the COM execution by many magnitude (ie., with option, was processing 20 records per second.  Without, processed 241 records in 0.1 secs).  Looking through the user groups for a little more explaination notes that this option should only be used when the regular expression is static — and can be defined globally within a context.  Since this isn’t the case within my application, I can see why I had the problem.  Too bad the MS documentation doesn’t make the behavior of this option a little clearer.

Anyway, I’ll be updating the MarcEdit code sometime this weekend as I finish up added PERL support to the script maker.