Excel Quirks when working with the Delimited Text Translator

By reeset / On / In MarcEdit

Every now and again, I get messages on the list noting that when trying to translate Excel files with longer fields or with numerical data – MarcEdit’s extracted data doesn’t always reflect what’s in the file.  It can be confusing, and understanding the why of it all probably isn’t something folks want/need to understand – but I’ll try to explain what is happening here and then what I’ve been working on to try and make this go away.

So why does this sometimes happy?  Well, to understand why it happens, you have to understand a little bit about how MarcEdit views Excel.  MarcEdit doesn’t actually know anything about Excel (or Access for that point), but rather interacts with them abstractly using Microsoft’s built-in JET engine.  This is an OBDC layer that essentially allows applications to interact with these file types as databases.  This definitely simplifies the process of working with these data files – but it introduces some quirks as well.  When MarcEdit converses with an Excel file via the OBDC layer – it knows nothing about the Excel file.  As you might or might not be aware – Excel has the ability to type columns or cells.  By default, Excel attempts to type your data based on the contents of that data.  Sometimes it gets it right, sometime wrong – often times, it just leaves it marked as general.  Well, MarcEdit interacts with this data abstractly through a generic late-bound object – so it doesn’t know what the field data type is for any data in the database – so it relies on the OBDC connection to negotiate that with Excel, and sometime it negotiates this data type incorrectly (or at least, falls back to primitive data types rather than more complex types like TEXT).  This will occur most frequently in two specific cases.

  1. You have a long data field.  If the data field is marked as type General – the OBDC connection will default to treating that data as a VARCHAR.  That means MarcEdit can only receive 255 characters from the field.  This problem goes away if the data is typed as TEXT.
  2. You have alphanumeric data like ISBN that Excel wants to treat as scientific notation (or some-sort numeric).  When extracting data – the OBDC connection will mistype the information and often times return the data in a scientific notation or numeric notation.  Again, the problem goes away if the data is specifically typed.

Obviously, this is confusing when it occurs (though I’m not actually sure how often people encounter this problem since it isn’t reported often).  Since the workaround is a relatively simple one (i.e., changing the cell/column formatting), I haven’t spent too many cycles looking into simplifying this process.  However, with a few large enhancements completed, I’m going to be dedicating some cycles to investigate automated field typing within the Jet engine to see if I can come up with a solution that will virtual eliminate the problem.  The tricky part, of course, is making sure that any changes to the OBDC code-base doesn’t introduce regressions, but that’s just a normal part of development.

At this point in time, I’m looking to continue what has become a somewhat annual tradition of releasing a MarcEdit update right around Christmas, so I’m tentatively working to have a solution to this issue as part of this release.