MarcEdit and OpenRefine
There have been a number of workshops and presentations that I’ve seen floating around that talk about ways of using MarcEdit and OpenRefine together when doing record editing. OpenRefine, for folks that might not be familiar, use to be known as Google Refine, and is a handy tool for working with messy data. While there is a lot of potential overlap between the types of edits available between MarcEdit and OpenRefine, the strength of the tool is that it allows you to access your data via a tabular interface to easily find variations in metadata, relationships, and patterns.
For most folks working with MarcEdit and OpenRefine together, the biggest challenge is moving the data back and forth. MARC binary data isn’t supported by OpenRefine, and MarcEdit’s mnemonic format isn’t well suited for import using OpenRefine’s import options as well. And once the data has been put into OpenRefine, getting back out and turned into MARC can be difficult for first time users as well.
Because I’m a firm believe that uses should use the tool that they are most comfortable with – I’ve been talking to a few OpenRefine users trying to think about how I could make the process of moving data between the two systems easier. And to that end, I’ll be adding to MarcEdit a toolset that will facilitate the export and import of MARC (and MarcEdit’s mnemonic) data formats into formats that OpenRefine can parse and easily generate. I’ve implemented this functionality in two places – one as a standalone application found on the Main MarcEdit Window, and one as part of the MarcEditor – which will automatically convert or import data directly into the MarcEditor Window.
Exporting Data from MarcEdit
As noted above, there will be two methods of exporting data from MarcEdit into one of two formats for import into OpenRefine. Presently, MarcEdit supports generating either json or tab delimited format. These are two formats that OpenRefine can import to create a new project.
If I have a MARC file and I want to export it for use in OpenRefine – I would using the following steps:
- Open MarcEdit
- Select Tools/OpenRefine/Export from the menu
- Enter my Source File (either a marc or mnemonic file)
- My Save File – MarcEdit supports export in json or tsv (tab delimited)
- Select Process
This will generate a file that can used for importing into OpenRefine. A couple notes about that process. When importing via tab delimited format – you will want to unselect options that does number interpretation. I’d also uncheck the option to turn blanks into nulls and make sure the option is selected that retains blank rows. These are useful on export and reimport into MarcEdit. When using Json as the file format – you will want to make sure after import to order your columns as TAG, Indicators, Content. I’ve found OpenRefine will mix this order, even though the json data is structured in this order.
Once you’ve made the changes to your data – Select the export option in OpenRefine and select the export tab delimited option. This is the file format MarcEdit can turn back into either MARC or the mnemonic file format. Please note – I’d recommend always going back to the mnemonic file format until you are comfortable with the process to ensure that the import process worked like you expected.
And that’s it. I’ve recorded a video on YouTube walking through these steps – you can find it here:
This if course just shows how to data between the two systems. If you want to learn more about how to work with the data once it’s in OpenRefine, I’d recommend one of the many excellent workshops that I’ve been seeing put on at conferences and via webinars by a wide range of talented metadata librarians.
In addition to the addition of the tool, I’ve set it up so that this tool can be selected as one of the user defined tools on the front page for quick access. This way, if this is one of the tools you use often, you can just get right to it.