Mar 032016
 

Yesterday, I had someone indicate that there was a problem with the Add/Delete Field function.  An update in the last version to allow for deduplication deletions based on subfields tripped other deletions.  This was definitely problematic.  This has been corrected, in addition to a couple other changes.

Change log:

6.2.88

  • Bug Fix: Add/Delete Field: I introduced an element into the Delete function to allow dedup deletions to happen at the subfield level. This tripped non-dedup deletions. This has been corrected.
  • Update: Build New Links: FAST headings in the 600,611,630 weren’t being processed. I’ve updated the rules file appropriately.
  • Update: RDA Helper Abbrevs File: Add S.L. abbreviation.
  • Bug Fix: Validate Headings: The Check A only when subject checking wasn’t honoring that option. This is corrected.

Changes can be found on the downloads page: http://marcedit.reeset.net/downloads

 

tr

 Posted by at 11:01 pm

MarcEdit Update

 MarcEdit  Comments Off on MarcEdit Update
Feb 282016
 

Update was posted Feb. 27 to all versions.  Update Contains the following changes:

6.2.85

  • Enhancement: Characterset Detection: MarcEdit is including a tool that will provide a heuristical analysis of a file to provide best guess characterset detection. (http://blog.reeset.net/archives/1897)
  • Enhancement: Build New Tool Function: Adding a find macro to the function so that users can now identify specific fields when building new fields from data in a MARC record. (http://blog.reeset.net/archives/1902)
  • Update: Build Links — improved handling of MESH data ** Update: Build Links — improved handling of AAT data
  • Update: Build Links — improved handling of ULAN data
  • Update: Build Links — added work around to character escaping issues found in .NET 4.0. Issue impacts URIs with trailing periods and slashes (/). Apparently, the URI encoding tool doesn’t escape them properly because of how Windows handles file paths.
  • Update: Build Links — Rules file updated to include refined definitions for the 6xx fields.
  • Update: MarcEdit Command-Line: program updated to include new build links functional updates
  • Update: COM object: Updated character encoding switching to simplify streaming functions.
  • Update: Validate Headings: Integrated rules file into checking.
  • Bug Fix: Validate Headings: headings validation was being tripped by the URI escaping issue in .NET 4.0. This has been corrected.
  • Update: RDA Helper: Finished code refinements
  • Update: Build Links — tool is now asynchronous
  • Enhancement: Build Links — Users can now select and build their own rules files
  • Enhancement: Build Links — Tool now includes a function that will track resolution speed from linked services and attempt to provide notification when services are performing poorly. First version won’t identify particular services — just that data isn’t being processed in a timely manner.
  • Bug Fix: Character Conversion — UTF-8 to MARC-8, the {dollar} literal isn’t being converted back to a literal dollar sign. This is related to removing the fall back entity checking in the last update. This has been corrected.

Updates can be picked up through the automated update tools in MarcEdit or via the downloads page: http://marcedit.reeset.net/downloads

 

–tr

 Posted by at 6:49 am

MarcEdit: Build New Field Enhancement

 MarcEdit  Comments Off on MarcEdit: Build New Field Enhancement
Feb 232016
 

I’m wrapping up a few odds and ends prior to releasing the next MarcEdit update – mostly around the linked data work and how the tool works with specific linked data services – but one of the specific changes that should make folks using the Build New Field tool happy is the addition of a new macro that can be used to select specific data elements when building a new field. 

So, for those that might not be aware, the build new field tool is a pattern based tool that allows users to select information from various MARC fields in a record and create a new field.  You can read about the initial description at: http://blog.reeset.net/archives/1782 and the enhancements that added a kind of macro language to the tool here: http://blog.reeset.net/archives/1853

When the tool runs, one of the assumptions that is made is that the tool pulls the data for the pattern from the first field/field/subfield combination that meets the pattern criteria.  This works well if your record has only a single field for the data that you need to capture.  But what if you have multiple fields.  Say for example, the user needs to create a call number, and one of those elements will be the ISBN – however, the record has multiple ISBN fields like:
=020  \\$a123456 (ebook)
=020  \\$a654321 (hardcopy)

Say I need to specifically get the ISBN from the hardcopy.  In the current build new tool function, this wouldn’t be possible without first changing the first 020 to something else (like an 021) – then changing it back when the operation was completed.  This is because if I used say:
=099  \\$aMyCall Number {020$a}

I would get the first 020$a value.  There hasn’t been a way to ask for the tool to find specific field data in this function.  But that has changed – I’ve introduced: find. 

Function: .find
Arguments: needle
Example: {020$a.find(“hardcopy”)}

Find will allow you to selectively find data in a field.  So, in the example above, I can now select the correct 020.
=020  \\$aMyCall Number {020$a.find(“hardcopy”).replace(“(hardcopy)”,””)}

This will output:
=020  \\$aMyCall Number 654321

A couple notes about usage.  Find must always be the first option in a chain of macros.  This is because the tool actually does the other operations like substitutions – so the criteria being queried must reflect the data in the record at read – not after it has been processed.  If you place find in any other position, you may invalidate your pattern. 

This will be part of the next upcoming MarcEdit update.

–tr

 Posted by at 8:20 am
Feb 142016
 

The topic of charactersets is likely something most North American catalogers rarely give a second thought to.  Our tools, systems – they all are built around a very anglo-centric world-view that assumes data is primarily structured in MARC21, and recorded in either MARC-8 or UTF8.  However, when you get outside of North America, the question of characterset, and even MARC flavor for that matter, becomes much more relevant.  While many programmers and catalogers that work with library data would like to believe that most data follows a fairly regular set of common rules and encodings – the reality is that it doesn’t.  While MARC21 is the primary MARC encoding for North American and many European libraries – it is just one of around 40+ different flavors of MARC, and while MARC-8 and UTF-8 are the predominate charactersets in libraries coding in MARC21, move outside of North American and OCLC, and you will run into Big5, Cyrillic (codepage 1251), Central European (codepage 1250), ISO-5426, Arabic (codepage 1256), and a range of many other localized codepages in use today.  So while UTF-8 and MARC-8 are the predominate encodings in countries using MARC21, a large portion of the international metadata community still relies on localized codepages when encoding their library metadata.  And this can be a problem for any North American library looking to utilize metadata encoded in one of these local codepages, or share data with a library utilizing one of these local codepages.

For years, MarcEdit has included a number of tools for handling this soup of character encodings – tools that work at different levels to allow the tool to handle data from across the spectrum of different metadata rules, encodings, and markups.  These get broken into two different types of processing algorithms.

Characterset Identification:

This algorithm is internal to MarcEdit and vital to how the tool handles data at a byte level.  When working with file streams for rendering, the tool needs to decide if the data is in UTF-8 or something else (for mnemonic processing) – otherwise, data won’t render correctly in the graphical interface without first determining characterset for use when rendering.  For a long time (and honestly, this is still true today), the byte in the LDR of a MARC21 record that indicates if a record is encoded in UTF-8 or something else, simply hasn’t been reliable.  It’s getting better, but a good number of systems and tools simply forget (or ignore) this value.  But more important for MarcEdit, this value is only useful for MARC21.  This encoding byte is set in a different field/position within each different flavor of MARC.  In order for MarcEdit to be able to handle this correctly, a small, fast algorithm needed to be created that could reliably identify UTF8 data at the binary level.  And that’s what’s used – a heuristical algorthm that reads bytes to determine if the characterset might be in UTF-8 or something else.

Might be?  Sadly, yes.  There is no way to auto detect characterset.  It just can’t happen.  Each codepage reuses the same codepoints, they just assign different characters to those codepoints based on which encoding is in use. So, a tool won’t know how to display textual data without first knowing the set of codepointer rules that data was encoded under.  It’s a real pain the backside.

To solve this problem, MarcEdit uses the following code in an identification function:

 
          int x = 0;
            int lLen = 0;
            
            try
            {

                x = 0;
                while (x < p.Length)
                {
                    //System.Windows.Forms.MessageBox.Show(p[x].ToString());
                    if (p[x] <= 0x7F)
                    {
                        x++;
                        continue;
                    }
                    else if ((p[x] & 0xE0) == 0xC0)
                    {
                        lLen = 2;
                    }
                    else if ((p[x] & 0xF0) == 0xE0)
                    {
                        lLen = 3;
                    }
                    else if ((p[x] & 0xF8) == 0xF0)
                    {
                        lLen = 4;
                    }
                    else if ((p[x] & 0xFC) == 0xF8)
                    {
                        lLen = 5;
                    }
                    else if ((p[x] & 0xFE) == 0xFC)
                    {
                        lLen = 6;
                    }
                    else
                    {
                        return RET_VAL_ANSI;
                    }
                    while (lLen > 1)
                    {
                        x++;
                        if (x > p.Length || (p[x] & 0xC0) != 0x80)
                        {
                            return RET_VAL_ERR;
                        }
                        lLen--;
                    }
                    iEType = RET_VAL_UTF_8;
                    		}
                    x++;
                }
            }
            catch (System.Exception kk) {
                iEType= RET_VAL_ERROR

            }
        
            return iEType;

This function allows the tool to quickly evaluate any data at a byte level and identify if that data might be UTF-8 or not.  Which is really handy for my usage.

Character Conversion

MarcEdit has also included a tool that allows users to convert data from one character encoding to another.

image

This tool requires users to identify the original characterset encoding for the file to be converted.  Without that information, MarcEdit would have no idea which set of rules to apply when shifting the data around based on how characters have been assigned to their various codepoints.  Unfortunately, a common problem that I hear from librarians, especially librarians in the United States that don’t have to deal with regularly this problem, is that they don’t know the file’s original characterset encoding, or how to find it.  It’s a common problem – especially when retrieving data from some Eastern European publishers and Asian publishers.  In many of these cases, users send me files, and based on my experience looking at different encodings, I can make a couple educated guesses and generally figure out how the data might be encoded.

Automatic Character Detection

Obviously, it would be nice if MarcEdit could provide some kind of automatic characterset detection.  The problem is that this is a process that is always fraught with errors.  Since there is no way to definitively determine the characterset of a file or data simply by looking at the binary data – we are left having to guess.  And this is where heuristics comes in again.

Current generation web browsers automatically set character encodings when rendering pages.  This is something that they do based on the presence of metadata in the header, information from the server, and a heuristic analysis of the data prior to rendering.  This is why everyone has seen pages that the browser believes is one character set, but is actually in another, making the data unreadable when it renders.  However, the process that browsers are currently using, well, as sad as this may be, it’s the best we got currently.

And so, I’m going to be pulling this functionality into MarcEdit.  Mozilla has made the algorithm that they use public, and some folks have ported that code into C#.  The library can be found on git hub here: https://github.com/errepi/ude.  I’ve tested it – it works pretty well, though is not even close to perfect.  Unfortunately, this type of process works best when you have lots of data to evaluate – but most MARC records are just a few thousand bytes, which just isn’t enough data for a proper analysis.  However, it does provide something — and maybe that something will provide a way for users working with data in an unknown character encodings to actually figure out how their data might be encoded.

The new character detection tools will be added to the next official update of MarcEdit (all versions).

image

And as I noted – this is a tool that will be added to give users one more tool to evaluating their records.  While detection may still only be a best guess – its likely a pretty good guess.

The MARC8 problem

Of course, not all is candy and unicorns.  MARC8, the lingua franca for a wide range of ILS systems and libraries – well, it complicates things.  Unlike many of the localized codepages that are actually well defined standards and in use by a wide range of users and communities around the world – MARC-8 is not.  MARC8 is essentially a made up encoding – it simply doesn’t exist outside of the small world of MARC21 libraries.  To a heuristical parser evaluating character encoding, MARC-8 looks like one of four different characterset: USASCII, Codepage 1252, ISO-8899, and UTF8.  The problem is that MARC-8, as an escape-base language, reuses parts of a couple different encodings.  This really complicates the identification of MARC-8, especially in a world where other encodings may (probably) will be present.  To that end, I’ve had to add a secondary set of heuristics that will evaluate data after detection so that if the data is identified as one of these four types, some additional evaluation is done looking specifically for MARC-8’s fingerprints.  This allows, most of the time, for MARC8 data to be correctly identified, but again, not always.  It just looks too much like other standard character encodings.  Again, it’s a good reminder that this tool is just a best guess at the characterset encoding of a set of records – not a definitive answer.

Honestly, I know a lot of people would like to see MARC as a data structure retired.  They write about it, talk about it, hope that BibFrame might actually do it.  I get their point – MARC as a structure isn’t well suited for the way we process metadata today.  Most programmers simply don’t work with formats like MARC, and fewer tools exist that make MARC easy to work with.  Likewise, most evolving metadata models recognize that metadata lives within a larger context, and are taking advantage of semantic linking to encourage the linking of knowledge across communities.  These are things libraries would like in their metadata models as well, and libraries will get there, though I think in baby steps.  When you consider the train-wreck RDA adoption and development was for what we got out of it (at a practical level) – making a radical move like BibFrame will require a radical change (and maybe event that causes that change).

But I think that there is a bigger problem that needs more immediate action.  The continued reliance on MARC8 actually posses a bigger threat to the long-term health of library metadata.  MARC, as a structure, is easy to parse.  MARC8, as a character encoding, is essentially a virus, one that we are continuing to let corrupt our data and lock it away from future generations.  The sooner we can toss this encoding to the trash heap, the better it will be for everyone – especially since we are likely the passing of one generation away from losing the knowledge of how this made up character encoding actually works.  And when that happens, it won’t matter how the record data is structured – because we won’t be able to read it anyway.

–tr

 Posted by at 3:48 pm
Feb 062016
 

Would this be the super bowl edition? Super-duper update? I don’t know – but I am planning an update. Here’s what I’m hoping to accomplish for this update (2/7/2016):

MarcEdit (Windows/Linux)

· Z39.50/SRU Enhancement: Enable user defined profiles and schemas within the SRU configuration. Status: Complete

· Z39.50/SRU Enhancement: Allow SRU searches to be completed as part of the batch tool. Status: ToDo

· Build Links: Updating rules file and updating components to remove the last hardcode elements. Status: Complete

· MarcValidators: Updating rules file Status: Complete

· RDA Bug Fix: 260 conversion – rare occasions when {} are present, you may lose a character Status: Complete

· RDA Enhancement: 260 conversion – cleaned up the code Status: Complete

· Jump List Enhancement: Selections in the jump list remain highlighted Status: Complete

· Script Wizard Bug Fix: Corrected error in the generator that was adding an extra “=” when using the conditional arguments. Status: Complete

MarcEdit Linux

· MarcEdit expects the /home/[username] to be present…when it’s not, the application data is being lost causing problems with the program. Updating this so allow the program to drop back to the application directory/shadow directory. Status: Testing

MarcEdit OSX

· RDA Fix [crash error when encountering invalid data] Status: Testing

· Z39.50 Bug: Raw Queries failing Status: Complete

· Command-line MarcEdit: Porting the Command line version of marcedit (cmarcedit). Status: Testing

· Installer – Installer needs to be changed to allow individual installation of the GUI MarcEdit and the Command-line version of MarcEdit. These two version share the same configuration data Status: ToDo

–tr

 Posted by at 5:09 am
Jan 252016
 

I’ve posted an update for all versions – changed noted here:

The significant change was a shift in how the linked data processing works.  I’ve shifted from hard code to a rules file.  You can read about that here: http://blog.reeset.net/archives/1887

If you need to download the file, you can get it from the automated update tool or from: http://marcedit.reeset.net/downloads.

–tr

 Posted by at 10:04 pm
Jan 252016
 

One of the changes in the current MarcEdit update is the introduction of a linked data rules file to help the program understand what data elements should be processed for automatic URL generation, and how that data should be treated.  The Rules file is found in the Configs directory and is called: linked_data_profile.xml

 

image

image

The rules file is pretty straightforward.  At this point, I haven’t created a schema for it, but I will to make defining data easier.  Until then, I’ve added references in the header of the document to note fields and values. 

Here’s a small snippet of the file:

<?xml version=”1.0″ encoding=”UTF-8″?>
<marcedit_linked_data_profile>
  <!–
    rules block:
        top level: field
            Attributes:
                type: authority, bibliographic, authority|bibliographic
            tag (required):
                Value: Field value
                Description: field to process
            subfield (required):
                Value: Subfield codes
                Description: subfields to use for matching
            index (optional):
                Values: subfield code or empty
                Description: field that denotes index
            atomize(optional):
                Values: 1 or empty
                Description: determines if field should be broken up for uri disambiguation
            special_instructions (optional):
                Values: name|subject|mixed
                Description: special instructions to improve normalization for names and subjects. 
            uri (required):
                Values: subfield code to include a url
                Description: Used to determine which subfield is used to embed a URI
            vocab (optional):
                Values (see supported vocabularies section)
                Description: when no index is supplied, you can predefine a supported index
               
               
  Supported Vocabularies:
    Value: lcshac
    Description: LC Childrens Subjects
   
    Value: lcdgt
    Description: LC Demographic Terms
   
    Value: lcsh
    Description: LC Subjects
   
    Value: lctmg
    Description: TGM
   
    Value: aat
    Description: Getty Arts and Architecture Thesaurus
   
    Value: ulan
    Description: Getty ULAN
   
    Value: lcgft
    Description: LC Genre Forms
  
   Value: lcmpt
   Descirption: LC Medium Performance Thesaurus
  
   Value: naf
   Description: LC NACO Terms
  
   Value: naf_lcsh
   Description: lcsh/naf combined indexes.
  
   Value: mesh
   Description: MESH indexes
    –>
  <rules>
    <field type=”bibliographic”>
      <tag>100</tag>
      <subfields>abcdqnp</subfields>
      <uri>0</uri>
      <special_instructions>name</special_instructions>
    </field>
    <field type=”bibliographic”>
      <tag>110</tag>
      <subfields>abcdqnp</subfields>
      <uri>0</uri>
      <special_instructions>name</special_instructions>
    </field>
</rules>
</marcedit_linked_data_profile>

The rules file is pretty straightforward.  You have a field where you define a type.  Acceptable values are: authority, bibliographic, authority|bibliographic.  This tells the tool which type of record the process rules apply to.  Second you define a tag, subfields to process when evaluating for linking, a uri field (this is the subfield used when outputting the URI, special instructions (if there are any), where the field is atomized (i.e., broken up so that you have one concept per URI), and vocab (to preset a default vocabulary for processing).  So for example, say a user wanted to atomize a field that currently isn’t defined as such – they would just find the processing block for the field and add: <atomize>1</atomized> into the block – and that’s it.

The idea behind this rules file is to support the work of a PCC Task Force while they are testing embedding of URIs in MARC records.  By shifting from a compiled solution to a rules based solution, I can provide immediate feedback and it should make the process easier to customize and test. 

An important note – these rules will change.  They are pretty well defined for bibliographic data, but authority data is still being worked out. 

–tr

 Posted by at 10:01 pm

MarcEdit Update (all versions)

 MarcEdit  Comments Off on MarcEdit Update (all versions)
Jan 182016
 

This update includes a new tool, changes to the merge tool, and a behavior change in the MARCEngine.  You can see the change log at:

You can get the update through MarcEdit’s automated update mechanism or from: http://marcedit.reeset.net/downloads/

–tr

 Posted by at 3:05 pm

MarcEdit and OpenRefine

 MarcEdit  Comments Off on MarcEdit and OpenRefine
Jan 162016
 

There have been a number of workshops and presentations that I’ve seen floating around that talk about ways of using MarcEdit and OpenRefine together when doing record editing.  OpenRefine, for folks that might not be familiar, use to be known as Google Refine, and is a handy tool for working with messy data.  While there is a lot of potential overlap between the types of edits available between MarcEdit and OpenRefine, the strength of the tool is that it allows you to access your data via a tabular interface to easily find variations in metadata, relationships, and patterns.

For most folks working with MarcEdit and OpenRefine together, the biggest challenge is moving the data back and forth.  MARC binary data isn’t supported by OpenRefine, and MarcEdit’s mnemonic format isn’t well suited for import using OpenRefine’s import options as well.  And once the data has been put into OpenRefine, getting back out and turned into MARC can be difficult for first time users as well.

Because I’m a firm believe that uses should use the tool that they are most comfortable with – I’ve been talking to a few OpenRefine users trying to think about how I could make the process of moving data between the two systems easier.  And to that end, I’ll be adding to MarcEdit a toolset that will facilitate the export and import of MARC (and MarcEdit’s mnemonic) data formats into formats that OpenRefine can parse and easily generate.  I’ve implemented this functionality in two places – one as a standalone application found on the Main MarcEdit Window, and one as part of the MarcEditor – which will automatically convert or import data directly into the MarcEditor Window.

Exporting Data from MarcEdit

As noted above, there will be two methods of exporting data from MarcEdit into one of two formats for import into OpenRefine.  Presently, MarcEdit supports generating either json or tab delimited format.  These are two formats that OpenRefine can import to create a new project.

image
OpenRefine Option from the Main Window

image
OpenRefine Export/Import Tool.

If I have a MARC file and I want to export it for use in OpenRefine – I would using the following steps:

  1. Open MarcEdit
  2. Select Tools/OpenRefine/Export from the menu
  3. Enter my Source File (either a marc or mnemonic file)
  4. My Save File – MarcEdit supports export in json or tsv (tab delimited)
  5. Select Process

This will generate a file that can used for importing into OpenRefine.  A couple notes about that process.  When importing via tab delimited format – you will want to unselect options that does number interpretation.  I’d also uncheck the option to turn blanks into nulls and make sure the option is selected that retains blank rows.  These are useful on export and reimport into MarcEdit.  When using Json as the file format – you will want to make sure after import to order your columns as TAG, Indicators, Content.  I’ve found OpenRefine will mix this order, even though the json data is structured in this order.

Once you’ve made the changes to your data – Select the export option in OpenRefine and select the export tab delimited option.  This is the file format MarcEdit can turn back into either MARC or the mnemonic file format.  Please note – I’d recommend always going back to the mnemonic file format until you are comfortable with the process to ensure that the import process worked like you expected.

And that’s it.  I’ve recorded a video on YouTube walking through these steps – you can find it here:

This if course just shows how to data between the two systems.  If you want to learn more about how to work with the data once it’s in OpenRefine, I’d recommend one of the many excellent workshops that I’ve been seeing put on at conferences and via webinars by a wide range of talented metadata librarians.

*** Update****

In addition to the addition of the tool, I’ve set it up so that this tool can be selected as one of the user defined tools on the front page for quick access.  This way, if this is one of the tools you use often, you can just get right to it.

MarcEdit's Start Window Preferences with new OpenRefine Data Transfer Tool Option

MarcEdit’s Start Window Preferences with new OpenRefine Data Transfer Tool Option

Main Window with OpenRefine Data Transfer Tool

Main Window with OpenRefine Data Transfer Tool

 Posted by at 6:27 pm

MarcEdit Update (all versions)

 MarcEdit  Comments Off on MarcEdit Update (all versions)
Jan 102016
 

I decided to celebrate my absence from ALA’s Midwinter by doing a little coding.  Smile  I’ve uploaded updates for all versions of MarcEdit, though the Mac version has experienced the most significant revisions.  The changes:

Windows/Linux ChangeLog:

OSX ChangeLog:

You can get the update from the Downloads page (http://marcedit.reeset.net/downloads) or using the automated updating tools within MarcEdit.

Questions,

–tr

 Posted by at 8:39 pm