MarcEdit 2017 Usage Information

By reeset / On / In MarcEdit

Every year, I like to take a couple of minutes and pull my log files to get a quick and dirty look at who might be making use of MarcEdit.  This year, I was also interested in how quickly MarcEdit 7 is being picked up, as the application install base is large and diverse and I’m thinking about how many maintenance releases I’m planning over this next year related to MarcEdit 6 (thinking 4 at this point).

Look at the numbers:

Number of Executions: ~3 million

Executions are measured by tracking those users that make use of the automated update/notifications tool.  Since MarcEdit will ping the update service, I get a rough idea of how often the program was started during the year.  However, as this only captures uses of folks taking advantage of the notification service, and are online – this number represents only a slice of usage.

Countries: ~190

Again, using the log files from the update service, the analytic software I use provides an set of broad country/administrative region codes.  Over the course of 2017, ~190 individual regions were represented, with ~120 regions have an active presence, month-over-month.

MarcEdit 7/MacOS Downloads

This I’m interested in because I’m curious about the rate of adoption.  On update, I usually see ~8-10,000 active users that routinely update the software.  Looking at Dec. 2017 (first month of release) and Jan. 2018 – it looks like folks are slowly starting to test and put MarcEdit 7 through it’s paces.  Below, a download represents a unique user.

Dec. 2017: 6,400 total downloads

Jan. 2018: 11,700 total downloads

Finally – how many questions did I get to answer.  Again, this is hard to say, but looking just at the MarcEdit Listserv, I provided roughly 5,500 responses.  Given questions I get on the list and off, it wouldn’t be a stretch to say that I probably answer ~20 questions a day regarding some aspect of the application.

Development Hours: ~820 hrs

This one surprised me, but this past year was spent revising the application – an honestly, it could be low.  On average though, it wouldn’t be out of the realm of possibility to say that I spent ~17 hours per week in 2017 writing code for MarcEdit, most of it happening between the hours of midnight-3 am. 

So, that’s roughly a snapshot of 2017 usage.  I’ll be interested to see what 2018 will bring.

–tr

MarcEdit Unicode Normalizations–specifying ordinal matching and what that means

By reeset / On / In MarcEdit

I’m continuing to flesh out how to make it easier to work with normalizations – specifically – so that what is being queried is actually what is being found.  In general, the new normalization enforcement options solve issues relating to finding and replacing text.  However, the place where this still comes up as a challenge is when using Find/Find All.  Internally, .NET’s string.IndexOf function uses a cultural invariant settings – so it takes data being queried, and breaks it down into other variations of the character.  And example: ß gets search both as the Unicode normalizations and as “ss”.  There is a pretty good chance users don’t want to search for “ss” when querying for “ß” and then maybe there are times when they do.  In this case, I’ve updated the Find/Find All query so that users can determine how the tool will interpret data for searching.

What exactly does this look like?  Here’s and example:

image

image

 

In this case, both “ß” in the various normalizations and “ss” are found.  For example:

image

However, when we shift the query to an ordinal search, we return results just for the diacritic: “ß” in its various normalizations, but not culturally invariant expressions like: “ss”.

image

image

By providing different case types, users can get a better idea of what types of information are showing up in their records.

Finally, replacements always happen ordinally.  Unlike the search in .NET which escapes data into its cultural variant expressions, replacements are always ordinal so they must match.  This is why the option to enforce unicode normalizations are important, as they enable this to work across values that can be expressed using a wide range of valid codepoints.

Make sense?  This will be available in all versions of MarcEdit.

–tr

MarcEdit 7 Clustering Enhancements

By reeset / On / In MarcEdit

One area that I’d like to see MarcEdit continue to evolve is around the support for the clustering of data to support editing or data extraction.  While tools like OpenRefine will provide a much more robust set of tooling, there are barriers to using these tools due to the nature of library data.  By embedding lite-weight tools into MarcEdit, the tooling can help to overcome some of these issues.  

To that end, I’m exploring a handful of additional clustering options for the application, and beginning the process of rolling a few new options out.  The first of these options will be an enhancement to the way the program develops keys/tokens when clustering data.  By default, the program takes the data found in specific subfield codes, and then does some very light normalization – before passing the data through a set of fuzzy matching algorithms.  This process produces clusters, but can miss some data if names or values are inverted.  Take for example:

Reese, Terry and Terry Reese.  The clustering algorithms likely won’t put these together because the distance required to normalize these together is pretty high.  These would likely be represented as separate clusters.  But this is very much one of the use cases that should be addressed.  To that end, I’ve added an option that will utilize the same approach OpenRefine utilizes – tokenized fingerprints.  Rather than working with the data provided, the tool breaks down the strings, normalizes away data and common diacritics, and then sorts the data so that Reese, Terry and Terry Reese turn into the following identical token: reese terry.  Utilizing a combination of fingerprinting and the fuzzy matching algorithms, users can take even more control over how clustering occurs in the application. 

Users will see this option (in all versions of MarcEdit) within the Generate cluster screen).

image

One of the goals in implementing this new option, is that I’ll be extending the format support related to the clustering application.  Over the next week or so, I’ll be adding support for delimited forms (so, you cluster on columns) and XML documents (any form – you again, will define values for clustering), allowing users to then make changes across delimited data or any xml formatted data.  The Excel/delimited formats will come first, the XML formats second.  With luck, I’ll have this work finished prior to hitting Austin for ER&L.

–tr

MarcEdit 7 Updates

By reeset / On / In MarcEdit

I posted updates that spanned all versions of MarcEdit yesterday.   The updates were primary based around the thoughts I’ve been having around unicode normalizations, and a need to provide a way to formalized and standardized normalizations across records. 

Unicode Normalizations

In all versions of MarcEdit, there is a new settings in the preferences window:

image

When the option, Enforce Defined Normalization is selected, the tool will ensure all input and output created through the MarcEditor and MarcEngine conform to the normalization selected.  This will allow users to find and replace data without need to worry about normalization, and will ensure that the data that comes out of MarcEdit all complies to a specific normalization when working with UTF8 data.

Executable tasks

I had a question asking if MarcEdit’s Tasks could be created as executable files (like scripts).  The answer is kind of.  I’ve added a new option to the tool – in the MarcEdit Task Manager – you can select a task for export and now select: Task Executable (*.exe)

image

What this will do is allow you to create an new program that wraps execution of the task into it.  Now, this doesn’t make the task portable or a stand alone file.  The executable assumes MarcEdit is installed, and the task lives in your task store – But this gives you a shortcut of sorts that can be used to just drag and drop records for processing.  On the desktop, your new executable task will look like:

image

And you can drag files to process onto the icon.  The program will run the task, outputting the new file as the same file name as processed, with the addition of .rev.[extension] added to it.

This is still slightly a concept – you may find your mileage will vary with it.  But if you try it, let me know how it goes.

–tr

MarcEdit 7: The great [Normalization] escape

By reeset / On / In General Computing, MarcEdit, Programming

working out some thoughts here — this will change as I continue working through some of these issues.

If you follow the MarcEdit development, you’ll know that last week, I posted a question in a number of venues about the affects of Unicode Normalization and its potential impacts for our community.  I’ve been doing a little bit of work in MarcEdit, having a number of discussions with vendors and folks that work with normalizations regularly – and have started to come up with a plan.  But I think there is a teaching opportunity here as well, an opportunity to discuss how we find ourselves having to deal with this particular problem, where the issue is rooted, and the impacts that I see right now in ILS systems and for users of tools like MarcEdit.  This isn’t going to be an exhaustive discussion, but hopefully it helps folks understand a little bit more what’s going on, and why this needs to be addressed.

Background

So, let’s start at the beginning.  What exactly are Unicode normalizations, and why is this something that we even need to care about….

Unicode Normalizations are, in my opinion, largely an artifact of our (the computing industry’s) transition from a non-Unicode world to Unicode, especially in the way that the extended Latin character sets ended up being supported.

So, let’s talk about character sets and code pages.  Character sets define the language that is utilized to represent a specific set of data.  Within the operating system and programming languages, these character sets are represented as code pages. For example, Windows provides support for the following code pages: https://msdn.microsoft.com/en-us/library/windows/desktop/dd317756(v=vs.85).aspx.    Essentially, code pages are lists of numeric values that tell the computer how to map a  representation of a letter to a specific byte.  So, let’s use a simple example, “A”.  In ASCII and UTF8 (and other) code pages, the A that we read, is actually represented as a byte of data.  This byte is 0x41.  When the browser (or word processor) sees this value, it checks the value against the defined code page, and then provides the appropriate value from the font being utilized.  This is why, in some fonts, some characters will be represented as a “?” or a block.  These represent bytes or byte sequences that may (or may not) be defined within the code page, but are not available in the font.

Prior to Unicode implementations, most languages had their own code pages.  In Windows, the US. English code page would default to 1252.  In Europe, if ISO-8859 was utilized, the code page would default to  28591.  In China, the code page could be one of many.  Maybe “Big-5”, or code page 950, or what is referred to as Simplified Chinese, or code page 936.  The gist here, is that prior to the Unicode standard, languages were represented by different values, and the keyboards, fonts, systems – would take the information about a specific code page and interpret the data so that it could be read.  Today, this is why catalogers may still encounter confusion if they get records from Asia, where the vendor or organization makes use of “Big-5” as the encoding.  When they open the data in their catalog (or editor), the data will be jumbled.  This is because MARC doesn’t include information about the record code page – rather, it defines values as Unicode, or something else.  So, it is on catalogers and systems to know the character set being utilized, and utilized tools to convert the byte points from a character encoding that they might not be able to use, to one that is friendly for their systems.

So, let’s get back to this idea of Normalization Forms.  My guess is that much of the Normalization mess that we find ourselves in is related to ISO-8859.  This code page and standard has been widely utilized in European countries, and provides a standard method of representing extended Latinate characters [those between 129-255], though, Normalizations also affect other languages as well.  Essentially, the unicode specification included ISO-8859 to ease the transition, but also provide new, composed code points for many of the characters.  And Normalizations were born.

Unicode Normalizations, very basically, define how characters are represented.  There are 4 primary normalization forms that I think we need to care about in libraries.  These are (https://en.wikipedia.org/wiki/Unicode_equivalence):

  1. NFC – The canonical normalization, which will replace decomposed characters with composed code points.
  2. NFD – The canonical normalization, but in which data is fully decomposed
  3. NFKC – A normalization that utilizes a full compatibility decomposition, followed by the replacement of sequences with their primary composites, if possible.
  4. NFKD – A normalization that utilizes a full compatibility decomposition.

 

Practically, what does this mean.  Well, it means that a value like: é can be represented in multiple ways.  In fact, this is a good example of the problems that having differing Unicode Normalization Forms is having in the library community.  In the NFC and NFKC notation, this value é is represented by a single code point, that represents the letter and its diacritic fully.  In the NFD and NFKD notations, this character is represented by code points that correspond to the “e” and the diacritic separately.  This has definite implications, as composed characters make indexing of data with diacritical marks easier, whereas, decomposed characters must be composed to index correctly.

And how does this affect the library community.  Well, we have this made up character encoding known as MARC8 (https://en.wikipedia.org/wiki/MARC-8).  MARC8 is a library specific character set (doesn’t have a code page value, so all rendering is done by applications the understand MARC8) that has no equivalent outside of the library.  Like many character sets with a need to represent wide-characters (those with diacritics), MARC8 represented characters with diacritics by utilizing decomposed characters (though, this decomposition was MARC8 specific).  For librarians, this matters because the U.S. Library of Congress, when providing instructions on support for Unicode in MARC records, provided for the ability to round-trip between MARC8 and UTF8 and back (http://www.loc.gov/marc/specifications/speccharucs.html).  This round-tripability comes at a cost, and that cost is that data, to be in sync with the recommendations, should only be provided in the NFKD notation.  This has implications however, as current generation operating systems are are generally implemented utilizing NFC as the internal representation for string data, and for programmers, who have to navigate challenges within their languages, as in most cases, language functions that deal with concepts like in-string searching or regular expressions, are done using settings that make them culturally aware (i.e., allows searching across data in different normalizations), but the replacement and manipulation of that data is almost always done using ordinal (binary) matching, which means that data using different normalization forms are not compatible.  And quite honestly, this is confusing the hell out of metadata people.  Using our “é” character as an example – a user may be able to open a program or work in a programming language, and find this value regardless of the underlying data normalization, but when it comes to making changes – the data will need to match the underlying normalization; otherwise, no changes are actually made.  And if you are a user that is just looking at the data on the screen (without the ability to see the underlying binary data or without knowledge of what normalization is being used), you’d rightly start to wonder, why didn’t the changes complete.  This is the legacy that round-trip support for MARC-8 has left us within the library community, and the implications of having data moving fluidly between different normalizations is having real consequences today.

Had We Listened to Gandalf

Can holding on with caption -- Run you fool
(Source: http://quicklol.com/wp-content/uploads/2012/03/run-you-fools-cat-lol.jpg)

The ability to round-trip data from MARC-8 to UTF8 and back seemed like such a good idea at the time.  And the specifications that the U.S. Library of Congress lays out were/are easy enough  to understand and implement.  But we should have known that it was going to be easy, and that in creating this kind of backward compatibility, we were just looking for trouble down the road.

Probably the first indication that this was going to be problematic was the use of Numeric Character Reference (NRC) form to represent characters that exist outside of the MARC-8 repertoire.  Once UTF8 became allowed and a standard for representation of bibliographic data, the frequency in which MARC-8 records began to be littered by NRC representations (i.e., &#xXXXX; notation) increased exponentially, as too did the number of questions on the MarcEdit list for ways to find better substitutions for that data – primarily, because most ILS providers never fully adopted support for NRC encoded data.  Looking back now, what is interesting, is that many of the questioned related to the substitution of NRC notations can be traced to the utilization of NFC normalized data and the rise in the presence of “smart” characters generated in our text editing systems.  Looking at the MarcEdit archive, I can find multiple entries from users looking to replace NRC data elements that exist, simply because these elements represented the composed data points, and were thus, in compatible with MARC-8.  So, we probably should have seen this coming…and quite honestly, should have made a break.  Data created in UTF8 will almost always result in some level of data change when being converted back to MARC8…we should probably have just accepted that was a likely outcome, and not worried about the importance of round-tripability.

But….we have, and did, and now we have to find a way to make the data that we have, within the limitations of our systems, work.   But what are the limitations or consequences when thinking about the normalization form of data?  The data should render the same, right?  The data should search the same, right?  The data should export the same, right?  The answer to those questions, is that this shouldn’t matter, if the local system was standardizing the normalization of data as it is added or exported from the system, but in practice, it appears that few (if any systems) do that, so the normalization form of the data can have significant impacts on what the user sees, can discovery, or can export.

What the user sees

Probably the most perplexing issues that arise related to the normalization form of data, arise in how the data is rendered to the user.  While normalization forms have a binary difference, the system should be able to accommodate these differences aren’t visible to the user.  Throughout this document, I’ve been using different normalized forms of the letter “é”, but if the browser and the operating system are working like they are suppose to – you – as the reader shouldn’t be aware of these differences.  But we know that this isn’t always the case.  Here’s one such example:

image

The top set of data represents the data seen in an ILS prior to export.  The bottom shows the data once reimported, but the Normalization form had shifted from NFC to NFKD.  The interface being presented to the user has chosen to represent the data as bytes to flag that the data is represented as a decomposed character.  But this is jarring to the user as they probably should care.

The above example is actually not as uncommon as you might think.  In experimenting with a variety of ILS systems, changes in Normalization form can often have unintended effects for the user…and since it is impossible to know which normalization form is utilized without looking at the data at the binary level – how would one know when changes to records will result in significant changes to the user experience.

The short answer, is you can’t.  I started to wonder how OCLC treats Unicode data, and if internally, OCLC normalized the data coming into and out of its system.  And the answer is no – as long as the data is valid, the characters, in whatever normalization, is accepted into the system.  To test this, I made changes to the following record: http://osu.worldcat.org/title/record-builder-added-this-test-record-on-06262013-130714/oclc/850940559.  First, I was interested if any normalization was happening when interacting with OCLC’s Metadata API, and secondly, I was wondering if data brought in with different normalizations would impact searching of the resource.  And, the answers to these questions are interesting.  First, I wanted to confirm that OCLC accepted data in any normalization provided (as was relayed to me by OCLC), and indeed that is the case.  OCLC doesn’t do any normalization, as far as I can tell, of data going into the system.  This means that a user could download a master record, and make no other change to the record but updating the normalized form, and replace that record.  From the users perspective, the change wouldn’t be noticeable – but at the data level, the changes could be profound.  Given the variety of differences in how different ILS system utilize data in the different Unicode normalization forms, this likely explains some of the “diacritic display issue” questions that periodically make their way on the MarcEdit listserv.  Users are expecting that their data is compatible with their system because the OCLC data downloaded is in UTF8 and their system supports UTF8.  However, unknown to the cataloger, the reliance of data existing in a specific normalized form may cause issues.

The second question I was interested in, as it related to OCLC, was indexing.  Would a difference in normalization form cause indexing issues.  We know that in some systems, it does.  And for many European users, I have long recommended using MarcEdit’s normalization options to ensure that data converted to UTF8 utilizes the NFC normalization – as it enables local systems to index data correctly (i.e., index the letter + diacritic, rather than the letter then diacritic, then other data).  I was wondering if OCLC would demonstrate this kind of indexing behavior, but curiously, I found OCLC had trouble indexing any data with diacritical values.  Since I’m sure that isn’t the expected result, I’ve reached out to see exactly what is the expectation for the user.

Indexing implications

As noted above, for years now, I’ve recommended that users who utilize Koha as their ILS system configure MarcEdit to utilize the NFC normalization as the standard data output when converting data between MARC-8 and UTF-8.  The reason for this has been to ensure that data indexes correctly rather than flatly.  But maybe this recommendation should have been more broadly.  While  I didn’t look at every system, one common aspect of many of the systems that I did look at show, is that data normalized as NFKD tends to not index a representation of data as diacritical value.   They either normalize all diacritical data way, or they index the data as it appears in the binary – so for example, a record like this: évery would index as e_acute_very, i.e., the indexed value would be a plan “e”, but if the data appeared in NFC notation, the data would be indexed as an é (the combined character) allowing users to search for data using the letter + diacritic.  How does your system index its data?  It’s a question I’m asking today, and wondering how much of an impact normalization form has without the ILS, as well as outside the ILS (as we reuse data in a variety of contexts).  Since each system may make different assumptions and indexing decisions based on UTF8 data presented – its an interesting question to consider.

Export implications

The best case scenario is that a system would export data the same way that its represented in the system.  This is what OCLC does – and while it likely helps to exasperate some of the problems I see upstream with systems that may look for specific normalizations, its regular and expected.  Is this behavior the rule?  Unfortunately it is not.  I see many examples where data is altered on export, and often times, the diacritic related, the issue can be traced to the normalized form of the original data.  Again, the system probably should care which form is provided (in a perfect work), but if the system is implementing the MARC specification as written (see LC guidance above), then developing operations around the expectation of NFKD formed would likely led to complications.  But again, you’d likely never know until you tried to take the data out of the system.

Thinking about this in MarcEdit

So if you’ve stayed with me this long, you may be wondering if there is anything that we can do about the problems, short of getting everyone to agree that we all normalize our data a certain way (good luck).  In MarcEdit, I’ve been looking at addressing this question in order to address the following problems that I get asked about regularly:

  1. When I try to replace x diacritic, I can find the instances, but when I try to replace, only some (or none) are replaced
  2. When I import my data back into my system, diacritics are decomposed
  3. How can I ensure my records can index diacritics correctly

 

The first two issues are ones that come up periodically, and are especially confusing to users because the differences in data is at a binary level – so hard to see.  The last issue, MarcEdit has provided a 1/2 answer for.  It has always provided a way to set normalization when converted data to UTF8, but once there, it assumes that the user will provide the data in the form that they require (I’m realizing, this is a bad assumption).

To address this problem, I’m providing a method in MarcEdit that will allow the user to force the normalization of UTF8 data into a specific normalization, and will enable the application to support search and replace of data, regardless of the normalized form of a character that a user might us.  This will show up in the MarcEdit preferences.  Under the MARCEngine settings, there are options related to data normalization.  These show up as:

image

MarcEdit has included support for sometime to set normalization when compiling data.  But this doesn’t solve the problem when trying to edit, search, etc. records in the MarcEditor or within the other areas of the program.  So, a new option will be available – Enforce Defined Normalization.  This will enable the application to save data in the preferred normalization and also force all user submitted data through a wrapper that will enable edit operations to be completed, regardless of the normalized form a user may use when searching for data or the underlying normalization form of the individual records.  Internally, MarcEdit will make this process invisible, but the output created will be records that place all UTF8 characters into the specified normalization.  This seems to be a good option, and its very unlikely that tomorrow, the systems that we use will suddenly all start to use UTF8 data the same way – and taking this approach, they don’t have to.  MarcEdit will work as a bridge to take data in any UTF8 normalization, and will ensure that the data outputted all meets the criteria specified for the user.

Sounds good – I think so.  But it makes me a little nervous as well.  Why – because OCLC takes any data provided to it.  In theory, a record could switch normalizations multiple times, if users pulled data down, edited them using this option, and uploaded the data back to the database.  Does this matter?  Will it cause unforeseen issues?  I don’t know – I’m asking OCLC.  I also worry that allowing users to specify normalization form could have cascading issues when it comes to record sharing.  No everyone uses MarcEdit (nor should they) and its hard to know what impact this makes on other coding tools, etc.  This is why this function won’t be enabled by default – but will need to be turned on by the user – as I continue to inquire and have conversations about the larger implications of this work.  The short answer is that this is a pain point, and a problem that needs to be addressed somehow.  I see too many questions and too many records where the normalization form of the data plays a role in providing confusing data to the user, confusing data to the cataloger, or difficulties in reusing or sharing the data with other systems/processes.  At the same time, this feels like a band-aid fix until we reach a point in the evolution of our systems and metadata that we can free ourselves from MARC-8, and begin to think only about our data in UTF8.

Conclusions

So what should folks take away from all this?  Let’s start with the obvious.  Just because your data is in UTF8, doesn’t mean that its the same as my data in UTF8.  Normalization forms of data, a tool that was initially used to ease the transition of data from non-Unicode to Unicode data, can have other implications as well.  The information that I’ve provided, are just examples of challenges that make there way to me due to my work with MarcEdit.  I’m sure other folks have had different experiences…and I’d love to hear these if you want to provide them below.

Best,

–tr

MarcEdit Unicode Question [also posted on the listserv]

By reeset / On / In character encodings, MarcEdit

** This was posted on the listserv, but I’m putting this out there broadly **
** Updated to include a video demonstrating how Normalization currently impacts users **

Video demonstrating the question at hand:

 

So, I have an odd unicode question and I’m looking for some feedback.  I had someone working with MarcEdit and looking for é.  This (and a few other characters) represent some special problems when doing replacements because they can be represented by multiple codepoints.  They can be represented as a letter + diacritic (like you’d find in MARC8) or they can be represented as a single code point.

Here’s the rub.  In Windows 10 — if you do a find and replace using either type of normalization (.NET supports 4 major normalizations), the program will find the string, and replace the data.  The problem is that it replaces the data in the normalization that is presented — meaning, that, if in your file, you have data where your system provides multiple codepoints (the traditional standard with MARC21 — what is called the KD normalization) and you do a search where the replacement using a single code point, the replacement will replace the multiple code points with a single code point.  This is apparently, a Windows 10 behavior.  But I find this behaves differently on Mac system (and linux) — which is problematic and confusing.
At the same time, most folks don’t realize that characters like é have multiple iterations, and MarcEdit can find them but won’t replace them unless they are ordinally equivalent (unless you do a case insensitive search).  So, the tool may tell you it’s found fields with this value, but that when the replacement happens, it reports replacements having been made, but no data is actually changed (because ordinally, they are *not* the same).
So, I’ve been thinking about this.  There is something I could do.  In the preferences, I allow users to define which unicode normalization they want to use when converting data to Unicode.  This value only is used by the MarcEngine.  However, I could extend this to the editing functions.  Using this method, I could for data that comes through the search to conform to the desired normalization — but, you still would have times, again, where you are looking for data say that is normalized in Form C, you’ve told me you want all data in Form KD, and so again, é may not be found because again, ordinally they are not correct.
The other option — and this seems like the least confusing, but it has other impacts, would be to modify the functions so that the tool tests the Find string and based on the data present, normalizes all data so that it matches that normalization.  This way, replacements would always happen appropriately.  Of course, this means that if your data started in KD notation, it may end up (would likely end up, if you enter these diacritics from a keyboard) in C notation.  I’m not sure what the impact would be for ILS systems, as they may expect one notation, and get another.  They should support all Unicode notations, but given that MARC21 assumes KD notation, they may be lazy and default to that set.  To prevent normalization switching, I could have the program on save, ensure that all unicode data matches the encoding specified in the preferences.  That would be possible — it comes with a small speed costs — probably not a big one — but I’d have to see what the trade off would be.
I’m bringing this up because on Windows 10 — it looks as those the Replace functionality in the system is doing these normalizations automatically.  From the users perspective, this is likely desired, but from a final output — that’s harder to say.  And since you’d never be able to tell if the Normalization has changed unless you looked at the data under a hex editor (because honestly, it shouldn’t matter, but again, if your ILS only supported a single normalization, it very much would) — this could be a problem.
My initial inclination, given that Windows 10 appears to be doing normalization on the fly allowing users to search and replace é in multiple normalizations — is to potentially normalizing all data that is recognized as UTF8, which would allow me to filter all strings going into the system, and then when saving, push out the data using the normalization that was requested.  But then, I’m not sure if this is still a big issue, or, if knowing that the data is in single or multiple code points (from a find a replace persepctive) is actually desired.
So, I’m pushing this question out to the community, especially as UTF8 is becoming the rule, and not the exception.

MarcEdit MacOS 3 has Arrived!

By reeset / On / In MarcEdit

MarcEdit MacOS 3 is the latest branch of the MarcEdit 7 family. MarcEdit MacOS 3 represents the next generational update for MarcEdit on the Mac and is functionally equivalent to MarcEdit 7. MarcEdit MacOS 3 introduces the following features:

  1. Startup Wizard
  2. Clustering Tools
  3. New Linked Data Framework
  4. New Task Management and Task Processing
  5. Task Broker
  6. OCLC Integration with OCLC Profiles
  7. OCLC Integration and search in the MarcEditor
  8. New Global Editing Tools
  9. Updated UI
  10. More

 

There are also a couple things that are currently missing that I’ll be filling in over the next couple of weeks. Presently, the following elements are missing in the MacOS version:

  1. OCLC Downloader
  2. OCLC Bib Uploader (local and non-local)
  3. OCLC Holdings update (update for profiles)
  4. Task Processing Updates
  5. Need to update Editor Functions
    1. Dedup tool – Add/Delete Function
    2. Move tool — Copy Field Function
    3. RDA Helper — 040 $b language
    4. Edit Shortcuts — generate paired ISBN-13
    5. Replace Function — Exact word match
    6. Extract/Delete Selected Records — Exact word match
  6. Connect the search dropdown
    1. Add to the MARC Tools Window
    2. Add to the MarcEditor Window
    3. Connect to the Main Window
  7. Update Configuration information
  8. XML Profiler
  9. Linked Data File Editor
  10. Startup Wizard

Rather than hold the update till these elements are completed, I’m making the MarcEdit MacOS version available now so that users can be testing and interacting with the tooling, and I’ll finish adding these remaining elements to the application. Once completed, all versions of MarcEdit will share the same functionality, save for elements that rely on technology or practices tied to a specific operating system.

Updated UI

The MarcEdit MacOS 3 introduces a new UI. While the UI is still reflective of MacOS best practices, it also shares many of the design elements developed as part of MarcEdit 7. This includes new elements like the StartUp wizard with Fluffy Install agent:

 

The Setup Wizard provides users the ability to customize various application settings, as well as import previous settings from earlier versions of MarcEdit.

 

Updates to the UI

New Clustering tools

MarcEdit MacOS 3 provides MacOS users more tools, more help, more speed…it gives you more, so you can do more.

Downloading:

Download the latest version of MarcEdit MacOS 3 from the downloads page at: http://marcedit.reeset.net/downloads

-tr

MarcEdit MacOS 3 Design notes

By reeset / On / In MarcEdit

** Updated 12/28 **

Ok, so I’m elbow deep putting some of the final touches on the MacOS version of MarcEdit. Most of the changes to be completed are adding new functionality (introduced in MarcEdit 7 in recent updates), implementing the new task browser, updating the terminal mode, and completing some of the UI touches. My plan has been to target Jan. 1 as the release date for the next MacOS version of MarcEdit, but at this point, I’m thinking this version will release sometime between Jan.1 and Jan. 7. Hopefully, folks will be OK if I need a little bit of extra time.

As I’m getting closer to completing this work, I wanted to talk about some of the ways that I’m thinking about how the MacOS version of MarcEdit is being redesigned.

  1. As much as possible, I’m syncing the interfaces so that all versions of MarcEdit will fundamentally look the same. In some places, this means significantly redoing parts of the UI. In others, its mostly cosmetic (colors). While design does have to stay within Apple’s UI best practices (as much as possible), I am trying to make sure that the interfaces will be very similar. Probably the biggest differences will be in the menuing. To do this, I’ve had to improve the thread queuing system that I’ve developed in the MacOS version of MarcEdit to make up for some of the weaknesses in how default UI threads work.
  2. One of the challenges I’ve been having is related to some of the layout changes in the UI. To simplify this process, the initial version of the Mac update won’t allow a lot of the forms to be resized. The form itself will resize automatically based on the font and font sizes selected by the user – but resizing and scaling all the items on the window and views automatically and in spatial content is providing to be a real pain. Looking at a large number of Mac apps, window resizing doesn’t always appear to be available though (unless the window is more editor based) – so maybe this won’t be a big issue.
  3. Like MarcEdit 7, I’m working on integrations. Enhancing the OCLC integrations, updating the ILS integrations, integrating a lot of help into MarcEdit MacOS, adding new wizards, integrating plugins – I’m trying to make sure that I fully embrace with the MacOS update, one of the key MarcEdit 7 design rules – that MarcEdit should integrate or simplify the moving of data between systems (because you are probably using more programs than MarcEdit).
  4. Functionally, MacOS 3 should be functionally equivalent to MarcEdit 7 with the following exceptions
    1. There is no COM functionality (this is windows only)
    2. Initially, there will be no language switching (the way controls are named are very different than on Windows – I haven’t figured out how to connect the differences)
  5. Like MarcEdit 7, I’m targeting this build for newer versions of MacOS. In this case, I’ll be targeting 10.10+. This means that users will need to be running Yosemite (released in 2013) or greater. Let me know if this is problematic. I can push this down one, maybe two versions – but I’m just not sure how common older OSX versions are in the wild.

 

Here’s a working wireframe of the new main MacOS MarcEdit update.

Anyway – these are the general goals. All the functional code written for MarcEdit 7 has been able to be reused in MarcEdit MacOS 3, so like the MarcEdit 7 update, this will be a big release.

As I’ve noted before, I’m not a Mac user. I spent more time in the eco-system to get a better idea of how programs handle (or don’t) resizing windows, etc. – but fundamentally, working with MacOS feels like working with a broken version of Linux. So, with that in mind – I’m developing MarcEdit MacOS 3 using the 4 concepts above. Once completed, I’d be happy to talk (or hear) from primarily MacOS users and talk about how some of the UI design decisions might be updated to make the program easier for MacOS users.

Best,

–tr

MarcEdit 7: Holiday Edition

By reeset / On / In MarcEdit

I hope that this note finds everyone in good spirits. We are in the mist of the holiday season and I hope that everyone that this reaches has had a happy one. If you are like me, the past couple of days have been spent clean up. There are boxes to put away, trees to un-trim, decorations to store away for another year. But one thing has been missing, and that has been my annual Christmas eve update. Hopefully, folks won’t mind it being a little belated this year.

The update includes a number of updates – I posted about the most interesting (I think) here: http://blog.reeset.net/archives/2493, but the full changelog is below:

  • Enhancement: Clustering Tools: Added the ability to extract records via the clustering tooling
  • Enhancement: Clustering Tools: Added the ability to search within clusters
  • Enhancement: Linux Build created
  • Bug Fix: Clustering Tools: Numbering at the top wasn’t always correct
  • Bug Fix: Task Manager: Processing number count wouldn’t reset when run
  • Enhancement: Task Broker: Various updates to improve performance and address some outlier formats
  • Bug Fix: Find/Replace Task Processing: Task Editor was incorrectly always check the conditional option. This shouldn’t affect run, but it was messy.
  • Enhancement: Copy Field: Added a new field feature
  • Enhancement: Startup Wizard — added tools to simplify migration of data from MarcEdit 6 to MarcEdit 7

One thing I specifically want to highlight, and that is the presence of a Linux build. I’ve posted a quick video documenting the installation process at: https://www.youtube.com/watch?v=EfoSt0ll8S0. The MarcEdit 7 Linux build is much more self-contained than previous versions, something I’m hoping to do with the MacOS build as well. I’ll tell folks upfront, there are some UI issues with the Linux version – but I’ll keep working to resolve them. However, I’ve had a few folks asking about the tool, so I wanted to make it ready and available.

Throughout this week, I’ll be working on updating the MacOS build (I’ve fallen a little behind, this build may take an extra week to complete (I was targeting Jan. 1, it might slip a few days past) and I’ll say that functionality, I think folks will be happy as it fills in a number of gaps while still integrating the new MarcEdit 7 functionality (including the new clustering tools).

As always, if you have questions, please let me know. Otherwise, I’d like to wish everyone a Happy New Year, filled with joy, love, friendship, and success.

Best,

–tr

MarcEdit 7 Christmas Update: Preview

By reeset / On / In MarcEdit

As is my habit – there will be an update coming out around Christmas. And while it won’t be a large update (since MarcEdit 7 was just made available) – I think there will be a couple of new features that will make the changes worth it.

Clustering:

I’m continuing to enhance the clustering functionality – and for the Christmas update, I will be adding the ability to search within the clusters, as well as the ability to extract records from selected clusters (rather than just providing the ability to change the data in the cluster). By allowing the extraction of records within a cluster, this will give users the ability to use the clustering tools to extract record sets, and then run specific reports or perform selected edits against very targeted data.

Extracting Records:

New Search Functionality:


 

These two new clustering options should, I hope, give users some additional control over not only how they search for and interact with clustered data within MarcEdit, but also provides some new functionality that continues to enable all catalogers, regardless of their technical background, the ability to utilize the power clustering data can provide.

Copy Field Changes:

One common question that usually involves utilizing a fairly complicated regular expression using MarcEdit’s multi-line replacement mode (which can be terrifying to use for some), the the ability to move fields from the same field group into a new field group. For example, when converting data from a non-MARC metadata format, it might not be possible to setup a process that distinguishes between first and second authors. So, the final transformation may look something like this:

=100 \\$aLast Name, First Name
=100 \\$aLast Name2, First Name
=245 10$aTitle

In this instance, it would be desirable to be able to move the data from the second 100 field into a 700 field. As noted above, this was previously accomplished with a regular expression. However, this update introduces a new option in the Copy Field Function: Move Field Data.

 

The Move Field Data option allows users to identify a field group, and then set the positions that shouldn’t be changed. So, in my example, I would set the preserve position to 1, which would then update field #2 (or 3 or 4 or 5, etc.). Currently the tool does not allow you to preserve a range of values, but I may try to flush out that functionality in anticipation of the request, assuming that the process is straightforward. If it’s not, then I’ll wait for the request.

MarcEdit 6 to MarcEdit 7 Migration

I’m working on a set of common questions that have come up, but one of the most common has been related to moving tasks from MarcEdit 6 into MarcEdit 7. By default, the tool attempts to make that transition for you – but in many cases, the process isn’t able to automatically transfer the data. So, I’ve been spending some time and adding this into the initial startup wizard. Now, when you first install MarcEdit 7 – the tool will attempt to determine if you have a copy of MarcEdit 6 installed on your machine. If you do, a new Wizard page will show up to walk you through the data migration process.

The new Wizard page looks like the following:

If you click on Select Data to Migrate


At this point, you can select the classes of data that you want to import into MarcEdit 7. For some users, they might want to pull all the data into MarcEdit 7; while others may just want tasks. Select Export – and then wait for the tool to complete migrating your data.

Linux Version of MarcEdit 7

Finally, on Christmas, I will post a zip file with instructions for running MarcEdit 7 on Linux. I’m still wrapping up the “build” but I’m hoping that this version of MarcEdit 7 will require zero configuration work to make it run – though, I will be updating the ReadMe file to match the new install/run information.

And I think that is mostly it. I may include some additional help information, a couple new videos/documentation pages – and the MacOS version of MarcEdit is still on target for Jan. 1, 2018.

If you have any questions, feel free to let me know.

–tr