MarcEdit 7 Update: XML Profile/OAI Harvester Changes

Updated:  BePress reached out on twitter pointing to a status page created this afternoon (May 8th): https://www.bepress.com/status/  I’m willing to assuming good intentions here, and trust that this will be a temporary condition.  I’ll also trust that in the future, they will do a better job notifying their customers prior to blacklisting services that they might be using.

I’ve been keeping myself busy over the past few weeks rebuilding my MacOS environment (it kind of bricked after a failed update) so I can get a number of changes moved to the Mac version of MarcEdit.  Crossing my fingers that this will happen this week.

In the mean time, I’ve completed two significant updates to MarcEdit.  The first is an expansion of the XML Profiler.  When MarcEdit 7 came out, the idea was that this tool would enable user to process XML and JSON data without needing to understand exactly how the processing happens.  I started with XML processing because that is more prevalent, but after receiving my first set of vendor data in JSON, I went about adding the new functionality.  I’ve documented the changes in this video:

OAI Harvester changes

The second update is related to the OAI Harvester.  These changes are being necessitated by some bad actors — specifically BePress.  About a week ago, I started getting feedback from users that MarcEdit was failing when harvesting BePress sites.  Seemed kind of odd, so I started to investigate.  I was being told by the users that BePress couldn’t tell them exactly what was wrong, and on their end, they couldn’t tell that anything had changed.  So, for a week, I poked at this, and for the longest time, I couldn’t figure out what was going on because in my tests, everything kept working.  It wasn’t until I started playing with the user agent string that I discovered that it appears that BePress is actively blocking MarcEdit users from harvesting their own data from their sites.  I say apparently because I haven’t actually talked to BePress — but from the tests — I think its kind of obvious.  I recorded this video.

You can see that when the tool uses either the default UserAgent or any UserAgent with MarcEdit in the name, the service returns a 503.  When I change the UserAgent string to anything else, the process completes.  A coincidence — I doubt it.

Now, I use these user agent strings so that vendors can contact me if they are finding that MarcEdit’s process is too aggressive.   And vendors have — and we talk about best HTTP practice so that they can let MarcEdit know if the service is being overwhelmed.  This is how its suppose to work.  Now, I could change the UserAgent string — I could automatically rotate it — spoof browser agent strings making it really difficult to determine the client — but I’d prefer not to do that.  So I won’t.  But I will enable users to do this, mostly because the systems they are trying to harvest from are their own, and the data they are trying to capture, is again, their own.  With this update, users will find in the Advanced Settings tab a new Option — UserAgent string.

image

Included you will find the default MarcEdit user agent, as well as two agents that match to Firefox and Edge.  These rotate as the browsers update themselves.  Users can use one of these default agents and spoof their request as a browser, or they can edit the string and add their own custom agent.  Personally, if BePress is intent on limiting harvesting on their system, I might recommend working with them so you can provide a custom UserAgent that identifies the process as your organization.  Better yet, if you are a BePress customer, you should let them know that they shouldn’t be doing this kind of filtering to begin with.

Anyway — these changes are available in the latest version of MarcEdit 7.  I’ll be working this week to move these into the Linux and Mac versions.

Download will come via the automated update tool, or from downloading directly at: http://marcedit.reeset.net/downloads/

–tr


Posted

in

by

Tags: