We hear the refrain over and over – we live in a global community. Socially, politically, economically – the ubiquity of the internet and free/cheap communications has definitely changed the world that we live in. For software developers, this shift has definitely been felt as well. My primary domain tends to focus around software built for the library community, but I’ve participated in a number of open source efforts in other domains as well, and while it is easier than ever to make one’s project/source available to the masses, efforts to localize said projects is still largely overlooked. And why? Well, doing internationalization work is hard and often times requires large numbers of volunteers proficient in multiple languages to provide quality translations of content in a wide range of languages. It also tends to slow down the development process and requires developers to create interfaces and inputs that support language sets that they themselves may not be able to test or validate.
If your project team doesn’t have the language expertise to provide quality internalization support, you have a variety of options available to you (with the best ones reserved for those with significant funding). These range of tools available to open source projects like: TranslateWiki (https://translatewiki.net/wiki/Translating:New_project) which provides a platform for volunteers to participate in crowd-sourced translation services. There are also some very good subscription services like Transifex (https://www.transifex.com/), a subscription service that again, works as both a platform and match-making service between projects and translators. Additionally, Amazon’s Mechanical Turk can be utilized to provide one off translation services at a fairly low cost. The main point though, is that services do exist that cover a wide spectrum in terms of cost and quality. The challenge of course, is that many of the services above require a significant amount of match-making, either on the part of the service or the individuals involved with the project and oftentimes money. All of this ultimately takes time, sometimes a significant amount of time, making it a difficult cost/benefit analysis of determining which languages one should invest the time and resources to support.
This is a problem that I’ve been running into a lot lately. I work on a number of projects where the primary user community hails largely from North America; or, well, the community that I interact with most often are fairly English language centric. But that’s changing — I’ve seen a rapidly growing international community and increasing calls for localized versions of software or utilities that have traditionally had very niche audiences.
I’ll use MarcEdit (http://marcedit.reeset.net) as an example. Over the past 5 years, I’ve seen the number of users working with the program steadily increase, with much of that increase coming from a growing international user community. Today, 1/3-1/2 of each month’s total application usage comes from outside of North America, a number that I would have never expected when I first started working on the program in 1999. But things have changed, and finding ways to support these changing demographics are challenging..
In thinking about ways to provide better support for localization, one area that I found particularly interesting was the idea of marrying automated language transcription with human intervention. The idea being that a localized interface could be automatically generated using an automated translation tool to provide a “good enough” translation, that could also serve as the template for human volunteers to correct and improve the work. This would enable support for a wide range of languages where English really is a barrier but no human volunteer has been secured to provide localized translation; but would enable established communities to have a “good enough” template to use as a jump-off point to improve and speed up the process of human enhanced translation. Additionally, as interfaces change and are updated, or new services are added, automated processes could generate the initial localization, until a local expert was available to provide the high quality transcription of the new content, to avoid slowing down the development and release process.
This is an idea that I’ve been pursing for a number of months now, and over the past week, have been putting into practice. Utilizing Microsoft’s Translation Services, I’ve been working on a process to extract all text strings from a C# application and generate localized language files for the content. Once the files have been generated, I’ve been having the files evaluated by native speakers to comment on quality and usability…and for the most part, the results have been surprising. While I had no expectation that the translations generated through any automated service would be comparable to human mediated translation, I was pleasantly surprised to hear that the automated data is very often, good enough. That isn’t to say that it’s without its problems, there are definitely problems. The bigger question has been, do these problems impede the use of the application or utility. In most cases, the most glaring issue with the automated translation services has been context. For example, take the word Score. Within the context of MarcEdit and library bibliographic description, we know score applies to musical scores, not points scored in a game…context. The problem is that many languages do make these distinctions with distinct words, and if the translation service cannot determine the context, it tends to default to the most common usage of a term – and in the case of library bibliographic description, that would be often times incorrect. It’s made for some interesting conversations with volunteers evaluating the automated translations – which can range from very good, to down right comical. But by a large margin, evaluators have said that while the translations were at times very awkward, they would be “good enough” until someone could provide better a better translation of the content. And what is more, the service gets enough of the content right, that it could be used as a template to speed the translation process. And for me, this is kind of what I wanted to hear.
Microsoft’s Translation Services
There really aren’t a lot of options available for good free automated translation services, and I guess that’s for good reason. It’s hard, and requires both resources and adequate content to learn how to read and output natural language. I looked hard at the two services that folks would be most familiar with: Google’s Translation API (https://cloud.google.com/translate/) and Microsoft’s translation services (https://datamarket.azure.com/dataset/bing/microsofttranslator). When I started this project, my intention was to work with Google’s Translation API – I’d used it in the past with some success, but at some point in the past few years, Google seems to have shut down its free API translation services and replace them with a more traditional subscription service model. Now, the costs for that subscription (which tend to be based on number of characters processed) is certainly quite reasonable, my usage will always be fairly low and a little scattershot making the monthly subscription costs hard to justify. Microsoft’s translation service is also a subscription based service, but it provides a free tier that supports 2 million characters of through-put a month. Since that more than meets my needs, I decided to start here.
The service provides access to a wide range of languages, including Klingon (Qo’noS marcedit qaStaHvIS tlhIngan! nuq laH ‘oH Dunmo’?), which made working with the service kind of fun. Likewise, the APIs are well-documented, though can be slightly confusing due to shifts in authentication practice to an OAuth Token-based process sometime in the past year or two. While documentation on the new process can be found, most code samples found online still reference the now defunct key/secret key process.
So how does it work? Performance-wise, not bad. In generating 15 language files, it took around 5-8 minutes per file, with each file requiring close to 1600 calls against the server, per file. As noted above, accuracy varies, especially when doing translations of one word commands that could have multiple meanings depending on context. It was actually suggested that some of these context problems may actually be able to be overcome by using a language other than English as the source, which is a really interesting idea and one that might be worth investigating in the future.
Seeing how it works
If you are interested in seeing how this works, you can download a sample program which pulls together code copied or cribbed from the Microsoft documentation (and then cleaned for brevity) as well as code on how to use the service from: https://github.com/reeset/C–Language-Translator. I’m kicking around the idea of converting the C# code into a ruby gem (which is actually pretty straight forward), so if there is any interest, let me know.