10 Mistakes Made in Web Translations (and How to Fix Them)

There are many assumptions made by developers when it comes to creating multilingual websites, and any one of these mistakes is easy to make when starting work on your first multilingual Web project. I've compiled a list of 10 (this number was only reached by happy coincidence, not by setting a deliberate target!) common mistakes made by developers and others, and how to avoid them.

A Translation is Just a Change of the Language for a Different Country
Not Giving Your Design Room for Content
Language Direction
Fonts That Don't Cover All Languages
Context of Translation Strings
- What is Context?
Concatenating Strings After Individual Translation
Assumptions About Pluralisation
Multiple Placeholders Within Strings
Reinventing the Wheel
Translating Numbers
Conclusion

A Translation is Just a Change of the Language for a Different Country

This is an error I've been guilty of in my past, just assuming that a translation was simply a conversion into another language for a country. The reality is a little more complicated.

Countries don't have a 1 to 1 mapping to a language, languages are shared among countries, and countries may use more than one language. For example, Switzerland counts German and French among its main languages. To make things more interesting, there are variations in languages between countries. There is a difference between the English used in England and America; a ground floor in British English is the equivalent of the first floor in American English.

There's a standard which helps with all of this: ISO 3166, which provides a way of defining short codes to identify the language and any sub-division within, such as en-GB or en-US.

For developers, this means that you need to set up the means to provide translations not only based on the language, but the language culture.

Not Giving Your Design Room for Content

This is one I encountered on my second multilingual project with a website that had translations in 28 languages. Some languages, such as German and Polish tend to have sentences a little longer than their English counterparts, which means there's a need for larger content areas.

The project I was working on had some amazing designs, but all the designs were initially based around English content. Single-line headings grew to two lines, tables of content no longer fit their content on smaller screens, and images needed more work to make text fit easily.

English	English Character Count	German	German Character Count
How batteries work	18	Wie Batterien funktionieren	27
History of batteries	20	Geschichte der Batterien	24
Battery care	12	Batteriepflege	14

The resolution for this goes beyond just development and into the design sphere. Designs should be flexible enough to work with content beyond the ideal, and assumptions on lengths of content shouldn't be made in the development process too. Those headings that you've set a maximum character width for in the database? Increase it, or better still, remove it altogether. If you're programmatically restricting widths of strings, consider instead using CSS to do it at a presentational level only.

Language Direction

For the majority of languages, text goes left to right, and this is the default in most browsers. However, not all languages do. Now, in the majority of cases, when you have the right lang attribute property set on the <html> element (or sub-element in mixed language pages) the browser is able to set the text direction correctly. If that was all, this would make a very minor point in my list.

As with all other areas, sometimes the assumptions a computer (or browser) makes are wrong, so you have to give it a helping nudge. There are a couple of nudges you can give it, one through HTML, and one through CSS.

First, you can set the text direction within the HTML using the dir attribute:


<p dir="rtl">Content.</p>

Then, you can aid the browser with CSS:


*[dir="rtl"] { direction: rtl; }

Fonts That Don't Cover All Languages

An interesting issue I ran into about 7 years ago was where a specific font was chosen for a website. It was only when we came to adding the Greek translations that we saw that the text looked awful, with lots of the characters appearing quite out of place among their neighbors. The problem was that the chosen font, while beautiful, didn't contain the characters beyond the standard extended Latin sets. This meant that sentences containing both Greek and Latin characters would be using different fonts, like sentences containing a brand name in English. So even though the font in question was quite simple, it was very obvious with characters like 'E' (capital letter e) and 'Ε' (capital Greek letter Epsilon).

It's best shown with an example. Imagine I had decided to use the free Anton font from Google Fonts. This font does not contain the Greek Epsilon, so the browser uses whatever is the fallback font, going all the way down the stack until it finds a font that can display the required character. If no font is found that contains that character on the CSS font list, the browser falls back to the OS fonts. If all of that fails, it will display an empty box.

The below graphic shows the text "My name is Ashley Sheridan" in both English and Greek. The Anton font does not have the Greek glyphs available, so uses the fallback font on my system instead, which obviously looks awful.

My name is Ashley Sheridan, in English and Greek, highlighting the missing glyphs in the Anton font.

There are a few things you can do to circumvent this problem:

Choose a font that contains all the glyphs you require for all languages you need
Override the font completely when the pages language is one that isn't supported by your font
Select a close fallback font and apply it in the font stack

Context of Translation Strings

A common mistake when translating is to ignore the context of the word/phase being translated. Typically, this will happen more with very short sentences or individual words, as these are the easiest to confuse the context for.

Consider the word "address". When used as the label for a field in the delivery section of an online purchase form, it's obvious that this refers to the location the goods should be delivered. However, when used as the heading for a blog post detailing an introduction speech given at a conference, the meaning would refer to the speech itself.

It's not totally unfeasible that both uses could exist on the same website selling tickets to tech conferences. Given that it's the same spelling in English for both words, it will likely only exist in the dictionary once.

When this word is translated though, which of the two meanings do you translate to?

There are a couple of ways to avoid running into this issue:

Use translation keys which aren't just the string value
With strings that could be confused like this, give them a translation context

What is Context?

There is a popular format for translations that's nearly 3 decades old: gettext. A typical translation of a simple string would look like this in your .po file:


msgid "Address"
msgstr "Address"

Now, as we can see, in some cases this will lead to confusion when it's translated, so what can add context to the string, to better aid the correct usage of a translation in the right context:


msgctxt "delivery"
msgid "Address"
msgstr "Address"

msgctxt "speech"
msgid "Address"
msgstr "Address"

Then to use a specific translation in your code, you would do something like this:


<?php
gettext("delivery" . "\004" . "Address");

This is the context name you set up in your .po file for the string, followed by the EOT (End Of Text) character, and then the string for which you want the translation.

The Friendly how-to for translators article on the layout of .po files and how to edit them is a great resource if you want to find out more about this format.

Concatenating Strings After Individual Translation

A fairly standard need for any translation is to use variable content within a string that itself shouldn't be translated. For example, phone numbers, or brand names are generally left alone. The phrase "you have 10 unread emails" would be "Sie haben 10 ungelesene E-Mails" in German. Notice that the number is left entirely untranslated (numbers using numerical characters are typically always left alone in any translation).

A bad practice I've seen recently went something along the lines of this:


<?php

echo _("you have") . " $totalEmails " . _("unread emails");

The problem here is that it makes the assumption that any translation will follow the rules of English sentence structure. For example, in Hungarian, this would look something like "10 olvasatlan e-mailt kapott".

The solution is very simple: wherever possible, always translate whole phrases/sentences. If you need variable content within the string, you can use placeholders:


msgid "you have %d unread emails"
msgstr "You have %d unread emails"

Then, in your code, you could do something like this:


<?php

echo sprintf(_('you have %d unread emails'), $totalEmails);

Assumptions About Pluralisation

If you're someone who only speaks English and has little understanding of other languages and how they can differ, then it's easy to understand why certain assumptions about plurals are made.

One very obvious mistake I've seen recently was something like this:


<?php

echo _("Your birthday is in %d day");

if($daysUntilBirthday > 1)
{
	echo 's';
}

The problem here should be obvious: not every language uses an 's' to denote the plural of 'day'. In German, for example, day is Tag, and days is Tage. The answer is to not attempt to manually handle translations, let the i18n library you're using do it for you; it's a wheel you do not need to reinvent.

Another bad assumption is to treat all plurals above 1 as the same:


<?php

echo ($hoursLeft == 1) ? _("1 hour") : sprintf(_("%d hours"), $hoursLeft);

It's less obvious why this will fail, unless you understand a language like Russian, where the plural translation changes a little more based on the number. The following table shows how even the numbers 0-4 result in some very different translations:

English	Russian
0 hours	0 часов
1 hour	1 час
2 hours	два часа
3 hours	3 часа

Again, this is a solved problem, and one that takes a lot of effort and understanding of a wide assortment of languages to get right. It's more efficient (and easier!) to let your library deal with this for you. There's a great example on the PHP Manual for ngettext:


"Plural-Forms: nplurals=3; plural=n%10==1 && n%100!=11 ? 0 : n%10>=2 && n%10<=4 && (n%100<10 || n%100>=20) ? 1 : 2;\n"

msgid "File"
msgid_plural "Files"
msgstr[0] "Файл"
msgstr[1] "Файла"
msgstr[2] "Файлов"

and the PHP:


<?php

echo ngettext("File", "Files", $number);

You don't need to worry too much about the first line in the .po file, as that's typically something that will be handled by whoever is doing your translation, and they will understand the nuances of the language. Basically though, this line instructs your code how to handle plurals, and which numbers will translate to which form of the plural. If you're interested, there is a lot more information on the GNU gettext plural forms manual page.

Multiple Placeholders Within Strings

The traditional method of using a %s to denote a string (or %d for digits) placeholder in a translation has one minor flaw: when you use more than one of them you're stuck with a fixed order within the phrase you're translating. Change the language, and you're stuck with the word order of your starting language.

This is referred to as the word order in linguistics, and the specific order of the subject, object, and verb will change depending on the language. The Wikipedia article there explains this with the simple phrase "she ate bread". That ordering of words is fine for languages like English and other Romance languages, but if you're writing in Korean or Arabic, that order of words changes.

Now, consider the situation where the subject "she" and the object "bread" were variables within your web app. If you're translating to a non-Romance language, that fixed placeholder order would break the translation.

So what can you do? Thankfully, sprintf() (which is most often used in conjunction with gettext() to replace variables within translations) supports numbered arguments that don't rely on their position within the whole string.


# English - subject verb object
msgid "%1$s someverb %2$s"
msgstr "%1$s someverb %2$s"

...

# Arabic - verb subject object
msgid "%1$s someverb %2$s"
msgstr "someverb %2$s %1$s"

This way, your translators are able to freely move around the variable parts of your translations while retaining the legibility of it for the specific language it's in, and as a developer you need to make no other changes. The numbers within the %1$s part refer to the specific argument number that you're passing in to sprintf().

There is another way if you're using a framework like Symfony, which adds some nice syntactic sugar to the whole thing, and lets you set up your dictionary in a much more readable manner:


msgid "This sentence uses a %namedPlaceholder%"
msgstr "This sentence uses a %namedPlaceholder%"

Then, in your Twig views, you would do something like:


{% set namedPlaceholder = 'something' %}
{% trans %}This sentence uses a %namedPlaceholder%{% endtrans %}

Reinventing the Wheel

I mentioned earlier in Assumptions about pluralisation that reinventing the wheel when it comes to translations is generally a bad idea (unless you happen to be very smart and fluent in a lot of languages, and even then it's not recommended).

When it comes to all things i18n, there are a lot of great options out there. If you're using PHP, Python, or C++, you've got gettext() and its ilk, along with the .po dictionary format. If you're using JavaScript you could use something like something like JED (which uses .po files converted into JSON files). The .Net framework has the newer XLIFF format, and there are several packages available on NuGet for working with the format

If you're a developer who is working on, about to work on, or interested in a web-based translation project, have a look at the tools available to you before you try rolling your own. Add this to the list of other things you should avoid creating from scratch:

Security
Date libraries
Caching
Translations

If you really feel that the existing tools just don't give you that feature you desperately need, then be very mindful with anything new that you create, and try and consider all the potential possible ways that it will be used with various languages.

Translating Numbers

Numbers are a confusing thing when it comes to translations, and it would be logical to assume that they would need to be translated (not formatted, that's a different thing entirely) much like words are. However, numbers do seem to be the one thing that are shared between language cultures.

So even though the number 8, for example, can be represented with the characters ㍠, ㏧, ８, Ⅷ, and ⅷ, the regular character 8 is perfectly fine, and preferred under almost all circumstances.

Of course, there are still some times where you might want to translate numbers to their Roman numeral equivalent as I did some 6 years ago! It's not something I would recommend in conjunction with a multilingual website though.

Conclusion

Translations are complicated, and any web translation project can pose a few problems for you. However, if you look at the existing tools and the plethora of documentation to accompany them, you can avoid making some of the mistakes I and many others have made in the past.

Try and test things with different locales, and if possible, get someone who knows the language to help test with you. At the very least, short and simple phrases and words can be changed to a different language with online tools (Google Translate and Bing Translate being some of the better free ones) which should give you a very good idea if you're doing things correctly.

Ashley Sheridan.co.uk