Subscribe to this blog


When working with texts in different languages, the reliable identification of them is problematic, because of conflicting standards. For instance, Wikipedia uses ISO 639-1 codes (with two letters) and FreeDict uses three-letter codes from the ISO 639-3 standard. Converting between those can be tedious and retrieving an English name even more. The Rust crate isolang can make this very easy and fast.

Using different language codes has it's use case for many applications. The two-digit codes from the ISO 639-1 standard for instance are short and widely used. They however don't map to all available languages, which is problematic for less widely spoken languages, because these don't have a two-letter code. The 639-3 codes can map all available languages (past and present), but they are often quite different to the 639-1 codes, so mapping between those is hard.

Rust is known for its potential to optimize and for the isolang crate, I decided to invest some time to make it perform fast. A key to this is to have all information available during compile time, so that on program startup or data access, no additional work needs to be done.

The crate provides an enum with the ISO 639-3 codes and offers methods to convert from and into 639-1 codes and into an English language string. Using the three-letter language codes for the enum was the most sensible solution, because it allows the unique identification of all languages.

As an example of the usage, have a look at this very short example:

use isolang::Language;

let iso_639_1 = Language::from_639_1("de").expect("Illegal ISO 639-1 \
    language code given").to_639_3()

That doesn't look like much, but that is intentional. You don't want to care about this simple task and all you need is contained in the data segment of your binary. Not a single of your precious cycles is wasted for initialization or allocation. This is possible thanks to a static Map from the PHF crate.

One missing feature is the conversion from an English language string to the Language data structure. This is not included because it is a slightly problematic mapping, because some language names might have multiple codes assigned and it is hard to properly identify them that way. It could be added, if need be.

I'm using this crate in CRAFT, a word2vec text preprocessor.



Comments