Normalizing Accented Words

14 / Jun / 2012 by Sachin 0 comments

We all often need to work on data aggregated together from different sources, and before we analyse it, we often need to normalize it to a certain standard, A normalization process typically includes removing special characters, converting all text to lower case , We can also have certain rules that words like “saint” will always be normalized to “st.” etc.

An important part of such normalization is to account for ‘accented’ characters like é or è , and generally you would want them to normalize to normal English alphabet ‘e’, as that would help in sorting/searching words containing these characters. For eg : you would want “Indianapòlis” to be normalized to “Indianapolis”.

We can achieve this by using java.text.Normalizer class, all we need to do is

Normalizer.normalize("Indianapòlis", Normalizer.Form.NFKD).replaceAll("\\p{InCombiningDiacriticalMarks}+", "")

Lets understand what is happening here, clearly we are calling a static function normalize in java.text.Normalizer class, the first parameter we passed is the string we want to normalize, the second parameter is the normalization form. There are 4 normalization forms

  1. NFC – Canonical Decomposition, followed by Canonical Composition.
  2. NFD – Canonical Decomposition
  3. NFKC – Compatibility Decomposition, followed by Canonical Composition
  4. NFKD – Compatibility Decomposition

So in the second parameter we pass in the NFKD form, which is an enum of type Form.

The normalizer function will still return us “Indianapòlis”. so, what happened there?

Lets understand, we can create an ò using two ways. It can either be a unicode character (U+00F2) or it can be a normal english ‘o’ with a grave accent added to it (o+ (U+0060)).

What normalizer function did was it normalized both cases to english ‘o’ with an accent (o+ (U+0060))

The next part is a replaceAll with a regex, the {InCombiningDiacriticalMarks} is a Unicode block property, which matches the accented characters.

The second part replaces all accented characters (grave accent (U+0060) in this case) with empty string. So what we finally get is “Indianapolis”, and we are done.

Hope it helped!

Sachin Anand




Leave a comment -