Normalization Forms for Accented Characters in Java

29 / Jun / 2012 by Sachin 0 comments

Text Normalization is the process of “standardizing” text to a certain form, so as to enable, searching, indexing and other types of analytical processing on it. Often working with large quantities of text we encounter character with accents like é , â etc. Unicode provides multiple ways to create such characters . For example we can have é created using Unicode sequence \u00E9 (composite character) or we can create it using a combination of e + acute accent. that would be (e + \u0301).
Now the é created would look same in both the representations and would also mean the same thing, but for a java program they are actually not the same characters so the é created using the two methods are actually not equal for your program. Clearly we need to normalize these two different representations to a fixed standard.

And its here, that java.text.Normalizer class comes to our rescue. All we need to do is normalize things to a normalization form out of these 4 :

  1. NFC – Canonical Decomposition, followed by Canonical Composition.
  2. NFD – Canonical Decomposition
  3. NFKC – Compatibility Decomposition, followed by Canonical Composition
  4. NFKD – Compatibility Decomposition



Canonical Decomposition means, taking a character and decomposing it into its component characters



Compatibility decomposition means taking a character and decomposing it by compatibility and arranging them in specific order


Canonical Composition means recomposing characters based on their canonical equivalence.


Canonical equivalence further means that characters have the same appearance and meaning when printed or displayed.


To Fully summarize this in an example,

Consider, the Angstrom sign “Å”, (U+212B) and  the Swedish letter “Å” (U+00C5), both are expanded by NFD (or NFKD) into “A” and “°” (U+0041 and U+030A) which is then reduced by NFC (or NFKC) to the Swedish letter “Å” (U+00C5)  (Swedish Letter “Å” is canonically equivalent to Angstrom sign “Å” as they are printed and displayed as exactly same, though they are different).

Now we know how we can normalize unicode characters to a standard form wherever required.



Sachin Anand

@babasachinanand

sachin[at]intelligrape[dot]com

FOUND THIS USEFUL? SHARE IT

Leave a comment -