{"id":5782,"date":"2012-06-29T18:30:12","date_gmt":"2012-06-29T13:00:12","guid":{"rendered":"http:\/\/www.tothenew.com\/blog\/?p=5782"},"modified":"2014-12-17T09:14:34","modified_gmt":"2014-12-17T03:44:34","slug":"normalization-forms-for-accented-characters-in-java","status":"publish","type":"post","link":"https:\/\/www.tothenew.com\/blog\/normalization-forms-for-accented-characters-in-java\/","title":{"rendered":"Normalization Forms for Accented Characters in Java"},"content":{"rendered":"<p>Text Normalization is the process of &#8220;standardizing&#8221; text to a certain form, so as to enable, searching, indexing and other types of analytical processing on it. Often working with large quantities of text we encounter character with accents like <strong>\u00e9 , \u00e2 <\/strong>etc. Unicode provides multiple ways to create such characters . For example we can have<strong> \u00e9<\/strong> created using Unicode sequence \\u00E9 (composite character) or we can create it using a combination of e + acute accent. that would be (e + \\u0301).<br \/>\nNow the <strong>\u00e9<\/strong> created would look same in both the representations and would also mean the same thing, but for a java program they are actually not the same characters so the <strong>\u00e9 <\/strong>created using the two methods are actually not equal for your program. Clearly we need to normalize these two different representations to a fixed standard.<\/p>\n<p>And its here, that <strong>java.text.Normalizer <\/strong>class comes to our rescue. All we need to do is normalize things to a normalization form out of these 4 :<\/p>\n<ol>\n<li>NFC \u2013 Canonical Decomposition, followed by Canonical Composition.<\/li>\n<li>NFD \u2013 Canonical Decomposition<\/li>\n<li>NFKC \u2013 Compatibility Decomposition, followed by Canonical Composition<\/li>\n<li>NFKD \u2013 Compatibility Decomposition<\/li>\n<\/ol>\n<p>\n\tCanonical Decomposition means, taking a character and decomposing it into its component characters<\/p>\n<p>\n\tCompatibility decomposition means taking a character and decomposing it by compatibility and arranging them in specific order<br \/>\n<br \/>\n\tCanonical Composition means recomposing characters based on their <strong>canonical equivalence.<\/strong><br \/>\n<br \/>\n<strong>Canonical equivalence <\/strong>further means that characters have the same appearance and meaning when printed or displayed.<br \/>\n<br \/>\nTo Fully summarize this in an example, <br \/>\nConsider, the Angstrom sign <strong>&#8220;\u00c5&#8221;, <\/strong> <strong>(U+212B)<\/strong> and \u00a0the Swedish letter <strong>&#8220;\u00c5&#8221;<\/strong> <strong> (U+00C5), <\/strong> both are expanded by NFD (or NFKD) into <strong>&#8220;A&#8221;<\/strong> and <strong>&#8220;\u00b0&#8221;<\/strong> <strong>(U+0041 and U+030A) <\/strong> which is then reduced by NFC (or NFKC) to the Swedish letter <strong>&#8220;\u00c5&#8221;<\/strong> <strong>(U+00C5) \u00a0(<\/strong>Swedish Letter <strong>&#8220;\u00c5&#8221;<\/strong> is canonically equivalent to Angstrom sign <strong>&#8220;\u00c5&#8221;<\/strong> as they are printed and displayed as exactly same, though they are different<strong>)<\/strong>.<br \/>\n<\/p>\n<p>Now we know how we can normalize unicode characters to a standard form wherever required.<\/p>\n<p>\nSachin Anand<\/p>\n<p>@babasachinanand<\/p>\n<p>sachin[at]intelligrape[dot]com<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Text Normalization is the process of &#8220;standardizing&#8221; text to a certain form, so as to enable, searching, indexing and other types of analytical processing on it. Often working with large quantities of text we encounter character with accents like \u00e9 , \u00e2 etc. Unicode provides multiple ways to create such characters . For example we [&hellip;]<\/p>\n","protected":false},"author":14,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"iawp_total_views":13},"categories":[1],"tags":[833,4844,832,834,831],"aioseo_notices":[],"_links":{"self":[{"href":"https:\/\/www.tothenew.com\/blog\/wp-json\/wp\/v2\/posts\/5782"}],"collection":[{"href":"https:\/\/www.tothenew.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.tothenew.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.tothenew.com\/blog\/wp-json\/wp\/v2\/users\/14"}],"replies":[{"embeddable":true,"href":"https:\/\/www.tothenew.com\/blog\/wp-json\/wp\/v2\/comments?post=5782"}],"version-history":[{"count":0,"href":"https:\/\/www.tothenew.com\/blog\/wp-json\/wp\/v2\/posts\/5782\/revisions"}],"wp:attachment":[{"href":"https:\/\/www.tothenew.com\/blog\/wp-json\/wp\/v2\/media?parent=5782"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.tothenew.com\/blog\/wp-json\/wp\/v2\/categories?post=5782"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.tothenew.com\/blog\/wp-json\/wp\/v2\/tags?post=5782"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}