{"id":5754,"date":"2012-06-14T17:42:24","date_gmt":"2012-06-14T12:12:24","guid":{"rendered":"http:\/\/www.tothenew.com\/blog\/?p=5754"},"modified":"2016-12-19T15:16:40","modified_gmt":"2016-12-19T09:46:40","slug":"normalizing-accented-words","status":"publish","type":"post","link":"https:\/\/www.tothenew.com\/blog\/normalizing-accented-words\/","title":{"rendered":"Normalizing Accented Words"},"content":{"rendered":"<p>We all often need to work on data aggregated together from different sources, and before we analyse it, we often need to normalize it to a certain standard, A normalization process typically includes removing special characters, converting all text to lower case , We can also have certain rules that words like &#8220;saint&#8221; will always be normalized to &#8220;st.&#8221; etc.<\/p>\n<p>An important part of such normalization is to account for &#8216;accented&#8217; characters like \u00e9 or \u00e8 , and generally you would want them to normalize to normal\u00a0English\u00a0alphabet &#8216;e&#8217;, as that would help in sorting\/searching words containing these characters. For eg : you would want &#8220;Indianap\u00f2lis&#8221; to be normalized to &#8220;Indianapolis&#8221;.<\/p>\n<p>We can achieve this by using <strong>java.text.Normalizer<\/strong> class, all we need to do is<\/p>\n<p>[java]<br \/>\nNormalizer.normalize(&amp;quot;Indianap\u00f2lis&amp;quot;, Normalizer.Form.NFKD).replaceAll(&amp;quot;\\\\p{InCombiningDiacriticalMarks}+&amp;quot;, &amp;quot;&amp;quot;)<br \/>\n[\/java]<\/p>\n<p>Lets understand what is happening here, clearly we are calling a static function normalize in <strong>java.text.Normalizer<\/strong> class, the first parameter we passed is the string we want to normalize, the second parameter is the <a href=\"http:\/\/www.tothenew.com\/blog\/normalization-forms-for-accented-characters-in-java\/\">normalization form<\/a>. There are 4 normalization forms<\/p>\n<ol>\n<li>NFC &#8211; Canonical Decomposition, followed by Canonical Composition.<\/li>\n<li>NFD &#8211; Canonical Decomposition<\/li>\n<li>NFKC &#8211; Compatibility Decomposition, followed by Canonical Composition<\/li>\n<li>NFKD &#8211; Compatibility Decomposition<\/li>\n<\/ol>\n<p>So in the second parameter we pass in the NFKD form, which is an enum of type\u00a0<strong>Form<\/strong>.<\/p>\n<p>The normalizer function will still return us &#8220;Indianap\u00f2lis&#8221;. so, what happened there?<br \/>\n<br \/>\nLets understand, we can create an \u00f2 using two ways. It can either be a unicode character (U+00F2) or it can be a normal english &#8216;o&#8217; with a grave accent added to it (o+ (U+0060)).<br \/>\n<br \/>What normalizer function did was it normalized both cases to english &#8216;o&#8217; with an accent (o+ (U+0060))<br \/>\n<br \/>The next part is a replaceAll with a regex, the\u00a0{InCombiningDiacriticalMarks} is a Unicode block property, which matches the accented characters.<br \/>\n<br \/>The second part replaces all accented characters (grave accent (U+0060) in this case) with empty string. So what we finally get is &#8220;Indianapolis&#8221;, and we are done.<br \/>\n<br \/>\nHope it helped!<\/p>\n<p>Sachin Anand<\/p>\n<p>sachin[at]intelligrape[dot]com<\/p>\n<p>@babasachinanand<\/p>\n","protected":false},"excerpt":{"rendered":"<p>We all often need to work on data aggregated together from different sources, and before we analyse it, we often need to normalize it to a certain standard, A normalization process typically includes removing special characters, converting all text to lower case , We can also have certain rules that words like &#8220;saint&#8221; will always [&hellip;]<\/p>\n","protected":false},"author":14,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"iawp_total_views":10},"categories":[1],"tags":[829,830,828,827,831],"aioseo_notices":[],"_links":{"self":[{"href":"https:\/\/www.tothenew.com\/blog\/wp-json\/wp\/v2\/posts\/5754"}],"collection":[{"href":"https:\/\/www.tothenew.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.tothenew.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.tothenew.com\/blog\/wp-json\/wp\/v2\/users\/14"}],"replies":[{"embeddable":true,"href":"https:\/\/www.tothenew.com\/blog\/wp-json\/wp\/v2\/comments?post=5754"}],"version-history":[{"count":0,"href":"https:\/\/www.tothenew.com\/blog\/wp-json\/wp\/v2\/posts\/5754\/revisions"}],"wp:attachment":[{"href":"https:\/\/www.tothenew.com\/blog\/wp-json\/wp\/v2\/media?parent=5754"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.tothenew.com\/blog\/wp-json\/wp\/v2\/categories?post=5754"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.tothenew.com\/blog\/wp-json\/wp\/v2\/tags?post=5754"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}