Convert unicode codepoints to unicode hex values in java


In a part of our crawler development, we encountered a Bangla news site (http://www.kalerkantho.com/) which uses code points instead of unicode hex valus in their website. Although it renders banla fonts in browser, but when viewings source it only shows code points, so when downloaded by crawler we only got কકى similar.  For the indexing purpose we needed to convert them to hex values so that it renders bangla font anywhere. The process of converting is really so simple.

First tokenize the "কકى" strings with ";" and then strip out "&#", the integer numbers you are getting is the decimal values of hex. Convert these string to integer and then add with '\u0000' now type cast it to char
and you will get the hex ie unicode character

int codePoint = 2537;
char c = (char) ('\u0000' + codePoint);

No related posts.

Tags: ,

To make money we lose our health, and then to restore our health we lose our money.... We live as if we are never going to die, and we die as if we never lived!

1 Comment Leave yours

  1. I love your blog…
    Really nice and clear information

    Thank you for sharing

Leave a Reply