Convert unicode codepoints to unicode hex values in java
In a part of our crawler development, we encountered a Bangla news site (http://www.kalerkantho.com/) which uses code points instead of unicode hex valus in their website. Although it renders banla fonts in browser, but when viewings source it only shows code points, so when downloaded by crawler we only got কકى similar. For the indexing purpose we needed to convert them to hex values so that it renders bangla font anywhere. The process of converting is really so simple.
First tokenize the "কકى" strings with ";" and then strip out "&#", the integer numbers you are getting is the decimal values of hex. Convert these string to integer and then add with '\u0000' now type cast it to char
and you will get the hex ie unicode character
int codePoint = 2537; char c = (char) ('\u0000' + codePoint);
No related posts.

I love your blog…
Really nice and clear information
Thank you for sharing