How to trim() No-Break space ( ) when parsing HTML
The entity in web pages is used to represent a blank space, but is sometimes converted to ASCII 160 (no break space), instead of ASCII 32 (space). This does not work well with Java functions like trim() which expect the ASCII 32 character (‘ ‘).
Let us consider the scenario where are trying to parse a HTML page which has a text block.
<td>Blog entry
<td>
Once you get the text node, you can get the value of the Node by using the getNodeValue() call.
// first use the DOM api to get the correct node Node td = ....; // then get the node text String text = td.getNodeValue();
When you try printing the text you will see a space at the end of the text.
System.out.println("'" + text + "'");
'Blog entry '
Trimming does not fix this as trimming only trims the space (ASCII 32).
System.out.println("'" + text.trim() + "'");
'Blog entry '
The correct way to get rid of this is to use the following regular expression. The no-break space is represented by the unicode character “\u00A0″ which can be escaped using a simple regular expression.
// use regex '\u00A0' to match No-Break space
System.out.println("'" + text.replaceAll("[\\s\\u00A0]+$", "") + "'");
'Blog entry'
Reference page
http://www.fileformat.info/info/unicode/char/00a0/index.htm
Related posts:




Thanks a lot, this article has helped me a lot!!
Hello!
Vineet, you make it sound as if there’s some mistake or something wrong with translating “ ” into a non-breaking space instead of just an ordinary blank space. But that is exactly what “ ” means — “Non-Breaking SPace”. (What did you think the characters N, B, S, and P in the entity stand for?)
If there is a problem, it is absolutely not with converting “ ” to ASCII 160, but with trim() and Java functions like it — if anything should change, then those functions should be changed to handle valid characters correctly, not the perfectly correct conversion.
Or, hey, come to think of it, maybe it’s just your expectations that are wrong: Since “ ” is explicitely intended to be _Non-Breaking_, why on Earth would you think it should be handled like an ordinary space by “trim()”? The whole point of a Non-Breaking Space is that it is NOT to be considered a word boundary, but left intact with the letters to either side of it, just like an ordinary letter.
Or, in other words: NOT to be handled like an ordinary space is exactly what “ ” IS FOR. So for the conversion to convert it to ASCII 160, and for trim() NOT to trim that character, seems to me (from your terse description) to be precisely the correct behaviour.
HTH!
OK, so add “ampersand-nbsp-semicolon” into my previous comment in quite a few places, where now it has those mysterious empty quotes ” ” …