How to trim() No-Break space ( ) when parsing HTML

The   entity in web pages is used to represent a blank space, but is sometimes converted to ASCII 160 (no break space), instead of ASCII 32 (space). This does not work well with Java functions like trim() which expect the ASCII 32 character (‘ ‘).

Let us consider the scenario where are trying to parse a HTML page which has a text block.

   
   <td>Blog entry              
            &nbsp;<td>

Once you get the text node, you can get the value of the Node by using the getNodeValue() call.

   
// first use the DOM api to get the correct node
Node td = ....;
// then get the node text
String text = td.getNodeValue();

When you try printing the text you will see a space at the end of the text.

   System.out.println("'" + text + "'");
   'Blog entry '

Trimming does not fix this as trimming only trims the space (ASCII 32).

   System.out.println("'" + text.trim() + "'");
   'Blog entry '

The correct way to get rid of this is to use the following regular expression. The no-break space is represented by the unicode character “\u00A0″ which can be escaped using a simple regular expression.

   // use regex '\u00A0' to match No-Break space
   System.out.println("'" + text.replaceAll("[\\s\\u00A0]+$", "") + "'");
   'Blog entry'
Reference page

http://www.fileformat.info/info/unicode/char/00a0/index.htm

Related posts:

  1. New Java 7 Feature: String in Switch support

Leave a Reply

 

 

 

You can use these HTML tags

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

Get Adobe Flash playerPlugin by wpburn.com wordpress themes