Access to www.w3.org DTDs blocked from Java

I was trying to parse an XML file with Java JAXB. The XML file had the following header at the beginning of the file.

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE appspec [
 <!ENTITY % HTMLlat1 PUBLIC
 "-//W3C//ENTITIES Latin 1 for XHTML//EN"
 "http://www.w3.org/TR/xhtml1/DTD/xhtml-lat1.ent">
 <!ENTITY % HTMLspec PUBLIC
 "-//W3C//ENTITIES Special for XHTML//EN"
 "http://www.w3.org/TR/xhtml1/DTD/xhtml-special.ent">
 %HTMLlat1;
 %HTMLspec;
]>

The program failed with the following error:
Server returned HTTP response code: 500 for URL: http://www.w3.org/TR/xhtml1/DTD/xhtml-lat1.ent

Full stacktrace:

[ERROR] BUILD ERROR
[INFO] ------------------------------------------------------------------------
[INFO] Could not read the xml, make sure that it is a valid xml and it validates against the schema

Server returned HTTP response code: 500 for URL: http://www.w3.org/TR/xhtml1/DTD/xhtml-lat1.ent
[INFO] ------------------------------------------------------------------------
[DEBUG] Trace
org.apache.maven.lifecycle.LifecycleExecutionException: Could not read the xml, make sure that it is a valid xml and it validates against the schema
        at org.apache.maven.lifecycle.DefaultLifecycleExecutor.executeGoals(DefaultLifecycleExecutor.java:719)
        at org.apache.maven.lifecycle.DefaultLifecycleExecutor.executeStandaloneGoal(DefaultLifecycleExecutor.java:569)
        at org.apache.maven.lifecycle.DefaultLifecycleExecutor.executeGoal(DefaultLifecycleExecutor.java:539)
        at org.apache.maven.lifecycle.DefaultLifecycleExecutor.executeGoalAndHandleFailures(DefaultLifecycleExecutor.java:387)
        at org.apache.maven.lifecycle.DefaultLifecycleExecutor.executeTaskSegments(DefaultLifecycleExecutor.java:284)
        at org.apache.maven.lifecycle.DefaultLifecycleExecutor.execute(DefaultLifecycleExecutor.java:180)
        at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:328)
        at org.apache.maven.DefaultMaven.execute(DefaultMaven.java:138)
        at org.apache.maven.cli.MavenCli.main(MavenCli.java:362)
        at org.apache.maven.cli.compat.CompatibleMain.main(CompatibleMain.java:60)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at org.codehaus.classworlds.Launcher.launchEnhanced(Launcher.java:315)
        at org.codehaus.classworlds.Launcher.launch(Launcher.java:255)
        at org.codehaus.classworlds.Launcher.mainWithExitCode(Launcher.java:430)
        at org.codehaus.classworlds.Launcher.main(Launcher.java:375)
Caused by: org.apache.maven.plugin.MojoExecutionException: Could not read the xml, make sure that it is a valid xml and it validates against the schema
        at org.clickframes.mavenplugin.ClickframesGenPlugin.readProject(ClickframesGenPlugin.java:246)
        at org.clickframes.mavenplugin.ClickframesGenPlugin.execute(ClickframesGenPlugin.java:126)
        at org.apache.maven.plugin.DefaultPluginManager.executeMojo(DefaultPluginManager.java:490)
        at org.apache.maven.lifecycle.DefaultLifecycleExecutor.executeGoals(DefaultLifecycleExecutor.java:694)
        ... 17 more
Caused by: javax.xml.bind.UnmarshalException
 - with linked exception:
[java.io.IOException: Server returned HTTP response code: 500 for URL: http://www.w3.org/TR/xhtml1/DTD/xhtml-lat1.ent]
        at com.sun.xml.bind.v2.runtime.unmarshaller.UnmarshallerImpl.unmarshal0(UnmarshallerImpl.java:197)
        at com.sun.xml.bind.v2.runtime.unmarshaller.UnmarshallerImpl.unmarshal(UnmarshallerImpl.java:174)
        at com.sun.xml.bind.v2.runtime.unmarshaller.UnmarshallerImpl.unmarshal(UnmarshallerImpl.java:225)
        at org.clickframes.AppspecJaxbWrapper.readAppspecType(AppspecJaxbWrapper.java:53)
        at org.clickframes.AppspecReader.readProject(AppspecReader.java:54)
        at org.clickframes.mavenplugin.ClickframesGenPlugin.readProject(ClickframesGenPlugin.java:244)
        ... 20 more
Caused by: java.io.IOException: Server returned HTTP response code: 500 for URL: http://www.w3.org/TR/xhtml1/DTD/xhtml-lat1.ent
        at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1313)
        at com.sun.org.apache.xerces.internal.impl.XMLEntityManager.setupCurrentEntity(XMLEntityManager.java:677)
        at com.sun.org.apache.xerces.internal.impl.XMLEntityManager.startEntity(XMLEntityManager.java:1315)
        at com.sun.org.apache.xerces.internal.impl.XMLEntityManager.startEntity(XMLEntityManager.java:1252)
        at com.sun.org.apache.xerces.internal.impl.XMLDTDScannerImpl.startPE(XMLDTDScannerImpl.java:722)
        at com.sun.org.apache.xerces.internal.impl.XMLDTDScannerImpl.skipSeparator(XMLDTDScannerImpl.java:2069)
        at com.sun.org.apache.xerces.internal.impl.XMLDTDScannerImpl.scanDecls(XMLDTDScannerImpl.java:2032)
        at com.sun.org.apache.xerces.internal.impl.XMLDTDScannerImpl.scanDTDInternalSubset(XMLDTDScannerImpl.java:377)
        at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl$DTDDriver.dispatch(XMLDocumentScannerImpl.java:1141)
        at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl$DTDDriver.next(XMLDocumentScannerImpl.java:1090)
        at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl$PrologDriver.next(XMLDocumentScannerImpl.java:977)
        at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:648)
        at com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.next(XMLNSDocumentScannerImpl.java:140)
        at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:511)
        at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:808)
        at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:737)
        at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:119)
        at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1205)
        at com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:522)
        at com.sun.xml.bind.v2.runtime.unmarshaller.UnmarshallerImpl.unmarshal0(UnmarshallerImpl.java:195)
        ... 25 more
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 34 seconds
[INFO] Finished at: Fri Oct 29 17:18:58 EDT 2010
[INFO] Final Memory: 17M/177M
[INFO] ------------------------------------------------------------------------

The program complained about not being able to download the file from www.w3.org. I tried to view the file in Firefox and was able to view it successfully. My hypothesis at that point was that w3.org is allowing Firefox but blocking Java. I confirmed that using wget and manually setting the User-Agent.

Using wget with User-Agent set to Firefox

You can set the user-agent using the -U option in wget

wget http://www.w3.org/TR/xhtml1/DTD/xhtml-lat1.ent -U "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.2.10) Gecko/20100914 Firefox/3.6.10"
--11:24:08--  http://www.w3.org/TR/xhtml1/DTD/xhtml-lat1.ent
Resolving www.w3.org... 128.30.52.37
Connecting to www.w3.org|128.30.52.37|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 11775 (11K) [application/xml-external-parsed-entity]
Saving to: `xhtml-lat1.ent'
100%[======================================================================================================================>] 11,775      --.-K/s   in 0.07s
11:24:08 (157 KB/s) - `xhtml-lat1.ent' saved [11775/11775]

Using wget with User-Agent set to Java

wget http://www.w3.org/TR/xhtml1/DTD/xhtml-lat1.ent -U "Java/1.6.0_20"
--11:24:17--  http://www.w3.org/TR/xhtml1/DTD/xhtml-lat1.ent
Resolving www.w3.org... 128.30.52.37
Connecting to www.w3.org|128.30.52.37|:80... connected.
HTTP request sent, awaiting response... 500 Server Error
11:24:48 ERROR 500: Server Error.

The above verified that www.w3.org was blocking the request from Java.

Conclusion

W3.org blocks requests to certain resources originating from the Java program, identified by the User-Agent header ‘Java/1.6.0_20′. This is a known issue. The URLs have been deliberately blocked by w3.org due to ‘abusive’ use by Java programs. Read the full story here: W3C’s Excessive DTD Traffic.

The bottom line is that you, or your program, should download and cache a copy of this resource and not hit w3.org with a request to the same static resource over and over. Respect the Expiry date HTTP header.

Related posts:

  1. JAXB code snippets for beginners
  2. Cache Java webapps with Squid Reverse Proxy
  3. 3 ways to run Java main from Maven
  4. The plugin ‘org.codehaus.mojo:selenium-maven-plugin’ does not exist or no valid version could be found
  5. Installing Java 7 on Mac OS X

Leave a Reply

 

 

 

You can use these HTML tags

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

Get Adobe Flash playerPlugin by wpburn.com wordpress themes