Access to www.w3.org DTDs blocked from Java

I was trying to parse an XML file with Java JAXB. The XML file had the following header at the beginning of the file.

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE appspec [
 <!ENTITY % HTMLlat1 PUBLIC
 "-//W3C//ENTITIES Latin 1 for XHTML//EN"
 "http://www.w3.org/TR/xhtml1/DTD/xhtml-lat1.ent">
 <!ENTITY % HTMLspec PUBLIC
 "-//W3C//ENTITIES Special for XHTML//EN"
 "http://www.w3.org/TR/xhtml1/DTD/xhtml-special.ent">
 %HTMLlat1;
 %HTMLspec;
]>

The program failed with the following error:
Server returned HTTP response code: 500 for URL: http://www.w3.org/TR/xhtml1/DTD/xhtml-lat1.ent

Full stacktrace:

[ERROR] BUILD ERROR
[INFO] ------------------------------------------------------------------------
[INFO] Could not read the xml, make sure that it is a valid xml and it validates against the schema

Server returned HTTP response code: 500 for URL: http://www.w3.org/TR/xhtml1/DTD/xhtml-lat1.ent
[INFO] ------------------------------------------------------------------------
[DEBUG] Trace
org.apache.maven.lifecycle.LifecycleExecutionException: Could not read the xml, make sure that it is a valid xml and it validates against the schema
        at org.apache.maven.lifecycle.DefaultLifecycleExecutor.executeGoals(DefaultLifecycleExecutor.java:719)
        at org.apache.maven.lifecycle.DefaultLifecycleExecutor.executeStandaloneGoal(DefaultLifecycleExecutor.java:569)
        at org.apache.maven.lifecycle.DefaultLifecycleExecutor.executeGoal(DefaultLifecycleExecutor.java:539)
        at org.apache.maven.lifecycle.DefaultLifecycleExecutor.executeGoalAndHandleFailures(DefaultLifecycleExecutor.java:387)
        at org.apache.maven.lifecycle.DefaultLifecycleExecutor.executeTaskSegments(DefaultLifecycleExecutor.java:284)
        at org.apache.maven.lifecycle.DefaultLifecycleExecutor.execute(DefaultLifecycleExecutor.java:180)
        at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:328)
        at org.apache.maven.DefaultMaven.execute(DefaultMaven.java:138)
        at org.apache.maven.cli.MavenCli.main(MavenCli.java:362)
        at org.apache.maven.cli.compat.CompatibleMain.main(CompatibleMain.java:60)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at org.codehaus.classworlds.Launcher.launchEnhanced(Launcher.java:315)
        at org.codehaus.classworlds.Launcher.launch(Launcher.java:255)
        at org.codehaus.classworlds.Launcher.mainWithExitCode(Launcher.java:430)
        at org.codehaus.classworlds.Launcher.main(Launcher.java:375)
Caused by: org.apache.maven.plugin.MojoExecutionException: Could not read the xml, make sure that it is a valid xml and it validates against the schema
        at org.clickframes.mavenplugin.ClickframesGenPlugin.readProject(ClickframesGenPlugin.java:246)
        at org.clickframes.mavenplugin.ClickframesGenPlugin.execute(ClickframesGenPlugin.java:126)
        at org.apache.maven.plugin.DefaultPluginManager.executeMojo(DefaultPluginManager.java:490)
        at org.apache.maven.lifecycle.DefaultLifecycleExecutor.executeGoals(DefaultLifecycleExecutor.java:694)
        ... 17 more
Caused by: javax.xml.bind.UnmarshalException
 - with linked exception:
[java.io.IOException: Server returned HTTP response code: 500 for URL: http://www.w3.org/TR/xhtml1/DTD/xhtml-lat1.ent]
        at com.sun.xml.bind.v2.runtime.unmarshaller.UnmarshallerImpl.unmarshal0(UnmarshallerImpl.java:197)
        at com.sun.xml.bind.v2.runtime.unmarshaller.UnmarshallerImpl.unmarshal(UnmarshallerImpl.java:174)
        at com.sun.xml.bind.v2.runtime.unmarshaller.UnmarshallerImpl.unmarshal(UnmarshallerImpl.java:225)
        at org.clickframes.AppspecJaxbWrapper.readAppspecType(AppspecJaxbWrapper.java:53)
        at org.clickframes.AppspecReader.readProject(AppspecReader.java:54)
        at org.clickframes.mavenplugin.ClickframesGenPlugin.readProject(ClickframesGenPlugin.java:244)
        ... 20 more
Caused by: java.io.IOException: Server returned HTTP response code: 500 for URL: http://www.w3.org/TR/xhtml1/DTD/xhtml-lat1.ent
        at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1313)
        at com.sun.org.apache.xerces.internal.impl.XMLEntityManager.setupCurrentEntity(XMLEntityManager.java:677)
        at com.sun.org.apache.xerces.internal.impl.XMLEntityManager.startEntity(XMLEntityManager.java:1315)
        at com.sun.org.apache.xerces.internal.impl.XMLEntityManager.startEntity(XMLEntityManager.java:1252)
        at com.sun.org.apache.xerces.internal.impl.XMLDTDScannerImpl.startPE(XMLDTDScannerImpl.java:722)
        at com.sun.org.apache.xerces.internal.impl.XMLDTDScannerImpl.skipSeparator(XMLDTDScannerImpl.java:2069)
        at com.sun.org.apache.xerces.internal.impl.XMLDTDScannerImpl.scanDecls(XMLDTDScannerImpl.java:2032)
        at com.sun.org.apache.xerces.internal.impl.XMLDTDScannerImpl.scanDTDInternalSubset(XMLDTDScannerImpl.java:377)
        at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl$DTDDriver.dispatch(XMLDocumentScannerImpl.java:1141)
        at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl$DTDDriver.next(XMLDocumentScannerImpl.java:1090)
        at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl$PrologDriver.next(XMLDocumentScannerImpl.java:977)
        at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:648)
        at com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.next(XMLNSDocumentScannerImpl.java:140)
        at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:511)
        at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:808)
        at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:737)
        at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:119)
        at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1205)
        at com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:522)
        at com.sun.xml.bind.v2.runtime.unmarshaller.UnmarshallerImpl.unmarshal0(UnmarshallerImpl.java:195)
        ... 25 more
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 34 seconds
[INFO] Finished at: Fri Oct 29 17:18:58 EDT 2010
[INFO] Final Memory: 17M/177M
[INFO] ------------------------------------------------------------------------

The program complained about not being able to download the file from www.w3.org. I tried to view the file in Firefox and was able to view it successfully. My hypothesis at that point was that w3.org is allowing Firefox but blocking Java. I confirmed that using wget and manually setting the User-Agent.

Using wget with User-Agent set to Firefox

You can set the user-agent using the -U option in wget

wget http://www.w3.org/TR/xhtml1/DTD/xhtml-lat1.ent -U "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.2.10) Gecko/20100914 Firefox/3.6.10"
--11:24:08--  http://www.w3.org/TR/xhtml1/DTD/xhtml-lat1.ent
Resolving www.w3.org... 128.30.52.37
Connecting to www.w3.org|128.30.52.37|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 11775 (11K) [application/xml-external-parsed-entity]
Saving to: `xhtml-lat1.ent'
100%[======================================================================================================================>] 11,775      --.-K/s   in 0.07s
11:24:08 (157 KB/s) - `xhtml-lat1.ent' saved [11775/11775]

Using wget with User-Agent set to Java

wget http://www.w3.org/TR/xhtml1/DTD/xhtml-lat1.ent -U "Java/1.6.0_20"
--11:24:17--  http://www.w3.org/TR/xhtml1/DTD/xhtml-lat1.ent
Resolving www.w3.org... 128.30.52.37
Connecting to www.w3.org|128.30.52.37|:80... connected.
HTTP request sent, awaiting response... 500 Server Error
11:24:48 ERROR 500: Server Error.

The above verified that www.w3.org was blocking the request from Java.

Conclusion

W3.org blocks requests to certain resources originating from the Java program, identified by the User-Agent header ‘Java/1.6.0_20′. This is a known issue. The URLs have been deliberately blocked by w3.org due to ‘abusive’ use by Java programs. Read the full story here: W3C’s Excessive DTD Traffic.

The bottom line is that you, or your program, should download and cache a copy of this resource and not hit w3.org with a request to the same static resource over and over. Respect the Expiry date HTTP header.

Related posts:

  1. JAXB code snippets for beginners
  2. Cache Java webapps with Squid Reverse Proxy
  3. 3 ways to run Java main from Maven
  4. The plugin ‘org.codehaus.mojo:selenium-maven-plugin’ does not exist or no valid version could be found
  5. Installing Java 7 on Mac OS X

4 comments to Access to www.w3.org DTDs blocked from Java

  • IBBoard

    I’d have thought that people would have learned from experience. Netscape removed DTDs before and it caused havoc – all because people were relying on downloading a DTD from some external site.

    What happens if a) the site is down, b) the user has no Internet connection or c) the site changes the file in some way? Surely a local cache is the only sane way to do it so that you know you’re always going to have the DTD and so that you know you’re going to get the same DTD. That’s ignoring the fact that the URI of the DTD doesn’t even have to be its actual address on the web.

  • Drew Sudell

    w3.org has been refusing requests for dtds, schemas, and entities based on user agent, specifically blocking java, for some time. See their system team’s blog entery W3C’s Excessive DTD traffic.

    The use of URLs as identifiers in XML is arguably a bad choice. W3C argues that technically they are identifiers not links, yet a get on those URLs returns the relevant DTD, Schema or Entity (so long as your user agent isn’t Java).

    For real production systems the entities should be cached and resolved locally, as the previous commenter suggested. But for short code spikes that’s not very practical. And frankly basing the response on user agent is kind of marginal and heavy handed.

  • IBBoard

    “And frankly basing the response on user agent is kind of marginal and heavy handed.”

    True – they’d be much better off putting the schemas somewhere else so that the XML ID returns a 410 Gone if someone uses the ID as a URL. They could then link to the new location from their pages, with some kind of lock-down system (timed access using mod_auth_token or similar?)

    As for short code spikes, I think it took me about 10 lines of code in C# to use local copies of the DTDs instead of downloading them from the web. Originally I hadn’t thought about web downloads and had assumed that, as my dev environment had the schemas and the code wasn’t complaining, the framework had its own caches or similar. It was only when the app first ran without a Web connection that I found it, and it’ll be something that I know about in future.

  • I have been fighting with this problem a lot recently as it seems they also block .net user agents. I proved this using IE and the DTD’s were blocked, I then altered the IE user-agent to remove the .net identifier and all of a sudden I was able to access the resource. Whilst I appreciate their reasons for doing something a complete block is not a great solution as apps by default cannot access the DTD to be able to cache them in the first place!

Leave a Reply

 

 

 

You can use these HTML tags

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

Get Adobe Flash playerPlugin by wpburn.com wordpress themes