Making Web content available as PDF is one way to facilitate the dissemination of content. In some industries, providing access to print-formatted documents, such as employee benefit descriptions, is mandatory. The law actually dictates that summary plan descriptions (SPDs) be made available in print format even though the content may be provided online. Just printing the Webpage is not sufficient because the print format must include a table of contents with page number references.
To add such functionality to a Webpage, developers can convert the HTML content to PDF format; this article illustrates how. The method illustrated here to perform the conversion uses only open source components. Commercial products also support dynamic document generation. Adobe has the Document Server product line, for example; however, its cost is substantial. Using an open source solution mitigates the cost factor while adding source code transparency.
The conversion consists of three steps:
This article demonstrates how to perform the translations using the command line interfaces provided by the tools and then introduces a Java program that uses the DOM (Document Object Model) interfaces.
The code in this article was tested with the following versions:
| Component | Version |
| JDK | 1.5_06 |
| JTidy | r7-dev |
| Xalan-J | 2.7 |
| FOP | 0.20.5 |
Each of the three steps consists of generating an output file from an input file. The inputs and outputs of the steps are shown in the figure below.

Translation
Using the three tools' command line interfaces allows for an easy way to get started. However, this approach is not suitable for a production-level system because of the temporary intermediate files that would be written to disk. This extra I/O would result in poor performance. Later in this article, the issue of temporary files becomes moot when the three tools are invoked by a Java program.
The first step is to translate the HTML file to a new XHTML file. Of course if the starting point for the conversion is already XHTML, then this step does not apply.
I used JTidy to perform the translation. JTidy is a Java port of the Tidy HTML parser. In the process of translating to XHTML, JTidy also adds missing close tags to create a well-formed XML document. I used the most recent version listed (r7-dev) on the SourceForge Website.
To run JTidy, use the following tidy.sh script:
#/bin/sh
java -classpath lib/Tidy.jar org.w3c.tidy.Tidy -asxml >
This script sets the CLASSPATH variable and invokes JTidy. To run JTidy, the input file is passed as a command line argument. By default, the generated
XHTML is directed to standard output. The -modify switch can also be used to overwrite the input file. The -asxml switch directs JTidy to output well-formed XML as opposed to HTML.
| Subject |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Converting XHTML, HTML to PDF with CSS stylesBy Anonymous on July 27, 2009, 2:48 amTry: http://codedeliver.blogspot.com/2009/07/converting-xhtml-html-to-pdf-with-css.html
Reply | Read entire comment
Table in html prevents it from conversionBy Anonymous on July 24, 2009, 8:19 pmHi, I implemented this program and found out that as long as html doesn't have any table tag, it gets converted but as long as I add even a single table, it doesn't....
Reply | Read entire comment
html By Anonymous on July 7, 2009, 5:08 amhttp://html-to-pdf.net/ that's really a fantastic post ! ! added to my favourite blogs list..
Reply | Read entire comment
If you download a newerBy mart on June 22, 2009, 12:03 amIf you download a newer version of FOP you will find these classes ok in the avalon & fop jars. I noticed that with my environment using fop 0.25, i had exactly...
Reply | Read entire comment
Please give correct Jar linkBy scriptm0nkey on March 15, 2009, 12:51 pmHey I'm trying to run through your tutorial, but i'm missing the following classes from the jar: import org.apache.fop.apps.Driver; import org.apache.fop.messaging.MessageHandler; import...
Reply | Read entire comment
View all comments