Wizard API updated!
Tim Boudreau has released a new version of the Swing Wizard library (version 0.997) that fixes the WizardException bug reported in JavaWorld's recent Open Source Java Project profile. The article's examples have been reworked to test out the new, improved WizardException. Thanks, Tim, for this helpful fix!
Open Source Java Projects: The Wizard API

Newsletter sign-up

Sign up for our technology specific newsletters.

Enterprise Java
View all newsletters

Email Address:

Convert HTML content to PDF format

Support access to PDF files on your Webpages

Making Web content available as PDF is one way to facilitate the dissemination of content. In some industries, providing access to print-formatted documents, such as employee benefit descriptions, is mandatory. The law actually dictates that summary plan descriptions (SPDs) be made available in print format even though the content may be provided online. Just printing the Webpage is not sufficient because the print format must include a table of contents with page number references.

To add such functionality to a Webpage, developers can convert the HTML content to PDF format; this article illustrates how. The method illustrated here to perform the conversion uses only open source components. Commercial products also support dynamic document generation. Adobe has the Document Server product line, for example; however, its cost is substantial. Using an open source solution mitigates the cost factor while adding source code transparency.

The conversion consists of three steps:

  1. Convert the HTML to XHTML
  2. Convert the XHTML document to XSL-FO (Extensible Stylesheet Language Formatting Objects) using an XSL stylesheet and an XSLT transformer
  3. Pass the XSL-FO document to a formatter to generate the target PDF document

This article demonstrates how to perform the translations using the command line interfaces provided by the tools and then introduces a Java program that uses the DOM (Document Object Model) interfaces.

Component versions

The code in this article was tested with the following versions:

Component Version
JDK 1.5_06
JTidy r7-dev
Xalan-J 2.7
FOP 0.20.5


Using the command line interfaces

Each of the three steps consists of generating an output file from an input file. The inputs and outputs of the steps are shown in the figure below.

Translation

Using the three tools' command line interfaces allows for an easy way to get started. However, this approach is not suitable for a production-level system because of the temporary intermediate files that would be written to disk. This extra I/O would result in poor performance. Later in this article, the issue of temporary files becomes moot when the three tools are invoked by a Java program.

Step 1: HTML to XHTML

The first step is to translate the HTML file to a new XHTML file. Of course if the starting point for the conversion is already XHTML, then this step does not apply.

I used JTidy to perform the translation. JTidy is a Java port of the Tidy HTML parser. In the process of translating to XHTML, JTidy also adds missing close tags to create a well-formed XML document. I used the most recent version listed (r7-dev) on the SourceForge Website.

To run JTidy, use the following tidy.sh script:

 #/bin/sh

java -classpath lib/Tidy.jar org.w3c.tidy.Tidy -asxml >


This script sets the CLASSPATH variable and invokes JTidy. To run JTidy, the input file is passed as a command line argument. By default, the generated XHTML is directed to standard output. The -modify switch can also be used to overwrite the input file. The -asxml switch directs JTidy to output well-formed XML as opposed to HTML.

1 | 2 | 3 |  Next >

Discuss

Start a new discussion or jump into one of the threads below:

Subject Replies Last post
. tables and images
By Nick Afshartous
17 05/06/08 05:09 PM
by davehall84
. HTML to PDF Conversion
By sibi
1 04/30/08 10:20 AM
by rfq
. jtidy
By ldup
1 04/30/08 10:08 AM
by rfq
. external graphic in FOP
By ldup
0 02/08/08 08:07 AM
by ldup
. I cant find all fop classes
By fersm_mono
3 01/08/08 03:10 PM
by fersm_mono
. HTML to PDF in ASP.NET
By fchivu
0 01/04/08 01:36 AM
by fchivu
. java.net.ConnectException error
By Azhar
0 01/02/08 07:51 AM
by Azhar
. Updated Html2Pdf for fop-0.92beta-bin-jdk1.4
By nakita
5 12/12/07 05:47 PM
by simferop
. http://pd4ml.com - is a commercial alternative
By zfr
0 12/08/07 01:07 PM
by zfr
. Including css files
By jaichem
0 07/27/07 04:19 AM
by jaichem
. problem with "getTransformer" function
By ffar
0 02/23/07 01:17 PM
by ffar
. html2pdf conversion demystified
By wyze
0 01/12/07 08:13 AM
by wyze
. HowTo page break ?
By Anonymous
2 01/12/07 03:52 AM
by g_roy9
. Printing Header and Footer
By archanands
0 12/31/06 04:21 AM
by archanands
. Output to System.out
By CrazyAtlantaGuy
1 12/08/06 02:35 PM
by CrazyAtlantaGuy
. HTML TO PDF
By Anonymous
2 10/05/06 10:01 AM
by Anonymous
. TrueTypeFonts Problems
By Glaudiston
1 10/05/06 06:49 AM
by Anonymous
. Landscape Printing?
By Anonymous
0 10/05/06 06:14 AM
by Anonymous
. problem to insert TABLE .. ??
By Anonymous
1 10/04/06 10:16 AM
by Anonymous
. Convert HTML content to PDF format
By JavaWorldAdministrator
12 10/04/06 10:16 AM
by Anonymous


Resources