Unicode development with DbForms, MySQL 4.1, and friends

Developing a Unicode-enabled application with DbForms and MySQL 4.1

Developing a Unicode-based application can be a frustrating experience because of the myriad of little "gotchas" involved with making each piece of a complex system work with Unicode data. This article covers my recent experiences in building a college registration system supporting English and the Thai languages. The development was J2EE-based and used the following technologies:

  • Jakarta Tomcat 5.0.19
  • MySQL 4.1.2
  • DbForms 2.4rcX
  • JasperReports 0.5.2
  • Apache POI (Poor Obfuscation Implementation)

During the project's initial phases, my team needed to decide on a database to use that would store Thai characters. For maximum flexibility, we decided to use Unicode to store data, instead of a Thai character set. Our database would also store English data, and future versions of this system might need to support other languages as well. Due to our limited budget, we narrowed the database choice to open source products that support Unicode.

My prior experience with MySQL led me to make the potentially risky choice of using MySQL 4.1, which is clearly marked as a develop release. After doing research on the Web and reading mail archives, I concluded that MySQL 4.1 had some problems with complex queries and complicated collating with fields of different character sets; however, 4.1 had no reported problems with data integrity or database corruption.

We started development with the MySQL 4.1.1 release and noticed a few complex queries where the ordering did not work as expected; we were forced to rewrite the queries or break a single query into two queries. We were also able to create one complex query that crashed the system (but did not cause any corruption). Upgrading to 4.1.2 solved the problem with the crashing query. To date, we have not experienced any crashing with 4.1.2.

We are happy with our decision to use 4.1.2, but it was a risky choice, and we were prepared to back out if necessary. Also, given the application type, this risk level was acceptable. For a more mission-critical application, I probably would not have used MySQL 4.1.

MySQL supports multiple table types, the default is ISAM (indexed sequential access method), which is fast, but does not offer referential integrity through foreign key constraints. Instead, we used InnoDB tables in our system and used the constraint feature for referential integrity. Any potential performance loss was more than offset by the reliability that referential integrity adds to a complex database.

Using Unicode with MySQL 4.1 is simple, and there are two ways to specify which character set to use. The first approach specifies a character set as the default for a table; the second declares a character set for a specific field. Specifying a field's character set takes precedence over the table's character set.

Example 1. Define a default character set for a table

CREATE TABLE `city` (
  `city_id` int(11) NOT NULL default '0',
  `name` varchar(25) NOT NULL default '',
  `name_eng` varchar(25) NOT NULL default '',
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

Example 2. Define a default character set and specific character set for a field

CREATE TABLE `test2` (
  `alpha` varchar(5) character set utf8 NOT NULL default '',
  `beta` varchar(10) default NULL
) ENGINE=MyISAM DEFAULT CHARSET=latin1;

In the first example, all fields are UTF-8: name stores Thai data and name_eng stores English language data. In the second example, the table's character set is latin1 and field alpha is UTF-8. latin1 is the standard character set for tables that don't specify a default character set.

On my version of Linux (SuSE 9.1), I can display Thai characters using the MySQL shell by starting it with the command mysql --default-character-set=utf8. Unfortunately, this did not work on Windows. Under Windows, we had to wait until we had a Webpage working before we could visually verify that we had Thai data. When using the commands mysqldump to back up a database and mysql to restore the database, you must use --default-character-set=utf8 option, otherwise character set information is lost.

We used a commercial product MySQL Manager 2 from EMS to design the database and generate our queries. The tool offers a nice visual query builder that can then display the query's text. We then copied the SQL code into our source. The tool is designed to work with MySQL 4.0 and did not support 4.1's character set extensions. If we created a table with the tool, we then had to manually execute an alter table command to change the character set.

Server platform

After selecting MySQL, our next choice was deciding which J2EE server to use. Development started with Tomcat 4.1.x, but we quickly switched to 5.0.19 for the better tag library support. As an extra bonus, we were able to take advantage of the EL (expression language) support integrated with 5.0.x.

The final major decision was our selection of DbForms. DbForms bills itself as a tool that "enables developers to build sophisticated Web-based database-driven applications in very short time and with very little efforts. DbForms applications are built in a manner conceptually similar to RAD [rapid application development]."

DbForms is best thought of as many different pieces: first, a J2EE framework for writing database-driven applications; second, a rich library of tags to take advantage of the framework; and third, a GUI tool that generates JSP (JavaServer Pages) pages from a database via XSL (Extensible Stylesheet Language) templates. We modified some of these template files to add support for including a common file at the top and generating all pages as UTF-8.

DBForms's GUI tool is useful in creating initial JSP pages for data views and entry. For simple screens, the code created by the GUI tool can be used unaltered. In our application, approximately 40 percent of the data entry screens were generated by the GUI tool and deployed unaltered. Another 30 percent of the screens required minimal coding to be useful, usually adding DbForms's Select tag so that the user could select a field value from a foreign database.

One limitation of the GUI tool is that it is not a "lifecycle" tool. If you use it to regenerate a page or the global dbforms-config.xml file, it will overwrite the old file. The GUI tool does not remember changes from previous edits. To get around this limitation, I created a parallel directory structure called "dummy," made new edits there, then copied and pasted from the dummy directory to the real directory. This proved useful in situations where I added a few fields to a table and wanted to regenerate the data entry screen, but did not want to lose any changes I had previously made to the JSP page.

Now that we have covered the big picture, let's start looking at the individual steps required to get things working.

Getting Unicode to work

The first thing we did was write a standalone Java program that would read in Thai characters from a MySQL database and write them out to a file. The test database was populated by hand. The program's key part is the section that opens the connection:

DriverManager.setLogWriter(new PrintWriter(System.out));
Class.forName("com.mysql.jdbc.Driver");
Connection db =
DriverManager.getConnection(
"jdbc:mysql:///" + dbname +
"?requireSSL=false&useUnicode=true&characterEncoding=UTF-8",
   "root", "");

The setLogWriter() function is used for diagnostics and isn't needed in production, but is helpful for tracking down problems.

The JDBC (Java Database Connectivity) connect string has three parameters. Their meaning should be obvious. We used requireSSL=false since we had problems building MySQL 4.1.x from scratch using SSL (Secure Socket Layer) on Linux. We didn't try very hard though; in our environment, SSL was not important. The useUnicode and characterSetEncoding parameters tell the JDBC subsystem that the database is Unicode and everything is UTF-8. Note that the string UTF-8 is the Java way of specifying the charset; MySQL uses the string utf8. If changing charsets, you'll have to compare what Java supports with what MySQL supports.

We also copied the above code to a JSP page and ensured we could display Thai. The only tricky part of that exercise was setting the page output charset. That was accomplished by putting the following line at the top of each page:

<%@ page contentType="text/html;charset=UTF-8" language="java" %>

Note: If you use Apache with mod_jk (the Apache module for running Tomcat as a plug-in within Apache), you will have to configure Apache to treat the page as UTF-8 as well; the Apache setting might overwrite the Tomcat setting (depending on which version you use). Our project used Tomcat as a standalone server.

Our next step was to get things working with DbForms. We started by using the GUI tool and setting the JDBC connect string, jdbc:mysql:///dbtest?requireSSL=false&useUnicode=true&characterEncoding=UTF-8. The tool then created a dbforms-config.xml file. Scroll down to the bottom of the file and you should see:

<dbconnection
  Name="jdbc:mysql:///stpneal2?requireSSL=false&amp;useUnicode=true
    &amp;characterEncoding=UTF-8"
  isJndi = "false"
  conClass  = "com.mysql.jdbc.Driver"
  username = "root"
  password  = ""/>

Note: If you do not see &amp;, then you are using an old version of the GUI tool, which does not handle XML escaping correctly.

In the DbForms GUI tool, click on the XSL Transformation tab and create some JSP files. You can then review the code produced to see what DbForms and the GUI tool can do for you in terms of generating ready-to-use JSP pages.

We then manually edited the generated JSP files and inserted the <%@ page contentType="text/html;charset=UTF-8" language="java" %> directive. After verifying it worked, we modified the XSL files used by the GUI tool to generate the code. The change will vary file-to-file, but it will look something like this:

 //--file " 
  <xsl:value-of select="$fileName" /> 
  " ------------------------------------------------ 
  <xsl:text disable-output-escaping="yes">
     <%@ page contentType="text/html;charset=UTF-8" language="java" %>
     <%@ include file="inc_taglibs.jsp" %> 
  </xsl:text>

As you can see, we modified the XSL file to output two directives at the top of each file generated. One sets the page contentType and charset; the other includes a file that we used to define all our taglibs.

Everything worked fine in terms of displaying data, but updating data did not work correctly. A little digging revealed the problem to be that the form-processing code did not know what charset to use. A limitation of HTML form processing is that the data's charset entered in the form or in the page used to display the form is not passed back to the server. Once we discovered the problem, the answer was simple: we set Tomcat to treat all form data as UTF-8. We did that by defining a filter in the web.xml file. A filter is a code snippet called by the Web server that manipulates the input before passing it on. The filter code is in the class file org.dbforms.util.external.SetCharacterEncodingFilter. This file is a copy of the sample filter that ships with Tomcat, but it has been copied into the DbForms JAR for ease of use:

<!-- ======== Filter for Set Character Encoding  ======= --> 
<filter>
  <filter-name>Set Character Encoding</filter-name> 
  <filter-class>
     org.dbforms.util.external.SetCharacterEncodingFilter
  </filter-class> 
 <init-param>
  <param-name>encoding</param-name> 
  <param-value>UTF-8</param-value> 
  </init-param>
</filter>

At this point, our application displayed and updated Thai information in the database, but we still had many tasks to complete in order to make our application fully usable as a multilingual site.

The next task was to display static content on the page that was localized. The easiest way to perform localization is with tags, and we could choose from two different tag libraries: the standard formatting (prefix FMT) tag library that is part of JSTL (Java Standard Tag Library), or DbForms's db:message tag. We ended up supporting both. In most cases, we used the FMT tag explicitly in our JSP code, but we used the DbForms message tag code implicitly to perform localization on HTML widgets like buttons.

Fortunately, both tag libraries worked the same way—they used a set of property files in the WEB-INF/ classes directory composed of key-value statements.

A brief introduction to ApplicationResources files for localization

Localization is performed through a hierarchy of property files. There is a base file and then descendent files. If a given key cannot be found in a descendent file, its parent files will be checked in ascending order. The hierarchy model is used to support a default file, a specific language, and its dialects.

Assuming we use the name ApplicationResources for our file naming, we can have four files in WEB-INF/ classes:

ApplicationResources.propertiesThe base file
ApplicationResources_en.propertiesEnglish language
ApplicationResources_en_UK.propertiesEnglish, British dialect
ApplicationResources_th.properties,Thai language

Depending on our current locale's setting, we start with one of these files. If a key cannot be found, then the next ascending file is checked until we run out of files and generate a not found.

1 2 Page 1
Page 1 of 2