Tuesday, October 23, 2012

Character encoding, java, javascript and tomcat

Today my main issue has been with character encoding. This seems to pop up every couple of months and I either learn something new or deal with the same thing all over again. This time I learned a few things that were new.

The Setup: Users are allowed to type in custom tags for photos in the app I'm working on, so QA tested every character we might care about which revealed a few major problems. We allow the user to filter the photos by these tags. The filter sends the tags over a GET request to the server, the server parses the request and builds a new photos page with only those photos that match the selected tags.

The Issue: The app is older and foreign language support is new, so we constantly find places in the app where foreign characters don't work or cause strange problems. So when the javascript sent these foreign characters as a GET request they were getting converted to something other than UTF-8 when I tried to get them from the request as a parameter. 
When the request arrived at the server the characters would get converted wrong again by tomcat, and then when the information about a photo is retrieved from the database some of those foreign characters were HTML encoded before getting saved in the database.

The Solutions:
First off, I made sure each jsp contained the following line:
<%@ page contentType= "text/html; charset=UTF-8" pageEncoding="UTF-8" %>
This fixed the problem with the javascript sending the characters in the wrong format.
Second I made sure the Tomcat server.xml contained the following line:
<?xml version="1.0" encoding="UTF-8"?>
This didn't fix anything but was something I read should be done anyway to ensure we support UTF-8.
Third, in the java code I made sure to use StringEscapeUtils.unescapeHtml when retrieving the photo tags from the database.
Finally, I couldn't use the HttpServletRequest.getParameter function because tomcat was doing something to the strings before returning them. From what I read, step two above was supposed to help with that but isn't a guarantee. The best thing to do is use the getQueryString() function and decode the string yourself. So I did that and used URLDecoder.decode(string, "UTF-8") and finally got the tags from the GET request in the correct format.

These steps combined got the tags into a regular readable format instead of those annoying ? symbols. Once the tags were in the right format all my other code started working correctly.

No comments:

Post a Comment