« Back to home

Getting UTF-8 out of Domino web agents

A common technique for getting XML data out of IBM Lotus Domino is to build an agent which outputs the DXL encoding of a document and call it via HTTP. The code typically looks like this:

Print "Content-type: text/xml"
Dim session As New NotesSession
Dim doc As NotesDocument
[...obtain your data somehow in the variable doc...]
Dim exporter As NotesDXLExporter
Set exporter = session.CreateDXLExporter
exporter.OutputDOCTYPE = False
Dim stream As NotesStream
Set stream = session.CreateStream
Call exporter.SetInput(doc)
Call exporter.SetOutput(stream)
Call exporter.Process
Print stream.ReadText()

However, there’s a subtle error in the above code. The kind of error that can make everything look fine in testing, then cause your integration work to fall over in production.

NotesDXLExporter creates UTF-8 output, including a first line that says

<?xml version=“1.0” encoding=“utf-8”?>

NotesStream objects created from a NotesSession default to UTF-8 as well. Notes String variables are also stored as UTF-8 internally. However, Notes agents output ISO-8859-1 by default, because that’s the default character set for the web.

So if your data happens to contain any of the accented characters which ISO-8859-1 represents via octets in the range 224-255, those characters will be output as octets in the range 224-255. However, that’s not valid UTF-8; in UTF-8, an octet with the top bit set indicates the start of a multibyte character sequence.

That means your XML will contain an invalid character sequence, and any conformant XML parser that attempts to read it will choke. That includes the Java JAXP parser included in Java 6.0.

The solution turns out to be simple, but obscure. Instead of

Print "Content-type: text/xml"

you need to do

Print "Content-type: text/xml; charset=utf-8"

Believe it or not, Domino sniffs the output of your agent, and uses that to decide what character set translations it should carry out. The added parameter will switch Domino’s web server into UTF-8 mode, effectively preventing Domino from messing with your data. Your accented characters will then be left as UTF-8 and will decode properly for the recipient.