Subscribe to this blog

As some might have noticed, I have rewritten my internal blog handling to make the appearance more uniform with the general web site design and to allow comments in the future. Since I now have switched „full-time” to Python 3.x, I had some real problems with encodings in Python3-CGI scripts, which I here want to discuss.

It all started with an error of a Python 3.x script with Apache on my server saying:

UnicodeEncodeError: 'ascii' codec can't encode character 'xxx' in position 0: ordinal not in range(128)

Since I am encoding everything in UTF-8 and all my systems use this encoding, I didn't expect problems. I first tried to write directly UTF-8 characters to sys.stdout, like:

import sys
bprint = sys.stdout.buffer.write
text_ugly = "öäü……·"
bprint( text_ugly.encode('utf-8') )

Strange enough, this worked only with one CGI script, the other one still threw an error, even though both read the same source. So I added the following two lines to my apache.conf:

AddDefaultCharset UTF-8

I thought that to be the nicest solution, but it didn't work out either. I then read numerous threads all discussing the same problem, but the people there found solutions I could not use. My concrete problem was a UnicodeEncodeError saying that surrogate characters could not be encoded to UTF-8. Something weird turned my Umlauts into surrogate characters, or more specifically, just one character. All umlauts before this one and after it were fine!

This article pointed me at least to an understanding of the problem. Not all functions in Python 3 are (or were) fully Unicode-compatible, like e.g. os.environ and produced sometimes surrogate characters. In my case it were os.listdir(). On my local laptop, I'm running Debian Unstable with Python 3.3.x, my server has Python 3.2. The problem only occures on my Server,, so I guess the Python developers might have fixed it. My simple solution is now to do the encoding like:

s.encode('utf-8', errors='surrogateescape')