As some might have noticed, I have rewritten my internal blog handling to make the appearance more uniform with the general web site design and to allow comments in the future. Since I now have switched „full-time” to Python 3.x, I had some real problems with encodings in Python3-CGI scripts, which I here want to discuss.
It all started with an error of a Python 3.x script with Apache on my server saying:
UnicodeEncodeError: 'ascii' codec can't encode character 'xxx' in position 0: ordinal not in range(128)
Since I am encoding everything in UTF-8 and all my systems use this
encoding, I didn't expect problems. I first tried to write directly
UTF-8 characters to sys.stdout
, like:
import sys
bprint = sys.stdout.buffer.write
text_ugly = "öäü……·"
bprint( text_ugly.encode('utf-8') )
Strange enough, this worked only with one CGI script, the other one
still threw an error, even though both read the same source. So I added
the following two lines to my apache.conf
:
AddDefaultCharset UTF-8
SetEnv PYTHONIOENCODING utf8
I thought that to be the nicest solution, but it didn't work out either.
I then read numerous threads all discussing the same problem, but the
people there found solutions I could not use. My concrete problem was a
UnicodeEncodeError
saying that surrogate
characters
could not be encoded to UTF-8. Something weird turned my Umlauts into
surrogate characters, or more specifically, just one character. All
umlauts before this one and after it were fine!
This article pointed me at least
to an understanding of the problem. Not all functions in Python 3 are
(or were) fully Unicode-compatible, like e.g. os.environ
and produced
sometimes surrogate characters. In my case it were os.listdir()
. On my
local laptop, I'm running Debian Unstable with Python 3.3.x, my server
has Python 3.2. The problem only occures on my Server,, so I guess the
Python developers might have fixed it. My simple solution is now to do
the encoding like:
s.encode('utf-8', errors='surrogateescape')
Comments