[Bf-committers] Proposal for handling string encoding in blender.
rogerwickes at yahoo.com
Fri Aug 13 15:47:00 CEST 2010
I volunteered to look into this for scripts, and found that UTF8 encoding is a
way to go. There are many string encoding/decoding standards/codecs, UTF, UCS,
etc and variants
within families. see
The same situation for video codecs has happened for strings. geez. Anyhoo...
Requirements: In general, I think we want encoding and decoding arbitrary binary
into text strings that can be entered by any keyboard, saved and decoded
losslessly in the
blend file, displayed on the user's computer, safely sent by email, used as
parts of URLs,
or included as part of an HTTP POST request, be a valid filename, etc.
I think that UTF8 would suit our purposes now and for the next decade or two.
UTF-8 can encode any Unicode character.
The downside is that because encoded strings may contain many zero bytes, the
strings cannot be manipulated
by normal C string handling for even simple operations such as copy. This means
that a pass through
the ENTIRE code base is needed to seek out all str functions and replace them
with a call to encode/decode.
in 2007, Python adopted UTF8 and recoded their base to use it.
For a Py3 discussion, see http://www.python.org/dev/peps/pep-0383/. For
displaying encoded strings,
Py3k uses an pretty involved process: http://www.python.org/dev/peps/pep-3138/
For the python code base itself, as of 2007, they also have issues and more
questions than answers
see http://www.python.org/dev/peps/pep-3131/ and the bottom line is: english
UTF16 is the ultimate alternative for international error messages, etc
and is what is used in the Mac OSX and Windows.
It can encode any glyph. There are some space saving advantages for UTF16, but
only if the text is mostly glyphs.
Characters U+0800 through U+FFFF use three bytes in UTF-8, but only two in
UTF-16. As a result, text in (for example) Chinese, Japanese or Hindi could
take more space in UTF-8 if there are more of these characters than there are
ASCII characters. This rarely happens in real documents, for example both the
Japanese and the Korean UTF-8 article on Wikipedia take more space if saved as
UTF-16 than the original UTF-8 version
More information about the Bf-committers