[Bf-committers] Proposal for handling string encoding in blender.

Fri Aug 13 15:47:00 CEST 2010

I volunteered to look into this for scripts, and found that UTF8 encoding is a 
safe
way to go. There are many string encoding/decoding standards/codecs, UTF, UCS, 
etc and variants
within families. see 
http://en.wikipedia.org/wiki/Comparison_of_Unicode_encodings.
The same situation for video codecs has happened for strings. geez. Anyhoo...

Requirements: In general, I think we want encoding and decoding arbitrary binary 
strings 

into text strings that can be entered by any keyboard, saved and decoded 
losslessly in the 

blend file, displayed on the user's computer, safely sent by email, used as 
parts of URLs, 

or included as part of an HTTP POST request, be a valid filename, etc. 

I think that UTF8 would suit our purposes now and for the next decade or two. 
UTF-8 can encode any Unicode character. 

The downside is that because encoded strings may contain many zero bytes, the 
strings cannot be manipulated 

by normal C string handling for even simple operations such as copy. This means 
that a pass through
the ENTIRE code base is needed to seek out all str functions and replace them 
with a call to encode/decode.

in 2007, Python adopted UTF8 and recoded their base to use it. 
http://www.python.org/dev/peps/pep-3120/
For a Py3 discussion, see http://www.python.org/dev/peps/pep-0383/. For 
displaying encoded strings, 

Py3k uses an pretty involved process: http://www.python.org/dev/peps/pep-3138/
For the python code base itself, as of 2007, they also have issues and more 
questions than answers
see http://www.python.org/dev/peps/pep-3131/ and the bottom line is: english 
normal characters. 

UTF16 is the ultimate alternative for international error messages, etc 
and is what is used in the Mac OSX and Windows. 
It can encode any glyph. There are some space saving advantages for UTF16, but 
only if the text is mostly glyphs.
Characters U+0800 through U+FFFF use three bytes in UTF-8, but only two  in 
UTF-16. As a result, text in (for example) Chinese, Japanese or Hindi  could 
take more space in UTF-8 if there are more of these characters than there are 
ASCII characters.  This rarely happens in real documents, for example both the 
Japanese  and the Korean UTF-8 article on Wikipedia take more space if saved as  
UTF-16 than the original UTF-8 version

--Roger