[Bf-committers] Proposal for handling string encoding in blender.
vekoon at gmail.com
Fri Aug 13 16:35:19 CEST 2010
Let's clarify some things.
First of all UTF-8 was specifically designed to replace ASCII
painlessly, thus it does not contain zero (or null) bytes, meaning you
can always have a C string to hold UTF-8 chars.
Also UTF-8 can encode ANY character representable in the Unicode
standard, which means most characters in the world can be encoded in UTF-8.
The "é" character is simply not ASCII but latin-1. The fact that you
don't notice the difference when using it in C is that it only takes 1
byte to store this value, but ASCII is actually a 7-bit standard, you
get the extra bit for free because the minimum data size in C is 8-bits.
That being said, I don't think it's a good idea to have encoding stored
in the .blend, this would be useless without full unicode support.
The easiest thing to do is to simply assume we always use UTF-8
internally, as it's ASCII compatible, any other encoding would require
too much work.
What I was suggesting is that we try to detect if certain strings (like
file paths) are UTF-8 and if they aren't see if they are UTF-16/32 and
convert them before storing. The same in reverse from a file path in the
.blend to one for external usage.
So if we have stored a relative path like //tèst/file.png (in UTF-8) and
the OS wants UTF-16 we can convert this before passing the path to the
OS and because most OSes use some version of UTF, by handling all of
them we're reasonably safe we won't break too many file tree setups.
Of course there are other apps already doing this, like all web
browsers. For instance Firefox, which even provides a library for
encoding detection (independent C++ library usable from C too):
Note that I've never actually tried it, but I guess it has to work
considering it's used in Firefox. The article is dated but I think it's
still valid, the source is here I believe:
More information about the Bf-committers