[Bf-committers] Proposal for handling string encoding in blender.

Fri Aug 13 18:01:16 CEST 2010

@Jeroen, my plan was to set UTF8 as the only possible encoding, (which
is a superset of ascii), so there would be no need to flag files or
treat them different.

@Roger Wickes, I think we can only consider to support utf8 in blender
because of ascii compatibility, later on we can make sure string
editing is properly supported. (at the moment a backspace wont work on
a multibyte UTF8 char), Thanks for the extra info all the same.

@Diego B, Yes for limiting input, the storage in DNA wont change.
- The patch should be quite small.
The thing I don't understand is that character - "é" renders ok in the
py-console but python complains its not supported, but you say it only
supports utf8.

The only changes needed are...
- UI input would ensure for UTF8, for all non path names.
- Python ensure UTF8 encoding on setting strings.
- Python returns UTF8 strings except for the file path which use
Py_FileSystemDefaultEncoding.

@Elia, yep, this proposal hopefully will make blender work with
minimal effort and so we dont need to add in encoding info into a
blend file.

You mention being smarter about using pathnames with the OS, Im not
against this but think it could be done separately to this.
My proposal mainly deals with python being compatible with existing
blender data and disallowing the user to set names python cant decode.

On Sat, Aug 14, 2010 at 12:35 AM, Elia Sarti <vekoon at gmail.com> wrote:
> Let's clarify some things.
>
> First of all UTF-8 was specifically designed to replace ASCII
> painlessly, thus it does not contain zero (or null) bytes, meaning you
> can always have a C string to hold UTF-8 chars.
>
> Also UTF-8 can encode ANY character representable in the Unicode
> standard, which means most characters in the world can be encoded in UTF-8.
> The "é" character is simply not ASCII but latin-1. The fact that you
> don't notice the difference when using it in C is that it only takes 1
> byte to store this value, but ASCII is actually a 7-bit standard, you
> get the extra bit for free because the minimum data size in C is 8-bits.
>
> That being said, I don't think it's a good idea to have encoding stored
> in the .blend, this would be useless without full unicode support.
>
> The easiest thing to do is to simply assume we always use UTF-8
> internally, as it's ASCII compatible, any other encoding would require
> too much work.
>
> What I was suggesting is that we try to detect if certain strings (like
> file paths) are UTF-8 and if they aren't see if they are UTF-16/32 and
> convert them before storing. The same in reverse from a file path in the
> .blend to one for external usage.
> So if we have stored a relative path like //tèst/file.png (in UTF-8) and
> the OS wants UTF-16 we can convert this before passing the path to the
> OS and because most OSes use some version of UTF, by handling all of
> them we're reasonably safe we won't break too many file tree setups.
>
> Of course there are other apps already doing this, like all web
> browsers. For instance Firefox, which even provides a library for
> encoding detection (independent C++ library usable from C too):
>
> http://www.mozilla.org/projects/intl/detectorsrc.html
>
> Note that I've never actually tried it, but I guess it has to work
> considering it's used in Firefox. The article is dated but I think it's
> still valid, the source is here I believe:
>
> http://mxr.mozilla.org/mozilla/source/extensions/universalchardet/
> _______________________________________________
> Bf-committers mailing list
> Bf-committers at blender.org
> http://lists.blender.org/mailman/listinfo/bf-committers
>

-- 
- Campbell