[Bf-committers] Proposal for handling string encoding in blender.

Sat Aug 28 15:28:22 CEST 2010

Writing to follow up on this topic.
Committed utf8 limits for data names with an exception for filepaths -
applies to the UI and PyAPI, but found some interesting things while
working on this.

Python has a number of error callbacks to handle incompatible chars
when encoding and decoding.
from C/API.
- PyUnicode_EncodeUTF8(PyUnicode_AS_UNICODE(u), PyUnicode_GET_SIZE(u),
"surrogateescape");
- PyUnicode_DecodeUTF8(str, strlen(str), "surrogateescape");
See http://docs.python.org/py3k/library/codecs.html#codecs.register
for more info.

This means an invalid unicode char gets converted to something like \u2345

This makes the whole problem seem very simple, use these as fallbacks
for _PyUnicode_AsString and PyUnicode_FromString and we ALWAYS get a
valid string from ANY C, *char array (tested with random byte arrays).

So I converted the rna api to use these and all was fine, you can
assign invalid unicode values like this.

  bpy.context.object.name = "num\udce9ro"
# rather then "numéro", but they are the same internally/

This works for getting and setting rna but I wasn't able to use these
strings - print, writing to a file would raise errors, So basically
the problem is moved to python.

The most simple way I could find to print an object named
"num\udce9ro" was this...
  print(somestring.encode("ASCII", "surrogateescape").decode("ASCII", "ignore"))
...even this has unicode chars striped so its not that useful if you
want a unique value.

So unless I'm missing something It seems this is such a pain to deal
with these strings in python that it would be better to use byte
arrays. - b"EvertStringHasA_b_prefix".
Since if we allow these strings script writers would just ignore this
corner case and we'd get bug reports about it every so often.

This lead me to the come back to the conclusion to enforce utf8 for
all data names.

Nevertheless these annoying strings still have to be taken into
account with paths, an example of the problem is the OBJ exporter can
write to the path but throws an error when trying to print() it or
write to a file.

The only thing thats left to do is go over the scripts and make sure
they work with non-utf8 paths and make sure new ID names derived from
paths are stripped.

- Campbell

On Sat, Aug 14, 2010 at 4:30 AM, Roger Wickes <rogerwickes at yahoo.com> wrote:
> I think that if you save "numéro" in your .blend, it does not matter what the OS
> UTF is;
> when you enter it into like the text editor or a field within the Blender UI,
> it only matters what the str function in Blender (that is processing that field)
>
> does when it is reading that field and saving it.
>
> I suggest that all string functions in Blender use UTF8 encoding,
> and save strings internally as a UTF8 array,
> so that the accent is preserved if you enter it as say, a mesh name.
>
> OS dependency is only relevant when, say, creating a folder or file. For that,
> Blender should use OS-defaultencoding as Campbell has said, when dealing with
> filenames and the absolutely idiotic slash/backslash conflict we have today. All
> OS encodings will respect your "numéro" as a filename/dirname/username, afaik.
>
>  --Roger
>
>
>
>
> ----- Original Message ----
> From: Elia Sarti <vekoon at gmail.com>
> To: bf-blender developers <bf-committers at blender.org>
> Sent: Fri, August 13, 2010 6:47:01 AM
> Subject: Re: [Bf-committers] Proposal for handling string encoding in blender.
>
> The point is that different systems use different encodings. UTF-8 is
> just one way to encode multibyte characters, UTF-16 is another for
> instance (and there are hundreds others).
>
> Means if you save "numéro" in your .blend on an OS using utf-8 and
> someone opens it in one using utf-16 then the string is incompatible.
>
> I say +1 to this with an addendum.
> To some extent encoding can be detected and thus converted, would it be
> hard to do so for strings in the .blend? Of course only for a limited
> collection, I'd say utf-8 <-> utf-16 would probably suffice as I believe
> many linux distros use utf-8 while windows and mac use utf-16, so this
> would cover the majority of cases.