Thursday, September 5, 2013

Why is UTF-8 treated as not multibyte?



It’s a long discussion and object of terminology. Multibyte is a slippery term and is not the best one.
Visual Studio has 3 options for characters sets:

a) No characters set, which means it works OK with single byte characters sets (SBCS) like CP1251 (ru-RU) or CP1252 (en-US)
characters take 1 byte

b) MBCS, which means it works OK with multibyte Character sets like CP936,
characters take 1 or 2 bytes, GUI accepts such characters if appropriate locale is selected in Control Panel

c) Unicode, which means working with UTF-16BE,
characters take 2 bytes, selected locale doesn’t make any sense

Note that there is no option to work with utf-8.
There are conversion functions utf8 <-> MBCS.
In Microsoft documentation term “multibyte” is related to MBCS. It was hard for me to achieve, and I suppose there could be misunderstanding among the team regarding this term.

Even having the same way of coding and having floating amount of bytes, utf8 is a way of encoding Unicode characters, it is not related to MBCS at all.

In order to have our virare and maina (it.) we’ve agreed to call MBCS multibyte

No comments:

Post a Comment