Wednesday, August 7, 2013

MultiByte, UTF-8 and Chinese Character Set



This is a result of long discussion and research of MultiByte encoding with Olga L. this morning. Keep in mind the following information when processing strings when performing localization to Chinese:
  • Multibyte is used in HyperLynx for localized strings.
  • Multibyte is not related to wide chars (wchar_t, Utf-16) at all, even having 2 bytes per character.
  • Multibyte is not related to Utf-8.
  • In Visual Studio debugger you always see Multibyte characters in case Chinese Simplified locale is selected in Control Panel
  • Multibyte (MBCS, DBCS) is the same as CodePage 936 or GB2312 in case Chinese Simplified locale is selected in Control Panel
  • getchar's _(“Two beer or not to be”) returns Multibyte string.
  • “tchar.h” routines like _tcsclen, _tcsncpy,  etc. deal with Multibyte strings
  • .po files are written in UTF-8 and converted to Multibyte on loading
  • .rc files resources are written in Win1251
  • Chinese .zh-CN.rc resources are written in CP936
  • Some of MFC Windows GUI accepts  Multibyte, some only accept ANSI or wchar_t *
Use this site to understand different encodings better, note that on Chinese locale we deal with CP936 Encoding: http://www.kreativekorp.com/charset/encoding.php


For example, take character sequence U+963F
  • Appearance:
  • Unicode Block: CJK Unified Ideographs (4E00-9FFF)
  • Unicode Code Point, Decimal: 38463
  • Unicode Code Point, Hexadecimal: U+963F
  • HTML Character Entity, Decimal: 阿
  • HTML Character Entity, Hexadecimal: 阿
  • Keystroke, Windows: Alt+038463
  • Keystroke, Macintosh, Unicode Hex Input: Option-963F
  • CP936 Encoding: B0 A2
  • UTF-8 Encoding: E9 98 BF
  • UTF-16BE Encoding: 96 3F
  • UTF-16LE Encoding: 3F 96
  • UTF-32BE Encoding: 00 00 96 3F
  • UTF-32LE Encoding: 3F 96 00 00
http://www.kreativekorp.com/charset/encoding.php?file=cp936.kte&char=B0A2

No comments:

Post a Comment