zoukankan      html  css  js  c++  java
  • string wstring

    摘自:stackoverflow

    string? wstring?

    std::string is a basic_string templated on a char, and std::wstring on a wchar_t.

    char vs. wchar_t

    char is supposed to hold a character, usually a 1-byte character. wchar_t is supposed to hold a wide character, and then, things get tricky: On Linux, a wchar_t is 4-bytes, while on Windows, it's 2-bytes

    what about Unicode, then?

    The problem is that neither char nor wchar_t is directly tied to unicode.

    On Linux?

    Let's take a Linux OS: My Ubuntu system is already unicode aware. When I work with a char string, it is natively encoded in UTF-8 (i.e. Unicode string of chars). The following code:

    #include<cstring>#include<iostream>int main(int argc,char* argv[]){constchar text[]="olé";constwchar_t wtext[]= L"olé";
    
       std::cout <<"sizeof(char)    : "<<sizeof(char)<< std::endl ;
       std::cout <<"text            : "<< text << std::endl ;
       std::cout <<"sizeof(text)    : "<<sizeof(text)<< std::endl ;
       std::cout <<"strlen(text)    : "<< strlen(text)<< std::endl ;
    
       std::cout <<"text(binary)    :";for(size_t i =0, iMax = strlen(text); i < iMax;++i){
          std::cout <<" "<<static_cast<unsignedint>(static_cast<unsignedchar>(text[i]));}
    
       std::cout << std::endl << std::endl ;
    
       std::cout <<"sizeof(wchar_t) : "<<sizeof(wchar_t)<< std::endl ;//std::cout << "wtext           : " << wtext << std::endl ; <- error
       std::cout <<"wtext           : UNABLE TO CONVERT NATIVELY."<< std::endl ;
       std::wcout << L"wtext          : "<< wtext << std::endl;
    
       std::cout <<"sizeof(wtext)   : "<<sizeof(wtext)<< std::endl ;
       std::cout <<"wcslen(wtext)   : "<< wcslen(wtext)<< std::endl ;
    
       std::cout <<"wtext(binary)   :";for(size_t i =0, iMax = wcslen(wtext); i < iMax;++i){
          std::cout <<" "<<static_cast<unsignedint>(static_cast<unsignedshort>(wtext[i]));}
    
       std::cout << std::endl << std::endl ;return0;}

    outputs the following text:

    sizeof(char):1
    text            : olésizeof(text):5
    strlen(text):4
    text(binary):111108195169sizeof(wchar_t):4
    wtext           : UNABLE TO CONVERT NATIVELY.sizeof(wtext):16
    wcslen(wtext):3
    wtext(binary):111108233

    You'll see the "olé" text in char is really constructed by four chars: 110, 108, 195 and 169 (not counting the trailing zero). (I'll let you study the wchar_t code as an exercise)

    So, when working with a char on Linux, you should usually end up using Unicode without even knowing it. And as std::string works with char, so std::string is already unicode-ready.

    Note that std::string, like the C string API, will consider the "olé" string to have 4 characters, not three. So you should be cautious when truncating/playing with unicode chars because some combination of chars is forbidden in UTF-8.

    On Windows?

    On Windows, this is a bit different. Win32 had to support a lot of application working with char and on different charsets/codepages produced in all the world, before the advent of Unicode.

    So their solution was an interesting one: If an application works with char, then the char strings are encoded/printed/shown on GUI labels using the local charset/codepage on the machine. For example, "olé" would be "olé" in a french-localized Windows, but would be something différent on an cyrillic-localized Windows ("olй" if you use Windows-1251). Thus, "historical apps" will usually still work the same old way.

    For Unicode based applications, Windows uses wchar_t, which is 2-bytes wide, and is encoded in UTF-16, which is Unicode encoded on 2-bytes characters (or at the very least, the mostly compatible UCS-2, which is almost the same thing IIRC).

    Applications using char are said "multibyte" (because each glyph is composed of one or more chars), while applications using wchar_t are said "widechar" (because each glyph is composed of one or two wchar_t. See MultiByteToWideChar and WideCharToMultiByte Win32 conversion API for more info.

    Thus, if you work on Windows, you badly want to use wchar_t (unless you use a framework hiding that, like GTK+ or QT...). The fact is that behind the scenes, Windows works with wchar_t strings, so even historical applications will have their char strings converted in wchar_t when using API like SetWindowText (low level API function to set the label on a Win32 GUI).

    Memory issues?

    UTF-32 is 4 bytes per characters, so there is no much to add, if only that a UTF-8 text and UTF-16 text will always use less or the same amount of memory than an UTF-32 text (and usually less).

    If there is a memory issue, then you should know than for most western languages, UTF-8 text will use less memory than the same UTF-16 one.

    Still, for other languages (chinese, japanese, etc.), the memory used will be either the same, or larger for UTF-8 than for UTF-16.

    All in all, UTF-16 will mostly use 2 bytes per characters (unless you're dealing with some kind of esoteric language glyphs (Klingon? Elvish?), while UTF-8 will spend from 1 to 4 bytes.

    See http://en.wikipedia.org/wiki/UTF-8#Compared_to_UTF-16 for more info.

    Conclusion

    1. When I should use std::wstring over std::string?

    On Linux? Almost never (§).
    On Windows? Almost always (§).
    On cross-plateform code? Depends on your toolkit...

    (§) : unless you use a toolkit/framework saying otherwise

    2. Can std::string hold all the ASCII character set including special characters?

    Notice: A std::string is suitable for holding a 'binary' buffer, where a std::wstring is not!

    On Linux? Yes.
    On Windows? Only special characters available for the current locale of the Windows user.

    Edit (After a comment from Johann Gerell): a std::string will be enough to handle all char based strings (each char being a number from 0 to 255). But:

    1. ASCII is supposed to go from 0 to 127. Higher chars are NOT ASCII.
    2. a char from 0 to 127 will be held correctly
    3. a char from 128 to 255 will have a signification depending on your encoding (unicode, non-unicode, etc.), but it will be able to hold all Unicode glyphs as long as they are encoded in UTF-8.

    3. Is std::wstring supported by almost all popular C++ compilers?

    Mostly, with the exception of GCC based compilers that are ported to Windows
    It works on my g++ 4.3.2 (under Linux), and I used Unicode API on Win32 since Visual C++ 6.

    4. What is exactly a wide character?

    On C/C++, it's a character type written wchar_t which is larger than the simple char character type. It is supposed to be used to put inside characters whose indices (like Unicode glyphs) are larger than 255 (or 127, depending...)

  • 相关阅读:
    checkbox的问题整理
    通过阅读ASP.NET MVC5 框架解密 路由的一点心得
    用JS实现避免重复加载相同js文件
    如何给一个网站设置子网站
    Linux环境下Python的安装过程
    linux下更新Python版本并修改默认版本
    【引用】如何读技术类书籍
    专业收藏_资格证书
    ASP.NET单元测试配置文件
    面试收集
  • 原文地址:https://www.cnblogs.com/coolbear/p/3096406.html
Copyright © 2011-2022 走看看