Making wchar_t
work on Linux, OS X and Windows for CMarkup release 10.1 I learned a couple of humble lessons, and I expect I'll be posting more here as I get feedback. To me the term wchar_t
string is the same as C++ wide string, C++ wide char, C++ wchar, C++ wide character string, etc, which all come down to an array of wchar_t
. The STL std::wstring
class based on wchar_t
characters is the wide version of the std::string
class based on char
characters.
Why wchar?
Using a wchar_t
string (and STL std::wstring
) on POSIX (Linux and OS X) has few advantages if any since nowadays a regular char
string is in Unicode UTF-8 by default, including, I assume, most system functions, file paths, and programming interfaces. Using wide strings therefore means an extra layer of UTF-8 to UTF-32 conversion on many operations. Nevertheless, I went ahead and implemented and tested wide char "MARKUP_WCHAR
" support in CMarkup since a) it was there for Windows UNICODE
builds, and b) a customer expressed interest in doing a wide char build for Mac.
Note that the gcc 3.4.4 "cygming" compiler that comes with cygwin 1.5.25-15 doesn't seem to have have std::wstring
or even wprintf
, though it does have wchar_t
. Since CMarkup requires a wchar_t
based string class, a wide char build is not supported here.
Compiling for wide char vs char
I took my cue from VC++ _T
macros such as _tcscpy
which switch based on the character set selected for the build. With CMarkup, you define MARKUP_WCHAR
(or UNICODE
) to compile for wide strings since otherwise it compiles for char
strings. A set of macros is defined accordingly with the wide versions of functions and types. Here are examples of defines for character, constant character pointer and string copy that are different based on MARKUP_WCHAR
:
#if defined(MARKUP_WCHAR) #define MCD_CHAR wchar_t #define MCD_PCSZ const wchar_t* #define MCD_PSZCPY wcscpy ... other wide functions #else // not MARKUP_WCHAR #define MCD_CHAR char #define MCD_PCSZ const char* #define MCD_PSZCPY strcpy ... other non-wide functions #endif
sizeof wchar_t
Unlike Windows UTF-16 2-byte wide chars, wchar_t
on Linux and OS X is 4 bytes UTF-32 (gcc/g++ and XCode). On cygwin it is 2 (cygwin uses Windows APIs).
At first I used runtime if statements like if ( sizeof(wchar_t) == 4 )
but aside from being bad style that led to compiler warnings in the code that was for the other size of wchar_t
. I wanted a way to automatically determine the size of wchar_t
at compile time based on predefined macros (you can list g++ predefined macros with the command cpp -dM
and press Ctrl+D). I settled on using __SIZEOF_WCHAR_T__
or even better __WCHAR_MAX__
which is provided by gcc on Linux, OS X, and cygwin.
#if ! defined(MARKUP_SIZEOFWCHAR) #if __SIZEOF_WCHAR_T__ == 4 || __WCHAR_MAX__ > 0x10000 #define MARKUP_SIZEOFWCHAR 4 #else #define MARKUP_SIZEOFWCHAR 2 #endif #endif
I left the option of setting it explicitly by defining MARKUP_SIZEOFWCHAR
if the predefined macros aren't available.
Of course, everywhere you do conversions to and from wchar_t
strings, you have to be aware of whether it is UTF-16 or UTF-32. So I differentiate as follows:
#if MARKUP_SIZEOFWCHAR == 4 // sizeof(wchar_t) == 4 ... treat wchar_t string as UTF-32 #else // sizeof(wchar_t) == 2 ... treat wchar_t string as UTF-16 #endif
sprintf wchar_t with "%ls"
In VC++, you can use "%s"
in the format string of swprintf
(or wprintf
, fwprintf
) to insert a wide string. But in POSIX you have to use "%ls"
. This may be compiler dependent rather than operating system dependent.
type | meaning in sprintf | meaning in swprintf | ||
---|---|---|---|---|
Windows | POSIX | Windows | POSIX | |
ls or lS | wchar_t |
wchar_t |
wchar_t |
wchar_t |
s | char |
char |
wchar_t |
char |
S | wchar_t |
char |
char |
char |
The only way to switch between sprintf
and it's wide char version swprintf
on POSIX seamlessly would be to use a macro in the middle of your format string. I was able to concatenate strings instead and avoid the whole issue of swprintf
for strings.
Note also that gcc uses a safe form of swprintf with the extra argument to specify the length of the receiving buffer (VC++ 2005 and up has the safe string version swprintf_s
). And I also was confused when I accidentally googled wsprintf
(first two letters swapped) which appears to be a version of this function only on Windows.
no wide filenames on POSIX
There is no wide fopen
on POSIX like _wfopen
on Windows (same goes for open
and stat
). Filenames, whether received from system APIs or composed by your program should be kept in "filesystem representation" (UTF-8) and you should avoid doing encoding conversions on pathnames because you could be subject to differences in Unicode decomposition implementations that could subtly modify the pathname.
Therefore I had to implement special filename macros for filenames to be passed to the CMarkup functions without wide strings even in a wide char build.
iconv on OS X doesn't support "WCHAR_T"
This is more an issue of using iconv
across different platforms and configurations, but I found that although iconv_open
did not complain about "WCHAR_T"
on OS X, it did not convert properly. So I switched to explicitly using "UTF-32"
or "UTF-16"
depending on MARKUP_SIZEOFWCHAR
. I can't say I understand all of the iconv vs libiconv issues, but the way I used iconv in g++ was with the -liconv flag.
http://www.firstobject.com/wchar_t-string-on-linux-osx-windows.htm