-
May 22nd, 2008, 05:55 PM
#1
A guideline for coding with strings in Windows
Here's what you can do to to write programs that ports cleanly between ANSI and UNICODE. Many questions on this forum are due to the fact that folks have trouble understanding the difference between a char* and a wchar_t* string and how to write code that can cleanly compile under ANSI (MBCS) or UNICODE or how to convert between the two.
The following are some guidelines that I've found useful:
1) Understand that C++ projects created with VC++ 2005 and above default to UNICODE, instead of ANSI (MBCS).
2) Don't write code that hardcodes string with just quotes (i.e " ")
Code:
"my string literal that isn't portable between ansi and unicode"
3) Do use the _T("") macro for string literals. The following will happily switch between ANSI (mbcs) and UNICODE:
Code:
_T("my string literal that is portable between ansi and unicode")
4) Don't use char* in your code. If you need a pointer to a string, use LPTSTR or LPCTSTR instead (this makes it 'character' portable).
5) Don't declare buffers as
Code:
char buffer[MAX_PATH];
6) Do declare string buffers as
Code:
#include <tchar.h>
TCHAR buffer[MAX_PATH];
7) Do use a string class instead of using a raw string buffer. Use std::string or std::wstring. To make this work in ANSI or UNICODE, use the old trick of type defining it as a Tstring
Code:
// The following STL components are available in either wchar_t or char forms.
#ifndef _TSTRING
#ifdef _UNICODE
#define Tstring std::wstring
#define Tstringstream std::wstringstream
#define Tfstream std::wfstream
#define Tiostream std::wiostream
#else // #ifdef _UNICODE
#define Tstring std::string
#define Tstringstream std::stringstream
#define Tfstream std::fstream
#define Tiostream std::iostream
#endif // #ifdef _UNICODE
#define _TSTRING
#endif // #ifndef _TSTRING
8) Do use CString rather over Tstring. CString isn't just for MFC anymore. If you are using a compiler newer than VC6, Microsoft has redefined MFC's CString class so you don't need to include MFC to use it anymore. Just include <atlstr.h> in your non-MFC project. P.S. this negates 7) above - it's called progress.
9) Don't use the old runtime library string functions such as sprintf or sprintf_s.
10) Use the TCHAR (and new safe) equivalents, such as:
Code:
_stprintf_s
_stscanf_s
etc.
11) Do use the String Conversion Macros. When converting from ANSI (MBCS) to UNICODE or UNICODE to ANSI (MBCS) or either to/from BSTR, leverage these macros.
Conversion Key
Code:
SourceType/DestinationType Description
A ANSI character string.
W Unicode character string.
T Generic character string (equivalent to W when _UNICODE
is defined, equivalent to A otherwise).
OLE OLE character string.
Here's an incomplete list:
Code:
T2A // When you need to convert a generic string (either ANSI or UNICODE) and always make it ANSI.
T2W // Same as above except it always makes it UNICODE.
A2T // When the string is ANSI and you need to make it a generic string
W2T // When the string is UNICODE and you need to make it a generic string
OLE2T // From a BSTR to a generic string
T2OLE // From a generic string to a BSTR
Note: In VC6, the conversion macros did stacked based allocations during the conversion process. As such they didn't handle large strings very well. In the newer compilers, the conversion macros are smarter and allocate small strings on the stack and larger strings on the heap. But don't worry about freeing memory allocated for the conversion, because the macros will do the right thing and free any allocated memory.
12) Do use CComBSTR or _bstr_t when working with BSTRs and COM. These classes are slick and take care of the drugery of COM string allocation and freeing. As a bonus these classes can do the correct string conversions as well.
13) If you are writing code using VC2005 and above and the code is compiled for UNICODE and the code will never be run under ANSI, you can hardcode your string literals with L"" instead of _T(""). I should mention that it's still okay to use _T("").
Lastly consider this programming mindset: Allow any string manipulation within your program to be generic based (in other words, write in T or generic string), and convert any string inputs or outputs as needed. For example if I need to call some methods that pass me an ANSI string, I want to immediately convert it to a T (using the A2T macro) so I can work with it in the rest of the program. Similarly, if I need to call a function that only takes a wchar*, I simply convert from the generic T (using the T2W macro) inline as I make the call.
This allows all internal program manipulation to be in ANSI or UNICODE (based on the build settings) and the programmer only has to explicitly convert when receiving data from/to the outside. Note: Windows api's that take string generally do not need to be converted (because the appropriate xxxA or xxxW versions will automatically get called).
A final comment. If your program has a UI or uses Windows api's and is targeted to run under NT (Win2K, XP, Vista, etc.), consider building your application as UNICODE. Understand if you build your program as ANSI (MCBS), every call you make to an Windows api (that takes a string) will undergo a string conversion. This is because the xxxA api's just internally converts everything to UNICODE and then calls the xxxW api version.
Comments? Suggestions?
Last edited by Arjay; May 23rd, 2008 at 03:07 AM.
-
May 22nd, 2008, 07:26 PM
#2
Re: A guideline for coding with strings in Windows
Or you could just set MBCS and not worry about it. In 15 or so years of VC++, I've never had to do anything with Unicode. I guess some people do, but it seems like a lot of trouble to have to worry about it like that.
-
May 22nd, 2008, 07:59 PM
#3
Re: A guideline for coding with strings in Windows
You are correct, it depends what you are doing. If you are one of those developers still using VC 6, and working in ANSI, you may not ever see the need to do this.
But it really isn't too difficult, and if you ever need to port your code to UNICODE (or maintain both ANSI and UNICODE), following these simple guidelines will make life easier.
-
May 22nd, 2008, 08:03 PM
#4
Re: A guideline for coding with strings in Windows
 Originally Posted by Arjay
Comments? Suggestions?
Regarding guideline #7, what about a simple set of TCHAR-dependent typedefs, instead?
Code:
typedef std::basic_string<TCHAR> Tstring;
typedef std::basic_stringstream<TCHAR> Tstringstream;
typedef std::basic_fstream<TCHAR> Tfstream;
//typedef std::basic_*<TCHAR> T*; etc...
The pre-processor is useful, but should probably be avoided when real language constructs can be used, don't you think?
-
May 22nd, 2008, 08:38 PM
#5
Re: A guideline for coding with strings in Windows
8) Pretty much makes 7) obsolete.
Back in the VC6 timeframe the Tstring approach was a good alternative to CString (which wasn't available for non-MFC).
Now that CString works in non-mfc (just #include <atlstr.h>), I personally don't see much of a reason to use the stl versions as CString offers more useful features such as Format, LoadString, GetBuffer, AllocSysString (conversion OLE macro alternative), etc.
-
May 22nd, 2008, 09:08 PM
#6
Re: A guideline for coding with strings in Windows
 Originally Posted by Arjay
8) Pretty much makes 7) obsolete.
Back in the VC6 timeframe the Tstring approach was a good alternative to CString (which wasn't available for non-MFC).
Now that CString works in non-mfc (just #include <atlstr.h>), I personally don't see much of a reason to use the stl versions as CString offers more useful features such as Format, LoadString, GetBuffer, AllocSysString (conversion OLE macro alternative), etc.
As always, there is no "single solution" that will work for everyone.
My post was meant to point out an alternative to using the pre-processor (at least in terms of direct symbol substitution) to people (who might not know better at the time) that will be reading this thread in the future. =)
-
May 22nd, 2008, 09:27 PM
#7
Re: A guideline for coding with strings in Windows
It seems to me that this approach makes code cross-character-set at the expense of making it very much *not* cross-platform.
I'd be interested to know if there's an easy way to write code which is both cross-character-set and cross-platform.
-
May 22nd, 2008, 09:33 PM
#8
Re: A guideline for coding with strings in Windows
 Originally Posted by Lindley
It seems to me that this approach makes code cross-character-set at the expense of making it very much *not* cross-platform.
I'd be interested to know if there's an easy way to write code which is both cross-character-set and cross-platform.
Sure there is, but it is not "standard"....
About 12 years ago, our company designed a complete string library which has eveolved over time. It does everything the way we want, and is completely based on a "DynChar" object for handling the individual characters. (It is not currently alvailable to the public).
The reason I bring it up is that each "string class" (ours included) is designed to meet a certain set of goals. Individual vendors have each taken the approach of best meeting their own goals [I actually consider the different development/working groups within MSFT as different vendors, but that is another tale...]. This actually makes good fiscal sense.
Even our own class follows this paradigm, so it is really no better in the general sense, but it is perfect for us.... Certain things (such as interacting with native Win32Api) we simply do not do directly. Those are all wrapped, and if they take "character" type parameters, the wrappers handle the transformations. The same for all of our UI controls.
TheCPUWizard is a registered trademark, all rights reserved. (If this post was helpful, please RATE it!)
2008, 2009,2010
In theory, there is no difference between theory and practice; in practice there is.
* Join the fight, refuse to respond to posts that contain code outside of [code] ... [/code] tags. See here for instructions 
* How NOT to post a question here
* Of course you read this carefully before you posted
* Need homework help? Read this first
-
May 23rd, 2008, 12:22 AM
#9
Re: A guideline for coding with strings in Windows
I suppose you're going to post this to the FAQs, right?
-
May 23rd, 2008, 02:17 AM
#10
Re: A guideline for coding with strings in Windows
I have personal standpoint on this subject that I'll like to run by everyone.
>> Here's what you can do to to write programs that ports cleanly between ANSI and UNICODE.
Windows 95 is the only OS that doesn't support UNICODE. So writing code that can be compiled for ANSI or UNICODE, means you're writing code to support Windows 95 (ANSI) and UNICODE (anything non-95). So why even bother with ANSI API's these days? I don't see any other reason to write TCHAR code that can switch back and forth.
My opinion is to program against UNICODE API's exclusively.
Use L"" or "", char or wchar_t, string or wstring, CStringA or CStringW, .... depending on what you need.
On a side note, there is a separation between the Win32 API and CRT on how it handles tchar's - which may be useful for the FAQ:
- Win32AP uses "UNICODE" macro, "TEXT()", and "TCHAR"
- CRT uses "_UNICODE" macro, "_T()/_TEXT()", and "_TCHAR"
The CRT's <tchar.h> will typedef "TCHAR" if it hasn't been already, so TCHAR vs _TCHAR doesn't really matter.
>> 4) Do use char* in your code
I think you mean "don't".
>> This allows all internal program manipulation to be in ANSI or UNICODE (based on the build settings)
But why? Why go through the trouble of using TCHAR's when you could be explicit in your types instead? Does anyone know of a good reason?
gg
-
May 23rd, 2008, 03:05 AM
#11
Re: A guideline for coding with strings in Windows
 Originally Posted by Lindley
It seems to me that this approach makes code cross-character-set at the expense of making it very much *not* cross-platform.
I'd be interested to know if there's an easy way to write code which is both cross-character-set and cross-platform.
This post is intended for Windows (as the title suggest). The target is for being able to compile for ANSI and UNICODE on Win9x and NT based platforms.
-
May 23rd, 2008, 03:36 AM
#12
Re: A guideline for coding with strings in Windows
[QUOTE=Codeplug]I have personal standpoint on this subject that I'll like to run by everyone.
 Originally Posted by Codeplug
My opinion is to program against UNICODE API's exclusively.
Use L"" or "", char or wchar_t, string or wstring, CStringA or CStringW, .... depending on what you need.
As I mentioned near the end of the post, I have a different philosophy on this. I prefer to internally within the program have all strings in the T (general) mode, and convert only when necessary for data coming in or going out.
This makes any calls to the Windows api's cleaner (there's no A or W on the end) and it's easier to spot the times that you are converting. Strings are always stored in the general form and CString, LPTSTR, or LPCTSTR is used to pass things around and only the _T("") macro is used for the literals.
I find code that uses L"", "" or xxxxA or xxxxW interspersed harder to follow because it's [to me] more difficult to spot when something has been converted.
 Originally Posted by Codeplug
On a side note, there is a separation between the Win32 API and CRT on how it handles tchar's - which may be useful for the FAQ:
- Win32AP uses "UNICODE" macro, "TEXT()", and "TCHAR"
- CRT uses "_UNICODE" macro, "_T()/_TEXT()", and "_TCHAR"
The CRT's <tchar.h> will typedef "TCHAR" if it hasn't been already, so TCHAR vs _TCHAR doesn't really matter.
For simplicity, I suggest only using the _T("") macro (although the others work as well).
 Originally Posted by Codeplug
>> 4) Do use char* in your code
I think you mean "don't".
Yes, thanks, I fixed it.
 Originally Posted by Codeplug
>> Here's what you can do to to write programs that ports cleanly between ANSI and UNICODE.
Windows 95 is the only OS that doesn't support UNICODE. So writing code that can be compiled for ANSI or UNICODE, means you're writing code to support Windows 95 (ANSI) and UNICODE (anything non-95). So why even bother with ANSI API's these days? I don't see any other reason to write TCHAR code that can switch back and forth.
>> This allows all internal program manipulation to be in ANSI or UNICODE (based on the build settings)
But why? Why go through the trouble of using TCHAR's when you could be explicit in your types instead? Does anyone know of a good reason?
The intent is it write code that can switch from ANSI -> UNICODE.
Sometimes it's desirable to maintain a program that runs on Win9x (Win95, Win98, Win98SE, and WinME) and NT based platforms (NT4 and earlier, Win2K, XP, Win2K3, Vista and newer). As time goes on, this is less of a requirement, but quite a few large organizations still need to support both (Win9x and NT) platforms and build binaries for both.
Another reason would be if you work in a place that plans on moving to a to UNICODE in the future, but are currently running a VC6, VC2002, or VC2003 compiler. If you start any new projects using these guidelines, your port will be much easier when you do go to UNICODE in the future.
Part of some developers confusion is when they take code from one pre-VC2005 project and try to run it in a newer project created with VC2005 or newer. They run into all sorts of issues and wonder why the code doesn't compile anymore. Since VC2005 and newer projects default to UNICODE, they would have run into less trouble had they followed these guidelines.
Posting Permissions
- You may not post new threads
- You may not post replies
- You may not post attachments
- You may not edit your posts
-
Forum Rules
|
Click Here to Expand Forum to Full Width
On-Demand Webinars (sponsored)
|