A guideline for coding with strings in Windows
CodeGuru Home VC++ / MFC / C++ .NET / C# Visual Basic VB Forums Developer.com
Results 1 to 12 of 12

Thread: A guideline for coding with strings in Windows

  1. #1
    Arjay's Avatar
    Arjay is offline Moderator / MS MVP Power Poster
    Join Date
    Aug 2004
    Posts
    10,973

    A guideline for coding with strings in Windows

    Here's what you can do to to write programs that ports cleanly between ANSI and UNICODE. Many questions on this forum are due to the fact that folks have trouble understanding the difference between a char* and a wchar_t* string and how to write code that can cleanly compile under ANSI (MBCS) or UNICODE or how to convert between the two.

    The following are some guidelines that I've found useful:

    1) Understand that C++ projects created with VC++ 2005 and above default to UNICODE, instead of ANSI (MBCS).

    2) Don't write code that hardcodes string with just quotes (i.e " ")
    Code:
    "my string literal that isn't portable between ansi and unicode"
    3) Do use the _T("") macro for string literals. The following will happily switch between ANSI (mbcs) and UNICODE:
    Code:
    _T("my string literal that is portable between ansi and unicode")
    4) Don't use char* in your code. If you need a pointer to a string, use LPTSTR or LPCTSTR instead (this makes it 'character' portable).

    5) Don't declare buffers as
    Code:
    char buffer[MAX_PATH];
    6) Do declare string buffers as
    Code:
    #include <tchar.h>
    TCHAR buffer[MAX_PATH];
    7) Do use a string class instead of using a raw string buffer. Use std::string or std::wstring. To make this work in ANSI or UNICODE, use the old trick of type defining it as a Tstring
    Code:
    // The following STL components are available in either wchar_t or char forms.
    #ifndef _TSTRING  
      #ifdef _UNICODE 
        #define Tstring std::wstring 
        #define Tstringstream std::wstringstream 
        #define Tfstream std::wfstream 
        #define Tiostream std::wiostream 
      #else // #ifdef _UNICODE 
        #define Tstring std::string 
        #define Tstringstream std::stringstream 
        #define Tfstream std::fstream 
        #define Tiostream std::iostream 
      #endif // #ifdef _UNICODE 
      #define _TSTRING
    #endif // #ifndef _TSTRING
    8) Do use CString rather over Tstring. CString isn't just for MFC anymore. If you are using a compiler newer than VC6, Microsoft has redefined MFC's CString class so you don't need to include MFC to use it anymore. Just include <atlstr.h> in your non-MFC project. P.S. this negates 7) above - it's called progress.

    9) Don't use the old runtime library string functions such as sprintf or sprintf_s.

    10) Use the TCHAR (and new safe) equivalents, such as:
    Code:
    _stprintf_s
    _stscanf_s
    etc.
    11) Do use the String Conversion Macros. When converting from ANSI (MBCS) to UNICODE or UNICODE to ANSI (MBCS) or either to/from BSTR, leverage these macros.

    Conversion Key
    Code:
    SourceType/DestinationType  Description  
    A      ANSI character string.
    W     Unicode character string.
    T       Generic character string (equivalent to W when _UNICODE
            is defined, equivalent to A otherwise).
    OLE  OLE character string.
    Here's an incomplete list:
    Code:
    T2A     // When you need to convert a generic string (either ANSI or UNICODE) and always make it ANSI.
    T2W    // Same as above except it always makes it UNICODE.
    A2T     // When the string is ANSI and you need to make it a generic string
    W2T    // When the string is UNICODE and you need to make it a generic string
    OLE2T // From a BSTR to a generic string
    T2OLE // From a generic string to a BSTR
    Note: In VC6, the conversion macros did stacked based allocations during the conversion process. As such they didn't handle large strings very well. In the newer compilers, the conversion macros are smarter and allocate small strings on the stack and larger strings on the heap. But don't worry about freeing memory allocated for the conversion, because the macros will do the right thing and free any allocated memory.

    12) Do use CComBSTR or _bstr_t when working with BSTRs and COM. These classes are slick and take care of the drugery of COM string allocation and freeing. As a bonus these classes can do the correct string conversions as well.

    13) If you are writing code using VC2005 and above and the code is compiled for UNICODE and the code will never be run under ANSI, you can hardcode your string literals with L"" instead of _T(""). I should mention that it's still okay to use _T("").

    Lastly consider this programming mindset: Allow any string manipulation within your program to be generic based (in other words, write in T or generic string), and convert any string inputs or outputs as needed. For example if I need to call some methods that pass me an ANSI string, I want to immediately convert it to a T (using the A2T macro) so I can work with it in the rest of the program. Similarly, if I need to call a function that only takes a wchar*, I simply convert from the generic T (using the T2W macro) inline as I make the call.

    This allows all internal program manipulation to be in ANSI or UNICODE (based on the build settings) and the programmer only has to explicitly convert when receiving data from/to the outside. Note: Windows api's that take string generally do not need to be converted (because the appropriate xxxA or xxxW versions will automatically get called).

    A final comment. If your program has a UI or uses Windows api's and is targeted to run under NT (Win2K, XP, Vista, etc.), consider building your application as UNICODE. Understand if you build your program as ANSI (MCBS), every call you make to an Windows api (that takes a string) will undergo a string conversion. This is because the xxxA api's just internally converts everything to UNICODE and then calls the xxxW api version.

    Comments? Suggestions?
    Last edited by Arjay; May 23rd, 2008 at 03:07 AM.

  2. #2
    GCDEF is offline Elite Member Power Poster
    Join Date
    Nov 2003
    Posts
    11,986

    Re: A guideline for coding with strings in Windows

    Or you could just set MBCS and not worry about it. In 15 or so years of VC++, I've never had to do anything with Unicode. I guess some people do, but it seems like a lot of trouble to have to worry about it like that.

  3. #3
    Arjay's Avatar
    Arjay is offline Moderator / MS MVP Power Poster
    Join Date
    Aug 2004
    Posts
    10,973

    Re: A guideline for coding with strings in Windows

    You are correct, it depends what you are doing. If you are one of those developers still using VC 6, and working in ANSI, you may not ever see the need to do this.

    But it really isn't too difficult, and if you ever need to port your code to UNICODE (or maintain both ANSI and UNICODE), following these simple guidelines will make life easier.

  4. #4
    Join Date
    Jun 2006
    Location
    M31
    Posts
    885

    Re: A guideline for coding with strings in Windows

    Quote Originally Posted by Arjay
    Comments? Suggestions?
    Regarding guideline #7, what about a simple set of TCHAR-dependent typedefs, instead?
    Code:
    typedef std::basic_string<TCHAR> Tstring;
    typedef std::basic_stringstream<TCHAR> Tstringstream;
    typedef std::basic_fstream<TCHAR> Tfstream;
    
    //typedef std::basic_*<TCHAR> T*; etc...
    The pre-processor is useful, but should probably be avoided when real language constructs can be used, don't you think?

  5. #5
    Arjay's Avatar
    Arjay is offline Moderator / MS MVP Power Poster
    Join Date
    Aug 2004
    Posts
    10,973

    Re: A guideline for coding with strings in Windows

    8) Pretty much makes 7) obsolete.

    Back in the VC6 timeframe the Tstring approach was a good alternative to CString (which wasn't available for non-MFC).

    Now that CString works in non-mfc (just #include <atlstr.h>), I personally don't see much of a reason to use the stl versions as CString offers more useful features such as Format, LoadString, GetBuffer, AllocSysString (conversion OLE macro alternative), etc.

  6. #6
    Join Date
    Jun 2006
    Location
    M31
    Posts
    885

    Re: A guideline for coding with strings in Windows

    Quote Originally Posted by Arjay
    8) Pretty much makes 7) obsolete.

    Back in the VC6 timeframe the Tstring approach was a good alternative to CString (which wasn't available for non-MFC).

    Now that CString works in non-mfc (just #include <atlstr.h>), I personally don't see much of a reason to use the stl versions as CString offers more useful features such as Format, LoadString, GetBuffer, AllocSysString (conversion OLE macro alternative), etc.
    As always, there is no "single solution" that will work for everyone.
    My post was meant to point out an alternative to using the pre-processor (at least in terms of direct symbol substitution) to people (who might not know better at the time) that will be reading this thread in the future. =)

  7. #7
    Lindley is offline Elite Member Power Poster
    Join Date
    Oct 2007
    Location
    Fairfax, VA
    Posts
    10,885

    Re: A guideline for coding with strings in Windows

    It seems to me that this approach makes code cross-character-set at the expense of making it very much *not* cross-platform.

    I'd be interested to know if there's an easy way to write code which is both cross-character-set and cross-platform.

  8. #8
    Join Date
    Mar 2002
    Location
    NY, USA
    Posts
    12,097

    Re: A guideline for coding with strings in Windows

    Quote Originally Posted by Lindley
    It seems to me that this approach makes code cross-character-set at the expense of making it very much *not* cross-platform.

    I'd be interested to know if there's an easy way to write code which is both cross-character-set and cross-platform.
    Sure there is, but it is not "standard"....

    About 12 years ago, our company designed a complete string library which has eveolved over time. It does everything the way we want, and is completely based on a "DynChar" object for handling the individual characters. (It is not currently alvailable to the public).

    The reason I bring it up is that each "string class" (ours included) is designed to meet a certain set of goals. Individual vendors have each taken the approach of best meeting their own goals [I actually consider the different development/working groups within MSFT as different vendors, but that is another tale...]. This actually makes good fiscal sense.

    Even our own class follows this paradigm, so it is really no better in the general sense, but it is perfect for us.... Certain things (such as interacting with native Win32Api) we simply do not do directly. Those are all wrapped, and if they take "character" type parameters, the wrappers handle the transformations. The same for all of our UI controls.
    TheCPUWizard is a registered trademark, all rights reserved. (If this post was helpful, please RATE it!)
    2008, 2009
    In theory, there is no difference between theory and paractice; in practice there is.

    * Join the fight, refuse to respond to posts that contain code outside of [code] ... [/code] tags. See here for instructions
    * How NOT to post a question here
    * Of course you read this carefully before you posted
    * Need homework help? Read this first

  9. #9
    Join Date
    Oct 2002
    Location
    Timisoara, Romania
    Posts
    14,360

    Re: A guideline for coding with strings in Windows

    Comments? Suggestions?
    I suppose you're going to post this to the FAQs, right?
    Marius Bancila
    Home Page
    My CodeGuru articles

    I do not offer technical support via PM or e-mail. Please use vbBulletin codes.

  10. #10
    Join Date
    Nov 2003
    Posts
    1,778

    Re: A guideline for coding with strings in Windows

    I have personal standpoint on this subject that I'll like to run by everyone.

    >> Here's what you can do to to write programs that ports cleanly between ANSI and UNICODE.
    Windows 95 is the only OS that doesn't support UNICODE. So writing code that can be compiled for ANSI or UNICODE, means you're writing code to support Windows 95 (ANSI) and UNICODE (anything non-95). So why even bother with ANSI API's these days? I don't see any other reason to write TCHAR code that can switch back and forth.

    My opinion is to program against UNICODE API's exclusively.
    Use L"" or "", char or wchar_t, string or wstring, CStringA or CStringW, .... depending on what you need.

    On a side note, there is a separation between the Win32 API and CRT on how it handles tchar's - which may be useful for the FAQ:
    - Win32AP uses "UNICODE" macro, "TEXT()", and "TCHAR"
    - CRT uses "_UNICODE" macro, "_T()/_TEXT()", and "_TCHAR"

    The CRT's <tchar.h> will typedef "TCHAR" if it hasn't been already, so TCHAR vs _TCHAR doesn't really matter.

    >> 4) Do use char* in your code
    I think you mean "don't".

    >> This allows all internal program manipulation to be in ANSI or UNICODE (based on the build settings)
    But why? Why go through the trouble of using TCHAR's when you could be explicit in your types instead? Does anyone know of a good reason?

    gg

  11. #11
    Arjay's Avatar
    Arjay is offline Moderator / MS MVP Power Poster
    Join Date
    Aug 2004
    Posts
    10,973

    Re: A guideline for coding with strings in Windows

    Quote Originally Posted by Lindley
    It seems to me that this approach makes code cross-character-set at the expense of making it very much *not* cross-platform.

    I'd be interested to know if there's an easy way to write code which is both cross-character-set and cross-platform.
    This post is intended for Windows (as the title suggest). The target is for being able to compile for ANSI and UNICODE on Win9x and NT based platforms.

  12. #12
    Arjay's Avatar
    Arjay is offline Moderator / MS MVP Power Poster
    Join Date
    Aug 2004
    Posts
    10,973

    Re: A guideline for coding with strings in Windows

    [QUOTE=Codeplug]I have personal standpoint on this subject that I'll like to run by everyone.

    Quote Originally Posted by Codeplug
    My opinion is to program against UNICODE API's exclusively.
    Use L"" or "", char or wchar_t, string or wstring, CStringA or CStringW, .... depending on what you need.
    As I mentioned near the end of the post, I have a different philosophy on this. I prefer to internally within the program have all strings in the T (general) mode, and convert only when necessary for data coming in or going out.

    This makes any calls to the Windows api's cleaner (there's no A or W on the end) and it's easier to spot the times that you are converting. Strings are always stored in the general form and CString, LPTSTR, or LPCTSTR is used to pass things around and only the _T("") macro is used for the literals.

    I find code that uses L"", "" or xxxxA or xxxxW interspersed harder to follow because it's [to me] more difficult to spot when something has been converted.
    Quote Originally Posted by Codeplug
    On a side note, there is a separation between the Win32 API and CRT on how it handles tchar's - which may be useful for the FAQ:
    - Win32AP uses "UNICODE" macro, "TEXT()", and "TCHAR"
    - CRT uses "_UNICODE" macro, "_T()/_TEXT()", and "_TCHAR"

    The CRT's <tchar.h> will typedef "TCHAR" if it hasn't been already, so TCHAR vs _TCHAR doesn't really matter.
    For simplicity, I suggest only using the _T("") macro (although the others work as well).

    Quote Originally Posted by Codeplug
    >> 4) Do use char* in your code
    I think you mean "don't".
    Yes, thanks, I fixed it.

    Quote Originally Posted by Codeplug
    >> Here's what you can do to to write programs that ports cleanly between ANSI and UNICODE.
    Windows 95 is the only OS that doesn't support UNICODE. So writing code that can be compiled for ANSI or UNICODE, means you're writing code to support Windows 95 (ANSI) and UNICODE (anything non-95). So why even bother with ANSI API's these days? I don't see any other reason to write TCHAR code that can switch back and forth.

    >> This allows all internal program manipulation to be in ANSI or UNICODE (based on the build settings)
    But why? Why go through the trouble of using TCHAR's when you could be explicit in your types instead? Does anyone know of a good reason?
    The intent is it write code that can switch from ANSI -> UNICODE.

    Sometimes it's desirable to maintain a program that runs on Win9x (Win95, Win98, Win98SE, and WinME) and NT based platforms (NT4 and earlier, Win2K, XP, Win2K3, Vista and newer). As time goes on, this is less of a requirement, but quite a few large organizations still need to support both (Win9x and NT) platforms and build binaries for both.

    Another reason would be if you work in a place that plans on moving to a to UNICODE in the future, but are currently running a VC6, VC2002, or VC2003 compiler. If you start any new projects using these guidelines, your port will be much easier when you do go to UNICODE in the future.

    Part of some developers confusion is when they take code from one pre-VC2005 project and try to run it in a newer project created with VC2005 or newer. They run into all sorts of issues and wonder why the code doesn't compile anymore. Since VC2005 and newer projects default to UNICODE, they would have run into less trouble had they followed these guidelines.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  


Azure Activities Information Page

Windows Mobile Development Center


Click Here to Expand Forum to Full Width

This is a CodeGuru survey question.


Featured


HTML5 Development Center