Chapter 2. Strings and Text

Strings come in a number of different character sets. COM components often need to use multiple character sets and occasionally need to convert from one set to another. ATL provides a number of string conversion classes that convert from one character set to another, if necessary, and do nothing when they are not needed.

The CComBSTR class is a smart string class. This class properly allocates, copies, and frees a string according to the BSTR string semantics. CComBSTR instances can be used in most, but not all, of the places you would use a BSTR.

The CString class is a new addition to ATL, with roots in MFC. This class handles allocation, copying, formatting, and offers a host of advanced string-processing features. It can manage ANSI and Unicode data, and convert strings to and from BSTR s for use in processing Automation method parameters. With CString, you can even control and customize the way memory is managed for the class’s string data.

String Data Types, Conversion Classes, and Helper Functions

A Review of Text Data Types

The text data type is somewhat of a pain to deal with in C++ programming. The main problem is that there isn’t just one text data type; there are many of them. I use the term text data type here in the general sense of an array of characters. Often, different operating systems and programming languages introduce additional semantics for an array of characters (for example, NUL character termination or a length prefix) before they consider an array of characters a text string.

When you select a text data type, you must make a number of decisions. First, you must decide what type of characters constitute the array. Some operating systems require you to use ANSI characters when you pass a string (such as a file name) to the operating system. Some operating systems prefer that you use Unicode characters but will accept ANSI characters. Other operating systems require you to use EBCDIC characters. Stranger character sets are in use as well, such as the Multi/Double Byte Character Sets (MBCS/DBCS); this book largely doesn’t discuss those details.

Second, you must consider what character set you want to use to manipulate text within your program. No requirement states that your source code must use the same character set that the operating system running your program prefers. Clearly, it’s more convenient when both use the same character set, but a program and the operating system can use different character sets. You “simply” must convert all text strings going to and coming from the operating system.

Third, you must determine the length of a text string. Some languages, such as C and C++, and some operating systems, such as Windows 9x/NT/XP and UNIX, use a terminating NUL character to delimit the end of a text string. Other languages, such as the Microsoft Visual Basic interpreter, Microsoft Java virtual machine, and Pascal, prefer an explicit length prefix specifying the number of characters in the text string.

Finally, in practice, a text string presents a resource-management issue. Text strings typically vary in length. This makes it difficult to allocate memory for the string on the stack, and the text string might not fit on the stack at all. Therefore, text strings are often dynamically allocated. Of course, this means that a text string must be freed eventually. Resource management introduces the idea of an owner of a text string. Only the owner frees the string, and frees it only once. Ownership becomes quite important when you pass a text string between components.

To make matters worse, two COM objects can reside on two different computers running two different operating systems that prefer two different character sets for a text string. For example, you can write one COM object in Visual Basic and run it on the Windows XP operating system. You might pass a text string to another COM object written in C++ running on an IBM mainframe. Clearly, we need some standard text data type that all COM objects in a heterogeneous environment can understand.

COM uses the OLECHAR character data type. A COM text string is a NUL-character-terminated array of OLECHAR characters; a pointer to such a string is an LPOLESTR. [1] As a rule, a text string parameter to a COM interface method should be of type LPOLESTR. When a method doesn’t change the string, the parameter should be of type LPCOLESTR – that is, a constant pointer to an array of OLECHAR characters.

Frequently, though not always, the OLECHAR type isn’t the same as the characters you use when writing your code. Sometimes, though not always, the OLECHAR type isn’t the same as the characters you must provide when passing a text string to the operating system. This means that, depending on context, sometimes you need to convert a text string from one character set to another – and sometimes you won’t.

Unfortunately, a change in compiler options (for example, a Windows XP Unicode build or a Windows CE build) can change this context. As a result, code that previously didn’t need to convert a string might require conversion, or vice versa. You don’t want to rewrite all string-manipulation code each time you change a compiler option. Therefore, ATL provides a number of string-conversion macros that convert a text string from one character set to another and are sensitive to the context in which you invoke the conversion.

Windows Character Data Types

Now let’s focus specifically on the Windows platform. Windows-based COM components typically use a mix of four text data types:

Unicode. A specification for representing a character as a “wide-character,” 16-bit multilingual character code. The Windows NT/XP operating system uses the Unicode character set internally. All characters used in modern computing worldwide, including technical symbols and special publishing characters, can be represented uniquely in Unicode. The fixed character size simplifies programming when using international character sets. In C/C++, you represent a wide-character string as a wchar_t array; a pointer to such a string is a wchar_t*.
MBCS/DBCS. The Multi-Byte Character Set is a mixed-width character set in which some characters consist of more than 1 byte. The Windows 9x operating systems, in general, use the MBCS to represent characters. The Double-Byte Character Set (DBCS) is a specific type of multibyte character set. It includes some characters that consist of 1 byte and some characters that consist of 2 bytes to represent the symbols for one specific locale, such as the Japanese, Chinese, and Korean languages. In C/C++, you represent an MBCS/DBCS string as an unsigned char array; a pointer to such a string is an unsigned char*. Sometimes a character is one unsigned char in length; sometimes, it’s more than one. This is loads of fun to deal with, especially when you’re trying to back up through a string. In Visual C++, MBCS always means DBCS. Character sets wider than 2 bytes are not supported.
ANSI. You can represent all characters in the English language, as well as many Western European languages, using only 8 bits. Versions of Windows that support such languages use a degenerate case of MBCS, called the Microsoft Windows ANSI character set, in which no multibyte characters are present. The Microsoft Windows ANSI character set, which is essentially ISO 8859/x plus additional characters, was originally based on an ANSI draft standard. The ANSI character set maps the letters and numerals in the same manner as ASCII. However, ANSI does not support control characters and maps many symbols, including accented letters, that are not mapped in standard ASCII. All Windows fonts are defined in the ANSI character set. This is also called the Single-Byte Character Set (SBCS), for symmetry. In C/C++, you represent an ANSI string as a char array; a pointer to such a string is a char*. A character is always one char in length. By default, a char is a signed char in Visual C++. Because MBCS characters are unsigned and ANSI characters are, by default, signed characters, expressions can evaluate differently when using ANSI characters, compared to using MBCS characters.
TCHAR / _TCHAR. This is a Microsoft-specific generic-text data type that you can map to a Unicode character, an MBCS character, or an ANSI character using compile-time options. You use this character type to write generic code that can be compiled for any of the three character sets. This simplifies code development for international markets. The C runtime library defines the _TCHAR type, and the Windows operating system defines the TCHAR type; they are synonymous. tchar.h, a Microsoft-specific C runtime library header file, defines the generic-text data type _TCHAR. ANSI C/C++ compiler compliance requires implementer-defined names to be prefixed by an underscore. When you do not define the __STDC__ preprocessor symbol (by default, this macro is not defined in Visual C++), you indicate that you don’t require ANSI compliance. In this case, the tchar.h header file also defines the symbol TCHAR as another alias for the generic-text data type if it isn’t already defined. winnt.h, a Microsoft-specific Win32 operating system header file, defines the generic-text data type TCHAR. This header file is operating system specific, so the symbol names don’t need the underscore prefix.

Win32 APIs and Strings

Each Win32 API that requires a string has two versions: one that requires a Unicode argument and another that requires an MBCS argument. On a non-MBCS-enabled version of Windows, the MBCS version of an API expects an ANSI argument. For example, the SetWindowText API doesn’t really exist. There are actually two functions: SetWindowTextW, which expects a Unicode string argument, and SetWindowTextA, which expects an MBCS/ANSI string argument.

The Windows NT/2000/XP operating systems internally use only Unicode strings. Therefore, when you call SetWindowTextA on Windows NT/2000/XP, the function translates the specified string to Unicode and then calls SetWindowTextW. The Windows 9x operating systems do not support Unicode directly. The SetWindowTextA function on the Windows 9x operating systems does the work, while SetWindowTextW returns an error. The MSLU library from Microsoft [2] provides implementations of almost all the Unicode functions on Win9x.

This gives you a difficult choice. You could write a performance-optimized component using Unicode character strings that runs on Windows 2000 but not on Windows 9x. You could use MSLU for Unicode strings on both families and lose performance on Windows 9x. You could write a more general component using MBCS/ANSI character strings that runs on both operating systems but not optimally on Windows 2000. Alternatively, you could hedge your bets by writing source code that enables you to decide at compile time what character set to support.

A little coding discipline and some preprocessor magic let you code as if there were a single API called SetWindowText that expects a TCHAR string argument. You specify at compile time which kind of component you want to build. For example, you write code that calls SetWindowText and specifies a TCHAR buffer. When compiling a component as Unicode, you call SetWindowTextW; the argument is a wchar_t buffer. When compiling an MBCS/ANSI component, you call SetWindowTextA; the argument is a char buffer.

When you write a Windows-based COM component, you should typically use the TCHAR character type to represent characters used by the component internally. Additionally, you should use it for all characters used in interactions with the operating system. Similarly, you should use the TEXT or __TEXT macro to surround every literal character or string.

tchar.h defines the functionally equivalent macros _T, __T, and _TEXT, which all compile a character or string literal as a generic-text character or literal. winnt.h also defines the functionally equivalent macros TEXT and __TEXT, which are yet more synonyms for _T, __T, and _TEXT. (There’s nothing like five ways to do exactly the same thing.) The examples in this chapter use __TEXT because it’s defined in winnt.h. I actually prefer _T because it’s less clutter in my source code.

An operating-system-agnostic coding approach favors including tchar.h and using the _TCHAR generic-text data type because that’s somewhat less tied to the Windows operating systems. However, we’re discussing building components with text handling optimized at compile time for specific versions of the Windows operating systems. This argues that we should use TCHAR, the type defined in winnt.h. Plus, TCHAR isn’t as jarring to the eyes as _TCHAR and it’s easier to type. Most code already implicitly includes the winnt.h header file via windows.h, and you must explicitly include tchar.h. All sorts of good reasons support using TCHAR, so the examples in this book use this as the generic-text data type.

This means that you can compile specialized versions of the component for different markets or for performance reasons. These types and macros are defined in the winnt.h header file.

You also must use a different set of string runtime library functions when manipulating strings of TCHAR characters. The familiar functions strlen, strcpy, and so on operate only on char characters. The less familiar functions wcslen, wcscpy, and so on work on wchar_t characters. Moreover, the totally strange functions _mbslen, _mbscpy, and so on work on multibyte characters. Because TCHAR characters are sometimes wchar_t, sometimes char-holding ANSI characters, and sometimes char-holding (nominally unsigned) multibyte characters, you need an equivalent set of runtime library functions that work with TCHAR characters.

The tchar.h header file defines a number of useful generic-text mappings for string-handling functions. These functions expect TCHAR parameters, so all their function names use the _tcs (the _t character set) prefix. For example, _tcslen is equivalent to the C runtime library strlen function. The _tcslen function expects TCHAR characters, whereas the strlen function expects char characters.

Controlling Generic-Text Mapping Using the Preprocessor

Two preprocessor symbols and two macros control the mapping of the TCHAR data type to the underlying character type the application uses.

UNICODE/_UNICODE. The header files for the Windows operating system APIs use the UNICODE preprocessor symbol. The C/C++ runtime library header files use the _UNICODE preprocessor symbol. Typically, you define either both symbols or neither of them. When you compile with the symbol _UNICODE defined, tchar.h maps all TCHAR characters to wchar_t characters. The _T, __T, and _TEXT macros prefix each character or string literal with a capital L (creating a Unicode character or literal, respectively). When you compile with the symbol UNICODE defined, winnt.h maps all TCHAR characters to wchar_t characters. The TEXT and __TEXT macros prefix each character or string literal with a capital L (creating a Unicode character or literal, respectively). The _tcsXXX functions are mapped to the corresponding _wcsXXX functions.
_MBCS. When you compile with the symbol _MBCS defined, all TCHAR characters map to char characters, and the preprocessor removes all the _T and __TEXT macro variations. It leaves the character or literal unchanged (creating an MBCS character or literal, respectively). The _tcsXXX functions are mapped to the corresponding _mbsXXX versions.
None of the above. When you compile with neither symbol defined, all TCHAR characters map to char characters and the preprocessor removes all the _T and __TEXT macro variations, leaving the character or literal unchanged (creating an ANSI character or literal, respectively). The _tcsXXX functions are mapped to the corresponding strXXX functions.

You write generic-text-compatible code by using the generic-text data types and functions. An example of reversing and concatenating to a generic-text string follows:

TCHAR *reversedString, *sourceString, *completeString;
reversedString = _tcsrev (sourceString);
completeString = _tcscat (reversedString, __TEXT("suffix"));

When you compile the code without defining any preprocessor symbols, the preprocessor produces this output:

char *reversedString, *sourceString, *completeString;
reversedString = _strrev (sourceString);
completeString = strcat (reversedString, "suffix");

When you compile the code after defining the _UNICODE preprocessor symbol, the preprocessor produces this output:

wchar_t *reversedString, *sourceString, *completeString;
reversedString = _wcsrev (sourceString);
completeString = wcscat (reversedString, L"suffix");

When you compile the code after defining the _MBCS preprocessor symbol, the preprocessor produces this output:

char *reversedString, *sourceString, *completeString;
reversedString = _mbsrev (sourceString);
completeString = _mbscat (reversedString, "suffix");

COM Character Data Types

COM uses two character types:

OLECHAR. The character type COM uses on the operating system for which you compile your source code. For Win32 operating systems, this is the wchar_t character type. [3] For Win16 operating systems, this is the char character type. For the Mac OS, this is the char character type. For the Solaris OS, this is the wchar_t character type. For the as yet unknown operating system, this is who knows what. Let’s just pretend there is an abstract data type called OLECHAR. COM uses it. Don’t rely on it mapping to any specific underlying data type.
BSTR. A specialized string type some COM components use. A BSTR is a length-prefixed array of OLECHAR characters with numerous special semantics.

Now let’s complicate things a bit. You want to write code for which you can select, at compile time, the type of characters it uses. Therefore, you’re manipulating strictly TCHAR strings internally. You also want to call a COM method and pass it the same strings. You must pass the method either an OLECHAR string or a BSTR string, depending on its signature. The strings your component uses might or might not be in the correct character format, depending on your compilation options. This is a job for Supermacro!

ATL String-Conversion Classes

ATL provides a number of string-conversion classes that convert, when necessary, among the various character types described previously. The classes perform no conversion and, in fact, do nothing, when the compilation options make the source and destination character types identical. Seven different classes in atlconv.h implement the real conversion logic, but this header also uses a number of typedefs and preprocessor #define statements to make using these converter classes syntactically more convenient.

These class names use a number of abbreviations for the various character data types:

T represents a pointer to the Win32 TCHAR character type; an LPTSTR parameter.
W represents a pointer to the Unicode wchar_t character type; an LPWSTR parameter.
A represents a pointer to the MBCS/ANSI char character type; an LPSTR parameter.
OLE represents a pointer to the COM OLECHAR character type; an LPOLESTR parameter.
C represents the C/C++ const modifier.

All class names use the form C<source-abbreviation>2<destination-abbreviation>. For example, the CA2W class converts an LPSTR to an LPWSTR. When there is a C in the name (not including the first C – that stands for “class”), add a const modification to the following abbreviation; for example, the CT2CW class converts a LPTSTR to a LPCWSTR.

The actual class behavior depends on which preprocessor symbols you define (see Table 2.1). Note that the ATL conversion classes and macros treat OLE and W as equivalent.

Table 2.1. Character Set Preprocessor Symbols

Preprocessor Symbol Defined	`T` Becomes…	`OLE` Becomes…
None	`A`	`W`
_UNICODE	`W`	`W`

Table 2.2 lists the ATL string-conversion macros.

Table 2.2. ATL String-Conversion Classes

`CA2W`	`COLE2T`	`CT2CA`	`CT2W`	`CW2T`
`CA2WEX`	`COLE2TEX`	`CT2CAEX`	`CT2WEX`	`CW2TEX`
`CA2T`	`COLE2CT`	`CT2OLE`	`CT2CW`	`CW2CT`
`CA2TEX`	`COLE2CTEX`	`CT2OLEEX`	`CT2CWEX`	`CW2CTEX`
`CA2CT`	`CT2A`	`CT2COLE`	`CW2A`
`CA2CTEX`	`CT2AEX`	`CT2COLEEX`	`CW2AEX`

As you can see, no BSTR conversion classes are listed in Table 2.2. The next section of this chapter introduces the CComBSTR class as the preferred mechanism for dealing with BSTR-type conversions.

When you look inside the atlconv.h header file, you’ll see that many of the definitions distill down to a fairly small set of six actual classes. For instance, when _UNICODE is defined, CT2A becomes CW2A, which is itself typedef’d to the CW2AEX template class. The type definition merely applies the default template parameters to CW2AEX. Additionally, all the previous class names always map OLE to W, so COLE2T becomes CW2T, which is defined as CW2W under Unicode builds. Because the source and destination types for CW2W are the same, this class performs no conversions. Ultimately, the only six classes defined are the template classes CA2AEX, CA2CAEX, CA2WEX, CW2AEX, CW2CWEX, and CW2WEX. Only CA2WEX and CW2AEX have different source and destination types, so these are the only two classes doing any real work. Thus, our expansive list of conversion classes in Table 2.2 has distilled down to only two interesting ones. These two classes are both defined and implemented similarly, so we look at only CA2WEX to glean an understanding of how they both work.

template< int t_nBufferLength = 128 >
class CA2WEX {
    CA2WEX( LPCSTR psz );
    CA2WEX( LPCSTR psz, UINT nCodePage );
    ...
public:
    LPWSTR m_psz;
    wchar_t m_szBuffer[t_nBufferLength];
    ...
};

The class definition is actually pretty simple. The template parameter specifies the size of a fixed static buffer to hold the string data. This means that most string-conversion operations can be performed without allocating any dynamic storage. If the requested string to convert exceeds the number of characters passed as an argument to the template, CA2WEX uses malloc to allocate additional storage.

Two constructors are provided for CA2WEX. The first constructor accepts an LPCSTR and uses the Win32 API function MultiByteToWideChar to perform the conversion. By default, the class uses the ANSI code page for the current thread’s locale to perform the conversion. The second constructor can be used to specify an alternate code page that governs how the conversion is performed. This value is passed directly to MultiByteToWideChar, so see the online documentation for details on code pages accepted by the various Win32 character conversion functions.

The simplest way to use this converter class is to accept the default value for the buffer size parameter. Thus, ATL provides a simple typedef to facilitate this:

typedef CA2WEX<> CA2W;

To use this converter class, you need to write only simple code such as the following:

void PutName (LPCWSTR lpwszName);

void RegisterName (LPCSTR lpsz) {
    PutName (CA2W(lpsz));
}

Two other use cases are also common in practice:

Receiving a generic-text string and passing to a method that expects an OLESTR as input
Receiving an OLESTR and passing it to a method that expects a generic-text string

The conversion classes are easily employed to deal with these cases:

void PutAddress(LPOLESTR lpszAddress);

void RegisterAddress(LPTSTR lpsz) {
    PutAddress(CT2OLE(lpsz));
}

void PutNickName(LPTSTR lpszName);

void RegisterAddress(LPOLESTR lpsz) {
    PutNickName(COLE2T(lpsz));
}

A Note on Memory Management

As convenient as the conversion classes are, you can run into some nasty pitfalls if you use them incorrectly. The conversion classes allocate the memory for the converted text automatically and clean it up in the class destructor. This is useful because you don’t have to worry about buffer management. However, it also means that code like this is a crash waiting to happen:

LPOLESTR ConvertString(LPTSTR lpsz) {
    return CT2OLE(lpsz);
}

You’ve just returned either a pointer to the stack of the called function (which is trashed when the function returns) if the string was short, or a pointer to an array on the heap that will be deallocated before the function returns.

The worst part is that, depending on your macro selection, the code might work just fine but will crash when you switch from ANSI to Unicode for the first time (usually two days before ship). To avoid this, make sure that you copy the converted string to a separate buffer (or use a string class) first if you need it for more than a single expression.

ATL String-Helper Functions

Sometimes you want to copy a string of OLECHAR characters. You also happen to know that OLECHAR characters are wide characters on the Win32 operating system. When writing a Win32 version of your component, you might call the Win32 operating system function lstrcpyW, which copies wide characters. Unfortunately, Windows NT/2000, which supports Unicode, implements lstrcpyW, but Windows 95 does not. A component that uses the lstrcpyW API doesn’t work correctly on Windows 95.

Instead of lstrcpyW, use the ATL string-helper function ocscpy to copy an OLECHAR character string. It works properly on both Windows NT/2000 and Windows 95. The ATL string-helper function ocslen returns the length of an OLECHAR string. This is nice for symmetry, although the lstrlenW function it replaces does work on both operating systems.

OLECHAR* ocscpy(LPOLESTR dest, LPCOLESTR src);
size_t ocslen(LPCOLESTR s);

Similarly, the Win32 CharNextW operating system function doesn’t work on Windows 95, so ATL provides a CharNextO string-helper function that increments an OLECHAR* by one character and returns the next character pointer. It does not increment the pointer beyond a NUL termination character.

LPOLESTR CharNextO(LPCOLESTR lp);

ATL String-Conversion Macros

The string-conversion classes discussed previously were introduced in ATL 7. ATL 3 (and code written with ATL 3) used a set of macros instead. In fact, these macros are still in use in the ATL code base. For example, this code is in the atlctl.h header:

STDMETHOD(Help)(LPCOLESTR pszHelpDir) {
    T* pT = static_cast<T*>(this);
    USES_CONVERSION;
    ATLTRACE(atlTraceControls,2,
      _T("IPropertyPageImpl::Help\n"));
    CComBSTR szFullFileName(pszHelpDir);
    CComHeapPtr<OLECHAR>
      pszFileName(LoadStringHelper(pT->m_dwHelpFileID));
    if (pszFileName == NULL)
      return E_OUTOFMEMORY;
    szFullFileName.Append(OLESTR("\\"));
    szFullFileName.Append(pszFileName);
    WinHelp(pT->m_hWnd, OLE2CT(szFullFileName),
        HELP_CONTEXTPOPUP, NULL);
    return S_OK;
}

The macros behave much like the conversion classes, minus the leading C in the macro name. So, to convert from tchar to olechar, you use T2OLE(s).

Two major differences arise between the macros and the conversion classes. First, the macros require some local variables to work; the USES_CONVERSION macro is required in any function that uses the conversion macros. (It declares these local variables.) The second difference is the location of the conversion buffer.

In the conversion classes, the buffer is stored either as a member variable on the stack (if the buffer is small) or on the heap (if the buffer is large). The conversion macros always use the stack. They call the runtime function _alloca, which allocates extra space on the local stack.

Although it is fast, _alloca has some serious downsides. The stack space isn’t freed until the function exits, which means that if you do conversion in a loop, you might end up blowing out your stack space. Another nasty problem is that if you use the conversion macros inside a C++ catch block, the _alloca call messes up the exception-tracking information on the stack and you crash. [4]

The ATL team apparently took two swipes at improving the conversion macros. The final solution is the conversion classes. However, a second set of conversion macros exists: the _EX flavor. These are used much like the original conversion macros; you put USES_CONVERSION_EX at the top of the function. The macros have an _EX suffix, as in T2A_EX. The _EX macros are different, however: They take two parameters, not one. The first parameter is the buffer to convert from as usual. The second parameter is a threshold value. If the converted buffer is smaller than this threshold, the memory is allocated via _alloca. If the buffer is larger, it is allocated on the heap instead. So, these macros give you a chance to avoid the stack overflow. (They still won’t help you in a catch block.) The ATL code uses the _EX macros extensively; the previous example is the only one left that still uses the old macros.

We don’t go into the details of either macro set here; the conversion classes are much safer to use and are preferred for new code. We mention them only so that you know what you’re looking at if you see them in older code or the ATL sources themselves.

The CComBSTR Smart BSTR Class

A Review of the COM String Data Type: BSTR

COM is a language-neutral, hardware-architecture-neutral model. Therefore, it needs a language-neutral, hardware-architecture-neutral text data type. COM defines a generic text data type, OLECHAR, that represents the text data COM uses on a specific platform. On most platforms, including all 32-bit Windows platforms, the OLECHAR data type is a typedef for the wchar_t data type. That is, on most platforms, the COM text data type is equivalent to the C/C++ wide-character data type, which contains Unicode characters. On some platforms, such as the 16-bit Windows operating system, OLECHAR is a typedef for the standard C char data type, which contains ANSI characters. Generally, you should define all string parameters used in a COM interface as OLECHAR* arguments.

COM also defines a text data type called BSTR. A BSTR is a length-prefixed string of OLECHAR characters. Most interpretive environments prefer length-prefixed strings for performance reasons. For example, a length-prefixed string does not require time-consuming scans for a NUL character terminator to determine the length of a string. Actually, the NUL-character-terminated string is a language-specific concept that was originally unique to the C/C++ language. The Microsoft Visual Basic interpreter, the Microsoft Java virtual machine, and most scripting languages, such as VBScript and JScript, internally represent a string as a BSTR.

Therefore, when you pass a string to or receive a string from a method parameter to an interface defined by a C/C++ component, you’ll often use the OLECHAR* data type. However, if you need to use an interface defined by another language, frequently string parameters will be the BSTR data type. The BSTR data type has a number of poorly documented semantics, which makes using BSTRs tedious and error prone for C++ developers.

A BSTR has the following attributes:

A BSTR is a pointer to a length-prefixed array of OLECHAR characters.
A BSTR is a pointer data type. It points at the first character in the array. The length prefix is stored as an integer immediately preceding the first character in the array.
The array of characters is NUL character terminated.
The length prefix is in bytes, not characters, and does not include the terminating NUL character.
The array of characters may contain embedded NUL characters.
A BSTR must be allocated and freed using the SysAllocString and SysFreeString family of functions.
A NULL BSTR pointer implies an empty string.
A BSTR is not reference counted; therefore, two references to the same string content must refer to separate BSTR s. In other words, copying a BSTR implies making a duplicate string, not simply copying the pointer.

With all these special semantics, it would be useful to encapsulate these details in a reusable class. ATL provides such a class: CComBSTR.

The CComBSTR Class

The CComBSTR class is an ATL utility class that is a useful encapsulation for the COM string data type, BSTR. The atlcomcli.h file contains the definition of the CComBSTR class. The only state maintained by the class is a single public member variable, m_str, of type BSTR.

////////////////////////////////////////////////////
// CComBSTR


class CComBSTR {
public:
  BSTR m_str;
...
} ;

Constructors and Destructor

Eight constructors are available for CComBSTR objects. The default constructor simply initializes the m_str variable to NULL, which is equivalent to a BSTR that represents an empty string. The destructor destroys any BSTR contained in the m_str variable by calling SysFreeString. The SysFreeString function explicitly documents that the function simply returns when the input parameter is NULL so that the destructor can run on an empty object without a problem.

CComBSTR() { m_str = NULL; }
~CComBSTR() { ::SysFreeString(m_str); }

Later in this section, you will learn about numerous convenience methods that the CComBSTR class provides. However, one of the most compelling reasons for using the class is so that the destructor frees the internal BSTR at the appropriate time, so you don’t have to free a BSTR explicitly. This is exceptionally convenient during times such as stack frame unwinding when locating an exception handler.

Probably the most frequently used constructor initializes a CComBSTR object from a pointer to a NUL-character-terminated array of OLECHAR charactersor, as it’s more commonly known, an LPCOLESTR.

CComBSTR(LPCOLESTR pSrc) {
    if (pSrc == NULL) m_str = NULL;
    else {
        m_str = ::SysAllocString(pSrc);
        if (m_str == NULL)
                AtlThrow(E_OUTOFMEMORY);
    }
}

You invoke the preceding constructor when you write code such as the following [5]:

CComBSTR str1 (OLESTR ("This is a string of OLECHARs"));

The previous constructor copies characters until it finds the end-of-string NULL character terminator. When you want some lesser number of characters copied, such as the prefix to a string, or when you want to copy from a string that contains embedded NULL characters, you must explicitly specify the number of characters to copy. In this case, use the following constructor:

CComBSTR(int nSize, LPCOLESTR sz);

This constructor creates a BSTR with room for the number of characters specified by nSize; copies the specified number of characters, including any embedded NULL characters, from sz; and then appends a terminating NUL character. When sz is NULL, SysAllocStringLen skips the copy step, creating an uninitialized BSTR of the specified size. You invoke the preceding constructor when you write code such as the following:

// str2 contains "This is a string"
CComBSTR str2 (16, OLESTR ("This is a string of OLECHARs"));

// Allocates an uninitialized BSTR with room for 64 characters
CComBSTR str3 (64, (LPCOLESTR) NULL);

// Allocates an uninitialized BSTR with room for 64 characters
CComBSTR str4 (64);

The CComBSTR class provides a special constructor for the str3 example in the preceding code, which doesn’t require you to provide the NULL argument. The preceding str4 example shows its use. Here’s the constructor:

CComBSTR(int nSize) {
    ...
    m_str = ::SysAllocStringLen(NULL, nSize);
    ...
}

One odd semantic feature of a BSTR is that a NULL pointer is a valid value for an empty BSTR string. For example, Visual Basic considers a NULL BSTR to be equivalent to a pointer to an empty string; that is, a string of zero length in which the first character is the terminating NUL character. To put it symbolically, Visual Basic considers IF p = "", where p is a BSTR set to NULL, to be true. The SysStringLen API properly implements the checks; CComBSTR provides the Length method as a wrapper:

unsigned int Length() const { return ::SysStringLen(m_str); }

You can also use the following copy constructor to create and initialize a CComBSTR object to be equivalent to an already initialized CComBSTR object:

CComBSTR(const CComBSTR& src) {
    m_str = src.Copy();
    ...
}

In the following code, creating the str5 variable invokes the preceding copy constructor to initialize their respective objects:

CComBSTR str1 (OLESTR("This is a string of OLECHARs")) ;
CComBSTR str5 = str1 ;

Note that the preceding copy constructor calls the Copy method on the source CComBSTR object. The Copy method makes a copy of its string and returns the new BSTR. Because the Copy method allocates the new BSTR using the length of the existing BSTR and copies the string contents for the specified length, the Copy method properly copies a BSTR that contains embedded NUL characters.

BSTR Copy() const {
    if (!*this) { return NULL; }
    return ::SysAllocStringByteLen((char*)m_str,
        ::SysStringByteLen(m_str));
}

Two constructors initialize a CComBSTR object from an LPCSTR string. The single argument constructor expects a NUL-terminated LPCSTR string. The two-argument constructor permits you to specify the length of the LPCSTR string. These two constructors are functionally equivalent to the two previously discussed constructors that accept an LPCOLESTR parameter. The following two constructors expect ANSI characters and create a BSTR that contains the equivalent string in OLECHAR characters:

CComBSTR(LPCSTR pSrc) {
     ...
     m_str = A2WBSTR(pSrc);
     ...
}
CComBSTR(int nSize, LPCSTR sz) {
     ...
     m_str = A2WBSTR(sz, nSize);
     ...
}

The final constructor is an odd one. It takes an argument that is a GUID and produces a string containing the string representation of the GUID.

CComBSTR(REFGUID src);

This constructor is quite useful when building strings used during component registration. In a number of situations, you need to write the string representation of a GUID to the Registry. Some code that uses this constructor follows:

// Define a GUID as a binary constant
static const GUID GUID_Sample = { 0x8a44e110, 0xf134, 0x11d1,
    { 0x96, 0xb1, 0xBA, 0xDB, 0xAD, 0xBA, 0xDB, 0xAD } };

// Convert the binary GUID to its string representation
CComBSTR str6 (GUID_Sample) ;
// str6 contains "{8A44E110-F134-11d1-96B1-BADBADBADBAD}"

Assignment

The CComBSTR class defines three assignment operators. The first one initializes a CComBSTR object using a different CComBSTR object. The second one initializes a CComBSTR object using an LPCOLESTR pointer. The third one initializes the object using a LPCSTR pointer. The following operator=() method initializes one CComBSTR object from another CComBSTR object:

CComBSTR& operator=(const CComBSTR& src) {
    if (m_str != src.m_str) {
        ::SysFreeString(m_str);
        m_str = src.Copy();
        if (!!src && !*this) { AtlThrow(E_OUTOFMEMORY); }
    }
    return *this;
}

Note that this assignment operator uses the Copy method, discussed a little later in this section, to make an exact copy of the specified CComBSTR instance. You invoke this operator when you write code such as the following:

CComBSTR str1 (OLESTR("This is a string of OLECHARs"));
CComBSTR str7 ;

str7 = str1; // str7 contains "This is a string of OLECHARs"
str7 = str7; // This is a NOP. Assignment operator
             // detects this case

The second operator=() method initializes one CComBSTR object from an LPCOLESTR pointer to a NUL-character-terminated string.

CComBSTR& operator=(LPCOLESTR pSrc) {
    if (pSrc != m_str) {
        ::SysFreeString(m_str);
        if (pSrc != NULL) {
            m_str = ::SysAllocString(pSrc);
            if (!*this) { AtlThrow(E_OUTOFMEMORY); }
        } else {
            m_str = NULL;
        }
    }
    return *this;
}

Note that this assignment operator uses the SysAllocString function to allocate a BSTR copy of the specified LPCOLESTR argument. You invoke this operator when you write code such as the following:

CComBSTR str8 ;

str8 = OLESTR ("This is a string of OLECHARs");

It’s quite easy to misuse this assignment operator when you’re dealing with strings that contain embedded NUL characters. For example, the following code demonstrates how to use and misuse this method:

CComBSTR str9 ;
str9 = OLESTR ("This works as expected");

// BSTR bstrInput contains "This is part one\0and here's part two"
CComBSTR str10 ;
str10 = bstrInput; // str10 now contains "This is part one"

To properly handle situations such as this one, you should turn to the AssignBSTR method. This method is implemented very much like operator=(LPCOLESTR), except that it uses SysAllocStringByteLen.

HRESULT AssignBSTR(const BSTR bstrSrc) {
    HRESULT hr = S_OK;
    if (m_str != bstrSrc) {
        ::SysFreeString(m_str);
        if (bstrSrc != NULL) {
            m_str = ::SysAllocStringByteLen((char*)bstrSrc,
                ::SysStringByteLen(bstrSrc));

            if (!*this) { hr = E_OUTOFMEMORY; }
        } else {
            m_str = NULL;
        }
    }


    return hr;
}

You can modify the code as follows:

CComBSTR str9 ;
str9 = OLESTR ("This works as expected");

// BSTR bstrInput contains
// "This is part one\0and here's part two"
CComBSTR str10 ;
str10.AssignBSTR(bstrInput);     // works properly

// str10 now contains "This is part one\0and here's part two"

The third operator=() method initializes one CComBSTR object using an LPCSTR pointer to a NUL-character-terminated string. The operator converts the input string, which is in ANSI characters, to a Unicode string; then it creates a BSTR containing the Unicode string.

CComBSTR& operator=(LPCSTR pSrc) {
    ::SysFreeString(m_str);
    m_str = A2WBSTR(pSrc);
    if (!*this && pSrc != NULL) { AtlThrow(E_OUTOFMEMORY); }
    return *this;
}

The final assignment methods are two overloaded methods called LoadString.

bool LoadString(HINSTANCE hInst, UINT nID) ;
bool LoadString(UINT nID) ;

The first loads the specified string resource nID from the specified module hInst (using the instance handle). The second loads the specified string resource nID from the current module using the global variable `AtlBaseModule`.

CComBSTR Operations

Four methods give you access, in varying ways, to the internal BSTR string that is encapsulated by the CComBSTR class. The operator BSTR() method enables you to use a CComBSTR object in situations where a raw BSTR pointer is required. You invoke this method any time you cast a CComBSTR object to a BSTR implicitly or explicitly.

operator BSTR() const { return m_str; }

Frequently, you invoke this operator implicitly when you pass a CComBSTR object as a parameter to a function that expects a BSTR. The following code demonstrates this:

HRESULT put_Name (/* [in] */ BSTR pNewValue) ;

CComBSTR bstrName = OLESTR ("Frodo Baggins");
put_Name (bstrName); // Implicit cast to BSTR

The operator&() method returns the address of the internal m_str variable when you take the address of a CComBSTR object. Use care when taking the address of a CComBSTR object. Because the operator&() method returns the address of the internal BSTR variable, you can overwrite the internal variable without first freeing the string. This causes a memory leak. However, if you define the macro ATL_CCOMBSTR_ADDRESS_OF_ASSERT in your project settings, you get an assertion to help catch this error.

#ifndef ATL_CCOMBSTR_ADDRESS_OF_ASSERT
// Temp disable CComBSTR::operator& Assert
#define ATL_NO_CCOMBSTR_ADDRESS_OF_ASSERT
#endif

BSTR* operator&() {
#ifndef ATL_NO_CCOMBSTR_ADDRESS_OF_ASSERT
    ATLASSERT(!*this);
#endif
    return &m_str;
}

This operator is quite useful when you are receiving a BSTR pointer as the output of some method call. You can store the returned BSTR directly into a CComBSTR object so that the object manages the lifetime of the string.

HRESULT get_Name (/* [out] */ BSTR* pName);

CComBSTR bstrName ;
get_Name (&bstrName); // bstrName empty so no memory leak

The CopyTo method makes a duplicate of the string encapsulated by a CComBSTR object and copies the duplicate’s BSTR pointer to the specified location. You must free the returned BSTR explicitly by calling SysFreeString.

HRESULT CopyTo(BSTR* pbstr);

This method is handy when you need to return a copy of an existing BSTR property to a caller. For example:

STDMETHODIMP SomeClass::get_Name (/* [out] */ BSTR* pName) {
  // Name is maintained in variable m_strName of type CComBSTR
  return m_strName.CopyTo (pName);
}

The Detach method returns the BSTR contained by a CComBSTR object. It empties the object so that the destructor will not attempt to release the internal BSTR. You must free the returned BSTR explicitly by calling SysFreeString.

BSTR Detach() { BSTR s = m_str; m_str = NULL; return s; }

You use this method when you have a string in a CComBSTR object that you want to return to a caller and you no longer need to keep the string. In this situation, using the CopyTo method would be less efficient because you would make a copy of a string, return the copy, and then discard the original string. Use Detach as follows to return the original string directly:

STDMETHODIMP SomeClass::get_Label (/* [out] */ BSTR* pName) {
  CComBSTR strLabel;
  // Generate the returned string in strLabel here
  *pName = strLabel.Detach ();
  return S_OK;
}

The Attach method performs the inverse operation. It attaches a BSTR to an empty CComBSTR object. Ownership of the BSTR now resides with the CComBSTR object, and the object’s destructor will eventually free the string. Note that if the CComBSTR already contains a string, it releases the string before it takes control of the new BSTR.

void Attach(BSTR src) {
    if (m_str != src) {
        ::SysFreeString(m_str);
        m_str = src;
    }
}

Use care when using the Attach method. You must have ownership of the BSTR you are attaching to a CComBSTR object because eventually the object will attempt to destroy the BSTR. For example, the following code is incorrect:

STDMETHODIMP SomeClass::put_Name (/* [in] */ BSTR bstrName) {
  // Name is maintained in variable m_strName of type CComBSTR
  m_strName.Attach (bstrName); // Wrong! We don't own bstrName
  return E_BONEHEAD;
}

More often, you use Attach when you’re given ownership of a BSTR and you want a CComBSTR object to manage the lifetime of the string.

STDMETHODIMP SomeClass::get_Name (/* [out] */ BSTR* pName);
...
BSTR bstrName;
pObj->get_Name (&bstrName); // We own and must free the raw BSTR

CComBSTR strName;
strName.Attach(bstrName); // Attach raw BSTR to the object

You can explicitly free the string encapsulated in a CComBSTR object by calling Empty. The Empty method releases any internal BSTR and sets the m_str member variable to NULL. The SysFreeString function explicitly documents that the function simply returns when the input parameter is NULL so that you can call Empty on an empty object without a problem.

void Empty() { ::SysFreeString(m_str); m_str = NULL; }

CComBSTR supplies two additional interesting methods. These methods enable you to convert BSTR strings to and from SAFEARRAY s, which might be useful for converting to and from string data to adapt to a specific method signature. Chapter 3, “ATL Smart Types,” presents a smart class for handling SAFEARRAY s.

HRESULT BSTRToArray(LPSAFEARRAY *ppArray) {
    return VectorFromBstr(m_str, ppArray);
}

HRESULT ArrayToBSTR(const SAFEARRAY *pSrc) {
    ::SysFreeString(m_str);
    return BstrFromVector((LPSAFEARRAY)pSrc, &m_str);
}

As you can see, these methods merely serve as thin wrappers for the Win32 functions VectorFromBstr and BstrFromVector. BSTRToArray assigns each character of the encapsulated string to an element of a one-dimensional SAFEARRAY provided by the caller. Note that the caller is responsible for freeing the SAFEARRAY. ArrayToBSTR does just the opposite: It accepts a pointer to a one-dimensional SAFEARRAY and builds a BSTR in which each element of the SAFEARRAY becomes a character in the internal BSTR. CComBSTR frees the encapsulated BSTR before overwriting it with the values from the SAFEARRAY. ArrayToBSTR accepts only SAFEARRAY s that contain char type elements; otherwise, the function returns a type mismatch error.

String Concatenation Using CComBSTR

Eight methods concatenate a specified string with a CComBSTR object: six overloaded Append methods, one AppendBSTR method, and the operator+=() method.

HRESULT Append(LPCOLESTR lpsz, int nLen);
HRESULT Append(LPCOLESTR lpsz);
HRESULT Append(LPCSTR);
HRESULT Append(char ch);
HRESULT Append(wchar_t ch);

HRESULT Append(const CComBSTR& bstrSrc);
CComBSTR& operator+=(const CComBSTR& bstrSrc);

HRESULT AppendBSTR(BSTR p);

The Append(LPCOLESTR lpsz, int nLen) method computes the sum of the length of the current string plus the specified nLen value, and allocates an empty BSTR of the correct size. It copies the original string into the new BSTR and then concatenates nLen characters of the lpsz string onto the end of the new BSTR. Finally, it frees the original string and replaces it with the new BSTR.

CComBSTR strSentence = OLESTR("Now is ");
strSentence.Append(OLESTR("the time of day is 03:00 PM"), 9);
// strSentence contains "Now is the time "

The remaining overloaded Append methods all use the first method to perform the real work. They differ only in the manner in which the method obtains the string and its length. The Append(LPCOLESTR lpsz) method appends the contents of a NUL-character-terminated string of OLECHAR characters. The Append(LPCSTR lpsz) method appends the contents of a NUL-character-terminated string of ANSI characters. Individual characters can be appended using either Append(char ch) or Append(wchar_t ch). The Append(const CComBSTR& bstrSrc) method appends the contents of another CComBSTR object. For notational and syntactic convenience, the operator+=() method also appends the specified CComBSTR to the current string.

CComBSTR str11 (OLESTR("for all good men ");
// calls Append(const CComBSTR& bstrSrc);
strSentence.Append(str11);
// strSentence contains "Now is the time for all good men "
// calls Append(LPCOLESTR lpsz);
strSentence.Append((OLESTR("to come "));
// strSentence contains "Now is the time for all good men to come "
// calls Append(LPCSTR lpsz);
strSentence.Append("to the aid ");
// strSentence contains
// "Now is the time for all good men to come to the aid "

CComBSTR str12 (OLESTR("of their country"));
StrSentence += str12; // calls operator+=()
// "Now is the time for all good men to come to
// the aid of their country"

When you call Append using a BSTR parameter, you are actually calling the Append(LPCOLESTR lpsz) method because, to the compiler, the BSTR argument is an OLECHAR* argument. Therefore, the method appends characters from the BSTR until it encounters the first NUL character. When you want to append the contents of a BSTR that possibly contains embedded NULL characters, you must explicitly call the AppendBSTR method.

One additional method exists for appending an array that contains binary data:

HRESULT AppendBytes(const char* lpsz, int nLen);

AppendBytes does not perform a conversion from ANSI to Unicode. The method uses SysAllocStringByteLen to properly allocate a BSTR of nLen bytes (not characters) and append the result to the existing CComBSTR.

You can’t go wrong following these guidelines:

When the parameter is a BSTR, use the AppendBSTR method to append the entire BSTR, regardless of whether it contains embedded NUL characters.
When the parameter is an LPCOLESTR or an LPCSTR, use the Append method to append the NUL-character-terminated string.
So much for function overloading…

Character Case Conversion

The two character case-conversion methods, ToLower and ToUpper, convert the internal string to lowercase or uppercase, respectively. In Unicode builds, the conversion is actually performed in-place using the Win32 CharLowerBuff API. In ANSI builds, the internal character string first is converted to MBCS and then CharLowerBuff is invoked. The resulting string is then converted back to Unicode and stored in a newly allocated BSTR. Any string data stored in m_str is freed using SysFreeString before it is overwritten. When everything works, the new string replaces the original string as the contents of the CComBSTR object.

HRESULT ToLower() {
        if (m_str != NULL) {
#ifdef _UNICODE
               // Convert in place
               CharLowerBuff(m_str, Length());
#else
            UINT _acp = _AtlGetConversionACP();
            ...
            int nRet = WideCharToMultiByte(
                _acp, 0, m_str, Length(),
                pszA, _convert, NULL, NULL);
            ...

            CharLowerBuff(pszA, nRet);

            nRet = MultiByteToWideChar(_acp, 0, pszA, nRet,
                                    pszW, _convert);
                ...

            BSTR b = ::SysAllocStringByteLen(
                (LPCSTR) (LPWSTR) pszW,
                nRet * sizeof(OLECHAR));
            if (b == NULL)
                    return E_OUTOFMEMORY;
            SysFreeString(m_str);
            m_str = b;
#endif

            }
            return S_OK;
}

Note that these methods properly do case conversion, in case the original string contains embedded NUL characters. Also note, however, that the conversion is potentially lossy, in the sense that it cannot convert a character when the local code page doesn’t contain a character equivalent to the original Unicode character.

CComBSTR Comparison Operators

The simplest comparison operator is operator!(). It returns true when the CComBSTR object is empty, and false otherwise.

bool operator!() const { return (m_str == NULL); }

There are four overloaded versions of the operator<() methods, four of the operator>() methods, and five of the operator==() and operator!=() methods. The additional overload for operator==() simply handles special cases comparison to NULL. The code in all these methods is nearly the same, so I discuss only the operator<() methods; the comments apply equally to the operator>() and operator==() methods.

These operators internally use the VarBstrCmp function, so unlike previous versions of ATL that did not properly compare two CComBSTR s that contain embedded NUL characters, these new operators handle the comparison correctly most of the time. So, the following code works as expected. Later in this section, I discuss properly initializing CComBSTR objects with embedded NUL s.

BSTR bstrIn1 =
    SysAllocStringLen(
        OLESTR("Here's part 1\0and here's part 2"), 35);
BSTR bstrIn2 =
    SysAllocStringLen(
        OLESTR("Here's part 1\0and here is part 2"), 35);

CComBSTR bstr1(::SysStringLen(bstrIn1), bstrIn1);
CComBSTR bstr2(::SysStringLen(bstrIn2), bstrIn2);

bool b = bstr1 == bstr2; // correctly returns false

In the first overloaded version of the operator<() method, the operator compares against a provided CComBSTR argument.

bool operator<(const CComBSTR& bstrSrc) const {
    return VarBstrCmp(m_str, bstrSrc.m_str,
        LOCALE_USER_DEFAULT, 0) ==
    VARCMP_LT;
}

In the second overloaded version of the operator<() method, the operator compares against a provided LPCSTR argument. An LPCSTR isn’t the same character type as the internal BSTR string, which contains wide characters. Therefore, the method constructs a temporary CComBSTR and delegates the work to operator<(const CComBSTR& bstrSrc), just shown``.``

bool operator>(LPCSTR pszSrc) const {
        CComBSTR bstr2(pszSrc);
        return operator>(bstr2);
}

The third overload for the operator<() method accepts an LPCOLESTR and operates very much like the previous overload:

bool operator<(LPCOLESTR pszSrc) const {
    CComBSTR bstr2(pszSrc);
    return operator>(bstr2);
}

The fourth overload for the operator<() accepts an LPOLESTR; the implementation does a quick cast and calls the LPCOLESTR version to do the work:

bool operator>(LPOLESTR pszSrc) const {
    return operator>((LPCOLESTR)pszSrc);
}

CComBSTR Persistence Support

The last two methods of the CComBSTR class read and write a BSTR string to and from a stream. The WriteToStream method writes a ULONG count containing the numbers of bytes in the BSTR to a stream. It writes the BSTR characters to the stream immediately following the count. Note that the method does not tag the stream with an indication of the byte order used to write the data. Therefore, as is frequently the case for stream data, a CComBSTR object writes its string to the stream in a hardware-architecture-specific format.

HRESULT WriteToStream(IStream* pStream) {
    ATLASSERT(pStream != NULL);
    if(pStream == NULL)
        return E_INVALIDARG;

    ULONG cb;
    ULONG cbStrLen = ULONG(m_str ?
        SysStringByteLen(m_str)+sizeof(OLECHAR) : 0);
    HRESULT hr = pStream->Write((void*) &cbStrLen,
        sizeof(cbStrLen), &cb);
    if (FAILED(hr))
        return hr;
    return cbStrLen ?
        pStream->Write((void*) m_str, cbStrLen, &cb) :
        S_OK;
}

The ReadFromStream method reads a ULONG count of bytes from the specified stream, allocates a BSTR of the correct size, and then reads the characters directly into the BSTR string. The CComBSTR object must be empty when you call ReadFromStream; otherwise, you will receive an assertion from a debug build or will leak memory in a release build.

HRESULT ReadFromStream(IStream* pStream) {
    ATLASSERT(pStream != NULL);
    ATLASSERT(!*this); // should be empty
    ULONG cbStrLen = 0;
    HRESULT hr = pStream->Read((void*) &cbStrLen,
        sizeof(cbStrLen), NULL);
    if ((hr == S_OK) && (cbStrLen != 0)) {
        //subtract size for terminating NULL which we wrote out
        //since SysAllocStringByteLen overallocates for the NULL
        m_str = SysAllocStringByteLen(NULL,
            cbStrLen-sizeof(OLECHAR));
        if (!*this) hr = E_OUTOFMEMORY;
        else hr = pStream->Read((void*) m_str, cbStrLen, NULL);
        ...
    }
    if (hr == S_FALSE) hr = E_FAIL;
    return hr;
}

Minor Rant on BSTRs, Embedded NUL Characters in Strings, and Life in General

The compiler considers the types BSTR and OLECHAR* to be synonymous. In fact, the BSTR symbol is simply a typedef for OLECHAR*. For example, from wtypes.h:

typedef /* [wire_marshal] */ OLECHAR __RPC_FAR *BSTR;

This is more than somewhat brain damaged. An arbitrary BSTR is not an OLECHAR*, and an arbitrary OLECHAR* is not a BSTR. One is often misled on this regard because frequently a BSTR works just fine as an OLECHAR*.

STDMETHODIMP SomeClass::put_Name (LPCOLESTR pName) ;

BSTR bstrInput = ...
pObj->put_Name (bstrInput) ; // This works just fine... usually
SysFreeString (bstrInput) ;

In the previous example, because the bstrInput argument is defined to be a BSTR, it can contain embedded NUL characters within the string. The put_Name method, which expects a LPCOLESTR (a NUL-character-terminated string), will probably save only the characters preceding the first embedded NUL character. In other words, it will cut the string short.

You also cannot use a BSTR where an [out] OLECHAR* parameter is required. For example:

STDMETHODIMP SomeClass::get_Name(OLECHAR** ppName) {
  BSTR bstrOutput =... // Produce BSTR string to return
  *ppName = bstrOutput ; // This compiles just fine
  return S_OK ;          // but leaks memory as caller
                         // doesn't release BSTR
}

Conversely, you cannot use an OLECHAR* where a BSTR is required. When it does happen to work, it’s a latent bug. For example, the following code is incorrect:

STDMETHODIMP SomeClass::put_Name (BSTR bstrName) ;
// Wrong! Wrong! Wrong!
pObj->put_Name (OLECHAR("This is not a BSTR!")) ;

If the put_Name method calls SysStringLen to obtain the length of the BSTR, it will try to get the length from the integer preceding the stringbut there is no such integer. Things get worse if the put_Name method is remotedthat is, lives out-of-process. In this case, the marshaling code will call SysStringLen to obtain the number of characters to place in the request packet. This is usually a huge number (4 bytes from the preceding string in the literal pool, in this example) and often causes a crash while trying to copy the string.

Because the compiler cannot tell the difference between a BSTR and an OLECHAR*, it’s quite easy to accidentally call a method in CComBSTR that doesn’t work correctly when you are using a BSTR that contains embedded NUL characters. The following discussion shows exactly which methods you must use for these kinds of BSTR s.

To construct a CComBSTR, you must specify the length of the string:

BSTR bstrInput =
  SysAllocStringLen (
    OLESTR ("This is part one\0and here's part two"),
    36) ;

CComBSTR str8 (bstrInput) ; // Wrong! Unexpected behavior here
                            // Note: str2 contains only
                            // "This is part one"

CComBSTR str9 (::SysStringLen (bstrInput),
    bstrInput); // Correct!
// str9 contains "This is part one\0and here's part two"

Assigning a BSTR that contains embedded NUL characters to a CComBSTR object never works. For example:

// BSTR bstrInput contains
// "This is part one\0and here's part two"
CComBSTR str10;
str10 = bstrInput; // Wrong! Unexpected behavior here
                   // str10 now contains "This is part one"

The easiest way to perform an assignment of a BSTR is to use the Empty and AppendBSTR methods:

str10.Empty();                // Insure object is initially empty
str10.AppendBSTR (bstrInput); // This works!

In practice, although a BSTR can potentially contain embedded NUL characters, most of the time it doesn’t. Of course, this means that, most of the time, you don’t see the latent bugs caused by incorrect BSTR use.

The CString Class

CString Overview

For years now, ATL programmers have glared longingly over the shoulders of their MFC brethren slinging character data about in their programs with the grace and dexterity of Barishnikov himself. MFC developers have long enjoyed the ubiquitous CString class provided with the library; so much so that when they ventured into previous versions of ATL, they often found themselves tempted to check that wizard option named Support MFC and suck in a 1MB library just to allow them to continue working with their bread-‘n-butter string class. Sure, ATL programmers have CComBSTR, which is fine for code at the “edges” of a method’s implementation; that is, either receiving a BSTR input parameter at the beginning of a method or returning some sort of BSTR output parameter at the end of a method. But compared to CString’s extensive support for everything from sprintf-style formatting to search-and-replace, CComBSTR is woefully inadequate for any serious string processing. And, sure, ATL programmers have had STL’s string<> template class for years, but it also falls short of CString in functionality. In addition, because it is a standard, platform-independent class, it can’t possibly provide such useful functionality as integrating with the Windows resource architecture.

Well, the long wait is over: CString is available as of ATL 7. In fact, CString is a shared class between MFC and ATL, along with a number of other classes. You’ll note that there are no longer separate \MFC\Include and \ATL\Include directories within the Visual Studio file hierarchy. Instead, both libraries maintain code in \ATLMFC\Include. I think it’s extraordinarily insightful to examine just how and where the shared CString class is defined. First, all the header files are under a directory named \ATLMFC, not \MFCATL. CString used to be defined in afx.h, the prefix that has identified MFC from its earliest beginnings. Now the definition appears in a file that simply defines CString as a typedef to a template class called CStringT that does all the work. This template class is actually in the ATL namespace. That’s right: one of the last bastions of MFC supremacy is now found under the ATL moniker.

CString Anatomy

Now that CString is template-based, it follows the general ATL design pattern of supporting pluggable functionality through template parameters that specialize in CString behavior. As the first sections of this chapter revealed, a number of different types of strings exist, with different mechanisms for manipulating them. Templates are very well suited to this kind of scenario, in which exposing flexibility is important. But usability is also important, so ATL uses a convenient combination of typedefs and default template parameters to simplify using CString.

Understanding what’s under the covers of a CString instance is important in understanding not only how the methods and operators work, but also how CString can be extended and specialized to fit particular requirements or to facilitate certain optimizations. When you declare an instance of CString, you are actually instantiating a template class called CStringT. The file atlstr.h provides typedefs for CString, as well as for ANSI and Unicode versions``CStringA`` and CStringW, respectively.

typedef CStringT< wchar_t, StrTraitATL<
    wchar_t, ChTraitsCRT< wchar_t > > >
    CAtlStringW;
typedef CStringT< char, StrTraitATL<
    char, ChTraitsCRT< char > > >
    CAtlStringA;
typedef CStringT< TCHAR, StrTraitATL<
    TCHAR, ChTraitsCRT< TCHAR > > >
    CAtlString;

typedef CAtlStringW CStringW;
typedef CAtlStringA CStringA;
typedef CAtlString CString;

Strictly speaking, these typedefs are generated only if the ATL project is linking to the CRT, which ATL projects now do by default. Otherwise, the ChTraitsCRT template class is not used as a parameter to CStringT because it relies upon CRT functions to manage character-level manipulation.

Because the CStringT template class is the underlying class doing all the work, the remainder of the discussion is in terms of CStringT. This class is defined in cstringt.h as follows:

template< typename BaseType, class StringTraits >
class CStringT :
    public CSimpleStringT< BaseType > {
     // ...
}

The behavior of the CStringT class is governed largely by three things: 1) the CSimpleStringT base class, 2) the BaseType template parameter, and 3) the StringTraits template parameter. CSimpleStringT provides a lot of basic string functionality that CStringT inherits. The BaseType template parameter is used to establish the underlying character data type of the string. The only state CStringT holds is a pointer to a character string of the type BaseType. This data is held in the m_pszData private member defined in the CSimpleStringT base class. The StringTraits parameter is an interesting one. This parameter establishes three things: 1) the module from which resource strings will be loaded, 2) the string manager used to allocate string data, and 3) the class that will provide low-level character manipulation. The atlstr.h header file contains the definition for this template class.

template< typename _BaseType = char, class StringIterator =
                                        ChTraitsOS< _BaseType > >
class StrTraitATL : public StringIterator {
public:
    static HINSTANCE FindStringResourceInstance(UINT nID) {
        return( AtlFindStringResourceInstance( nID ) );
    }

    static IAtlStringMgr* GetDefaultManager() {
        return( &g_strmgr );
    }
};

StrTraitATL derives from the StringIterator template parameter passed in. This parameter implements low-level character operations that CStringT ultimately will invoke when application code calls methods on instances of CString. Two choices of ATL-provided classes encapsulate the character traits: ChTraitsCRT and ChTraitsOS. The former uses functions that require you to link to the CRT in your project, so you would use it if you were already linking to the CRT. The latter does not require the CRT to implement its character-manipulation functions. Both expose a common set of functions that CStringT uses in its internal implementation.

Note that in the definition of the StrTraitATL, we see the first evidence of the extensibility of CStringT. The GetdefaultManager method returns a reference to a string manager via the IAtlStringMgr interface. This interface enforces a generic pattern for managing string memory. atlsimpstr.h provides the definition for this interface.

__interface IAtlStringMgr {
public:
    CStringData* Allocate( int nAllocLength, int nCharSize );
    void Free( CStringData* pData );
    CStringData* Reallocate( CStringData* pData,
        int nAllocLength, int nCharSize );

    CStringData* GetNilString();
    IAtlStringMgr* Clone();
};

ATL supplies a default string manager that is used if the user does not specify another. This default string manager is a concrete class called CAtlStringMgr that implements IAtlStringMgr. Abstracting string management into a separate class enables you to customize the behavior of the string-management functions to suit specific application requirements. Two mechanisms exist for customizing string management for CStringT. The first mechanism involves merely using CAtlStringMgr with a specific memory manager. Chapter 3, “ATL Smart Types,” discusses the IAtlMemMgr interface, a generic interface that encapsulates heap memory management. Associating a memory manager with CAtlStringMgr is as simple as passing a pointer to the memory manager to the CAtlStringMgr constructor. CStringT must be instructed to use this CAtlStringMgr in its internal implementation by passing the string manager pointer to the CStringT constructor. ATL provides five built-in heap managers that implement IAtlMemMgr. We use CWin32Heap to demonstrate how to use an alternate memory manager with CStringT.

// create a thread-safe process heap with zero initial size
// and no max size
// constructor parameters are explained later in this chapter
CWin32Heap heap(0, 0, 0);

// create a string manager that uses this memory manager
CAtlStringMgr strMgr(&heap);

// create a CString instance that uses this string manager
CString str(&strMgr);

// ... perform some string operations as usual

If you want more control over the string-management functions, you can supply your own custom string manager that fully implements IAtlStringMgr. Instead of passing a pointer to CAtlStringMgr to the CString constructor, as in the previous code, you would simply pass a pointer to your custom IAtlStringMgr implementation. This custom string manager might use one of the existing memory managers or a custom implementation of IAtlMemMgr. Additionally, a custom string manager might want to enforce a different buffer-sharing policy than CAtlStringMgr’s default copy-on-write policy. Copy-on-write allows multiple CStringT instances to read the same string memory, but a duplicate is created before any writes to the buffer are performed.

Of course, the simplest thing to do is to use the defaults that ATL chooses when you use a simple CString declaration, as in the following:

// declare an empty CString instance
CString str;

With this declaration, ATL will use CAtlStringMgr to manage the string data. CAtlStringMgr will use the built-in CWin32Heap heap manager for supplying string data storage.

Constructors

CStringT provides 19 different constructors, although one of the constructors is compiled into the class definition only if you are building a managed C++ project for the .NET platform. These types of ATL specializations are not discussed in this book. In general, however, the large number of constructors present represents the various different sources of string data with which a CString instance can be initialized, along with the additional options for supplying alternate string managers. We examine these constructors in related groups.

Before going further into the various methods, let’s look at some of the notational shortcuts that CStringT uses in its method signatures. To properly understand even the method declarations with CStringT, you must be comfortable with the typedefs used to represent the character types in CStringT. Because CStringT uses template parameters to represent the base character type, the syntax for expressing the various allowed character types can become cumbersome or unclear in places. For instance, when you declare a CStringW, you create an instance of CStringT that encapsulates a series of wchar_t characters. From the definition of the CStringT template class, you can easily see that the BaseType template parameter can be used in method signatures that need to specify a wchar_t type parameterbut how would you specify methods that need to accept a char type parameter? Certainly, I need to be able to append char strings to a wchar_t-based CString. Conversely, I must have the ability to append wchar_t strings to a char-based CString. Yet I have only one template class in which to accomplish all this. CStringT provides six type definitions to deal with this syntactic dichotomy. They might seem somewhat arbitrary at first, but you’ll see as we look closer into CStringT that their use actually makes a lot of sense. Table 2.3 summarizes these typedefs.

Table 2.3. CStringT Character Traits Type Definitions

Typedef	BaseType is `char`	BaseType is `wchar_t`	Meaning
`XCHAR`	`char`	`wchar_t`	Single character of the same type as the `CStringT` instance
`PXSTR`	`LPSTR`	`LPWSTR`	Pointer to character string of the same type as `CStringT` instance
`PCXSTR`	`LPCSTR`	`LPCWSTR`	Pointer to constant character string of the same type as the `CStringT` instance
`YCHAR`	`wchar_t`	`Char`	Single character of the opposite type as the `CStringT` instance
`PYSTR`	`LPWSTR`	`LPSTR`	Pointer to character string of the opposite type as `CStringT` instance
`PCYSTR`	`LPCWSTR`	`LPCSTR`	Pointer to constant character string of the opposite type as the `CStringT` instance

Two constructors enable you to initialize a CString to an empty string:

CStringT();
explicit CStringT( IAtlStringMgr* pStringMgr );

Recall that the data for the CString is kept in the m_pszData data member. These constructors simply initialize the value of this member to be either a NUL character or two NUL characters if the BaseType is wchar_t. The second constructor accepts a pointer to a string manager to use with this CStringT instance. As stated previously, if the first constructor is used, the CStringT instance will use the default string manager CAtlStringMgr, which relies upon an underlying CWin32Heap heap manager to allocate storage from the process heap.

The next two constructors provide two different copy constructors that enable you to initialize a new instance from an existing CStringT or from an existing CSimpleStringT.

CStringT( const CStringT& strSrc );
CStringT( const CThisSimpleString& strSrc );

The second constructor accepts a CThisSimpleString reference, but this is simply a typedef to CSimpleString<BaseType>. Exactly what these copy constructors do depends upon the policy established by the string manager that is associated with the CStringT instance. Recall that if no string manager is specified, such as with the constructor shown previously that accepts an IAtlStringMgr pointer, CAtlStringMgr will be used to manage memory allocation for the instance’s string data. This default string manager implements a copy-on-write policy that allows multiple CStringT instances to share a string buffer for reading, but automatically creates a copy of the buffer whenever another CStringT instance tries to perform a write operation. The following code demonstrates how these copy semantics work in practice:

// "Fred" memcpy'd into strOrig buffer
CString strOrig("Fred");
// str1 points to strOrig buffer (no memcpy)
CString str1(strOrig);
// str2 points to strOrig buffer (no memcpy)
CString str2(str1);
// str3 points to strOrig buffer (no memcpy)
CString str3(str2);
// new buffer allocated for str2
// "John" memcpy'd into str2 buffer
str2 = "John";

As the comments indicate, CAtlStringMgr creates no additional copies of the internal string buffer until a write operation is performed with the assignment statement of str2. The storage to hold the new data in str2 is obtained from CAtlStringMgr. If we had specified another custom string manager to use via a constructor, that implementation would have determined how and when data is allocated. Actually, CAtlStringMgr simply increments str2’s buffer pointer to “allocate” memory within its internal heap. As long as there is room in the CAtlStringMgr’s heap, no expansion of the heap is required and the string allocation is fast and efficient.

Several constructors accept a pointer to a character string of the same type as the CStringT instancethat is, a character string of type BaseType.

CStringT( const XCHAR* pszSrc );
CStringT( const XCHAR* pch, int nLength );
CStringT( const XCHAR* pch, int nLength, IAtlStringMgr* pStringMgr );

The first constructor should be used when the character string provided is NUL terminated. CStringT determines the size of the buffer needed by simply looking for the terminating NUL. However, the second and third forms of the constructor can accept an array of characters that is not NUL terminated. In this case, the length of the character array (in characters, not bytes), not including the terminating NUL that will be added, must be provided. You can improperly initialize your CString if you don’t feed these constructors the proper length or if you use the first form with a string that’s not NUL terminated. For instance:

char rg[4] = { 'F', 'r', 'e', 'd' };

// Wrong! Wrong!  rg not NULL-terminated
// str1 contains junk
CString str1(rg);

// ok, length provided to invoke correct ctor
CString str2(rg, 4);

char* sz = "Fred";
// ok, sz NULL-terminated => no length parameter needed
CString str3(sz);

You can also initialize a CStringT instance with a character string of the opposite type of BaseType.

CSTRING_EXPLICIT CStringT( const YCHAR* pszSrc );
CStringT( const YCHAR* pch, int nLength );
CStringT( const YCHAR* pch, int nLength,
    IAtlStringMgr* pStringMgr );

These constructors work in an analogous manner to the XCHAR-based constructors just shown. The difference is that these constructors convert the source string to the BaseType declared for the CStringT instance, if it is required. For example, if the BaseType is wchar_t, such as when you explicitly declare a CStringW instance, and you pass the constructor a char*, CStringT will use the Windows API function MultiByteToWideChar to convert the source string.

CStringT( LPCSTR pszSrc, IAtlStringMgr* pStringMgr );
CStringT( LPCWSTR pszSrc, IAtlStringMgr* pStringMgr );

You can also initialize a CStringT instance with a repeated series of characters using the following constructors:

CSTRING_EXPLICIT CStringT( char ch, int nLength = 1 );
CSTRING_EXPLICIT CStringT( wchar_t ch, int nLength = 1 );

Here, the nLength specifies the number of copies of the ch character to replicate in the CStringT instance, as in the following:

CString str('z', 5); // str contains "zzzzz"

CStringT also enables you to initialize a CStringT instance from an unsigned char string, which is how MBCS strings are represented.

CSTRING_EXPLICIT CStringT( const unsigned char* pszSrc );
CStringT( const unsigned char* pszSrc,
    IAtlStringMgr* pStringMgr );

Finally, CStringT provides two constructors that accept a VARIANT as the string source:

CStringT( const VARIANT& varSrc );
CStringT( const VARIANT& varSrc, IAtlStringMgr* pStringMgr );

Internally, CStringT uses the COM API function VariantChangeType to attempt to convert varSrc to a BSTR. VariantChangeType handles simple conversion between basic types, such as numeric-to-string conversions. However, the varSrc VARIANT cannot contain a complex type, such as an array of double. In addition, these two constructors truncate a BSTR that contains an embedded NUL.

// BSTR bstr contains "This is part one\0and here's part two"
VARIANT var;
var.vt = VT_BSTR;
var.bstrVal = bstr;
// var contains "This is part one\0 and here's part two"
CString str(var);   // str contains "This is part one"

Assignment

CStringT defines eight assignment operators. The first two enable you to initialize an instance from an existing CStringT or CSimpleStringT:

CStringT& operator=( const CStringT& strSrc );
CStringT& operator=( const CThisSimpleString& strSrc );

With both of these constructors, the copy policy of the string manager in use dictates how these operators behave. By default, CStringT instances use the copy-on-write policy of the CAtlStringMgr class. See the previous discussion of the CStringT constructors for more information.

The next two assignment operators accept pointers to string literals of the same type as the CStringT instance or of the opposite type, as indicated by the PCXSTR and PCYSTR source string types:

CStringT& operator=( PCXSTR pszSrc );
CStringT& operator=( PCYSTR pszSrc );

Of course, no conversions are necessary with the first operator. However, CStringT invokes the appropriate Win32 conversion function when the second operator is used, as in the following code:

CStringA str;         // declare an empty ANSI CString
str = L"Hello World"; // operator=(PCYSTR) invoked
                      // characters converted via
                      // WideCharToMultiByte

CStringT also enables you to assign instances to individual characters. In these cases, CStringT actually creates a string of one character and appends either a 1- or 2-byte NUL terminator, depending on the type of character specified and the BaseType of the CStringT instance. These operators then delegate to either operator=(PCXSTR) or operator=(PCYSTR) so that any necessary conversions are performed.

CStringT& operator=( char ch );
CStringT& operator=( wchar_t ch );

Yet another CStringT assignment operator accepts an unsigned char* as its argument to support MBCS strings. This operator simply casts pszSrc to a char* and invokes either operator=(PCXSTR) or operator=(PCYSTR):

CStringT& operator=( const unsigned char* pszSrc );

Finally, instances of CStringT can be assigned to VARIANT types. The use and behavior here are identical to that described previously for the corresponding CStringT constructor:

CStringT& operator=( const VARIANT& var );

String Concatenation Using CString

CStringT defines eight operators used to append string data to the end of an existing string buffer. In all cases, storage for the new data appended is allocated using the underlying string manager and its encapsulated heap. By default, this means that CAtlStringMgr is employed; its underlying CWin32Heap instance will be used to invoke the Win32 HeapReAlloc API function as necessary to grow the CStringT buffer to accommodate the data appended by these operators.

CStringT& operator+=( const CThisSimpleString& str );
CStringT& operator+=( PCXSTR pszSrc );
CStringT& operator+=( PCYSTR pszSrc );
template< int t_nSize >
CStringT& operator+=( const CStaticString<
    XCHAR, t_nSize >& strSrc );
CStringT& operator+=( char ch );
CStringT& operator+=( unsigned char ch );
CStringT& operator+=( wchar_t ch );
CStringT& operator+=( const VARIANT& var );

The first operator accepts an existing CStringT instance, and two operators accept PCXSTR strings or PCYSTR strings. Three other operators enable you to append individual characters to an existing CStringT. You can append a char, wchar_t, or unsigned char. One operator enables you to append the string contained in an instance of CStaticString. You can use this template class to efficiently store immutable string data; it performs no copying of the data with which it is initialized and merely serves as a convenient container for a string constant. Finally, you can append a VARIANT to an existing CStringT instance. As with the VARIANT constructor and assignment operator discussed previously, this operator relies upon VariantChangeType to convert the underlying VARIANT data into a BSTR. To the compiler, a BSTR looks just like an OLECHAR*, so this operator will ultimately end up calling either operator+=(PCXSTR) or operator+=(PCYSTR), depending on the BaseType of the CStringT instance. The same issues with embedded NUL``s in the source ``BSTR that we discussed earlier in the “Assignment” section apply here.

Three overloads of operator+() enable you to concatenate multiple strings conveniently.

friend CSimpleStringT operator+(
    const CSimpleStringT& str1,
    const CSimpleStringT& str2 );
friend CSimpleStringT operator+(
    const CSimpleStringT& str1,
    PCXSTR psz2 );
friend CSimpleStringT operator+(
    PCXSTR psz1,
    const CSimpleStringT& str2 );

These operators are invoked when you write code such as the following:

CString str1("Every good "); // str1: "Every good"
CString str2("boy does ");   // str2: "boy does "
CString str3;                // str3: empty
str3 = str1 + str3 + "fine"; // str3: "Every good boy does fine"

String concatenation is also supported through several Append methods. Four of these methods are defined on the CSimpleStringT base class and actually do the real work for the operators just discussed. Indeed, the only additional functionality offered by these four Append methods over the operators appears in the overload that accepts an nLength parameter. This enables you to append only a portion of an existing string. If you specify an nLength greater than the length of the source string, space will be allocated to accommodate nLength characters. However, the resulting CStringT data will be NUL terminated in the same place as pszSrc.

void Append( PCXSTR pszSrc );
void Append( PCXSTR pszSrc, int nLength );
void AppendChar( XCHAR ch );
void Append( const CSimpleStringT& strSrc );

Three additional methods defined on CStringT enable you to append formatted strings to existing CStringT instances. Formatted strings are discussed more later in this section when we cover CStringT’s Format operation. In short, these types of operations enable you to employ sprintf-style formatting to CStringT instances. The three methods shown here differ only from FormatMessage in that the CStringT instance is appended with the constructed string instead of being overwritten by it.

void __cdecl AppendFormat( UINT nFormatID, ... );
void __cdecl AppendFormat( PCXSTR pszFormat, ... );
void AppendFormatV( PCXSTR pszFormat, va_list args );

Character Case Conversion

Two CStringT methods support case conversion: MakeUpper and MakeLower.

CStringT& MakeUpper() {
    int nLength = GetLength();
    PXSTR pszBuffer = GetBuffer( nLength );
    StringTraits::StringUppercase( pszBuffer );
    ReleaseBufferSetLength( nLength );

    return( *this );
}

CStringT& MakeLower() {
    int nLength = GetLength();
    PXSTR pszBuffer = GetBuffer( nLength );
    StringTraits::StringLowercase( pszBuffer );
    ReleaseBufferSetLength( nLength );

    return( *this );
}

Both of these methods delegate their work to the ChTraitsOS or ChTraitsCRT class, depending on which of these was specified as the template parameter when the CStringT instance was declared. Simply instantiating a variable of type CString uses the default character traits class supplied in the typedef for CString. If the preprocessor symbol _ATL_CSTRING_NO_CRT is defined, the ChTraitsOS class is used; and the Win32 functions CharLower and CharUpper are invoked to perform the conversion. If _ATL_CSTRING_NO_CRT is not defined, the ChTraitsCRT class is used by default, and it uses the appropriate CRT function: _mbslwr, _mbsupr, _wcslwr, or _wcsupr.

CString Comparison Operators

CString defines a whole slew of comparison operators (that’s a metric slew, not an imperial slew). Seven versions of operator== enable you to compare CStringT instances with other instances, with ANSI and Unicode string literals, and with individual characters.

friend bool operator==( const CStringT& str1,
    const CStringT& str2 );
friend bool operator==( const CStringT& str1, PCXSTR psz2 );
friend bool operator==( PCXSTR psz1, const CStringT& str2 );
friend bool operator==( const CStringT& str1, PCYSTR psz2 );
friend bool operator==( PCYSTR psz1, const CStringT& str2 );
friend bool operator==( XCHAR ch1, const CStringT& str2 );
friend bool operator==( const CStringT& str1, XCHAR ch2 );

As you might expect, a corresponding set of overloads for operator!= is also provided.

friend bool operator!=( const CStringT& str1,
    const CStringT& str2 );
friend bool operator!=( const CStringT& str1, PCXSTR psz2 );
friend bool operator!=( PCXSTR psz1, const CStringT& str2 );
friend bool operator!=( const CStringT& str1, PCYSTR psz2 );
friend bool operator!=( PCYSTR psz1, const CStringT& str2 );
friend bool operator!=( XCHAR ch1, const CStringT& str2 );
friend bool operator!=( const CStringT& str1, XCHAR ch2 );

And, of course, a full battalion of relational comparison operators is available in CStringT.

friend bool operator<( const CStringT& str1,
    const CStringT& str2 );
friend bool operator<( const CStringT& str1, PCXSTR psz2 );
friend bool operator<( PCXSTR psz1, const CStringT& str2 );
friend bool operator>( const CStringT& str1,
    const CStringT& str2 );
friend bool operator>( const CStringT& str1, PCXSTR psz2 );
friend bool operator>( PCXSTR psz1, const CStringT& str2 );
friend bool operator<=( const CStringT& str1,
    const CStringT& str2 );
friend bool operator<=( const CStringT& str1, PCXSTR psz2 );
friend bool operator<=( PCXSTR psz1, const CStringT& str2 );
friend bool operator>=( const CStringT& str1,
    const CStringT& str2 );
friend bool operator>=( const CStringT& str1, PCXSTR psz2 );
friend bool operator>=( PCXSTR psz1, const CStringT& str2 );

All the operators use the same method to perform the actual comparison: CStringT::Compare. A brief inspection of the operator= overload that takes two CStringT instances reveals how this is accomplished:

friend bool operator==( const CStringT& str1,
    const CStringT& str2 ) {
    return( str1.Compare( str2 ) == 0 );
}

Similarly, the same overload for operator!= is defined as follows:

friend bool operator!=( const CStringT& str1,
    const CStringT& str2 ) {
    return( str1.Compare( str2 ) != 0 );
}

The relational operators use Compare like this:

friend bool operator<( const CStringT& str1,
    const CStringT& str2 ) {
    return( str1.Compare( str2 ) < 0 );
}

Compare returns -1 if str1 is lexicographically (say that ten times fast while standing on your head) less than str2, and 1 if str1 is lexicographically greater than str1. Strings are compared character by character until an inequality occurs or the end of one of the strings is reached. If no inequalities are detected and the strings are the same length, they are considered equal. Compare returns 0 in this case. If an inequality is found between two characters, the result of a lexical comparison between the two characters is returned as the result of the string comparison. If the characters in the strings are the same except that one string is longer, the shorter string is considered to be less than the longer string. It is important to note that all these comparisons are case-sensitive. If you want to perform noncase-sensitive comparisons, you must resort to using the CompareNoCase method directly, as discussed in a moment.

As with many of the character-level operations invoked by various CStringT methods and operators, the character traits class does the real heavy lifting. The CStringT::Compare method delegates to either ChTraitsOS or ChTraitsCRT, as discussed previously.

int Compare( PCXSTR psz ) const {
    ATLASSERT( AtlIsValidString( psz ) );
    return( StringTraits::StringCompare( GetString(), psz ) );
}

int CompareNoCase( PCXSTR psz ) const {
    ATLASSERT( AtlIsValidString( psz ) );
    return( StringTraits::StringCompareIgnore(
        GetString(), psz ) );
}

Assuming that CString is used to declare the instance and the project defaults are in use (_ATL_CSTRING_NO_CRT is not defined), the Compare method delegates to ChTraitsCRT::StringCompare. This function uses one of the CRT functions lstrcmpA or wcscmp. Correspondingly, CompareNoCase invokes either lstrcmpiA or _wcsicmp.

Two additional comparison methods provide the same functionality as Compare and CompareNoCase, except that they perform the comparison using language rules. The CRT functions underlying these methods are _mbscoll and _mbsicoll, or their Unicode equivalents, depending again on the underlying character type of the CStringT.

int Collate( PCXSTR psz ) const
int CollateNoCase( PCXSTR psz ) const

One final operator that bears mentioning is operator[]. This operator enables you to use convenient arraylike syntax to access individual characters in the CStringT string buffer. This operator is defined on the CSimpleStringT base class as follows:

XCHAR operator[]( int iChar ) const {
ATLASSERT( (iChar >= 0) && (iChar <= GetLength()) );
return( m_pszData[iChar] );
}

This function merely does some simple bounds checking (note that you can index the NUL terminator if you want) and then returns the character located at the specified index. This enables you to write code like the following:

CString str("ATL Internals");
char c1 = str[2];    // 'L'
char c2 = str[5];    // 'n'
char c3 = str[13];   // '\0'

CString Operations

CStringT instances can be manipulated and searched in a variety of ways. This section briefly presents the methods CStringT exposes for performing various types of operations. Three methods are designed to facilitate searching for strings and characters within a CStringT instance.

int Find( XCHAR ch, int iStart = 0 ) const
int Find( PCXSTR pszSub, int iStart = 0 ) const
int FindOneOf( PCXSTR pszCharSet ) const
int ReverseFind( XCHAR ch ) const

The first version of Find accepts a single character of BaseType and returns the zero-based index of the first occurrence of ch within the CStringT instance. Find starts the search at the index specified by iStart. If the character is not found, -1 is returned. The second version of Find accepts a string of characters and returns either the index of the first character of pszSub within the CStringT or -1 if pszSub does not occur in its entirety within the instance. As with many character-level operations, the character traits class performs the real work. With ChTraitsCRT in use, the first two versions of Find delegate ultimately to the CRT functions _mbschr and _mbsstr, respectively. The FindOneOf method looks for the first occurrence of any character within the pszCharSet parameter. This method invokes the CRT function _mbspbrk to do the search. Finally, the ReverseFind method operates similarly to Find, except that it starts its search at the end of the CStringT and looks “backward.” Note that all these operations are case-sensitive. The following examples demonstrate the use of these search operations.

CString str("Show me the money!");

int n = str.Find('o');      // n = 2
n = str.Find('O');          // n = -1, case-sensitivity
n = str.ReverseFind('o');   // n = 13, 'o' in "money" found
                            // first
n = str.Find("the");        // n = 8
n = str.FindOneOf("aeiou"); // n = 2
n = str.Find('o', 4);       // n = 13, started search after
                            // first 'o'

Nine different trim functions enable you to remove characters from the beginning and or end of a CStringT. The first trim function removes all leading and trailing whitespace characters from the string. The second overload of trim accepts a character and removes all leading and trailing instances of chTarget from the string; the third overload of trim removes leading and trailing occurrences of any character in the pszTargets string parameter. The three overloads for trimLeft behave similarly to trim, except that they remove the desired characters only from the beginning of the string. As you might guess, trimRight removes only trailing instances of the specified characters.

CStringT& Trim()
CStringT& Trim( XCHAR chTarget )
CStringT& Trim( PCXSTR pszTargets )
CStringT& TrimLeft()
CStringT& TrimLeft( XCHAR chTarget )
CStringT& TrimLeft( PCXSTR pszTargets )
CStringT& TrimRight()
CStringT& TrimRight( XCHAR chTarget )
CStringT& TrimRight( PCXSTR pszTargets )

CStringT provides two useful functions for extracting characters from the encapsulated string:

CStringT SpanIncluding( PCXSTR pszCharSet ) const
CStringT SpanExcluding( PCXSTR pszCharSet ) const

SpanIncluding starts from the beginning of the CStringT data and returns a new CStringT instance that contains all the characters in the CStringT that are included in the pszCharSet string parameter. If no characters in pszCharSet are found, an empty CStringT is returned. Conversely, SpanExcluding returns a new CStringT that contains all the characters in the original CStringT, up to the first one in pszCharSet. In this case, if no character in pszCharSet is found, the entire original string is returned.

You can insert individual characters or entire strings into a CStringT instance using the overloaded Insert method:

int Insert( int iIndex, PCXSTR psz )
int Insert( int iIndex, XCHAR ch )

These methods insert the specified character or string into the CStringT instance starting at iIndex. The string manager associated with the CStringT allocates additional storage to accommodate the new data. Similarly, you can delete a character or series of characters from a string using either the Delete or Remove methods:

int Delete( int iIndex, int nCount = 1 )
int Remove( XCHAR chRemove )

Delete removes from the CStringT nCount characters starting at iIndex. Remove deletes all occurrences of the single character specified by chRemove.

CString str("That's a spicy meatball!");
str.Remove('T');    // str contains "hat's a spicy meatball!"
str.Remove('a');    // str contains "ht's spicy metbll!"

Individual characters or strings can be replaced using the overloaded Replace method:

int Replace( XCHAR chOld, XCHAR chNew )
int Replace( PCXSTR pszOld, PCXSTR pszNew )

These methods search the CStringT instance for every occurrence of the specified character or string and replace each occurrence with the new character or string provided. The methods return either the number of replacements performed or -1 if no occurrences were found.

You can extract substrings of a CStringT using the Left, Mid, and Right functions:

CStringT Left( int nCount ) const
CStringT Mid( int iFirst ) const
CStringT Mid( int iFirst, int nCount ) const
CStringT Right( int nCount ) const

These functions are quite simple. Left returns in a new CStringT instance the first nCount characters of the original CStringT. Mid has two overloads. The first returns a new CStringT instance that contains all characters in the original starting at iFirst and continuing to the end. The second overload of Mid accepts an nCount parameter so that only the specified number of characters starting at iFirst are returned in the new CStringT. Finally, Right returns the rightmost nCount characters of the CStringT instance.

CStringT's MakeReverse method enables you to reverse the characters in a CStringT:

CStringT& MakeReverse();

CString str("Let's do some ATL");
str.MakeReverse(); // str contains "LTA emos od s'teL"

Tokenize is a very useful method for breaking a CStringT into tokens separated by user-specified delimiters:

CStringT Tokenize( PCXSTR pszTokens, int& iStart ) const

The pszTokens parameter can include any number of characters that will be interpreted as delimiters between tokens. The iStart parameter specifies the starting index of the tokenization process. Note that this parameter is passed by reference so that the Tokenize implementation can update its value to the index of the first character following a delimiter. The function returns a CStringT instance containing the string token found. When no more tokens are found, the function returns an empty CStringT and iStart is set to -1. Tokenize is typically used in code like the following:

CString str("Name=Jenny; Ph: 867-5309");
CString tok;
int nPos = 0;
LPCSTR pszDelims = "; =:-";
tok = str.Tokenize(pszDelims, nPos);
while (tok != "") {
printf("Found token: %s\n", tok);
    tok = str.Tokenize(pszDelims, nPos);
}
// Prints the following:
// Found token: Name
// Found token: Jenny
// Found token: Ph
// Found token: 867
// Found token: 5309

Three methods enable you to populate a CStringT with string data embedded in the component DLL (or EXE) as a Windows resource:

BOOL LoadString( UINT nID )
BOOL LoadString( HINSTANCE hInstance, UINT nID )
BOOL LoadString( HINSTANCE hInstance, UINT nID,
    WORD wLanguageID )

The first overload retrieves the string from the module containing the calling code and stores it in CStringT. The second and third overloads enable you to explicitly pass in a handle to the module from which the resource string should be loaded. Additionally, the third overload enables you to load a string in a specific language by specifying the LANGID via the wLanguageID parameter. The function returns trUE if the specified resource could be loaded into the CStringT instance; otherwise, it returns FALSE.

CStringT also provides a very thin wrapper function on top of the Win32 function GetEnvironmentVariable:

BOOL GetEnvironmentVariable( PCXSTR pszVar )

With this simple function, you can retrieve the value of the environment variable indicated by pszVar and store it in the CStringT instance. The functions return TRUE if it succeeded and FALSE otherwise.

Formatted Data

One of the most useful features of CStringT is its capability to construct formatted strings using sprintf-style format specifiers. CStringT exposes four methods for building formatted string data. The first two methods wrap underlying calls to the CRT function vsprintf or vswprintf, depending on whether the CStringT’s BaseType is char or wchar_t.

void __cdecl Format( PCXSTR pszFormat, ... );
void __cdecl Format( UINT nFormatID, ... );

The first overload for the Format method accepts a format string directly. The second overload retrieves the format string from the module’s string table by looking up the resource ID nFormatID.

Two other closely related methods enable you to build formatted strings with CStringT instances. These methods wrap the Win32 API function FormatMessage:

void __cdecl FormatMessage( PCXSTR pszFormat, ... );
void __cdecl FormatMessage( UINT nFormatID, ... );

As with the Format methods, FormatMessage enables you to directly specify the format string by using the first overload or to load it from the module’s string table using the second overload. It is important to note that the format strings allowed for Format and FormatMessage are different. Format uses the format strings vsprintf allows; FormatMessage uses the format strings the Win32 function FormatMessage allows. The exact syntax and semantics for the various format specifiers allowed are well documented in the online documentation, so this is not repeated here.

You use these methods in code like the following:

CString strFirst = "John";
CString strLast = "Doe";
CString str;

// str will contain "Doe, John: Age = 45"
str.Format("%s, %s: Age = %d", strLast, strFirst, 45);

Working with BSTRs and CString

You’ve seen that CStringT is great for manipulating char or wchar_t strings. Indeed, all the operations we’ve presented so far operate in terms of these two fundamental character types. However, we’re going to be using ATL to build COM components, and that means we’ll often be dealing with Automation types such as BSTR. So, we must have a convenient mechanism for returning a BSTR from a method while doing all the processing with our powerful CStringT class. As it happens, CStringT supplies two methods for precisely that purpose:

BSTR AllocSysString() const {
    BSTR bstrResult = StringTraits::AllocSysString( GetString(),
        GetLength() );
    if( bstrResult == NULL ) {
        ThrowMemoryException();
    }

    return( bstrResult );
}

BSTR SetSysString( BSTR* pbstr ) const {
    ATLASSERT( AtlIsValidAddress( pbstr, sizeof( BSTR ) ) );

    if( !StringTraits::ReAllocSysString( GetString(), pbstr,
        GetLength() ) ) {
        ThrowMemoryException();
    }

    ATLASSERT( *pbstr != NULL );
    return( *pbstr );
}

AllocSysString allocates a BSTR and copies the CStringT contents into it. CStringT delegates this work to the character traits class, which ultimately uses the COM API function SysAllocString. The resulting BSTR is returned to the caller. Note that AllocSysString transfers ownership of the BSTR, so the burden is on the caller to eventually call SysFreeString. CStringT also provides SetSysString, which provides the same capability as AllocSysString, except that SetSysString works with an existing BSTR and uses ReAllocSysString to expand the storage of the pbstr argument and then copies the CStringT data into it. This process also frees the original BSTR passed in.

The following example demonstrates how AllocSysString can be used to return a BSTR from a method call.

STDMETHODIMP CPhoneBook::LookupName( BSTR* pbstrName) {
  // ... do some processing

  CString str("Kirk");

  *pbstrName = str.AllocString(); // pbstrName contains "Kirk"

    // caller must eventually call SysFreeString
}

Summary

You must be especially careful when using the BSTR string type because it has numerous special semantics. The ATL CComBSTR class manages many of the special semantics for you and is quite useful. However, the class cannot compensate for the poor decision that, to the C++ compiler, equates the OLECHAR* and BSTR types. You always must use care when using the BSTR type because the compiler will not warn you of many pitfalls.

The CString class is poised to become the new workhorse for string processing in ATL. It is now a shared class with the MFC library and offers a host of powerful functions for manipulating strings in ways that would be very cumbersome and error prone with other string classes. Additionally, CString provides for the customization of string allocation via the IAtlStringMgr interface and a default implementation of that interface in CAtlStringMgr.