Shift State - The GNU C Library

Previous: Non-reentrant String Conversion, Up: Non-reentrant Conversion

6.4.3 States in Non-reentrant Functions

In some multibyte character codes, the meaning of any particular byte sequence is not fixed; it depends on what other sequences have come earlier in the same string. Typically there are just a few sequences that can change the meaning of other sequences; these few are called shift sequences and we say that they set the shift state for other sequences that follow.

To illustrate shift state and shift sequences, suppose we decide that the sequence 0200 (just one byte) enters Japanese mode, in which pairs of bytes in the range from 0240 to 0377 are single characters, while 0201 enters Latin-1 mode, in which single bytes in the range from 0240 to 0377 are characters, and interpreted according to the ISO Latin-1 character set. This is a multibyte code that has two alternative shift states (“Japanese mode” and “Latin-1 mode”), and two shift sequences that specify particular shift states.

When the multibyte character code in use has shift states, then mblen, mbtowc, and wctomb must maintain and update the current shift state as they scan the string. To make this work properly, you must follow these rules:

Before starting to scan a string, call the function with a null pointer for the multibyte character address—for example, mblen (NULL, 0). This initializes the shift state to its standard initial value.
Scan the string one character at a time, in order. Do not “back up” and rescan characters already scanned, and do not intersperse the processing of different strings.

Here is an example of using mblen following these rules:

     void
     scan_string (char *s)
     {
       int length = strlen (s);
     
       /* Initialize shift state.  */
       mblen (NULL, 0);
     
       while (1)
         {
           int thischar = mblen (s, length);
           /* Deal with end of string and invalid characters.  */
           if (thischar == 0)
             break;
           if (thischar == -1)
             {
               error ("invalid multibyte character");
               break;
             }
           /* Advance past this character.  */
           s += thischar;
           length -= thischar;
         }
     }

The functions mblen, mbtowc and wctomb are not reentrant when using a multibyte code that uses a shift state. However, no other library functions call these functions, so you don't have to worry that the shift state will be changed mysteriously.