Next: Whitespace, Previous: Write Programs in English!, Up: Lexical Syntax [Contents][Index]
GNU C source files are usually written in the ASCII character set, which was defined in the 1960s for English. However, they can also include Unicode characters represented in the UTF-8 multibyte encoding. This makes it possible to represent accented letters such as ‘รก’, as well as other scripts such as Arabic, Chinese, Cyrillic, Hebrew, Japanese, and Korean.1
In C source code, non-ASCII characters are valid in comments, in wide character constants (see Wide Character Constants), and in string constants (see String Constants).
Another way to specify non-ASCII characters in constants (character or string) and identifiers is with an escape sequence starting with backslash, specifying the intended Unicode character. (See Unicode Character Codes.) This specifies non-ASCII characters without putting a real non-ASCII character in the source file itself.
C accepts two-character aliases called digraphs for certain characters. See Digraphs.
On some obscure systems, GNU C uses UTF-EBCDIC instead of UTF-8, but that is not worth describing in this manual.