Code sets for character data
A character set is one or more natural-language alphabets together with additional symbols for digits, punctuation, and diacritical marks. Each character set has at least one code set, which maps its characters to unique bit patterns. These bit patterns are called code points.
ASCII, ISO8859-1, Windows™ Code Page 1252, and EBCDIC are examples of code sets that support the English language.
The number of unique characters in the language determines the amount of storage that each character requires in a code set. Because a single byte can store values in the range 0 - 255, it can uniquely identify 256 characters. Most Western languages have fewer than 256 characters and therefore have code sets made up of single-byte characters. When an application handles data in such code sets, it can assume that 1 byte stores 1 character.
The ASCII code set contains 128 characters. Therefore, the code point for each character requires 7 bits of a byte. These single-byte characters with code points in the range 0 - 128 are sometimes called ASCII or 7-bit characters. The ASCII code set is a single-byte code set and is a subset of all code sets that HCL® OneDB® products support.
- 8-bit characters
- The 8-bit characters are single-byte characters whose code points are 128 - 255. Examples from the ISO8859-1 code set or Windows Code Page 1252 include the non-English é, ñ, and ö characters. Only if the software is 8-bit clean can it interpret these characters correctly. For more information, see GLS8BITFSYS environment variable.
- Multibyte characters
- If a character set contains more than 256 characters, the code set must contain multibyte characters. A multibyte character might require 2 - 4 bytes of storage. Some East-Asian locales support character sets that can contain thousands of ideographic characters; GLS provides full support, for example, for the unified Chinese GB18030-2000 code set, which contains nearly 1.4 million code points. Such languages have code sets that include both single-byte and multibyte characters. These code sets are called multibyte code sets.