Code sets for character data

A character set is one or more natural-language alphabets together with additional symbols for digits, punctuation, and diacritical marks. Each character set has at least one code set, which maps its characters to unique bit patterns. These bit patterns are called code points.

ASCII, ISO8859-1, Windows™ Code Page 1252, and EBCDIC are examples of code sets that support the English language.

The number of unique characters in the language determines the amount of storage that each character requires in a code set. Because a single byte can store values in the range 0 - 255, it can uniquely identify 256 characters. Most Western languages have fewer than 256 characters and therefore have code sets made up of single-byte characters. When an application handles data in such code sets, it can assume that 1 byte stores 1 character.

The ASCII code set contains 128 characters. Therefore, the code point for each character requires 7 bits of a byte. These single-byte characters with code points in the range 0 - 128 are sometimes called ASCII or 7-bit characters. The ASCII code set is a single-byte code set and is a subset of all code sets that HCL® OneDB® products support.

If a code set contains more than 128 characters, some of its characters have code points that must set the eighth bit of the byte. These non-ASCII characters might be either of the following types of characters:

8-bit characters: The 8-bit characters are single-byte characters whose code points are 128 - 255. Examples from the ISO8859-1 code set or Windows Code Page 1252 include the non-English é, ñ, and ö characters. Only if the software is 8-bit clean can it interpret these characters correctly. For more information, see GLS8BITFSYS environment variable.
Multibyte characters: If a character set contains more than 256 characters, the code set must contain multibyte characters. A multibyte character might require 2 - 4 bytes of storage. Some East-Asian locales support character sets that can contain thousands of ideographic characters; GLS provides full support, for example, for the unified Chinese GB18030-2000 code set, which contains nearly 1.4 million code points. Such languages have code sets that include both single-byte and multibyte characters. These code sets are called multibyte code sets.

Some characters in the Japanese SJIS code set, for another example, are of 2 bytes or 3 bytes. Applications that handle data in multibyte code sets cannot assume that one character takes only 1 byte of storage.

Tip: In this publication, the term non-ASCII characters applies to all characters with a code point greater than 127. Non-ASCII characters include both 8-bit and multibyte characters.

HCL OneDB products can support single-byte or multibyte code sets. For some examples of GLS locales that support non-ASCII characters, see Support for non-ASCII characters.

Tip: Throughout this publication, examples show how single-byte and multibyte characters are displayed. Because multibyte characters are usually ideographic, this publication does not use the actual multibyte characters. Instead, it uses ASCII characters to represent both single-byte and multibyte characters.