The etx_CreateCharSet() routine
The etx_CreateCharSet() routine creates a user-defined character set.
Syntax
etx_CreateCharSet (charset_name, file_name)
Element | Purpose | Data type |
---|---|---|
charset_name | Name of your user-defined character set. If you enter a name longer than 18 characters, the name is silently truncated to 18 characters. | CHAR (18) |
file_name | Absolute path name of the operating system file from which the text search engine loads the character set. The file can be on either the server or the client machine. The client machine is searched first. | LVARCHAR |
Return type
None.
Usage
When you create an etx index on a column, you can specify the character set used to index the text data. The character set indicates which letters are to be indexed; any characters in the text data that are not listed in the character set are converted to blanks. Use the CHAR_SET index parameter to specify the name of the character set. You must create a user-defined character set before you use it to create an etx index.
The module provides three built-in character sets: ASCII, ISO, and OVERLAP_ISO. Each of these built-in character sets includes only alphanumeric characters and maps lowercase letters to uppercase. This is sufficient for most text searches. For a complete description of the three built-in character sets, see Character sets.
There are times, however, when you might want to index nonalphanumeric characters or distinguish between lowercase and uppercase letters. In these cases, you must define your own character set.
To define your own character set, first create an operating system file that specifies the characters you want to index. The next section describes in detail the structure of this operating system file.
To use the user-defined character set, specify its name in the CHAR_SET index parameter of the CREATE INDEX statement.
Structure of the operating system character set file
The
operating system file consists of 16 lines of 16 hexadecimal numbers,
plus optional lines that contain comments. Each position corresponds
to an ASCII character. If you want the character in the position to
be indexed, enter the hexadecimal value of the character. If you do
not want the character to be indexed, enter 00
.
The ISO 8859-1 table in Character sets lists the ISO 8859-1 character set that can be used as a reference when creating the operating system file.
Comments begin with a slash, a hyphen, or a pound sign, and they can appear anywhere in the file.
0x2D
), underscores (hexadecimal
value 0x5F
), backslashes (hexadecimal value 0x5C
),
and forward slashes (hexadecimal value 0x2F
), the
alphanumeric characters 0 through 9, a through z, and A through Z,
and maps the lowercase letters a through z to uppercase, the operating
system file would look like the following example: # Character set that indexes hyphens and
/ alphanumeric characters. All lower case letters
\ are mapped to upper case.
- Note the different ways of specifying that a
# line is a comment.
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 2D 00 2F
30 31 32 33 34 35 36 37 38 39 00 00 00 00 00 00
00 41 42 43 44 45 46 47 48 49 4A 4B 4C 4D 4E 4F
50 51 52 53 54 55 56 57 58 59 5A 00 5C 00 00 5F
00 41 42 43 44 45 46 47 48 49 4A 4B 4C 4D 4E 4F
50 51 52 53 54 55 56 57 58 59 5A 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
This
is similar to the built-in ASCII character set, except that hyphens,
underscores, forward slashes, and backslashes are also indexed instead
of being converted to blanks. These four characters are indexed because
the position in the matrix for each character contains its hexadecimal
representation: 0x2D
, 0x5F
, 0x5C
,
and 0x2F
.
All lowercase letters are mapped to uppercase by specifying the uppercase hexadecimal value in the lowercase letter position.
For example, uppercase letter A has
a hexadecimal value of 0x41
. The position in the
matrix of uppercase A contains the hexadecimal value 0x41
,
thus uppercase A is indexed as uppercase A.
However,
the position in the matrix of lowercase a also contains the
hexadecimal value 0x41
(which represents uppercase A)
instead of the actual hexadecimal representation of lowercase a, 0x61
.
Thus, lowercase a is mapped to uppercase A, or in other
words, lowercase a is indexed as if it were the same as uppercase A.
The same is true for all the letters a through z and A through Z.
For more information about the ISO 8859-1 table, refer to ids_excal_144.html#ids_excal_144.
Example
EXECUTE PROCEDURE etx_CreateCharSet
('my_charset', '/local0/excal/my_char_set_file');
The search engine stores and loads the contents of my_charset from the file called /local0/excal/my_char_set_file on the operating system.