Creating data distributions

You can generate statistics for a table and you can build data distributions for each table that your query accesses.

About this task

(You do not need to run UPDATE STATISTICS operations when the statistics are generated automatically.)

The database server creates data distributions, which provide information to the optimizer, any time the UPDATE STATISTICS MEDIUM or UPDATE STATISTICS HIGH command is executed.

Important:

The database server creates data distributions by sampling a column's data, sorting the data, building distributions bins, and inserting the results into the sysdistrib system catalog table.

You can control the sample size for the scan through the keyword HIGH or MEDIUM. The difference between UPDATE STATISTICS HIGH and UPDATE STATISTICS MEDIUM is the number of rows sampled. UPDATE STATISTICS HIGH scans the entire table, while UPDATE STATISTICS MEDIUM samples only a subset of rows, based on the confidence and resolution used by the UPDATE STATISTICS statement.

You can use the LOW keyword with the UPDATE STATISTICS statement only for fully qualified index keys.

If a distribution has been generated for a column, the optimizer uses that information to estimate the number of rows that match a query against a column. Data distributions in sysdistrib supersede values in the colmin and colmax column of the syscolumns system catalog table when the optimizer estimates the number of rows returned.

When you use data-distribution statistics for the first time, try to update statistics in MEDIUM mode for all your tables and then update statistics in HIGH mode for all columns that head indexes. This strategy produces statistical query estimates for the columns that you specify. These estimates, on average, have a margin of error less than percent of the total number of rows in the table, where percent is the value that you specify in the RESOLUTION clause in the MEDIUM mode. The default percent value for MEDIUM mode is 2.5 percent. (For columns with HIGH mode distributions, the default resolution is 0.5 percent.)

With the DISTRIBUTIONS ONLY option, you can execute UPDATE STATISTICS MEDIUM at the table level or for the entire system because the overhead of the extra columns is not large.

The database server uses the storage locations that the DBSPACETEMP environment variable specifies only when you use the HIGH option of UPDATE STATISTICS.

You can prevent UPDATE STATISTICS operations from using indexes when sorting rows by setting the third parameter of the DBUPSPACE environment variable to a value of 1.

For each table that your query accesses, build data distributions according to the following guidelines. Also see the examples below the guidelines.

Procedure

To generate statistics on a table:

Identify the set of all columns that appear in any single-column or multi-column index on the table.
Identify the subset that includes all columns that are not the leading column of any index.
Run UPDATE STATISTICS LOW on each column in that subset.

Results

To build data distributions for each table that your query accesses:

Run a single UPDATE STATISTICS MEDIUM for all columns in a table that do not head an index.
Use the default parameters unless the table is very large, in which case you should use a resolution of 1.0 and confidence of 0.99.
Run the following UPDATE STATISTICS statement to create distributions for non-index join columns and non-index filter columns:
```
UPDATE STATISTICS MEDIUM DISTRIBUTIONS ONLY;
```
Run UPDATE STATISTICS HIGH for all columns that head an index. For the fastest execution time of the UPDATE STATISTICS statement, you must execute one UPDATE STATISTICS HIGH statement for each column, as shown in the example below this procedure.
If you have indexes that begin with the same subset of columns, run UPDATE STATISTICS HIGH for the first column in each index that differs, as shown in the second example below this procedure.
For each single-column or multi-column index on the table:
1. Identify the set of all columns that appear in the index.
2. Identify the subset that includes all columns that are not the leading column of any index.
3. Run UPDATE STATISTICS LOW on each column in that subset. (LOW is the default.)
For the tables on which indexes were created in Step 3, run an UPDATE STATISTICS statement to update the sysindexes and syscolumns system catalog tables, as shown in the following example:
```
UPDATE STATISTICS FOR TABLE t1(a,b,e,f);
```
For small tables, run UPDATE STATISTICS HIGH, for example:
```
UPDATE STATISTICS HIGH FOR TABLE t2;
```

Because the statement constructs the statistics only once for each index, these steps ensure that UPDATE STATISTICS executes rapidly.

Examples

Example of UPDATE STATISTICS HIGH statements for all columns that head an index

Suppose you have a table t1 with columns a, b, c, d, e, and f with the following indexes:

CREATE INDEX ix_1 ON t1 (a, b, c, d) ... 
CREATE INDEX ix_3 ON t1 (f) ...

Run the following UPDATE STATISTICS statements for the columns that head an index:

UPDATE STATISTICS HIGH FOR TABLE t1(a);
UPDATE STATISTICS HIGH FOR TABLE t1(f);

These UPDATE STATISTICS HIGH statements replace the distributions created with the UPDATE STATISTICS MEDIUM statements with high distributions for index columns.

Example of UPDATE STATISTICS HIGH statements for the first column in each index that differs:

For example, suppose you have the following indexes on table t1:

CREATE INDEX ix_1 ON t1 (a, b, c, d) ...
CREATE INDEX ix_2 ON t1 (a, b, e, f) ...
CREATE INDEX ix_3 ON t1 (f) ...

Step 3 executes UPDATE STATISTICS HIGH on column a and column f. Then run UPDATE STATISTICS HIGH on columns c and e.

UPDATE STATISTICS HIGH FOR TABLE t1(c);
UPDATE STATISTICS HIGH FOR TABLE t1(e);

In addition, you can run UPDATE STATISTICS HIGH on column b, although this is usually not necessary.