Specifying the SAMPLING SIZE
In MEDIUM mode, you can optionally use the SAMPLING SIZE keywords
to specify the minimum number of rows to sample for calculating column
distribution statistics. If the Resolution clause omits the RESOLUTION
keyword and specifies no confidence level and no percent value,
then the number of rows that HCL
OneDB™ samples
will be the larger of the following two values:
- The min value that you specify immediately after the SAMPLING SIZE keywords
- The sampling size that is required for the default percent of rows in each bin (2.5%) and for the minimum confidence level (0.80).
If a sampling size is specified in a Resolution clause that includes
explicit values for both the average percent of sampled rows
per bin and for the confidence level, then the number of sampled
rows will be the larger of these two values:
- The min value that you specify immediately after the SAMPLING SIZE keywords
- The sampling size that is required for the specified percent of rows and for the specified confidence level.
If a sampling size is specified in a Resolution clause that includes an average percentage value but sets no confidence level, then the minimum confidence value of 0.80 is used to calculate the actual sampling size for HCL OneDB to use if the specified size is smaller.
For example, the following statement calculates statistics for
two columns of the customer table, without updating index information.
At least 200 rows will be sampled, but the actual size of the sample
might be larger than 200 if more rows are required to provide the
default 0.80 confidence level for a sample distribution that uses
approximately 50 equivalence categories, with an average percentage
of 2% of the sampled values in each bin.
UPDATE STATISTICS MEDIUM FOR TABLE customer (city, state) SAMPLING SIZE 200 RESOLUTION 2 DISTRIBUTIONS ONLY;
Whether or not you include an explicit SAMPLING SIZE specification in the Resolution clause, HCL OneDB records in the system catalog the actual sampling size (as a percentage of the total number of rows in the table) at the time of MEDIUM mode UPDATE STATISTICS creation.