Display data distributions

You can use the dbschema utility to display data distributions.

Unless column values change considerably, you do not need to regenerate the data distributions. To verify the accuracy of the distribution, compare dbschema -hd output with the results of appropriately constructed SELECT statements.

For example, the following dbschema command produces a list of distributions for each column of table customer in database vjp_stores with the number of values in each bin, and the number of distinct values:

dbschema -hd customer -d vjp_stores

Displaying Data Distributions with dbschema -hd shows the data distributions for the column zipcode that this dbschema -hd command produces. Because this column heads the zip_ix index, UPDATE STATISTICS HIGH was run on it, as the following output line indicates:

High Mode, 0.500000 Resolution

Displaying Data Distributions with dbschema -hd shows 17 bins with one distinct zipcode value in each bin.

Figure 1: Displaying Data Distributions with dbschema -hd

dbschema -hd customer -d vjp_stores

...
Distribution for virginia.customer.zipcode

Constructed on 09/18/2000

High Mode, 0.500000 Resolution
  
--- DISTRIBUTION ---

     (          02135 )
  1: (  1,   1, 02135 )
  2: (  1,   1, 08002 )
  3: (  1,   1, 08540 )
  4: (  1,   1, 19898 )
  5: (  1,   1, 32256 )
  6: (  1,   1, 60406 )
  7: (  1,   1, 74006 )
  8: (  1,   1, 80219 )
  9: (  1,   1, 85008 )
 10: (  1,   1, 85016 )
 11: (  1,   1, 94026 )
 12: (  1,   1, 94040 )
 13: (  1,   1, 94085 )
 14: (  1,   1, 94117 )
 15: (  1,   1, 94303 )
 16: (  1,   1, 94304 )
 17: (  1,   1, 94609 )


--- OVERFLOW ---

  1: (  2,      94022 )
  2: (  2,      94025 )
  3: (  2,      94062 )
  4: (  3,      94063 )
  5: (  2,      94086 )

The OVERFLOW portion of the output shows the duplicate values that might skew the distribution data, so dbschema moves them from the distribution to a separate list. The number of duplicates in this overflow list must be greater than a critical amount that the following formula determines. Displaying Data Distributions with dbschema -hd shows a resolution value of .0050. Therefore, this formula determines that any value that is duplicated more than one time is listed in the overflow section.

Overflow = .25 * resolution * number_rows
      = .25 * .0050 * 28
      = .035

For more information about the dbschema utility, see the HCL OneDB™ Migration Guide.