As we saw in the previous section, the query planner needs to estimate the number of rows retrieved by a query in order to make good choices of query plans. This section provides a quick look at the statistics that the system uses for these estimates.
One component of the statistics is the total number of entries in
each table and index, as well as the number of disk blocks occupied
by each table and index. This information is kept in the table
pg_class
,
in the columns reltuples
and
relpages
. We can look at it with
queries similar to this one:
SELECT relname, relkind, reltuples, relpages FROM pg_class WHERE relname LIKE 'tenk1%'; relname | relkind | reltuples | relpages ----------------------+---------+-----------+---------- tenk1 | r | 10000 | 358 tenk1_hundred | i | 10000 | 30 tenk1_thous_tenthous | i | 10000 | 30 tenk1_unique1 | i | 10000 | 30 tenk1_unique2 | i | 10000 | 30 (5 rows)
Here we can see that tenk1
contains 10000
rows, as do its indexes, but the indexes are (unsurprisingly) much
smaller than the table.
For efficiency reasons, reltuples
and relpages
are not updated on-the-fly,
and so they usually contain somewhat out-of-date values.
They are updated by VACUUM
, ANALYZE
, and a
few DDL commands such as CREATE INDEX
. A VACUUM
or ANALYZE
operation that does not scan the entire table
(which is commonly the case) will incrementally update the
reltuples
count on the basis of the part
of the table it did scan, resulting in an approximate value.
In any case, the planner
will scale the values it finds in pg_class
to match the current physical table size, thus obtaining a closer
approximation.
Most queries retrieve only a fraction of the rows in a table, due
to WHERE
clauses that restrict the rows to be
examined. The planner thus needs to make an estimate of the
selectivity of WHERE
clauses, that is,
the fraction of rows that match each condition in the
WHERE
clause. The information used for this task is
stored in the
pg_statistic
system catalog. Entries in pg_statistic
are updated by the ANALYZE
and VACUUM
ANALYZE
commands, and are always approximate even when freshly
updated.
Rather than look at pg_statistic
directly,
it's better to look at its view
pg_stats
when examining the statistics manually. pg_stats
is designed to be more easily readable. Furthermore,
pg_stats
is readable by all, whereas
pg_statistic
is only readable by a superuser.
(This prevents unprivileged users from learning something about
the contents of other people's tables from the statistics. The
pg_stats
view is restricted to show only
rows about tables that the current user can read.)
For example, we might do:
SELECT attname, inherited, n_distinct, array_to_string(most_common_vals, E'\n') as most_common_vals FROM pg_stats WHERE tablename = 'road'; attname | inherited | n_distinct | most_common_vals ---------+-----------+------------+------------------------------------ name | f | -0.363388 | I- 580 Ramp+ | | | I- 880 Ramp+ | | | Sp Railroad + | | | I- 580 + | | | I- 680 Ramp name | t | -0.284859 | I- 880 Ramp+ | | | I- 580 Ramp+ | | | I- 680 Ramp+ | | | I- 580 + | | | State Hwy 13 Ramp (2 rows)
Note that two rows are displayed for the same column, one corresponding
to the complete inheritance hierarchy starting at the
road
table (inherited
=t
),
and another one including only the road
table itself
(inherited
=f
).
The amount of information stored in pg_statistic
by ANALYZE
, in particular the maximum number of entries in the
most_common_vals
and histogram_bounds
arrays for each column, can be set on a
column-by-column basis using the ALTER TABLE SET STATISTICS
command, or globally by setting the
default_statistics_target configuration variable.
The default limit is presently 100 entries. Raising the limit
might allow more accurate planner estimates to be made, particularly for
columns with irregular data distributions, at the price of consuming
more space in pg_statistic
and slightly more
time to compute the estimates. Conversely, a lower limit might be
sufficient for columns with simple data distributions.
Further details about the planner's use of statistics can be found in Chapter 72.
It is common to see slow queries running bad execution plans because multiple columns used in the query clauses are correlated. The planner normally assumes that multiple conditions are independent of each other, an assumption that does not hold when column values are correlated. Regular statistics, because of their per-individual-column nature, cannot capture any knowledge about cross-column correlation. However, PostgreSQL has the ability to compute multivariate statistics, which can capture such information.
Because the number of possible column combinations is very large, it's impractical to compute multivariate statistics automatically. Instead, extended statistics objects, more often called just statistics objects, can be created to instruct the server to obtain statistics across interesting sets of columns.
Statistics objects are created using the
CREATE STATISTICS command.
Creation of such an object merely creates a catalog entry expressing
interest in the statistics. Actual data collection is performed
by ANALYZE
(either a manual command, or background
auto-analyze). The collected values can be examined in the
pg_statistic_ext_data
catalog.
ANALYZE
computes extended statistics based on the same
sample of table rows that it takes for computing regular single-column
statistics. Since the sample size is increased by increasing the
statistics target for the table or any of its columns (as described in
the previous section), a larger statistics target will normally result in
more accurate extended statistics, as well as more time spent calculating
them.
The following subsections describe the kinds of extended statistics that are currently supported.
The simplest kind of extended statistics tracks functional
dependencies, a concept used in definitions of database normal forms.
We say that column b
is functionally dependent on
column a
if knowledge of the value of
a
is sufficient to determine the value
of b
, that is there are no two rows having the same value
of a
but different values of b
.
In a fully normalized database, functional dependencies should exist
only on primary keys and superkeys. However, in practice many data sets
are not fully normalized for various reasons; intentional
denormalization for performance reasons is a common example.
Even in a fully normalized database, there may be partial correlation
between some columns, which can be expressed as partial functional
dependency.
The existence of functional dependencies directly affects the accuracy of estimates in certain queries. If a query contains conditions on both the independent and the dependent column(s), the conditions on the dependent columns do not further reduce the result size; but without knowledge of the functional dependency, the query planner will assume that the conditions are independent, resulting in underestimating the result size.
To inform the planner about functional dependencies, ANALYZE
can collect measurements of cross-column dependency. Assessing the
degree of dependency between all sets of columns would be prohibitively
expensive, so data collection is limited to those groups of columns
appearing together in a statistics object defined with
the dependencies
option. It is advisable to create
dependencies
statistics only for column groups that are
strongly correlated, to avoid unnecessary overhead in both
ANALYZE
and later query planning.
Here is an example of collecting functional-dependency statistics:
CREATE STATISTICS stts (dependencies) ON city, zip FROM zipcodes; ANALYZE zipcodes; SELECT stxname, stxkeys, stxddependencies FROM pg_statistic_ext join pg_statistic_ext_data on (oid = stxoid) WHERE stxname = 'stts'; stxname | stxkeys | stxddependencies ---------+---------+------------------------------------------ stts | 1 5 | {"1 => 5": 1.000000, "5 => 1": 0.423130} (1 row)
Here it can be seen that column 1 (zip code) fully determines column 5 (city) so the coefficient is 1.0, while city only determines zip code about 42% of the time, meaning that there are many cities (58%) that are represented by more than a single ZIP code.
When computing the selectivity for a query involving functionally dependent columns, the planner adjusts the per-condition selectivity estimates using the dependency coefficients so as not to produce an underestimate.
Functional dependencies are currently only applied when considering
simple equality conditions that compare columns to constant values,
and IN
clauses with constant values.
They are not used to improve estimates for equality conditions
comparing two columns or comparing a column to an expression, nor for
range clauses, LIKE
or any other type of condition.
When estimating with functional dependencies, the planner assumes that conditions on the involved columns are compatible and hence redundant. If they are incompatible, the correct estimate would be zero rows, but that possibility is not considered. For example, given a query like
SELECT * FROM zipcodes WHERE city = 'San Francisco' AND zip = '94105';
the planner will disregard the city
clause as not
changing the selectivity, which is correct. However, it will make
the same assumption about
SELECT * FROM zipcodes WHERE city = 'San Francisco' AND zip = '90210';
even though there will really be zero rows satisfying this query. Functional dependency statistics do not provide enough information to conclude that, however.
In many practical situations, this assumption is usually satisfied; for example, there might be a GUI in the application that only allows selecting compatible city and ZIP code values to use in a query. But if that's not the case, functional dependencies may not be a viable option.
Single-column statistics store the number of distinct values in each
column. Estimates of the number of distinct values when combining more
than one column (for example, for GROUP BY a, b
) are
frequently wrong when the planner only has single-column statistical
data, causing it to select bad plans.
To improve such estimates, ANALYZE
can collect n-distinct
statistics for groups of columns. As before, it's impractical to do
this for every possible column grouping, so data is collected only for
those groups of columns appearing together in a statistics object
defined with the ndistinct
option. Data will be collected
for each possible combination of two or more columns from the set of
listed columns.
Continuing the previous example, the n-distinct counts in a table of ZIP codes might look like the following:
CREATE STATISTICS stts2 (ndistinct) ON city, state, zip FROM zipcodes; ANALYZE zipcodes; SELECT stxkeys AS k, stxdndistinct AS nd FROM pg_statistic_ext join pg_statistic_ext_data on (oid = stxoid) WHERE stxname = 'stts2'; -[ RECORD 1 ]-------------------------------------------------------- k | 1 2 5 nd | {"1, 2": 33178, "1, 5": 33178, "2, 5": 27435, "1, 2, 5": 33178} (1 row)
This indicates that there are three combinations of columns that have 33178 distinct values: ZIP code and state; ZIP code and city; and ZIP code, city and state (the fact that they are all equal is expected given that ZIP code alone is unique in this table). On the other hand, the combination of city and state has only 27435 distinct values.
It's advisable to create ndistinct
statistics objects only
on combinations of columns that are actually used for grouping, and
for which misestimation of the number of groups is resulting in bad
plans. Otherwise, the ANALYZE
cycles are just wasted.
Another type of statistic stored for each column are most-common value lists. This allows very accurate estimates for individual columns, but may result in significant misestimates for queries with conditions on multiple columns.
To improve such estimates, ANALYZE
can collect MCV
lists on combinations of columns. Similarly to functional dependencies
and n-distinct coefficients, it's impractical to do this for every
possible column grouping. Even more so in this case, as the MCV list
(unlike functional dependencies and n-distinct coefficients) does store
the common column values. So data is collected only for those groups
of columns appearing together in a statistics object defined with the
mcv
option.
Continuing the previous example, the MCV list for a table of ZIP codes might look like the following (unlike for simpler types of statistics, a function is required for inspection of MCV contents):
CREATE STATISTICS stts3 (mcv) ON city, state FROM zipcodes; ANALYZE zipcodes; SELECT m.* FROM pg_statistic_ext join pg_statistic_ext_data on (oid = stxoid), pg_mcv_list_items(stxdmcv) m WHERE stxname = 'stts3'; index | values | nulls | frequency | base_frequency -------+------------------------+-------+-----------+---------------- 0 | {Washington, DC} | {f,f} | 0.003467 | 2.7e-05 1 | {Apo, AE} | {f,f} | 0.003067 | 1.9e-05 2 | {Houston, TX} | {f,f} | 0.002167 | 0.000133 3 | {El Paso, TX} | {f,f} | 0.002 | 0.000113 4 | {New York, NY} | {f,f} | 0.001967 | 0.000114 5 | {Atlanta, GA} | {f,f} | 0.001633 | 3.3e-05 6 | {Sacramento, CA} | {f,f} | 0.001433 | 7.8e-05 7 | {Miami, FL} | {f,f} | 0.0014 | 6e-05 8 | {Dallas, TX} | {f,f} | 0.001367 | 8.8e-05 9 | {Chicago, IL} | {f,f} | 0.001333 | 5.1e-05 ... (99 rows)
This indicates that the most common combination of city and state is Washington in DC, with actual frequency (in the sample) about 0.35%. The base frequency of the combination (as computed from the simple per-column frequencies) is only 0.0027%, resulting in two orders of magnitude under-estimates.
It's advisable to create MCV statistics objects only
on combinations of columns that are actually used in conditions together,
and for which misestimation of the number of groups is resulting in bad
plans. Otherwise, the ANALYZE
and planning cycles
are just wasted.