CREATE STATISTICS is definitely one of my personal favorite features of PostgreSQL 10.0 because it allows consultants to help customers in many real-world situations. So, what is it all about? When you run SQL, the optimizer has to come up with clever decisions to speed up your queries. However, to do so, it has to rely heavily on estimates to figure out how much data a certain clause or a certain operation returns. Before version 10.0, PostgreSQL only had information about individual columns. Let's look at an example:
SELECT * FROM car WHERE vendor = 'Ford' AND model = 'Mini Clubman';
In version 9.6, PostgreSQL checks which fraction of the table matches Ford and which fraction matches Mini Clubman. Then, it would try to guess how many rows match both criteria. Remember, PostgreSQL 9.6 only has information about each column—it does not know that these columns are actually related. Therefore, it will simply multiply the odds of finding Ford with the odds of finding Mini Clubman and use this number. However, Ford does not produce a Mini Clubman instance—only BMW does. Therefore, the estimate is wrong. The same cross column correlation problem can happen in other cases too. The number of rows returned by a join might not be clear and the number of groups returned by a GROUP BY clause might be an issue.
Consider the following example:
SELECT gender, age, count(*) FROM children_born GROUP BY gender, age
The number of children born to people of a certain age will definitely depend on their age. The likelihood that some 30 year old women will have children is pretty high and therefore there will be a count. However, if you happen to be 98, you might not be so lucky and it is pretty unrealistic to have a baby, especially if you are a man (men tend to not give birth to children).
CREATE STATISTICS will give the optimizer a chance to gain deeper insights into what is going on by storing multivariate statistics. The idea is to help the optimizer handle functional dependencies.