Denormalization of data for performance reasons is one of the most common and entrenched misconceptions in data management. Less than full normalization has severe deleterious effects, particularly for analytics.
As the argument goes, normalization reduces data redundancy at such a heavy performance cost, that data professionals trade off its “esthetics” for better performance and ease of use. A common example is:
"To be able to quickly retrieve total monthly product sales hundreds of times a month, instead of reading thousands of rows of order lines in a fully normalized table, a table keyed on Year, Month and Product holds total sales and is updated as orders are processed".
There are several fallacies in this argument.
(Note: There is an important and qualitative distinction between normalization (to 1NF) and further normalization (to 5NF), which is missed or misunderstood.)
First, the example is not one of denormalization, which introduces redundancy by merging or nesting simple into compound facts "bundling" multiple types of fact (e. g., about both orders and customers) into one relation, not by derived (calculated) data, as in the example.
Second, even if the purpose of full normalization were solely to reduce redundancy, the associated data integrity risk is hardly sheer "esthetics", but it is not. Denormalized databases:
- Defeat semantic correctness of (i.e., risk anomalous query results!)
- Are more complex and harder to understand
- Make constraint and query formulation more complex and error prone and they are harder to optimize
- Make query results harder to interpret
- Are biased for certain applications and against others
This defies the claims of ease of use and defeats the purpose of database application neutrality -- serving multiple applications well, not just one.
Third, the argument for denormalization is rooted in logical-physical confusion (LPC) of levels of representation. Consider the equality:
(a+b) X (a-b) = (a2)-(b2)
Suppose you plug numbers for a and b into a calculator, calculate each of the two sides separately and get the result for one of the sides slower than the other. Do you blame this on the "slower equation side", or the calculator?
A fully normalized (5NF) and (properly) denormalized database schema are like the two sides of the equation -- they are different logical representations of the same information. But logical representation (the side of the equation) does not determine performance, the physical implementation -- hardware, DBMS, OS, physical database design (calculator) -- does and exclusively. While calculator implementations are fixed and users have no control over it, DBMS and database implementations can and should be diverse, variable and optimizable. It's always and only when either or both don't do a sufficiently good job that performance suffers.
Be that as it may, practitioners insist that performance improves when they denormalize databases, because "bundling" facts into less relations reduces joins. But even if this were always true -- it is not -- performance gains, if any, do not come from denormalization per se, but from trading off integrity for performance.
What many data professionals miss is that the redundancy introduced by denormalization must be controlled by the DBMS to ensure data integrity, which requires special integrity constraints that, it turns out, involve the very joins that denormalization is intended to avoid, defeating its purpose. These constraints are practically never declared and enforced, which creates the illusion that denormalization improves performance at no cost.
The real solution to unsatisfactory performance is better implementations, not integrity tradeoffs users should not have to make. But as long as the illusion persists, what incentive do the vendors have to offer them?
For more in-depth discussion of this and other misconceptions, see my just published DBDEBUNK GUIDE TO MISCONCEPTIONS ABOUT DATA FUNDAMENTALS.