The Costly Illusion of Denormalization for Performance


Denormalization of data for performance reasons is one of the most common and entrenched misconceptions in data management. Less than full normalization has severe deleterious effects, particularly for analytics.

As the argument goes, normalization reduces data redundancy at such a heavy performance cost, that data professionals trade off its “esthetics” for better performance and ease of use. A common example is:

    "To be able to quickly retrieve total monthly product sales hundreds of times a month, instead of reading thousands of rows of order lines in a fully normalized table, a table keyed on Year, Month and Product holds total sales and is updated as orders are processed".

There are several fallacies in this argument.

(Note: There is an important and qualitative distinction between normalization (to 1NF) and further normalization (to 5NF), which is missed or misunderstood.)

First, the example is not one of denormalization, which introduces redundancy by merging or nesting simple into compound facts "bundling" multiple types of fact (e. g., about both orders and customers) into one relation, not by derived (calculated) data, as in the example.

Second, even if the purpose of full normalization were solely to reduce redundancy, the associated data integrity risk is hardly sheer "esthetics", but it is not. Denormalized databases:

  • Defeat semantic correctness of (i.e., risk anomalous query results!)
  • Are more complex and harder to understand
  • Make constraint and query formulation more complex and error prone and they are harder to optimize
  • Make query results harder to interpret
  • Are biased for certain applications and against others
  • This defies the claims of ease of use and defeats the purpose of database application neutrality -- serving multiple applications well, not just one.

    Third, the argument for denormalization is rooted in logical-physical confusion (LPC) of levels of representation. Consider the equality:

    (a+b) X (a-b) = (a2)-(b2)

    Suppose you plug numbers for a and b into a calculator, calculate each of the two sides separately and get the result for one of the sides slower than the other. Do you blame this on the "slower equation side", or the calculator?

    A fully normalized (5NF) and (properly) denormalized database schema are like the two sides of the equation -- they are different logical representations of the same information. But logical representation (the side of the equation) does not determine performance, the physical implementation -- hardware, DBMS, OS, physical database design (calculator) -- does and exclusively. While calculator implementations are fixed and users have no control over it, DBMS and database implementations can and should be diverse, variable and optimizable. It's always and only when either or both don't do a sufficiently good job that performance suffers.

    Be that as it may, practitioners insist that performance improves when they denormalize databases, because "bundling" facts into less relations reduces joins. But even if this were always true -- it is not -- performance gains, if any, do not come from denormalization per se, but from trading off integrity for performance. What many data professionals miss is that the redundancy introduced by denormalization must be controlled by the DBMS to ensure data integrity, which requires special integrity constraints that, it turns out, involve the very joins that denormalization is intended to avoid, defeating its purpose. These constraints are practically never declared and enforced, which creates the illusion that denormalization improves performance at no cost.

    The real solution to unsatisfactory performance is better implementations, not integrity tradeoffs users should not have to make. But as long as the illusion persists, what incentive do the vendors have to offer them?

    For more in-depth discussion of this and other misconceptions, see my just published DBDEBUNK GUIDE TO MISCONCEPTIONS ABOUT DATA FUNDAMENTALS.

Fabian Pascal, Founder, Editor & Publisher, Database Debunkings

Fabian Pascal is an independent writer, lecturer, and analyst specializing in database management, with emphasis on data fundamentals and the relational model. He was affiliated with Codd & Date and has taught and lectured at the business and academic levels. Clients include IBM, Census Bureau, CIA, Apple, UCSF, and IRS. He is founder, editor, and publisher of Database Debunkings, a Website dedicated to dispelling myths and misconceptions about database management; and the Practical Database Foundations series of papers. Pascal, author of three books, has contributed extensively to trade publications including DM Review, Database Programming and Design, DBMS, Byte, Infoworld, and Computerworld.

Why You Always Need Primary Keys

Database pros should heed this warning: if you ignore that primary keys are mandatory, you can wreak havoc with inferences made from databases, including in analytics.

The Trouble with Data Warehouse Analytics

Data warehouses are essentially databases biased for particular data applications and against others. They are rooted in poor database foundation knowledge and logical-physical confusion.


Re: Good enough?
  • 11/1/2016 6:06:17 PM
NO RATINGS

Actually it is much worse than that.

One of the intended consequence of automating everything is to stop intellectual development such that people no longer acquire knowledge and ability to reason critically and independently. In such circumstances it is increasingly impossible to know, understand and appreciate the difference between science and the pretense of science. If you try to make people aware of the difference you're gonna be dismissed as "pedantic", "purist", "dinosaur", or some such.

It's pretty distressing to see people actually DEFENDING unsound products and practices as solutions to problems and ATTACKING sound ones (see, for example, here:

https://news.ycombinator.com/item?id=12739632

https://news.ycombinator.com/item?id=12437389

This is where the speed comes in: by the time the consequences materialize -- for which others pay -- their authors/promoters have already produced and probably sold more of the same. Those responsible never pay -- in fact the system rewards them. And those who end up paying are no longer able to figure out what's the source of their troubles.

Check and mate.

 

 

Re: Good enough?
  • 11/1/2016 5:45:51 PM
NO RATINGS

I was going to mention that point as well, most of us have no idea if the data were are pulling is really correct or not unless we understand how it was coded or we are so familiar with the field that we know something is off.

Re: Good enough?
  • 10/31/2016 11:52:58 PM
NO RATINGS

Of course. Except that today speed overrides correctness. What is more, there is not enough education, knowledge and reasoning ability to comprehend that correctness is not automatic -- it's assumed that you always get the correct result independent of the backbox from which you get it. Hey, computers!

Good enough?
  • 10/31/2016 9:06:38 PM
NO RATINGS

Call me niave but I always thought it was about getting the right answers and not about getting the wrong answers more quickly; unless you have a situation where close enough is good enough.

 

 

INFORMATION RESOURCES
ANALYTICS IN ACTION
CARTERTOONS
VIEW ALL +
QUICK POLL
VIEW ALL +