Why Data Scientists Must Understand Normalization

We are constantly told how data scientists must be “jacks of many skills”, but one of the most important is rarely included in the list thereof.

Very few databases are properly designed. Many SQL databases are denormalized inadvertently, or intentionally for performance. They require special constraints to control data redundancy and prevent inconsistencies, which are practically never enforced. Analysts cannot, therefore, take database consistency for granted. Furthermore, to issue sensible queries and ensure correct results and interpretation thereof, it’s not enough for analysts to know the types of fact represented in the database, but also whether and how the database designer has chosen to bundle -- nest or merge -- those facts and how to disentangle them for analysis.

A more common way to bundle facts than nesting (See Data Fundamentals for Analysts: Nested Facts and the (1st) Normal Form) is to merge them. For example, if there are facts about two classes of entities, instead of representing them in two relations:

Each fact about a department is merged with every fact about each of the employees assigned to it and the merged facts are represented by one relation:

Database designs that merge facts of different types into single relations have drawbacks similar to those that nest facts.

  • A database with relations such as DEPT_EMPTS is more difficult to understand than one with DEPARTMENTS and EMPLOYEES
  • Constraints and queries are more complex and results harder to interpret
  • Errors are more likely

But merging causes additional problems. Within a single class, the entity identifier determines all other properties: If you know the employee number of an employee, you also know all his other properties, because there is no other employee with that number. In other words, for each identifier value there is exactly one value of every descriptive property. Formally, all descriptive properties are functionally dependent on the identifier.

If a relation represents facts about a single class, the attribute dependencies that hold in them are exclusively the FD’s of the non-key attributes on the key.

Since a key value cannot repeat, neither do the non-key values dependent on it, and there is no redundancy.

This means, first, that a uniqueness constraint on the key is sufficient to ensure entity integrity, namely that there is a 1:1 correspondence between facts and relation tuples. Second, because the design keeps facts of different types in separate relations, they can be updated independently.

If a relation represents merged facts, that is no longer the case. The key of DEPT_EMPS is composite—(DEPT#,EMP#)—and the non-key attributes are functionally dependent on key components, which are not guaranteed to have unique values. They can repeat, in which case the values of the dependent attributes will repeat with them, producing redundancy. For example, a department's data will repeat for every employee assigned to it.

The key constraint is no longer sufficient to ensure entity integrity: additional constraints must be enforced to control the redundancy (i.e., ensure that all redundant instances of department data are updated). Such constraints are never enforced, putting databases at a high risk of inconsistency. Neither can the facts be independently updated, because they are merged. This can produce anomalies, i.e. side-effects (e.g., deleting a department will also delete its employees) that may be unintended. Semantic correctness is no longer guaranteed.

There is little the data scientist can do about databases rendered inconsistent by updates due to denormalization, except that the knowledge of it can alert him to potential data quality problems.

But if the analyst understands normalization and whether and how facts were merged, he can minimize errors -- simplify queries, ensure that they are the intended ones and the results are both logically and semantically correct and correctly interpreted -- by deriving fully normalized views of the data for analysis.

There are three types of merged facts and, therefore, three denormalized normal forms, with fifth normal form (5NF) considered for our purposes full normalization (See The Costly Illusion: Normalization, Integrity and Performance).

Fabian Pascal, Founder, Editor & Publisher, Database Debunkings

Fabian Pascal is an independent writer, lecturer, and analyst specializing in database management, with emphasis on data fundamentals and the relational model. He was affiliated with Codd & Date and has taught and lectured at the business and academic levels. Clients include IBM, Census Bureau, CIA, Apple, UCSF, and IRS. He is founder, editor, and publisher of Database Debunkings, a Website dedicated to dispelling myths and misconceptions about database management; and the Practical Database Foundations series of papers. Pascal, author of three books, has contributed extensively to trade publications including DM Review, Database Programming and Design, DBMS, Byte, Infoworld, and Computerworld.

Understanding the Division of Labor between Analytics Applications and DBMS

Those who ignore data fundamentals will always risk costly mistakes and inhibit their own progress towards analytics goals. Here's why.

Don't Conflate or Confuse Database Consistency with Truth

In the database context both truth and consistency are critical, but they should not be confused or conflated. DBMSs guarantee database consistency with the conceptual model of the real world they represent. On the other hand, a DBMS cannot and should not be expected to ensure truth.

Re: Well said
  • 5/31/2016 5:00:28 PM

By data use I don't mean applications, I mean the kind of inferences to be made from the data (data operations). Consistency has to do with both the logical structure (the types of fact = relations) and the constraints on it. But it's important to keep them distinct in one's mind.


Re: Well said
  • 5/31/2016 4:42:32 PM

I include "structure" with "content" (logical structure, not just relation structure, eg., your merged table represents a conjunctive structure, another table with the same tuples doesn't necessarily have the same structure). I don't object to separating the two, but I start with content as above, not proceeding further if structure isn't included plus dbms operations described algebraically, in other words what can be expressed and recorded and verified algebraically. Some people prefer FOL instead. That is the "data meaning" that can't be ignored, no matter the design philosophy. The "data meaning" of an end-application to a database user is something else, not usually formal, so I try to avoid generalizing without having requirements.

Re: Well said
  • 5/31/2016 4:13:31 PM

> attempts to make the database content self-contradictory, not contradictions among wishes for database behaviour.

Both the POFN and integrity enforcement are intended to ensure that information representation is consistent with data meaning and use. That includes both data structure and content. Both anomalies and corruption are inconsistencies. This is not understood in the industry.



Re: Well said
  • 5/31/2016 4:02:16 PM

I want to be clear that I don't dispute the POFN as a techique to follow. As far as I know there is no available dbms that supports relation updating to the extent that is logically possible. So it may be that POFN makes the best of a bad situation, likewise your article.

Re: Well said
  • 5/31/2016 3:30:17 PM

When I wrote of logical contradictions I intended them to mean attempts to make the database content self-contradictory, not contradictions among wishes for database behaviour. It seems premature to wish for certain behaviour before the logical consequences of a model of a theory are understood. That eliminates the possibility of a schema controlling database behaviour before the paint has even dried. The behaviour I gave is a logical consequence of the usual algebra, ie. of the model, not wishful thinking. It is widely misunderstood, even by writers who start with the same definition of MINUS but then drop it to claim it's ambiguous. If the necessary logical behaviour is not what people wish for, they are free to make a different formal model of the relational theory.

Re: Well said
  • 5/31/2016 1:18:32 PM

Actually, they are both contradictions. In the latter case, between the desire that two types of fact should be updated independently and the design that defeats that desire.

What is not understood about the The Principle of Full Normalization (POFN) is that it ensures there are no such contradictions.


Re: Well said
  • 5/31/2016 8:53:11 AM

The usual insertion anomaly is an example of a logical contradiction but the usual deletion anomaly is an example of desire. Insertion contradiction might be a better term for the former. If people don't like the logical deletion result as opposed to the one they desire, I'd say they should re-define what they think deletion means or invent a different algebra.

Re: Well said
  • 5/29/2016 5:13:26 PM

What you are expressing in a verbose way is what I state in the article as update anomalies--loss of guaranteed semantic correctness--exacerbated by poor relational fidelity of SQL and its implementations. Which, in fact, can also occur with queries of denormalized relations.

The problem is that most data professionals are unaware of the implications, or if they are aware, they view them as relational weaknesses, rather than SQL deficiencies due to relational violations. This confusion is mainly responsible for fads like XML and NoSQL that make the problem worse instead of resolving it, preventing the real solution--development of true RDBMS's.

That's the opposite of science.

Re: Well said
  • 5/29/2016 1:58:11 PM

Typical dbms’es require the user or programmer using that employees-departments merged table and asked to remove employee Smith from the Sales department to depart from the algebra and instead update the employee and/or departments table directly, not the merged table. (sorry I don't know how to break paragraphs on this site) I suspect most users would delete a row only from the employees table.  The merged table joins departments that have employees with employees that have departments.  Suppose Smith is the only Sales employee.  The logical result is the insertion of a department that has no employees to the database.  So the deletion is also an insertion to a subset of departments that is not the original subset of departments with employees, so a superset of the original singleton departments subset that adds departments without employees has trespassed onto the picture. The dbms has lost integrity control. But a logical dbms that implements the algrebra would recalculate the subset of departments having employees in the absence of the merged row and union that with the original set of departments without employees to obtain the resulting value of the departments table. If that is not the desired result then obviously it is the data design that is wrong.

Re: Well said
  • 5/28/2016 12:48:51 PM

"Data science" is hot, but nobody's really interested in the true science of data -- the relational model. And that's because the appalling state of education. Not even the authors of SQL and implementers at IBM had a good grasp of the RDM.

> "physical tables which represent relations that are supersets of the relations the applications are concerned with ... do not allow direct updating of the relations that programmers address for the purpose of replacement / updating."

Not sure I follow--can you elaborate what you mean?

>"the inability to design relational structures so that it is the dbms that enforces integrity, not application programmers"

Between the poor relational fidelity, weak support of integrity enforcement and bad database designs, SQL systems guarantee neither logical, nor semantic correctness of query--and therefore analytic--results. All the top programming, statistical, mathematical skills mean little in this context--in fact, they serve to obscure and distract from the fundaental problems.

Page 1 / 2   >   >>