We are constantly told how data scientists must be “jacks of many skills”, but one of the most important is rarely included in the list thereof.
Very few databases are properly designed. Many SQL databases are denormalized inadvertently, or intentionally for performance. They require special constraints to control data redundancy and prevent inconsistencies, which are practically never enforced. Analysts cannot, therefore, take database consistency for granted. Furthermore, to issue sensible queries and ensure correct results and interpretation thereof, it’s not enough for analysts to know the types of fact represented in the database, but also whether and how the database designer has chosen to bundle -- nest or merge -- those facts and how to disentangle them for analysis.
A more common way to bundle facts than nesting (See Data Fundamentals for Analysts: Nested Facts and the (1st) Normal Form) is to merge them. For example, if there are facts about two classes of entities, instead of representing them in two relations:
Each fact about a department is merged with every fact about each of the employees assigned to it and the merged facts are represented by one relation:
Database designs that merge facts of different types into single relations have drawbacks similar to those that nest facts.
- A database with relations such as DEPT_EMPTS is more difficult to understand than one with DEPARTMENTS and EMPLOYEES
- Constraints and queries are more complex and results harder to interpret
- Errors are more likely
But merging causes additional problems. Within a single class, the entity identifier determines all other properties: If you know the employee number of an employee, you also know all his other properties, because there is no other employee with that number. In other words, for each identifier value there is exactly one value of every descriptive property. Formally, all descriptive properties are functionally dependent on the identifier.
If a relation represents facts about a single class, the attribute dependencies that hold in them are exclusively the FD’s of the non-key attributes on the key.
Since a key value cannot repeat, neither do the non-key values dependent on it, and there is no redundancy.
This means, first, that a uniqueness constraint on the key is sufficient to ensure entity integrity, namely that there is a 1:1 correspondence between facts and relation tuples. Second, because the design keeps facts of different types in separate relations, they can be updated independently.
If a relation represents merged facts, that is no longer the case. The key of DEPT_EMPS is composite—(DEPT#,EMP#)—and the non-key attributes are functionally dependent on key components, which are not guaranteed to have unique values. They can repeat, in which case the values of the dependent attributes will repeat with them, producing redundancy. For example, a department's data will repeat for every employee assigned to it.
The key constraint is no longer sufficient to ensure entity integrity: additional constraints must be enforced to control the redundancy (i.e., ensure that all redundant instances of department data are updated). Such constraints are never enforced, putting databases at a high risk of inconsistency. Neither can the facts be independently updated, because they are merged. This can produce anomalies, i.e. side-effects (e.g., deleting a department will also delete its employees) that may be unintended. Semantic correctness is no longer guaranteed.
There is little the data scientist can do about databases rendered inconsistent by updates due to denormalization, except that the knowledge of it can alert him to potential data quality problems.
But if the analyst understands normalization and whether and how facts were merged, he can minimize errors -- simplify queries, ensure that they are the intended ones and the results are both logically and semantically correct and correctly interpreted -- by deriving fully normalized views of the data for analysis.
There are three types of merged facts and, therefore, three denormalized normal forms, with fifth normal form (5NF) considered for our purposes full normalization (See The Costly Illusion: Normalization, Integrity and Performance).