Data Fundamentals for Analysts: Nested Facts and the (1st) Normal Form


The design of a database determines formulation of queries that makes sense, constraints to be enforced (for data integrity), logical and semantic correctness of query results and correct interpretation thereof. Analysts often access databases designed by others. Unfortunately, all too often they are poorly designed, putting analytics at risk, of which the analyst without some basic understanding of design principles will be unaware.

Relational databases consist of relations -- sets of facts about property-sharing entities (relations are pictured as R-tables; R-table, row and column are shorthands for the underlying relation, tuple, and attribute). Every query executes an operation of the relational algebra that derives a new R-table from database R-tables. Its rows represent facts that are logical implications of the database facts. A RDBMS is, in fact, a logical inference engine that guarantees logical correctness of query results: if the database facts are true, so will be the derived facts. A core practical advantage of the relational data model (RDM) is physical data independence (PDI): queries and applications do not have to be revised when databases are reorganized physically (in storage).

Designs that adhere the Principle of Full Normalization (POFN) minimize integrity risk and ensure semantic correctness i.e., they do not produce query and update anomalies. Informally, a R-table is fully normalized -- which, for our purposes here means in fifth normal form (5NF) -- if it represents exactly one entity class. For example, two related entity classes, departments and employees, would be represented by two R-tables in a POFN-compliant database.

DEPTS {DEPT#,DNAME,…)
EMPS (EMP#,ENAME,DEPT#,…)

In practice, either by mistake or design, the POFN is violated by “bundling” multiple classes into single R-tables, in one of two ways. The first is to nest facts about one class within facts about another and using one R-table to represent them. For example, if employee facts are nested within department facts, one R-table can represent both.

DEPT_EMPS {DEPT#,DNAME,EMPLOYEES (EMP#,ENAME,DEPT#,…)}

EMPLOYEES is a column defined by the database designer on a non-simple domain, i.e., one that has components and, therefore, non-atomic values. In this particular case the values are R-tables and the components are columns and rows.

A data language with access to attribute components is very powerful (e.g., queries can restrict/project/join inner and outer tables), but inferencing from nested (compound) facts is prohibitively more complex and prone to error. This complicates table and constraint definitions, query expressions and interpretation of query results. What is more, such languages are not declarative -- they require logic higher than first order logic, which introduces self-referencing and undecidability -- defeating PDI. The RDM trades the expressive power of second order logic for the decidability and simplicity of first order languages and PDI.

A R-table with attributes defined only on simple domains takes a less convoluted form -- a normal form -- devoid of nesting. If R-tables are in the preferred normal form i.e., components meaningful to applications (here, employee attributes) are simple domains in their own right and a true RDBMS enforces value atomicity -- first order logic is sufficient. This imposes some limitations on the expressive power of data languages, but they are declarative and PDI and simplicity are preserved. A true RDBMS enforces atomicity via a data language that does not allow applications to access attribute components not explicitly defined on their own domains.

In the special case in which a R-table has an attribute defined on a non-simple domain that happens to be a relation-valued domain (RVD) like EMPLOYEES, it is possible to normalize it by un-nesting the inner R-tables into a separate R-table, as in the two-table design above. Note that both R-tables are in their normal form, devoid of non-simple domains (strictly speaking, this works only with RVD’s that have union-compatible R-tables i.e., with the same type and number of columns as values).

For various reasons, SQL DBMS’s are not truly relational. They:

  • Do not support true (user-defined) domains and—a blessing—RVD’s either (Oracle supports “nested tables” which are not true RVD’s);
  • Implement special functions that give access to components of domains defined as simple by database designers, a subversion of atomicity and, by creating domains and R-tables on the fly “behind the back” of the DBMS of the RDM too. Analysts should ascertain that this facility is not abused.

The normal form is currently known as first normal form, which has to do with the second way of bundling. On that and further normalization beyond 1NF in future posts.

References:

Fabian Pascal, Founder, Editor & Publisher, Database Debunkings

Fabian Pascal is an independent writer, lecturer, and analyst specializing in database management, with emphasis on data fundamentals and the relational model. He was affiliated with Codd & Date and has taught and lectured at the business and academic levels. Clients include IBM, Census Bureau, CIA, Apple, UCSF, and IRS. He is founder, editor, and publisher of Database Debunkings, a Website dedicated to dispelling myths and misconceptions about database management; and the Practical Database Foundations series of papers. Pascal, author of three books, has contributed extensively to trade publications including DM Review, Database Programming and Design, DBMS, Byte, Infoworld, and Computerworld.

Why You Always Need Primary Keys

Database pros should heed this warning: if you ignore that primary keys are mandatory, you can wreak havoc with inferences made from databases, including in analytics.

The Trouble with Data Warehouse Analytics

Data warehouses are essentially databases biased for particular data applications and against others. They are rooted in poor database foundation knowledge and logical-physical confusion.


Re: Had a laugh
  • 5/6/2016 11:30:05 AM
NO RATINGS

Let's hope the users will be too :).

Re: Had a laugh
  • 5/6/2016 2:53:15 AM
NO RATINGS

John Cleese explains how it could be: bigthink.com/videos/john-cleese-on-political-correctness

I'm just thankful that my next database project is building the database from scratch.

Re: Had a laugh
  • 5/1/2016 2:43:15 AM
NO RATINGS

It may have been in the past for you, but I assure you the vast majority of databases are not properly designed today. And that is not laughing matter.

Had a laugh
  • 4/30/2016 9:22:59 PM
NO RATINGS

I laughed when I saw your title because, years ago, I had to work with a database that was in dire need of normalization. It was hell. I wasn't laughing then, but I'm glad that it's now in the past.

INFORMATION RESOURCES
ANALYTICS IN ACTION
CARTERTOONS
VIEW ALL +
QUICK POLL
VIEW ALL +