Data Fundamentals for Analysts, Not Worth Repeating: Duplicates

Frequent hits are driven by the question “Are keys mandatory?” Puzzlingly, many data professionals do not seem to understand why duplicates should be prohibited. This should worry analysts.

Entities are distinguishable in the real world, or we would not be able to tell them apart (and count them). Those important enough to track individually, such as legal US residents, cars, or computers, are issued identifiers such as SSN's, tags, or S/N's. A real world identifier captures an entity's identity within a class -- all its remaining descriptive (non-identifying) properties are "about the identifier", so to speak. Members of multiple classes have multiple identifiers.

The lack of a real world identifier means that individual tracking is not meaningful. Such entities -- cans of cat food, for example -- are still distinguishable and countable in the real world, but only by virtue of their visually distinct location in physical space.

Recall, now, that a relational database consists of logical relations. A relation is a set of tuples that represent facts about a class of property-sharing entities.

A real world identifier is represented in the database by a natural key (NK), a relation attribute that has unique values. For example, an EMPLOYEES relation can be pictured as this R-table with the employee number attribute (shown as the EMP# column) representing the real world identifier.

A RDBMS does not permit keyless relations in the database. So what about entities that lack a real world identifier?

SQL DBMS’s permit keyless tables -- that do not picture relations -- such as this.

But the absence of an identifier means that individual entities are not meaningful, so this representation contradicts the real world. Note: The table represents the facts:

  • Can of Cat Food has price 0.39.
  • Can of Cat Food has price 0.39.

“Stating the same fact more than once, does not make it truer, only redundant,” as E. F. Codd used to say.

Contradictions produce problems. First, a DBMS is incapable of “visually” discerning a data entry duplication error from "valid" duplicates, which means high risk of inconsistent databases and wrong counts and other query results.

Second, how to interpret query results that differ only in the number of duplicates they contain? Either they are semantically equivalent -- in which case, why the difference? Or they are not, in which case, what does the difference mean? (SQL DBMS’s do not treat them as equivalent, which inhibits performance optimization).

Third, keyless tables violate two of Codd's core 12 rules and rob databases of the practical advantages they are intended to confer.

  • The Guaranteed Logical Access (GLA) rule mandates that every data value in a relational database be logically accessible via a combination of relation name + attribute name + key value.
  • The Physical Data Independence (PDI) rule mandates that queries and applications should not have to be rewritten just because the database is physically reorganized to maximize performance.

By forcing reliance on physical addressing schemes to access individual duplicates, keyless relations defeat both GLA and PDI (for examples of the complications see here, here, here and references).

The design consistent with reality is, of course:

If a need arises to track cat food cans individually, they must be assigned some real world identifier:

Only SQL tables with

  • unique, unordered rows
  • uniquely named, unordered columns
  • no missing values
are R-tables that guarantee logically correct query results. Unfortunately, SQL DBMS’s permit tables that violate these features, including keyless ones, so analysts better exercise care.

Fabian Pascal, Founder, Editor & Publisher, Database Debunkings

Fabian Pascal is an independent writer, lecturer, and analyst specializing in database management, with emphasis on data fundamentals and the relational model. He was affiliated with Codd & Date and has taught and lectured at the business and academic levels. Clients include IBM, Census Bureau, CIA, Apple, UCSF, and IRS. He is founder, editor, and publisher of Database Debunkings, a Website dedicated to dispelling myths and misconceptions about database management; and the Practical Database Foundations series of papers. Pascal, author of three books, has contributed extensively to trade publications including DM Review, Database Programming and Design, DBMS, Byte, Infoworld, and Computerworld.

The Trouble with Data Warehouse Analytics

Data warehouses are essentially databases biased for particular data applications and against others. They are rooted in poor database foundation knowledge and logical-physical confusion.

Outsmarting the DBMS: Analysts Should Beware

Analysts should avoid relying on techniques that undermine the soundness of database design.

Re: also...
  • 3/28/2016 5:09:23 PM

> users/application programmers infer the theory from wrong as well as incomplete implementations.

Funny you should mention that. In fact, the two "data models" preceding the RDM, hierarchic and network (CODASYL), were actually abstracted from database practice, rather than developed from graph theory (as Codd did RDM). That's why they were not well-defined and complete and is one reason they proved prohibitively complex and inflexible.

> only possible way to replace relations is by updating base tables ... rather than by calculating the values/tables of the relational expressions ... Even so-called "truly relational" dbms'es duplicate this SQL inadequacy, so they really shouldn't be labelled "TRDBMS".

There are many reasons SQL DBMS's are not relational, but an important yet unknown one is that they do not operate on the predicates associated with tables (which are not necessarily R-tables), but on the table NAMES, which should be treated only as shorthands for the predicates. That's what true RDBMS's should do and none does.

That is directly due to lack of foundation knowledge and very few practitioners realize that the two core advantages of the RDM--provably logically correct query results and full physical data independence--are missing from current products.



  • 3/28/2016 4:14:16 PM
1 saves

Not only do SQL products present physical phenomena that set theory denies, such as the duplicates you mention, they prevent the manipulation of many possible logical relation values, limiting their use to presentation as opposed to calculation (as McGoveran points out). Instead of complaining about what's missing from pretty much every one of today's products, users/application programmers infer the theory from wrong as well as incomplete implementations. In particular they assume that the only possible way to replace relations is by updating base tables as if they were the physical files of forty years ago rather than by calculating the values/tables of the relational expressions users actually specify.. Calculaton is entrusted to application code, just as it was then (application programming thus became more complicated with the advent of such systems, not simpler). Even so-called "truly relational" dbms'es duplicate this SQL inadequacy, so they really shouldn't be labelled "TRDBMS".