Why You Always Need Primary Keys

(Image: EtiAmmos/Shutterstock)

(Image: EtiAmmos/Shutterstock)

According to search queries hitting @dbdebunk.com, too many data professionals question the mandatory nature of primary keys, ask about changes to them, or prefer surrogate to natural keys. This indicates misunderstanding and misuse of a critical feature that can wreak havoc with inferences made from databases, including analytics. I have explained one function of keys, but there are several types of key that are poorly understood.

A tuple of a relation (depicted as a R-table row on paper or screen) represents formally in the database a description for a real world object. A statement that describes the object -- i.e., specifies the values of its properties -- is unequivocally true or false. Either there is an object with those property values, or there isn't. In other words, there is a fact.

A combination of one or more defining properties comprises an object's identity and, therefore, uniquely identifies that object. Defining properties inherent to objects such as fingerprints, dental records, or DNA for people, are immutable identifiers, and usually compound or complex. Defining properties of human-constructed objects such as organizations, projects, Products, are not inherent. Unless a combination of defining properties is found that, based on belief (from evidence, not whim) is stable for at least the intended life of the database and its use, there is no basis for the claim that an object represented at one time is the same as one observed at another time, and all bets are off for inferences made from the database.

For convenience, simpler "names" are often assigned to compound or complex identifiers. These names distinguish rather than identify, but may become identifiers after long-term use. Consider the social security number (SSN). The government has a well-defined procedure that assigns the SSN as a simple numeric name for the combination of properties that define persons eligible for US earnings. The properties identify, SSN is just a convenient substitute. Names are also assigned when defining properties that are implicit and hard to pinpoint.

An identifier -- a set of defining properties -- is represented in the database by a key, a combination of one or more relation attributes (depicted as R-table columns) that is system-guaranteed to have unique values.

A "natural key" (NK) represents an identifier that is either inherent, or assumed to be stable for the life of the database. Because NKs are usually combinations of real world properties, NKs are familiar and meaningful to users.

A "surrogate key" (SK) represents an assigned name whose values are in a 1:1 relationship with the values of some NKs (however abstract, complex, or difficult to measure/observe the NK values may be). Without such a relationship there is no object definition/identification, and, therefore, there is no reliable database representation. Because they do not represent object properties, SKs cannot be used for meaningful data access, so they can only be added and cannot replace specifiable NKs.

The relational model mandates that for every relation with multiple keys one should be selected as "primary key" (PK). The theory is mute on selection criteria, which are pragmatic: a meaningful NK is preferred, but if it is composite/complex, a "proper" simple SK -- unexposed to users/applications -- is a preferred target for foreign key (FK) references. Composite PKs should be "irreducible." The choice may involve tradeoffs specific to each database and its use.

SKs are indiscriminately deployed for the wrong reasons and do not always have the 1:1 relationships with NKs. If to emulate database-wide object-IDs, they serve no purpose; if to improve performance, it's a contamination of the model with physical implementation considerations that can actually defeat the purpose. For example, because SQL DBMSs implement keys by indexing, each SK adds not only data, but also an index that must be managed. If the SK is exposed to users in applications, the need to translate to and from a meaningful NK imposes joins that would otherwise be unnecessary, which makes queries more complex and can actually slow performance.

Fabian Pascal, Founder, Editor & Publisher, Database Debunkings

Fabian Pascal is an independent writer, lecturer, and analyst specializing in database management, with emphasis on data fundamentals and the relational model. He was affiliated with Codd & Date and has taught and lectured at the business and academic levels. Clients include IBM, Census Bureau, CIA, Apple, UCSF, and IRS. He is founder, editor, and publisher of Database Debunkings, a Website dedicated to dispelling myths and misconceptions about database management; and the Practical Database Foundations series of papers. Pascal, author of three books, has contributed extensively to trade publications including DM Review, Database Programming and Design, DBMS, Byte, Infoworld, and Computerworld.

The Importance of Understanding Classes, Sets, and Relations for Analytics

Failure to understand these fundamentals causes poor database designs and risks incorrect and/or improperly interpreted analytics results.

Understanding the Division of Labor between Analytics Applications and DBMS

Those who ignore data fundamentals will always risk costly mistakes and inhibit their own progress towards analytics goals. Here's why.

Re: Necessity of PKs
  • 4/12/2017 2:31:07 AM

Disposable data are not important enough for databases. Not the data that I am referring to here.

Be that as it may, identification of objects represented in the database is critical. If you can't identify objects reliably, all bets are off. This is fundamentals whether folks like it or not.



Re: Necessity of PKs
  • 4/11/2017 8:43:39 AM

Yes there would be some of the folks using disposable data to not be so readily keen for the permanant keys but as noted it's got to work among the relationships among all the available keys to make the most of security while keeping things as simple as possible for users I would think.

Re: Necessity of PKs
  • 4/10/2017 2:24:41 PM

I suggest you reread my post more carefully--the problems are worse than just poor grasp of key application. They don't understand what NKs and SKs are.

A NK represents a set of properties that define and identify objects of a type. If you don't have one, you have no definition of objects of that type and no identification--no basis for claiming that an object at time1 is the same object at time2. A SK is only a simpler name assigned to that set of defining properties that make up the NK--it is not an object property!!!! Without a 1:1 relationship between the SK and NK values, the SK means nothing. The validity of the SK as identifier is dependent on the 1:1 relationship with the NK. That is why it is so important to exercise care in how you conceptualize object types of interest, something which conceptual modelers and db designers do not appreciate.

A PK that changes during the life of the db means that the object definition has changed, which means a new object type. This means (1) a new conceptual model and a new logical model (2) the change is represented as a relationship between objects of the new type and the old type often via FKs (i.e.,the database gets a historic dimension). This is something that practically no data professional is aware of.




Re: Necessity of PKs
  • 4/10/2017 9:26:46 AM

I'll accept that either the data folks I talk to just agree with me while we're talking or they tend to be the types who have been bitten in the past and understand the pitfalls.  I do agree that there is poor understanding of how keys can be applied.  A  natural key is better than nothing and is what I typically see people using until they realize how simple it is to create a GUID for example as a surrogate key.  I've seen a fair number of tables accidently supported by a natural key so I see that as a way to introduce the concept.  It also makes it easy to explain why you need to give some consideration to the types of keys you'll use.  Things like SSN are not permanent nor are usernames or employee IDs so if you can explain how those things may be unique now and make records identifiable but also how a change in them will cause you data headaches it is a good launch pad to using surrogate keys.


Re: FK's not fundamental
  • 4/8/2017 2:08:05 PM

May I suggest my 3-part post @dbdebunk.com "Are FKs mandatory?". They were introduced in the context of normalization: if you unnest the relation values of attributes defined on relation-valued domains (RVD) as relations in their own right, you must have a value-based representation of the relationship that was embedded in the nesting. So in this context they ARE mandatory.

But they are a separate issue from PKs, SKs and NKs that my post is about.

Re: Necessity of PKs
  • 4/8/2017 2:03:21 PM

I'm afraid your sample is far from representative. Most of the hits I get @dbdebunk.com is by those who question keys.

And as my post argues, there is poor understanding -- if any -- of types of key. There are plenty of misconceptions: surrogate keys better than natural keys, database wide surrogates a la object ID, you name it.


FK's not fundamental
  • 4/7/2017 10:08:32 AM

Regarding FK's and database tradeoffs (2nd last paragraph), suggest that one of the practical tradeoffs is a dbms tradeoff not just a database tradeoff. Codd said in 1970 about qualification (eg. queries) of relations: "Thus, the class of expressions which can be used in a set specification must have the descriptive power of the class of well-formed formulas ... ". Without saying so explicitly, he offered the ForeignKey shortcut to represent a set specification of a special case of one such class, for example special in that it constrains rows of a referencing base table. However, the general class includes all the possible qualifications that produce the same relation value for given inputs. The general class can be written as a projection of a union of joins some of the latter being projections involving literals, a logical structure that is equivalent to that relation value. When a language supports nothing but union, difference, join and projection, plus literals, depending on those literals one such structure gives exactly the operational behaviour of a foreign key. I mention this not because queries should be more intricate than they already are but to point out that foreign keys are not fundamental in the way that primary keys are (for example, the above four operators seem not sufficient to define a PK algebraically). The general tradeoffs implied by the typical dbms include forcing users to update several base tables when a single view update suffices (all constraints are effectively update constraints). This is because the typical dbms is not capable of associating the set membership function aka predicate of a query with the target relation, instead it substitutes that of the relation represented by a base table value. Thinking about the foreign key example in terms that go beyond base table constraints is one pathway to a dbms that is capable of the above association, toward databases that are drastically more self-contained in that most of the set membership expressions used by applications are contained in the database itself rather than user code, specifically in the catalog which can then be thought of as an executable program in its own right. The result would be drastically increased productivity for the many application coders at the cost of acquiring deeper comprehension by the relatively fewer data designers.

Necessity of PKs
  • 4/7/2017 8:23:17 AM

I think the only people I've talked to who would disagree that primary keys are necessary are working with disposable data.  Unfortunately, they also realize they should have been using a PK after the damage is done and that data isn't so disposable any more or they apply that disposable data model to a table that isn't so disposable.  You don't need to get overly complex and as noted there are endless ways to build keys so the only reason for not having them is not knowing.