According to search queries hitting @dbdebunk.com, too many data professionals question the mandatory nature of primary keys, ask about changes to them, or prefer surrogate to natural keys. This indicates misunderstanding and misuse of a critical feature that can wreak havoc with inferences made from databases, including analytics. I have explained one function of keys, but there are several types of key that are poorly understood.
A tuple of a relation (depicted as a R-table row on paper or screen) represents formally in the database a description for a real world object. A statement that describes the object -- i.e., specifies the values of its properties -- is unequivocally true or false. Either there is an object with those property values, or there isn't. In other words, there is a fact.
A combination of one or more defining properties comprises an object's identity and, therefore, uniquely identifies that object. Defining properties inherent to objects such as fingerprints, dental records, or DNA for people, are immutable identifiers, and usually compound or complex. Defining properties of human-constructed objects such as organizations, projects, Products, are not inherent. Unless a combination of defining properties is found that, based on belief (from evidence, not whim) is stable for at least the intended life of the database and its use, there is no basis for the claim that an object represented at one time is the same as one observed at another time, and all bets are off for inferences made from the database.
For convenience, simpler "names" are often assigned to compound or complex identifiers. These names distinguish rather than identify, but may become identifiers after long-term use. Consider the social security number (SSN). The government has a well-defined procedure that assigns the SSN as a simple numeric name for the combination of properties that define persons eligible for US earnings. The properties identify, SSN is just a convenient substitute. Names are also assigned when defining properties that are implicit and hard to pinpoint.
An identifier -- a set of defining properties -- is represented in the database by a key, a combination of one or more relation attributes (depicted as R-table columns) that is system-guaranteed to have unique values.
A "natural key" (NK) represents an identifier that is either inherent, or assumed to be stable for the life of the database. Because NKs are usually combinations of real world properties, NKs are familiar and meaningful to users.
A "surrogate key" (SK) represents an assigned name whose values are in a 1:1 relationship with the values of some NKs (however abstract, complex, or difficult to measure/observe the NK values may be). Without such a relationship there is no object definition/identification, and, therefore, there is no reliable database representation. Because they do not represent object properties, SKs cannot be used for meaningful data access, so they can only be added and cannot replace specifiable NKs.
The relational model mandates that for every relation with multiple keys one should be selected as "primary key" (PK). The theory is mute on selection criteria, which are pragmatic: a meaningful NK is preferred, but if it is composite/complex, a "proper" simple SK -- unexposed to users/applications -- is a preferred target for foreign key (FK) references. Composite PKs should be "irreducible." The choice may involve tradeoffs specific to each database and its use.
SKs are indiscriminately deployed for the wrong reasons and do not always have the 1:1 relationships with NKs. If to emulate database-wide object-IDs, they serve no purpose; if to improve performance, it's a contamination of the model with physical implementation considerations that can actually defeat the purpose. For example, because SQL DBMSs implement keys by indexing, each SK adds not only data, but also an index that must be managed. If the SK is exposed to users in applications, the need to translate to and from a meaningful NK imposes joins that would otherwise be unnecessary, which makes queries more complex and can actually slow performance.