Redundancy, Consistency, and Integrity: Derivable Data

(Image: D3Damon/iStockphoto)

(Image: D3Damon/iStockphoto)

Database redundancy can wreak havoc with interpretation of analytics results, but it also poses consistency risks that can affect the correctness of the results themselves. The risks are too underappreciated for effective prevention. Given industry practices, analysts who use databases they did not design, or designed without sufficient foundation knowledge, should be on the alert.

A database represents a conceptual model of some segment of reality of interest. The model is informal and consists of business rules expressed in natural language that structure reality as groups of objects with properties. Database systems can only manipulate abstract symbols mathematically, so the model must be 'formalized' -- i.e., expressed symbolically in some formal data language -- for database representation. To that end, object groups are represented in the database by relations, facts about objects by tuples, properties by attributes, and business rules by integrity constraints. The DBMS enforces the constraints to ensure 'database consistency' -- i.e., that the formal database representation is faithful to the informal model of reality -- i.e., is consistent with the rules.

Reality, of course, is not redundant. Information about reality can be recorded redundantly due to database design -- often inadvertently, due to poor design skills, but sometimes intentionally, for user convenience.

'Derivable data' are one source of redundancy. It suffices for the purposes of this discussion to understand that querying relational databases means applying the set operations of the 'relational algebra' (RA), individually or in any sequence, to one or more relations, to produce -- derive -- relations as results, a property referred to as 'relational closure.' This is akin to the numeric algebra we call arithmetic, whose operations, when applied to numbers, produce numbers. The RA is closed to relations like arithmetic is closed to numbers.

A database that contains a relation that is derivable via some sequence of RA operations from the others is redundant. If a projection of that relation is derivable via some RA operations from other projections of relations, including self, it is 'strongly redundant.' A projection selects one or more attributes of a relation.

Consider the relation:


with EMP# as the primary key and MGR# as a foreign key referencing EMP#. Since managers are employees, the projection of EMPLOYEES on {MGR#,MGRNAME} is derivable (via a FK-PK self-join) from the projection on {EMP#,ENAME}.

Strong redundancy introduces inter-relation dependencies that, because they do not exist in the real world, are not reflected in the business rules and the DBMS is unaware of them. There is nothing to prevent updates that leave the database in an inconsistent state (e.g., a change of ENAME of a manager without a corresponding change of MGRNAME, or vice-versa). The responsibility for data integrity falls on users and applications, the unreliable approach that database management made obsolete.

Strong redundancy should be avoided by design, but if it is introduced, it must be 'declared to the DBMS' in the form of constraints, such that it can enforce database consistency. Consider, for example, three relations R, S, T, and two RA operations A, B:

  • Relation A(T) is derivable from relation T by operation A;
  • Relation B(R,S) is derivable from relations R and S via operation B;
  • Relation A(T) is derivable from relation B(R,S) -- i.e., A(T) = B(R,S).

A database that contains either A(T) or B(R,S) together with R, S, and T is strongly redundant. Consistency is guaranteed 'if and only if' the above equality is declared to the DBMS as a formal constraint, 'in addition to the constraints formalized from the rules.'

The reality is that many databases are strongly redundant, but devoid of the additional constraints. Even if database designers were aware of the need to add them -- they are not -- that requires a RDBMS with a relationally complete data language that supports fully and correctly all relational operations and constraints. Unfortunately, SQL DBMSs are not truly relational and SQL does not provide such support. (For relational constraints and how SQL compares, see To Understand Integrity, Don't Start with SQL.)

So analysts cannot take database consistency for granted and should not.

Fabian Pascal, Founder, Editor & Publisher, Database Debunkings

Fabian Pascal is an independent writer, lecturer, and analyst specializing in database management, with emphasis on data fundamentals and the relational model. He was affiliated with Codd & Date and has taught and lectured at the business and academic levels. Clients include IBM, Census Bureau, CIA, Apple, UCSF, and IRS. He is founder, editor, and publisher of Database Debunkings, a Website dedicated to dispelling myths and misconceptions about database management; and the Practical Database Foundations series of papers. Pascal, author of three books, has contributed extensively to trade publications including DM Review, Database Programming and Design, DBMS, Byte, Infoworld, and Computerworld.

Data Meaning: Analytics vs. Data Mining

Can you reconstruct the meaning for a database when you have no documentation? Here's a deep dive.

Redundancy, Consistency, and Integrity: Derivable Data

Analysts should not take database consistency for granted. Here's why.

Re: It will get worse
  • 6/12/2017 3:06:01 PM

Yes, but what I am trying to point out is that

1. Knowing the source of the data is necessary but insufficient;

2. Knowledge of what meaning in the database context IS cannot be expected without an education in data fundamentals.

Stay tuned for my next article.

Re: It will get worse
  • 6/9/2017 10:39:45 AM

If you don't have foundation knowledge, you are unaware that you should be on alert for anything and you dk what to be on alert for -- that's the problem.

For example, if you dk what meaning means in the database context -- which very few data professionals know -- you dk that without it you cannot query databases sensibly or interpret results correctly and you have no idea where to look for it and the fact that the meaning is not in the tables or even in the entire system. It's only in the database designer's mind so you're operating with blinders.

Re: It will get worse
  • 6/8/2017 8:12:29 AM

I fully agree. Sometimes the context of the data is critical.

Re: It will get worse
  • 6/8/2017 7:43:06 AM

Considering it's true that "analysts who use databases they did not design, or designed without sufficient foundation knowledge, should be on the alert," this indicates maybe a growing problem as new data is being collected and analyzed at increasingly larger volumes and the data personnel may not be up to speed on how to recognize and remedy the problems involved.

Re: It will get worse
  • 6/7/2017 6:51:27 PM

You got it.

If there are exceptions, they are in the institutions frecvented by the elites. Obama, for all its "leftism" where did he send his daughters?

What most Americans means is that this is just a matter of incompetence. It is, but it's intentional.





Re: It will get worse
  • 6/7/2017 4:27:37 PM

From the WSJ article: Some say these findings are a sign of the failure of America's higher-education system to arm graduates with analytical reasoning and problem-solving skills

Yes, I agree - there is clear evidence that education in the US is not providing graduates with the ability to think and to learn. There are exceptions in certain schools or in certain departments or programs, but the general trend is in the wrong direction.

And the problem starts before students even get to college. In the latest PISA test comparing Math performance of 15 year-olds from around the world, the US dropped to 31st of 35 OECD member nations.

(OECD is Organisation for Economic Co-operation and Development)

Re: It will get worse
  • 6/7/2017 2:10:04 PM

The concept of education has been utterly corrupted. It used to mean intellectual development: how to think independently and critically. Now it is preparation for the workforce and conformism and even that is being reduced to coding.

This is actually a natural regression: even in societies that start democratically there is an inherent aspiration to control and those who own the society realize that real education is not in their interest. So one of the first things they do is destroy it.

This is part of the process by which dominant civilizations peak--due to education-- and then collapse--due to its destruction.

Judging from what's happening in the US and the West in general, would you say there is evidence for this?


Re: It will get worse
  • 6/7/2017 1:59:16 PM

Thinking is hard work.

Colleges used to teach peope how to learn. It included logic, reason, and how to listen to, understand and consider a variety of sources. An ability to evaluate information mattered.

It seems to have changed dramatically. Discussion and reason don't matter as much as before. So I see what you mean by the "collapse of education".


Re: It will get worse
  • 6/5/2017 6:54:25 PM

I don't think you understood my point.

There is no "other deficit" -- the lack of understanding of what data mean IS the education deficit -- they are one and the same.

A true data professional who received proper foundation knowledge via education knows what meaning is and how to acquire and express it in practice. Current people who work in the industry without education are not even aware that this is something they must know. They only know how to apply tools and but not how to apply them properly and what conclusions to draw.

This is something that people are not born with, they must be taught. And if they are not, how can they even realize that they lack the necessary knowledge and acquire it, particularly since employers are as ignorant of this as they are?


Re: It will get worse
  • 6/5/2017 5:46:20 PM

I agree that the education is surely lacking. The other real deficit I see is not understanding wat the data represents and how it was gathered. Without this knowledge you have no way of knowing what an unreasonable value is in the data.

Page 1 / 2   >   >>