Fabian Pascal

Why Data Scientists Must Understand Normalization

View Comments: Newest First | Oldest First | Threaded View
Page 1 / 2   >   >>
dbdebunker
User Rank
Data Doctor
Re: Well said
dbdebunker   5/31/2016 5:00:28 PM
NO RATINGS
By data use I don't mean applications, I mean the kind of inferences to be made from the data (data operations). Consistency has to do with both the logical structure (the types of fact = relations) and the constraints on it. But it's important to keep them distinct in one's mind.

 

bobloblaw114
User Rank
Prospector
Re: Well said
bobloblaw114   5/31/2016 4:42:32 PM
NO RATINGS
I include "structure" with "content" (logical structure, not just relation structure, eg., your merged table represents a conjunctive structure, another table with the same tuples doesn't necessarily have the same structure). I don't object to separating the two, but I start with content as above, not proceeding further if structure isn't included plus dbms operations described algebraically, in other words what can be expressed and recorded and verified algebraically. Some people prefer FOL instead. That is the "data meaning" that can't be ignored, no matter the design philosophy. The "data meaning" of an end-application to a database user is something else, not usually formal, so I try to avoid generalizing without having requirements.

dbdebunker
User Rank
Data Doctor
Re: Well said
dbdebunker   5/31/2016 4:13:31 PM
NO RATINGS
> attempts to make the database content self-contradictory, not contradictions among wishes for database behaviour.

Both the POFN and integrity enforcement are intended to ensure that information representation is consistent with data meaning and use. That includes both data structure and content. Both anomalies and corruption are inconsistencies. This is not understood in the industry.

 

 

bobloblaw114
User Rank
Prospector
Re: Well said
bobloblaw114   5/31/2016 4:02:16 PM
NO RATINGS
I want to be clear that I don't dispute the POFN as a techique to follow. As far as I know there is no available dbms that supports relation updating to the extent that is logically possible. So it may be that POFN makes the best of a bad situation, likewise your article.

bobloblaw114
User Rank
Prospector
Re: Well said
bobloblaw114   5/31/2016 3:30:17 PM
NO RATINGS
When I wrote of logical contradictions I intended them to mean attempts to make the database content self-contradictory, not contradictions among wishes for database behaviour. It seems premature to wish for certain behaviour before the logical consequences of a model of a theory are understood. That eliminates the possibility of a schema controlling database behaviour before the paint has even dried. The behaviour I gave is a logical consequence of the usual algebra, ie. of the model, not wishful thinking. It is widely misunderstood, even by writers who start with the same definition of MINUS but then drop it to claim it's ambiguous. If the necessary logical behaviour is not what people wish for, they are free to make a different formal model of the relational theory.

dbdebunker
User Rank
Data Doctor
Re: Well said
dbdebunker   5/31/2016 1:18:32 PM
NO RATINGS
Actually, they are both contradictions. In the latter case, between the desire that two types of fact should be updated independently and the design that defeats that desire.

What is not understood about the The Principle of Full Normalization (POFN) is that it ensures there are no such contradictions.

 

bobloblaw114
User Rank
Prospector
Re: Well said
bobloblaw114   5/31/2016 8:53:11 AM
NO RATINGS
The usual insertion anomaly is an example of a logical contradiction but the usual deletion anomaly is an example of desire. Insertion contradiction might be a better term for the former. If people don't like the logical deletion result as opposed to the one they desire, I'd say they should re-define what they think deletion means or invent a different algebra.

dbdebunker
User Rank
Data Doctor
Re: Well said
dbdebunker   5/29/2016 5:13:26 PM
NO RATINGS
What you are expressing in a verbose way is what I state in the article as update anomalies--loss of guaranteed semantic correctness--exacerbated by poor relational fidelity of SQL and its implementations. Which, in fact, can also occur with queries of denormalized relations.

The problem is that most data professionals are unaware of the implications, or if they are aware, they view them as relational weaknesses, rather than SQL deficiencies due to relational violations. This confusion is mainly responsible for fads like XML and NoSQL that make the problem worse instead of resolving it, preventing the real solution--development of true RDBMS's.

That's the opposite of science.

bobloblaw114
User Rank
Prospector
Re: Well said
bobloblaw114   5/29/2016 1:58:11 PM
NO RATINGS
Typical dbms’es require the user or programmer using that employees-departments merged table and asked to remove employee Smith from the Sales department to depart from the algebra and instead update the employee and/or departments table directly, not the merged table. (sorry I don't know how to break paragraphs on this site) I suspect most users would delete a row only from the employees table.  The merged table joins departments that have employees with employees that have departments.  Suppose Smith is the only Sales employee.  The logical result is the insertion of a department that has no employees to the database.  So the deletion is also an insertion to a subset of departments that is not the original subset of departments with employees, so a superset of the original singleton departments subset that adds departments without employees has trespassed onto the picture. The dbms has lost integrity control. But a logical dbms that implements the algrebra would recalculate the subset of departments having employees in the absence of the merged row and union that with the original set of departments without employees to obtain the resulting value of the departments table. If that is not the desired result then obviously it is the data design that is wrong.

dbdebunker
User Rank
Data Doctor
Re: Well said
dbdebunker   5/28/2016 12:48:51 PM
NO RATINGS
"Data science" is hot, but nobody's really interested in the true science of data -- the relational model. And that's because the appalling state of education. Not even the authors of SQL and implementers at IBM had a good grasp of the RDM.

> "physical tables which represent relations that are supersets of the relations the applications are concerned with ... do not allow direct updating of the relations that programmers address for the purpose of replacement / updating."

Not sure I follow--can you elaborate what you mean?

>"the inability to design relational structures so that it is the dbms that enforces integrity, not application programmers"

Between the poor relational fidelity, weak support of integrity enforcement and bad database designs, SQL systems guarantee neither logical, nor semantic correctness of query--and therefore analytic--results. All the top programming, statistical, mathematical skills mean little in this context--in fact, they serve to obscure and distract from the fundaental problems.

Page 1 / 2   >   >>
Information Resources
More Blogs from Fabian Pascal
Analysts should avoid relying on techniques that undermine the soundness of database design.
The emphasis on coding in place of education obscures and disregards the core practical objective of database management to minimize programming.
Those delving into data science have to learn the important difference between prediction and explanation, as we discovered in post-election discussions and finger-pointing.
Don't be fooled by the promises that denormalization in data management will provide performance gains at no cost. The real cost will be at the expense of analytics.
Today's new data management paradigm seems more like a return to how things were done decades ago.
Radio Show
A2 Conversations
UPCOMING
Jessica Davis
Data Analysts in Training: Meeting Tomorrow's Demand


3/8/2017   REGISTER   0
ARCHIVE
Jessica Davis
Our Bodies, Our Data: Medical Records For Sale


2/21/2017  LISTEN   62
ARCHIVE
Jessica Davis
Energy Analytics: Using Data to Find Savings


2/14/2017  LISTEN   44
ARCHIVE
Jessica Davis
Sharpen Your Analytics & Data Management Strategy


2/8/2017  LISTEN   74
ARCHIVE
Jessica Davis
Analytics: Make the Most of Data's Potential in 2017


1/19/2017  LISTEN   19
ARCHIVE
Jessica Davis
A2 Radio: Can You Trust Your Data?


12/20/2016  LISTEN   70
ARCHIVE
James M. Connolly
Retail Analytics: See Where Style Meets Statistics


12/6/2016  LISTEN   53
ARCHIVE
James M. Connolly
Why the IoT Matters to Your Business


11/29/2016  LISTEN   45
ARCHIVE
James M. Connolly
Will Data and Humans Become Friends in 2017?


11/22/2016  LISTEN   40
ARCHIVE
James M. Connolly
We Can Build Smarter Cities


10/20/2016  LISTEN   31
ARCHIVE
James M. Connolly
Visualization: Let Your Data Speak


10/13/2016  LISTEN   70
Information Resources
Quick Poll
Quick Poll
About Us  |  Contact Us  |  Help  |  Register  |  Twitter  |  Facebook  |  RSS