DBMS for Analytics: Risky Business Without Foundation Knowledge, Part 2


In this two-part series I alert analysts that correct interpretation and assessment of media/industry claims without being misled requires a good grasp of data fundamentals. In Part 1, I discussed the logical-physical confusion and the erroneous misclassification of DBMSs as relational and non-relational underlying the argument that the latter are superior to the former for analytics applications. In Part 2, I discuss a third misconception behind the claim.

(Image: D3Damon/iStockphoto)

(Image: D3Damon/iStockphoto)

Consistency and Integrity

"RDBMS's greatest strength -- data integrity -- has now become its biggest weakness. To ensure consistent entry of data, (RDBMS) requires a strict data model enforced by tons of referential data relationship constraints ... [a] data model burden emerges as a serious limitation in cases where an organization requires flexibility ... enter the non-relational database."

Data practitioners who perceive integrity as a weakness are in the wrong field -- they do not understand what integrity and consistency are and, therefore, the meaning of data (What Meaning Means: Business Rules, Predicates, Integrity Constraints and Database Consistency, Redundancy, Consistency, and Integrity Derivable Data). By enforcing integrity constraints, a DBMS ensures consistency of the database with the business rules that denote the meaning of the data (i.e., the faithfulness of the database to the conceptual model of the segment of the real world it represents). No analyst should accept data that does not represent reality accurately, yet they do without being aware of it (Relational Fidelity & Analytics Integrity). A DBMS that doesn't enforce integrity cannot be considered a true DBMS. SQL DBMSs do a rather poor job of that (To Really Understand Integrity, Don't Start with SQL), non-SQL and other non-RDBMSs give it up altogether -- their data structures make it both theoretically and practically unfeasible. Trading off integrity for "flexibility" is a bad analytics strategy, but how can technology that had already been dropped as inflexible in favor of SQL several decades ago be a better option anyway? They who forget the past.

Giving up integrity is sometimes acceptable for data warehouses and other read-only databases, but, first, this is somewhat of an illusion: they must be populated and without constraint enforcement, what ensures consistency for all applications (including analytics) using them, particularly if the data come from SQL, or non-relational databases where constraints are not enforced either? (The Trouble with Data Warehouse Analytics). Second, beyond integrity, the theoretical foundation of the RDM guarantees 'logical validity' of the data sets retrieved for analysis; and adherence to relational design principles (Database Design: What It Is and Isn't) ensures their 'semantic correctness' (Object Orientation, Relational Database Design, Logical Validity and Semantic Correctness). Neither "modern" non-RDBMSs, nor "traditional" SQL DBMSs guarantee them.

"Unstructured Data"

I used quotes because it is a contradiction in terms ('Unstructured Data': Why This Popular Term Is Really a Contradiction, Structuring the World With 'Unstructured Data'). All data is structured 'by definition' -- if they weren't, they would be random noise, not data.

" ... if you're dealing with massive amounts of unstructured data, you may not have the luxury of developing a relational database with clearly defined schema ... related information of all types. Instead of relying on tables, non-RDBMS databases are document-oriented. This way, non-structured data -- such as articles, photos, social media data, videos, or content within a blog post -- can be stored in a single document that can be easily found, but isn't necessarily categorized into fields like a relational database does."

More fundamental misunderstanding. All data management involves manipulation of 'some' data structure (Analytics = Manipulation of Data Structure), which determines the kind of manipulation applicable to it (i.e., what kind of questions can be asked of the data and answers obtained). All the enumerated elements are structured, it's just that the structures are complex and, thus, so is their manipulation (All Structures Aren't Created Equal). Documents are multi-structured (Documents and Databases). Data are not "inherently" relational or non-relational -- it is a matter of modeling and design choices. Relational structure has advantages that are critical -- decidable data languages, guaranteed logical validity, semantic correctness, physical and logical independence -- that no non-RDBMS can deliver (Data Sublanguages, Programming, and Data Integrity, Physical Data Independence). The upfront effort is, thus, not a luxury, but it is not done either because the value is unappreciated, modeling and design skills are lacking, or out of expediency and laziness. Problem is, if you don't understand reality sufficiently to model it and design a relational database, you can't guarantee consistency, sensible queries, valid, semantically correct results and correct interpretation.

Fabian Pascal, Founder, Editor & Publisher, Database Debunkings

Fabian Pascal is an independent writer, lecturer, and analyst specializing in database management, with emphasis on data fundamentals and the relational model. He was affiliated with Codd & Date and has taught and lectured at the business and academic levels. Clients include IBM, Census Bureau, CIA, Apple, UCSF, and IRS. He is founder, editor, and publisher of Database Debunkings, a Website dedicated to dispelling myths and misconceptions about database management; and the Practical Database Foundations series of papers. Pascal, author of three books, has contributed extensively to trade publications including DM Review, Database Programming and Design, DBMS, Byte, Infoworld, and Computerworld.

DBMS for Analytics: Risky Business Without Foundation Knowledge (Part 1)

Claims that non-relational DBMSs for analytics are gaining in popularity, even if true, are not due to the superiority of non-relational DBMSs. These claims reflect poor foundation knowledge and technology regress in the industry.

The Importance of Understanding Classes, Sets, and Relations for Analytics

Failure to understand these fundamentals causes poor database designs and risks incorrect and/or improperly interpreted analytics results.


Pending Review
  • 1/19/2018 1:42:27 AM
NO RATINGS

This comment is waiting for review by our moderators.

Decision Making and Tools that lack Structural Integrity
  • 1/9/2018 11:56:24 PM
NO RATINGS

If every database and/or model is flawed from a lack of Integrity (to which I agree), how does it seem users ( management) for the most part are lucky with them ?  By luck I mean the use of these structurally unsound tools should bring more harm than we are aware of at least. 

Users seem to be able to get just enough out of their faulty databases/models to stay in business or even flourish.  Of course there are the companies that crash and burn, but was it due to databases/models with little integrity or something else ?   I would hazard to guess that the number of databases that fit the purist approach is equivalent to the number of wealthy in the States - a very small percentage. 

 

Given this, I wonder how businesses continue to be a going concerns when they are working with faulty facts ?  Poorly structured models and databases ?  Or maybe they rely on these tools less than we think ?   I don't know. Looking forward to hearing others thoughts on this issue.

Re: Data Integrity
  • 1/9/2018 10:38:49 PM
NO RATINGS

How many data professionals have a good grasp of first order predicate logic?

Maybe half do. Even worse, I'm afraid the number is declining.

Re: Data Integrity
  • 1/9/2018 6:53:20 PM
NO RATINGS

Even if you know reality well enough you need to know how to express it as a conceptual model and then as a logical model. The second step involves formalizing business rules as integrity constraints and you cannot do that without a good grasp of first order predicate logic. How many data professionals do you think qualify?

Re: Data Integrity
  • 1/9/2018 6:49:00 PM
NO RATINGS

I don't think the majority of data professionals understand the fundamentals enough to appreciate the importance of integrity and how to achieve it.

Re: Data Integrity
  • 1/9/2018 6:47:35 PM
NO RATINGS

The statement was not referring to end-user access to publicly available databases.

It was referring to those who create and manage databases for end users.

 

Re: Data Integrity
  • 1/3/2018 3:12:24 PM
NO RATINGS

How true as noted "if you don't understand reality sufficiently to model it," you're going to be in for some big problems. Of course that applies not only to data and DBMS but for lots of business plans taking note that "reality" is not always that easy to comprehend or figure out if we really are seeing things as they really are.

Re: Data Integrity
  • 1/3/2018 10:13:56 AM
NO RATINGS

There are many like this. The key is to know when you are taking a risk about integrety and understanding the consequences.

Data Integrity
  • 12/29/2017 11:33:33 PM
NO RATINGS

I'm not sure I agree with this statement-

Giving up integrity is sometimes acceptable for data warehouses and other read-only databases

There are some databases, like US Census data, that, while not perfect, are useable without me personally confirming the integrity.

INFORMATION RESOURCES
ANALYTICS IN ACTION
CARTERTOONS
VIEW ALL +
QUICK POLL
VIEW ALL +