DBMS for Analytics: Risky Business Without Foundation Knowledge (Part 1)

A new study finding that "non-relational database management systems now comprising 70% of analytics data sources" attributes their popularity to "superiority" to RDBMSs in satisfying analytics needs. There are good reasons to be skeptical of such findings, but even if this one were true, the arguments advanced in support of the attribution are rooted in the misconceptions due to poor foundation knowledge debunked by this blog (and in more depth at dbdebunk.com). Let's see.

(Image: D3Damon/iStockphoto)

(Image: D3Damon/iStockphoto)

Note: The RDM assumes conceptual models (i.e., that object groups, their properties and their relationships have been identified). Some analytical systems are designed to 'discover' the models (from data value relationships, sequence analysis, co-occurrences, etc.). These are two distinct practices that are commonly confused, but should not be, an issue I addressed in Data Meaning: Analytics vs. Data Mining and Data, Information, Knowledge Discovery, and Knowledge Representation.

"Only 30% of data analytics is still performed against traditional relational database management systems" while "approximately 70% are modern non-RDBMS sources" like Hadoop, NoSQL, in-memory, search, columnar/MPP analytic and cloud native databases."

As my readers should know, there is little to no understanding in the industry of what a RDBMS is (What Is a True Relational System and What Is Not). In fact, there are 'no' true RDBMSs, only SQL DBMSs wrongly alleged to be relational. They have limited relational fidelity and nowhere near the capabilities and advantages conferred by the RDM. Classifications of DBMSs as relational cannot and should not be trusted. Some of the criticisms of "relational" systems apply, thus, to SQL DBMSs, not RDBMSs and in what follows I will, therefore, substitute [SQL] for 'relational'.

Note: Even criticisms of SQL DBMSs cannot be of their "analytics capabilities" (analytics is an 'application function'), but at best only for their data retrieval capabilities (i.e., data integrity and manipulation, their 'data management DBMS function'). There is little recognition of this distinction in the industry. (Understanding the Division of Labor between Analytics Applications and DBMS).

"For example, analytics users that understand how to leverage graph queries can derive deep network structure insight and wide relationship analysis over graphed data that simply can't be computed on relational schema structured data."

There are applications for which directed graph data structures are suitable. But they are much rarer than what their proponents would have you believe and have serious drawbacks, not the least of which are 'prohibitive complexity and inflexibility'. This is a core problem that the RDM was introduced to address and is a major reason hierarchic and network DBMSs having been effectively dropped more than four decades ago in favor of even weakly relational SQL — so much for "modern". Them who forget the past …

To the extent that there is anything to the often repeated claim that "[SQL] databases have a fraught relationship with applications written in object-oriented programming languages like Java, PHP and Python", it is their affinity to directed graph structures. But (1) this has nothing to do with analytics per se and (2) a 'careful' separation between computationally complete programming languages (CCL) and data languages is necessary to guarantee relational advantages (Data Sublanguages, Programming, and Data Integrity).

Logical-physical Confusion

In-memory, columnar, and cloud native DBMSs are types of DBMS implementation that say nothing about their underlying data models, which are practically ignored in DBMS reviews, evaluations, or comparisons (Structure, Integrity, Manipulation How to Compare Data Models). No wonder that misconceptions -- rather than true RDBMSs that would address the SQL valid criticisms -- proliferate (Database Management: No Progress Without Data Fundamentals). This 'logical-physical confusion' (LPC) is rampant (Don't Mix Model with Implementation) and underlies most of the non-RDBMS superiority arguments:

  • "… scale horizontally";
  • "processing huge amounts of data in the cloud";
  • "allowing relatively low-cost servers to be combined into a single, powerful cluster";
  • "solve great performance challenges, tackle huge scales of data, help mine value from a wider variety of data";

There's more. Stay tuned for Part 2.

Fabian Pascal, Founder, Editor & Publisher, Database Debunkings

Fabian Pascal is an independent writer, lecturer, and analyst specializing in database management, with emphasis on data fundamentals and the relational model. He was affiliated with Codd & Date and has taught and lectured at the business and academic levels. Clients include IBM, Census Bureau, CIA, Apple, UCSF, and IRS. He is founder, editor, and publisher of Database Debunkings, a Website dedicated to dispelling myths and misconceptions about database management; and the Practical Database Foundations series of papers. Pascal, author of three books, has contributed extensively to trade publications including DM Review, Database Programming and Design, DBMS, Byte, Infoworld, and Computerworld.

The Importance of Understanding Classes, Sets, and Relations for Analytics

Failure to understand these fundamentals causes poor database designs and risks incorrect and/or improperly interpreted analytics results.

Understanding the Division of Labor between Analytics Applications and DBMS

Those who ignore data fundamentals will always risk costly mistakes and inhibit their own progress towards analytics goals. Here's why.