Understanding the Division of Labor between Analytics Applications and DBMS

I am coming across, on the one hand, instructions on how to do "analytics with SQL" and, on the other, tools purporting to enable "analytics without SQL." They are an umpteenth iteration of essentially similar ideas during my 30-plus years in data management and reflect common and entrenched fundamental misconceptions that I have documented and analyzed the costly consequences of in my writings and teachings. They will keep repeating, inhibiting genuine progress, as long as data fundamentals are ignored or dismissed. One of the least understood is the distinction between DBMS and application functions.

DBMS functions are the responsibility of and performed by the DBMS 'for all applications'. There are several categories thereof, but here we will focus on the one including the three functions defined by a data model such as the RDM: structure, integrity and manipulation. A true RDBMS has a relational 'data sub-language' with which (1) relations and (2) integrity constraints are defined and data can be retrieved by (3) manipulating relations mathematically as sets -- i.e., applying the set operations of the relational algebra on them. Restricting the sub-language to the three data management functions makes first order predicate logic (FOPL) sufficient, with the following advantages:

[Note: Other DBMS functions are storage management, performance optimization, concurrency control, transaction management. They are not the purview of the data model, are performed by the DBMS, but not via the data sub-language, because they require higher logic than FOPL (see next).]

(Image: Marynchenko Oleksandr/Shutterstock)

(Image: Marynchenko Oleksandr/Shutterstock)

Application-specific functions are the responsibility of and performed by each application. They include (1) result preparation and (2) presentation and (3) communication with the DBMS. They are not limited to data management, so they require higher logic than FOPL and, thus, 'computationally complete' languages (CCL). A CCL (e.g., programming language), is expressively more powerful than a relational data sub-language, but:

  • Is imperative (procedural), prone to undecidability and significantly more complex;
  • Logical validity and semantic correctness are not system-guaranteed;
  • Physical and logical independence are not supported.

'Proper hosting' of relational data sub-languages in CCLs is the optimal division of labor between the DBMS and applications: relational advantages are preserved for data management functions performed by the DBMS and only application-specific functions are relegated to applications. Violating this arrangement by extending either (a) the data sub-language with application functions, or (b) the host CCL with data management functions destroys optimality and causes a plethora of costly complications

Consider now the above two "solutions" for analytics. SQL was intended to be a relational data sub-language and, as such, should have expressed strictly the three data management DBMS functions. So, first, forcing SQL to perform analytics functions is a type (a) violation. Second, analytic functions are application-specific and, therefore, tools that purport to "perform analytics without SQL" is a trivial promise that should be the case 'by definition.' The correct perspective for 'databased analytics' is, of course, that applications use CCLs that (1) host SQL for data management DBMS functions -- data retrieval -- and (2) perform analytics functions that do not require, or rely on SQL.

This should not be taken as a SQL recommendation for data management functions. It is neither truly and fully relational, nor a well designed language, nor a strictly data sub-language, as it haphazardly includes some application-specific functions. But unfortunately, SQL is the industry's standard database language that has no superior alternative. This is in itself due to its authors lacking a good grasp of data fundamentals and the RDM and contributes to the persistent failure to understand and appreciate the optimal division of labor and to provide tools that enforce it.

Therefore, to avoid the traps that industry practices and products impose, it behooves the analyst to acquire foundation knowledge.

Fabian Pascal, Founder, Editor & Publisher, Database Debunkings

Fabian Pascal is an independent writer, lecturer, and analyst specializing in database management, with emphasis on data fundamentals and the relational model. He was affiliated with Codd & Date and has taught and lectured at the business and academic levels. Clients include IBM, Census Bureau, CIA, Apple, UCSF, and IRS. He is founder, editor, and publisher of Database Debunkings, a Website dedicated to dispelling myths and misconceptions about database management; and the Practical Database Foundations series of papers. Pascal, author of three books, has contributed extensively to trade publications including DM Review, Database Programming and Design, DBMS, Byte, Infoworld, and Computerworld.

The Importance of Understanding Classes, Sets, and Relations for Analytics

Failure to understand these fundamentals causes poor database designs and risks incorrect and/or improperly interpreted analytics results.

Understanding the Division of Labor between Analytics Applications and DBMS

Those who ignore data fundamentals will always risk costly mistakes and inhibit their own progress towards analytics goals. Here's why.