Structure, Integrity, Manipulation: How to Compare Data Models


(Image: isak55/Shutterstock)

(Image: isak55/Shutterstock)

The IT industry operates like the fashion industry: every few years -- and the number keeps getting smaller -- a "new" data technology pops up, with vendors, the trade media and various "experts" all stepping over each other to claim that it'll "revolutionize your business" and unless you jump on the bandwagon, you'll be "left behind." But time and again these prove to be fads lacking a sound foundation. Huge resources are invested in migrations from fad to fad, rather than in productive work (Don't believe the hype about Hadoop usage, Basta, Big Data It's Time to Say Arrivederci). Remember?

"Hadoop seems to take over relational database, as Hbase can store even unstructured data whereas relational data warehouse limits to structured data ... handles traditional structured data just fine, albeit in a different way than a RDBMS ... EDW vendors [will] incorporate Hadoop framework into their core architectures to enable advanced and high performance analytics."

First, "unstructured data" is a contradiction in terms: all data is structured 'by definition' -- if they weren't, they would be meaningless random noise and, therefore, not data. Second, performance is determined exclusively by physical implementation (hardware, OS, DBMS and its optimizer, network load, concurrent access, and so on). Associating it with the data model, which is purely logical, is 'logical-physical confusion' (LPC). Third, practically all references to 'relational' are actually to SQL systems, which are far from relational. Fourth, data warehouses are non-relationally designed databases.

But, fifth and most important, "handling data in a different way" is so much handwaving. Data management IS (1) manipulation of some (2) data structure with 3) protection of data integrity, the three components of a data model -- i.e., there is no data management without one. So any claim for a data management technology of equivalence to, or superiority over an alternative involves a comparison of the formal data structure, manipulation and integrity components of the corresponding data models and their pros and cons.

The three components of the RDM can be specified precisely and succinctly:

  • Structure = relation, represents object group:

- attributes represent properties

- tuples represent (facts about) objects

- domain and attribute constraints represent property rules

- tuple constraints represent object rules (relationships among properties)

- multi-tuple constraints represent multi-object rules (relationships among objects)

- database (multi-relation) constraints represent multi-group rules (relationships among groups)

  • Manipulation = Relational algebra, represents inferencing--i.e., derivation of new facts from database facts

For a true RDBMS and fully normalized databases the exclusive advantages are:

  • Declarative, decidable data language -- i.e., queries are guaranteed to terminate;
  • Physical and logical independence -- i.e., queries and applications are insulated from physical and some logical reorganizations of the database;
  • System-guaranteed

- database consistency (with the business rules in effect)

- logical and semantic correctness of query results;

all due to the RDM's formal foundation in first order logic and set mathematics.

What exactly is there to gain in return for opting for a "different way" and giving them up? Not only don't you get an answer when you ask proponents. They cannot even specify the underlying data model. Instead, you get ramblings without any specifics e.g.,

"You can and do model data in Hadoop. Indeed, the first thing you do with HBase is decide how to organize your data into column families. Kind of hard to argue that doesn't happen because we and lots of others do exactly that. Ditto with Hive. Heck, even the choice of how you write data into HDFS requires some choice on organization even if it "just' mimics the what it was modeled the application/source it is being loaded from. Am I saying that is the same type and depth of modeling as an RDBMS, no. It is still data modeling, yes. The fact it isn't the same as a RDBMS is the whole point, there are design considerations where you want the HBase model instead of a RDBMS."

"Modeling columns" and "expressing relationships" is too little and too vague.

When somebody pushes an alternative to the RDM (including BigData/Hadoop), ask what

  • Its three formal components are (exactly, please!)?
  • In the real world they represent?
  • Is the underlying theoretical foundation? and
  • Advantages they offer to compensate for the loss of those of the RDM?

Let me know if you ever get a satisfactory answer.

AI, anybody?

Fabian Pascal, Founder, Editor & Publisher, Database Debunkings

Fabian Pascal is an independent writer, lecturer, and analyst specializing in database management, with emphasis on data fundamentals and the relational model. He was affiliated with Codd & Date and has taught and lectured at the business and academic levels. Clients include IBM, Census Bureau, CIA, Apple, UCSF, and IRS. He is founder, editor, and publisher of Database Debunkings, a Website dedicated to dispelling myths and misconceptions about database management; and the Practical Database Foundations series of papers. Pascal, author of three books, has contributed extensively to trade publications including DM Review, Database Programming and Design, DBMS, Byte, Infoworld, and Computerworld.

Don't Conflate or Confuse Database Consistency with Truth

In the database context both truth and consistency are critical, but they should not be confused or conflated. DBMSs guarantee database consistency with the conceptual model of the real world they represent. On the other hand, a DBMS cannot and should not be expected to ensure truth.

Structure, Integrity, Manipulation: How to Compare Data Models

Is that new data trend actually something that you really need, or you could risk being left behind? Or is it just a buzz word or a fad?


Re: Selling the Sizzle
  • 8/3/2017 6:56:08 PM
NO RATINGS

Data management is part and parcel of the IT industry, which is of business culture, which is of American culture. They are not separate things.

Selling the Sizzle
  • 8/3/2017 1:37:01 PM
NO RATINGS

How true it often is as noted that "Huge resources are invested in migrations from fad to fad, rather than in productive work." That's not an unfamiliar observation not just limited to business fads of course. Spend an hour on TV or radio and you'll certainly hear the same ploys trying to get us to buy the latest and greatest even without facts to back up the claims. 

INFORMATION RESOURCES
ANALYTICS IN ACTION
CARTERTOONS
VIEW ALL +
QUICK POLL
VIEW ALL +