The Importance of Understanding Classes, Sets, and Relations for Analytics


One of the clearest indications of poor foundation knowledge in data management practice is misuse and abuse of terminology. Many data professionals are inducted into the industry without a formal education, via programming and software tools, and use terms indiscriminately, as jargon, without understanding them. This has produced weak DBMS implementations and poorly designed databases that put the correctness of databased analytics at risk (my forthcoming DBDEBUNK Dictionary of Data Fundamentals is an effort to address this problem.)

For example, 'class' in data management is confused with programming class, an important distinction between class and 'set' is missed, and a 'relation' -- which is a set -- is often thought of as a class.

In object-oriented programming a class is a code template for creating objects that encapsulate data and behavior, each object being an instantiation of a class (i.e., a specific application of the template). In data management, however, a class is not code, but a formal, well defined concept from set theory, one-half of the theoretical foundation of the Relational Data Model (RDM), the understanding of which is critical for proper conceptual modeling, database design, and valid analytics.

A class is a group of objects that share the properties required for group membership (i.e., are of the same type). As I explained in Data Meaning: Analytics vs. Data Mining, there are two kinds of properties required for group membership:

  • Individual properties shared by class members
  • Collective properties that arise from relationships among (1) individual properties and (2) all members.

Given some well-defined universe of objects and a class definition (the properties), when the definition is applied to the universe, it selects out those objects that satisfy the definition (i.e., have the required properties). Otherwise put, a class induces a set of members (the definition is the class 'intension', the set of members its 'extension').

Conceptual (or business) modeling formulates business rules that define object classes of interest, each of which is jointly defined by several types of rules that specify the properties required for membership. Applying the rules to corresponding object universes induces sets of members, and these sets are represented formally in the database by relations:

  • Facts about each group's members are represented by 'tuples' (displayed as rows);
  • Individual properties are represented by 'attributes' defined on 'domains' (displayed as columns);
  • Collective property rules are represented by 'constraints' on relations (that constrain them to be consistent with the rules);

For example, when an enterprise determines individual (e.g., education, skills, experience) and collective (e.g., uniqueness, a maximum number of hires) properties that data scientists must have individually and as a group to be hired for available positions, it specifies the rules defining the class of its 'data science employees.' This class definition is applied to a universe of applicants to hire those that satisfy the rules, inducing a set of employees represented by a relation (displayable as a R-table). The relation is subject to constraints that are formalizations of the rules, which ensure that it is consistent with the class definition rules.

Failure by data professionals to understand these fundamentals gives rise to the common mistake of asking if one or more tables (not even guaranteed to be R-tables) are properly designed, without specifying the rules (which denote the meaning assigned by the database designer to the relations represented by the tables) and the corresponding constraints that enforce the rules in the database, the knowledge of which is often insufficient.

In these circumstances databases are not guaranteed to be properly designed and constrained and 'logical validity' and 'semantic correctness' of datasets retrieved by queries for analysis cannot be assumed, which means that insofar as analytics are concerned, all bets are off.

Fabian Pascal, Founder, Editor & Publisher, Database Debunkings

Fabian Pascal is an independent writer, lecturer, and analyst specializing in database management, with emphasis on data fundamentals and the relational model. He was affiliated with Codd & Date and has taught and lectured at the business and academic levels. Clients include IBM, Census Bureau, CIA, Apple, UCSF, and IRS. He is founder, editor, and publisher of Database Debunkings, a Website dedicated to dispelling myths and misconceptions about database management; and the Practical Database Foundations series of papers. Pascal, author of three books, has contributed extensively to trade publications including DM Review, Database Programming and Design, DBMS, Byte, Infoworld, and Computerworld.

The Importance of Understanding Classes, Sets, and Relations for Analytics

Failure to understand these fundamentals causes poor database designs and risks incorrect and/or improperly interpreted analytics results.

Understanding the Division of Labor between Analytics Applications and DBMS

Those who ignore data fundamentals will always risk costly mistakes and inhibit their own progress towards analytics goals. Here's why.


Re: What (You) meant Was ....
  • 10/31/2017 2:26:53 AM
NO RATINGS

That is a completely different issue.

Before deciding HOW you're gonna correct him you gotta know what is wrong and what is right. This is what this blog is about. The post addresses the issue of misuse and abuse of fundamentals for those INTERESTED IN AND CAPABLE OF educating themselves. For those who are neither...

 

Re: Oft go awry
  • 10/31/2017 2:09:03 AM
NO RATINGS

Well, the only people who can REALLY understand fundamentals are those who are educated in at least how RDM, logic and set theory apply to db mgmt. Since the number of db practitioners who go through this is too close to 0, this should answer your question.

It is actually worse than that. There is dismissiveness and increasing hostility to education. They are not even going through college -- they learn some coding and that's that. After all, Bill Gates and Zuckerberg became billionaires by dropping out, so who needs education.

If you read my "This Week" posts @dbdebunk every other week you can see the horrible and scary consequences of what these coders are doing way beyond the tech -- they destroy this country.

 

What (You) meant Was ....
  • 10/31/2017 1:55:32 AM
NO RATINGS

"My manager uses technical terms incorrectly all the time.  How do I correct them? "    —Hacker News

 

Interesting question and something I run into all the time. But the question of how do I correct them is very dicey.  Often it depends on the context of things but in general, I try not to be abrasive.  We all tend to "get over our skis" at times.

So unless you really dislike your manager and job, it is best to just be professionally polite in correcting the misconception.   And hopefully they will thank you for it.

Would be interesting to hear how others in the community handle this potentially explosive situation as well. 

Re: Oft go awry
  • 10/31/2017 1:41:52 AM
NO RATINGS

Thanks Fabian for highlighting the issues that programmers versus DBA have in common with terms that on the surface could appear silmilar but in practice couldn't be further apart in actual meaning and therefore usuage. 

As with all of your posts, I am going to have to let this sink in but in the meantime, are there any entities that understand and use correctly these concepts that if not used appropriately will lead to a useless analytical mess ?

Re: Oft go awry
  • 10/30/2017 11:20:54 PM
NO RATINGS

I think the post makes quite clear the nature of the problems to run into.

If the database designer does not know these fundamentals he cannot possibly do proper design. The relations will not be designed correctly -- they will not be fully normalized and properly constrained -- or not even relations. Consequently:

(1) They will not be accurate representation of reality;

(2) When relational operations are applied to those relations (restrict, project, union, join) in order to retrieve datasets for analytics, correctness is mathematically guaranteed only when those operations are applied to fully normalized and constrained relations, so the datasets will not be correct;

(3) The interpretation of the datasets will not be what the analyst thinks it is.

That is why I used the term "all bets are off for analytics". And the problem is insidious because the analyst will get datasets but has no way of knowing if there are problems and what their effect is. But because so many designers dk the fundamentals he can't assume there aren't any.

Oft go awry
  • 10/30/2017 10:01:29 PM
NO RATINGS

@Fabian - is there an example you can offer of how these mistakes and misunderstandings have led to problems? Anything that is simple enough to share?

INFORMATION RESOURCES
ANALYTICS IN ACTION
CARTERTOONS
VIEW ALL +
QUICK POLL
VIEW ALL +