Data Meaning: Analytics vs. Data Mining

(Image: 3alexd/iStockphoto)

(Image: 3alexd/iStockphoto)

Years back a client consulted me on a problem: they had an important database the meaning of which they did not know. Could the meaning be reconstructed? The question reflected common poor practices due to poor grasp of data fundamentals.

Data mining is distinct from analytics. The former is aimed at 'finding' meaningful data patterns -- i.e., knowledge 'discovery' -- while the latter derives new knowledge from 'existing' knowledge -- i.e., deduction (see Data, Information, Knowledge Discovery, and Knowledge Representation). 'Sensible' querying of databases to retrieve data for analytic applications and correct interpretation of results without a good grasp of data meaning is a fool's errand. Yet current database practices are extremely deficient in this respect.

Data professionals commonly believe they can infer meaning from sheer inspection of the tabular display of database relations, which is an illusion even for simple ones, particularly when, as is so often the case, they are poorly designed. For example, what does this relation mean?

Here's what an analyst needs to be aware of when it comes to database interpretation.

Consider the relation


Unless you are familiar with the real world context of its design, you just don't know. Usually the interpretation, if any is given, is something informal like

"Supplier with supplier number S# supplies part with part number P# in quantity QTY".

which (1) is ambiguous -- is QTY the packaging quantity? The only quantity ever supplied? Does it represent historical quantities? -- and (2) incomplete. Much more is required make all this clear in English.

Relations represent formally in the database facts about object groups in the real world -- in this case, supply objects are relationships between objects of two other types, suppliers and parts. The meaning assigned to them and the database as a whole 'by the database designer' is conveyed by a set of 'business rules' -- informal group descriptions in natural language that specify group properties of several types:

  1. 'Property rules' specify individual properties shared by group members independent of the other members (e.g., the three of supplies);
  2. 'Object rules' specify properties that arise from relationships between individual properties (e.g., the supply relationship);
  3. 'Multi-object rules' specify properties that objects have collectively as a group and arise from within-group relationships between all the group members (e.g., distinguishability of supplies);
  4. 'Multi-group rules' specify properties of the group of related groups -- which is what a database represents -- that arise from cross-group relationships between, members of distinct groups (e.g., between supplies and suppliers and parts);

The full meaning of a relation is conveyed by the conjunction of all the rules in the first three categories; that of the database by the conjunction of all the rules for all the relations with those in the fourth category.

A DBMS does not understand informal rules in natural language like users do, it can only manipulate abstract symbols mathematically. Rules must be 'formalized' -- i.e., expressed symbolically as  'constraints' in a formal data language that the DBMS "understands" 'algorithmically' -- i.e., as procedures to enforce database consistency with the corresponding rules.

  1. 'Domain' and 'attribute constraints' correspond to property rules;
  2. 'Tuple constraints' correspond to object rules;
  3. 'Multi-tuple constraints' correspond to multi-object rules;
  4. 'Multi-relation (or 'database') constraints' correspond to multi-group rules;

For full consistency, the constraints corresponding to all the rules 'must be declared -- explicitly spelled out -- to the DBMS in a formal data language and enforced by it'. They constrain relations to be consistent with the rules, but 'neither the rules, nor the constraints appear in the tabular display of relations'.

The above rule only specifies the three individual properties shared by supplies and the one arising from the relationship between them. Enforcement would be limited to the corresponding attribute constraints and one tuple constraint. Which means that tuples inserted in the relation will be consistent with the rule above, but may not be with any of the other rules.

Sound databased analytics require full consistency. They obtain 'if and only if':

  • Database designers have a good grasp of data fundamentals --information modeling and relational database design -- and are competent in formulation of rules and their formalization as relational constraints;
  • Databases are fully normalized;
  • Relationally complete declarative data languages express all the relational constraints;
  • 'True' RDBMSs enforce them, document the meaning 'in the system' and make it accessible to users on demand;

Sadly, none of this can be taken for granted in current industry practices. SQL DBMSs record very few constraints in the system catalog and no business rules. Caveat emptor.

"Not cost-effectively and with sufficient confidence in the result" I told the client.

Fabian Pascal, Founder, Editor & Publisher, Database Debunkings

Fabian Pascal is an independent writer, lecturer, and analyst specializing in database management, with emphasis on data fundamentals and the relational model. He was affiliated with Codd & Date and has taught and lectured at the business and academic levels. Clients include IBM, Census Bureau, CIA, Apple, UCSF, and IRS. He is founder, editor, and publisher of Database Debunkings, a Website dedicated to dispelling myths and misconceptions about database management; and the Practical Database Foundations series of papers. Pascal, author of three books, has contributed extensively to trade publications including DM Review, Database Programming and Design, DBMS, Byte, Infoworld, and Computerworld.

Redundancy, Consistency, and Integrity: Derivable Data

Analysts should not take database consistency for granted. Here's why.

The Necessity of Foreign Keys

A proper understanding of data fundamentals requires the understanding of the importance of keys and primary keys. This time we take a look at another important type of key -- foreign keys.

Re: Data Mining vs Analytics
  • 7/10/2017 10:42:49 PM

@Lyndon_Henry    It was this passage that caused me to ask the earlier question. While I understand Fabian's definition of databased Analytics yet it is still not completely clear to me that Data Mining is not in some sense inductive reasoning packaged as an database technique in this case - Data Mining.    

Really having trouble understanding how I am so off in this conclusion.

Re: Data Mining vs Analytics
  • 7/10/2017 4:43:48 PM


Fabian explains 

Data mining is distinct from analytics. The former is aimed at 'finding' meaningful data patterns -- i.e., knowledge 'discovery' -- while the latter derives new knowledge from 'existing' knowledge -- i.e., deduction ....

Very clearly, succinctly, and usefully stated. Thank you.


Re: Data Mining vs Analytics
  • 7/10/2017 3:05:06 PM

A DBMS together with databases comprise a logic inferencing (deductive) system.

The tuples in db relations represent facts -- axioms -- and when you query you derive (deduce, infer) new facts -- theorems -- represented by the tuples in the relations that are the results of the queries. They are logical implications of the facts recorded in the database.

The way databased analytics are intended to work is: query the db for a data set and use a specialized application to analyze it. So you've got two deductive steps: (1) to retrieve the dataset (2) to analyze it. The analysis is just another kind of inferencing.

The distinction between induction and deduction is a completely separate issue.

Data Mining vs Analytics
  • 7/10/2017 2:47:13 PM

Hi Fabian,  Would it be correct to infer that if analytics results in deductive reasoning, Data mining is analogous to Inductive reasoning ?