Data, Information, Knowledge Discovery, and Knowledge Representation

The use of SQL databases (confused with relational databases) is often questioned for certain analytical purposes. For example, scientific research experiments that “requires assignment of data to tables, which is difficult when the scientists do not know ahead of time what analysis to run on the data, a lack of knowledge that severely limits the usefulness of relational databases. NoSQL or XML docubases are recommended as a better solution.

According to David McGoveran, we can distinguish, roughly, among:

  • Data: Categorized sequences of values representing some properties of interest (e.g., research variables in scientific experiments), but if and how they are related is unknown and not representable in the database
  • Information: Properties further organized in combinations conceptualized as “entities” (e.g., ‘runs’, or ‘cases’ in scientific experiments), but how the “entities” are related is unknown and not representable in the database
  • Knowledge: Relationships between properties and entities of different types are known -- there is a conceptual model in database speak -- a theory in science, representable in the database by a logical model.

As any true scientist should know, science has a context of discovery and a context of validation (I wonder how many of those who call themselves “data scientists” know the concepts). In the former, experiments are usually done to discover relationships -- a theory (to be validated in the latter, in which further implications thereof can also be derived). So there is no theory/conceptual model/knowledge yet. Data or information can be represented by any database (e.g. a SQL database can represent variables and/or runs as columns and/or rows in tables), docubase and even file -- none is superior or inferior for analytical purposes.

Databases in general and relational databases in particular are intended to be most useful for knowledge representation in the context of validation -- when a theory has been formulated -- and the logical model representing it in the database can be used to validate it and further analyzed to derive additional implications of the theory. A RDBMS can then meet the requirements of knowledge representation by enforcing integrity constraints (for consistency of the logical model with the conceptual model/theory) in the database and query it to make logical inferences (derive implications).

The dual theoretical foundation of set theory and first order predicate logic (FOPL) guarantee correct results. Unfortunately, true RDBMS’s don’t exist. The closest the IT industry has ever come are SQL DBMS’s, which -- the common misconception notwithstanding -- are not truly relational and meet the requirements only partially. NoSQL cannot do better and as a rule do much less.

In the context of discovery a theory/model is being sought and knowledge discovery (commonly referred to as data mining) is distinct from knowledge representation. That, however:

    Also requires a formal logical system supplemented with a theory about semantics and how properties and relationships are to be recognized in or inferred from a subject matter. Any database can be used for knowledge discovery, but not all DBMS's support any degree of automated knowledge discovery and, unfortunately, many even fail to enhance manual knowledge discovery to any substantial degree. That is certainly true of commercial NoSQL systems that lack formal logical underpinnings, so any knowledge discovery is ad-hoc and manual: it all occurs in the head of the user. -- David McGoveran

Fabian Pascal, Founder, Editor & Publisher, Database Debunkings

Fabian Pascal is an independent writer, lecturer, and analyst specializing in database management, with emphasis on data fundamentals and the relational model. He was affiliated with Codd & Date and has taught and lectured at the business and academic levels. Clients include IBM, Census Bureau, CIA, Apple, UCSF, and IRS. He is founder, editor, and publisher of Database Debunkings, a Website dedicated to dispelling myths and misconceptions about database management; and the Practical Database Foundations series of papers. Pascal, author of three books, has contributed extensively to trade publications including DM Review, Database Programming and Design, DBMS, Byte, Infoworld, and Computerworld.

Redundancy, Consistency, and Integrity: Derivable Data

Analysts should not take database consistency for granted. Here's why.

The Necessity of Foreign Keys

A proper understanding of data fundamentals requires the understanding of the importance of keys and primary keys. This time we take a look at another important type of key -- foreign keys.