According to David McGoveran, we can distinguish, roughly, among:
- Data: Categorized sequences of values representing some properties of interest (e.g., research variables in scientific experiments), but if and how they are related is unknown and not representable in the database
- Information: Properties further organized in combinations conceptualized as “entities” (e.g., ‘runs’, or ‘cases’ in scientific experiments), but how the “entities” are related is unknown and not representable in the database
- Knowledge: Relationships between properties and entities of different types are known -- there is a conceptual model in database speak -- a theory in science, representable in the database by a logical model.
As any true scientist should know, science has a context of discovery and a context of validation (I wonder how many of those who call themselves “data scientists” know the concepts). In the former, experiments are usually done to discover relationships -- a theory (to be validated in the latter, in which further implications thereof can also be derived). So there is no theory/conceptual model/knowledge yet. Data or information can be represented by any database (e.g. a SQL database can represent variables and/or runs as columns and/or rows in tables), docubase and even file -- none is superior or inferior for analytical purposes.
Databases in general and relational databases in particular are intended to be most useful for knowledge representation in the context of validation -- when a theory has been formulated -- and the logical model representing it in the database can be used to validate it and further analyzed to derive additional implications of the theory. A RDBMS can then meet the requirements of knowledge representation by enforcing integrity constraints (for consistency of the logical model with the conceptual model/theory) in the database and query it to make logical inferences (derive implications).
The dual theoretical foundation of set theory and first order predicate logic (FOPL) guarantee correct results. Unfortunately, true RDBMS’s don’t exist. The closest the IT industry has ever come are SQL DBMS’s, which -- the common misconception notwithstanding -- are not truly relational and meet the requirements only partially. NoSQL cannot do better and as a rule do much less.
In the context of discovery a theory/model is being sought and knowledge discovery (commonly referred to as data mining) is distinct from knowledge representation. That, however:
Also requires a formal logical system supplemented with a theory about semantics and how properties and relationships are to be recognized in or inferred from a subject matter. Any database can be used for knowledge discovery, but not all DBMS's support any degree of automated knowledge discovery and, unfortunately, many even fail to enhance manual knowledge discovery to any substantial degree. That is certainly true of commercial NoSQL systems that lack formal logical underpinnings, so any knowledge discovery is ad-hoc and manual: it all occurs in the head of the user. -- David McGoveran