(Image: Andrey VP/Shutterstock)
You've probably heard the frequent argument that relational databases (which, unfortunately, in practice, means SQL ones) do not serve the performance, flexibility, and temporalization needs of analytical applications satisfactorily. Indeed, Anchor, Data Vault, and Dimensional Modeling techniques are promoted as solutions to the "problems" due to normalized databases. All this is rooted in certain fundamental misconceptions that can be costly for business intelligence, analytics, and data science.
What the industry calls a warehouse is essentially a materialized view of an operational database that typically doesn't get updated in full sync with the underlying database -- i.e., a time-specific snapshot database copy. Such views are produced by data manipulation -- applying some operations to the database. In effect they are query results. Many if not most data professionals overlook a unique and crucial feature of truly relational DBMSs and databases -- their formal dual theoretical foundation -- simple set theory (SST) and first order predicate logic (FOPL). If relational set operations are properly applied to fully normalized database relations, results are guaranteed to be logically and semantically correct.
Thus, (1) given a true RDBMS and a fully normalized relational database (2) a "transformation process" is triggered (by documented circumstances) that applies relational algebra operations to the database to produce the desired view, which (3) is materialized by persisting it in storage. If either the design of the database or the transformation applied by the DBMS violates the theory, correctness is no longer guaranteed.
While warehousing is often undertaken to improve performance, the transformation is purely logical and performance is determined exclusively at the physical implementation level -- how the data is represented and accessed in storage. The only physical implementation aspect of data warehousing is the materialization of the view, which is independent of warehousing techniques.
Warehouses are essentially databases biased for some data applications (and against others) and are rooted in poor database foundation knowledge and logical-physical confusion. Even when warehouses consist of relations, warehouse developers often do not understand their precise interpretation. The design is based on various unwarranted or false assumptions about what the data means. The above modeling methods do not allow documenting the transformation itself -- the relational algebra operations that comprise the transformation. But more often than not warehouses do not consist of relations, which are minimally required to be in first normal form (1NF), and are, therefore, not just denormalized, but non-relational. Consequently, all bets are off; sound derivations of correct analytical results are not guaranteed.
Neutral databases that serve different data views to multiple applications were introduced as a solution to the prohibitive problems caused by application-specific biased files. Application programs needing different views of the same data required differently structured files, proliferating redundancy and inconsistencies. Each and every application program had to enforce data integrity and security, and optimize performance -- functions now centralized in the DBMS. Application-based integrity enforcement created a redundant, complex, and error-prone maintenance burden that was so prohibitive it was mostly foregone. Due to lack of familiarity with history, application-biased databases are bringing those problems back. Those who forget the past are doomed to repeat it.
Because warehouses are read-only, the risks to data integrity may be smaller. They are confined and exclusive to the developer responsible for the transformation and the warehouse load procedure, an environment that is more controlled than the shared operational database that are usually updated by many applications/users. But warehouses are populated by SQL DBMSs that are not truly relational, from poorly designed operational databases, so all bets are off.
It sometimes makes sense to offload data and analytics from operational databases. But without the relational guarantee, caveat emptor.