The Trouble with Data Warehouse Analytics


(Image: Andrey VP/Shutterstock)

(Image: Andrey VP/Shutterstock)

You've probably heard the frequent argument that relational databases (which, unfortunately, in practice, means SQL ones) do not serve the performance, flexibility, and temporalization needs of analytical applications satisfactorily. Indeed, Anchor, Data Vault, and Dimensional Modeling techniques are promoted as solutions to the "problems" due to normalized databases. All this is rooted in certain fundamental misconceptions that can be costly for business intelligence, analytics, and data science.

What the industry calls a warehouse is essentially a materialized view of an operational database that typically doesn't get updated in full sync with the underlying database -- i.e., a time-specific snapshot database copy. Such views are produced by data manipulation -- applying some operations to the database. In effect they are query results. Many if not most data professionals overlook a unique and crucial feature of truly relational DBMSs and databases -- their formal dual theoretical foundation -- simple set theory (SST) and first order predicate logic (FOPL). If relational set operations are properly applied to fully normalized database relations, results are guaranteed to be logically and semantically correct.

Thus, (1) given a true RDBMS and a fully normalized relational database (2) a "transformation process" is triggered (by documented circumstances) that applies relational algebra operations to the database to produce the desired view, which (3) is materialized by persisting it in storage. If either the design of the database or the transformation applied by the DBMS violates the theory, correctness is no longer guaranteed.

While warehousing is often undertaken to improve performance, the transformation is purely logical and performance is determined exclusively at the physical implementation level -- how the data is represented and accessed in storage. The only physical implementation aspect of data warehousing is the materialization of the view, which is independent of warehousing techniques.

Warehouses are essentially databases biased for some data applications (and against others) and are rooted in poor database foundation knowledge and logical-physical confusion. Even when warehouses consist of relations, warehouse developers often do not understand their precise interpretation. The design is based on various unwarranted or false assumptions about what the data means. The above modeling methods do not allow documenting the transformation itself -- the relational algebra operations that comprise the transformation. But more often than not warehouses do not consist of relations, which are minimally required to be in first normal form (1NF), and are, therefore, not just denormalized, but non-relational. Consequently, all bets are off; sound derivations of correct analytical results are not guaranteed.

Neutral databases that serve different data views to multiple applications were introduced as a solution to the prohibitive problems caused by application-specific biased files. Application programs needing different views of the same data required differently structured files, proliferating redundancy and inconsistencies. Each and every application program had to enforce data integrity and security, and optimize performance -- functions now centralized in the DBMS. Application-based integrity enforcement created a redundant, complex, and error-prone maintenance burden that was so prohibitive it was mostly foregone. Due to lack of familiarity with history, application-biased databases are bringing those problems back. Those who forget the past are doomed to repeat it.

Because warehouses are read-only, the risks to data integrity may be smaller. They are confined and exclusive to the developer responsible for the transformation and the warehouse load procedure, an environment that is more controlled than the shared operational database that are usually updated by many applications/users. But warehouses are populated by SQL DBMSs that are not truly relational, from poorly designed operational databases, so all bets are off.

It sometimes makes sense to offload data and analytics from operational databases. But without the relational guarantee, caveat emptor.

Fabian Pascal, Founder, Editor & Publisher, Database Debunkings

Fabian Pascal is an independent writer, lecturer, and analyst specializing in database management, with emphasis on data fundamentals and the relational model. He was affiliated with Codd & Date and has taught and lectured at the business and academic levels. Clients include IBM, Census Bureau, CIA, Apple, UCSF, and IRS. He is founder, editor, and publisher of Database Debunkings, a Website dedicated to dispelling myths and misconceptions about database management; and the Practical Database Foundations series of papers. Pascal, author of three books, has contributed extensively to trade publications including DM Review, Database Programming and Design, DBMS, Byte, Infoworld, and Computerworld.

Redundancy, Consistency, and Integrity: Derivable Data

Analysts should not take database consistency for granted. Here's why.

The Necessity of Foreign Keys

A proper understanding of data fundamentals requires the understanding of the importance of keys and primary keys. This time we take a look at another important type of key -- foreign keys.


Re: Search for Data Warehousing 101
  • 4/4/2017 2:08:53 AM
NO RATINGS

It is NOT different -- it IS a database for analytical purposes, but a database nevertheless. Only those who don't understand fundamentals think it's different, because they have disconnected it from databases.

The title of something does not necessarily make it reliable. If you're not savvy, it's difficult for you to assess the accuracy and reliability by just a title. SO caveat emptor.

Re: Search for Data Warehousing 101
  • 4/4/2017 2:04:29 AM
NO RATINGS

With all due respect vendor sources are the last ones to consult when trying to educate yourself on matters that are mot product specific  for reasons that should be utterly obvious.

The industry produces tons of information on any subject you care to consider, but you can count on your fingers those that are sound--i.e., based on some modicum of science. A vast majority of the stuff is ad-hoc and very few professionals have a proper education--distinct from schooling--to appreciate the implications.

My post tried to stress those implications.

 

Re: logical architecture of a data warehouse
  • 4/4/2017 1:24:38 AM
NO RATINGS

From disparate data sources to a warehouse takes transformations. If both the sources and the DWH are relational, the transformations are based on relational theory and, therefore, are system-guaranteed to yield correct results. If they are ad-hoc and based on whatever the integrator perceives to be correct and lacks any formal theoretical foundations, all bets are off. So yes, people are taking that on, but because they lack a decent grasp of fundamentals they are oblivious to the caveats.

Re: Search for Data Warehousing 101
  • 3/15/2017 7:24:31 PM
NO RATINGS

Thank you Lyndon.  That helps a lot for those like myself who are not as savy. I especially like the one labeled "How is a data warehouse different from a regular database?"

Re: Search for Data Warehousing 101
  • 3/14/2017 8:19:52 AM
NO RATINGS

Interesting how the subtleties of data warehousing may be defined in different places and how differing views of just how to implement and manage the data come about depending on what our definitions may be. Almost a serious subject for doctors of philosophy to ponder?

Search for Data Warehousing 101
  • 3/10/2017 2:27:44 PM
NO RATINGS
1 saves

..

For those of us not so intimately involved with the achitecture and other more esoteric conceptual aspects of data warehousing, I went searching for something like an Introduction to Data Warehousing 101 to try to better understanding what data warehousing was and how it differed from plain old databases in general.

I couldn't find an actual Data Warehousing 101, but in addition to Fabian's definition/explanation, I found the following articles  (with excerpts below that I thought helpful to better understanding the issue):

https://www.tutorialspoint.com/dwh/dwh_data_warehousing.htm
What is Data Warehousing?
Data warehousing is the process of constructing and using a data warehouse. A data warehouse is constructed by integrating data from multiple heterogeneous sources that support analytical reporting, structured and/or ad hoc queries, and decision making. Data warehousing involves data cleaning, data integration, and data consolidations

https://docs.oracle.com/cd/B10500_01/server.920/a96520/concept.htm
What is a Data Warehouse?
A data warehouse is a relational database that is designed for query and analysis rather than for transaction processing. It usually contains historical data derived from transaction data, but it can include data from other sources. It separates analysis workload from transaction workload and enables an organization to consolidate data from several sources.
In addition to a relational database, a data warehouse environment includes an extraction, transportation, transformation, and loading (ETL) solution, an online analytical processing (OLAP) engine, client analysis tools, and other applications that manage the process of gathering data and delivering it to business users.

https://www.informatica.com/services-and-training/glossary-of-terms/data-warehousing-definition.html
How is a data warehouse different from a regular database?
Data warehouses use a different design from standard operational databases. The latter are optimized to maintain strict accuracy of data in the moment by rapidly updating real-time data. Data warehouses, by contrast, are designed to give a long-range view of data over time. They trade off transaction volume and instead specialize in data aggregation.
What are data warehouses used for?
Many types of business data are analyzed via data warehouses. The need for a data warehouse often becomes evident when analytic requirements run afoul of the ongoing performance of operational databases. Running a complex query on a database requires the database to enter a temporary fixed state. This is often untenable for transactional databases. A data warehouse is employed to do the analytic work, leaving the transactional database free to focus on transactions.

http://www.1keydata.com/datawarehousing/data-warehouse-definition.html
A data warehouse is a subject-oriented, integrated, time-variant and non-volatile collection of data in support of management's decision making process.
Subject-Oriented: A data warehouse can be used to analyze a particular subject area. For example, "sales" can be a particular subject.
Integrated: A data warehouse integrates data from multiple data sources. For example, source A and source B may have different ways of identifying a product, but in a data warehouse, there will be only a single way of identifying a product.
Time-Variant: Historical data is kept in a data warehouse. For example, one can retrieve data from 3 months, 6 months, 12 months, or even older data from a data warehouse. This contrasts with a transactions system, where often only the most recent data is kept. For example, a transaction system may hold the most recent address of a customer, where a data warehouse can hold all addresses associated with a customer.
Non-volatile: Once data is in the data warehouse, it will not change. So, historical data in a data warehouse should never be altered.

Hope this may be helpful to others similarly curious ... 

..

Re: logical architecture of a data warehouse
  • 3/8/2017 2:43:14 PM
NO RATINGS

If, as the data modeler you want no part in the data warehouse ... or at least, the role of the data warehouse that deals with those performance and scalability issues which you consider out of your scope ... then you still have to deal with the issue of integrating various sources, including those sources outside the control of your organization, and keeping all necessary views/variations of time-variate data.  So you need to take on the responsibiliy of building in the ingestion and integration of all of that data to your master/transactional data model/database.  That's one possible solution, but it's not something that the people supporting most transactional systems are willing to take on.  

If all of that is done and available, then anything further that is done to enable reporting ... dealing with the physical performance/scalability realities ... using dimensional models, aggregates of various kinds, indexing, specialized database servers, etc ... could be considered simply part of the "freedom to do whatever they damn please at the physical level to maximize performance" ... not changing the logical model.  Still, it would be possible to screw up the results, but it's practically impossible to ensure that no client applications can screw up their view of the data after they query from your master copy, no matter whether they instantiate or persist anything on their end or not.  If all of the integration and other heavy lifting of applying business logic had been done at the source though, it's a lot easier to validate whether the clients have fumbled it downstream.

"The chance that any inferences made from DWHs are correct is purely coincidental."  - This sounds like a great disclaimer and I may consider it for placement on cube wall and email signature in the future.  However, my own opinion is that you can increase the odds to something beyond purely coincidental if you have the right people working on it.  

 

 

Re: logical architecture of a data warehouse
  • 3/8/2017 1:21:08 PM
NO RATINGS

> For the most part however those drivers relate to scalability and performance issues, so if looking at it from a logical purist standpoint those concerns are not going to rate consideration.

Logical-physical confusion (LPC): Performance and scalability are determined exclusively at the physical implementation level and the logical relational design has nothing to do with them. True RDBMSs support physical independence which give DBMS vendors complete freedom to do whatever they damn please at the physical level to maximize performance, without impairing applications. But because data professionals lack foundation knowledge, they confuse RDBMSs with SQL DBMSs, suffer from LPC and ensure that the correct solutions will never emerge.

 

Aside from the fact that the source databases are poorly designed and their data are not guaranteed to be correct, the transformation to DWH is ad-hoc, arbitrary and not based on the relational algebra or logically correct databases. The chance that any inferences made from DWHs are correct is purely coincidental.

Re: logical architecture of a data warehouse
  • 3/8/2017 9:27:51 AM
NO RATINGS

Fabian,

You said: "I suspect that the main driver for DWH is to bias databases for some applications (and against others)."

There is some truth to that, since in addition to the data integration aspect, a lot of data warehouse design is motivated in how the data is used (analysis and reporting) and how the characteristics of that use differ from the characteristics of transactional systems.  For the most part however those drivers relate to scalability and performance issues, so if looking at it from a logical purist standpoint those concerns are not going to rate consideration.  The best practices to handle those concerns vary depending on the situation ... they change over time as technology evolves and they are dependent on the relationships between the data/application and the infrastructure supporting those (or infrastructure budget.)  

All of that said, a data warehouse architecture/model done well will attempt to address those issues in a way that centers on the way the organization conceptualizes the things represented by the data, and the more-or-less constant demands of analysis (usually represented by dimensional modeling, at least for the layer that is accessed by end users).  The design should preferably not be dominated by the needs of specific client applications ... though certainly that happens.  Unfortunately too many people in the DW space are focused on certain vendor software stacks and their conceptual framework gets stuck within that specific vendor box.  (For example, Microsoft BI people who want to build lots of narrowly-purposed cubes.)

Finally, this aspect that you mentioned is truly very important: "Because warehouses are read-only, the risks to data integrity may be smaller. They are confined and exclusive to the developer responsible for the transformation and the warehouse load procedure, an environment that is more controlled than the shared operational database that are usually updated by many applications/users."  As you point out, integrity of the DW data is not guaranteed.  The work of pulling data from sources and integrating it into the data warehouse is definitely a point of vulnerability and needs to be taken very seriously.  Organizations that see it as simple data replication and have it being worked by the cheapest "qualified" resources available are asking for trouble (and usually get it.)  That said, since that function is (or should be) controlled by a specific team of professionals, and the clients of the data warehouse are read-only, the data integrity risks are limited to the actions of that specific data loading/integration (often called ETL) team.  

 

Re: logical architecture of a data warehouse
  • 3/7/2017 11:58:08 AM
NO RATINGS

Well, but then the issues that I raise in my post are much, much worse.

System-guaranteed correctness of results requires properly designed relational databases that adhere to the three design principles that are jointly satisfied by full normalization and truly relational databases. Since SQL DBMSs are not, databases are not fully normalized and the integration is not based on the relational algebra, DWHs are arbitrarily designed  and, therefore, all bets are off.

This is also the reason DWH integration is hard -- they lack a formal theoretical foundation. It's a return to something similar to the proliferation of files and application programs.

If data is poorly structured initially, the only sound way to integrate is by means of proper relational re-design. Since this is not what DWHs are about...

I suspect that the main driver for DWH is to bias databases for some applications (and against others). This is not much different than files&programs, except it's done with databases.

Unsound.

Page 1 / 2   >   >>
INFORMATION RESOURCES
ANALYTICS IN ACTION
CARTERTOONS
VIEW ALL +
QUICK POLL
VIEW ALL +