Fabian Pascal

The Trouble with Data Warehouse Analytics

NO RATINGS
View Comments: Newest First | Oldest First | Threaded View
SethBreedlove
User Rank
Data Doctor
Re: Search for Data Warehousing 101
SethBreedlove   3/15/2017 7:24:31 PM
NO RATINGS
Thank you Lyndon.  That helps a lot for those like myself who are not as savy. I especially like the one labeled "How is a data warehouse different from a regular database?"

kq4ym
User Rank
Data Doctor
Re: Search for Data Warehousing 101
kq4ym   3/14/2017 8:19:52 AM
NO RATINGS
Interesting how the subtleties of data warehousing may be defined in different places and how differing views of just how to implement and manage the data come about depending on what our definitions may be. Almost a serious subject for doctors of philosophy to ponder?

Lyndon_Henry
User Rank
Blogger
Search for Data Warehousing 101
Lyndon_Henry   3/10/2017 2:27:44 PM
NO RATINGS
1 saves
..

For those of us not so intimately involved with the achitecture and other more esoteric conceptual aspects of data warehousing, I went searching for something like an Introduction to Data Warehousing 101 to try to better understanding what data warehousing was and how it differed from plain old databases in general.

I couldn't find an actual Data Warehousing 101, but in addition to Fabian's definition/explanation, I found the following articles  (with excerpts below that I thought helpful to better understanding the issue):

https://www.tutorialspoint.com/dwh/dwh_data_warehousing.htm
What is Data Warehousing?
Data warehousing is the process of constructing and using a data warehouse. A data warehouse is constructed by integrating data from multiple heterogeneous sources that support analytical reporting, structured and/or ad hoc queries, and decision making. Data warehousing involves data cleaning, data integration, and data consolidations

https://docs.oracle.com/cd/B10500_01/server.920/a96520/concept.htm
What is a Data Warehouse?
A data warehouse is a relational database that is designed for query and analysis rather than for transaction processing. It usually contains historical data derived from transaction data, but it can include data from other sources. It separates analysis workload from transaction workload and enables an organization to consolidate data from several sources.
In addition to a relational database, a data warehouse environment includes an extraction, transportation, transformation, and loading (ETL) solution, an online analytical processing (OLAP) engine, client analysis tools, and other applications that manage the process of gathering data and delivering it to business users.

https://www.informatica.com/services-and-training/glossary-of-terms/data-warehousing-definition.html
How is a data warehouse different from a regular database?
Data warehouses use a different design from standard operational databases. The latter are optimized to maintain strict accuracy of data in the moment by rapidly updating real-time data. Data warehouses, by contrast, are designed to give a long-range view of data over time. They trade off transaction volume and instead specialize in data aggregation.
What are data warehouses used for?
Many types of business data are analyzed via data warehouses. The need for a data warehouse often becomes evident when analytic requirements run afoul of the ongoing performance of operational databases. Running a complex query on a database requires the database to enter a temporary fixed state. This is often untenable for transactional databases. A data warehouse is employed to do the analytic work, leaving the transactional database free to focus on transactions.

http://www.1keydata.com/datawarehousing/data-warehouse-definition.html
A data warehouse is a subject-oriented, integrated, time-variant and non-volatile collection of data in support of management's decision making process.
Subject-Oriented: A data warehouse can be used to analyze a particular subject area. For example, "sales" can be a particular subject.
Integrated: A data warehouse integrates data from multiple data sources. For example, source A and source B may have different ways of identifying a product, but in a data warehouse, there will be only a single way of identifying a product.
Time-Variant: Historical data is kept in a data warehouse. For example, one can retrieve data from 3 months, 6 months, 12 months, or even older data from a data warehouse. This contrasts with a transactions system, where often only the most recent data is kept. For example, a transaction system may hold the most recent address of a customer, where a data warehouse can hold all addresses associated with a customer.
Non-volatile: Once data is in the data warehouse, it will not change. So, historical data in a data warehouse should never be altered.

Hope this may be helpful to others similarly curious ... 

..

JeffC44101
User Rank
Prospector
Re: logical architecture of a data warehouse
JeffC44101   3/8/2017 2:43:14 PM
NO RATINGS
If, as the data modeler you want no part in the data warehouse ... or at least, the role of the data warehouse that deals with those performance and scalability issues which you consider out of your scope ... then you still have to deal with the issue of integrating various sources, including those sources outside the control of your organization, and keeping all necessary views/variations of time-variate data.  So you need to take on the responsibiliy of building in the ingestion and integration of all of that data to your master/transactional data model/database.  That's one possible solution, but it's not something that the people supporting most transactional systems are willing to take on.  

If all of that is done and available, then anything further that is done to enable reporting ... dealing with the physical performance/scalability realities ... using dimensional models, aggregates of various kinds, indexing, specialized database servers, etc ... could be considered simply part of the "freedom to do whatever they damn please at the physical level to maximize performance" ... not changing the logical model.  Still, it would be possible to screw up the results, but it's practically impossible to ensure that no client applications can screw up their view of the data after they query from your master copy, no matter whether they instantiate or persist anything on their end or not.  If all of the integration and other heavy lifting of applying business logic had been done at the source though, it's a lot easier to validate whether the clients have fumbled it downstream.

"The chance that any inferences made from DWHs are correct is purely coincidental."  - This sounds like a great disclaimer and I may consider it for placement on cube wall and email signature in the future.  However, my own opinion is that you can increase the odds to something beyond purely coincidental if you have the right people working on it.  

 

 

dbdebunker
User Rank
Data Doctor
Re: logical architecture of a data warehouse
dbdebunker   3/8/2017 1:21:08 PM
NO RATINGS
> For the most part however those drivers relate to scalability and performance issues, so if looking at it from a logical purist standpoint those concerns are not going to rate consideration.

Logical-physical confusion (LPC): Performance and scalability are determined exclusively at the physical implementation level and the logical relational design has nothing to do with them. True RDBMSs support physical independence which give DBMS vendors complete freedom to do whatever they damn please at the physical level to maximize performance, without impairing applications. But because data professionals lack foundation knowledge, they confuse RDBMSs with SQL DBMSs, suffer from LPC and ensure that the correct solutions will never emerge.


 

Aside from the fact that the source databases are poorly designed and their data are not guaranteed to be correct, the transformation to DWH is ad-hoc, arbitrary and not based on the relational algebra or logically correct databases. The chance that any inferences made from DWHs are correct is purely coincidental.


JeffC44101
User Rank
Prospector
Re: logical architecture of a data warehouse
JeffC44101   3/8/2017 9:27:51 AM
NO RATINGS
Fabian,

You said: "I suspect that the main driver for DWH is to bias databases for some applications (and against others)."

There is some truth to that, since in addition to the data integration aspect, a lot of data warehouse design is motivated in how the data is used (analysis and reporting) and how the characteristics of that use differ from the characteristics of transactional systems.  For the most part however those drivers relate to scalability and performance issues, so if looking at it from a logical purist standpoint those concerns are not going to rate consideration.  The best practices to handle those concerns vary depending on the situation ... they change over time as technology evolves and they are dependent on the relationships between the data/application and the infrastructure supporting those (or infrastructure budget.)  

All of that said, a data warehouse architecture/model done well will attempt to address those issues in a way that centers on the way the organization conceptualizes the things represented by the data, and the more-or-less constant demands of analysis (usually represented by dimensional modeling, at least for the layer that is accessed by end users).  The design should preferably not be dominated by the needs of specific client applications ... though certainly that happens.  Unfortunately too many people in the DW space are focused on certain vendor software stacks and their conceptual framework gets stuck within that specific vendor box.  (For example, Microsoft BI people who want to build lots of narrowly-purposed cubes.)

Finally, this aspect that you mentioned is truly very important: "Because warehouses are read-only, the risks to data integrity may be smaller. They are confined and exclusive to the developer responsible for the transformation and the warehouse load procedure, an environment that is more controlled than the shared operational database that are usually updated by many applications/users."  As you point out, integrity of the DW data is not guaranteed.  The work of pulling data from sources and integrating it into the data warehouse is definitely a point of vulnerability and needs to be taken very seriously.  Organizations that see it as simple data replication and have it being worked by the cheapest "qualified" resources available are asking for trouble (and usually get it.)  That said, since that function is (or should be) controlled by a specific team of professionals, and the clients of the data warehouse are read-only, the data integrity risks are limited to the actions of that specific data loading/integration (often called ETL) team.  

 

dbdebunker
User Rank
Data Doctor
Re: logical architecture of a data warehouse
dbdebunker   3/7/2017 11:58:08 AM
NO RATINGS
Well, but then the issues that I raise in my post are much, much worse.

System-guaranteed correctness of results requires properly designed relational databases that adhere to the three design principles that are jointly satisfied by full normalization and truly relational databases. Since SQL DBMSs are not, databases are not fully normalized and the integration is not based on the relational algebra, DWHs are arbitrarily designed  and, therefore, all bets are off.

This is also the reason DWH integration is hard -- they lack a formal theoretical foundation. It's a return to something similar to the proliferation of files and application programs.

If data is poorly structured initially, the only sound way to integrate is by means of proper relational re-design. Since this is not what DWHs are about...

I suspect that the main driver for DWH is to bias databases for some applications (and against others). This is not much different than files&programs, except it's done with databases.

Unsound.

JeffC44101
User Rank
Prospector
logical architecture of a data warehouse
JeffC44101   3/6/2017 10:52:30 AM
NO RATINGS
Hi Fabian.  Your assumption seems to be that a data warehouse exists primarily for performance purposes, saying "what the industry calls a warehouse is essentially a materialized view of an operational database".  That is certainly one fuction of a DW, but I submit it is not the primary function.

The main effort that goes into building most data warehouses is data integration from multiple source systems.  In a theoretical perfect world, all of the data would be perfectly integrated in the transactional system, however that is almost never the case in real life for several reasons.  One reason is that even custom built internal systems often do not have data logically integrated ... using the same keys ... before being brought into a data warehouse.  Sometimes these systems are partially or entirely purchased from external vendors.  Some of the sources are entirely outside the corporate firewall and control of the organization ... either "cloud" applications or other data sources that the organization wants to integrate with their own internal data for analytical purposes.  There can also be integration from different sources of the same data, either concurrent or over different historical periods.   A good data warehouse architect will try to take these disparate sources and transform them into a model that best reflects the way his organization thinks about it.  

Definitely there are a lot of poor practices and poor designers out there.  A lot of people with impressive sounding titles in the data warehouse space don't actually have a good understanding of the theory or practice outside of a particular way they learned to do it 20 years ago that was based in overcoming performance limitations of certain software and infrastructure of the time at a certain business.  That leads to a lot of misunderstanding and misrepresentation of what a datawarehouse is and why it exists.  That's not to say that performance isn't still an issue ... performance and scalability are a real world consideration until you have infinite resources and finite data ... but there is a lot more to it. 

In my opinion, data warehousing suffers due to a general underestimation of the real complexity involved and how difficult it is to do well, which leads to a lot of it being done poorly, and gives the entire practice a bad name.  I suspect you are going to blast me as a heretic, but if you find yourself with interest and time to dig into the topic further, you might want to check out Bill Inmon if you have not.  There is an ancient divide in DW practice between Inmon and Kimball ... and personally I think there is room for the best of both ... but I definitely see you tolerating the Inmon view more, which holds that a data warehouse should always integrate the data in a centralized model in a normalized form.  In any case, any serious consideration of the data modeling, be it Inmon or Kimball or whatever, is better than the idea that modeling is outdated and not needed because all you need to do is shove a bunch of data into a "big data" solution of some sort and let the magic take care of it for you, "just like a Google search".

 

Information Resources
More Blogs from Fabian Pascal
Analysts should avoid relying on techniques that undermine the soundness of database design.
The emphasis on coding in place of education obscures and disregards the core practical objective of database management to minimize programming.
Those delving into data science have to learn the important difference between prediction and explanation, as we discovered in post-election discussions and finger-pointing.
Don't be fooled by the promises that denormalization in data management will provide performance gains at no cost. The real cost will be at the expense of analytics.
Radio Show
A2 Conversations
UPCOMING
James M. Connolly
Evolution of the Data Scientist Role


3/23/2017   REGISTER   0
UPCOMING
Jessica Davis
Monetize Your Data: Turning Insights Into Action


3/29/2017   REGISTER   0
ARCHIVE
Jessica Davis
Data Analysts in Training: Meeting Tomorrow's Demand


3/8/2017  LISTEN   63
ARCHIVE
Jessica Davis
Our Bodies, Our Data: Medical Records For Sale


2/21/2017  LISTEN   63
ARCHIVE
Jessica Davis
Energy Analytics: Using Data to Find Savings


2/14/2017  LISTEN   44
ARCHIVE
Jessica Davis
Sharpen Your Analytics & Data Management Strategy


2/8/2017  LISTEN   74
ARCHIVE
Jessica Davis
Analytics: Make the Most of Data's Potential in 2017


1/19/2017  LISTEN   19
ARCHIVE
Jessica Davis
A2 Radio: Can You Trust Your Data?


12/20/2016  LISTEN   70
ARCHIVE
James M. Connolly
Retail Analytics: See Where Style Meets Statistics


12/6/2016  LISTEN   53
ARCHIVE
James M. Connolly
Why the IoT Matters to Your Business


11/29/2016  LISTEN   45
ARCHIVE
James M. Connolly
Will Data and Humans Become Friends in 2017?


11/22/2016  LISTEN   40
ARCHIVE
James M. Connolly
We Can Build Smarter Cities


10/20/2016  LISTEN   31
Information Resources
Quick Poll
Quick Poll
About Us  |  Contact Us  |  Help  |  Register  |  Twitter  |  Facebook  |  RSS