Data Prep and Data Quality


After data acquisition, the early stages of the “data science process” include data preparation (also known as data transformation, data wrangling, and data munging) and data quality (also known as data cleansing). Historically, the data prep and data quality stages have been long, arduous, and time-consuming processes. Most data scientists and data analysts spend the majority of their time trying to prepare data for analysis.

Credit: Pixabay
Credit: Pixabay

Data scientists and analysts are a rare commodity today, so most of them command high salaries. As a consequence, most businesses would rather have these highly paid resources figuring out what the data means, rather than prepping and cleansing it. Well-honed tool sets are needed to mitigate this reality.

Here’s a short list of data prep tasks commonly found in enterprise data pipelines: combine data attributes across multiple data sets; shape data to make it more suitable for analytics; enrich data to provide the contexts needed for analytics; transform data to ready it for machine learning algorithms.

On the data quality side, we also have a number of common tasks: highlight inconsistencies, gaps, and duplications; correct (or remove) corrupt, inaccurate, or improperly formatted data; manage missing data by imputing new values; perform harmonization and standardization processes.

Fortunately, there are a number of vendors that have products in the data prep space. These technology solutions can combine, clean, and shape data before any kind of analytics is done. Most vendors include the data cleansing task along with prep for seamless functionality that readies the data for feature engineering, model selection and beyond. Some vendors offer “self-service” data prep solutions as a way for business users to rapidly access, blend, and prepare data for analysis without the help of IT. The solutions are designed to give data scientists, data analysts and non-technical users a logical view of data enrichment. Although many data prep solutions include tools for ensuring data quality, we can distinguish between these portions of the data pipeline.

Some companies in the big data have gone a step further with data prep and data quality in the form of a process called “data blending.” Data blending is a quick and straightforward method used to extract value from multiple data sources. This process can also help to discover correlations between the different data sets without the time and expense of traditional data warehouse processes, and gain deeper business insights in hours, not the weeks typical of traditional approaches. It also provides a way to automate time-consuming, manual data preparation tasks.

Another important trend that affects the ability to transform and clean data is accommodating the needs of streaming applications. In these applications, data arrives in near real-time and requires low-latency handling. Companies with real-time data streams have the need to transform and clean “data in motion.” A streaming data platform does two things:

  • Data Integration: to capture streams of events or data changes and feeds these to other data systems such as relational databases, key-value stores, Hadoop, or the data warehouse
  • Stream processing: to enable continuous, real-time processing and transformation (including data prep and cleansing) of these streams and make the results available system-wide. The goal of streaming data platforms is to enable processing of data streams as data arrives.

In this post, I’ve mentioned just a few generalized data prep and data quality tasks. What specific tasks can you think of? Please reply with your own anecdotes. What tools did you use? Was your solution successful?

Daniel D. Gutierrez, Data Scientist

Daniel D. Gutierrez is a Data Scientist with Los Angeles-based Amulet Analytics, a service division of Amulet Development Corp. He's been involved with data science and big-data long before it came in vogue, so imagine his delight when the Harvard Business Review recently deemed "data scientist" as the sexiest profession for the 21st century. Previously, he taught computer science and database classes at UCLA Extension for over 15 years, and authored three computer industry books on database technology. He also served as technical editor, columnist, and writer at a major monthly computer industry publication for seven years. Follow his data science musings at @AMULETAnalytics.

Data Prep and Data Quality

Yes, there are ways to minimize the amount of time that data scientists spend on data preparation.

The Value Proposition of Streaming Analytics to the Enterprise

Streaming analytics, which is drawing an increasing amount of interest, helps enterprises by visualizing the business in real-time, cutting preventable losses, automating immediate actions, and detecting urgent conditions.


Re: Stop munging my data
  • 9/15/2016 1:06:41 PM
NO RATINGS

I've seen that form time to time, too, Terry. It should be in the data clensing catagory, but not sure if there is a specific difference in the term meaning.  One challenge in data science is that a lot of terms are used interchangeably and can lose meaning in some ways.

Re: Stop munging my data
  • 9/13/2016 10:27:48 PM
NO RATINGS

Another term I've heard used is "data scrubbing"... because dirty, filthy data needs a long soak in a vat of lye and then a good going-over with some steel wool.

Is this related, separate, one and the same as munging, or something else?

Re: Stop munging my data
  • 9/13/2016 4:01:12 PM
NO RATINGS

..

Daniel writes


 I always called the process "data transformation." I had never heard of either "munging" or "wrangling." But when I was acting as a Community TA for the data science certificate program on Coursera, the Johns Hopkins professors teaching the course used "munging," so I kind of adopted the term. I even have a chapter on "Data Munging" in my new machine learning book! Call it what you will, the process is the important part, not the term.


 

I agree the process is more important than what you call it. In any case, it's an issue involving data quality, including the quality of data input to analytical processes.

"Data munging" seems suspiciously similar to "data cooking", which is a term more commonly used in my own professional circles. The problem of data beeing "munged" or "cooked" or whatever was an issue in a big Austin-area politically tinged urban transit controversy I was involved in over 2 years ago. This is described in my  A2 blog article at the time: Analytics Fuel Transit Duel in Austin. Incidentally, my side ultimately was the victor ...

 

Re: Stop munging my data
  • 9/13/2016 12:25:46 PM
NO RATINGS

Lyndon - I love that you provided the definition and some backgrouond to a phrase I really did not understand how it came into use.  Thank you!

Re: Stop munging my data
  • 9/13/2016 12:24:22 PM
NO RATINGS

True. I have a problem wrapping my head around the term munging as well. Terms should describe the action, but the value of an action can change. Event tracking is a technical term from JavaScript development, but when web analytics became popular, its meaning seem less relevant from its purpose - to tag media on a web page.

Re: Stop munging my data
  • 9/12/2016 10:05:54 PM
NO RATINGS

Yes, "data munging" is an odd bird. In the past, I always called the process "data transformation." I had never heard of either "munging" or "wrangling." But when I was acting as a Community TA for the data science certificate program on Coursera, the Johns Hopkins professors teaching the course used "munging," so I kind of adopted the term. I even have a chapter on "Data Munging" in my new machine learning book! Call it what you will, the process is the important part, not the term.

Stop munging my data
  • 9/12/2016 5:48:19 PM
NO RATINGS

..

Daniel writes


After data acquisition, the early stages of the "data science process" include data preparation (also known as data transformation, data wrangling, and data munging) and data quality (also known as data cleansing). 


 

I'd never before encountered the term "data munging", so I Googled it. For those who may be similarly benighted, here's a definition of the word munge from WhatIs.com:


According to The New Hacker's Dictionary , munge (pronounced MUHNJ ) is (1) a verb, used in a derogatory sense, meaning to imperfectly transform information, or (2) a noun meaning a comprehensive rewrite of a routine, data structure, or the whole program.


 

Hopefully now, one of these days I'll have an opportunity to say something like "Just munge your own data, and leave mine alone ..."

 

Re: Data Prep is the new SEO
  • 9/2/2016 7:59:21 AM
NO RATINGS

The one that I really like at the moment is an open source database called Neo4j.  It creates a graphic and treats each data element as a node.  It then draws the graphic to show how each node relates based on metadata on the element.  Very useful.

Re: Data Prep is the new SEO
  • 8/31/2016 11:38:50 PM
NO RATINGS

Yes, I agree. Furthermore, I see a continued flow of new data prep, data integration and data quality tools arriving in the marketplace all the time. This seems to say that these tasks are being taken seriously.

Re: Data Prep is the new SEO
  • 8/31/2016 3:57:38 PM
NO RATINGS

The analytic climate now has made companies curious, even if they have not quite the knack for identifying what they need from the data.  That ability will be refined as new data prep tools come into the market.

Page 1 / 2   >   >>
INFORMATION RESOURCES
ANALYTICS IN ACTION
CARTERTOONS
VIEW ALL +
QUICK POLL
VIEW ALL +