After data acquisition, the early stages of the “data science process” include data preparation (also known as data transformation, data wrangling, and data munging) and data quality (also known as data cleansing). Historically, the data prep and data quality stages have been long, arduous, and time-consuming processes. Most data scientists and data analysts spend the majority of their time trying to prepare data for analysis.
Data scientists and analysts are a rare commodity today, so most of them command high salaries. As a consequence, most businesses would rather have these highly paid resources figuring out what the data means, rather than prepping and cleansing it. Well-honed tool sets are needed to mitigate this reality.
Here’s a short list of data prep tasks commonly found in enterprise data pipelines: combine data attributes across multiple data sets; shape data to make it more suitable for analytics; enrich data to provide the contexts needed for analytics; transform data to ready it for machine learning algorithms.
On the data quality side, we also have a number of common tasks: highlight inconsistencies, gaps, and duplications; correct (or remove) corrupt, inaccurate, or improperly formatted data; manage missing data by imputing new values; perform harmonization and standardization processes.
Fortunately, there are a number of vendors that have products in the data prep space. These technology solutions can combine, clean, and shape data before any kind of analytics is done. Most vendors include the data cleansing task along with prep for seamless functionality that readies the data for feature engineering, model selection and beyond. Some vendors offer “self-service” data prep solutions as a way for business users to rapidly access, blend, and prepare data for analysis without the help of IT. The solutions are designed to give data scientists, data analysts and non-technical users a logical view of data enrichment. Although many data prep solutions include tools for ensuring data quality, we can distinguish between these portions of the data pipeline.
Some companies in the big data have gone a step further with data prep and data quality in the form of a process called “data blending.” Data blending is a quick and straightforward method used to extract value from multiple data sources. This process can also help to discover correlations between the different data sets without the time and expense of traditional data warehouse processes, and gain deeper business insights in hours, not the weeks typical of traditional approaches. It also provides a way to automate time-consuming, manual data preparation tasks.
Another important trend that affects the ability to transform and clean data is accommodating the needs of streaming applications. In these applications, data arrives in near real-time and requires low-latency handling. Companies with real-time data streams have the need to transform and clean “data in motion.” A streaming data platform does two things:
- Data Integration: to capture streams of events or data changes and feeds these to other data systems such as relational databases, key-value stores, Hadoop, or the data warehouse
- Stream processing: to enable continuous, real-time processing and transformation (including data prep and cleansing) of these streams and make the results available system-wide. The goal of streaming data platforms is to enable processing of data streams as data arrives.
In this post, I’ve mentioned just a few generalized data prep and data quality tasks. What specific tasks can you think of? Please reply with your own anecdotes. What tools did you use? Was your solution successful?