Simple tools for cleaning unformatted data can help remind us.
Google Refine, formerly Freebase Gridworks, is one such tool. Google Refine downloads and operates from your desktop, though the interface is through a standard browser. The user can upload data into a file project, and then separate text threads that appear throughout the dataset for placement in separate columns, rows, or in any tabular arrangement. Refine permits this editing over a large scale of data, as you can see in the video below. Google even encourages users to approach Refine differently from a spreadsheet.
The feature I like is that Refine saves each edit, much like Apple’s Time Machine backup software. Users can move back and forth to compare tabular changes. A user can use a thread separation methodology in another project by saving it as a JSON file, then importing it.
That leads to another nice aspect. You don't need expert knowledge of code language to use Google Refine. A much more valuable skill is imagination. Paramount is knowing how to refine data based on one’s understanding of what needs to be separated or combined.
Let's consider, for example, a GM analyst who is matching data related to the Chevrolet Corvette. If he understands that a 7.0 liter V-8 engine and a 427 cubic inches V-8 engine have some similar references and meanings, he can group data with a focus on the similarity. Google Refine maintains a few features that permit the user to focus on the purpose behind the dataset rather than how to express the arrangement of the data in a specific technical language.
Google has recently updated Refine with a new format. The improvements do not radically change the user interface, unlike updates to Google Analytics and Google YouTube Analytics, as I discuss in a previous post, Google Spruces Up YouTube Analytics.
Additional tools for refining data include DataWrangler, created by the Visualization Group at Stanford University.
Developers need to do some work on some features that aren't fully intuitive yet. For example, initially it would appear that you can only select one of three example datasets before realizing you can edit and paste your intended set.
Tools like Google Refine and DataWrangler will not eliminate challenges of finding and arranging data. But they do make the data cleaner when imported. As a result, analysts will be able to envision how a table can potentially be created, and ultimately how a dataset can potentially help your business.
Do you have a data cleansing tool of choice? Share on the message board below.