Clustering: Knowing Which Birds Flock Together

(Image: Teguh Jati Prasetyo/Shutterstock)

(Image: Teguh Jati Prasetyo/Shutterstock)

Imagine a jigsaw puzzle. Usually you can associate pieces by image and shape. But suppose every piece is the same shape and is small enough to make images confusing at first look. You'd take a guess at how they fit, right?

Data can be that way. Fortunately, analysts are finding many advanced ways to bring data together. One technique receiving attention these days is clustering, an unsupervised machine learning method that calculates how unlabeled data should be grouped.

Clustering has been used in different industries and studies when classification is unclear. Medical researchers, for example, use clustering to associate patients with similar symptoms or results to clinical trials. Marketers in particular value clustering for discovering customer groups based on unlabeled data. In this instance, we will look at how preparing for clustering is done.

To run a cluster analysis, you can use some basic R programming or Python programming steps. Python is pretty straight-forward in its application, while R relies on libraries, a set of functions that run specialized functions within a program.

Once you have chosen your programming language, you can then import data in your program via a local file or an API call to a database. Most developers have worked on libraries or python frameworks to make API calls easy.

The data is placed into an object. Doing so allows you to inspect the data and ensure it is without anomalies. Both R and Python rely on objects to place data in a matrix format, which allows for mapping data to graphs easily.

The next step is creating a cluster to examine a few statistical details, with the purpose of determining the number of observations in each cluster and seeing how observations are matched to the clusters. The most valuable result is being able to plot the sum of squared error (SSE) versus potential K-mean values. SSE is a sum of the squared difference between an observation value and a mean of the observations. Its purpose is to measure the accuracy of the clusters -- a low number implies less variation in the results.

Analysts strive to creates a curve that the analyst then follows with a decreasing SSE until reaching an "elbow" that represents the recommended K-means values.

As an example, below is a SSE/K-means graph I created in R using the library factoextra. In this example, 6 is selected as the K-means.

The analysis is then recalculated to get the cluster details with K-means set equal to 6.

From an analytics perspective, how data is processed in clustering differs from a straightforward label seen in analytics solutions. Imagine how analysts have typically reported -- they've explained a result based on how data is arranged in a metrics and dimensions, such as the largest sources of referral traffic or the keyword phrases that brought the largest search traffic.

With clustering, the data is not set in a pre-arranged relationship. There's no response variable -- the dimension dependent on metrics. The cluster algorithm examines the dataset, and then arranges the partitioning rules based on the data parameters.

Advanced tools like SAS Visual Analytics and SAS Visual Statistics can provide additional insights on clustering results. These tools, for example, can highlight if a correlation among some clusters exists. That can aid decisions on how to treat customer segments represented by the clusters determined.

There are a number of ways to determine clusters, along with examining the correlation between cluster groups. K-means is the most commonly used technique when starting a cluster analysis. But other types of clusters exist, such as hierarchal, which processes each observation so that the results are mapped out as a hierarchy rather than data point grouped together.

The selection of an analysis technique depends upon the assumptions you place on the data. A good choice depends upon appreciating the math being applied in a program and translating the data assumptions into the programming language you use.

But overall, there is no one single playbook. That open sky opportunity is the best benefit cluster analysis offers. Unlabeled data sparks creativity in finding data patterns. Clustering can ultimately provide new views of product, service, and customer segments and make delivering solutions to those segments less of an enigma.

Pierre DeBois, Founder, Zimana

Pierre DeBois is the founder of Zimana, a small business analytics consultancy that reviews data from Web analytics and social media dashboard solutions, then provides recommendations and Web development action that improves marketing strategy and business profitability. He has conducted analysis for various small businesses and has also provided his business and engineering acumen at various corporations such as Ford Motor Co. He writes analytics articles for and Pitney Bowes Smart Essentials and contributes business book reviews for Small Business Trends. Pierre looks forward to providing All Analytics readers tips and insights tailored to small businesses as well as new insights from Web analytics practitioners around the world.

How Analytics Has Changed (and Not Changed)

It's no longer a struggle to get management to invest in analytics programs. But new challenges have arrived. Here's a closer look at the state of analytics today.

Why You Need a Chief Data Visualization Officer

A Chief Data Visualization Officer can help your analytics team tell a useful story with data and establish the right measurement framework for your organization.

Re: Cluster Analysis: A Blast from the Past
  • 11/11/2017 9:59:42 AM

A very good question. Getting to the optimal amounts of clustera and figuring the validity of the correlations may be not so simple as it may seem. But, it does look like it may be solution to some problems where the ideas may bear fruit it not "know which birds flock together."

Re: Cluster Analysis: A Blast from the Past
  • 11/9/2017 9:27:06 AM

Basically cluster analysis, factor analysis, path analyis, etc.  - these are all parametric methods that look at how variances of independent variables are grouped or linked.  These techniques have been used almiost since fire was discovered, but fell out of favor because without modern technology, they were cumbersome/brutal to calculate (in fact, in 1981, we had to learn and do path analysis for causal models by hand). What the modern tools do is make it easier by providing GUIs and making them accessible by the web.  I am happy to see Pierre surface this technique - it means that the old stuff is still the good stuff.

Re: Cluster Analysis: A Blast from the Past
  • 11/9/2017 9:07:49 AM

Not being familiar with this data analytics technique, I did a bit of Googling for info. I found the following article to be particularly helpful:

An Introduction to Clustering and different methods of clustering

It provides the following simply explained overview:

Clustering is the task of dividing the population or data points into a number of groups such that data points in the same groups are more similar to other data points in the same group than those in other groups. In simple words, the aim is to segregate groups with similar traits and assign them into clusters.

Let's understand this with an example. Suppose, you are the head of a rental store and wish to understand preferences of your costumers to scale up your business. Is it possible for you to look at details of each costumer and devise a unique business strategy for each one of them? Definitely not. But, what you can do is to cluster all of your costumers into say 10 groups based on their purchasing habits and use a separate strategy for costumers in each of these 10 groups. And this is what we call clustering.

I found this fairly helpful in getting a clearer understanding of clustering and its value.


Re: Cluster Analysis: A Blast from the Past
  • 11/7/2017 10:51:33 PM

To be clear, are there more advances in identifiying clusters or more ability to avoid cluster illusion or bias, or mayble both?

Re: Cluster Analysis: A Blast from the Past
  • 11/2/2017 6:32:23 PM

Thanks - I've been seeing a few of these types of post in R-Bloggers, but its more technical and focused on programmer usage and observation.  One aspect that has been changing in analytics is how much programmer perspective and ability has been infused into the tasks required to analyze data and provide the story.

Cluster Analysis: A Blast from the Past
  • 11/2/2017 12:42:36 PM

Wow - cluster analysis - that takes me back to circa 1980.  Thanks for surfacing this topic. I cant recall the last time that I have seen anything on it explicitly.