Predictive Analytics Troubleshooting: Identify Categorical Data Bias

(Image: Jasni/Shutterstock)

(Image: Jasni/Shutterstock)

When you walk into a Las Vegas casino, knowing where to place your bets can be complicated. Another complicated process is knowing where your data bias is getting out of hand.

Not knowing about your data bias can have consequences with the advent of chatbots. Just ask Microsoft. Microsoft Tay, a chatbot, posted inflammatory and offensive tweets through its Twitter account, forcing Microsoft to shut down the algorithm only 16 hours after its launch. The decision tree for that bot contained a vulnerability that trolling Twitter users employed to corrupt Tay's responses into racist commentary.

But analysts are gaining more knowledge in ways to detect where bias occurs, particularly for clusters. Analysts are starting to learn how to break down the contributing elements in predictive models into segments to get a better view of bias factors.

I'll share a small tip from my experiences. The origins of the word "analysis" -- Greek for "break down" -- has been my rote explanation to business leaders for over 8 years. Back in my early years in analytics, I faced a huge effort to get everyone on the page for what web analytics was and the value it has to business.

Those initial discussions feel like a cakewalk now when I think about explaining the benefits of now widely available advanced analytics.

Analysts now must raise their capabilities as they dig into the data, identify errors, and put outliers into context against a model. The data hygiene needed for predictive analytics demands more than that of reporting dashboards. For example, very few models can handle empty fields. Predictive model errors scale quickly if the data used to train the model contains errors, so users should know how a model handles missing data before using it.

Categorical variables are another aspect that can impact data bias. These variables are binary in nature -- either something is in a given category or it is not. Thus categorical variables should draw from a data source with little variability.

But bias can appear even with something so binary as categories. Bias has significant consequences and is influenced increasingly come by real world experiences. A decision that presents opportunities based on stereotypes can be disastrous for a customer-facing business relying on chatbots. Chatbot designers should understand what questions may be influenced by culture, gender, and race in order to prevent catastrophes like the Microsoft Tay fiasco.

To get a handle on bias, first consider this modeling question -- how should you deal with outliers in a given dataset of expected categories? Is that outlier an error or an unknown data category that was not anticipated? Such questions help to frame context around a category.

That leads into a key tip. Know when data is influenced by conditions such as sales seasonality. That can help to highlight outliers or even spark debate among a business team if the seasonal data is appropriate for modeling.

It's best to incorporate responses to categorical concerns when you are applying data hygiene. Fortunately, the newest hygiene approaches are making best practices easier to establish. One example is tidy data, a concept first advocated by Hadley Wickham. Tidy data is a process to ensure that dataset columns, rows, and data appear in a consistent way prior to conducting deep analysis. In a preparation for data used for a cluster analysis, it can also assure that each cluster sees a sufficient data population once data has been arranged.

Simple models flush out categorical problems with greater highlighting when a model is run. A key quality on clustering is determining what will be the optimal number of clusters. That number is often not very clear from the data set itself. Thus selecting a trial with a simple model makes it easier to discern if the number of clusters should be adjusted. K-Nearest Neighbors, which assigns scores to data, is an example where a simple machine learning algorithm can be used.

A professional once wrote that the mystique of black box predictive modeling is giving way as more data -- and consequently, data bias techniques -- are being revealed. Analytics professionals can today decide the right technique to keep bias low and make the next Microsoft Tay a less likely bet.

Pierre DeBois, Founder, Zimana

Pierre DeBois is the founder of Zimana, a small business analytics consultancy that reviews data from Web analytics and social media dashboard solutions, then provides recommendations and Web development action that improves marketing strategy and business profitability. He has conducted analysis for various small businesses and has also provided his business and engineering acumen at various corporations such as Ford Motor Co. He writes analytics articles for and Pitney Bowes Smart Essentials and contributes business book reviews for Small Business Trends. Pierre looks forward to providing All Analytics readers tips and insights tailored to small businesses as well as new insights from Web analytics practitioners around the world.

Clustering: Knowing Which Birds Flock Together

Analytics pros from many different industries employ clustering when classification is unclear. Here's how they do it.

How Analytics Can Help Marketers Reuse Content to Boost Sales

As content marketing becomes important, aging content becomes a concern. Here are some ways of using analytic reporting to develop content ideas and to align content with customers in the sales cycle.

Re: Open data
  • 6/27/2017 11:04:14 PM

@ Predictable Chaos.  - Indeed Tay could be considered an unintended success.  It could be used to flush out sex offenders or terrorists. Though there are lots of aurguments on both sides whether or not if it could be effective or even have negative results. 

Re: Open data
  • 6/22/2017 5:40:32 PM

Anytime there is an interaction with a human there is an opportunity to offend chatbots are linear and do not have the breadth of understanding of human innuendo and sarcasm so the opportunity to offend does exist based on their programmed responses.

Re: Open data
  • 6/22/2017 11:44:48 AM

Microsoft has now a replacement for tay at and it will be interesting to see how this one behaves compared to the first version of it's chatbot which was shut down last year after the embarrassing replies being made. Nonetheless, it's imporatant as noted remember that "A decision that presents opportunities based on stereotypes can be disastrous for a customer-facing business relying on chatbots."

Re: Accuracy an integrity
  • 6/16/2017 1:48:11 PM

Very true Open data is another area with its own challenges. So many organizations are still sorting out their own data after years of dealing with challenges, moving systems and critical that impacted the collection of data. That overall cleaning and unification still hinder many organizations.

Open data
  • 6/16/2017 1:35:40 PM

Tay on Twitter is an interesting topic.

From one point of view, maybe it was a simply a success - within less than 24 hours, Tay sounded like a millennial on Twitter. It was generating, or echoing, the same pointless and offensive content as everyone who was interacting with it.

Of course this isn't exactly what Microsoft had in mind.

Re: Accuracy an integrity
  • 6/14/2017 10:40:07 PM

True about integrity. I'm also seeing some discussion about integrity with respect to open data - there's not a benchmark for the civic organizations that are relying on data, even though more are placing data and developer initiatives online.

Accuracy an integrity
  • 6/14/2017 12:34:02 PM

You highlight a key issue in all data analytics integrity still gets us all. Until we achieve a high level of integrity in our data all our analytics are flawed and can produce erroneous results. I am not sure we can ever fully address the issue but hopefully, with better criteria in place we can make the data cleaner and eliminate data sets that are known to lack the integrity needed to yield accurate results