Let's Analyze 32 Years of Basketball Tournament Data!

Here in the US, it's the nationwide men's college basketball tournament season! Therefore let's use some data from the previous years' tournaments to sharpen our analytics & visualization skills...

But before we get started, I must mention (brag?) that my alma mater, NC State University, won this tournament in 1983. They were somewhat of an underdog team, which made the win even a sweeter! And to get you into the mood for some college sports, here is a picture of my friend Jennifer's girls, all dressed in their NCSU colors. I was probably about this age when NCSU won the championship ... yeah, that's it -- that's the ticket! ;-)

Problems with the Original Graph

And now, let's get to the analytics! ... I had seen a visualization created using Tableau software. It was interesting, but I had to study it for quite a while before I "got" what it was showing. Once I "got it" I thought it was pretty interesting data, and I wanted to create my own version, and try to make things a little more intuitive, etc. Here are a few of the things in the Tableau version I thought could be improved:

  • There were too many shades of color to visually discern.
  • A diverging gradient color ramp was used, but there didn't seem to be enough games with the neutral/middle color.
  • None of the title or legend text above the graph explained clearly what the graph was showing.
  • The regions were not in a geographically logical order.
  • The graph didn't fit within my screen, but I had to scroll down to the bottom of the graph (which was off my screen) to scroll the graphic left/right.
  • And the title claims 30 years of data are being shown ... but it's actually 32.

Tracking Down the Data

There was a link in the original graph to the data source, but as I read on that page it wasn't the original source, but rather a copy of the data they had imported from this pdf document. The tool they had used to import the data from the pdf wouldn't run on my PC, therefore I did some Google searching, and found an alternate way to read the data from a pdf. I imported it using a web service called pdftables.com, and I was very impressed -- they imported the data very cleanly, and let me save it as an Excel spreadsheet. I cleaned up the spreadsheet a bit (deleting the extraneous year headers I didn't need), and used SAS' Proc Import to get the data into a SAS dataset. I then used a few data step tricks to do things like rename some of the variables and fill-in (retain) the dates for all the rows of data that had games on the same day.

Creating my Improved Visualization

As with many of my visualizations, I created a custom Gmap of simple squares, and then programmatically annotated text labels around them (some of the positions are data-driven, and some are hard-coded). It's all a matter of sorting your data in a logical way, and then adding x/y offsets for the polygons based on the data. I can't stress how useful and flexible this technique is! Here's a link to the SAS code if you'd like to see all the details -- it's a bit tedious, but nothing really difficult to follow.

And here's an image of my final graph. I encourage you to click on it, to see the full size interactive version, with html mouse-over text for each of the boxes/games.


Here are the changes I made, to overcome the problems in the original graph:

  • I only used 5 shades of color, which is a number that is easy to visually discern.
  • Rather than making my neutral/middle color represent games in which the seed values were exactly equal, I had it represent games in which the seed values were within 1.
  • My main title tells you clearly what the graph represents.
  • I ordered my regions West-to-East, as they would be arranged in a geographical map.
  • I made my graph a little smaller, so it will hopefully fit on your screen -- but if it doesn't fit, you can use the browser's scroll bars to, without having to first scroll to the bottom of the graph to scroll left/right.
  • My title doesn't claim that only 30 years of data are being analyzed.

Your Insight?

Does this graph give you any insight into the basketball tournament? Do the predicted winners usually win? Have the trends generally stayed the same from year to year? Would it be useful to color the games by some other variable or calculated value? I'd love to hear your thoughts and suggestions in the comment section!

This content was reposted from the SAS Learning Post. Go there to view the original.

Robert Allison, The Graph Guy!, SAS

Robert Allison has worked at SAS for more than 20 years and is perhaps the foremost expert in creating custom graphs using SAS/GRAPH. His educational background is in computer science, and he holds a BS, MS, and PhD from North Carolina State University. He is the author of several conference papers, has won a few graphic competitions, and has written a book calledSAS/GRAPH: Beyond the Basics.

What Cities are in Hurricane Irma's Path?

Here's an example of using data and visualization to look at weather -- specifically, the possible path of Hurricane Irma. Does your city need to get ready?

Mapping Out the Next Robot Invasion

Where are all the robots today? Here's a look at a better data visualization to represent where in the US all the robots are.

Re: Cinderella Chart
  • 4/5/2017 5:43:58 PM

Yes it might be interesting to see if the "black swan" effect could be visualized, The outliers that aren't likely to win but every so often will, and surprise everyone and put a smile of the lucky prognosticator that predicted the cinderella team.

Cinderella Chart
  • 4/3/2017 1:23:09 PM

@Robert -

Here is a suggestion for a different, but related chart. Every year, there is a lot of attention paid to the possible upsets, and to the Cinderalla teams that will somehow keep winning upsets deep into the tournament. (Remember Butler in 2010 and 2011?)

So you could take the same two-tone green-red squares, but instead of showing the whole tournament, just show the single team that was the best Cinderalla team for that year.


Re: great comments!
  • 3/31/2017 11:00:29 AM

Maybe a couple charts one for all the teams, one for sweet sixteen and one for winning teams. Different charts might appeal to different levels of fans and allow them to compare their initial picks and picks as the teams evolve.

great comments!
  • 3/31/2017 7:38:19 AM

I like the comments & suggestions I'm hearing! They're much better than, say, a year or so ago ... which I think means that all my blog readers have been paying close attention to the suggestions and things I try to teach in my blogs! :)

Re: Some insight
  • 3/30/2017 10:30:43 PM

We've all seen enough of Robert's handiwork to know that simple is better. Maybe this could have been pared down to the 10 most winning teams that play in March. 

Re: Conten Conclusions
  • 3/30/2017 10:22:25 PM

I'll pick up the chorus... this isn't as easily glean-able as most of your excellent re-do's, Robert. But I bet the March Madness maniacs among us are eating this up!

Re: Some insight
  • 3/30/2017 10:09:03 PM

PredictableChaos - interesting observations!

Re: Conten Conclusions
  • 3/30/2017 9:45:09 PM

I think the whole tournament is just too much tosee at a glance. It would work much better if you limted it to the sweet 16 and better.

Conten Conclusions
  • 3/30/2017 3:40:23 PM

I agree the second chart is easier to read, but I do think most people would have a hard time drawing conclusions quickly. A summary or bullets might help them understand how to interpret what they see with more clarity.

Re: Some insight
  • 3/29/2017 9:29:49 PM

@ Robert -  You must be the most interesting person to over hear in an elevator talking about sports because you can speak about it in a way that most people can't. 

And if you are in a heated debate about a particular play or fact you can pull out a graph. 

Page 1 / 2   >   >>