Here in the US, it's the nationwide men's college basketball tournament season! Therefore let's use some data from the previous years' tournaments to sharpen our analytics & visualization skills...
But before we get started, I must mention (brag?) that my alma mater, NC State University, won this tournament in 1983. They were somewhat of an underdog team, which made the win even a sweeter! And to get you into the mood for some college sports, here is a picture of my friend Jennifer's girls, all dressed in their NCSU colors. I was probably about this age when NCSU won the championship ... yeah, that's it -- that's the ticket! ;-)
Problems with the Original Graph
And now, let's get to the analytics! ... I had seen a visualization created using Tableau software. It was interesting, but I had to study it for quite a while before I "got" what it was showing. Once I "got it" I thought it was pretty interesting data, and I wanted to create my own version, and try to make things a little more intuitive, etc. Here are a few of the things in the Tableau version I thought could be improved:
- There were too many shades of color to visually discern.
- A diverging gradient color ramp was used, but there didn't seem to be enough games with the neutral/middle color.
- None of the title or legend text above the graph explained clearly what the graph was showing.
- The regions were not in a geographically logical order.
- The graph didn't fit within my screen, but I had to scroll down to the bottom of the graph (which was off my screen) to scroll the graphic left/right.
- And the title claims 30 years of data are being shown ... but it's actually 32.
Tracking Down the Data
There was a link in the original graph to the data source, but as I read on that page it wasn't the original source, but rather a copy of the data they had imported from this pdf document. The tool they had used to import the data from the pdf wouldn't run on my PC, therefore I did some Google searching, and found an alternate way to read the data from a pdf. I imported it using a web service called pdftables.com, and I was very impressed -- they imported the data very cleanly, and let me save it as an Excel spreadsheet. I cleaned up the spreadsheet a bit (deleting the extraneous year headers I didn't need), and used SAS' Proc Import to get the data into a SAS dataset. I then used a few data step tricks to do things like rename some of the variables and fill-in (retain) the dates for all the rows of data that had games on the same day.
Creating my Improved Visualization
As with many of my visualizations, I created a custom Gmap of simple squares, and then programmatically annotated text labels around them (some of the positions are data-driven, and some are hard-coded). It's all a matter of sorting your data in a logical way, and then adding x/y offsets for the polygons based on the data. I can't stress how useful and flexible this technique is! Here's a link to the SAS code if you'd like to see all the details -- it's a bit tedious, but nothing really difficult to follow.
And here's an image of my final graph. I encourage you to click on it, to see the full size interactive version, with html mouse-over text for each of the boxes/games.
Here are the changes I made, to overcome the problems in the original graph:
- I only used 5 shades of color, which is a number that is easy to visually discern.
- Rather than making my neutral/middle color represent games in which the seed values were exactly equal, I had it represent games in which the seed values were within 1.
- My main title tells you clearly what the graph represents.
- I ordered my regions West-to-East, as they would be arranged in a geographical map.
- I made my graph a little smaller, so it will hopefully fit on your screen -- but if it doesn't fit, you can use the browser's scroll bars to, without having to first scroll to the bottom of the graph to scroll left/right.
- My title doesn't claim that only 30 years of data are being analyzed.
Does this graph give you any insight into the basketball tournament? Do the predicted winners usually win? Have the trends generally stayed the same from year to year? Would it be useful to color the games by some other variable or calculated value? I'd love to hear your thoughts and suggestions in the comment section!
This content was reposted from the SAS Learning Post. Go there to view the original.