Ask A Data Scientist: How Does Clustering in Tableau 10 Work?
by Pat Lapomarda, on September 22, 2016
Katie Paige recently listed the 10 Things We Like About Tableau 10 here at Arkatechture. I've been a fan of how Tableau is really the "easy-button" for BI for many years and was really excited when I read about #5: Clustering!
I did some digging to see how it was implemented and learned that Tableau uses K-Means and optimizes the hardest part of using K-Means by using the Calinski-Harabasz Index to recommend the value of k. At its core, K-Means clustering finds the best method to split a population into k different groups on the basis of their features.
Katie recommended a Tableau blog post by Tableau Zen Master Andy Cotgreave and it's a great primer on how & why to use clustering in Tableau. My research started with Tableau's product documentation. As expected, Tableau does a great job identifying the constraints they've applied in their implementation:
Clustering is not available when any of the following conditions apply:
- When you are using a cube (multidimensional) data source.
- When there is a blended dimension in the view.
- When there are no fields that can be used as variables (inputs) for clustering in the view.
- When there are no dimensions present in an aggregated view.
When any of those conditions apply, you will not be able to drag Clusters from the Analytics pane to the view.
In addition, the following field types cannot be used as variables (inputs) for clustering:
- Table calculations
- Blended calculations
- Ad-hoc calculations
- Generated latitude/longitude values
- Measure Names/Measure Values
After a little thought, these all make sense, but I could see myself trying several times to use a binned measure in clustering, since that's what I'd do in R or SAS (I do find Tableau's binning feature the only real dissatisfying feature). From this documentation, you can dig into How Clustering Works in Tableau and find Determining the optimal number of clusters with the Calinski-Harabasz criterion:
Where SSB is the overall between-cluster variance, SSW the overall within-cluster variance, k the number of clusters, and N the number of observations
The math behind this is solid, since it's built on the proportional variance between & within clusters and the proportional dimensionality of the clustering. I found this great explanation of the index on StackExchange.
By recommending k, Tableau has made the hardest part of clustering easy, just like they've been doing for vizzing for years. It's no wonder why we love Tableau here at Arkatechture.