Ask A Data Scientist: How Does Clustering in R vs. Tableau Work?

by Pat Lapomarda, on November 17, 2016

Have a pressing Data Science question?  Click Here to Ask A Data Scientist!

 

In the last installment of AADS, about the Top Algorithms Used by Data Scientists, I closed with a teaser:

In the next installment of Ask a Data Scientist we will compare two approaches; Tableau, the "new kid" on the clustering block, and R, a tried and true clustering stalwart. So, stay tuned, and thanks for reading!

Ben Richey has been working with me on this.  We decided that the best way to understand the inner workings of Tableau's clustering would be to use the classic Iris dataset, which you can get within R or from the Machine Learning Repository at the University of California, Irvine. We used this to understand how the implementation of k-means differs between R & Tableau by trying to match the actual species of irises using clustering on the petal & sepal lengths & widths.  The species are clearly clustered when plotted by the petal length vs. width and this will be our comparison point:

 

Species.png

 

Boulder Insights published a nice YouTube tutorial on getting clustering going on the Iris data.  There are four features to cluster on in the Iris data set: Petal Length, Petal Width, Sepal Length, and Sepal Width. The basic Petal Length vs Petal Width is the same in both:

R:

2 Vars R.png

 

Tableau:

2 Vars Tab 10.png

 

What's really interesting is that when you add more features (Sepal Width), the clusters start to diverge significantly:

 

R:

3 Vars R.png

 

Tableau:

3 Vars Tab 10.png

 

Finally, with the addition of the fourth feature, R stabilizes; however, if we change the random starting point, we can see its effect on cluster assignment:

 

R:

4 Vars 888 R.png

4 Vars 20 R.png

 

 

 

 

 

 

 

Tableau:

4 Vars Tab 10.png

 

The drift between the two R versions has now become very noticeable, while Tableau is quite stable; however it's 2nd cluster has actually grown slightly with the addition of a fourth feature to the model. This difference between the two methods occurs because of several differences in technical implementation of k-means:

 

 
Distance Measurement

Euclidean only

Euclidean only, but with alternative implementation (kmeans {amap}) Maximum, Manhattan, Canberra, Binary, Pearson , Correlation, Spearman, & Kendall are also available

Centroid Initialization

Uses the Howard-Harris method to divide the original data into 2 parts, then repeats on the part with the highest distance variance until k is reached. Bora Beran from Tableau does a great job explaining this.

Randomly (using set.seed) or Deterministically (using centers) picks k points define the clusters

Categorical Variable Use

Built in transformation using Multiple Correspondence Analysis (MCA) to convert the category to a distance.

Separate function for categorical data (kmodes) using mode vs. mean as measure or requires pre-processing to convert categories into numbers. FactoMineR is most popular.

 

Tableau does a much better job matching the actual Iris species, regardless of the features clustered. Here is a comparison of their misclassification rates (R on the left, Tableau on the right):

misclass1.pngTabMis1.png

misclass2.pngTabMis2.png

misclass3.pngTabMis3.png

 

Therefore, Tableau can really help speed up your analytics! Ready to test it out yourself? Check out our R markdown and Tableau workbook here:

View The R Markdown View The Tableau Workbook

 

Have a feature you'd like us to test? Need some data solution recommendations? Curious about the Data Science industry?
Have a pressing Data Science question?  Click Here to Ask A Data Scientist!

 

Topics:Data VisualizationData Science

The Arkatechture Blog

A place for visualization veterans, analytics enthusiasts, and self-aware artificial intelligence to binge on all things data. 

Subscribe to our Blog