In this example we will demonstrate k-NN, Naive Bayes algorithm and Support Vector Machines. Some of the libraries we are going to be used are below.

Load the libraries.

library(cluster)
library(psych)
library(datasets)
library(pvclust)

Let’s set a path that we are going to use for every load/save of files. You can replace it with your own.

path="~/Dropbox/MST_COURSES/SS16/practicals/"
setwd(path)

K-means

We will start with k-means algorithm. First step is to read a sample file from our computer and convert it to matrix data. You now should be able to do it and also run some basic statistics to get a first idea of how your data looks like.

data1 <- read.table(file='kmeans_data.csv', sep=',', header=T, row.names=1)
str(data1)
summary(data1)
data.p <- as.matrix(data1)

The next steps would be to “transform” our data into a form that is more suitable for clustering (you should be able to see that the variation between different attributes is big)

#Convert data from counts to percents
data.p <- prop.table(data.p,1)*100

#Provide Z-score standardize data?
kdata <- na.omit(data.p) 
kdata <- scale(kdata)

Next up, we select the max. number of clusters that we are going to check. I will run k-means algorithm several times in order to identify which method gives me “best” results (aka which is the best k).

#Number of clusters that I am going to check
n.lev = 15

# Calculate the within groups sum of squared error (SSE) for the number of cluster solutions selected by the user

wss <- rnorm(10) #used to save the SSE for each execution of the algorithm

while (prod(wss==sort(wss,decreasing=T))==0) {
  wss <- (nrow(kdata)-1)*sum(apply(kdata,2,var))
  for (i in 2:n.lev) wss[i] <- sum(kmeans(kdata, centers=i)$withinss)}

Next, we plot the SSE against all tested cluster solutions for actual data.

xrange <- range(1:n.lev)
yrange <- range(min(wss),max(wss))
plot(xrange,yrange, type='n', xlab="Cluster Solution", ylab="Within Groups SSE", main="Cluster Solutions against SSE")
lines(wss, type="b", col='blue')

Using the “shoulder” method (i.e. is there a sudden change in the improvement of the algorithm?) could you decide on what is the optimal number of clusters?

Next, choose this appropriate number of clusters and run the algorithm once more.

clust.level = ?   #set this number correctly :)
fit5 <- kmeans(kdata, clust.level)

Now, we can also produce the results to the ouput and process them further (if needed).

aggregate(kdata, by=list(fit5$cluster), FUN=mean)
clust.out <- fit5$cluster
kclust <- as.matrix(clust.out)
kclust.out <- cbind(kclust, data1)
write.table(kclust.out, file=paste0(path,"kmeans_out_5.csv"), sep=",")

Further analysis with PCA plot

Let’s display the Principal Components plot of data with clusters identified

clusplot(kdata, fit5$cluster, shade=F, labels=2, lines=0, color=T, lty=4, main='Principal Components plot showing K-means clusters')

And then send output to be saved.

kclust.out.p <- prop.table(as.matrix(kclust.out),1)*100
out <- capture.output(describeBy(kclust.out.p,kclust))
cat(out,file="Kmeans_out.txt", sep='\n', append=F)

Now repeat process with another K value. How are the results different? Was the selection of your K-choice sucessful or not?

Experimenting with another dataset

We loaded the datasets library which contains “attitude” dataset.

attitude

A little task for you :)

First step would be to explore the dataset and see basic statistics like (how many attributes, what kind of attributes, are there big variances, outliers, etc.) Then try to apply k-means algorithm and determine the best value for K. Then project the result with PCA in order to identify the quality of clustering.

Hierarchical Clustering

Now let’s try using a hierarchical clustering approach on the same dataset (the first one you loaded for kmeans).

In R, library “cluster” library implements hierarchical clustering using the agglomerative nesting algorithm (“agnes”).

The first argument x in agnes specifies the input data matrix or the dissimilarity matrix, depending on the value of the diss argument. If diss=TRUE, x is assumed to be a dissimilarity matrix. If diss=FALSE, x is treated as a matrix of observations. The argument stand = TRUE indicates that the data matrix is standardized before calculating the dissimilarities.

Each variable (a column in the data matrix) is standardized by first subtracting the mean value of the variable and then dividing the result by the mean absolute deviation of the variable. If x is already a dissimilarity matrix, this argument will be ignored.

To merge two clusters into a new cluster, the argument method specifies the measurement of between-cluster distance. method=“single” is for single linkage clustering, method=“complete” for complete linkage clustering, and method=“average” for average linkage clustering. The default is method=“average”.

With the method rect.hclust we can draw the clusters on the dendrogram and see whether the result is satisfying us. Choose an appropriate value for k. In order to draw the clusters on the figure. Below we test all four different approaches of HAC. Remember to set k to a correct value before running the rect.hclust statement.

agn <- hclust(dist(dat), method= "single")
plot(agn, hang=-1)
rect.hclust(agn, k=?, border = "red")

agn <- hclust(dist(dat), method= "complete")
plot(agn, hang=-1)
rect.hclust(agn, k=?, border = "red")

agn <- hclust(dist(dat), method= "average")
plot(agn, hang=-1)
rect.hclust(agn, k=?, border = "red")

agn <- hclust(dist(dat), method= "ward.D2")
plot(agn, hang=-1)
rect.hclust(agn, k=?, border = "red")

Can you (intuitively) compare the results of the four different approaches? Which one seems to be performing better?

Can you (intuitively) compare the results of k-means with the results of HAC for the same dataset?