In this example we will demonstrate random forests and simple decision trees. Some of the libraries we are going to be used are below.

Load the library

library(rpart)
library(rpart.plot)
library(randomForest)
library(e1071)
library(caret)
library(rattle)

Help on ramdonForest package and function (or any other package)

library(help=randomForest)
help(randomForest)

Business Scenario and dataset

A marketing department of a bank runs various marketing campaigns for cross-selling products, improving customer retention and customer services.

In this example, the bank wanted to cross-sell term deposit product to its customers.Contacting all customers is costly and does not create good customer experience. So, the bank wanted to build a predictive model which will identify customers who are more likely to respond to term deport cross sell campaign.

We will use sample Marketing Data sample for building Random Forest based model using R.

First, we read the file. Notice that you might need to adjust the path

path="~/Dropbox/MST_COURSES/SS16/practicals/"
termCrosssell<-read.csv("bank-full.csv",sep=";", header = T)

Next, we eeplore the ata frame

names(termCrosssell)
str(termCrosssell)

How many variables are there? What are their types? What is the target variable? What is it’s type? What is the problem that we will need to solve?

Let’s do some statistics on the percentage of positive/negative samples in our dataset

table(termCrosssell$y)/nrow(termCrosssell)

What is the percentage of positive / negative data points?

Training and validation sets

Now, we will split the data sample into development and validation samples.

sample.ind <- sample(2, 
                     nrow(termCrosssell),
                     replace = T,
                     prob = c(0.6,0.4))
cross.sell.dev <- termCrosssell[sample.ind==1,]
cross.sell.val <- termCrosssell[sample.ind==2,]

We need to make sure that we also have positive and negative samples in both sets. How can you check this, using the code from above (using table)?

Now that both development and validation samples have similar target variable distribution. This is just a sample validation. If target variable is factor, classification decision tree is built. We can check the type of response variable.

class(cross.sell.dev$y)

Notice that categorical variables in R are saved as factors. So if your target variable is a factor then the problem is a classification, or if it is numeric then it is regression.

Prepare the formula

varNames <- names(cross.sell.dev)
# Exclude ID or Response variable
varNames <- varNames[!varNames %in% c("y")]

# add + sign between exploratory variables
varNames1 <- paste(varNames, collapse = "+")

# Add response variable and convert to a formula object
rf.form <- as.formula(paste("y", varNames1, sep = " ~ "))

Classification Trees

We will first run a single classification tree and see the results. Rpart which implements a simple decision tree algorithm can take different parameters like: minsplit (the minimum number of observations that must exist in a node in order for a split to be attempted), minbucket (the minimum number of observations in any terminal node. If only one of minbucket or minsplit is specified, the code either sets minsplit to minbucket*3 or minbucket to minsplit/3, as appropriate) and more that can be found in ?rpart.control.

ctree=rpart(rf.form,method="class",data=cross.sell.dev, control=rpart.control(minsplit=10))

We can then print some statistics for the decision tree and actually visualize it!

printcp(ctree)
plotcp(ctree)
fancyRpartPlot(ctree)

Be sure to study this tree and understand how it classifies the samples.

Evaluation of the algorithm

Evaluation is taking place, first for the training set and then for the validation set. We use the predict function with the type class to get the class that the sample is classified. Type can be set to other arguments (see ?predict.rpart): For example, “prob” will be giving the probability that a data point belongs to one class or the other.

p1prob=predict(ctree, type = "prob")  # factor
p1=predict(ctree, type = "class")  # factor

ckx=confusionMatrix(data=p1,
                reference=cross.sell.dev$y,
                positive="yes")
                
#Then we predict for the validation set
p2=predict(ctree, newdata=cross.sell.val, type="class")

ckx2=confusionMatrix(data=p2,
                reference=cross.sell.val$y,
                positive="yes")

Random Forests

Some of the commonly used parameters of randomForest functions are x : Random Forest Formula data: Input data frame ntree: Number of decision trees to be grown replace: Takes True and False and indicates whether to take sample with/without replacement sampsize: Sample size to be drawn from the input data for growing decision tree importance: Whether independent variable importance in random forest be assessed proximity: Whether to calculate proximity measures between rows of a data frame

Random Forest can be used for Classification and Regression problems. Based on type of target /response variable, the relevant decision trees will be built.

Now, we have a sample data and formula for building Random Forest model. Let’s build 500 decision trees using Random Forest. Notice that it might take some time.

cross.sell.rf <- randomForest(rf.form,
                              cross.sell.dev,
                              ntree=500,
                              importance=T)

Random Forest Error Rate and Importance Plot

500 decision trees or a forest has been built using the Random Forest algorithm based learning. We can plot the error rate across decision trees.

Some things to notice here: 1) You will see three lines in the plot. The solid black line denotes the total error and the other (colourful) lines denote the per-class error (in our example the class error for positive and negatve class). Can you guess for which class the error is higher?

  1. Check the error plot and see what happens to the error as the number of tree increases. Can we decide whether 500 trees are necessary?

Try rerunning the code for a number of trees that you consider optimal.

plot(cross.sell.rf)

Variable importance plot is also a useful tool and can be plotted using varImpPlot function. Top 5 variables are selected and plotted based on Model Accuracy and Gini value. We can also get a table with decreasing order of importance based on a measure (1 for model accuracy and 2 node impurity)

varImpPlot(cross.sell.rf,
           sort = T,
           main="Variable Importance",
           n.var=5)

Get the Variable Importance Table

var.imp <- data.frame(importance(cross.sell.rf,
                                 type=2))

Make row names as columns

var.imp$Variables <- row.names(var.imp)
var.imp[order(var.imp$MeanDecreaseGini,decreasing = T),]

Based on Random Forest variable importance, the variables could be selected for any other predictive modelling techniques or machine learning.

Now, we want to measure the accuracy of the Random Forest model. Some of the other model performance statistics are KS, Lift Chart or the ROC curve.

Predict Response Variable Value using Random Forest

Since we have a model, now we are able to predict the response variable for any other data point that has the same data.

cross.sell.dev$predicted.response <- predict(cross.sell.rf ,cross.sell.dev)

Using caret package can be used for creating confusion matrix based on actual response variable and predicted value.

library(e1071)
library(caret)

confusionMatrix(data=cross.sell.dev$predicted.response,
                reference=cross.sell.dev$y,
                positive="yes")

What is the accuracy of the model?

Now we can predict response for the validation sample and calculate model accuracy for the validation sample.

Predicting response variable

cross.sell.val$predicted.response <- predict(cross.sell.rf ,cross.sell.val)

Create Confusion Matrix

confusionMatrix(data=cross.sell.val$predicted.response,
                reference=cross.sell.val$y,
                positive="yes")

What is the accuracy level for the validation set?

Precision, Recall and F-measure computation

Given the lecture notes and the confusion matrices can you compute the precision, recall and F1-measure? Write some simple lines of code to get these.