In this example we will demonstrate random forests and simple decision trees. Some of the libraries we are going to be used are below.
Load the library
library(rpart)
library(rpart.plot)
library(randomForest)
library(e1071)
library(caret)
library(rattle)
Help on ramdonForest package and function (or any other package)
library(help=randomForest)
help(randomForest)
A marketing department of a bank runs various marketing campaigns for cross-selling products, improving customer retention and customer services.
In this example, the bank wanted to cross-sell term deposit product to its customers.Contacting all customers is costly and does not create good customer experience. So, the bank wanted to build a predictive model which will identify customers who are more likely to respond to term deport cross sell campaign.
We will use sample Marketing Data sample for building Random Forest based model using R.
First, we read the file. Notice that you might need to adjust the path
path="~/Dropbox/MST_COURSES/SS16/practicals/"
termCrosssell<-read.csv("bank-full.csv",sep=";", header = T)
Next, we eeplore the ata frame
names(termCrosssell)
str(termCrosssell)
How many variables are there? What are their types? What is the target variable? What is it’s type? What is the problem that we will need to solve?
Let’s do some statistics on the percentage of positive/negative samples in our dataset
table(termCrosssell$y)/nrow(termCrosssell)
What is the percentage of positive / negative data points?
Now, we will split the data sample into development and validation samples.
sample.ind <- sample(2,
nrow(termCrosssell),
replace = T,
prob = c(0.6,0.4))
cross.sell.dev <- termCrosssell[sample.ind==1,]
cross.sell.val <- termCrosssell[sample.ind==2,]
We need to make sure that we also have positive and negative samples in both sets. How can you check this, using the code from above (using table)?
Now that both development and validation samples have similar target variable distribution. This is just a sample validation. If target variable is factor, classification decision tree is built. We can check the type of response variable.
class(cross.sell.dev$y)
Notice that categorical variables in R are saved as factors. So if your target variable is a factor then the problem is a classification, or if it is numeric then it is regression.
varNames <- names(cross.sell.dev)
# Exclude ID or Response variable
varNames <- varNames[!varNames %in% c("y")]
# add + sign between exploratory variables
varNames1 <- paste(varNames, collapse = "+")
# Add response variable and convert to a formula object
rf.form <- as.formula(paste("y", varNames1, sep = " ~ "))
We will first run a single classification tree and see the results. Rpart which implements a simple decision tree algorithm can take different parameters like: minsplit (the minimum number of observations that must exist in a node in order for a split to be attempted), minbucket (the minimum number of observations in any terminal
ctree=rpart(rf.form,method="class",data=cross.sell.dev, control=rpart.control(minsplit=10))
We can then print some statistics for the decision tree and actually visualize it!
printcp(ctree)
plotcp(ctree)
fancyRpartPlot(ctree)
Be sure to study this tree and understand how it classifies the samples.
Evaluation is taking place, first for the training set and then for the validation set. We use the predict function with the type class to get the class that the sample is classified. Type can be set to other arguments (see ?predict.rpart): For example, “prob” will be giving the probability that a data point belongs to one class or the other.
p1prob=predict(ctree, type = "prob") # factor
p1=predict(ctree, type = "class") # factor
ckx=confusionMatrix(data=p1,
reference=cross.sell.dev$y,
positive="yes")
#Then we predict for the validation set
p2=predict(ctree, newdata=cross.sell.val, type="class")
ckx2=confusionMatrix(data=p2,
reference=cross.sell.val$y,
positive="yes")
Some of the commonly used parameters of randomForest functions are x : Random Forest Formula data: Input data frame ntree: Number of decision trees to be grown replace: Takes True and False and indicates whether to take sample with/without replacement sampsize: Sample size to be drawn from the input data for growing decision tree importance: Whether independent variable importance in random forest be assessed proximity: Whether to calculate proximity measures between rows of a data frame
Random Forest can be used for Classification and Regression problems. Based on type of target /response variable, the relevant decision trees will be built.
Now, we have a sample data and formula for building Random Forest model. Let’s build 500 decision trees using Random Forest. Notice that it might take some time.
cross.sell.rf <- randomForest(rf.form,
cross.sell.dev,
ntree=500,
importance=T)
500 decision trees or a forest has been built using the Random Forest algorithm based learning. We can plot the error rate across decision trees.
Some things to notice here: 1) You will see three lines in the plot. The solid black line denotes the total error and the other (colourful) lines denote the per-class error (in our example the class error for positive and negatve class). Can you guess for which class the error is higher?
Try rerunning the code for a number of trees that you consider optimal.
plot(cross.sell.rf)
Variable importance plot is also a useful tool and can be plotted using varImpPlot function. Top 5 variables are selected and plotted based on Model Accuracy and Gini value. We can also get a table with decreasing order of importance based on a measure (1 for model accuracy and 2 node impurity)
varImpPlot(cross.sell.rf,
sort = T,
main="Variable Importance",
n.var=5)
Get the Variable Importance Table
var.imp <- data.frame(importance(cross.sell.rf,
type=2))
Make row names as columns
var.imp$Variables <- row.names(var.imp)
var.imp[order(var.imp$MeanDecreaseGini,decreasing = T),]
Based on Random Forest variable importance, the variables could be selected for any other predictive modelling techniques or machine learning.
Now, we want to measure the accuracy of the Random Forest model. Some of the other model performance statistics are KS, Lift Chart or the ROC curve.
Since we have a model, now we are able to predict the response variable for any other data point that has the same data.
cross.sell.dev$predicted.response <- predict(cross.sell.rf ,cross.sell.dev)
Using caret package can be used for creating confusion matrix based on actual response variable and predicted value.
library(e1071)
library(caret)
confusionMatrix(data=cross.sell.dev$predicted.response,
reference=cross.sell.dev$y,
positive="yes")
What is the accuracy of the model?
Now we can predict response for the validation sample and calculate model accuracy for the validation sample.
cross.sell.val$predicted.response <- predict(cross.sell.rf ,cross.sell.val)
confusionMatrix(data=cross.sell.val$predicted.response,
reference=cross.sell.val$y,
positive="yes")
What is the accuracy level for the validation set?
Given the lecture notes and the confusion matrices can you compute the precision, recall and F1-measure? Write some simple lines of code to get these.