Task 1: The Titanic Dataset

 

The Titanic passengers dataset is one of the most famous datasets in Machine Learning. In the folder ÒtitanicÓ you will find a version of the dataset split to training and testing set.

Here is a description of the variables.

survival        Survival   (0 = No; 1 = Yes)

pclass          Passenger Class   (1 = 1st; 2 = 2nd; 3 = 3rd)

name            Name

sex             Sex

age             Age

sibsp           Number of Siblings/Spouses Aboard

parch           Number of Parents/Children Aboard

ticket          Ticket Number

fare            Passenger Fare

cabin           Cabin

embarked        Port of Embarkation
              (C = Cherbourg; Q = Queenstown; S = Southampton)

Preprocessing:

Perform a pre-processing analysis (you can limit it to the training set) and report of your findings. That includes a small report with several quantitative and qualitative characteristics: How many variables, what kind of variables, are there any outliers, are there correlations between variables, present some plots that are representative of your dataset, etc. Decide whether you need to normalize your data.

 

Classification

Then prepare the dataset for classification targeting the surviving passengers (this is your output class). Make sure you understand the hypothesis space of the problem. Try all the algorithms we already practiced (decision tree, random forests, k-NN, Na•ve bayes, Support Vector Machine) and report which one(s) have the best performance.

 

Task 2: Document classification

We already discussed during the lecture, that documents are usually represented by the bag-of-words (tf-idf model). In a classification task, these are the features of the problem.

 

R4-counts.csv file contains 818 documents with their count values. R4-class.csv contains the equivalent class information (documents belong to four different classes). Be sure to import the necessary files and load the dataset with the following statements and perform some preliminary analysis:

 

mat=read.csv("R4-counts.csv", header=TRUE)

class=readLines("R4-class.csv")

class=factor(unlist(class[-1]))

 

How sparse is the matrix (i.e. how many zeros are there)? The array is not yet pre-processed using tf-idf measure, so we can do it by using some of the functions of ÒtmÓ package (or you are free to implement your own method)

 

mat=as.DocumentTermMatrix(mat, weightTfIdf)

mat=as.matrix(mat)

 

Split your dataset to training and testing subsets and apply some of the classification algorithms we already worked with. Which one performs better? Is it the same as the previous task?