Task 1: The Titanic Dataset

The Titanic passengers dataset is one of the most famous datasets in Machine Learning. In the folder “titanic” you will find a version of the dataset split to training and testing set.

Here is a description of the variables.

survival Survival (0 = No; 1 = Yes)

pclass Passenger Class (1 = 1st; 2 = 2nd; 3 = 3rd)

name Name

sex Sex

age Age

sibsp Number of Siblings/Spouses Aboard

parch Number of Parents/Children Aboard

ticket Ticket Number

fare Passenger Fare

cabin Cabin

embarked Port of Embarkation
(C = Cherbourg; Q = Queenstown; S = Southampton)

Preprocessing:

Perform a pre-processing analysis (you can limit it to the training set) and report of your findings. That includes a small report with several quantitative and qualitative characteristics: How many variables, what kind of variables, are there any outliers, are there correlations between variables, present some plots that are representative of your dataset, etc. Decide whether you need to normalize your data.

Classification

Then prepare the dataset for classification targeting the surviving passengers (this is your output class). Make sure you understand the hypothesis space of the problem. Try all the algorithms we already practiced (decision tree, random forests, k-NN, Naïve bayes, Support Vector Machine) and report which one(s) have the best performance.

Task 2: Document classification

We already discussed during the lecture, that documents are usually represented by the bag-of-words (tf-idf model). In a classification task, these are the features of the problem.

R4-counts.csv file contains 818 documents with their count values. R4-class.csv contains the equivalent class information (documents belong to four different classes). Be sure to import the necessary files and load the dataset with the following statements and perform some preliminary analysis:

mat=read.csv("R4-counts.csv", header=TRUE)

class=readLines("R4-class.csv")

class=factor(unlist(class[-1]))

How sparse is the matrix (i.e. how many zeros are there)? The array is not yet pre-processed using tf-idf measure, so we can do it by using some of the functions of “tm” package (or you are free to implement your own method)

mat=as.DocumentTermMatrix(mat, weightTfIdf)

mat=as.matrix(mat)

Split your dataset to training and testing subsets and apply some of the classification algorithms we already worked with. Which one performs better? Is it the same as the previous task?