Task 1: The
Titanic Dataset
The Titanic passengers dataset is one of the most famous datasets in Machine Learning. In the folder ÒtitanicÓ you will find a version of the dataset split to training and testing set.
Here is a description of the variables.
survival Survival
(0 = No; 1 = Yes)
pclass
Passenger Class (1 =
1st; 2 = 2nd; 3 = 3rd)
name
Name
sex
Sex
age
Age
sibsp
Number of Siblings/Spouses Aboard
parch
Number of Parents/Children Aboard
ticket
Ticket Number
fare
Passenger Fare
cabin
Cabin
embarked Port
of Embarkation
(C = Cherbourg; Q = Queenstown; S = Southampton)
Preprocessing:
Perform a pre-processing analysis (you can limit it to the training set) and report of your findings. That includes a small report with several quantitative and qualitative characteristics: How many variables, what kind of variables, are there any outliers, are there correlations between variables, present some plots that are representative of your dataset, etc. Decide whether you need to normalize your data.
Classification
Then prepare the dataset for classification targeting the surviving passengers (this is your output class). Make sure you understand the hypothesis space of the problem. Try all the algorithms we already practiced (decision tree, random forests, k-NN, Na•ve bayes, Support Vector Machine) and report which one(s) have the best performance.
Task 2: Document
classification
We already discussed during the lecture, that documents are usually represented by the bag-of-words (tf-idf model). In a classification task, these are the features of the problem.
R4-counts.csv file contains 818 documents with their count values. R4-class.csv contains the equivalent class information (documents belong to four different classes). Be sure to import the necessary files and load the dataset with the following statements and perform some preliminary analysis:
mat=read.csv("R4-counts.csv", header=TRUE)
class=readLines("R4-class.csv")
class=factor(unlist(class[-1]))
How sparse is the matrix (i.e. how many zeros are there)?
The array is not yet pre-processed using tf-idf
measure, so we can do it by using some of the functions of ÒtmÓ package (or you
are free to implement your own method)
mat=as.DocumentTermMatrix(mat,
weightTfIdf)
mat=as.matrix(mat)
Split your dataset to training and testing subsets and apply some of the classification algorithms we already worked with. Which one performs better? Is it the same as the previous task?