Task 3: Clustering
& Factorization
Load the ratings dataset (ratings.csv). After reading the table, convert it (just to be on the safe side) to a data matrix using the following statement:
rat=as.matrix(rat)
NB: The matrix is the original movielense dataset, so it might be too large to practice on it. If you want, just reduce the number of users and/or movies you can use by selecting some of them, e.g.
rat1= rat[1:200,1:500] #will
select 200 users and 500 movies
3.1 Now try to apply NMF. If you apply the nnmf method (as presented in the practicals), does it work?
3.2 Try to apply k-means to the compressed representation (both for users and movies). Try different values of k (like we did in the practicals)? Can you find a good value for clustering?
Task 4: Unsupervised
Document analysis
Remember the text dataset we used in Task 2? LetŐs work a bit more on that but in an unsupervised manner. LetŐs load R4-counts.csv.
4.1 Try to perform clustering (use k=4) using the counts (original matrix) and then the tf-idf case. Compare the SSE.
4.2 LetŐs apply LDA algorithm to the same dataset. Try
different number of topics. If number of topics is 4, then how do the topic
descriptions fit with the clusters you got in 4.1? In order to compare the
clusters/topics, try to get the 10 most important terms for each cluster and
similarly for each topic. Do they agree?