Task 3: Clustering & Factorization

Load the ratings dataset (ratings.csv). After reading the table, convert it (just to be on the safe side) to a data matrix using the following statement:

rat=as.matrix(rat)

NB: The matrix is the original movielense dataset, so it might be too large to practice on it. If you want, just reduce the number of users and/or movies you can use by selecting some of them, e.g.

rat1= rat[1:200,1:500] #will select 200 users and 500 movies

3.1 Now try to apply NMF. If you apply the nnmf method (as presented in the practicals), does it work?

3.2 Try to apply k-means to the compressed representation (both for users and movies). Try different values of k (like we did in the practicals)? Can you find a good value for clustering?

Task 4: Unsupervised Document analysis

Remember the text dataset we used in Task 2? Let’s work a bit more on that but in an unsupervised manner. Let’s load R4-counts.csv.

4.1 Try to perform clustering (use k=4) using the counts (original matrix) and then the tf-idf case. Compare the SSE.

4.2 Let’s apply LDA algorithm to the same dataset. Try different number of topics. If number of topics is 4, then how do the topic descriptions fit with the clusters you got in 4.1? In order to compare the clusters/topics, try to get the 10 most important terms for each cluster and similarly for each topic. Do they agree?