Now let’s examine how GloVe embeddings works. As commonly known, word2vec word vectors capture many linguistic regularities. To give the canonical example, if we take word vectors for the words “paris”, “france” and “germany” and perform the following operation:

vector(“paris”)-vector(“france”)+vector(“germany”)

the resulting vector will be close to the vector for “berlin”"

Let’s download the same Wikipedia data used as a demo by word2vec:

library(text2vec)
text8_file = "~/text8"
if (!file.exists(text8_file)) {
  download.file("http://mattmahoney.net/dc/text8.zip", "~/text8.zip")
  unzip ("~/text8.zip", files = "text8", exdir = "~/")
}
wiki = readLines(text8_file, n = 1, warn = FALSE)

In the next step we will create a vocabulary, a set of words for which we want to learn word vectors. Note, that all of text2vec???s functions which operate on raw text data (create_vocabulary, create_corpus, create_dtm, create_tcm) have a streaming API and you should iterate over tokens as the first argument for these functions.

# Create iterator over tokens
tokens <- space_tokenizer(wiki)
# Create vocabulary. Terms will be unigrams (simple words).
it = itoken(tokens, progressbar = FALSE)
vocab <- create_vocabulary(it)

These words should not be too uncommon. Fot example we cannot calculate a meaningful word vector for a word which we saw only once in the entire corpus. Here we will take only words which appear at least five times. text2vec provides additional options to filter vocabulary (see ?prune_vocabulary).

vocab <- prune_vocabulary(vocab, term_count_min = 5L)

Now we have 71,290 terms in the vocabulary and are ready to construct term-co-occurence matrix (TCM).

# Use our filtered vocabulary
vectorizer <- vocab_vectorizer(vocab, 
                               # don't vectorize input
                               grow_dtm = FALSE, 
                               # use window of 5 for context words
                               skip_grams_window = 5L)
tcm <- create_tcm(it, vectorizer)

Now we have a TCM matrix and can factorize it via the GloVe algorithm. text2vec uses a parallel stochastic gradient descent algorithm. By default it will use all cores on your machine, but you can specify the number of cores if you wish. For example, to use 4 threads call RcppParallel::setThreadOptions(numThreads = 4).

Let’s fit our model. (It can take several minutes to fit!)

glove = GlobalVectors$new(word_vectors_size = 50, vocabulary = vocab, x_max = 10)
glove$fit(tcm, n_iter = 20)

And now we get the word vectors:

word_vectors <- glove$get_word_vectors()

We can find the closest word vectors for our paris - france + germany example:

berlin <- word_vectors["paris", , drop = FALSE] - 
  word_vectors["france", , drop = FALSE] + 
  word_vectors["germany", , drop = FALSE]
cos_sim = sim2(x = word_vectors, y = berlin, method = "cosine", norm = "l2")
head(sort(cos_sim[,1], decreasing = TRUE), 5)

Can you try another two-three examples of these analogies?

Your task

Can you perform a factorization method on the word_vectors matrix? Our goal is to get compressed representations from the vectors for the words and represent some vector analogies (like the ones we saw in the lecture).