Introduction to mxnet package

First, let’s load the necessary packages.

library(mlbench)
library(mxnet)

We will load a simple dataset and first try to train a multi-layer neural network with it. As usual, we need to process the dataset and split it to training and test set.

data(Sonar, package="mlbench")
Sonar[,61] = as.numeric(Sonar[,61])-1
train.ind = c(1:50, 100:150)
train.x = data.matrix(Sonar[train.ind, 1:60])
train.y = Sonar[train.ind, 61]
test.x = data.matrix(Sonar[-train.ind, 1:60])
test.y = Sonar[-train.ind, 61]

In mxnet, we offer a function called mx.mlp so that users can build a general multi-layer neural network to do classification or regression.

There are several parameters we have to feed to mx.mlp:

Training data and label. Number of hidden nodes in each hidden layers. Number of nodes in the output layer. Type of the activation. Type of the output loss. The device to train (GPU or CPU). Other parameters for mx.model.FeedForward.create.

The following summarizes a way to create and train a multi-layer neural network.

mx.set.seed(0)
model <- mx.mlp(train.x, train.y, hidden_node=10, out_node=2, out_activation="softmax",
                num.round=20, array.batch.size=15, learning.rate=0.07, momentum=0.9, 
                eval.metric=mx.metric.accuracy)

You can see the accuracy from the training process. It should also be easy to make prediction and evaluate it.

preds = predict(model, test.x)
## Auto detect layout of input matrix, use rowmajor..
pred.label = max.col(t(preds))-1
table(pred.label, test.y)

A regression model

We use the following code to load and process the data:

data(BostonHousing, package="mlbench")
train.ind = seq(1, 506, 3)
train.x = data.matrix(BostonHousing[train.ind, -14])
train.y = BostonHousing[train.ind, 14]
test.x = data.matrix(BostonHousing[-train.ind, -14])
test.y = BostonHousing[-train.ind, 14]

Although we can use mx.mlp again to do regression by changing the out_activation, this time we are going to introduce a flexible way to configure neural networks in mxnet. The configuration is done by the “Symbol” system in mxnet, which takes care of the links among nodes, the activation, dropout ratio, etc. To configure a multi-layer neural network, we can do it in the following way:

# Define the input data
data <- mx.symbol.Variable("data")
# A fully connected hidden layer
# data: input source
# num_hidden: number of neurons in this layer
fc1 <- mx.symbol.FullyConnected(data, num_hidden=1)
# Use linear regression for the output layer
lro <- mx.symbol.LinearRegressionOutput(fc1)

What matters for a regression task is mainly the last function. It enables the new network to optimize for squared loss. We can now train on this simple data set. You can always visualize your configuration by using the following code (applied on the last-layer)

graph.viz(lro$as.json())

Now, let’s train the model with the following code and parameters

mx.set.seed(0)
model <- mx.model.FeedForward.create(lro, X=train.x, y=train.y,
ctx=mx.cpu(), num.round=50, array.batch.size=20,
learning.rate=2e-6, momentum=0.9, eval.metric=mx.metric.rmse)

It is also easy to make prediction.

preds = predict(model, test.x)
## Auto detect layout of input matrix, use rowmajor..
sqrt(mean((preds-test.y)^2))

Building a more complex network

MNIST is a handwritten digits image data set created by Yann LeCun. Every digit is represented by a 28x28 image. It has become a standard data set to test classifiers on simple image input. You can find info for the dataset here: http://yann.lecun.com/exdb/mnist/

Let’s load it to R.

train <- read.csv('day3/train.csv', header=TRUE)
test <- read.csv('day3/test.csv', header=TRUE)
train <- data.matrix(train)
test <- data.matrix(test)

train.x <- train[,-1]
train.y <- train[,1]

Here every image is represented as a single row in train/test. The greyscale of each image falls in the range [0, 255], we can linearly transform it into [0,1] by

train.x <- t(train.x/255)
test <- t(test/255)

We also transpose the input matrix to npixel x nexamples, which is the column major format accepted by mxnet (and the convention of R). In the label part, we see the number of each digit is fairly even:

table(train.y)

Network Configuration

Now we have the data. The next step is to configure the structure of our network.

data <- mx.symbol.Variable("data")
fc1 <- mx.symbol.FullyConnected(data, name="fc1", num_hidden=128)
act1 <- mx.symbol.Activation(fc1, name="relu1", act_type="relu")
fc2 <- mx.symbol.FullyConnected(act1, name="fc2", num_hidden=64)
act2 <- mx.symbol.Activation(fc2, name="relu2", act_type="relu")
fc3 <- mx.symbol.FullyConnected(act2, name="fc3", num_hidden=10)
softmax <- mx.symbol.SoftmaxOutput(fc3, name="sm")

Try to visualize this network and see it’s structure.

In mxnet, we use its own data type symbol to configure the network. data <- mx.symbol.Variable(“data”) use data to represent the input data, i.e. the input layer. Then we set the first hidden layer by fc1 <- mx.symbol.FullyConnected(data, name=“fc1”, num_hidden=128). This layer has data as the input, its name and the number of hidden neurons.

The activation is set by act1 <- mx.symbol.Activation(fc1, name=“relu1”, act_type=“relu”). The activation function takes the output from the first hidden layer fc1. The second hidden layer takes the result from act1 as the input, with its name as “fc2” and the number of hidden neurons as 64. the second activation is almost the same as act1, except we have a different input source and name.

Here comes the output layer. Since there’s only 10 digits, we set the number of neurons to 10. Finally we set the activation to softmax to get a probabilistic prediction.

Training

We are almost ready for the training process. Before we start the computation, let’s decide what device should we use.

devices <- mx.cpu()

After all this preparation, you can run the following command to train the neural network!

mx.set.seed(0)
model <- mx.model.FeedForward.create(softmax, X=train.x, y=train.y,
                                     ctx=devices, num.round=10, array.batch.size=100,
                                     learning.rate=0.07, momentum=0.9,  eval.metric=mx.metric.accuracy,
                                     initializer=mx.init.uniform(0.07),
                                     epoch.end.callback=mx.callback.log.train.metric(100))

Prediction

To make prediction, we can simply write

preds <- predict(model, test)
dim(preds)

It is a matrix with 28000 rows and 10 cols, containing the desired classification probabilities from the output layer. To extract the maximum label for each row, we can use the max.col in R:

pred.label <- max.col(t(preds)) - 1
table(pred.label)