A little H2O deeplearning experiment on the MNIST data set

Introduction

H2O is a fast and scalable opensource machine learning platform. Several algorithms are available, for example neural networks, random forests, linear models and gradient boosting. See the complete list here. Recently the H2O world conference was held, unfortunately I was not there. Luckily there is a lot of material available, videos and slides, it triggered me to try the software.

The software is easy to set up on my laptop. Download the software from the H2O download site, it is a zip file that needs to be unzipped. It contains (among other files) a jar file that needs to be run from the command line:

java -jar h20.jar

After H2O has started, you can browse to localhost:54321 (the default port number can be changed, specify: -port 65432) and within the browser you can use H2O via the flow interface. In this blog post I will not use the flow interface but I will use the R interface.

H2O flow web interface

The H2O R interface

To install the H2O R interface you can follow the instructions provided here. Its a script that checks if there is already a H2O R package installed, if needed it installs packages that the H2O package depends on, and it installs the H2O R package. Start the interface to H2O from R. If H2O was already started from the command line you can connect to the same H2O instance by specifying the same port and use startH2O = FALSE.


library(h2o)

localH2O =  h2o.init(nthreads = -1, port = 54321, startH2O = FALSE)

MNIST handwritten digits

The data I have used for my little experiment is the famous handwritten digits data from MNIST. The data in CSV format can be downloaded from Kaggle. The train data set has 42.000 rows and 785 columns, each row represents a digit, a digit is made up of 28 by 28 pixels, in total 784 columns, plus one additional label column. The first column in the CSV file is called ‘label’, the rest of the columns are called called pixel0, pixel1,….,pixel783. The following code imports the data and plots the first 100 digits, together with the label.


MNIST_DIGITStrain = read.csv( 'D:/R_Projects/MNIST/MNIST_DIGITStrain.csv' )
par( mfrow = c(10,10), mai = c(0,0,0,0))
for(i in 1:100){
  y = as.matrix(MNIST_DIGITStrain[i, 2:785])
  dim(y) = c(28, 28)
  image( y[,nrow(y):1], axes = FALSE, col = gray(255:0 / 255))
  text( 0.2, 0, MNIST_DIGITStrain[i,1], cex = 3, col = 2, pos = c(3,4))
}

The first 100 MNIST handwritten digits and the corresponding label

The data is imported into R, its a local R data frame. To apply machine learning techniques on the MNIST digits, the data needs to be available on the H2O platform. From R you can either import a CSV file directly into the H2O platform or you can import an existing R object into the H2O platform.


mfile = 'D:\\R_Projects\\MNIST\\MNIST_DIGITStrain.csv'
MDIG = h2o.importFile(path = mfile,sep=',')

# Show the data objects on the H2O platform
h2o.ls()

key
1 MNIST_DIGITStrain.hex_3

Deep learning autoencoder

Now that the data is in H2O we can apply machine learning techniques on the data. One type of analysis that interested me the most is the ability to train autoencoders. The idea is to use the input data to predict the input data by means of a ‘bottle-neck’ network.

autoencoder

The middle layer can be regarded as a compressed representation of the input. In H2O R, a deep learning autoencoder can be trained as follows.

NN_model = h2o.deeplearning(
  x = 2:785,
  training_frame = MDIG,
  hidden = c(400, 200, 2, 200, 400 ),
  epochs = 600,
  activation = 'Tanh',
  autoencoder = TRUE
)

So there is one input layer with 784 neurons, a second layer with 400 neurons, a third layer with 200, the middle layer with 2 neurons, etc. The middle layer is a 2-dimensional representation of a 784 dimensional digit. The 42.000 2-dimensional representations of the digits are just points that we can plot. To extract the data from the middle layer we need to use the function h20.deepfeatures.


train_supervised_features2 = h2o.deepfeatures(NN_model, MDIG, layer=3)

plotdata2 = as.data.frame(train_supervised_features2)
plotdata2$label = as.character(as.vector(MDIG[,1]))

qplot(DF.L3.C1, DF.L3.C2, data = plotdata2, color = label, main = 'Neural network: 400 - 200 - 2 - 200 - 400')

NNdeep

In training the autoencoder network I have not used the label, this is not a supervised training exercise. However, I have used the label in the plot above. We can see the ‘1’ digits clearly on the left-hand side, while the ‘7’ digits are more on the right-hand side, and the pink ‘8’ digits are more in the center. It’s far from a perfect, I need to explore more options in the deep learning functionality to achieve a better separation in 2 dimensions.

Comparison with a 2 dimensional SVD data reduction

Autoencoders use nonlinear transformations to compress high dimensional data to a lower dimensional space. Singular Value decomposition on the other hand can be used to compress data to a lower dimensional space by using only linear transformations. See my earlier blog post on SVD. The following picture shows the MNIST digits projected to 2 dimensions using SVD.
SVD2

There is a good separation between the 1’s and the 0’s, but the rest of the digits are much less separated than the autoencoder. There is of course a time benefit for the SVD. It takes around 6.5 seconds to calculate a SVD on the MNIST data while it took around 350 seconds for the autoencoder.

Conclusion

With this little autoencoder example, I have just scratched the surface of what is possible in H2O. There is much more to discover, many supervised learning algorithms, and also within the deep learning functionality of H2O there are a lot of settings which I have not explored further.

8 thoughts on “A little H2O deeplearning experiment on the MNIST data set”

Ирина Голощапова

December 14, 2015 at 7:59 pm

For me, your code leaves an error while trying to train NN_model:
Error in .verify_dataxy(training_frame, x, y, autoencoder) :
out of range explanatory variable 2,3,4,5,6,7,8,9,10,11,12,13, … etc.

LikeLike

rrrventures

March 31, 2016 at 7:12 pm

Hey, I am using this post to learn a bit about h2o. Did the training of the model take really long? I am using the same default h2o.init parameters and seems to be taking forever. Not sure if it is expected or I should change the instance config.

LikeLiked by 1 person

- Longhow Lam
  
  March 31, 2016 at 7:15 pm
  
  Hi,
  
  The calculation of the dimension reduction did take some time, it run for an hour or so on my laptop
  
  Cheers,
  Longhow
  
  LikeLike
  
  - rrrventures
    
    March 31, 2016 at 8:18 pm
    
    Thanks for the quick reply and blog post again. Cheers
    
    LikeLike
Neal

February 1, 2017 at 3:38 pm

Hello,
I think we need a test set to be created. Your code only loads the ‘train.csv’ file. I think you need to load the ‘test.csv’ file, put that into an h2o object then use code like this:

model_01 <- h2o.deeplearning(
x = names(h2oDataAll),
training_frame = h2oDataTrain,
validation_frame = h2oDataTest,
ETC …
)

LikeLike

usct

March 1, 2017 at 9:52 am

Hi, How can we use the output of “train_supervised_features2” to get the most important inputs. E.g., If there are 20 inputs to DL and we want top 5 inputs, how we can extract that

LikeLike

Jeff

March 23, 2017 at 7:08 pm

My projection does not look anything like the projection in the tutorial.

LikeLike

- VeilleData (@VeilleData)
  
  March 24, 2017 at 9:31 am
  
  Same here, the results is worse than SVD. With other parameters you can do better though.
  
  LikeLike