Some insights in soccer transfers using Market Basket Analysis



Although more than 20 years old, Market Basket Analysis (MBA) (or association rules mining) can still be a very useful technique to gain insights in large transactional data sets. The classical example is transactional data in a supermarket. For each customer we know what the individual products (items) are that he has put in his basket and bought. Other use cases for MBA could be web click data, log files, and even questionnaires.

With market basket analysis we can identify items that are frequently bought together. Usually the results of an MBA are presented in the form of rules. The rules can be as simple as {A ==> B}, when a customer buys item A then it is (very) likely that the customer buys item B. More complex rules are also possible {A, B ==> D, F}, when a customer buys items A and B then it is likely that he buys items D and F.


Soccer transactional data

To perform MBA you need of course data, but I don’t have real transactional data from a retailer that I can present here. So I am using soccer data instead 🙂 From the Kaggle site you can download some soccer data, thanks to Hugo Mathien. The data contains around 25.000 matches from eleven European soccer leagues starting from season 2008/2009 until season 2015/2016. After some data wrangling I was able to generate a transactional data set suitable for market basket analysis. The data structure is very simple, some records are given in the figure below:

So we do not have customers but soccer players, and we do not have products but soccer clubs. In total, my soccer transactional data set contains around 18.000 records. Obviously, these records do not only include the multi-million transfers covered in the media, but also all the transfers of players nobody has ever heard of 🙂

Market basket results

In R you can use the arules package for MBA / association rules mining. Alternatively, when the order of the transactions is important, like my soccer transfers, you should use the arulesSequences package. After running the algorithm I got some interesting results. The figure below shows the most frequently occurring transfers between clubs:

So in this data set the most frequently occurring transfer is from Fiorentina to Genoa (12 transfers in total). I have published the entire table with the rules on RPubs, where you can look up the transfer activity of your favorite soccer club.

Network graph visualization

All the rules that we get from the association rules mining form a network graph. The individual soccer clubs are the nodes of the graph and each rule “from ==> to” is an edge of the graph. In R, network graphs can be visualized nicely by means of the visNetwork package. The network is shown in the picture below.

An interactive version can be found on RPubs. The different colors represent the different soccer leagues. There are eleven leagues in this data, there are more leagues in Europe, but in this data we see that the Polish league is quite isolated from the rest. Almost blended in each other are the German, Spanish, English and French leagues. Less connected are the Scottish and Portuguese leagues, but also in the big English Premier and German leagues you will find less connected clubs like Bournemouth, Reading or Arminia Bielefeld.

The size of a node in the above graph represents it’s betweenness centrality, it is an indicator of a node’s centrality in a network. It is equal to the number of shortest paths from all vertices to all others that pass through that node. In R betweenness measures can be calculated with the igraph package. The most central clubs in the transfers of players are Sporting CP, Lechia Gdansk, Sunderland, FC Porto and PSV Eindhoven.

Virtual Items

An old trick among marketeers is to use virtual items in a market basket analysis. Besides the ‘physical’ items that a customer has in his basket, a marketeer can add extra virtual items in the basket. These could be for example customer characteristic like age-class, sex, but also things like day of week, region etc. The transactional data with virtual items might look like:

If you run a MBA on the transactional data with virtual items, interesting rules might appear. For example:

  • {Chocolate, Female ==> Eggs}
  • {Chocolate, Male ==> Apples}
  • {Beer, Friday, Male, Age[18:23] ==> sausages}.

Virtual items that I can add to my soccer transactional data are: age-class, four classes: 1: players younger than 25, 2: [25, 29), 3: [29, 33) and 4: the players that are 33 or older. Preferred foot, two classes: left or right. Height class, four classes: 1; players smaller than 178 cm, 2: [178, 183), 3: [183, 186), and 4: players taller than 186 cm.

After running the algorithm again the results allow you to find out the frequently occuring transfers of let-footers. I can see 4 left footers that transferred from Roma to Sampdoria, more rules can be seen on my RPubs site.


When you have transactional data, even as small as the soccer transfers, market basket analysis is definitely one of the techniques you should try to get some first insights. Feel free to look at my R code on GitHub to experiment with the soccer transfers data.

Cheers Longhow.

New chapters for 50 shades of grey….



Some time ago I had the honor to follow an interesting talk from Tijmen Blankevoort on neural networks and deeplearning. Convolutional and recurrent neural networks were topics that already caught my interest and this talk inspired me to dive into these topics deeper and do some more experiments with it.

In the same session organized by Martin de Lusenet for Ziggo (a Dutch cable company) I also had the honor to give a talk, my presentation contained a text mining experiment that I did earlier on the Dutch TV soap GTST “Goede Tijden Slechte Tijden”. A nice idea by Tijmen was: Why not use deep learning to generate new episode plots for GTST?

So I did that, see my LinkedIn post on GTST. However, these episodes are in Dutch and I guess only interesting for people here in the Netherlands. So to make things more international and more spicier I generated some new texts based on deep learning and the erotic romance novel 50 shades of grey 🙂

More than plain vanilla networks

In R or SAS you could already train plain vanilla neural networks for a long time. The so-called fully connected networks where all input nodes are connected to all nodes in the following hidden layer.And all nodes in a hidden layer are connected to all nodes in the following hidden layer or output layer.


In more recent years deep learning frame works have become very popular. For example Caffe, Torch, CTNK, Tensorflow and MXNET. The additional value of these frame works compared to SAS for example are:

  • They support more network types than plain vanilla networks. For example, convolutional networks, where not all input nodes are connected to a next layer. And recurrent networks, where loops are present. A nice introduction to these networks can be found here and here.
  • They support computations on GPU’s, which could speed up things dramatically.
  • They are open-source and free. No need for long sales and implementation cycles 🙂 Just download it and use it!

recurrent neural network

My 50 Shades of Grey experiment

For my experiment I used the text of the erotic romance novel 50 shades of grey. A pdf can be found here, I used xpdfbin to extract all the words into a plain text file. I trained a Long Short Term Memory network (LSTM, a special type of recurrent networks), with MXNET. The reason to use MXNET is that they have a nice R interface, so that I can just stay in my comfortable RStudio environment.

Moreover, the R example script of MXNET is ready to run, I just changed the input data and used more rounds of training and more hidden layers. The script and the data can be found on Github.

The LSTM model is fit on character level, the complete romance novel contains 817,204 characters, all these characters are mapped to a number (91 unique numbers). The first few numbers are shown in the following figure.


Once the model has been trained it can generate new text, character by character!

arsess whatever
yuu’re still expeliar a sally. Reftion while break in a limot.”
“Yes, ald what’s at my artmer and brow maned, but I’m so then for a
dinches suppretion. If you think vining. “Anastasia, and depregineon posing rave.
He’d deharing minuld, him drits.

“Miss Steele
“Fasting at liptfel, Miss I’ve dacind her leaches reme,” he knimes.
“I want to blight on to the wriptions of my great. I find sU she asks the stroke, to read with what’s old both – in our fills into his ear, surge • whirl happy, this is subconisue. Mrs. I can say about the battractive see. I slues
is her ever returns. “Anab.

It’s too even ullnes. “By heaven. Grey
about his voice. “Rest of the meriction.”
He scrompts to the possible. I shuke my too sucking four finishessaures. I need to fush quint the only more eat at me.
“Oh my. Kate. He’s follower socks?
“Lice in Quietly. In so morcieut wait to obsed teach beside my tired steately liked trying that.”
Kate for new of its street of confcinged. I haven’t Can regree.
“Where.” I fluscs up hwindwer-and I have

I’ll staring for conisure, pain!”
I know he’s just doesn’t walk to my backeting on Kate has hotelby of confidered Christaal side, supproately. Elliot, but it’s the ESca, that feel posing, it make my just drinking my eyes bigror on my head. S I’ll tratter topality butterch,” I mud
a nevignes, bleamn. “It’s not by there soup. He’s washing, and I arms and have. I wave to make my eyes. It’s forgately? Dash I’d desire to come your drink my heathman legt
you hay D1 Eyep, Christian Gry, husder with a truite sippking, I coold behind, it didn’t want to mive not to my stop?”

“Sire, stcaring it was do and he licks his viice ever.”
I murmurs, most stare thut’s the then staraline for neced outsive. She
so know what differ at,” he murmurs?
“I shake my headanold.” Jeez.
“Are you?” Eviulder keep “Oh,_ I frosing gylaced in – angred. I am most drink to start and try aparts through. I really thrial you, dly woff you stund, there, I care an right dains to rainer.” He likes his eye finally finally my eyes to over opper heaven, places my trars his Necked her jups.
“Do you think your or Christian find at me, is so with that stand at my mouth sait the laxes any litee, this is a memory rude. It
flush,” He says usteer?” “Are so that front up.
I preparraps. I don’t scomine Kneat for from Christian.
“Christian,’! he leads the acnook. I can’t see. I breathing Kate’ve bill more over keen by. He releases?”
“I’m kisses take other in to peekies my tipgents my


The generated text does not make any sense, nor will it win any literature prize soon 🙂 Keep in mind, that the model is based ‘only’ on 817,204 characters  (which is considered a small number), and I did not bother to fine-tune the model at all. But still it is funny and remarkable to see that when you use it to generate text, character by character, it can still produce a lot of correct English words and even some correct basic grammar patterns!

cheers, Longhow.


The Eurovision 2016 song contest in an R Shiny app


In just a few weeks the Eurovision 2016 song contest will be held again. There are 43 participants, two semi-finals on the 10th and 12th of May and a final on the 14th of May. It’s going to be a long watch in front of the television…. 🙂 Who is going to win? Well, you could ask experts, lookup the number of tweets on the different participants, count YouTube likes or go to bookmakers sites. On the time of writing Russia was the favorite among the bookmakers according to this overview of bookmakers.

Spotify data

As an alternative, I used Spotify data. There is a Spotify API which allows you to get information on Play lists, Artists, Tracks, etc. It is not difficult to extract interesting information from the API:

  • Sign up for a (Premium or Free) Spotify account
  • Register a new application on the ‘My Applications‘ site
  • You will then get a client ID and a client Secret

In R you can use the httr library to make API calls. First, with the client ID and secret you need to retrieve a token, then with the token you can call one of the Spotify API endpoints, for example information on a specific artist, see the R code snippet below.


clientID = '12345678910'

response = POST(
authenticate(clientID, secret),
body = list(grant_type = 'client_credentials'),
encode = 'form',

mytoken = content(response)$access_token

## Frank Sinatra spotify artist ID
artistID = '1Mxqyy3pSjf8kZZL4QVxS0'

HeaderValue = paste0('Bearer ', mytoken)

URI = paste0('', artistID)
response2 = GET(url = URI, add_headers(Authorization = HeaderValue))
Artist = content(response2)

The content of the second response object is a nested list with information on the artist. For example url links to images, the number of followers, the popularity of an artist, etc.

Track popularity

An interesting API endpoint is the track API. Especially the information on the track popularity. What is the track popularity? Taken from the Spotify web site:

The popularity of the track. The value will be between 0 and 100, with 100 being the most popular. The popularity of a track is a value between 0 and 100, with 100 being the most popular. The popularity is calculated by algorithm and is based, in the most part, on the total number of plays the track has had and how recent those plays are.

I wrote a small R script to retrieve the track popularity every hour of each of the 43 tracks that participate in this years Eurovision song contest. The picture below lists the top 10 popular tracks of the song contest participants.


At the time of writing the the most popular track was “If I Were Sorry” by Frans (Sweden), which is placed on number three by the bookmakers.The least popular track was “The real Thing” by Highway (Montenegro), corresponding to the last place of the bookmakers.

There is not a lot of movement in the track popularity, it is very stable over time. Maybe when we get nearer to the song contest final in May we’ll see some more movement. I have also kept track of the number of followers that an artist has.There is much more movement here. See the figure below.


Everyday around 5 pm – 6 pm Frans gets around 10 to 12 new followers on Spotify! Artist may of course also lose some followers, for example Douwe Bob in the above picture.

Audio features and related artists

Audio features of tracks like loudness, dance-ablity, tempo etc, can also be retrieved from the API. A simple scatter plot of the 43 songs reveals loud and undancable songs. For example, Francesca Michielin (Italy), she is one of the six lucky artists that already has a place in the final!


Every artist on Spotify also has a set of related artist, this set can be retrieved from the API and can be viewed nicely in a network graph.


The green nodes are the 43 song contest participants. Many of them are ‘isolated’ but some of them are related to each other or connected through a common related artist.


I have created a small Eurovision 2016 Shiny app that summarizes the above information so you can see and listen for your self. We will find out how strong the Spotify track popularity is correlated with the final ranking of the Eurovision song contest on May the 14th!

Cheers, Longhow.

Delays on the Dutch railway system

I almost never travel by train, the last time was years ago. However, recently I had to take the train from Amsterdam and it was delayed for 5 minutes. No big deal, but I was just curious how often these delays occur on the Dutch railway system. I couldn’t quickly find a historical data set with information on delays, so I decided to gather my own data.

The Dutch Railways provide an API (De NS API) that returns actual departure and delay data for a certain train station. I have written a small R script that calls this API for each of the 400 train stations in The Netherlands.  This script is then scheduled to run every 10 minutes.  The API returns data in XML format, the basic entity is “a departing train”. For each departing train we know its departure time, the destination, the departing train station, the type of train, the delay (if there is any), etc. So what to do with all these departing trains? Throw it all into MongoDB. Why?

  • Not for any particular reason :-).
  • It’s easy to install and setup on my little Ubuntu server.
  • There is a nice R interface to MongoDB.
  • The response structure (see picture below) from the API is not that difficult to flatten to a table, but NoSQL sounds more sexy than MySQL nowadays 🙂


I started to collect train departure data at the 4th of January, per day there are around 48.000 train departures in The Netherlands. I can see how much of them are delayed, per day, per station or per hour. Of course, since the collection started only a few days ago its hard to use these data for long-term delay rates of the Dutch railway system. But it is a start.

To present this delay information in an interactive way to others I have created an R Shiny app that queries the MongoDB database. The picture below from my Shiny app shows the delay rates per train station on the 4th of January 2016, an icy day especially in the north of the Netherlands.




Analyzing “Twitter faces” in R with Microsoft Project Oxford


In my previous blog post I used the Microsoft Translator API in my BonAppetit Shiny app to recommend restaurants to tourists. I’m getting a little bit addicted to the Microsoft API’s, they can be fun to use :-). In this blog post I will briefly describe some of the Project Oxford API’s of Microsoft.

The API’s can be called from within R, and if you combine them with other API’s, for example Twitter, then interesting “Twitter face” analyses can be done.  See my “TweetFace” shiny app to analyse faces that can be found on Twitter.

Project Oxford

The API’s of Project Oxford can be categorized into:

  • Computer Vision,
  • Face,
  • Video,
  • Speech and
  • Language.

The free tier subscription provides 5000 API calls per month (with a rate limit of 20 calls per minute). I focused my experiments on the computer vision and face API’s, a lot of functionality is available to analyze images. For example, categorization of images, adult content detection, OCR, face recognition, gender analysis, age estimation and emotion detection.

Calling the API’s from R

The httr package provides very convenient functions to call the Microsoft API’s. You need to sign-up first and obtain a key. Let’s do a simple test on Angelina Jolie by using the face detect API.


Angelina Jolie, picture link


faceURL = ",gender,smile,facialHair"
img.url = ''

faceKEY = '123456789101112131415'

mybody = list(url = img.url)

faceResponse = POST(
  url = faceURL, 
  content_type('application/json'), add_headers(.headers = c('Ocp-Apim-Subscription-Key' = faceKEY)),
  body = mybody,
  encode = 'json'
Response [,gender,smile,facialHair]
Date: 2015-12-16 10:13
Status: 200
Content-Type: application/json; charset=utf-8
Size: 1.27 kB

If the call was successful a “Status: 200” is returned and the response object is filled with interesting information. The API returns the information as JSON which is parsed by R into nested lists.

AngelinaFace = content(faceResponse)[[1]]
[1] "faceId"  "faceRectangle" "faceLandmarks" "faceAttributes"

[1] "female"

[1] 32.6

[1] 0

[1] 0

[1] 0

Well, the API recognized the gender and that there is no facial hair :-), but her age is under estimated, Angelina is 40 not 32.6! Let’s look at emotions, the emotion API has its own key and url.

URL.emoface = ''

emotionKey = 'ABCDEF123456789101112131415'

mybody = list(url = img.url)

faceEMO = POST(
  url = URL.emoface,
  content_type('application/json'), add_headers(.headers = c('Ocp-Apim-Subscription-Key' = emotionKEY)),
  body = mybody,
  encode = 'json'
AngelinaEmotions = content(faceEMO)[[1]]
[1] 4.573111e-05

[1] 0.001244121

[1] 0.0001096572

[1] 1.256477e-06

[1] 0.0004313129

[1] 0.9977798

[1] 0.0003823086

[1] 5.75276e-06

A fairly neutral face. Let’s test some other Angelina faces


Find similar faces

A nice piece of functionality of the API is finding similar faces. First a list of faces needs to be created, then with a ‘query face’ you can search for similar-looking faces in the list of faces. Let’s look at the most sexy actresses.

## Scrape the image URLs of the actresses

linksactresses = ''

out = read_html(linksactresses)
images = html_nodes(out, '.zero-z-index')
imglinks = html_nodes(out, xpath = "//img[@class='zero-z-index']/@src") %>% html_text()

## additional information, the name of the actress
imgalts = html_nodes(out, xpath = "//img[@class='zero-z-index']/@alt") %>% html_text()

Create an empty list, by calling the facelist API, you should spcify a facelistID, which is placed as request parameter behind the facelist URL. So my facelistID is “listofsexyactresses” as shown in the code below.

### create an id and name for the face list
URL.face = ""

mybody = list(name = 'top 100 of sexy actresses')

faceLIST = PUT(
  url = URL.face,
  content_type('application/json'), add_headers(.headers = c('Ocp-Apim-Subscription-Key' = faceKEY)),
  body = mybody,
  encode = 'json'
Response []
Date: 2015-12-17 15:10
Status: 200
Content-Type: application/json; charset=utf-8
Size: 108 B

Now fill the list with images, the API allows you to provide user data with each image, this can be handy to insert names or other info. So for one image this works as follows

userdata = imgalts[i]
linkie = imglinks[i]
face.uri = paste(
  sep = ";"
face.uri = URLencode(face.uri)
mybody = list(url = linkie )

faceLISTadd = POST(
  url = face.uri,
  content_type('application/json'), add_headers(.headers = c('Ocp-Apim-Subscription-Key' = faceKEY)),
  body = mybody,
  encode = 'json'
Response []
Date: 2015-12-17 15:58
Status: 200
Content-Type: application/json; charset=utf-8
Size: 58 B

[1] '32fa4d1c-da68-45fd-9818-19a10beea1c2'

## status 200 is OK

Just loop over the 100 faces to complete the face list. With the list of images we can now perform a query with a new ‘query face’. Two steps are needed, first call the face detect API to obtain a face ID. I am going to use the image of Angelina, but a different one than the image on IMDB.

faceDetectURL = ',gender,smile,facialHair'
img.url = ''

mybody = list(url = img.url)

faceRESO = POST(
  url = faceDetectURL,
  content_type('application/json'), add_headers(.headers =  c('Ocp-Apim-Subscription-Key' = faceKEY)),
  body = mybody,
  encode = 'json'
fID = content(faceRESO)[[1]]$faceId

With the face ID, query the face list with the “find similar” API. There is a confidence of almost 60%.

sim.URI = ''

mybody = list(faceID = fID, faceListID = 'listofsexyactresses' )

faceSIM = POST(
  url = sim.URI,
  content_type('application/json'), add_headers(.headers = c('Ocp-Apim-Subscription-Key' = faceKEY)),
  body = mybody,
  encode = 'json'
yy = content(faceSIM)
[1] "6b4ff942-b216-4817-9739-3653a467a594"

[1] 0.5980769

The picture below shows some other matches…..



The API’s of Microsoft’s Project Oxford provide nice functionality for computer vision, face analysis. It’s fun to use them, see my ‘TweetFace’ Shiny app to analyse images on Twitter.




Bon Appetit: A restaurant recommender for tourists visiting the Netherlands


More and more tourists are visiting The Netherlands, this will become very clear if you walk through the center of Amsterdam on a sunny day. All those tourists need to eat somewhere, in some restaurant. You can see their sad faces as they have no clue where to go. Well, with the aid of a little data science I have made it easy for them :-). A small R Shiny app for tourists to inform them to which restaurant they should go in The Netherlands. In this blog post I will describe the different steps that I have taken.


Tourists in Amsterdam wondering where to eat……

Iens reviews

In an earlier blog post I wrote about scraping restaurant review data from and how to use that to generate restaurant recommendations. The technique was based on the restaurant ratings given by the reviewers. To generate personal recommendations you need to rate some restaurants first. But as a tourist visiting The Netherlands for the first time this might be difficult.

So I have made it a little bit easier, enter your idea of food in my Bon Appetit Shiny app, it will translate the text to Dutch if needed, then calculate the similarity of your translated text and all reviews from Iens, and then give you the top ten restaurants whose reviews matches best.

The Microsoft translator API

Almost all of the reviews on the Iens restaurant website are in Dutch, I assume that most tourists from outside The Netherlands do not speak Dutch. That is not a large problem, I can translate non Dutch text to Dutch by using a translator. Google and Microsoft offer translation API’s. I have chosen for the Microsoft API because they offer a free tier. The first 2 million characters are free per month. Sign-up and get started here. And because the API supports the Klingon language….. 🙂

The R franc package can recognize the language of the input text:

lang = franc(InputText)
ISO2 = speakers$iso6391[speakers$language==lang]
from = ISO2

The ISO 2 letter language code is needed in the call to the Microsoft translator API. I am making use of the httr package to set up the call. With your clientID and client secret a token must be retrieved. Then with this token the actual translation is done.

#Set up call to retrieve token

clientIDEncoded = URLencode("your microsoft client ID")

client_SecretEncoded = URLencode("your client secret")
Uri = ""

MyBody = paste(
   "grant_type = client_credentials&client_id=",

r = POST(url=Uri, body = MyBody, content_type("application/x-www-form-urlencoded"))
response = content(r)

Now that you have the token, make a call to translate the text

HeaderValue = paste("Bearer ", response$access_token, sep="")

TextEncoded = URLencode(InputText)

to = "nl"

uri2 = paste(

resp2 = GET(url = uri2, add_headers(Authorization = HeaderValue))
Translated = content(resp2)

#### dig out the text from the xml object
TranslatedText  = as(Translated , "character") %>% read_html(pp) %>% html_text()

Some example translations,

Louis van Gaal is notorious for his Dutch to English (or any other language for that matter) translations. Let’s see how the Microsoft API performs on some of his sentences.

  • Dutch: “Dat is hele andere koek”, van Gaal: That is different cook”, Microsoft: That is a whole different kettle of fish”.
  • Dutch: “de dood of de gladiolen”, van Gaal: “the dead or the gladiolus”, Microsoft: “the dead or the gladiolus”. 
  • Dutch: “Het is een kwestie van tijd”, van Gaal: “It’s a question of time”, Microsoft: “It’s a matter of time”.

The Cosine similarity

The distance or similarity between two documents (texts) can be measured by means of the cosine similarity. When you have a collection of reviews (texts), then this collection can be represented by a term document matrix. A row of this matrix is one review, its a vector of word counts. Another review or text is also a vector of word counts, given two vectors A and B the cosine similarity  is given by:


Now the input text that is translated to Dutch is also a vector of word counts and so can calculate the cosine similarity between each restaurant review and the input text. The restaurants corresponding to the most similar reviews are returned as recommended restaurants, bon appetit 🙂

Putting all together in a Shiny app

The above steps are implemented in my bon appetit Shiny app. Try out your thoughts and idea of food and get restaurant recommendations! Here is an example:

Input text: Large pizza with chicken and cheese that is tasty.


Input text translated to Dutch


The top ten restaurants corresponding to the translated input text


And for the German tourist: “Ich suche eines schnelles leckeres Hahnchen”, this gets translated to Dutch “ik ben op zoek naar een snelle heerlijke kip” and the ten restaurant recommendations you get are given in the following figure.



— Longhow —

A little H2O deeplearning experiment on the MNIST data set


H2O is a fast and scalable opensource machine learning platform. Several algorithms are available, for example neural networks, random forests, linear models and gradient boosting. See the complete list here. Recently the H2O world conference was held, unfortunately I was not there. Luckily there is a lot of material available, videos and slides, it triggered me to try the software.

The software is easy to set up on my laptop. Download the software from the H2O download site, it is a zip file that needs to be unzipped. It contains (among other files) a jar file that needs to be run from the command line:

java -jar h20.jar

After H2O has started, you can browse to localhost:54321 (the default port number can be changed, specify: -port 65432) and within the browser you can use H2O via the flow interface. In this blog post I will not use the flow interface but I will use the R interface.


H2O flow web interface

The H2O R interface

To install the H2O R interface you can follow the instructions provided here. Its a script that checks if there is already a H2O R package installed, if needed it installs packages that the H2O package depends on, and it installs the H2O R package. Start the interface to H2O from R. If H2O was already started from the command line you can connect to the same H2O instance by specifying the same port and use startH2O = FALSE.


localH2O =  h2o.init(nthreads = -1, port = 54321, startH2O = FALSE)

MNIST handwritten digits

The data I have used for my little experiment is the famous handwritten digits data from MNIST. The data in CSV format can be downloaded from Kaggle. The train data set has 42.000 rows and 785 columns, each row represents a digit, a digit is made up of 28 by 28 pixels, in total 784 columns, plus one additional label column. The first column in the CSV file is called ‘label’, the rest of the columns are called called pixel0, pixel1,….,pixel783. The following code imports the data and plots the first 100 digits, together with the label.

MNIST_DIGITStrain = read.csv( 'D:/R_Projects/MNIST/MNIST_DIGITStrain.csv' )
par( mfrow = c(10,10), mai = c(0,0,0,0))
for(i in 1:100){
  y = as.matrix(MNIST_DIGITStrain[i, 2:785])
  dim(y) = c(28, 28)
  image( y[,nrow(y):1], axes = FALSE, col = gray(255:0 / 255))
  text( 0.2, 0, MNIST_DIGITStrain[i,1], cex = 3, col = 2, pos = c(3,4))


The first 100 MNIST handwritten digits and the corresponding label

The data is imported into R, its a local R data frame. To apply machine learning techniques on the MNIST digits, the data needs to be available on the H2O platform. From R you can either import a CSV file directly into the H2O platform or you can import an existing R object into the H2O platform.

mfile = 'D:\\R_Projects\\MNIST\\MNIST_DIGITStrain.csv'
MDIG = h2o.importFile(path = mfile,sep=',')

# Show the data objects on the H2O platform

1 MNIST_DIGITStrain.hex_3

Deep learning autoencoder

Now that the data is in H2O we can apply machine learning techniques on the data. One type of analysis that interested me the most is the ability to train autoencoders. The idea is to use the input data to predict the input data by means of a ‘bottle-neck’ network.


The middle layer can be regarded as a compressed representation of the input. In H2O R, a deep learning autoencoder can be trained as follows.

NN_model = h2o.deeplearning(
  x = 2:785,
  training_frame = MDIG,
  hidden = c(400, 200, 2, 200, 400 ),
  epochs = 600,
  activation = 'Tanh',
  autoencoder = TRUE

So there is one input layer with 784 neurons, a second layer with 400 neurons, a third layer with 200, the middle layer with 2 neurons, etc. The middle layer is a 2-dimensional representation of a 784 dimensional digit. The 42.000 2-dimensional representations of the digits are just points that we can plot. To extract the data from the middle layer we need to use the function h20.deepfeatures.

train_supervised_features2 = h2o.deepfeatures(NN_model, MDIG, layer=3)

plotdata2 =
plotdata2$label = as.character(as.vector(MDIG[,1]))

qplot(DF.L3.C1, DF.L3.C2, data = plotdata2, color = label, main = 'Neural network: 400 - 200 - 2 - 200 - 400')


In training the autoencoder network I have not used the label, this is not a supervised training exercise. However, I have used the label in the plot above. We can see the ‘1’ digits clearly on the left-hand side, while the ‘7’ digits are more on the right-hand side, and the pink ‘8’ digits are more in the center. It’s far from a perfect, I need to explore more options in the deep learning functionality to achieve a better separation in 2 dimensions.

Comparison with a 2 dimensional SVD data reduction

Autoencoders use nonlinear transformations to compress high dimensional data to a lower dimensional space. Singular Value decomposition on the other hand can be used to compress data to a lower dimensional space by using only linear transformations. See my earlier blog post on SVD. The following picture shows the MNIST digits projected to 2 dimensions using SVD.

There is a good separation between the 1’s and the 0’s, but the rest of the digits are much less separated than the autoencoder. There is of course a time benefit for the SVD. It takes around 6.5 seconds to calculate a SVD on the MNIST data while it took around 350 seconds for the autoencoder.


With this little autoencoder example, I have just scratched the surface of what is possible in H2O. There is much more to discover, many supervised learning algorithms, and also within the deep learning functionality of H2O there are a lot of settings which I have not explored further.