Because its Friday… The IKEA Billy index



Because it is Friday, another ‘playful and frivolous data exercise 🙂

IKEA is more than a store, it is a very nice experience to go through. I can drop of my two kids at smàland, have some ‘quality time’ by walking around the store with my wife and eat some delicious Swedish meatballs. Back at home, the IKEA furniture are a good ‘relation-tester’: try building a big wardrobe together with your wife…..

The nice thing about IKEA is that you don’t have to come to the store for nothing, you can check the availability of an item on the IKEA website.

According to the website this gets refreshed every 1,5 hour. This brought me on an idea, if I check the availability every 1,5 hour I could get an idea of the number of items sold for a particular item.

The IKEA Billy index

Probably the most iconic item of IKEA is the Billy bookcase. Just in case you don’t know how this bookcase looks like, below is a picture, its simplicity in its most elegant way….

For every 1,5 hour over the last few months I have checked the Dutch IKEA website for the availability of this famous item for the 13 stores in the Netherlands, and calculated the negative difference between consecutive values.

The data that you get from this little playful exercise do not necessarily represent the numbers of Billy bookcases really sold. Maybe the stock got replenished in between, maybe items were moved internally to other stores. For example, if there are 50 Billy’s in Amsterdam available and 1,5 hour later there are 45 Billy’s, maybe 5 were sold, or 6 were sold and 1 got returned? replenished? I just don’t know!

All I see are movements in availability that might have been caused by products sold. But anyway, let’s call the movements of availability of the Billy’s the IKEA Billy index.

Some graphs of the Billy Index

Trends and forecasts

Facebook released a nice R package, called prophet. It can be used to perform forecasting on time series, and it is used internally by Facebook across many applications. I ran the prophet forecasting algorithm on the IKEA Billy index. The graph below shows the result.

There are some high peaks end of October, and end of December. We can also clearly see the Saturday peaks that the algorithm has picked up from the historic data and projected it in its future forecasts.

Weekday and color

The graph above showed already that on Saturdays the Billy index is high, what about the other days? The graph below shows the other days, it depicts the sum of the Ikea index per day since I started to collect this data (end of September). Wednesdays and Thursdays are less active days.


Clearly most of the Billy’s are white.


Does the daily Billy Index correlate with other data? I have used some Dutch weather data that can be downloaded from the Royal Netherlands Meteorological Institute (KNMI). The data consists of many daily weather variables. The graph below shows a correlation matrix of the IKEA Billy Index and only some of these weather variables.


The only correlation with some meaning of the IKEA Billy Index and a weather variable is the Wind Speed (-0.19). Increasing wind speeds means decreasing Billy’s.


It’s an explainable correlation of course…. 🙂 You wouldn’t want to go to IKEA on (very) windy days, it is not easy to drive through strong winds with your Billy on top of your car.


Cheers, Longhow.

R formulas in Spark and un-nesting data in SparklyR: Nice and handy!



In an earlier post I talked about Spark and sparklyR and did some experiments. At my work here at RTL Nederland we have a Spark cluster on Amazon EMR to do some serious heavy lifting on click and video-on-demand data. For an R user it makes perfectly sense to use Spark through the sparklyR interface. However, using Spark through the pySpark interface certainly has its benefits. It exposes much more of the Spark functionality and I find the concept of ML Pipelines in Spark very elegant.

In using Spark I like to share two little tricks described below with you.

The RFormula feature selector

As an R user you have to get used to using Spark through pySpark, moreover, I had to brush up some of my rusty Python knowledge. For training machine learning models there is some help though by using an RFormula 🙂

R users know the concept of model formulae in R, it can be handy way to formulate predictive models in a concise way. In Spark you can also use this concept, only a limited set of R operators are available (+, , . and :) , but it is enough to be useful. The two figures below show a simple example.rformula1

from import RFormula
f1 = "Targetf ~ paidDuration + Gender "
formula = RFormula(formula = f1)
train2 =


A handy thing about an RFormula in Spark is (just like using a formula in R in lm and some other modeling functions) that string features used in an RFormula will be automatically onehot encoded, so that they can be used directly in the Spark machine learning algorithms.

Nested (hierarchical) data in sparklyR

Sometimes you may find your self with nested hierarchical data. In pySpark you can flatten this hierarchy if needed. A simple example, suppose you read in a parquet file and it has the following structure:schemaThen to flatten the data you could use:sparkdfIn SparklyR however, reading the same parquet file results in something that isn’t useful to work with at first sight. If you open the table viewer to see the data, you will see rows with: <environment>.nesteddataFortunately, the facilities used internally by sparklyR to call Spark are available to the end user. You can invoke more methods in Spark if needed. So we can invoke the select and col method our self to flatten the hierarchy.rparsedAfter registering the output object, it is visible in the Spark interface and you can view the content.unnested

Thanks for reading my two tricks. Cheers, Longhow.

Did you say SQL Server? Yes I did….



My last blog post in 2016 on SQL Server 2016….. Some years ago, I have heard predictions from ‘experts‘ that within a few years Hadoop / Spark systems would take over traditional RDBMS’s like SQL Server. I don’t think that has happened (yet). Moreover, what some people don’t realize is that at least half of the world still depends on good old SQL Server. If tomorrow all the Transact stored procedures would somehow magically fail to run anymore, I think our society as we know it would collapse…..


OK, I might be exaggerating a little bit. The point is, there are still a lot of companies and use cases out there that are running SQL Server without the need for something else. And now with the integrated R services in SQL Server 2016 that might not be necessary at all 🙂

Deploying Predictive models created in R

From a business standpoint, creating a good predictive model and spending time on this, is only useful if you can deploy such a model in a system where the business can make use of the predictions in their ‘day-to-day operations’. Otherwise creating a predictive model is just an academic exercise / experiment….

Many predictive models are created in R on a ‘stand-alone’ laptop /server. There are different ways to deploy such models. Among others:

  • Re-build the scoring logic ‘by hand’ in the operational system. I did this in the past, it can be a little bit cumbersome and it’s not what you really want to do. If you do not have much data prep steps and your model is a logistic regression or a single tree, this is doable 🙂
  • Make use of PMML scoring. The idea is to create a model (in R) transform that to pmml and import the pmml in the operational system where you need the predictions. Unfortunately, not all models are supported and not all systems support importing (the latest version of) PMML
  • Create API’s (automatically) with technology like for example Azure ML, DeployR, or openCPU, so that the application that needs the prediction can call the API.

SQL Server 2016 R services

If your company is running SQL Server (2016) there is an other nice alternative to deploy R models by using the SQL Server R services. At my work at RTL Nederland [Oh btw we are looking for data engineers and data scientists :-)] we are using this technology to deploy the predictive churn and response models created in R. The process is not difficult; the few steps that are needed are demonstrated below.

Create any model in R

I am using an extreme gradient boosting algorithm to fit a classification model on the titanic data set. Instead of calling xgboost directly I am using the mlr package to train the model. Mlr provides a unified interface to machine learning in R, it takes care of some of the frequently used steps in creating a predictive model regardless of the underlying machine learning algorithm. So your code can become very compact and uniform.


Push the (xgboost) predictive model to SQL Server

Once you are satisfied with the predictive model (on your R laptop), you need to bring that model over to SQL Server so that you can use it there. This consists of the following steps:

SQL Code in SQL Server, write a stored procedure in SQL server that can accept a predictive R model, some meta data and saves that into a table in SQL Server.


This stored procedure can then be called from your R session.

Bring the model from R to SQL, to make it a little bit easier you can write a small helper function.


So what is the result? In SQL Server I now have a table (dbo.R_Models) with predictive models. My xgboost model to predict the survival on the Titanic is now added as an extra row. Such a table becomes like a sort of model store in SQL server.


Apply the predictive model in SQL Server.

Now that we have a model we can use it to calculate model scores on data in SQL Server. With the new R services in SQL Server 2016 there is a function called sp_exec_external_script. In this function you can call R to calculate model scores.


The scores (and the inputs) are stored added in a table.


The code is very generic, instead of xgboost models it works for any model. The scoring can (and should be) be done inside a stored procedure so that scoring can be done at regular intervals or triggered by certain events.


Deploying predictive models (that are created in R) in SQL Server has become easy with the new SQL R services. It does not require new technology or specialized data engineers. If your company is already making use of SQL Server then integrated R services are definitely something to look at if you want to deploy predictive models!

Some more examples with code can be found on the Microsoft GitHub pages.

Cheers, Longhow

Don’t give up on single trees yet…. An interactive tree with Microsoft R



A few days ago Microsoft announced their new Microsoft R Server 9.0 version. Among a lot of new things, it includes some new and improved machine learning algorithms in their MicrosoftML package.

  • Fast linear learner, with support for L1 and L2 regularization. Fast boosted decision tree. Fast random forest. Logistic regression, with support for L1 and L2 regularization.
  • GPU-accelerated Deep Neural Networks (DNNs) with convolutions. Binary classification using a One-Class Support Vector Machine.

And the nice thing is, the MicrosoftML package is now also available in the Microsoft R client version, which you can download and use for free.

Don’t give up on single trees yet….

Despite all the more modern machine learning algorithms, a good old single decision tree can still be useful. Moreover, in a business analytics context they can still keep up in predictive power. In the last few months I have created different predictive response and churn models. I usually just try different learners, logistic regression models, single trees, boosted trees, several neural nets, random forests. In my experience a single decision tree is usually ‘not bad’, often only slightly less predictive power than the more fancy algorithms.

An important thing in analytics is that you can ‘sell‘ your predictive model to the business. A single decision tree is a good way to to do just that, and with an interactive decision tree (created by Microsoft R) this becomes even more easy.

Here is an example: a decision tree to predict the survival of Titanic passengers.

The interactive version of the decision tree can be found on my GitHub.

Cheers, Longhow

Don’t buy a brand new Porsche 911 or Audi Q7!!



Many people know that nasty feeling when buying a brand new car. The minute that you have left the dealer, your car has lost a substantial amount of value. Unfortunately this depreciation is inevitable, however, the amount depends heavily on the car make and model. A small analysis of data from (used) cars shows these differences.

Car Data

I have used Rvest to scrape data from, a Dutch website that combines car for sales data from several other sites. The script to get the data is not that difficult, it can be found on my GitHub, together with my analysis script. There are around 435,000 cars. The data for each car consists of: make, model, price, fuel type, transmission and age. There are many different car makes and models, the most occurring cars in my data set are:


Car age vs. Kilometers

Obviously, there is a clear relation between the age of a car and the amount of kilometers driven. An interesting pattern to see is that this relation depends on on the car make (and model). The following figure shows a few car brands.

Large differences in amount of driving between car types start after 18 months. On average, Jaguars are not made for driving, after 60 months only around 83.000 KM are driven by its owners. While on the other hand, Mercedes-Benz owners have driven around 120.000 KM after 60 months.

A more extreme difference is between the Volvo V50 and the Hyundai i10. Between six and ten years, a Volvo V50 has driven on average 178K kilometers while a Hyundai i10 has driven only 75K kilometers.


A simple depreciation model is just linear depreciation. Per car brand, model, and transmission type, I can fit a straight line through price and kilometers driven. The slope of the line is the depreciation for every kilometer driven. An elegant way to obtain the depreciation per car type is by using the purrr and broom packages.



First, some outlying values are removed then only car types with enough data points are considered. Then I have grouped the data by brand, model and transmission type, so that for each group a simple linear regression model can be fitted:

Price = Intercept + depreciation * KM

The following table shows the results:


So, on average a new Porsche 911 costs 117,278.60 Euro, and every kilometer you drive will cost you around 49.75 cents in loss of value. The complete table with all car types can be found on RPubs. Although, simple and easy to interpret parameters, a straight line model is not a realistic model as can be seen in the following figure:


A better model to fit would be a non linear depreciation model. For example, exponential depreciation or if you don’t want to specify a specific function, some kind of smoothing spline. The R code only needs to be modified slightly, the code below fits a natural cubic splines per car type.


It is a better model (in terms of R-squared), it follows the non linear depreciation that we can see in the data. However, we do not have a single deprecation value. How much value a certain car will lose when driving 1 kilometer now depends on the amount of kilometers driven. It is the derivative of the fitted spline curve. For example, the spline curves fitted for a Renault Clio are given in the figure below. A Clio with automatic transmission hardly looses any value after 100,000 KM.


I have created a small shiny app so that you can see the curves of all the car types.


Despite my data science exercise and beautiful natural cubic smoothing splines models, buying a brand new car involves a lot of emotion. My wife wants a blue Citroen C4 Picasso, no matter what cubic spline model and R-squared I show to her!

So just ignore my analysis and buy the car that feels good to you!! Cheers, Longhow.

Danger, Caution H2O steam is very hot!!


H2O has recently released its steam AI engine, a fully open source engine that supports the management and deployment of machine learning models. Both H2O on R and H2O steam are easy to set up and use. And both complement each other perfectly.

A very simple example

Use H2O on R to create some predictive models. Well, due to lack of inspiration I just used the iris set to create some binary classifiers.


Once these models are trained, they are available for use in the H2O steam engine. A nice web interface allows you to set up a project in H2O steam to manage and display summary information of the models.


In H2O steam you can select a model that you want to deploy. It becomes a service with a REST API, a page is created to test the service.


And that is it! Your predictive model is up and running and waiting to be called from any application that can make REST API calls.

There is a lot more to explore in H2O steam, but be careful H2O steam is very hot!

Some insights in soccer transfers using Market Basket Analysis



Although more than 20 years old, Market Basket Analysis (MBA) (or association rules mining) can still be a very useful technique to gain insights in large transactional data sets. The classical example is transactional data in a supermarket. For each customer we know what the individual products (items) are that he has put in his basket and bought. Other use cases for MBA could be web click data, log files, and even questionnaires.

With market basket analysis we can identify items that are frequently bought together. Usually the results of an MBA are presented in the form of rules. The rules can be as simple as {A ==> B}, when a customer buys item A then it is (very) likely that the customer buys item B. More complex rules are also possible {A, B ==> D, F}, when a customer buys items A and B then it is likely that he buys items D and F.


Soccer transactional data

To perform MBA you need of course data, but I don’t have real transactional data from a retailer that I can present here. So I am using soccer data instead 🙂 From the Kaggle site you can download some soccer data, thanks to Hugo Mathien. The data contains around 25.000 matches from eleven European soccer leagues starting from season 2008/2009 until season 2015/2016. After some data wrangling I was able to generate a transactional data set suitable for market basket analysis. The data structure is very simple, some records are given in the figure below:

So we do not have customers but soccer players, and we do not have products but soccer clubs. In total, my soccer transactional data set contains around 18.000 records. Obviously, these records do not only include the multi-million transfers covered in the media, but also all the transfers of players nobody has ever heard of 🙂

Market basket results

In R you can use the arules package for MBA / association rules mining. Alternatively, when the order of the transactions is important, like my soccer transfers, you should use the arulesSequences package. After running the algorithm I got some interesting results. The figure below shows the most frequently occurring transfers between clubs:

So in this data set the most frequently occurring transfer is from Fiorentina to Genoa (12 transfers in total). I have published the entire table with the rules on RPubs, where you can look up the transfer activity of your favorite soccer club.

Network graph visualization

All the rules that we get from the association rules mining form a network graph. The individual soccer clubs are the nodes of the graph and each rule “from ==> to” is an edge of the graph. In R, network graphs can be visualized nicely by means of the visNetwork package. The network is shown in the picture below.

An interactive version can be found on RPubs. The different colors represent the different soccer leagues. There are eleven leagues in this data, there are more leagues in Europe, but in this data we see that the Polish league is quite isolated from the rest. Almost blended in each other are the German, Spanish, English and French leagues. Less connected are the Scottish and Portuguese leagues, but also in the big English Premier and German leagues you will find less connected clubs like Bournemouth, Reading or Arminia Bielefeld.

The size of a node in the above graph represents it’s betweenness centrality, it is an indicator of a node’s centrality in a network. It is equal to the number of shortest paths from all vertices to all others that pass through that node. In R betweenness measures can be calculated with the igraph package. The most central clubs in the transfers of players are Sporting CP, Lechia Gdansk, Sunderland, FC Porto and PSV Eindhoven.

Virtual Items

An old trick among marketeers is to use virtual items in a market basket analysis. Besides the ‘physical’ items that a customer has in his basket, a marketeer can add extra virtual items in the basket. These could be for example customer characteristic like age-class, sex, but also things like day of week, region etc. The transactional data with virtual items might look like:

If you run a MBA on the transactional data with virtual items, interesting rules might appear. For example:

  • {Chocolate, Female ==> Eggs}
  • {Chocolate, Male ==> Apples}
  • {Beer, Friday, Male, Age[18:23] ==> sausages}.

Virtual items that I can add to my soccer transactional data are: age-class, four classes: 1: players younger than 25, 2: [25, 29), 3: [29, 33) and 4: the players that are 33 or older. Preferred foot, two classes: left or right. Height class, four classes: 1; players smaller than 178 cm, 2: [178, 183), 3: [183, 186), and 4: players taller than 186 cm.

After running the algorithm again the results allow you to find out the frequently occuring transfers of let-footers. I can see 4 left footers that transferred from Roma to Sampdoria, more rules can be seen on my RPubs site.


When you have transactional data, even as small as the soccer transfers, market basket analysis is definitely one of the techniques you should try to get some first insights. Feel free to look at my R code on GitHub to experiment with the soccer transfers data.

Cheers Longhow.

New chapters for 50 shades of grey….



Some time ago I had the honor to follow an interesting talk from Tijmen Blankevoort on neural networks and deeplearning. Convolutional and recurrent neural networks were topics that already caught my interest and this talk inspired me to dive into these topics deeper and do some more experiments with it.

In the same session organized by Martin de Lusenet for Ziggo (a Dutch cable company) I also had the honor to give a talk, my presentation contained a text mining experiment that I did earlier on the Dutch TV soap GTST “Goede Tijden Slechte Tijden”. A nice idea by Tijmen was: Why not use deep learning to generate new episode plots for GTST?

So I did that, see my LinkedIn post on GTST. However, these episodes are in Dutch and I guess only interesting for people here in the Netherlands. So to make things more international and more spicier I generated some new texts based on deep learning and the erotic romance novel 50 shades of grey 🙂

More than plain vanilla networks

In R or SAS you could already train plain vanilla neural networks for a long time. The so-called fully connected networks where all input nodes are connected to all nodes in the following hidden layer.And all nodes in a hidden layer are connected to all nodes in the following hidden layer or output layer.


In more recent years deep learning frame works have become very popular. For example Caffe, Torch, CTNK, Tensorflow and MXNET. The additional value of these frame works compared to SAS for example are:

  • They support more network types than plain vanilla networks. For example, convolutional networks, where not all input nodes are connected to a next layer. And recurrent networks, where loops are present. A nice introduction to these networks can be found here and here.
  • They support computations on GPU’s, which could speed up things dramatically.
  • They are open-source and free. No need for long sales and implementation cycles 🙂 Just download it and use it!

recurrent neural network

My 50 Shades of Grey experiment

For my experiment I used the text of the erotic romance novel 50 shades of grey. A pdf can be found here, I used xpdfbin to extract all the words into a plain text file. I trained a Long Short Term Memory network (LSTM, a special type of recurrent networks), with MXNET. The reason to use MXNET is that they have a nice R interface, so that I can just stay in my comfortable RStudio environment.

Moreover, the R example script of MXNET is ready to run, I just changed the input data and used more rounds of training and more hidden layers. The script and the data can be found on Github.

The LSTM model is fit on character level, the complete romance novel contains 817,204 characters, all these characters are mapped to a number (91 unique numbers). The first few numbers are shown in the following figure.


Once the model has been trained it can generate new text, character by character!

arsess whatever
yuu’re still expeliar a sally. Reftion while break in a limot.”
“Yes, ald what’s at my artmer and brow maned, but I’m so then for a
dinches suppretion. If you think vining. “Anastasia, and depregineon posing rave.
He’d deharing minuld, him drits.

“Miss Steele
“Fasting at liptfel, Miss I’ve dacind her leaches reme,” he knimes.
“I want to blight on to the wriptions of my great. I find sU she asks the stroke, to read with what’s old both – in our fills into his ear, surge • whirl happy, this is subconisue. Mrs. I can say about the battractive see. I slues
is her ever returns. “Anab.

It’s too even ullnes. “By heaven. Grey
about his voice. “Rest of the meriction.”
He scrompts to the possible. I shuke my too sucking four finishessaures. I need to fush quint the only more eat at me.
“Oh my. Kate. He’s follower socks?
“Lice in Quietly. In so morcieut wait to obsed teach beside my tired steately liked trying that.”
Kate for new of its street of confcinged. I haven’t Can regree.
“Where.” I fluscs up hwindwer-and I have

I’ll staring for conisure, pain!”
I know he’s just doesn’t walk to my backeting on Kate has hotelby of confidered Christaal side, supproately. Elliot, but it’s the ESca, that feel posing, it make my just drinking my eyes bigror on my head. S I’ll tratter topality butterch,” I mud
a nevignes, bleamn. “It’s not by there soup. He’s washing, and I arms and have. I wave to make my eyes. It’s forgately? Dash I’d desire to come your drink my heathman legt
you hay D1 Eyep, Christian Gry, husder with a truite sippking, I coold behind, it didn’t want to mive not to my stop?”

“Sire, stcaring it was do and he licks his viice ever.”
I murmurs, most stare thut’s the then staraline for neced outsive. She
so know what differ at,” he murmurs?
“I shake my headanold.” Jeez.
“Are you?” Eviulder keep “Oh,_ I frosing gylaced in – angred. I am most drink to start and try aparts through. I really thrial you, dly woff you stund, there, I care an right dains to rainer.” He likes his eye finally finally my eyes to over opper heaven, places my trars his Necked her jups.
“Do you think your or Christian find at me, is so with that stand at my mouth sait the laxes any litee, this is a memory rude. It
flush,” He says usteer?” “Are so that front up.
I preparraps. I don’t scomine Kneat for from Christian.
“Christian,’! he leads the acnook. I can’t see. I breathing Kate’ve bill more over keen by. He releases?”
“I’m kisses take other in to peekies my tipgents my


The generated text does not make any sense, nor will it win any literature prize soon 🙂 Keep in mind, that the model is based ‘only’ on 817,204 characters  (which is considered a small number), and I did not bother to fine-tune the model at all. But still it is funny and remarkable to see that when you use it to generate text, character by character, it can still produce a lot of correct English words and even some correct basic grammar patterns!

cheers, Longhow.


The Eurovision 2016 song contest in an R Shiny app


In just a few weeks the Eurovision 2016 song contest will be held again. There are 43 participants, two semi-finals on the 10th and 12th of May and a final on the 14th of May. It’s going to be a long watch in front of the television…. 🙂 Who is going to win? Well, you could ask experts, lookup the number of tweets on the different participants, count YouTube likes or go to bookmakers sites. On the time of writing Russia was the favorite among the bookmakers according to this overview of bookmakers.

Spotify data

As an alternative, I used Spotify data. There is a Spotify API which allows you to get information on Play lists, Artists, Tracks, etc. It is not difficult to extract interesting information from the API:

  • Sign up for a (Premium or Free) Spotify account
  • Register a new application on the ‘My Applications‘ site
  • You will then get a client ID and a client Secret

In R you can use the httr library to make API calls. First, with the client ID and secret you need to retrieve a token, then with the token you can call one of the Spotify API endpoints, for example information on a specific artist, see the R code snippet below.


clientID = '12345678910'

response = POST(
authenticate(clientID, secret),
body = list(grant_type = 'client_credentials'),
encode = 'form',

mytoken = content(response)$access_token

## Frank Sinatra spotify artist ID
artistID = '1Mxqyy3pSjf8kZZL4QVxS0'

HeaderValue = paste0('Bearer ', mytoken)

URI = paste0('', artistID)
response2 = GET(url = URI, add_headers(Authorization = HeaderValue))
Artist = content(response2)

The content of the second response object is a nested list with information on the artist. For example url links to images, the number of followers, the popularity of an artist, etc.

Track popularity

An interesting API endpoint is the track API. Especially the information on the track popularity. What is the track popularity? Taken from the Spotify web site:

The popularity of the track. The value will be between 0 and 100, with 100 being the most popular. The popularity of a track is a value between 0 and 100, with 100 being the most popular. The popularity is calculated by algorithm and is based, in the most part, on the total number of plays the track has had and how recent those plays are.

I wrote a small R script to retrieve the track popularity every hour of each of the 43 tracks that participate in this years Eurovision song contest. The picture below lists the top 10 popular tracks of the song contest participants.


At the time of writing the the most popular track was “If I Were Sorry” by Frans (Sweden), which is placed on number three by the bookmakers.The least popular track was “The real Thing” by Highway (Montenegro), corresponding to the last place of the bookmakers.

There is not a lot of movement in the track popularity, it is very stable over time. Maybe when we get nearer to the song contest final in May we’ll see some more movement. I have also kept track of the number of followers that an artist has.There is much more movement here. See the figure below.


Everyday around 5 pm – 6 pm Frans gets around 10 to 12 new followers on Spotify! Artist may of course also lose some followers, for example Douwe Bob in the above picture.

Audio features and related artists

Audio features of tracks like loudness, dance-ablity, tempo etc, can also be retrieved from the API. A simple scatter plot of the 43 songs reveals loud and undancable songs. For example, Francesca Michielin (Italy), she is one of the six lucky artists that already has a place in the final!


Every artist on Spotify also has a set of related artist, this set can be retrieved from the API and can be viewed nicely in a network graph.


The green nodes are the 43 song contest participants. Many of them are ‘isolated’ but some of them are related to each other or connected through a common related artist.


I have created a small Eurovision 2016 Shiny app that summarizes the above information so you can see and listen for your self. We will find out how strong the Spotify track popularity is correlated with the final ranking of the Eurovision song contest on May the 14th!

Cheers, Longhow.

Delays on the Dutch railway system

I almost never travel by train, the last time was years ago. However, recently I had to take the train from Amsterdam and it was delayed for 5 minutes. No big deal, but I was just curious how often these delays occur on the Dutch railway system. I couldn’t quickly find a historical data set with information on delays, so I decided to gather my own data.

The Dutch Railways provide an API (De NS API) that returns actual departure and delay data for a certain train station. I have written a small R script that calls this API for each of the 400 train stations in The Netherlands.  This script is then scheduled to run every 10 minutes.  The API returns data in XML format, the basic entity is “a departing train”. For each departing train we know its departure time, the destination, the departing train station, the type of train, the delay (if there is any), etc. So what to do with all these departing trains? Throw it all into MongoDB. Why?

  • Not for any particular reason :-).
  • It’s easy to install and setup on my little Ubuntu server.
  • There is a nice R interface to MongoDB.
  • The response structure (see picture below) from the API is not that difficult to flatten to a table, but NoSQL sounds more sexy than MySQL nowadays 🙂


I started to collect train departure data at the 4th of January, per day there are around 48.000 train departures in The Netherlands. I can see how much of them are delayed, per day, per station or per hour. Of course, since the collection started only a few days ago its hard to use these data for long-term delay rates of the Dutch railway system. But it is a start.

To present this delay information in an interactive way to others I have created an R Shiny app that queries the MongoDB database. The picture below from my Shiny app shows the delay rates per train station on the 4th of January 2016, an icy day especially in the north of the Netherlands.