Dutch data science poetry


Sorry, hebben jullie heel even? Een dag uit een data saai-entist leven.

Dit is dan weer zo’n dag, waar helemaal niets meer mag. Wacht al een uur op mijn query, crasht ie!!! en doet ie het weer nie.

Ik train een mooie decision tree, Maar mijn model redt het nie. Doe dan maar mijn neuraal netwerk, Hmm, voorspellend vermogen, ook niet sterk.

Dan maar een rondje langs de business, ze snappen me niet, er is iets mis! Het regent buiten echt pijpenstelen, tijd om mijn werk maar op Git te delen.

Recruiters staan op je voice mail te gillen, Thuis ben je gewoon de aardappels aan het schillen. Nou ja, Tijd om naar bed te gaan, morgen maar weer verder met mijn sexy baan!



Echt HEMA….

Er zijn zoveel interessante technieken in het data science vakgebied, het is moeilijk om dat allemaal bij te houden. Werd laatst getriggered door UMAP (Uniform Manifold Approximation and Projection) voor dimensie reductie. 

Als ik het artikel (https://lnkd.in/eGPxspP) doorlees krijg ik nostalgische herinneringen aan mijn Analyse I, II en III vakken aan de VU over o.a. Riemann manifolds!

Een friovole/ludieke vingeroefening met HEMA data om eens te spelen met UMAP. De stappen zijn simpel:

  1. Scrape wat Hema product plaatjes van de site.
  2. Haal deze plaatjes door een pre-trained deeplearning netwerk waar de top layer vanaf is gehaald.
  3. Elk plaatje is nu omgezet in een hoog dimensionele feature vector.
  4. Pas UMAP toe om de dimensie naar drie te reduceren.
  5. Zet in een interactieve Shiny app….. see here


Amsterdam in an R leaflet nutshell


The municipal services of Amsterdam (The Netherlands) is providing open panorama images. See here and here. A camera car has driven around in the city, and now you can download these images.

Per neighborhood of Amsterdam  I randomly sampled 20 images and put them in an animated gif using R magick and the put it on a interactive leaflet map.

Before you book your tickets to Amsterdam, have a quick look here on the leaflet first 🙂


Is that a BMW or a Peugeot?



My son is 8 years old and he has shown a lot of interest in cars, which is strange because I have zero interest in cars. But he is driving me crazy when we have a car ride: “dad is that an Peugeot?“, “dad, that is an Audi” and “that is a BMW, right?“, “That is another cool BMW, why don’t we have a BMW?“. He is pretty accurate, close to 100%! I was curious how accurate a very simple model could get. Just a re-use of a pre-trained image model approach on my laptop without any GPU’s.

Image Data

There is a nice python package google-images-download to help you download certain images.

from google_images_download import google_images_download 
response = google_images_download.googleimagesdownload()  

arguments = {
  "keywords": "BMW,PEUGEOT",
  "print_urls": False,
  "suffix_keywords": "car",
  "output_directory": "TMP",
   "format": "png"


The above code will get you images of BMW’s and Peugeots, the problem though is that not all images are actually cars. You’ll see scooters, navigation systems and garages. Moreover, some downloaded files do not open at all.


So first, we can use a pre-trained resnet50 or vgg16 image classifier and run the downloaded files through this classifier and keep only the images that keras can open and were classified as car or wagon. Then the images are organized in the following folder structure

├── training
│   ├── bmw (150 images)
│   └── peugeot(150 images)
└── validation
    ├── bmw (50 images)
    └── peugeot(50 images)

Predictive model

I am using the most simple approach, both in terms of modeling and computational effort. It is described in section 5.3 of this fantastic book “Deep Learning in R” by François Chollet and J. J. Allaire.

  • Take a pretrained network, say VGG16, remove the top so that you only have a convolutional base.
  • Now run your images trough this base so that each image is a tensor.
  • Treat these tensors as input for a complete separate neural network classifier. For example a simple one hidden fully connected layer with 256 neurons, shown in the code snippet below.
model <- keras_model_sequential() %>%
    units = 256, 
    activation = "relu", 
    input_shape = 4 * 4 * 512
  ) %>%
  layer_dropout(rate = 0.5) %>%
  layer_dense(units = 1, activation = "sigmoid")

The nice thing is that once you have put your images in the proper folder structure you can just ‘shamelessly’ copy/paste the code from the accompanying markdown of the book and start training a BMW-Peugeot model.



After 15 epochs or so the accuracy on the validation images flattens of to around 80% which is not super good and not even close to what my son can achieve 🙂 But it is not too bad either for just 30 minutes of work in R, mostly copy pasting code….. Cheers, Longhow.

t-sne dimension reduction on Spotify mp3 samples

Schermafdruk van 2018-01-30 12-25-08


Not long ago I was reading on t-Distributed Stochastic Neighbor Embedding (t-sne), a very interesting dimension reduction technique, and on Mel frequency cepstrum a sound processing technique. Details of both techniques can be found here and here. Can we combine the two in a data analysis exercise? Yes, and with not too much R code you can already quickly create some visuals to get ‘musical’ insights.

Spotify Data

Where can you get some sample audio files? Spotify! There is a Spotify API which allows you to get information on playlists, artists, tracks, etc. Moreover, for many songs (not all though) Spotify provides downloadable preview mp3’s of 30 seconds. The link to the preview mp3 can be retrieved from the API. I am going to use some of these mp3’s for analysis.

In the web interface of Spotify you can look for interesting playlists. In the search field type in for example ‘Bach‘ (my favorite classical composer). In the search results go to the playlists tab, you’ll find many ‘Bach’ playlists from different users, including the ‘user’ Spotify itself. Now, given the user_id (spotify) and the specific playlist_id (37i9dQZF1DWZnzwzLBft6A for the Bach playlist from Spotify) we can extract all the songs using the API:

 GET https://api.spotify.com/v1/users/{user_id}/playlists/{playlist_id}

You will get the 50 Bach songs from the playlist, most of them have a preview mp3. Let’s also get the songs from a Heavy Metal play list, and a Michael Jackson play list. In total I have 146 songs with preview mp3’s in three ‘categories’:

  • Bach,
  • Heavy Metal,
  • Michael Jackson.

Transforming audio mp3’s to features

The mp3 files need to be transformed to data that I can use for machine learning, I am going to use the Python librosa package to do this. It is easy to call it from R using the reticulate package.

librosa = import("librosa")

#### python environment with librosa module installed
use_python(python = "/usr/bin/python3")

The downloaded preview mp3’s have a sample rate of 22.050. So a 30 second audio file has in total 661.500 raw audio data points.

onemp3 = librosa$load("mp3songs/bach1.mp3")

length(onemp3[[1]])/onemp3[[2]]  # ~30 seconds sound

## 5 seconds plot
pp = 5*onemp3[[2]]
plot(onemp3[[1]][1:pp], type="l")

A line plot of the raw audio values will look like.


For sound processing, features extraction on the raw audio signal is often applied first. A commonly used feature extraction method is Mel-Frequency Cepstral Coefficients (MFCC). We can calculate the MFCC for a song with librosa.

ff = librosa$feature
mel = librosa$logamplitude(
    sr = onemp3[[2]],

Each mp3 is now a matrix of MFC Coefficients as shown in the figure above. We have less data points than the original 661.500 data points but still quit a lot. In our example the MFCC are a 96 by 1292 matrix, so 124.032 values. We apply a the t-sne dimension reduction on the MFCC values.

Calculating t-sne

A simple and easy approach, each matrix is just flattened. So a song becomes a vector of length 124.032. The data set on which we apply t-sne consist of 146 records with 124.032 columns, which we will reduce to 3 columns with the Rtsne package:

tsne_out = Rtsne(AllSongsMFCCMatrix, dims=3) 

The output object contains the 3 columns, I have joined it back with the data of the artists and song names so that I can create an interactive 3D scatter plot with R plotly. Below is a screen shot, the interactive one can be found here.



It is obvious that Bach music, heavy metal and Michael Jackson are different, you don’t need machine learning to hear that. So as expected, it turns out that a straight forward dimension reduction on these songs with MFCC and t-sne clearly shows the differences in a 3D space. Some Michael Jackson songs are very close to heavy metal 🙂 The complete R code can be found here.

Cheers, Longhow

The ‘I-Love-IKEA’ – web app, built at the IKEA Hackaton with R and Shiny



On the 8th, 9th and 10th of December I participated at the IKEA hackaton. In one word it was just FANTASTIC! Well organized, good food, and participants from literally all over the world, even the heavy snow fall on Sunday did not stop us from getting there!



I formed a team with Jos van Dongen and his son Thomas van Dongen and we created the “I-Love-IKEA” app to help customers find IKEA products. And of course using R.


The “I-Love-IKEA” Shiny R app

The idea is simple. Suppose you are in the unfortunate situation that you are not in an IKEA store, and you see a chair, or nice piece of furniture, or something completely else…. Now does IKEA have something similar? Just take a picture, upload it using the I-Love-IKEA R Shiny app and get the best matching IKEA products back.

Implementation in R

How was this app created? The following steps outline steps that we took during the creation of the web app for the hackaton.

First, we have scraped 9000 IKEA product images using Rvest, then each image is ‘scored’ using a pre-trained VGG16 network, where the top layers are removed.



That means that for each IKEA image we have a 7*7*512 tensor, we flattened this tensor to a 25088 dimensional vector. Putting all these vectors in a matrix we have a 9000 by 25088 matrix.

If you have a new image, we use the same pre-trained VGG16 network to generate a 25088 dimensional vector for the new image. Now we can calculate all the 9000 distances (for example cosine similarity) between this new image and the 9000 IKEA images. We select, say, the top 7 matches.

A few examples


A Shiny web app

To make this useful for an average consumer, we have put it all in an R Shiny app using the library ‘minUI‘ so that the web site is mobile friendly. A few screen shots:



The web app starts with an ‘IKEA-style’ instruction, then it allows you to take a picture with your phone, or use one that you already have on your phone. It uploads the image and searches for the best matching IKEA products.


The R code is available from my GitHub, and a live running Shiny app can be found here.


Obviously, there are still many adjustments you can make to the app to improve the matching. For example pre process the images before they are sent through the VGG network. But there was no more time.

Unfortunately, we did not win a price during the hackaton, the jury did however find our idea very interesting. More importantly, we had a lot of fun. In Dutch “Het waren 3 fantastische dagen!”.

Cheers, Longhow.

Association rules using FPGrowth in Spark MLlib through SparklyR



Market Basket Analysis or association rules mining can be a very useful technique to gain insights in transactional data sets, and it can be useful for product recommendation. The classical example is data in a supermarket. For each customer we know what the individual products (items) are that he has bought. With association rules mining we can identify items that are frequently bought together. Other use cases for MBA could be web click data, log files, and even questionnaires.

In R there is a package arules to calculate association rules, it makes use of the so-called Apriori algorithm. For data sets that are not too big, calculating rules with arules in R (on a laptop) is not a problem. But when you have very huge data sets, you need to do something else, you can:

  • use more computing power (or cluster of computing nodes).
  • use another algorithm, for example FP Growth, which is more scalable. See this blog for some details on Apriori vs. FP Growth.

Or do both of the above points by using FPGrowth in Spark MLlib on a cluster. And the nice thing is: you can stay in your familiar R Studio environment!

Spark MLlib and sparklyr

Example Data set

We use the example groceries transactions data in the arules package. It is not a big data set and you would definitely not need more than a laptop, but it is much more realistic than the example given in the Spark MLlib documentation :-).

Preparing the data

I am a fan of sparklyr 🙂 It offers a good R interface to Spark and MLlib. You can use dplyr syntax to prepare data on Spark, it exposes many of the MLlib machine learning algorithms in a uniform way. Moreover, it is nicely integrated into the RStudio environment offering the user views on Spark data and a way to manage the Spark connection.


First connect to spark and read in the groceries transactional data, and upload the data to Spark. I am just using a local spark install on my Ubuntu laptop.

###### sparklyr code to perform FPGrowth algorithm ############


#### spark connect #########################################
sc <- spark_connect(master = "local")

#### first create some dummy data ###########################
transactions = readRDS("transactions.RDs")

#### upload to spark #########################################  
trx_tbl  = copy_to(sc, transactions, overwrite = TRUE)

For demonstration purposes, data is copied in this example from the local R session to Spark. For large data sets this is not feasible anymore, in that case data can come from hive tables (on the cluster).

The figure above shows the products purchased by the first four customers in Spark in an RStudio grid. Although transactional systems will often output the data in this structure, it is not what the FPGrowth model in MLlib expects. It expects the data aggregated by id (customer) and the products inside an array. So there is one more preparation step.

# data needs to be aggregated by id, the items need to be in a list
trx_agg = trx_tbl %>% 
   group_by(id) %>% 
      items = collect_list(item)

The figure above shows the aggregated data, customer 12, has a list of 9 items that he has purchased.

Running the FPGrowth algorithm

We can now run the FPGrowth algorithm, but there is one more thing. Sparklyr does not expose the FPGrowth algorithm (yet), there is no R interface to the FPGrowth algorithm. Luckily, sparklyr allows the user to invoke the underlying Scala methods in Spark. We can define an new object with invoke_new

  uid = sparklyr:::random_string("fpgrowth_")
  jobj = invoke_new(sc, "org.apache.spark.ml.fpm.FPGrowth", uid) 

Now jobj is an object of class FPGrowth in Spark.

  class org.apache.spark.ml.fpm.FPGrowth

And by looking at the Scala documentation of FPGrowth we see that there are more methods that you can use. We need to use the function invoke, to specify which column contains the list of items, to specify the minimum confidence and to specify the minimum support.

jobj %>% 
    invoke("setItemsCol", "items") %>%
    invoke("setMinConfidence", 0.03) %>%
    invoke("setMinSupport", 0.01)  %>%
    invoke("fit", spark_dataframe(trx_agg))

By invoking fit, the FPGrowth algorithm is fitted and an FPGrowthModel object is returned where we can invoke associationRules to get the calculated rules in a spark data frame

rules = FPGmodel %>% invoke("associationRules")
The rules in the spark data frame consists of an antecedent column (the left hand side of the rule), a consequent column (the right hand side of the rule) and a column with the confidence of the rule. Note that the antecedent and consequent are lists of items! If needed we can split these lists and collect them to R for plotting for further analysis.

The invoke statements and rules extractions statements can of course be wrapped inside functions to make it more reusable. So given the aggregated transactions in a spark table trx_agg, you can get something like:

GroceryRules =  ml_fpgrowth(
) %>%



The complete R script can be found on my GitHub. If arules in R on your laptop is not workable anymore because of the size of your data, consider FPGrowth in Spark through sparklyr.

cheers, Longhow