Is that a BMW or a Peugeot?



My son is 8 years old and he has shown a lot of interest in cars, which is strange because I have zero interest in cars. But he is driving me crazy when we have a car ride: “dad is that an Peugeot?“, “dad, that is an Audi” and “that is a BMW, right?“, “That is another cool BMW, why don’t we have a BMW?“. He is pretty accurate, close to 100%! I was curious how accurate a very simple model could get. Just a re-use of a pre-trained image model approach on my laptop without any GPU’s.

Image Data

There is a nice python package google-images-download to help you download certain images.

from google_images_download import google_images_download 
response = google_images_download.googleimagesdownload()  

arguments = {
  "keywords": "BMW,PEUGEOT",
  "print_urls": False,
  "suffix_keywords": "car",
  "output_directory": "TMP",
   "format": "png"

The above code will get you images of BMW’s and Peugeots, the problem though is that not all images are actually cars. You’ll see scooters, navigation systems and garages. Moreover, some downloaded files do not open at all.


So first, we can use a pre-trained resnet50 or vgg16 image classifier and run the downloaded files through this classifier and keep only the images that keras can open and were classified as car or wagon. Then the images are organized in the following folder structure

├── training
│   ├── bmw (150 images)
│   └── peugeot(150 images)
└── validation
    ├── bmw (50 images)
    └── peugeot(50 images)

Predictive model

I am using the most simple approach, both in terms of modeling and computational effort. It is described in section 5.3 of this fantastic book “Deep Learning in R” by François Chollet and J. J. Allaire.

  • Take a pretrained network, say VGG16, remove the top so that you only have a convolutional base.
  • Now run your images trough this base so that each image is a tensor.
  • Treat these tensors as input for a complete separate neural network classifier. For example a simple one hidden fully connected layer with 256 neurons, shown in the code snippet below.
model <- keras_model_sequential() %>%
    units = 256, 
    activation = "relu", 
    input_shape = 4 * 4 * 512
  ) %>%
  layer_dropout(rate = 0.5) %>%
  layer_dense(units = 1, activation = "sigmoid")

The nice thing is that once you have put your images in the proper folder structure you can just ‘shamelessly’ copy/paste the code from the accompanying markdown of the book and start training a BMW-Peugeot model.



After 15 epochs or so the accuracy on the validation images flattens of to around 80% which is not super good and not even close to what my son can achieve 🙂 But it is not too bad either for just 30 minutes of work in R, mostly copy pasting code….. Cheers, Longhow.


t-sne dimension reduction on Spotify mp3 samples

Schermafdruk van 2018-01-30 12-25-08


Not long ago I was reading on t-Distributed Stochastic Neighbor Embedding (t-sne), a very interesting dimension reduction technique, and on Mel frequency cepstrum a sound processing technique. Details of both techniques can be found here and here. Can we combine the two in a data analysis exercise? Yes, and with not too much R code you can already quickly create some visuals to get ‘musical’ insights.

Spotify Data

Where can you get some sample audio files? Spotify! There is a Spotify API which allows you to get information on playlists, artists, tracks, etc. Moreover, for many songs (not all though) Spotify provides downloadable preview mp3’s of 30 seconds. The link to the preview mp3 can be retrieved from the API. I am going to use some of these mp3’s for analysis.

In the web interface of Spotify you can look for interesting playlists. In the search field type in for example ‘Bach‘ (my favorite classical composer). In the search results go to the playlists tab, you’ll find many ‘Bach’ playlists from different users, including the ‘user’ Spotify itself. Now, given the user_id (spotify) and the specific playlist_id (37i9dQZF1DWZnzwzLBft6A for the Bach playlist from Spotify) we can extract all the songs using the API:


You will get the 50 Bach songs from the playlist, most of them have a preview mp3. Let’s also get the songs from a Heavy Metal play list, and a Michael Jackson play list. In total I have 146 songs with preview mp3’s in three ‘categories’:

  • Bach,
  • Heavy Metal,
  • Michael Jackson.

Transforming audio mp3’s to features

The mp3 files need to be transformed to data that I can use for machine learning, I am going to use the Python librosa package to do this. It is easy to call it from R using the reticulate package.

librosa = import("librosa")

#### python environment with librosa module installed
use_python(python = "/usr/bin/python3")

The downloaded preview mp3’s have a sample rate of 22.050. So a 30 second audio file has in total 661.500 raw audio data points.

onemp3 = librosa$load("mp3songs/bach1.mp3")

length(onemp3[[1]])/onemp3[[2]]  # ~30 seconds sound

## 5 seconds plot
pp = 5*onemp3[[2]]
plot(onemp3[[1]][1:pp], type="l")

A line plot of the raw audio values will look like.


For sound processing, features extraction on the raw audio signal is often applied first. A commonly used feature extraction method is Mel-Frequency Cepstral Coefficients (MFCC). We can calculate the MFCC for a song with librosa.

ff = librosa$feature
mel = librosa$logamplitude(
    sr = onemp3[[2]],


Each mp3 is now a matrix of MFC Coefficients as shown in the figure above. We have less data points than the original 661.500 data points but still quit a lot. In our example the MFCC are a 96 by 1292 matrix, so 124.032 values. We apply a the t-sne dimension reduction on the MFCC values.

Calculating t-sne

A simple and easy approach, each matrix is just flattened. So a song becomes a vector of length 124.032. The data set on which we apply t-sne consist of 146 records with 124.032 columns, which we will reduce to 3 columns with the Rtsne package:

tsne_out = Rtsne(AllSongsMFCCMatrix, dims=3) 

The output object contains the 3 columns, I have joined it back with the data of the artists and song names so that I can create an interactive 3D scatter plot with R plotly. Below is a screen shot, the interactive one can be found here.


It is obvious that Bach music, heavy metal and Michael Jackson are different, you don’t need machine learning to hear that. So as expected, it turns out that a straight forward dimension reduction on these songs with MFCC and t-sne clearly shows the differences in a 3D space. Some Michael Jackson songs are very close to heavy metal 🙂 The complete R code can be found here.

Cheers, Longhow

Dataiku 4.1.0: More support for R users!



Recently, Dataiku 4.1.0 was released, it now offers much more support for R users. But wait a minute, Data-what? I guess some of you do not know Dataiku, so what is Dataiku in the first place? It is a collaborative data science platform created to design and run data products at scale. The main themes of the product are:

Collaboration & Orchestration: A data science project often involves a team of people with different skills and interests. To name a few, we have data engineers, data scientists, business analysts, business stakeholders, hardcore coders, R users and Python users. Dataiku provides a platform to accommodate the needs of these different roles to work together on data science projects.

Productivity: Whether you like hard core coding or are more GUI oriented, the platform offers an environment for both. A flow interface can handle most of the steps needed in a data science project, and this can be enriched by Python or R recipes. Moreover, a managed notebook environment is integrated in Dataiku to do whatever you want with code.

Deployment of data science products: As a data scientist you can produce many interesting stuff, i.e. graphs, data transformations, analysis, predictive models. The Dataiku platform facilitates the deployment of these deliverables, so that others (in your organization) can consume them. There are dashboards, web-apps, model API’s, productionized model API’s and data pipelines.


There is a free version which contains already a lot of features to be very useful, and there is an paid version, with “enterprise features“. See for the Dataiku website for more info.

Improved R Support in 4.1.0

Among many new features, and the one that interests me the most as an R user, is the improved support for R. In previous versions of Dataiku there was already some support for R, this version has the following improvements. There is now support for:

R Code environments

In Dataiku you can now create so-called code environments for R (and Python). A code environment is a standalone and self-contained environment to run R code. Each environment can have its own set of packages (and specific versions of packages). Dataiku provides a handy GUI to manage different code environments. The figure below shows an example code environment with specific packages.


In Dataiku whenever you make use of R –> in R recipes, Shiny, R Markdown or creating R API’s you can select a specific R code environment to use.

R Markdown reports & Shiny applications

If you are working in RStudio you most likely already know R Markdown documents and Shiny applications. In this version, you can also create them in Dataiku. Now, why would you do that and not just create them in RStudio? Well, the reports and shiny apps become part of the Dataiku environment and so:

  • They are managed in the environment. You will have a good overview of all reports and apps and see who has created/edited them.
  • You can make use of all data that is already available in the Dataiku environment.
  • Moreover, the resulting reports and Shiny apps can be embedded inside Dataiku dashboards.

The figure above shows a R markdown report in Dataiku, the interface provides a nice way to edit the report, alter settings and publish the report. Below is an example dashboard in Dataiku with a markdown and Shiny report.


Creating R API’s

Once you created an analytical model in R, you want to deploy it and make use of its predictions. With Dataiku you can now easily expose R prediction models as an API. In fact, you can expose any R function as an API. The Dataiku GUI provides an environment where you can easily set up and test an R API’s. Moreover, The Dataiku API Node, which can be installed on a (separate) production server imports the R models that you have created in the GUI and can take care of load balancing, high availability and scaling of real-time scoring.

The following three figures give you an overview of how easy it is to work with the R API functionality.

First, define an API endpoint and R (prediction) function.


Then, define the R function, it can make use of data in Dataiku, R objects created earlier or any R library you need.


Then, test and deploy the R function. Dataiku provides a handy interface to test your function/API.


Finally, once you are satisfied with the R API you can make a package of the API, that package can then be imported on a production server with Dataiku API node installed. From which you can then serve API requests.


The new Dataiku 4.1.0 version has a lot to offer for anyone involved in a data science project. The system already has a wide range support for Python, but now with the improved support for R, the system is even more useful to a very large group of data scientists.

Cheers, Longhow.

Test driving Python integration in R, using the ‘reticulate’ package



Not so long ago RStudio released the R package ‘reticulate‘, it is an R interface to Python. Of course, it was already possible to execute python scripts from within R, but this integration takes it one step further. Imported Python modules, classes and functions can be called inside an R session as if it were just native R functions.

Below you’ll find some screen shot code snippets of using certain Python modules within R with the reticulate package. On my GitHub page you’ll find the R files from which these snippets were taken from.

Using python packages

The nice thing about reticulate in RStudio is the support for code completion. When you have imported a python module, RStudio will recognize the methods that are available in the python module:


The clarifai module

Clarifai provides a set of computer vision API’s for image recognition, face detection, extracting tags, etc. There is an official python module and there is also an R package by Guarav Sood, but it exposes less functionality. So I am going to use the python module in R. The following code snippet shows how easy it is to call python functions.


The output returned from the clarifai call is a nested list and can be quit intimidating at first sight. To browse trough these nested lists and to get a better idea of what is in those lists, you can use the package listviewer:


The module

The module contains a fast part-of-speech tagger for Dutch, sentiment analysis, and tools for Dutch verb conjugation and noun singularization & pluralization. At the moment it does not support python 3. That is not a big deal, I am using Anaconda and created a Python 2.7 environment to install

The nice thing of the reticulate package is that it allows you to choose a specific Python environment to be used.


The pytorch module

pytorch is a python package that provides tensor computations and deep neural networks. There is no ‘R torch’ equivalent, but we can use reticulate in R. There is an example of training a logistic regression in pytorch, see the code here. It takes just a little rewrite of this code to make this work in R. See the first few lines in the figure below.



As a data scientist you should know both R and Python, the reticulate package is no excuse for not learning Python! However, the reticulate package can be very useful if you want to do all your analysis in the RStudio environment. It works very well.

For example, I have used rvest to scrape some Dutch news texts, then used the Python module for Dutch sentiment and wrote an R Markdown document to present the results. Then the reticulate package is a nice way to keep everything in one environment.

Cheers, Longhow