Are you leaking h2o? Call plumber!

Create a predictive model with the h2o package.

H2o is a fantastic open source machine learning platform with many different algorithms. There is Graphical user interface, a Python interface and an R interface. Suppose you want to create a predictive model, and you are lazy then just run automl.

Lets say, we have both train and test data sets, and the first column is the target and the columns 2 until 50 are input features. Then we can use the following code in R

out = h2o.automl(
   x = 2:50, 
   y = 1,
   training_frame = TrainData, 
   validation_frame = TestData, 
   max_runtime_secs = 1800

According the help documentation: The current version of automl trains and cross-validates a Random Forest, an Extremely-Randomized Forest, a random grid of Gradient Boosting Machines (GBMs), a random grid of Deep Neural Nets, and then trains a Stacked Ensemble using all of the models.

After a time period that you can set, automl will terminate and has literally tried hundreds of models. You can get the top models, ranked by a certain performance metric. If you go for the champion just use:

championModel = out@leaderchampionModel

Now the championModel object can be used to score / predict new data with the simple call:

predict(championModel, newdata)

The champion model now lives on my laptop, it’s hard if not impossible to use this model in production. Someone or some process that wants a model score should not depend on you and your laptop!

Instead, the model should be available on a server where model prediction requests can be handled 24/7. This is where plumber comes in handy. First save the champion model to disk so that we can use it later.

h2o.saveModel(championModel, path = "myChampionmodel")

The plumber package, bring your model in production.

In a very simple and concise way, the plumber package allows users to expose existing R code as an API service available to others on the web or intranet. Now suppose you have a saved h2o model with three input features, how can we create an API from it? You decorate your R scoring code with special comments, as shown in the example code below.

# This is a Plumber API. In RStudio 1.2 or newer you can run the API by
# clicking the 'Run API' button above.

mymodel = h2o.loadModel("mySavedModel")

#* @apiTitle My model API engine

#### expose my model #################################
#* Return the prediction given three input features
#* @param input1 description for inp1
#* @param input2 description for inp2
#* @param input3 description for inp3
#* @post /mypredictivemodel
function( input1, input2, input3){
   scoredata = as.h2o(
      data.frame(input1, input2, input3 )
   ), scoredata))

To create and start the API service, you have to put the above code in a file and call the following code in R.

rapi <- plumber::plumb("plumber.R")  # Where 'plumber.R' is the location of the code file shown above 

The output you see looks like:

Starting server to listen on port 8000 
Running the swagger UI at

And if you go to the swagger UI you can test the API in a web interface where you can enter values for the three input parameters.

What about Performance?

At first glance I thought there might be quit some overhead, calling the h2o library, loading the h2o predictive model and then using the h2o predict function to calculate a prediction given the input features.

But I think it is workable. I installed R, h2o and plumber on a small n1-standard-2 Linux server on the Google Cloud Platform. One API call via plumber to calculate a prediction with a h2o random forest model with 100 trees took around 0.3 seconds to finish.

There is much more that plumber has to offer, see the full documentation.

Cheers, Longhow.


Selecting ‘special’ photos on your phone

At the beginning of the new year I always want to clean up my photos on my phone. It just never happens.

So now (like so many others I think) I have a lot of photos on my phone from the last 3.5 years. The iPhone photos app helps you a bit to go through your photos. But which ones are really special and you definitely want to keep?

Well, just apply some machine learning.

  1. I run all my photos through a VGG16 deep learning model to generate high dimensional features per photo (on my laptop without GPU this takes about 15 minutes for 2000 photos).
  2. The dimension is 25.088, which is difficult to visualize. I apply a UMAP dimension reduction to bring it back to 3.
  3. In R you can create an interactive 3D plot with plotly where each point corresponds to a photo. Using crosstalk, you can link it to the corresponding image. The photo appears when you hover over the point.

Well a “special” outlying picture in the scatter plot are my two children with a dog during a holiday a few years ago. I had never found it that fast. There are some other notable things that I can see, but I won’t bother you with it here ūüôā

Link GitHub repo with to two R scripts to cluster your own photos. Cheers, Longhow

An R Shiny app to recognize flower species


Playing around with PyTorch and R Shiny resulted in a simple Shiny app where the user can upload a flower image, the system will then predict the flower species.

Steps that I took

  1. Download labeled flower data from the Visual Geometry Group,
  2. Install Pytorch and download their transfer learning tutorial script,
  3. You need to slightly adjust the script to work on the flower data,
  4. Train and Save the model as a (*.pt) file, 
  5. Using the R reticulate package you can call python code from within R so that you can use a pytorch models in R,
  6. Create a Shiny app that allows the user to upload an image and display the predicted flower species.

Some links

Github repo with: Python notebook to fine tune the resnet18 model, R script with Shiny App, data folder with images.

Live running shiny app can be found here. 

Cheers, Longhow

XSV tool

From time to time you do get large csv files. Take for example the open data from RDW with all the vehicles in The Netherlands. The size is ~ 7.6 GB (14 mln. rows and 64 columns), its not even really that large, but large enough for notepad, wordpad and Excel to hang….

There is a nice and handy tool XSV, see the github repo.

You can use it for some quick stats of the csv file and even some basic manipulations. For my csv file,  It takes around 17 secs to count the number of records, around 18 secs to aggregate on a column.

In R data.table it took 2 minutes to import the set and in  Python pandas 2.5 minutes

#CommandlineToolsAreStillCool #XSV #CSV


Deploy machine learning models with GKE and Dataiku



In a previous post I described how easy it is to create and deploy machine learning models (exposing them as REST APIs) with Dataiku. In particular, it was an XGboost model predicting home values. Now, suppose my model for predicting home values becomes so successful that I need to serve millions of request per hour, then it would be very handy if my back end scales easily.

In this brief post I outline the few steps you need to take to deploy machine learning models created with Dataiku on a scalable kubernetes cluster on Google Kubernetes Engine (GKE).

Create a Kubernetes cluster

There is a nice GKE quickstart that demonstrate the creation of a kubernetes cluster on Google Cloud Platform (GCP). The cluster can be created by using the GUI on the Google cloud console. Alternatively, if you are making use of the Google cloud SDK, it basically boils down to creating and getting credentials with two commands:

gcloud container clusters create myfirst-cluster
gcloud container clusters get-credentials myfirst-cluster

When creating a cluster, there are many options that you can set. I left all options at their default value. It means that only a small cluster of 3 nodes of machine type n1-standard-1 will be created. We can now see the cluster in the Google cloud console.


Setup the Dataiku API Deployer

Now that you have a kubernetes cluster we can easily deploy predictive models with Dataiku. First, you need to create a predictive model. As described in my previous blog, you can do this with the Dataiku software. Then the Dataiku API Deployer, is the component that will take care of the management and actual deployment of your models onto the kubernetes cluster.

The machine where the Dataiku API Deployer is installed must be able to push docker images to your Google cloud environment and must be able to interact with the kubernetes cluster (through the kubectl command).


Deploy your stuff……

My XGboost model created in Dataiku is now pushed to the Dataiku API Deployer. From the GUI of the API Deployer you are now able to select the XGboost model to deploy it on your kubernetes cluster.

The API Deployer is a management environment to see what models (and model versions) are already deployed, it checks if the models are up and running, it manages your infrastructure (kubernetes clusters or normal machines).


When you select a model that you wish to deploy, you can click deploy and select a cluster. It will take a minute or so to package that model into a Docker image and push it to GKE. You will see a progress window.


When the process is finished you will see the new service on your Kubernetes Engine on GCP.


The model is up and running, waiting to be called. You could call it via curl for example:

curl -X POST \ \
  --data '{ "features" : {
    "HouseType": "Tussenwoning",
      "kamers": 6,
      "Oppervlakte": 134,
      "VON": 0,
      "PC": "16"


That’s all it was! You now have a scalable model serving engine. Ready to be easily resized when the millions of requests start to come in….. Besides predictive models you can also deploy/expose any R or Python function via the Dataiku API Deployer. Don’t forget to shut down the cluster to avoid incurring charges to your Google Cloud Platform account.

gcloud container clusters delete myfirst-cluster

Cheers, Longhow.