XSV tool

From time to time you do get large csv files. Take for example the open data from RDW with all the vehicles in The Netherlands. The size is ~ 7.6 GB (14 mln. rows and 64 columns), its not even really that large, but large enough for notepad, wordpad and Excel to hang….

There is a nice and handy tool XSV, see the github repo.

You can use it for some quick stats of the csv file and even some basic manipulations. For my csv file,  It takes around 17 secs to count the number of records, around 18 secs to aggregate on a column.

In R data.table it took 2 minutes to import the set and in  Python pandas 2.5 minutes

#CommandlineToolsAreStillCool #XSV #CSV



A pretty useless raspberry Pi application

What if you are not good in remembering faces? Well Buy a Rpi, camera and LED matrix display. Install openCV and the face-recognition library, and set it up with important faces to be recognized.

Point your camera to people and if there is a hit the LED display will show the name.

Deploy machine learning models with GKE and Dataiku



In a previous post I described how easy it is to create and deploy machine learning models (exposing them as REST APIs) with Dataiku. In particular, it was an XGboost model predicting home values. Now, suppose my model for predicting home values becomes so successful that I need to serve millions of request per hour, then it would be very handy if my back end scales easily.

In this brief post I outline the few steps you need to take to deploy machine learning models created with Dataiku on a scalable kubernetes cluster on Google Kubernetes Engine (GKE).

Create a Kubernetes cluster

There is a nice GKE quickstart that demonstrate the creation of a kubernetes cluster on Google Cloud Platform (GCP). The cluster can be created by using the GUI on the Google cloud console. Alternatively, if you are making use of the Google cloud SDK, it basically boils down to creating and getting credentials with two commands:

gcloud container clusters create myfirst-cluster
gcloud container clusters get-credentials myfirst-cluster

When creating a cluster, there are many options that you can set. I left all options at their default value. It means that only a small cluster of 3 nodes of machine type n1-standard-1 will be created. We can now see the cluster in the Google cloud console.


Setup the Dataiku API Deployer

Now that you have a kubernetes cluster we can easily deploy predictive models with Dataiku. First, you need to create a predictive model. As described in my previous blog, you can do this with the Dataiku software. Then the Dataiku API Deployer, is the component that will take care of the management and actual deployment of your models onto the kubernetes cluster.

The machine where the Dataiku API Deployer is installed must be able to push docker images to your Google cloud environment and must be able to interact with the kubernetes cluster (through the kubectl command).


Deploy your stuff……

My XGboost model created in Dataiku is now pushed to the Dataiku API Deployer. From the GUI of the API Deployer you are now able to select the XGboost model to deploy it on your kubernetes cluster.

The API Deployer is a management environment to see what models (and model versions) are already deployed, it checks if the models are up and running, it manages your infrastructure (kubernetes clusters or normal machines).


When you select a model that you wish to deploy, you can click deploy and select a cluster. It will take a minute or so to package that model into a Docker image and push it to GKE. You will see a progress window.


When the process is finished you will see the new service on your Kubernetes Engine on GCP.


The model is up and running, waiting to be called. You could call it via curl for example:

curl -X POST \ \
  --data '{ "features" : {
    "HouseType": "Tussenwoning",
      "kamers": 6,
      "Oppervlakte": 134,
      "VON": 0,
      "PC": "16"


That’s all it was! You now have a scalable model serving engine. Ready to be easily resized when the millions of requests start to come in….. Besides predictive models you can also deploy/expose any R or Python function via the Dataiku API Deployer. Don’t forget to shut down the cluster to avoid incurring charges to your Google Cloud Platform account.

gcloud container clusters delete myfirst-cluster

Cheers, Longhow.

Google AutoML rocks!

Waaauw Google AutoML Vision rocks!

A few months ago I performed a simple test with Keras to create a Peugeot – BMW image classifier on my laptop.

See Is that a BMW or a Peugeot?

A friendly encouragement from Erwin Huizenga to try Google AutoML Vision resulted in a very good BMW-Peugeot classifier 10 minutes later without a single line of code. Just a few simple steps were needed.

  • Upload your car images and label them.


  • Click train (first hour of training is for free!)
  • After 10 minutes the model was trained, with very good precision and recall.


And the nice thing: The model is up-and-running for anyone or any client application who needs an image prediction.


Just give it a try……Google AutoML Vision!

Little image experiment on my son

Please don’t report me to the authorities for conducting a little frivolous experiment on my son. 

There is this nice python package ‘face_recognition’, it can recognize, manipulate faces in pictures and it can calculate distances between faces.  What is the distance between my son and me (when I was 9 years old) and how does that compare with other children.

The picture below shows two soccer youth teams, my son and me (35 years ago). I have Chinese roots, so to make it hard for the algorithm I included a Chinese soccer team. I am happy with the results of the experiment. With a distance of 0.601, my son almost had the smallest distance to me of all the 28 faces…..

Try out the face recognition python package for yourself.  


Cucumber time, food on a 2D plate / plane



It is 35 degree Celsius out side, we are in the middle of the ‘slow news season’, in many countries also called cucumber time.  A period typified by the appearance of less informative and frivolous news in the media.

Did you know that 100 g of cucumber contain 0.28 mg of iron and 1.67 g of sugar? You can find all the nutrient values of a cucumber on the USDA food databases.

Food Data

There is more data, for many thousands of products you can retrieve nutrient values through an API (need to register for a free KEY). So besides the cucumber I extracted data for different type of food for example

  • Beef products
  • Dairy & Egg products
  • Vegetables
  • Fruits
  • etc.

And as a comparison, I retrieved the nutrient values for some fast food products from McDonald’s and Pizza Hut. Just to see if pizza can be classified  as vegetable from a data point of view 🙂 So the data looks like:


I have sampled 1500 products and per product we have 34 nutrient values.


The 34 dimensional data is now compressed / projected onto a two dimensional plane using UMAP (Uniform Manifold Approximation and Projection). There is a Python and R package to this.


An interactive map can be found here, and the R code to retrieve and plot the data here. Cheers, Longhow.