# Oil leakage… those old BMW’s are bad :-)

## Introduction

My first car was a 13 year Mitsubishi Colt, I paid 3000 Dutch Guilders for it. I can still remember a friend that would not like me to park this car in front of his house because of possible oil leakage.

Can you get an idea of which cars will likely to leak oil? Well, with open car data from the Dutch RDW you can. RDW is the Netherlands Vehicle Authority in the mobility chain.

## RDW Data

There are many data sets that you can download. I have used the following:

• Observed Defects. This set contains 22 mln. records on observed defects at car level (license plate number). Cars in The Netherlands have to be checked yearly, and the findings of each check are submitted to RDW.
• Basic car details. This set contains 9 mln. records, they are all the cars in the Netherlands, license plate number, brand, make, weight and type of car.
• Defects code. This little table provides a description of all the possible defect codes. So I know that code ‘RA02’ in the observed defects data set represents ‘oil leakage’.

## Simple Analysis in R

I have imported the data in R and with some simple dplyr statements I have determined per car make and age (in years) the number of cars with an observed oil leakage defect. Then I have determined how many cars there are per make and age, then dividing those two numbers will result in a so called oil leak percentage.

For example, in the Netherlands there are 2043 Opel Astra’s that are four years old, there are three observed with an oil leak, so we have an oil leak percentage of 0.15%.

The graph below shows the oil leak percentages for different car brands and ages. Obviously, the older the car the higher the leak percentage.  But look at BMW: waaauwww those old BMW’s are leaking like oil crazy… 🙂 The few lines of R code can be found here.

## Conclusion

There is a lot in the open car data from RDW, you can look at much more aspects / defects of cars. Regarding my old car that i had, according to this data Mitsubishi’s have a low oil leak percentage, even older ones.

Cheers, Longhow

# Test driving Python integration in R, using the ‘reticulate’ package

## Introduction

Not so long ago RStudio released the R package ‘reticulate‘, it is an R interface to Python. Of course, it was already possible to execute python scripts from within R, but this integration takes it one step further. Imported Python modules, classes and functions can be called inside an R session as if it were just native R functions.

Below you’ll find some screen shot code snippets of using certain Python modules within R with the reticulate package. On my GitHub page you’ll find the R files from which these snippets were taken from.

## Using python packages

The nice thing about reticulate in RStudio is the support for code completion. When you have imported a python module, RStudio will recognize the methods that are available in the python module:

### The clarifai module

Clarifai provides a set of computer vision API’s for image recognition, face detection, extracting tags, etc. There is an official python module and there is also an R package by Guarav Sood, but it exposes less functionality. So I am going to use the python module in R. The following code snippet shows how easy it is to call python functions.

The output returned from the clarifai call is a nested list and can be quit intimidating at first sight. To browse trough these nested lists and to get a better idea of what is in those lists, you can use the package listviewer:

### The pattern.nl module

The pattern.nl module contains a fast part-of-speech tagger for Dutch, sentiment analysis, and tools for Dutch verb conjugation and noun singularization & pluralization. At the moment it does not support python 3. That is not a big deal, I am using Anaconda and created a Python 2.7 environment to install pattern.nl.

The nice thing of the reticulate package is that it allows you to choose a specific Python environment to be used.

### The pytorch module

pytorch is a python package that provides tensor computations and deep neural networks. There is no ‘R torch’ equivalent, but we can use reticulate in R. There is an example of training a logistic regression in pytorch, see the code here. It takes just a little rewrite of this code to make this work in R. See the first few lines in the figure below.

## Conclusion

As a data scientist you should know both R and Python, the reticulate package is no excuse for not learning Python! However, the reticulate package can be very useful if you want to do all your analysis in the RStudio environment. It works very well.

For example, I have used rvest to scrape some Dutch news texts, then used the Python module pattern.nl for Dutch sentiment and wrote an R Markdown document to present the results. Then the reticulate package is a nice way to keep everything in one environment.

Cheers, Longhow

# New chapters for 50 shades of grey….

## Introduction

Some time ago I had the honor to follow an interesting talk from Tijmen Blankevoort on neural networks and deeplearning. Convolutional and recurrent neural networks were topics that already caught my interest and this talk inspired me to dive into these topics deeper and do some more experiments with it.

In the same session organized by Martin de Lusenet for Ziggo (a Dutch cable company) I also had the honor to give a talk, my presentation contained a text mining experiment that I did earlier on the Dutch TV soap GTST “Goede Tijden Slechte Tijden”. A nice idea by Tijmen was: Why not use deep learning to generate new episode plots for GTST?

So I did that, see my LinkedIn post on GTST. However, these episodes are in Dutch and I guess only interesting for people here in the Netherlands. So to make things more international and more spicier I generated some new texts based on deep learning and the erotic romance novel 50 shades of grey 🙂

## More than plain vanilla networks

In R or SAS you could already train plain vanilla neural networks for a long time. The so-called fully connected networks where all input nodes are connected to all nodes in the following hidden layer.And all nodes in a hidden layer are connected to all nodes in the following hidden layer or output layer.

In more recent years deep learning frame works have become very popular. For example Caffe, Torch, CTNK, Tensorflow and MXNET. The additional value of these frame works compared to SAS for example are:

• They support more network types than plain vanilla networks. For example, convolutional networks, where not all input nodes are connected to a next layer. And recurrent networks, where loops are present. A nice introduction to these networks can be found here and here.
• They support computations on GPU’s, which could speed up things dramatically.
• They are open-source and free. No need for long sales and implementation cycles 🙂 Just download it and use it!

recurrent neural network

## My 50 Shades of Grey experiment

For my experiment I used the text of the erotic romance novel 50 shades of grey. A pdf can be found here, I used xpdfbin to extract all the words into a plain text file. I trained a Long Short Term Memory network (LSTM, a special type of recurrent networks), with MXNET. The reason to use MXNET is that they have a nice R interface, so that I can just stay in my comfortable RStudio environment.

Moreover, the R example script of MXNET is ready to run, I just changed the input data and used more rounds of training and more hidden layers. The script and the data can be found on Github.

The LSTM model is fit on character level, the complete romance novel contains 817,204 characters, all these characters are mapped to a number (91 unique numbers). The first few numbers are shown in the following figure.

Once the model has been trained it can generate new text, character by character!

arsess whatever
yuu’re still expeliar a sally. Reftion while break in a limot.”
“Yes, ald what’s at my artmer and brow maned, but I’m so then for a
dinches suppretion. If you think vining. “Anastasia, and depregineon posing rave.
He’d deharing minuld, him drits.

“Miss Steele
“Fasting at liptfel, Miss I’ve dacind her leaches reme,” he knimes.
“I want to blight on to the wriptions of my great. I find sU she asks the stroke, to read with what’s old both – in our fills into his ear, surge • whirl happy, this is subconisue. Mrs. I can say about the battractive see. I slues
is her ever returns. “Anab.

It’s too even ullnes. “By heaven. Grey
about his voice. “Rest of the meriction.”
He scrompts to the possible. I shuke my too sucking four finishessaures. I need to fush quint the only more eat at me.
“Oh my. Kate. He’s follower socks?
“Lice in Quietly. In so morcieut wait to obsed teach beside my tired steately liked trying that.”
Kate for new of its street of confcinged. I haven’t Can regree.
“Where.” I fluscs up hwindwer-and I have

I’ll staring for conisure, pain!”
I know he’s just doesn’t walk to my backeting on Kate has hotelby of confidered Christaal side, supproately. Elliot, but it’s the ESca, that feel posing, it make my just drinking my eyes bigror on my head. S I’ll tratter topality butterch,” I mud
a nevignes, bleamn. “It’s not by there soup. He’s washing, and I arms and have. I wave to make my eyes. It’s forgately? Dash I’d desire to come your drink my heathman legt
you hay D1 Eyep, Christian Gry, husder with a truite sippking, I coold behind, it didn’t want to mive not to my stop?”
“Yes.”

“Sire, stcaring it was do and he licks his viice ever.”
I murmurs, most stare thut’s the then staraline for neced outsive. She
so know what differ at,” he murmurs?
“I shake my headanold.” Jeez.
“Are you?” Eviulder keep “Oh,_ I frosing gylaced in – angred. I am most drink to start and try aparts through. I really thrial you, dly woff you stund, there, I care an right dains to rainer.” He likes his eye finally finally my eyes to over opper heaven, places my trars his Necked her jups.
“Do you think your or Christian find at me, is so with that stand at my mouth sait the laxes any litee, this is a memory rude. It
flush,” He says usteer?” “Are so that front up.
I preparraps. I don’t scomine Kneat for from Christian.
“Christian,’! he leads the acnook. I can’t see. I breathing Kate’ve bill more over keen by. He releases?”
“I’m kisses take other in to peekies my tipgents my

## Conclusion

The generated text does not make any sense, nor will it win any literature prize soon 🙂 Keep in mind, that the model is based ‘only’ on 817,204 characters  (which is considered a small number), and I did not bother to fine-tune the model at all. But still it is funny and remarkable to see that when you use it to generate text, character by character, it can still produce a lot of correct English words and even some correct basic grammar patterns!

cheers, Longhow.

# A little H2O deeplearning experiment on the MNIST data set

## Introduction

H2O is a fast and scalable opensource machine learning platform. Several algorithms are available, for example neural networks, random forests, linear models and gradient boosting. See the complete list here. Recently the H2O world conference was held, unfortunately I was not there. Luckily there is a lot of material available, videos and slides, it triggered me to try the software.

The software is easy to set up on my laptop. Download the software from the H2O download site, it is a zip file that needs to be unzipped. It contains (among other files) a jar file that needs to be run from the command line:

java -jar h20.jar

After H2O has started, you can browse to localhost:54321 (the default port number can be changed, specify: -port 65432) and within the browser you can use H2O via the flow interface. In this blog post I will not use the flow interface but I will use the R interface.

H2O flow web interface

## The H2O R interface

To install the H2O R interface you can follow the instructions provided here. Its a script that checks if there is already a H2O R package installed, if needed it installs packages that the H2O package depends on, and it installs the H2O R package. Start the interface to H2O from R. If H2O was already started from the command line you can connect to the same H2O instance by specifying the same port and use startH2O = FALSE.

library(h2o)

localH2O =  h2o.init(nthreads = -1, port = 54321, startH2O = FALSE)

## MNIST handwritten digits

The data I have used for my little experiment is the famous handwritten digits data from MNIST. The data in CSV format can be downloaded from Kaggle. The train data set has 42.000 rows and 785 columns, each row represents a digit, a digit is made up of 28 by 28 pixels, in total 784 columns, plus one additional label column. The first column in the CSV file is called ‘label’, the rest of the columns are called called pixel0, pixel1,….,pixel783. The following code imports the data and plots the first 100 digits, together with the label.

MNIST_DIGITStrain = read.csv( 'D:/R_Projects/MNIST/MNIST_DIGITStrain.csv' )
par( mfrow = c(10,10), mai = c(0,0,0,0))
for(i in 1:100){
y = as.matrix(MNIST_DIGITStrain[i, 2:785])
dim(y) = c(28, 28)
image( y[,nrow(y):1], axes = FALSE, col = gray(255:0 / 255))
text( 0.2, 0, MNIST_DIGITStrain[i,1], cex = 3, col = 2, pos = c(3,4))
}

The first 100 MNIST handwritten digits and the corresponding label

The data is imported into R, its a local R data frame. To apply machine learning techniques on the MNIST digits, the data needs to be available on the H2O platform. From R you can either import a CSV file directly into the H2O platform or you can import an existing R object into the H2O platform.

mfile = 'D:\\R_Projects\\MNIST\\MNIST_DIGITStrain.csv'
MDIG = h2o.importFile(path = mfile,sep=',')

# Show the data objects on the H2O platform
h2o.ls()

key
1 MNIST_DIGITStrain.hex_3

## Deep learning autoencoder

Now that the data is in H2O we can apply machine learning techniques on the data. One type of analysis that interested me the most is the ability to train autoencoders. The idea is to use the input data to predict the input data by means of a ‘bottle-neck’ network.

The middle layer can be regarded as a compressed representation of the input. In H2O R, a deep learning autoencoder can be trained as follows.

NN_model = h2o.deeplearning(
x = 2:785,
training_frame = MDIG,
hidden = c(400, 200, 2, 200, 400 ),
epochs = 600,
activation = 'Tanh',
autoencoder = TRUE
)

So there is one input layer with 784 neurons, a second layer with 400 neurons, a third layer with 200, the middle layer with 2 neurons, etc. The middle layer is a 2-dimensional representation of a 784 dimensional digit. The 42.000 2-dimensional representations of the digits are just points that we can plot. To extract the data from the middle layer we need to use the function h20.deepfeatures.

train_supervised_features2 = h2o.deepfeatures(NN_model, MDIG, layer=3)

plotdata2 = as.data.frame(train_supervised_features2)
plotdata2\$label = as.character(as.vector(MDIG[,1]))

qplot(DF.L3.C1, DF.L3.C2, data = plotdata2, color = label, main = 'Neural network: 400 - 200 - 2 - 200 - 400')

In training the autoencoder network I have not used the label, this is not a supervised training exercise. However, I have used the label in the plot above. We can see the ‘1’ digits clearly on the left-hand side, while the ‘7’ digits are more on the right-hand side, and the pink ‘8’ digits are more in the center. It’s far from a perfect, I need to explore more options in the deep learning functionality to achieve a better separation in 2 dimensions.

#### Comparison with a 2 dimensional SVD data reduction

Autoencoders use nonlinear transformations to compress high dimensional data to a lower dimensional space. Singular Value decomposition on the other hand can be used to compress data to a lower dimensional space by using only linear transformations. See my earlier blog post on SVD. The following picture shows the MNIST digits projected to 2 dimensions using SVD.

There is a good separation between the 1’s and the 0’s, but the rest of the digits are much less separated than the autoencoder. There is of course a time benefit for the SVD. It takes around 6.5 seconds to calculate a SVD on the MNIST data while it took around 350 seconds for the autoencoder.

## Conclusion

With this little autoencoder example, I have just scratched the surface of what is possible in H2O. There is much more to discover, many supervised learning algorithms, and also within the deep learning functionality of H2O there are a lot of settings which I have not explored further.

# Some (spatial) analytics on primary school exams (Cito toets) in The Netherlands

## Predicting the Cito Score

Recently my six year old son had his first Cito test, during his career at primary school he will have much more of such Cito tests. It will end when he is around twelve and will have his final Cito exam. A primary school in the Netherlands does not have to participate in the Cito exams, there are around 6800 primary schools in the Netherlands of which around 5300 participate in the Cito exam. For each of those schools the average Cito score (average of all pupils) of the final Cito exam is available. Lets just call this the Cito score.

As a data scientist, I like to know if there are factors that can be used to predict the Cito score, and more importantly can we get that data. A little search on internet resulted in a few interesting data sets that might be useful to predict the Cito score:

• School details, like the type of school (public, christian, Jewish, Muslim Montessori, etc) and the location
• Financial results, like the total assets, liabilities, working capital etc.
• Staff & pupils, number of FTE’s, number of pupils, part-time personnel, management FTE
• Data on the education level of the parents of the pupils

I also have the house for sale data from Funda, which can tell me something about the prices of the houses close to the school. There is also open data on the demographics of a neighborhood, like number of people, number of immigrants, number of children, number of males, etc.

Lets have a look at the Cito score first, how is it distributed, what are low and high values. The figure below shows the distribution, the average of the Cito scores is 534.7, the median is 535, the 10% percentile is 529.3 and the 90% percentile is 539.7.

Now lets predict the Cito score, in SAS Enterprise Miner I have fitted a simple linear regression ($r^2 =0.35$), a random forest ($r^2 = 0.39$) and a neural network ($r^2 = 0.38$), so the random forest has the best predictive power. Although it is difficult to interpreted a random forest, it does identify the factors that are important to predict the Cito score. The three most important factors are:

• Percentage of pupils of a school whose parents both have an education
• Percentage of pupils of a school whose parents both are uneducated
• Average sales price of the 9 houses closest to the school

A plot of the three factors against the Cito score is given in the plots below, click on them to enlarge.

From the graph above we can see that schools in an area where houses prices are 300K have a Cito score of around 1 point more than schools in an area where the house prices are 200K. So every 100K increase in house prices means a 1 point increase in the Cito score. Conclusion: With open data we can reasonably predict the Cito score with some accuracy. See my Shiny app for the random forest Cito score prediction for each individual school.

## Spatial AutoCorrelation

Spatial autocorrelation occurs when points close to each others have similar (correlated) values. So do schools that are close to each other have similar Cito scores? We can analyse this by means of a variogram or correlogram. It provides a description of how data are related (correlated) with distance $h$.

The variogram is calculated by

$\gamma(h) = \frac{1}{2 |N(h)|} \Sigma_{N(h)} (z_i - z_j)^2$

where $z_i$ and $z_j$ are values (Cito scores) at locations $i$ and $j$, and $N(h)$ is the set of all pairwise distances $i-j=h$. So it calculates the average squared difference of values (Cito scores) separated by distance h. It is often easier to interpret a correlation (coefficient), instead of squared differences. For example, a correlation coefficient larger than 0.8 means strongly correlated values while a correlation lower than 0.1 means weakly correlated values. A correlogram $\rho(h)$ can be calculated from a variogram:

$\rho(h) = 1 - \frac{\gamma(h)}{C(0)}$

where $C(0)$ is the variance in the data. In SAS we can estimate the empirical variogram and correlogram from the Cito data with proc variogram. An example call is given in the figure below.

The resulting correlogram is given in the figure below.

Conclusion: Schools that are close to each other (within 1.5 KM) have strongly correlated Cito scores. So if your son’s school has a bad Cito score, you can take him to another school, but be sure the other school is at least 7 KM away from the current school.

# Car depreciation and regression splines….

You might have that terrible feeling when buying a new car. After picking it up and driving your new car out of the show room, it immediately looses value! The question is: how much?  Of course this car depreciation will depend on the make and model of the car.

In order to get some idea of the depreciation I have extracted data from a Dutch used car sales site www.autotrader.nl, so the amounts are in Euros.There are some features that you can scrape for every car like. For example the make, brand, fuel type, transmission, energy label, age. To get an idea of the data, the figure below displays around 2000 Renault Clios, extracted from the site. On the x axis, we have mileage (in this case kilometers driven), and on the y axis we have the price in Euros (the price that is displayed in the add, so not the price that is actually paid).

A simple linear regression model is fitted with these 2000 Renault Clio’s. The parameters are given in the following figure

So on average, a new Clio will cost around 15,082 Euros (Clios with automatic transmission are 1989 Euros more expensive), every kilometer you drive in a Clio will cost you 7.28 cents in loss of value, The R-squared of this simple regression model is 0.66. Some other cars to compare the depreciation are given in the table below.

Looking at the plot above, you can already see that a straight line is probably not the best curve that can fit the data points. Hmmm, so what other curves can we try?  Splines!

Splines can be seen as piece-wise polynomials, glued together. So for example from 0 to 25,000 kilometers a polynomial is used to predict the price, from 25,000 km to 75,000 another polynomial is used to predict the price. The points at which the polynomials are glued together are called knots. Splines are constructed in such a way that at the knots we have a smooth curve. The term comes from the tool used by shipbuilders and drafters to construct smooth shapes having desired properties. Drafters have long made use of a bendable strip fixed in position at a number of points that relaxes to form a smooth curve passing through those points.

In SAS the adaptivereg procedure can fit splines. It has some handy features, it constructs spline basis functions in an adaptive way by automatically selecting appropriate knot values for different variables and obtains reduced models by applying model selection techniques. Let’s fit a spline model on the Renault Clio’s using the procedure.

The spline model has an R-squared of 0.76, a big improvement compared to the R-squared of 0.66 of the simple linear regression model. How does the car value prediction look like? Look at the figure below

we see that new Clios with an automatic transmission are around 5000 Euro more expensive than Clios with a manual transmission, however, these automatics loose value much faster. There is a turning point at around 55,000 KM, the rate at which a Clio looses value (around 17 cents/KM) starts to decline after 55,000 KM (around 4 cents/KM). Interested in the depreciation of other car makes, look at my little Shiny app.

Cheers,

A happy Clio driver!

# Text mining basics: The IMDB reviews

Within Big Data analytics, the analysis of unstructured data has received a lot of attention recently. text mining can be seen as a collection of techniques to analyze unstructured data (i.e. documents, e-mails, tweets, product reviews or complaints). Two main applications of text mining:

1. Categorization. Text mining can categorize (large amounts of) documents into different topics. I think we all have created dozens of sub folders in Outlook to organize our e-mails, and manually moved e-mails to these folders. No need to do that anymore, let text mining create and name the folders and automatically categorize e-mails into those folders.
2. Prediction. With text mining you can predict if a document belongs to a certain category. For example, all the e-mails that a company receives, which of those e-mails belong to the category “negative sentiment”, or which of those e-mails belong to people who are likely to terminate their subscription?

How does text mining work? Let’s analyze some IMDB reviews, 50.000 reviews that are already classified into positive or negative sentiment, the data can be found here. There are a few steps to undertake, quit easily performed in SAS Text Miner.

#### 1. Parse / Filter

First I imported the reviews into a SAS data set. One column contains all the reviews, each review in a separate record. There is also a second column, the binary target, each review is classified as POSITIVE or NEGATIVE sentiment. The reviews need to be parsed and filtered.

The parsing and filtering nodes parse the collection of reviews to quantify information about the terms in all the reviews.The parsing can adjusted to include stemming (treat house and houses as one term), synonym lists (treat movie and film as one term) , stop lists (ignore the, in and with). The result is a so-called term document matrix, representing the frequency that a term occurs in a document.

#### 2. Apply Singular Value Decomposition

A problem that arises when there are many reviews is that the term document matrix can become very large, other problems that may arise in term document matrices are sparseness (many zeros in the matrix) and term dependency (the term boring and long may occur often together in reviews and so are strongly correlated). To resolve these problems a singular value decomposition (SVD) is applied on the term document matrix. In an earlier blog post I described the SVD. The term document matrix A is factorized into

$A = U \Sigma V^T$

Instead of using all the singular values we now only use the largest k singular values.In the term document each review R is represented by an m dimensional vector of terms, using the SVD this can be projected onto a lower dimensional sub space with

$\hat{R} = U^T_k R$

So our three reviews will look like

#### 3. Categorization or Prediction

Now that each review is projected onto a lower dimensional subspace, we can apply data mining techniques. For categorization we can apply clustering ( for example k-means or hierarchical clustering). Each review will will be segmented into a cluster, for clustering we do not need the Sentiment column. The next figure shows an example of a hierarchical clustering in SAS Enterprise Miner.

The clustering resulted in 13 clusters, the 13 node leaves. Each cluster is described by descriptive terms. For example, one cluster of reviews contains reviews of people talking about the fantastic script and plot, and another cluster talks about the bad acting.

To predict the sentiment we need the sentiment column, in Enterprise Miner I have set that column as a Target and projected the reviews onto a 300 dimensional subspace, so 300 inputs, SVD1, SVD2,…,SVD300. I have tried several models machine learning methods, random forests, gradient boosting, neural networks.

It turns out that a neural network with 1 layer with 50 neurons works quit well, an area under ROC of 0.945 on a holdout out set.