# Don’t buy a brand new Porsche 911 or Audi Q7!!

## Introduction

Many people know that nasty feeling when buying a brand new car. The minute that you have left the dealer, your car has lost a substantial amount of value. Unfortunately this depreciation is inevitable, however, the amount depends heavily on the car make and model. A small analysis of data from (used) cars shows these differences.

## Car Data

I have used Rvest to scrape data from www.gaspedaal.nl, a Dutch website that combines car for sales data from several other sites. The script to get the data is not that difficult, it can be found on my GitHub, together with my analysis script. There are around 435,000 cars. The data for each car consists of: make, model, price, fuel type, transmission and age. There are many different car makes and models, the most occurring cars in my data set are:

### Car age vs. Kilometers

Obviously, there is a clear relation between the age of a car and the amount of kilometers driven. An interesting pattern to see is that this relation depends on on the car make (and model). The following figure shows a few car brands.

Large differences in amount of driving between car types start after 18 months. On average, Jaguars are not made for driving, after 60 months only around 83.000 KM are driven by its owners. While on the other hand, Mercedes-Benz owners have driven around 120.000 KM after 60 months.

A more extreme difference is between the Volvo V50 and the Hyundai i10. Between six and ten years, a Volvo V50 has driven on average 178K kilometers while a Hyundai i10 has driven only 75K kilometers.

## Depreciation

A simple depreciation model is just linear depreciation. Per car brand, model, and transmission type, I can fit a straight line through price and kilometers driven. The slope of the line is the depreciation for every kilometer driven. An elegant way to obtain the depreciation per car type is by using the purrr and broom packages.

First, some outlying values are removed then only car types with enough data points are considered. Then I have grouped the data by brand, model and transmission type, so that for each group a simple linear regression model can be fitted:

Price = Intercept + depreciation * KM

The following table shows the results:

So, on average a new Porsche 911 costs 117,278.60 Euro, and every kilometer you drive will cost you around 49.75 cents in loss of value. The complete table with all car types can be found on RPubs. Although, simple and easy to interpret parameters, a straight line model is not a realistic model as can be seen in the following figure:

A better model to fit would be a non linear depreciation model. For example, exponential depreciation or if you don’t want to specify a specific function, some kind of smoothing spline. The R code only needs to be modified slightly, the code below fits a natural cubic splines per car type.

It is a better model (in terms of R-squared), it follows the non linear depreciation that we can see in the data. However, we do not have a single deprecation value. How much value a certain car will lose when driving 1 kilometer now depends on the amount of kilometers driven. It is the derivative of the fitted spline curve. For example, the spline curves fitted for a Renault Clio are given in the figure below. A Clio with automatic transmission hardly looses any value after 100,000 KM.

I have created a small shiny app so that you can see the curves of all the car types.

## Conclusion

Despite my data science exercise and beautiful natural cubic smoothing splines models, buying a brand new car involves a lot of emotion. My wife wants a blue Citroen C4 Picasso, no matter what cubic spline model and R-squared I show to her!

So just ignore my analysis and buy the car that feels good to you!! Cheers, Longhow.

# Danger, Caution H2O steam is very hot!!

H2O has recently released its steam AI engine, a fully open source engine that supports the management and deployment of machine learning models. Both H2O on R and H2O steam are easy to set up and use. And both complement each other perfectly.

## A very simple example

Use H2O on R to create some predictive models. Well, due to lack of inspiration I just used the iris set to create some binary classifiers.

Once these models are trained, they are available for use in the H2O steam engine. A nice web interface allows you to set up a project in H2O steam to manage and display summary information of the models.

In H2O steam you can select a model that you want to deploy. It becomes a service with a REST API, a page is created to test the service.

And that is it! Your predictive model is up and running and waiting to be called from any application that can make REST API calls.

There is a lot more to explore in H2O steam, but be careful H2O steam is very hot!

# Some insights in soccer transfers using Market Basket Analysis

## Introduction

Although more than 20 years old, Market Basket Analysis (MBA) (or association rules mining) can still be a very useful technique to gain insights in large transactional data sets. The classical example is transactional data in a supermarket. For each customer we know what the individual products (items) are that he has put in his basket and bought. Other use cases for MBA could be web click data, log files, and even questionnaires.

With market basket analysis we can identify items that are frequently bought together. Usually the results of an MBA are presented in the form of rules. The rules can be as simple as {A ==> B}, when a customer buys item A then it is (very) likely that the customer buys item B. More complex rules are also possible {A, B ==> D, F}, when a customer buys items A and B then it is likely that he buys items D and F.

## Soccer transactional data

To perform MBA you need of course data, but I don’t have real transactional data from a retailer that I can present here. So I am using soccer data instead 🙂 From the Kaggle site you can download some soccer data, thanks to Hugo Mathien. The data contains around 25.000 matches from eleven European soccer leagues starting from season 2008/2009 until season 2015/2016. After some data wrangling I was able to generate a transactional data set suitable for market basket analysis. The data structure is very simple, some records are given in the figure below:

So we do not have customers but soccer players, and we do not have products but soccer clubs. In total, my soccer transactional data set contains around 18.000 records. Obviously, these records do not only include the multi-million transfers covered in the media, but also all the transfers of players nobody has ever heard of 🙂

In R you can use the arules package for MBA / association rules mining. Alternatively, when the order of the transactions is important, like my soccer transfers, you should use the arulesSequences package. After running the algorithm I got some interesting results. The figure below shows the most frequently occurring transfers between clubs:

So in this data set the most frequently occurring transfer is from Fiorentina to Genoa (12 transfers in total). I have published the entire table with the rules on RPubs, where you can look up the transfer activity of your favorite soccer club.

### Network graph visualization

All the rules that we get from the association rules mining form a network graph. The individual soccer clubs are the nodes of the graph and each rule “from ==> to” is an edge of the graph. In R, network graphs can be visualized nicely by means of the visNetwork package. The network is shown in the picture below.

An interactive version can be found on RPubs. The different colors represent the different soccer leagues. There are eleven leagues in this data, there are more leagues in Europe, but in this data we see that the Polish league is quite isolated from the rest. Almost blended in each other are the German, Spanish, English and French leagues. Less connected are the Scottish and Portuguese leagues, but also in the big English Premier and German leagues you will find less connected clubs like Bournemouth, Reading or Arminia Bielefeld.

The size of a node in the above graph represents it’s betweenness centrality, it is an indicator of a node’s centrality in a network. It is equal to the number of shortest paths from all vertices to all others that pass through that node. In R betweenness measures can be calculated with the igraph package. The most central clubs in the transfers of players are Sporting CP, Lechia Gdansk, Sunderland, FC Porto and PSV Eindhoven.

## Virtual Items

An old trick among marketeers is to use virtual items in a market basket analysis. Besides the ‘physical’ items that a customer has in his basket, a marketeer can add extra virtual items in the basket. These could be for example customer characteristic like age-class, sex, but also things like day of week, region etc. The transactional data with virtual items might look like:

If you run a MBA on the transactional data with virtual items, interesting rules might appear. For example:

• {Chocolate, Female ==> Eggs}
• {Chocolate, Male ==> Apples}
• {Beer, Friday, Male, Age[18:23] ==> sausages}.

Virtual items that I can add to my soccer transactional data are: age-class, four classes: 1: players younger than 25, 2: [25, 29), 3: [29, 33) and 4: the players that are 33 or older. Preferred foot, two classes: left or right. Height class, four classes: 1; players smaller than 178 cm, 2: [178, 183), 3: [183, 186), and 4: players taller than 186 cm.

After running the algorithm again the results allow you to find out the frequently occuring transfers of let-footers. I can see 4 left footers that transferred from Roma to Sampdoria, more rules can be seen on my RPubs site.

## Conclusion

When you have transactional data, even as small as the soccer transfers, market basket analysis is definitely one of the techniques you should try to get some first insights. Feel free to look at my R code on GitHub to experiment with the soccer transfers data.

Cheers Longhow.

# New chapters for 50 shades of grey….

## Introduction

Some time ago I had the honor to follow an interesting talk from Tijmen Blankevoort on neural networks and deeplearning. Convolutional and recurrent neural networks were topics that already caught my interest and this talk inspired me to dive into these topics deeper and do some more experiments with it.

In the same session organized by Martin de Lusenet for Ziggo (a Dutch cable company) I also had the honor to give a talk, my presentation contained a text mining experiment that I did earlier on the Dutch TV soap GTST “Goede Tijden Slechte Tijden”. A nice idea by Tijmen was: Why not use deep learning to generate new episode plots for GTST?

So I did that, see my LinkedIn post on GTST. However, these episodes are in Dutch and I guess only interesting for people here in the Netherlands. So to make things more international and more spicier I generated some new texts based on deep learning and the erotic romance novel 50 shades of grey 🙂

## More than plain vanilla networks

In R or SAS you could already train plain vanilla neural networks for a long time. The so-called fully connected networks where all input nodes are connected to all nodes in the following hidden layer.And all nodes in a hidden layer are connected to all nodes in the following hidden layer or output layer.

In more recent years deep learning frame works have become very popular. For example Caffe, Torch, CTNK, Tensorflow and MXNET. The additional value of these frame works compared to SAS for example are:

• They support more network types than plain vanilla networks. For example, convolutional networks, where not all input nodes are connected to a next layer. And recurrent networks, where loops are present. A nice introduction to these networks can be found here and here.
• They support computations on GPU’s, which could speed up things dramatically.
• They are open-source and free. No need for long sales and implementation cycles 🙂 Just download it and use it!

recurrent neural network

## My 50 Shades of Grey experiment

For my experiment I used the text of the erotic romance novel 50 shades of grey. A pdf can be found here, I used xpdfbin to extract all the words into a plain text file. I trained a Long Short Term Memory network (LSTM, a special type of recurrent networks), with MXNET. The reason to use MXNET is that they have a nice R interface, so that I can just stay in my comfortable RStudio environment.

Moreover, the R example script of MXNET is ready to run, I just changed the input data and used more rounds of training and more hidden layers. The script and the data can be found on Github.

The LSTM model is fit on character level, the complete romance novel contains 817,204 characters, all these characters are mapped to a number (91 unique numbers). The first few numbers are shown in the following figure.

Once the model has been trained it can generate new text, character by character!

```arsess whatever
yuu’re still expeliar a sally. Reftion while break in a limot.”
“Yes, ald what’s at my artmer and brow maned, but I’m so then for a
dinches suppretion. If you think vining. “Anastasia, and depregineon posing rave.
He’d deharing minuld, him drits.

“Miss Steele
“Fasting at liptfel, Miss I’ve dacind her leaches reme,” he knimes.
“I want to blight on to the wriptions of my great. I find sU she asks the stroke, to read with what’s old both – in our fills into his ear, surge • whirl happy, this is subconisue. Mrs. I can say about the battractive see. I slues
is her ever returns. “Anab.

It’s too even ullnes. “By heaven. Grey
about his voice. “Rest of the meriction.”
He scrompts to the possible. I shuke my too sucking four finishessaures. I need to fush quint the only more eat at me.
“Oh my. Kate. He’s follower socks?
“Lice in Quietly. In so morcieut wait to obsed teach beside my tired steately liked trying that.”
Kate for new of its street of confcinged. I haven’t Can regree.
“Where.” I fluscs up hwindwer-and I have

I’ll staring for conisure, pain!”
I know he’s just doesn’t walk to my backeting on Kate has hotelby of confidered Christaal side, supproately. Elliot, but it’s the ESca, that feel posing, it make my just drinking my eyes bigror on my head. S I’ll tratter topality butterch,” I mud
a nevignes, bleamn. “It’s not by there soup. He’s washing, and I arms and have. I wave to make my eyes. It’s forgately? Dash I’d desire to come your drink my heathman legt
you hay D1 Eyep, Christian Gry, husder with a truite sippking, I coold behind, it didn’t want to mive not to my stop?”
“Yes.”

“Sire, stcaring it was do and he licks his viice ever.”
I murmurs, most stare thut’s the then staraline for neced outsive. She
so know what differ at,” he murmurs?
“Are you?” Eviulder keep “Oh,_ I frosing gylaced in – angred. I am most drink to start and try aparts through. I really thrial you, dly woff you stund, there, I care an right dains to rainer.” He likes his eye finally finally my eyes to over opper heaven, places my trars his Necked her jups.
“Do you think your or Christian find at me, is so with that stand at my mouth sait the laxes any litee, this is a memory rude. It
flush,” He says usteer?” “Are so that front up.
I preparraps. I don’t scomine Kneat for from Christian.
“Christian,’! he leads the acnook. I can’t see. I breathing Kate’ve bill more over keen by. He releases?”
“I’m kisses take other in to peekies my tipgents my```

## Conclusion

The generated text does not make any sense, nor will it win any literature prize soon 🙂 Keep in mind, that the model is based ‘only’ on 817,204 characters  (which is considered a small number), and I did not bother to fine-tune the model at all. But still it is funny and remarkable to see that when you use it to generate text, character by character, it can still produce a lot of correct English words and even some correct basic grammar patterns!

cheers, Longhow.

# The Eurovision 2016 song contest in an R Shiny app

## Introduction

In just a few weeks the Eurovision 2016 song contest will be held again. There are 43 participants, two semi-finals on the 10th and 12th of May and a final on the 14th of May. It’s going to be a long watch in front of the television…. 🙂 Who is going to win? Well, you could ask experts, lookup the number of tweets on the different participants, count YouTube likes or go to bookmakers sites. On the time of writing Russia was the favorite among the bookmakers according to this overview of bookmakers.

## Spotify data

As an alternative, I used Spotify data. There is a Spotify API which allows you to get information on Play lists, Artists, Tracks, etc. It is not difficult to extract interesting information from the API:

• Register a new application on the ‘My Applications‘ site
• You will then get a client ID and a client Secret

In R you can use the httr library to make API calls. First, with the client ID and secret you need to retrieve a token, then with the token you can call one of the Spotify API endpoints, for example information on a specific artist, see the R code snippet below.

```
library(httr)

clientID = '12345678910'
secret = 'ABCDEFGHIJKLMNOPQR'

response = POST(
'https://accounts.spotify.com/api/token',
accept_json(),
authenticate(clientID, secret),
body = list(grant_type = 'client_credentials'),
encode = 'form',
verbose()
)

mytoken = content(response)\$access_token

## Frank Sinatra spotify artist ID
artistID = '1Mxqyy3pSjf8kZZL4QVxS0'

URI = paste0('https://api.spotify.com/v1/artists/', artistID)
Artist = content(response2)

```

The content of the second response object is a nested list with information on the artist. For example url links to images, the number of followers, the popularity of an artist, etc.

## Track popularity

An interesting API endpoint is the track API. Especially the information on the track popularity. What is the track popularity? Taken from the Spotify web site:

The popularity of the track. The value will be between 0 and 100, with 100 being the most popular. The popularity of a track is a value between 0 and 100, with 100 being the most popular. The popularity is calculated by algorithm and is based, in the most part, on the total number of plays the track has had and how recent those plays are.

I wrote a small R script to retrieve the track popularity every hour of each of the 43 tracks that participate in this years Eurovision song contest. The picture below lists the top 10 popular tracks of the song contest participants.

At the time of writing the the most popular track was “If I Were Sorry” by Frans (Sweden), which is placed on number three by the bookmakers.The least popular track was “The real Thing” by Highway (Montenegro), corresponding to the last place of the bookmakers.

There is not a lot of movement in the track popularity, it is very stable over time. Maybe when we get nearer to the song contest final in May we’ll see some more movement. I have also kept track of the number of followers that an artist has.There is much more movement here. See the figure below.

Everyday around 5 pm – 6 pm Frans gets around 10 to 12 new followers on Spotify! Artist may of course also lose some followers, for example Douwe Bob in the above picture.

## Audio features and related artists

Audio features of tracks like loudness, dance-ablity, tempo etc, can also be retrieved from the API. A simple scatter plot of the 43 songs reveals loud and undancable songs. For example, Francesca Michielin (Italy), she is one of the six lucky artists that already has a place in the final!

Every artist on Spotify also has a set of related artist, this set can be retrieved from the API and can be viewed nicely in a network graph.

The green nodes are the 43 song contest participants. Many of them are ‘isolated’ but some of them are related to each other or connected through a common related artist.

## Conclusion

I have created a small Eurovision 2016 Shiny app that summarizes the above information so you can see and listen for your self. We will find out how strong the Spotify track popularity is correlated with the final ranking of the Eurovision song contest on May the 14th!

Cheers, Longhow.

# Delays on the Dutch railway system

I almost never travel by train, the last time was years ago. However, recently I had to take the train from Amsterdam and it was delayed for 5 minutes. No big deal, but I was just curious how often these delays occur on the Dutch railway system. I couldn’t quickly find a historical data set with information on delays, so I decided to gather my own data.

The Dutch Railways provide an API (De NS API) that returns actual departure and delay data for a certain train station. I have written a small R script that calls this API for each of the 400 train stations in The Netherlands.  This script is then scheduled to run every 10 minutes.  The API returns data in XML format, the basic entity is “a departing train”. For each departing train we know its departure time, the destination, the departing train station, the type of train, the delay (if there is any), etc. So what to do with all these departing trains? Throw it all into MongoDB. Why?

• Not for any particular reason :-).
• It’s easy to install and setup on my little Ubuntu server.
• There is a nice R interface to MongoDB.
• The response structure (see picture below) from the API is not that difficult to flatten to a table, but NoSQL sounds more sexy than MySQL nowadays 🙂

I started to collect train departure data at the 4th of January, per day there are around 48.000 train departures in The Netherlands. I can see how much of them are delayed, per day, per station or per hour. Of course, since the collection started only a few days ago its hard to use these data for long-term delay rates of the Dutch railway system. But it is a start.

To present this delay information in an interactive way to others I have created an R Shiny app that queries the MongoDB database. The picture below from my Shiny app shows the delay rates per train station on the 4th of January 2016, an icy day especially in the north of the Netherlands.

Cheers,

Longhow

# Analyzing “Twitter faces” in R with Microsoft Project Oxford

## Introduction

In my previous blog post I used the Microsoft Translator API in my BonAppetit Shiny app to recommend restaurants to tourists. I’m getting a little bit addicted to the Microsoft API’s, they can be fun to use :-). In this blog post I will briefly describe some of the Project Oxford API’s of Microsoft.

The API’s can be called from within R, and if you combine them with other API’s, for example Twitter, then interesting “Twitter face” analyses can be done.  See my “TweetFace” shiny app to analyse faces that can be found on Twitter.

## Project Oxford

The API’s of Project Oxford can be categorized into:

• Computer Vision,
• Face,
• Video,
• Speech and
• Language.

The free tier subscription provides 5000 API calls per month (with a rate limit of 20 calls per minute). I focused my experiments on the computer vision and face API’s, a lot of functionality is available to analyze images. For example, categorization of images, adult content detection, OCR, face recognition, gender analysis, age estimation and emotion detection.

## Calling the API’s from R

The httr package provides very convenient functions to call the Microsoft API’s. You need to sign-up first and obtain a key. Let’s do a simple test on Angelina Jolie by using the face detect API.

```library(httr)

faceURL = "https://api.projectoxford.ai/face/v1.0/detect?returnFaceId=true&returnFaceLandmarks=true&returnFaceAttributes=age,gender,smile,facialHair"
img.url = 'http://www.buro247.com/images/Angelina-Jolie-2.jpg'

faceKEY = '123456789101112131415'

mybody = list(url = img.url)

faceResponse = POST(
url = faceURL,
body = mybody,
encode = 'json'
)
faceResponse
Response [https://api.projectoxford.ai/face/v1.0/detect?returnFaceId=true&returnFaceLandmarks=true&returnFaceAttributes=age,gender,smile,facialHair]
Date: 2015-12-16 10:13
Status: 200
Content-Type: application/json; charset=utf-8
Size: 1.27 kB
```

If the call was successful a “Status: 200” is returned and the response object is filled with interesting information. The API returns the information as JSON which is parsed by R into nested lists.

```
AngelinaFace = content(faceResponse)[[1]]
names(AngelinaFace)
[1] "faceId"  "faceRectangle" "faceLandmarks" "faceAttributes"

AngelinaFace\$faceAttributes
\$gender
[1] "female"

\$age
[1] 32.6

\$facialHair
\$facialHair\$moustache
[1] 0

\$facialHair\$beard
[1] 0

\$facialHair\$sideburns
[1] 0

```

Well, the API recognized the gender and that there is no facial hair :-), but her age is under estimated, Angelina is 40 not 32.6! Let’s look at emotions, the emotion API has its own key and url.

```
URL.emoface = 'https://api.projectoxford.ai/emotion/v1.0/recognize'

emotionKey = 'ABCDEF123456789101112131415'

mybody = list(url = img.url)

faceEMO = POST(
url = URL.emoface,
body = mybody,
encode = 'json'
)
faceEMO
AngelinaEmotions = content(faceEMO)[[1]]
AngelinaEmotions\$scores
\$anger
[1] 4.573111e-05

\$contempt
[1] 0.001244121

\$disgust
[1] 0.0001096572

\$fear
[1] 1.256477e-06

\$happiness
[1] 0.0004313129

\$neutral
[1] 0.9977798

[1] 0.0003823086

\$surprise
[1] 5.75276e-06
```

A fairly neutral face. Let’s test some other Angelina faces

## Find similar faces

A nice piece of functionality of the API is finding similar faces. First a list of faces needs to be created, then with a ‘query face’ you can search for similar-looking faces in the list of faces. Let’s look at the most sexy actresses.

```
## Scrape the image URLs of the actresses
library(rvest)

images = html_nodes(out, '.zero-z-index')
imglinks = html_nodes(out, xpath = "//img[@class='zero-z-index']/@src") %>% html_text()

## additional information, the name of the actress
imgalts = html_nodes(out, xpath = "//img[@class='zero-z-index']/@alt") %>% html_text()

```

Create an empty list, by calling the facelist API, you should spcify a facelistID, which is placed as request parameter behind the facelist URL. So my facelistID is “listofsexyactresses” as shown in the code below.

```### create an id and name for the face list
URL.face = "https://api.projectoxford.ai/face/v1.0/facelists/listofsexyactresses"

mybody = list(name = 'top 100 of sexy actresses')

faceLIST = PUT(
url = URL.face,
body = mybody,
encode = 'json'
)
faceLIST
Response [https://api.projectoxford.ai/face/v1.0/facelists/listofsexyactresses]
Date: 2015-12-17 15:10
Status: 200
Content-Type: application/json; charset=utf-8
Size: 108 B

```

Now fill the list with images, the API allows you to provide user data with each image, this can be handy to insert names or other info. So for one image this works as follows

```i=1
userdata = imgalts[i]
face.uri = paste(
'https://api.projectoxford.ai/face/v1.0/facelists/listofsexyactresses/persistedFaces?userData=',
userdata,
sep = ";"
)
face.uri = URLencode(face.uri)
mybody = list(url = linkie )

url = face.uri,
body = mybody,
encode = 'json'
)
Response [https://api.projectoxford.ai/face/v1.0/facelists/listofsexyactresses/persistedFaces?userData=Image%20of%20Naomi%20Watts]
Date: 2015-12-17 15:58
Status: 200
Content-Type: application/json; charset=utf-8
Size: 58 B

\$persistedFaceId
[1] '32fa4d1c-da68-45fd-9818-19a10beea1c2'

## status 200 is OK
```

Just loop over the 100 faces to complete the face list. With the list of images we can now perform a query with a new ‘query face’. Two steps are needed, first call the face detect API to obtain a face ID. I am going to use the image of Angelina, but a different one than the image on IMDB.

```
faceDetectURL = 'https://api.projectoxford.ai/face/v1.0/detect?returnFaceId=true&returnFaceLandmarks=true&returnFaceAttributes=age,gender,smile,facialHair'

mybody = list(url = img.url)

faceRESO = POST(
url = faceDetectURL,
body = mybody,
encode = 'json'
)
faceRESO
fID = content(faceRESO)[[1]]\$faceId

```

With the face ID, query the face list with the “find similar” API. There is a confidence of almost 60%.

```
sim.URI = 'https://api.projectoxford.ai/face/v1.0/findsimilars'

mybody = list(faceID = fID, faceListID = 'listofsexyactresses' )

faceSIM = POST(
url = sim.URI,
body = mybody,
encode = 'json'
)
faceSIM
yy = content(faceSIM)
yy
[[1]]
[[1]]\$persistedFaceId
[1] "6b4ff942-b216-4817-9739-3653a467a594"

[[1]]\$confidence
[1] 0.5980769
```

The picture below shows some other matches…..

## Conclusion

The API’s of Microsoft’s Project Oxford provide nice functionality for computer vision, face analysis. It’s fun to use them, see my ‘TweetFace’ Shiny app to analyse images on Twitter.

Cheers,

Longhow