Oil leakage… those old BMW’s are bad :-)

lekkage

Introduction

My first car was a 13 year Mitsubishi Colt, I paid 3000 Dutch Guilders for it. I can still remember a friend that would not like me to park this car in front of his house because of possible oil leakage.

mitsubishi_colt_turbo_red_1984

Can you get an idea of which cars will likely to leak oil? Well, with open car data from the Dutch RDW you can. RDW is the Netherlands Vehicle Authority in the mobility chain.

RDW Data

There are many data sets that you can download. I have used the following:

  • Observed Defects. This set contains 22 mln. records on observed defects at car level (license plate number). Cars in The Netherlands have to be checked yearly, and the findings of each check are submitted to RDW.
  • Basic car details. This set contains 9 mln. records, they are all the cars in the Netherlands, license plate number, brand, make, weight and type of car.
  • Defects code. This little table provides a description of all the possible defect codes. So I know that code ‘RA02’ in the observed defects data set represents ‘oil leakage’.

Simple Analysis in R

I have imported the data in R and with some simple dplyr statements I have determined per car make and age (in years) the number of cars with an observed oil leakage defect. Then I have determined how many cars there are per make and age, then dividing those two numbers will result in a so called oil leak percentage.

 

For example, in the Netherlands there are 2043 Opel Astra’s that are four years old, there are three observed with an oil leak, so we have an oil leak percentage of 0.15%.

The graph below shows the oil leak percentages for different car brands and ages. Obviously, the older the car the higher the leak percentage.  But look at BMW: waaauwww those old BMW’s are leaking like oil crazy… 🙂 The few lines of R code can be found here.

Rplot01

Conclusion

There is a lot in the open car data from RDW, you can look at much more aspects / defects of cars. Regarding my old car that i had, according to this data Mitsubishi’s have a low oil leak percentage, even older ones.

Cheers, Longhow

 

Interactive sunbuRst graphs in Power BI in 5 minutes!!

pbiviztool

Introduction

If I mention Power BI to fellow data scientists I often get strange looks. However, I quite like the tool, it is an easy and fast way to share results, KPI’s and graphs with others. With the latest release, Power BI now supports interactive R graphs, and they are easy to create as well.

Steps to follow

1. Install Node.JS from here and then you can install the power bi tools with:

>npm install -g powerbi-visuals-tools

2. Create a new custom R visual:

>pbiviz new sunburstRHTMLVisual -t rhtml

3. This will create a directory sunburstRHTMLVisual. In that directory, edit the R script file script.r. It’s a one-liner to create a sunburst graph with the sunburstR package.

 

Values is the name of the input data frame, the data that is received from within the Power BI desktop application.

4. Now create the custom R visual as package with the following command: (issue this command inside the directory sunburstRHTMLVisual)

>pbiviz package

5. Inside the sub folder dist you will now find the file sunburstRHTMLVisual.pbiviz, this can be used in Power BI. Open the Power BI desktop application and import a custom visual from file. Select the sunburstRHTMLVisual.pbiviz file

 

That’s it, you’re done!

The resulting graph in a dashboard

To use the visual you will need a data set in power bi with two columns, one with the sequences and one with the number of occurrences of the corresponding sequence.

 

Click on the icon of the custom R visual you’ve just imported and select the two columns to get the interactive sunburst graph. Once the graph is created, you can hover over the rings to get more info, and you can turn on/off the legend.

 

Cheers, Longhow.

A “poor man’s video analyzer”…

video01

Introduction

Not so long ago there was a nice dataiku meetup with Pierre Gutierrez talking about transfer learning. RStudio recently released the keras package, an R interface to work with keras for deep learning and transfer learning. Both events inspired me to do some experiments at my work here at RTL and explore the usability of it for us at RTL. I like to share the slides of the presentation that I gave internally at RTL, you can find them on slide-share.

As a side effect, another experiment that I like to share is the “poor man’s video analyzer“. There are several vendors now that offer API’s to analyze videos. See for example the one that Microsoft offers. With just a few lines of R code I came up with a shiny app that is a very cheap imitation 🙂

Set up of the R Shiny app

To run the shiny app a few things are needed. Make sure that ffmpeg is installed, it is used to extract images from a video. Tensorflow and keras need to be installed as well. The extracted images from the video are parsed through a pre-trained VGG16 network so that each image is tagged.

After this tagging a data table will appear with the images and their tags. That’s it! I am sure there are better visualizations than a data table to show a lot of images. If you have a better idea just adjust my shiny app on GitHub…. 🙂

Using the app, some screen shots

There is a simple interface, specify the number of frames per second that you want to analyse. And then upload a video, many formats are supported (by ffmpeg), like *.mp4, *.mpeg, *.mov

Click on video images to start the analysis process. This can take a few minutes, when it is finished you will see a data table with extracted images and their tags from VGG-16.

Click on ‘info on extracted classes’ to see an overview of the class. You will see a bar chart of tags that where found and the output of ffmpeg. It shows some info on the video.

If you have code to improve the data table output in a more fancy visualization, just go to my GitHub. For those who want to play around, look at a live video analyzer shiny app here.

A Shiny app version using miniUI will be a better fit for small mobile screens.

Cheers, Longhow

Test driving Python integration in R, using the ‘reticulate’ package

RP

Introduction

Not so long ago RStudio released the R package ‘reticulate‘, it is an R interface to Python. Of course, it was already possible to execute python scripts from within R, but this integration takes it one step further. Imported Python modules, classes and functions can be called inside an R session as if it were just native R functions.

Below you’ll find some screen shot code snippets of using certain Python modules within R with the reticulate package. On my GitHub page you’ll find the R files from which these snippets were taken from.

Using python packages

The nice thing about reticulate in RStudio is the support for code completion. When you have imported a python module, RStudio will recognize the methods that are available in the python module:

clarifai_code_comp

The clarifai module

Clarifai provides a set of computer vision API’s for image recognition, face detection, extracting tags, etc. There is an official python module and there is also an R package by Guarav Sood, but it exposes less functionality. So I am going to use the python module in R. The following code snippet shows how easy it is to call python functions.

clarifaicode

The output returned from the clarifai call is a nested list and can be quit intimidating at first sight. To browse trough these nested lists and to get a better idea of what is in those lists, you can use the package listviewer:

listviewer

The pattern.nl module

The pattern.nl module contains a fast part-of-speech tagger for Dutch, sentiment analysis, and tools for Dutch verb conjugation and noun singularization & pluralization. At the moment it does not support python 3. That is not a big deal, I am using Anaconda and created a Python 2.7 environment to install pattern.nl.

The nice thing of the reticulate package is that it allows you to choose a specific Python environment to be used.

pattern

The pytorch module

pytorch is a python package that provides tensor computations and deep neural networks. There is no ‘R torch’ equivalent, but we can use reticulate in R. There is an example of training a logistic regression in pytorch, see the code here. It takes just a little rewrite of this code to make this work in R. See the first few lines in the figure below.

pytorch

Conclusion

As a data scientist you should know both R and Python, the reticulate package is no excuse for not learning Python! However, the reticulate package can be very useful if you want to do all your analysis in the RStudio environment. It works very well.

For example, I have used rvest to scrape some Dutch news texts, then used the Python module pattern.nl for Dutch sentiment and wrote an R Markdown document to present the results. Then the reticulate package is a nice way to keep everything in one environment.

Cheers, Longhow

Because its Friday… The IKEA Billy index

ikea

Introduction

Because it is Friday, another ‘playful and frivolous data exercise 🙂

IKEA is more than a store, it is a very nice experience to go through. I can drop of my two kids at smàland, have some ‘quality time’ by walking around the store with my wife and eat some delicious Swedish meatballs. Back at home, the IKEA furniture are a good ‘relation-tester’: try building a big wardrobe together with your wife…..

The nice thing about IKEA is that you don’t have to come to the store for nothing, you can check the availability of an item on the IKEA website.

According to the website this gets refreshed every 1,5 hour. This brought me on an idea, if I check the availability every 1,5 hour I could get an idea of the number of items sold for a particular item.

The IKEA Billy index

Probably the most iconic item of IKEA is the Billy bookcase. Just in case you don’t know how this bookcase looks like, below is a picture, its simplicity in its most elegant way….

For every 1,5 hour over the last few months I have checked the Dutch IKEA website for the availability of this famous item for the 13 stores in the Netherlands, and calculated the negative difference between consecutive values.

The data that you get from this little playful exercise do not necessarily represent the numbers of Billy bookcases really sold. Maybe the stock got replenished in between, maybe items were moved internally to other stores. For example, if there are 50 Billy’s in Amsterdam available and 1,5 hour later there are 45 Billy’s, maybe 5 were sold, or 6 were sold and 1 got returned? replenished? I just don’t know!

All I see are movements in availability that might have been caused by products sold. But anyway, let’s call the movements of availability of the Billy’s the IKEA Billy index.

Some graphs of the Billy Index

Trends and forecasts

Facebook released a nice R package, called prophet. It can be used to perform forecasting on time series, and it is used internally by Facebook across many applications. I ran the prophet forecasting algorithm on the IKEA Billy index. The graph below shows the result.

There are some high peaks end of October, and end of December. We can also clearly see the Saturday peaks that the algorithm has picked up from the historic data and projected it in its future forecasts.

Weekday and color

The graph above showed already that on Saturdays the Billy index is high, what about the other days? The graph below shows the other days, it depicts the sum of the Ikea index per day since I started to collect this data (end of September). Wednesdays and Thursdays are less active days.

 

Clearly most of the Billy’s are white.

Correlations

Does the daily Billy Index correlate with other data? I have used some Dutch weather data that can be downloaded from the Royal Netherlands Meteorological Institute (KNMI). The data consists of many daily weather variables. The graph below shows a correlation matrix of the IKEA Billy Index and only some of these weather variables.

 

The only correlation with some meaning of the IKEA Billy Index and a weather variable is the Wind Speed (-0.19). Increasing wind speeds means decreasing Billy’s.

 

It’s an explainable correlation of course…. 🙂 You wouldn’t want to go to IKEA on (very) windy days, it is not easy to drive through strong winds with your Billy on top of your car.

 

Cheers, Longhow.

R formulas in Spark and un-nesting data in SparklyR: Nice and handy!

sparkblog

Intro

In an earlier post I talked about Spark and sparklyR and did some experiments. At my work here at RTL Nederland we have a Spark cluster on Amazon EMR to do some serious heavy lifting on click and video-on-demand data. For an R user it makes perfectly sense to use Spark through the sparklyR interface. However, using Spark through the pySpark interface certainly has its benefits. It exposes much more of the Spark functionality and I find the concept of ML Pipelines in Spark very elegant.

In using Spark I like to share two little tricks described below with you.

The RFormula feature selector

As an R user you have to get used to using Spark through pySpark, moreover, I had to brush up some of my rusty Python knowledge. For training machine learning models there is some help though by using an RFormula 🙂

R users know the concept of model formulae in R, it can be handy way to formulate predictive models in a concise way. In Spark you can also use this concept, only a limited set of R operators are available (+, , . and :) , but it is enough to be useful. The two figures below show a simple example.rformula1


from pyspark.ml.feature import RFormula
f1 = "Targetf ~ paidDuration + Gender "
formula = RFormula(formula = f1)
train2 = formula.fit(train).transform(train)

sparkrformula

A handy thing about an RFormula in Spark is (just like using a formula in R in lm and some other modeling functions) that string features used in an RFormula will be automatically onehot encoded, so that they can be used directly in the Spark machine learning algorithms.

Nested (hierarchical) data in sparklyR

Sometimes you may find your self with nested hierarchical data. In pySpark you can flatten this hierarchy if needed. A simple example, suppose you read in a parquet file and it has the following structure:schemaThen to flatten the data you could use:sparkdfIn SparklyR however, reading the same parquet file results in something that isn’t useful to work with at first sight. If you open the table viewer to see the data, you will see rows with: <environment>.nesteddataFortunately, the facilities used internally by sparklyR to call Spark are available to the end user. You can invoke more methods in Spark if needed. So we can invoke the select and col method our self to flatten the hierarchy.rparsedAfter registering the output object, it is visible in the Spark interface and you can view the content.unnested

Thanks for reading my two tricks. Cheers, Longhow.

Did you say SQL Server? Yes I did….

rsqlserver

Introduction

My last blog post in 2016 on SQL Server 2016….. Some years ago, I have heard predictions from ‘experts‘ that within a few years Hadoop / Spark systems would take over traditional RDBMS’s like SQL Server. I don’t think that has happened (yet). Moreover, what some people don’t realize is that at least half of the world still depends on good old SQL Server. If tomorrow all the Transact stored procedures would somehow magically fail to run anymore, I think our society as we know it would collapse…..

postapo

OK, I might be exaggerating a little bit. The point is, there are still a lot of companies and use cases out there that are running SQL Server without the need for something else. And now with the integrated R services in SQL Server 2016 that might not be necessary at all 🙂

Deploying Predictive models created in R

From a business standpoint, creating a good predictive model and spending time on this, is only useful if you can deploy such a model in a system where the business can make use of the predictions in their ‘day-to-day operations’. Otherwise creating a predictive model is just an academic exercise / experiment….

Many predictive models are created in R on a ‘stand-alone’ laptop /server. There are different ways to deploy such models. Among others:

  • Re-build the scoring logic ‘by hand’ in the operational system. I did this in the past, it can be a little bit cumbersome and it’s not what you really want to do. If you do not have much data prep steps and your model is a logistic regression or a single tree, this is doable 🙂
  • Make use of PMML scoring. The idea is to create a model (in R) transform that to pmml and import the pmml in the operational system where you need the predictions. Unfortunately, not all models are supported and not all systems support importing (the latest version of) PMML
  • Create API’s (automatically) with technology like for example Azure ML, DeployR, sense.io or openCPU, so that the application that needs the prediction can call the API.

SQL Server 2016 R services

If your company is running SQL Server (2016) there is an other nice alternative to deploy R models by using the SQL Server R services. At my work at RTL Nederland [Oh btw we are looking for data engineers and data scientists :-)] we are using this technology to deploy the predictive churn and response models created in R. The process is not difficult; the few steps that are needed are demonstrated below.

Create any model in R

I am using an extreme gradient boosting algorithm to fit a classification model on the titanic data set. Instead of calling xgboost directly I am using the mlr package to train the model. Mlr provides a unified interface to machine learning in R, it takes care of some of the frequently used steps in creating a predictive model regardless of the underlying machine learning algorithm. So your code can become very compact and uniform.

xgboostexample

Push the (xgboost) predictive model to SQL Server

Once you are satisfied with the predictive model (on your R laptop), you need to bring that model over to SQL Server so that you can use it there. This consists of the following steps:

SQL Code in SQL Server, write a stored procedure in SQL server that can accept a predictive R model, some meta data and saves that into a table in SQL Server.

sqlr_sp

This stored procedure can then be called from your R session.

Bring the model from R to SQL, to make it a little bit easier you can write a small helper function.

rhelper

So what is the result? In SQL Server I now have a table (dbo.R_Models) with predictive models. My xgboost model to predict the survival on the Titanic is now added as an extra row. Such a table becomes like a sort of model store in SQL server.

sqlmodels

Apply the predictive model in SQL Server.

Now that we have a model we can use it to calculate model scores on data in SQL Server. With the new R services in SQL Server 2016 there is a function called sp_exec_external_script. In this function you can call R to calculate model scores.

sqlserver_rmodel_call

The scores (and the inputs) are stored added in a table.

sqltabel

The code is very generic, instead of xgboost models it works for any model. The scoring can (and should be) be done inside a stored procedure so that scoring can be done at regular intervals or triggered by certain events.

Conclusion

Deploying predictive models (that are created in R) in SQL Server has become easy with the new SQL R services. It does not require new technology or specialized data engineers. If your company is already making use of SQL Server then integrated R services are definitely something to look at if you want to deploy predictive models!

Some more examples with code can be found on the Microsoft GitHub pages.

Cheers, Longhow