The Bold & Beautiful Character Similarities using Word Embeddings

BBpost

Introduction

I often see advertisement for The Bold and The Beautiful, I have never watched a single episode of the series. Still, even as a data scientist you might be wondering how these beautiful ladies and gentlemen from the show are related to each other. I do not have the time to watch all these episodes to find out, so I am going to use word embeddings on recaps instead…

Calculating word embeddings

First, we need some data, from the first few google hits I got to the site soap central. Recaps can be found from the show that date back to 1997. Then, I used a little bit of rvest code to scrape the daily recaps into an R data set.

Word embedding is a technique to transform a word onto a vector of numbers, there are several approaches to do this. I have used the so-called Global Vector word embedding. See here for details, it makes use of word co-occurrences that are determined from a (large) collection of documents and there is fast implementation in the R text2vec package.

Once words are transformed to vectors, you can calculate distances (similarities) between the words, for a specific word you can calculate the top 10 closest words for example. More over linguistic regularities can be determined, for example:

amsterdam - netherlands + germany

would result in a vector that would be close to the vector for berlin.

Results for The B&B recaps

It takes about an hour on my laptop to determine the word vectors (length 250) from 3645 B&B recaps (15 seasons). After removing some common stop words, I have 10.293 unique words, text2vec puts the embeddings in a matrix (10.293 by 250).

Lets take the lovely steffy,

the ten closest words are:

    from     to     value
    <chr>  <chr>     <dbl>
 1 steffy steffy 1.0000000
 2 steffy   liam 0.8236346
 3 steffy   hope 0.7904697
 4 steffy   said 0.7846245
 5 steffy  wyatt 0.7665321
 6 steffy   bill 0.6978901
 7 steffy  asked 0.6879022
 8 steffy  quinn 0.6781523
 9 steffy agreed 0.6563833
10 steffy   rick 0.6506576

Lets take take the vector steffyliam, the closest words we get are

       death     furious      lastly     excused frustration       onset 
   0.2237339   0.2006695   0.1963466   0.1958089   0.1950601   0.1937230 

 and for bill – anger we get

     liam     katie     wyatt    steffy     quinn      said 
0.5550065 0.4845969 0.4829327 0.4645065 0.4491479 0.4201712 

The following figure shows some other B&B characters and their closest matches.

 

If you want to see the top n characters for other B&B characters use my little shiny app. The R code for scraping B&B recaps, calculating glove word-embeddings and a small shiny app can be found on my Git Hub.

Conclusion

This is a Mickey Mouse use case, but it might be handy if you are in the train and hear people next to you talking about the B&B, you can join their conversation. Especially if you have had a look at my B&B shiny app……

Cheers, Longhow

Advertisements

The one function call you need to know as a data scientist: h2o.automl

forest

Introduction

Two things that recently came to my attention were AutoML (Automatic Machine Learning) by h2o.ai and the fashion MNIST by Zalando Research. So as a test, I ran AutoML on the fashion mnist data set.

H2o AutoML

As you all know a large part of the work in predictive modeling is in preparing the data. But once you have done that, ideally you don’t want to spend too much work in trying many different machine learning models.  That’s were AutoML from h2o.ai comes in. With one function call you automate the process of training a large, diverse, selection of candidate models.

AutoML trains and cross-validates a Random Forest, an Extremely-Randomized Forest, GLM’s, Gradient Boosting Machines (GBMs) and Neural Nets. And then as “bonus” it trains a Stacked Ensemble using all of the models. The function to use in the h2o R interface is: h2o.automl. (There is also a python interface)

FashionMNIST_Benchmark = h2o.automl(
  x = 1:784,
  y = 785,
  training_frame = fashionmnist_train,
  validation_frame = fashionmninst_test
)

So the first 784 columns in the data set are used as inputs and column 785 is the column with labels. There are more input arguments that you can use. For example, maximum running time or maximum number of models to use, a stopping metric.

It can take some time to run all these models, so I have spun up a so-called high CPU droplet on Digital Ocean: 32 dedicated cores ($0.92 /h).

h2o_perf

h2o utilizing all 32 cores to create models

The output in R is an object containing the models and a ‘leaderboard‘ ranking the different models. I have the following accuracies on the fashion mnist test set.

  1. Gradient Boosting (0.90)
  2. Deep learning (0.89)
  3. Random forests (0.89)
  4. Extremely randomized forests (0.88)
  5. GLM (0.86)

There is no ensemble model, because it’s not supported yet for multi label classifiers. The deeplearning in h2o are fully connected hidden layers, for this specific Zalando images data set, you’re better of pursuing more fancy convolutional neural networks. As a comparison I just ran a simple 2 layer CNN with keras, resulting in an test accuracy of 0.92. It outperforms all the models here!

Conclusion

If you have prepared your modeling data set, the first thing you can always do now is to run h2o.automl.

Cheers, Longhow.

Oil leakage… those old BMW’s are bad :-)

lekkage

Introduction

My first car was a 13 year Mitsubishi Colt, I paid 3000 Dutch Guilders for it. I can still remember a friend that would not like me to park this car in front of his house because of possible oil leakage.

mitsubishi_colt_turbo_red_1984

Can you get an idea of which cars will likely to leak oil? Well, with open car data from the Dutch RDW you can. RDW is the Netherlands Vehicle Authority in the mobility chain.

RDW Data

There are many data sets that you can download. I have used the following:

  • Observed Defects. This set contains 22 mln. records on observed defects at car level (license plate number). Cars in The Netherlands have to be checked yearly, and the findings of each check are submitted to RDW.
  • Basic car details. This set contains 9 mln. records, they are all the cars in the Netherlands, license plate number, brand, make, weight and type of car.
  • Defects code. This little table provides a description of all the possible defect codes. So I know that code ‘RA02’ in the observed defects data set represents ‘oil leakage’.

Simple Analysis in R

I have imported the data in R and with some simple dplyr statements I have determined per car make and age (in years) the number of cars with an observed oil leakage defect. Then I have determined how many cars there are per make and age, then dividing those two numbers will result in a so called oil leak percentage.

 

For example, in the Netherlands there are 2043 Opel Astra’s that are four years old, there are three observed with an oil leak, so we have an oil leak percentage of 0.15%.

The graph below shows the oil leak percentages for different car brands and ages. Obviously, the older the car the higher the leak percentage.  But look at BMW: waaauwww those old BMW’s are leaking like oil crazy… 🙂 The few lines of R code can be found here.

Rplot01

Conclusion

There is a lot in the open car data from RDW, you can look at much more aspects / defects of cars. Regarding my old car that i had, according to this data Mitsubishi’s have a low oil leak percentage, even older ones.

Cheers, Longhow

 

Interactive sunbuRst graphs in Power BI in 5 minutes!!

pbiviztool

Introduction

If I mention Power BI to fellow data scientists I often get strange looks. However, I quite like the tool, it is an easy and fast way to share results, KPI’s and graphs with others. With the latest release, Power BI now supports interactive R graphs, and they are easy to create as well.

Steps to follow

1. Install Node.JS from here and then you can install the power bi tools with:

>npm install -g powerbi-visuals-tools

2. Create a new custom R visual:

>pbiviz new sunburstRHTMLVisual -t rhtml

3. This will create a directory sunburstRHTMLVisual. In that directory, edit the R script file script.r. It’s a one-liner to create a sunburst graph with the sunburstR package.

 

Values is the name of the input data frame, the data that is received from within the Power BI desktop application.

4. Now create the custom R visual as package with the following command: (issue this command inside the directory sunburstRHTMLVisual)

>pbiviz package

5. Inside the sub folder dist you will now find the file sunburstRHTMLVisual.pbiviz, this can be used in Power BI. Open the Power BI desktop application and import a custom visual from file. Select the sunburstRHTMLVisual.pbiviz file

 

That’s it, you’re done!

The resulting graph in a dashboard

To use the visual you will need a data set in power bi with two columns, one with the sequences and one with the number of occurrences of the corresponding sequence.

 

Click on the icon of the custom R visual you’ve just imported and select the two columns to get the interactive sunburst graph. Once the graph is created, you can hover over the rings to get more info, and you can turn on/off the legend.

 

Cheers, Longhow.

A “poor man’s video analyzer”…

video01

Introduction

Not so long ago there was a nice dataiku meetup with Pierre Gutierrez talking about transfer learning. RStudio recently released the keras package, an R interface to work with keras for deep learning and transfer learning. Both events inspired me to do some experiments at my work here at RTL and explore the usability of it for us at RTL. I like to share the slides of the presentation that I gave internally at RTL, you can find them on slide-share.

As a side effect, another experiment that I like to share is the “poor man’s video analyzer“. There are several vendors now that offer API’s to analyze videos. See for example the one that Microsoft offers. With just a few lines of R code I came up with a shiny app that is a very cheap imitation 🙂

Set up of the R Shiny app

To run the shiny app a few things are needed. Make sure that ffmpeg is installed, it is used to extract images from a video. Tensorflow and keras need to be installed as well. The extracted images from the video are parsed through a pre-trained VGG16 network so that each image is tagged.

After this tagging a data table will appear with the images and their tags. That’s it! I am sure there are better visualizations than a data table to show a lot of images. If you have a better idea just adjust my shiny app on GitHub…. 🙂

Using the app, some screen shots

There is a simple interface, specify the number of frames per second that you want to analyse. And then upload a video, many formats are supported (by ffmpeg), like *.mp4, *.mpeg, *.mov

Click on video images to start the analysis process. This can take a few minutes, when it is finished you will see a data table with extracted images and their tags from VGG-16.

Click on ‘info on extracted classes’ to see an overview of the class. You will see a bar chart of tags that where found and the output of ffmpeg. It shows some info on the video.

If you have code to improve the data table output in a more fancy visualization, just go to my GitHub. For those who want to play around, look at a live video analyzer shiny app here.

A Shiny app version using miniUI will be a better fit for small mobile screens.

Cheers, Longhow

Test driving Python integration in R, using the ‘reticulate’ package

RP

Introduction

Not so long ago RStudio released the R package ‘reticulate‘, it is an R interface to Python. Of course, it was already possible to execute python scripts from within R, but this integration takes it one step further. Imported Python modules, classes and functions can be called inside an R session as if it were just native R functions.

Below you’ll find some screen shot code snippets of using certain Python modules within R with the reticulate package. On my GitHub page you’ll find the R files from which these snippets were taken from.

Using python packages

The nice thing about reticulate in RStudio is the support for code completion. When you have imported a python module, RStudio will recognize the methods that are available in the python module:

clarifai_code_comp

The clarifai module

Clarifai provides a set of computer vision API’s for image recognition, face detection, extracting tags, etc. There is an official python module and there is also an R package by Guarav Sood, but it exposes less functionality. So I am going to use the python module in R. The following code snippet shows how easy it is to call python functions.

clarifaicode

The output returned from the clarifai call is a nested list and can be quit intimidating at first sight. To browse trough these nested lists and to get a better idea of what is in those lists, you can use the package listviewer:

listviewer

The pattern.nl module

The pattern.nl module contains a fast part-of-speech tagger for Dutch, sentiment analysis, and tools for Dutch verb conjugation and noun singularization & pluralization. At the moment it does not support python 3. That is not a big deal, I am using Anaconda and created a Python 2.7 environment to install pattern.nl.

The nice thing of the reticulate package is that it allows you to choose a specific Python environment to be used.

pattern

The pytorch module

pytorch is a python package that provides tensor computations and deep neural networks. There is no ‘R torch’ equivalent, but we can use reticulate in R. There is an example of training a logistic regression in pytorch, see the code here. It takes just a little rewrite of this code to make this work in R. See the first few lines in the figure below.

pytorch

Conclusion

As a data scientist you should know both R and Python, the reticulate package is no excuse for not learning Python! However, the reticulate package can be very useful if you want to do all your analysis in the RStudio environment. It works very well.

For example, I have used rvest to scrape some Dutch news texts, then used the Python module pattern.nl for Dutch sentiment and wrote an R Markdown document to present the results. Then the reticulate package is a nice way to keep everything in one environment.

Cheers, Longhow

Because its Friday… The IKEA Billy index

ikea

Introduction

Because it is Friday, another ‘playful and frivolous data exercise 🙂

IKEA is more than a store, it is a very nice experience to go through. I can drop of my two kids at smàland, have some ‘quality time’ by walking around the store with my wife and eat some delicious Swedish meatballs. Back at home, the IKEA furniture are a good ‘relation-tester’: try building a big wardrobe together with your wife…..

The nice thing about IKEA is that you don’t have to come to the store for nothing, you can check the availability of an item on the IKEA website.

According to the website this gets refreshed every 1,5 hour. This brought me on an idea, if I check the availability every 1,5 hour I could get an idea of the number of items sold for a particular item.

The IKEA Billy index

Probably the most iconic item of IKEA is the Billy bookcase. Just in case you don’t know how this bookcase looks like, below is a picture, its simplicity in its most elegant way….

For every 1,5 hour over the last few months I have checked the Dutch IKEA website for the availability of this famous item for the 13 stores in the Netherlands, and calculated the negative difference between consecutive values.

The data that you get from this little playful exercise do not necessarily represent the numbers of Billy bookcases really sold. Maybe the stock got replenished in between, maybe items were moved internally to other stores. For example, if there are 50 Billy’s in Amsterdam available and 1,5 hour later there are 45 Billy’s, maybe 5 were sold, or 6 were sold and 1 got returned? replenished? I just don’t know!

All I see are movements in availability that might have been caused by products sold. But anyway, let’s call the movements of availability of the Billy’s the IKEA Billy index.

Some graphs of the Billy Index

Trends and forecasts

Facebook released a nice R package, called prophet. It can be used to perform forecasting on time series, and it is used internally by Facebook across many applications. I ran the prophet forecasting algorithm on the IKEA Billy index. The graph below shows the result.

There are some high peaks end of October, and end of December. We can also clearly see the Saturday peaks that the algorithm has picked up from the historic data and projected it in its future forecasts.

Weekday and color

The graph above showed already that on Saturdays the Billy index is high, what about the other days? The graph below shows the other days, it depicts the sum of the Ikea index per day since I started to collect this data (end of September). Wednesdays and Thursdays are less active days.

 

Clearly most of the Billy’s are white.

Correlations

Does the daily Billy Index correlate with other data? I have used some Dutch weather data that can be downloaded from the Royal Netherlands Meteorological Institute (KNMI). The data consists of many daily weather variables. The graph below shows a correlation matrix of the IKEA Billy Index and only some of these weather variables.

 

The only correlation with some meaning of the IKEA Billy Index and a weather variable is the Wind Speed (-0.19). Increasing wind speeds means decreasing Billy’s.

 

It’s an explainable correlation of course…. 🙂 You wouldn’t want to go to IKEA on (very) windy days, it is not easy to drive through strong winds with your Billy on top of your car.

 

Cheers, Longhow.