Bon Appetit: A restaurant recommender for tourists visiting the Netherlands

Introduction

More and more tourists are visiting The Netherlands, this will become very clear if you walk through the center of Amsterdam on a sunny day. All those tourists need to eat somewhere, in some restaurant. You can see their sad faces as they have no clue where to go. Well, with the aid of a little data science I have made it easy for them :-). A small R Shiny app for tourists to inform them to which restaurant they should go in The Netherlands. In this blog post I will describe the different steps that I have taken.

iamsterdam

Tourists in Amsterdam wondering where to eat……

Iens reviews

In an earlier blog post I wrote about scraping restaurant review data from www.iens.nl and how to use that to generate restaurant recommendations. The technique was based on the restaurant ratings given by the reviewers. To generate personal recommendations you need to rate some restaurants first. But as a tourist visiting The Netherlands for the first time this might be difficult.

So I have made it a little bit easier, enter your idea of food in my Bon Appetit Shiny app, it will translate the text to Dutch if needed, then calculate the similarity of your translated text and all reviews from Iens, and then give you the top ten restaurants whose reviews matches best.

The Microsoft translator API

Almost all of the reviews on the Iens restaurant website are in Dutch, I assume that most tourists from outside The Netherlands do not speak Dutch. That is not a large problem, I can translate non Dutch text to Dutch by using a translator. Google and Microsoft offer translation API’s. I have chosen for the Microsoft API because they offer a free tier. The first 2 million characters are free per month. Sign-up and get started here. And because the API supports the Klingon language….. 🙂

The R franc package can recognize the language of the input text:


lang = franc(InputText)
ISO2 = speakers$iso6391[speakers$language==lang]
from = ISO2

The ISO 2 letter language code is needed in the call to the Microsoft translator API. I am making use of the httr package to set up the call. With your clientID and client secret a token must be retrieved. Then with this token the actual translation is done.


#Set up call to retrieve token

clientIDEncoded = URLencode("your microsoft client ID")

client_SecretEncoded = URLencode("your client secret")
Uri = "https://datamarket.accesscontrol.windows.net/v2/OAuth2-13"

MyBody = paste(
   "grant_type = client_credentials&client_id=",
   clientIDEncoded,
   "client_secret=",
   client_SecretEncoded,
   "&scope=http://api.microsofttranslator.com",
   sep="";
)

r = POST(url=Uri, body = MyBody, content_type("application/x-www-form-urlencoded"))
response = content(r)

Now that you have the token, make a call to translate the text


HeaderValue = paste("Bearer ", response$access_token, sep="")

TextEncoded = URLencode(InputText)

to = "nl"

uri2 = paste(
   "http://api.microsofttranslator.com/v2/Http.svc/Translate?text=",
   TextEncoded,
   "&from=",
   from,
   "&to=",
   to,
   sep=""
)

resp2 = GET(url = uri2, add_headers(Authorization = HeaderValue))
Translated = content(resp2)

#### dig out the text from the xml object
TranslatedText  = as(Translated , "character") %>% read_html(pp) %>% html_text()

Some example translations,

Louis van Gaal is notorious for his Dutch to English (or any other language for that matter) translations. Let’s see how the Microsoft API performs on some of his sentences.

  • Dutch: “Dat is hele andere koek”, van Gaal: That is different cook”, Microsoft: That is a whole different kettle of fish”.
  • Dutch: “de dood of de gladiolen”, van Gaal: “the dead or the gladiolus”, Microsoft: “the dead or the gladiolus”. 
  • Dutch: “Het is een kwestie van tijd”, van Gaal: “It’s a question of time”, Microsoft: “It’s a matter of time”.

The Cosine similarity

The distance or similarity between two documents (texts) can be measured by means of the cosine similarity. When you have a collection of reviews (texts), then this collection can be represented by a term document matrix. A row of this matrix is one review, its a vector of word counts. Another review or text is also a vector of word counts, given two vectors A and B the cosine similarity  is given by:

cosine

Now the input text that is translated to Dutch is also a vector of word counts and so can calculate the cosine similarity between each restaurant review and the input text. The restaurants corresponding to the most similar reviews are returned as recommended restaurants, bon appetit 🙂

Putting all together in a Shiny app

The above steps are implemented in my bon appetit Shiny app. Try out your thoughts and idea of food and get restaurant recommendations! Here is an example:

Input text: Large pizza with chicken and cheese that is tasty.

shiny

Input text translated to Dutch

shiny2

The top ten restaurants corresponding to the translated input text

 

And for the German tourist: “Ich suche eines schnelles leckeres Hahnchen”, this gets translated to Dutch “ik ben op zoek naar een snelle heerlijke kip” and the ten restaurant recommendations you get are given in the following figure.

shiny3

cheers,

— Longhow —

A little H2O deeplearning experiment on the MNIST data set

Introduction

H2O is a fast and scalable opensource machine learning platform. Several algorithms are available, for example neural networks, random forests, linear models and gradient boosting. See the complete list here. Recently the H2O world conference was held, unfortunately I was not there. Luckily there is a lot of material available, videos and slides, it triggered me to try the software.

The software is easy to set up on my laptop. Download the software from the H2O download site, it is a zip file that needs to be unzipped. It contains (among other files) a jar file that needs to be run from the command line:

java -jar h20.jar

After H2O has started, you can browse to localhost:54321 (the default port number can be changed, specify: -port 65432) and within the browser you can use H2O via the flow interface. In this blog post I will not use the flow interface but I will use the R interface.

h20

H2O flow web interface

The H2O R interface

To install the H2O R interface you can follow the instructions provided here. Its a script that checks if there is already a H2O R package installed, if needed it installs packages that the H2O package depends on, and it installs the H2O R package. Start the interface to H2O from R. If H2O was already started from the command line you can connect to the same H2O instance by specifying the same port and use startH2O = FALSE.


library(h2o)

localH2O =  h2o.init(nthreads = -1, port = 54321, startH2O = FALSE)

MNIST handwritten digits

The data I have used for my little experiment is the famous handwritten digits data from MNIST. The data in CSV format can be downloaded from Kaggle. The train data set has 42.000 rows and 785 columns, each row represents a digit, a digit is made up of 28 by 28 pixels, in total 784 columns, plus one additional label column. The first column in the CSV file is called ‘label’, the rest of the columns are called called pixel0, pixel1,….,pixel783. The following code imports the data and plots the first 100 digits, together with the label.


MNIST_DIGITStrain = read.csv( 'D:/R_Projects/MNIST/MNIST_DIGITStrain.csv' )
par( mfrow = c(10,10), mai = c(0,0,0,0))
for(i in 1:100){
  y = as.matrix(MNIST_DIGITStrain[i, 2:785])
  dim(y) = c(28, 28)
  image( y[,nrow(y):1], axes = FALSE, col = gray(255:0 / 255))
  text( 0.2, 0, MNIST_DIGITStrain[i,1], cex = 3, col = 2, pos = c(3,4))
}

digits

The first 100 MNIST handwritten digits and the corresponding label

The data is imported into R, its a local R data frame. To apply machine learning techniques on the MNIST digits, the data needs to be available on the H2O platform. From R you can either import a CSV file directly into the H2O platform or you can import an existing R object into the H2O platform.


mfile = 'D:\\R_Projects\\MNIST\\MNIST_DIGITStrain.csv'
MDIG = h2o.importFile(path = mfile,sep=',')

# Show the data objects on the H2O platform
h2o.ls()

key
1 MNIST_DIGITStrain.hex_3

Deep learning autoencoder

Now that the data is in H2O we can apply machine learning techniques on the data. One type of analysis that interested me the most is the ability to train autoencoders. The idea is to use the input data to predict the input data by means of a ‘bottle-neck’ network.

autoencoder

The middle layer can be regarded as a compressed representation of the input. In H2O R, a deep learning autoencoder can be trained as follows.

NN_model = h2o.deeplearning(
  x = 2:785,
  training_frame = MDIG,
  hidden = c(400, 200, 2, 200, 400 ),
  epochs = 600,
  activation = 'Tanh',
  autoencoder = TRUE
)

So there is one input layer with 784 neurons, a second layer with 400 neurons, a third layer with 200, the middle layer with 2 neurons, etc. The middle layer is a 2-dimensional representation of a 784 dimensional digit. The 42.000 2-dimensional representations of the digits are just points that we can plot. To extract the data from the middle layer we need to use the function h20.deepfeatures.


train_supervised_features2 = h2o.deepfeatures(NN_model, MDIG, layer=3)

plotdata2 = as.data.frame(train_supervised_features2)
plotdata2$label = as.character(as.vector(MDIG[,1]))

qplot(DF.L3.C1, DF.L3.C2, data = plotdata2, color = label, main = 'Neural network: 400 - 200 - 2 - 200 - 400')

NNdeep

In training the autoencoder network I have not used the label, this is not a supervised training exercise. However, I have used the label in the plot above. We can see the ‘1’ digits clearly on the left-hand side, while the ‘7’ digits are more on the right-hand side, and the pink ‘8’ digits are more in the center. It’s far from a perfect, I need to explore more options in the deep learning functionality to achieve a better separation in 2 dimensions.

Comparison with a 2 dimensional SVD data reduction

Autoencoders use nonlinear transformations to compress high dimensional data to a lower dimensional space. Singular Value decomposition on the other hand can be used to compress data to a lower dimensional space by using only linear transformations. See my earlier blog post on SVD. The following picture shows the MNIST digits projected to 2 dimensions using SVD.
SVD2

There is a good separation between the 1’s and the 0’s, but the rest of the digits are much less separated than the autoencoder. There is of course a time benefit for the SVD. It takes around 6.5 seconds to calculate a SVD on the MNIST data while it took around 350 seconds for the autoencoder.

Conclusion

With this little autoencoder example, I have just scratched the surface of what is possible in H2O. There is much more to discover, many supervised learning algorithms, and also within the deep learning functionality of H2O there are a lot of settings which I have not explored further.

Restaurants as atomic nuclei with the R visNetwork package

atomic

Recently the R package visNetwork was brought to my attention. It is an R interface to the ‘vis.js’ JavaScript charting library where you can create interactive network graphs. To try out the package I need some data, luckily I still have some restaurant reviews data from an earlier exercise. See my previous post. You need two data frames to plot a network in visNetwork, One is a data set with nodes, in my case a bunch of restaurants that have been reviewed. The second is a data set with edges, it has two columns (from and to), each row in this set shows the link between two restaurant for a network graph. How are two restaurants linked in my case? When a reviewer visited restaurant A and also visited restaurant B then the two restaurants are linked. In the edges data set I would have a row:  A   B.

Both the nodes and edges data sets can contain some extra columns which can be used in the network graph. For example, in the nodes data set I have used restaurant type (French, Chinese, Indian,…) to color my nodes. And in the edges data set I have specified the strength of the edge/link. In my case I have defined the strength as the number of reviewers. So, If reviewer x, y and z have reviewed restaurants A and B then the strength of edge A-B is 3.

Before I went to the university I have lived in the lovely city of of Hoorn in the Netherlands. Lets look at the restaurants in Hoorn and how they are connected. The restaurant nodes data set looks like:

rnodes

List of restaurants in Hoorn

The restaurant edges data set looks like:

redges

links between restaurants

Given the two data sets, the R code to create the network is given by


visNetwork( RestaurantNodes, RestaurantEdges, legend = TRUE) %>%
  visOptions( highlightNearest = TRUE, nodesIdSelection = TRUE) %>%
  visInteraction( navigationButtons = TRUE) %>%
  visPhysics( maxVelocity = 35)

There are many options you can set, see the help for more information. I have just used a couple. A handy option is highlightNearest, when there are many nodes you can select a node and only the nodes nearest to the selected are highlighted and the rest is grayed out. Here are some screen shots of my network graph, click to enlarge.

n1

network graph of restaurants

n2

Zooming in on the core

Don’t they look lovely? In the interactive graph the nodes can ‘vibrate’, just like atomic nuclei.  I have published the network graph on Rpubs so that you can interact with the graph yourself. Apart from the nostalgic reason to choose restaurants in Hoorn, there is another reason, when there are too many nodes and edges the graph does not perform well. Trying to create this graph for all restaurants in the Netherlands is not possible.

Deploying a car price model using R and AzureML

Recently Microsoft released the AzureML R package, it allows R users to publish their R models (or any R function) as a web service on the Microsoft Azure Machine Learning platform. Of course, I wanted to test the new package, so I performed the following steps.

Sign up for Azure Machine Learning

To make use of this service you will need to sign up for an account. There is a free tier account, go to the Azure machine learning website.

Get the AzureML package

The azureml package can be installed from GitHub, follow the instructions here.

Create a model in R to deploy

In a previous blog post I described how to estimate the value of a car. I scraped the Dutch site www.autotrader.nl where used cars are sold. For more than 100.000 used cars I have their make, model, sales price, mileage (kilometers driven) and age. Let’s focus again on Renault, and instead of only using mileage I am also going to use the age, and the car model in my predictive splines model. For those who are interested the Renault data (4MB, Rda format, with many more variables) is available here.


library(splines)
RenaultPriceModel = lm(Price ~ ns(Kilometers, df=5)*model + ns(Age,df=3) , data = RenaultCars)

Adding interaction between Kilometers and car model increases the predictive power. Suppose I am satisfied with the model, it has a nice R-squared (0.92) on a hold-out set, now I am going to create a scoring function from the model object



RenaultScorer = function(Kilometers, Age, model)
{
  require(splines)
  return(
    predict.lm(
      RenaultPriceModel, 
      data.frame("Kilometers" = Kilometers, "Age" = Age, "model" = model)
   )
  )
}

# predict the price for a 1 year old Renault Espace that has driven 50000 KM
RenaultScorer(50000,365,"ESPACE")
33616.91 

To see how estimated car prices vary over the age and kilometers, I created some contour plots. It seems that Clios loose value over years while Espaces loose value over kilometers. Sell your five year old Espace with 100K kilometers for 20.000 Euro, so that you can buy a new Clio (0 years and 0 kilometers)..

blog_countours

Before you can publish R functions on Azure you need to obtain the security credentials to your Azure Machine Learning workspace. See the instructions here. The function, RenaultScorer, can now be published as a web service on the AzureML platform using the publishWebservice function.


library(AzureML)

myWsID = "123456789101112131415"
myAuth = "ABCDEFGHIJKLMNOPQRSTUVWXYZ"

# publish the R function to AzureML
RenaultWebService = publishWebService(
  "RenaultScorer",
  "RenaultPricerOnline",
  list("Kilometers" = "float", "Age" = "float", "model" = "string"),
  list("prijs" = "float"),
  myWsID,
  myAuth
)

Test the web service

In R, the AzureML package contains the consumeList function to call the web service we have just created. The key and the location of the web service that we want to call are contained in the return object of the publishWebservice.


### predict the price of a one year old Renault Espace using the web service
RenaultPrice = consumeLists(
  RenaultWebService[[2]][[1]]["PrimaryKey"],
  RenaultWebService[[2]][[1]]$ApiLocation,
  list("Kilometers" = 50000, "Age" = 365, "model" = "ESPACE")
)
RenaultPrice

 prijs
1 33616.90625

In the Azure machine learning studio you can also test the web service.

AzureStudio

Azure ML studio

My web service (called RenaultPricerOnline) is visible now in Azure machine Learning studio, click on it and it will launch a dialog box to enter kilometers, age and model.

AzureStudio2

The call results again in a price of 33616.91. The call takes some time, but that’s because the performance is throttled for free tier accounts.

Conclusion

In R I have created a car price prediction model, but more likely than not, the predictive model will be consumed by another application than R, It is not difficult to publish an R model as a web service on the Azure machine learning platform. Once you have done that, AzureML will generate the api key and the link to the web service, so that you can provide that to anyone who wants to consume the predictive model in their applications.

Soap analytics: Text mining “Goede tijden slechte tijden” plot summaries….

Sorry for the local nature of this blog post. I was watching Dutch television and zapping between channels the other day and I stumbled upon “Goede Tijden Slechte Tijden” (GTST). This is a Dutch soap series broadcast by RTL Nederland. I must confess, I was watching (had to watch) this years ago because my wife was watching it…… My gut feeling with these daily soap series is that missing a few months or even years does not matter. Once you’ve seen some GTST episodes you’ve seen them all, the story line is always very similar. Can we use some data science to test if this gut feeling makes sense? I am using R and SAS to investigate this and see if more interesting soap analytics can be derived.

Scraping episode plot summaries

First, data is needed. All GTST episode plot summaries are available, from the very first episode in October 1990.blogpic01

A great R package to scrape data from web sites is rvest by Hadley Wickham, I can advise anyone to learn this. Luckily, the website structure of the GTST plot summaries is not that difficult. With the following R code I was able to extract all episodes. First I wrote a function that extracts the plot summaries of one specific month.

getGTSTdata = function(httploc){
  tryCatch({
    gtst = html(httploc) %>%
     html_nodes(".mainarticle_body") %>%
     html_children()

    # The dates are in bold, the plot summaries are normal text
    texts = names(gtst[[1]]) == "text"
    datumsel = names(gtst[[1]]) == "b"

    episodeplot = gtst[[1]][texts] %>%
     sapply(html_text) %>%
     str_replace_all(pattern='\n'," ")

    episodedate = gtst[[1]][datumsel] %>%
     sapply(html_text)

    # put data in a data frame and return as results
    return(data.frame(episodeplot, episodedate))
  },
    error = function(cond) {
      message(paste("URL does not seem to exist:", httploc))
      message("Here's the original error message:")
      message(cond)
      # Choose a return value in case of error
      return(NULL)
    }
  )
}

This function is then used inside a loop over all months to get all the episode summaries. Some months do not have episodes, and the corresponding link does not exist (actors are on summer holiday!). So I used the tryCatch function inside my function which will continue the loop if a link does not exist.

months = c("januari", "februari","maart","april","mei","juni","juli","augustus","september","oktober","november","december")
years = 1990:2012

GTST_Allplots = data.frame()

for (j in years){
  for(m in months){
    httploc = paste("http://www.rtl.nl/soaps/gtst/meerdijk/tvgids/", m, j, ".xml", sep="")
    out = getGTSTdata(httploc)

    if (!is.null(out)){
      out$datums = paste(out$datums, j)
      GTST_Allplots = rbind(GTST_Allplots,out)
    }
  }
}

A a result of the scraping, I got 4090 episode plot summaries. The real GTST fans know that there are more episodes, the more recent episode summaries (from 2013) are on a different site. For this analysis I did not bother to use them.

Text mining the plot summaries

The episode summaries are now imported in SAS Enterprise Guide, the following figure shows the first few rows of the data.

blogpic02

Click to enlarge

With SAS Text miner it is very easy and fast to analyze unstructured data, see my earlier blog post. What are the most important topics that can be found in all the episodes? Let’s generate them, I can use either the text topic node or the text cluster node in SAS. The difference is that with the topic node an episode can belong to more than one topic, while with the cluster node an episode belongs only to one cluster.

blogpic03I have selected 30 topics to be generated, you can experiment with different numbers of topics. The resulting topics can be found in the following figure.

blogpic04

GTST topics. click to enlarge

Looking at the different clusters found, it turns out that many topics are described by the names of the characters in the soap series. For example topic number 2 is described by terms like, “Anton“, “Bianca“, “Lucas“, ‘Maxime“, and “Sjoerd“, they occur often together in 418 episodes. And topic number 25 involves the terms “Ludo“, “Isabelle“, “Martine“, “Janine“. Here is a picture collage to show this in a more visual appealing way. I have only attached faces for six clusters, the other clusters are still the colored squares. The clusters that you see in the graph are based on the underlying distances between the clusters.

GTST_COLLAGE

click to enlarge

Zoom in on a topic

Now let’s say I am interested in the characters of topic 25 (the Ludo Isabelle story line). Can I discover sub-topics in this story line? So, I apply a filter on topic 25, only the episodes that belong to the Ludo Isabelle story line are selected and I generate a new set of topics (call them sub topics to distinguish them form the original topics) .

blogpic05

Text miner flow to investigate a specific topic

What are some of the subtopics of the Ludo Isabelle story line?

  • Getting money back
  • George, Annie Big panic, very worried
  • Writing farewell letter
  • Plan of Jack brings people in danger

As a data scientist I should now reach out or go back to the business and ask for validation. So I am reaching out now: Are there any real hardcore GTST fans out there that can explain:

  • why certain characters are grouped together?
  • why certain groups of characters are closer to each other than other groups?
  • recognize subtopics in Ludo Isabelle story lines?

Text profiling

To comeback to my gut feeling, can I see if the different story lines remain similar? I can use a the Text profile node in SAS Text miner to investigate this. The Text Profile node enables you to profile a target variable using terms found in the documents. For each level of a target variable, the node outputs a list of terms from the collection that characterize or describe that level. The approach uses a hierarchical Bayesian belief model to predict which terms are the most likely to describe the level. In order to avoid merely selecting the most common terms, prior probabilities are used to down-weight terms that are common in more than one level of the target variable.

The target variable can also be a date variable. In my scraped episode data I have the date of the episodes, there is a nice feature that allows the user to set the aggregation level. Let’s look at years as aggregation level.

blogpic06

Text profile node in SAS

The output consists of different tables and plots, two interesting plots are the Term Time series plot and the Target Similarity plot. The first one shows for a selected year the most important terms and how these terms evolve over the other years. Suppose we select 1991 then we get the following graph.

blogpic07

Click to enlarge. Terms over years

Myriam was an important term (character) in 1991, but we see that here role stopped in 1995. Jef was a slightly less important term, but his character continued for a very long time in the series. The similarity plot shows the similarities between the sets of episodes by the different years. The distances are calculated on the term beliefs of the chosen terms.

blogpic08

GTST similarities by year

We can see very strong similarities between years 2000 and 2001, two years that are also very similar are 1996 and 1997. Very isolated years are 2009 and 2012, maybe the writers tried a new story line that was unsuccessful and canceled it. Now let’s focus on one season of GTST, season 22 from September 2011 to May 2012, and look at similarities between months. They are shown in the following figure.

blogpic09

Season 22 of GTST: monthly similarities

It seems that the months September ’11, November ’11 and May ’12 are very similar but the rest of the months are quite separate.

Conclusion

The combination R / rvest and SAS Text miner is an ideal tool combination to quickly scrape (text) data from sites and easily get first insights into those texts. Applying this on Goede Tijden Slechte Tijden (GTST) episode summaries rejects my gut feeling that once you’ve seen some GTST episodes you’ve seen them all. It turns out that story lines between years, and story lines within one year can be quite dissimilar!

Without watching thousands of episodes I can quickly get an idea of which characters belong together, what the main topics are of a season, the rise and fall of characters in the series and how topics are changing over time. But keep in mind: If you are in a train and hear two nice lady’s talking about GTST, will this analysis be enough to have meaningful conversations with them?  …… I don’t think so!

Thanks @ErwinHuizenga and @josvandongen for reviewing.

Combining Hadoop, Spark, R, SparkR and Shiny…. and it works :-)

A long time ago in 1991 I had my first programming course (Modula 2) at the Vrije University in Amsterdam. I spend months behind a terminal with a green monochrome display doing the programming exercises using VI. Do you remember Shift ZZ, and :q!… 🙂 After my university period I did not use VI often… Recently, I got hooked up again!

monochromemonitor

A system very similar to this one where I did my first text editing in VI

I was excited to hear the announcement of sparkR, an R interface to spark and was eager to do some experiments with the software. Unfortunately none of the Hadoop sandboxes have spark 1.4 and sparkR pre-installed to play with. So I needed to undertake some steps myself. Luckily, all steps are beautifully described in great detail on different sites.

Spin up a vps

At argeweb I rented an Ubuntu VPS, 4 cores 8 GB. That is a very small environment for Hadoop, and of course a 1 node environment does not show the full potential of Hadoop / Spark. However, I am not trying to do performance or stress tests on very large data sets, just some functional tests. Moreover, I don’t want to spent more money :-),  though the VPS can nicely be used to install nginx and host my little website -> www.longhowlam.nl 

Install R, Rstudio and shiny

A very nice blog post by Dean Atalli, which I am not going to repeat here, describes how easy it is to setup R, RStudio and Shiny. I followed steps 6, 7 and 8 of his blog post and the result is a running Shiny server on my VPS environment. In addition to my local RStudio application on my laptop, I can now also use R on my iPhone through Rstudio server on my VPS. Can be quit handy in a crowded bar when I need to run some R commands….

iphonescreen
using R on my iPhone. You need good eyes!

Install Hadoop 2.7 and Spark

To run Hadoop, you need to install java first, configure SSH, fetch the hadoop tar.gz file, install it, set environment variables in the ~/.bashrc file, modify hadoop configuration files, format the hadoop file system and start it. All steps are described in full detail here. Then in addition to that download the latest version of Spark, the pre-build for hadoop 2.6 or later worked fine for me. You can just extract the tgz file, set the SPARK_HOME variable and you are done!

In each of the above steps different configuration files needed to be edited. Luckily I can still remember my basic VI skills……

Creating a very simple Shiny App

The SparkR package is already available when Spark is installed, its location is inside the Spark directory. So, when attaching the SparkR library to your R session, specify its location using the lib.loc argument in the library function. Alternatively, add the location of the SparkR library to the Search Paths for packages in R, using the .libPaths function. See some example code below.

library(SparkR, lib.loc = "/usr/local/spark/R/lib")

## initialeze SparkR environment
sc = sparkR.init(sparkHome = '/usr/local/spark')
sqlContext = sparkRSQL.init(sc)

## convert the local R 'faithful' data frame to a Spark data frame
df = createDataFrame(sqlContext, faithful)

## apply a filter waiting > 50 and show the first few records
df1 = filter(df, df$waiting > 50)
head(df1)

## aggregate and collect the results in a local R data sets
df2 = summarize(groupBy(df1, df1$waiting), count = n(df1$waiting))
df3.R = collect(df2)

Now create a new Shiny app, copy the R code above into the server.R file and instead of a hard coded value 50, let’s make this an input using a slider. That’s basically it, my first shiny app calling SparkR……

Cheers, Longhow

Restaurant analytics…. Now it becomes personal!

Its personal

In my previous blog post I performed a path analysis in SAS on restaurant reviewers. It turned out that after a visit to a Chinese restaurant, reviewers on Iens tend to go to an “International” restaurant. But which one should I visit? A recommendation engine can answer that question. Everyone who has visited an e-commerce website for example Amazon, has experienced the results of recommendation engine. Based on your click/purchase history new products are recommended. I have a Netflix subscription, based on my viewing behavior I get recommendations for new movies, see my recommendations below.

recommender1

Click to enlarge. Obviously these recommendations are based on the viewing behavior of my son and daugther, who spend too much time behind Netflix…. 🙂

Collaborative Filtering

How does it work? Lets fist look at the data that is needed, in the world of recommendation engines people often speak about users, items and the user-item rating matrix. In my scraped restaurant review data, this corresponds to reviewers, restaurants and their scores / ratings. See the figure below.

recommender2

The question now is, how can we fill in the blanks? For example, in the data above Sarah likes Fussia and Jimmie’s Kitchen but she has not rated the other Restaurants. Can we (the computer) do this for her? Yes, we can fill in the blanks with a predicted rating and recommend the restaurant with the highest rating to Sarah as the restaurant to visit next. A term you often hear in this context is collaborative filtering. A class of techniques based on the believe that a person gets the most relevant recommendations from people with ‘similar’ tastes. I am not going to write about the techniques here, a nice overview paper is: Collaborative Filtering Recommender Systems By Michael D. Ekstrand, John T. Riedl and Joseph A. Konstan. It can be found here.

Iens restaurant reviewers

The review data that I have scraped from the iens website is of course much larger than the matrix shown above. There are 8,900 items (restaurants), and there are 100,889 users (reviewers). So we would have a user item matrix with 8,900 X 100,889 (= 897,912,100) ratings. That would mean that every reviewer has rated every restaurant, obviously that is not the case. In fact, the user-item matrix is often very sparse, the iens data consists of 211,143 ratings that is only 0.02% of the matrix when it is completely filled.

In SAS I can use the recommend procedure to create recommendation engines, the procedure supports different techniques

  • Average, SlopeOne,
  • KNN, Association Rules
  • SVD, Ensemble, Cluster

The rating data that is needed to run the procedure should be given in a different form than the user-item matrix. A SAS data set with three columns, user, item and rating is needed. A snippet of the data is shown below.

recommender3

If I want the system to generate “personal” restaurant recommendations for me, I should also provide some personal ratings. Well, I liked Golden chopsticks (an 8 out of 10), a few months ago I was at Fussia, that was OK (a 7 out of 10), and for SAS I was at a client in Eindhoven, so I also ate at “Van der Valk Eindhoven” I did not really liked that (a 4 out of 10). So I have a created a small data set with my ratings and added that to the Iens ratings.

recommender4

After that I used the recommend procedure to try different techniques and choose the one with the smallest error on a hold-out set. The workflow is given in the following screenshot.

recommender6

To zoom in on the recommend procedure, it starts with the specification of the rating data set, and the specification of the user, item and rating columns. Then a method and its corresponding options need to be set. The following figure shows an example call

recommender5My personal recommendations 

After the procedure has finished, a recommendation engine is available, in the above code example an engine with two methods (SVD and ARM) is available and recommendations can be generated for each user. The code below shows how to do this.

recommender7And the top five restaurants I should visit are (with their predicted rating)……

  1. ‘T Stiefkwartierke (9.61)
  2. Brazz (9.19)
  3. Bandoeng (9.05)
  4. De Burgermeester (9.00)
  5. Argentinos (9.00)

So the first restaurant ‘T Stiefkwartierke is in Breda, the south of The Netherlands. I am going to visit that when I am in the neighborhood….

Restaurant analytics: Text mining, Path analysis, Sankey, Sunbursts and Chord plots

Last month I visited my favorite Chinese restaurant with some friends in the center of Amsterdam, Golden chopsticks, and had some nice food. I was wondering if other people shared the same opinion. So I visited iens.nl, a Dutch restaurant review site and looked up Golden chopsticks restaurant. They have fairly good reviews, so that was good. For our next restaurant diner my friends wanted to try something else, we had Chinese, so do we want to go to a Chinese restaurant again, or something else?

Let’s use some analytics to help me decide. With R and the package rVest it is not difficult to retrieve data from the Iens restaurant site. Some knowledge on CSS, Xpath and regular expressions is needed but then you can scrape away…. Inside a for loop over i,  I have code fragments like the ones given below.


start   = "http://www.iens.nl/restaurant/?q=&perPagina=20&pagina="
httploc = paste(start,i,sep="")

restaurants     = html(httploc)
recensieinfo    = html_text(html_nodes(restaurants,xpath='//div[@class="listerItem_score"]'))
restaurantLabel = html_nodes(restaurants,".listerItem")
restName        = html_text(html_nodes(reslabel2,".fontSerif"))
restAddress     = html_nodes(restarantLabel ,"div address")

Nreviews        = str_extract(str_extract(recensieinfo,pattern="[0-9]* recensies"), pattern = "[0-9]*")
averageScore    = str_extract(recensieinfo,pattern="[0-9].[0-9]")

# more statements

I have retrieved information of around 11.000 restaurants, for these restaurants there are around 215.000 reviews (taking only the reviews that also have numeric scores for food, service and scenery, and ignoring older reviews). The data I scraped are in two tables, one table at restaurant level and one on review level.

restdata  restdata2

First, some interesting facts on the scraped data.

Chinese restaurant names

It is true that certain names of Chinese restaurants occur more often. There are around 1600 Chinese restaurants, the most frequent names: Kota Radja (39), Peking (36), Lotus (33), De Chinese Muur (32), De Lange Muur (25), China Garden (25), Hong kong (22), Ni Hao (16). My father once had a Chinese restaurant called Hong Kong!  See the dashboard below. These kind of more occurring names is quit specific to Chinese restaurants, if we look at other kitchen types we find more unique restaurant names. You’ll find the names per kitchen in a little Shiny app here.

Kitchen/restaurant type and number of reviews

Of the 11.000 restaurants, there are around 8900 restaurants that have reviews, with restaurants El Olivo, Rhodos, Floor17 having the most reviews. There are around 215.000 reviews written by 103.800 reviewers, with MartinK having the most reviews.

The five types of restaurants that occur most often in all the reviews are “International”, “Hollands”, French, Italian and Chinese, as shown in the tree map below. The color of the tree map represents the average number of reviews per restaurant type. So although Chinese restaurants form a large fraction (8.8%) of all restaurants, there are only on average 8.3 reviews per Chinese restaurant. On the other hand, Italian restaurants form 8.4% of all restaurants and have more than 18 reviews per Italian restaurant. Conclusion: People eat at Chinese restaurants, but they just don’t write about it…..

kitchentypes

Review dates

The following figure shows a screen shot of a SAS Visual Analytics dashboard that I made from the Iens data, it shows a couple of things.

  • Uniqueness of restaurant names per kitchen type
  • The number of reviews per day, we can see a Valentin peak at 14-Feb-2015,
  • The avergare score per day, it looks like the scores in the month July are a little bit lower.
  • Saturdays and Fridays are more crowded for restaurants, everybody knows that, and the Iens review data confirms this.

dashboard

Review topics

If you look at the Iens site, then you’ll see that almost all of the reviews are written in Dutch, which you would expect. But if you run a language detection first on the reviews, its funny to see that here are some reviews in other languages. English is the second language (300 or so reviews) and a few Italian reviews. It turns out that the language detection (R textcat package) on “Italian” reviews is fooled by some particular Italian words like Buonissimo, cappuccino carprese and piza.

What are the topics that we can extract from the 215.000 reviews, In an earlier blog post I wrote about text mining, I have used SAS Text Miner to generate topics from the Iens reviews. So what are people writing about? The picture below shows the five topics that occur a lot in all the reviews. A topic in SAS Text Miner is characterized by key terms (in Dutch) which are given below.

topics

So it’s always the same: complaining about the long wait for the food…. but on the other hand a lot of reviews on how great and fantastic the evening was…

Next restaurant type to visit

To come back to my first question: what type of restaurant should I visit next? I can perform a path analysis to answer that question. Scraping the restaurant site not only resulted in the texts of the reviews but also the reviewer, the date and the restaurant(type). So in the data below for example Pauls path consists of a first visited to Sisters, then to  Fussia and then to Milo.

pathdata

With SAS software I can either use Visual Analytics to generate a Sankey diagram that will show the most occurring paths. Or alternatively, I can perform a path analysis on this data in Enterprise Miner and then visualize the most occurring paths in a sunburst or chord diagram. See the pictures below.

pathanalysissankey

sunburstchordplot

Interactive versions of the IENS sunburst and chord diagrams can be found here and hereConclusion: It turns out that after a Chinese restaurant visit most reviewers will visit an “International” restaurant type…. In a next blog post I will go one step further and show how to use techniques from recommendation engines to recommend at restaurant level.

Cheers,

Longhow

Some simple facial analytics on actors (and my manager)

Some time ago I was at a party, inevitably, a question that came up was: “Longhow what kind of work are you doing?” I answered: I am a data scientist I have the most sexy job, do you want me to show you how to use deep learning for facial analytics…… Oops, it became very quiet. Warning: don’t google on the words sexy deep and facial at your work!

But anyway, I was triggered by the Face++ website and thought can we do some simple facial analytics? The Face++ site offers an API where you can upload images (or use links to images). For each image that you upload the Face++ engine can detect if there are one or more faces. If a face is detected, you can also let the engine return 83 facial landmarks. Example landmarks are nose_tip, left_eye_left corner, contour_chin, see the picture below. So an image is now reduced to 83 landmarks.

face_WP

What are some nice faces to analyse? Well, I just googled on “top 200 actors” and got a nice list of actor names, and with the permission of my manager Rein Mertens, I added his picture as well. So what steps did I do next:

Data preparation

1. I have downloaded import.io, this a very handy tool to turn web pages into data. In the tool enter the link to the list of 200 actors. It will then extract all the links to the 200 pictures (together with more data) and conveniently export it to a csv file.

import.io from web page to data

Click to enlarge

2. I wrote a small script in R to import the csv file from step 1, so for each of the 200 actors I have the link to his image. Now I can call the face++ api for every link and let Face++ return the 83 facial landmarks. The result is a data set with 200 rows and 166 columns (x and y position of a landmark) and a ID column containing the name of the actor.

3. I added the 83 land marks of my manager to the data set. So my set now looks like

Analytical base table of actor faces

Click to enlarge

There were some images where Face++ could not detect a face, I removed those images.

nofaces

faces not recognized by the Face++ engine

Now that I have the so-called facial analytical base table of actors I can apply some nice analytics

Deep learning, autoencoders

What are some strange faces? To answer that with an objective point of view, I used deep learning autoencoders. The idea of an autoencoder is to learn a compressed representation for a set of data typically for the purpose of dimensionality reduction (Wikipedia). The trick is to use a neural network where the output target is the same as the input variables. I have actor faces which are observations in a 166 dimensional space, using proc neural in SAS I have reduced the faces to points in a two dimensional space. I have used 5 hidden layers where the middle layer consist of two neurons, as shown in the sas code snippet below. More information on auto encoders in SAS can be found here.

procneural

The faces can now be plotted in a scatter plot, (as shown below) the two dimensions corresponding to the middle layer.

scatter plot of actor faces

click to enlarge

A more interactive version of this plot can be found here, it is a simple D3.js script where you can hover over the dots and click to see the underlying picture.

Hierarchical Cluster analysis

Can we group together faces that are similar (similar facial landmarks)? Yes, we can apply a cluster technique to do that. There are many different techniques, I have used an agglomerative hierarchical cluster technique because the algorithm that is used to cluster the faces can be nicely visualized in a so called dendrogram. The algorithm starts with each observation (face) as its own cluster, in each following iteration pairs of clusters are merged until all observations form one cluster. In SAS you can use proc cluster to perform hierarchical clustering, see the code and dendrogram below.

proccluster

dendrogram faces

Click to enlarge

Using D3.js I have also created a more interactive dendrogram.

— L. —

.

Some (spatial) analytics on primary school exams (Cito toets) in The Netherlands

Predicting the Cito Score

Recently my six year old son had his first Cito test, during his career at primary school he will have much more of such Cito tests. It will end when he is around twelve and will have his final Cito exam. A primary school in the Netherlands does not have to participate in the Cito exams, there are around 6800 primary schools in the Netherlands of which around 5300 participate in the Cito exam. For each of those schools the average Cito score (average of all pupils) of the final Cito exam is available. Lets just call this the Cito score.

As a data scientist, I like to know if there are factors that can be used to predict the Cito score, and more importantly can we get that data. A little search on internet resulted in a few interesting data sets that might be useful to predict the Cito score:

  • School details, like the type of school (public, christian, Jewish, Muslim Montessori, etc) and the location
  • Financial results, like the total assets, liabilities, working capital etc.
  • Staff & pupils, number of FTE’s, number of pupils, part-time personnel, management FTE
  • Data on the education level of the parents of the pupils

I also have the house for sale data from Funda, which can tell me something about the prices of the houses close to the school. There is also open data on the demographics of a neighborhood, like number of people, number of immigrants, number of children, number of males, etc.

Lets have a look at the Cito score first, how is it distributed, what are low and high values. The figure below shows the distribution, the average of the Cito scores is 534.7, the median is 535, the 10% percentile is 529.3 and the 90% percentile is 539.7.

CitoScoreVerdeling

Now lets predict the Cito score, in SAS Enterprise Miner I have fitted a simple linear regression (r^2 =0.35), a random forest (r^2 = 0.39) and a neural network (r^2 = 0.38), so the random forest has the best predictive power. Although it is difficult to interpreted a random forest, it does identify the factors that are important to predict the Cito score. The three most important factors are:

  • Percentage of pupils of a school whose parents both have an education
  • Percentage of pupils of a school whose parents both are uneducated
  • Average sales price of the 9 houses closest to the school

A plot of the three factors against the Cito score is given in the plots below, click on them to enlarge.

Gewicht000Gewicht030houseprice

From the graph above we can see that schools in an area where houses prices are 300K have a Cito score of around 1 point more than schools in an area where the house prices are 200K. So every 100K increase in house prices means a 1 point increase in the Cito score. Conclusion: With open data we can reasonably predict the Cito score with some accuracy. See my Shiny app for the random forest Cito score prediction for each individual school.

Spatial AutoCorrelation

Spatial autocorrelation occurs when points close to each others have similar (correlated) values. So do schools that are close to each other have similar Cito scores? We can analyse this by means of a variogram or correlogram. It provides a description of how data are related (correlated) with distance h.

The variogram is calculated by

\gamma(h) = \frac{1}{2 |N(h)|} \Sigma_{N(h)} (z_i - z_j)^2

where z_i and z_j are values (Cito scores) at locations i and j, and N(h) is the set of all pairwise distances i-j=h. So it calculates the average squared difference of values (Cito scores) separated by distance h. It is often easier to interpret a correlation (coefficient), instead of squared differences. For example, a correlation coefficient larger than 0.8 means strongly correlated values while a correlation lower than 0.1 means weakly correlated values. A correlogram \rho(h) can be calculated from a variogram:

\rho(h) = 1 - \frac{\gamma(h)}{C(0)}

where C(0) is the variance in the data. In SAS we can estimate the empirical variogram and correlogram from the Cito data with proc variogram. An example call is given in the figure below.

proc variogram

The resulting correlogram is given in the figure below.

correlogram

Conclusion: Schools that are close to each other (within 1.5 KM) have strongly correlated Cito scores. So if your son’s school has a bad Cito score, you can take him to another school, but be sure the other school is at least 7 KM away from the current school.