New chapters for 50 shades of grey….

WP

Introduction

Some time ago I had the honor to follow an interesting talk from Tijmen Blankevoort on neural networks and deeplearning. Convolutional and recurrent neural networks were topics that already caught my interest and this talk inspired me to dive into these topics deeper and do some more experiments with it.

In the same session organized by Martin de Lusenet for Ziggo (a Dutch cable company) I also had the honor to give a talk, my presentation contained a text mining experiment that I did earlier on the Dutch TV soap GTST “Goede Tijden Slechte Tijden”. A nice idea by Tijmen was: Why not use deep learning to generate new episode plots for GTST?

So I did that, see my LinkedIn post on GTST. However, these episodes are in Dutch and I guess only interesting for people here in the Netherlands. So to make things more international and more spicier I generated some new texts based on deep learning and the erotic romance novel 50 shades of grey 🙂

More than plain vanilla networks

In R or SAS you could already train plain vanilla neural networks for a long time. The so-called fully connected networks where all input nodes are connected to all nodes in the following hidden layer.And all nodes in a hidden layer are connected to all nodes in the following hidden layer or output layer.

neural

In more recent years deep learning frame works have become very popular. For example Caffe, Torch, CTNK, Tensorflow and MXNET. The additional value of these frame works compared to SAS for example are:

  • They support more network types than plain vanilla networks. For example, convolutional networks, where not all input nodes are connected to a next layer. And recurrent networks, where loops are present. A nice introduction to these networks can be found here and here.
  • They support computations on GPU’s, which could speed up things dramatically.
  • They are open-source and free. No need for long sales and implementation cycles 🙂 Just download it and use it!
RNN

recurrent neural network

My 50 Shades of Grey experiment

For my experiment I used the text of the erotic romance novel 50 shades of grey. A pdf can be found here, I used xpdfbin to extract all the words into a plain text file. I trained a Long Short Term Memory network (LSTM, a special type of recurrent networks), with MXNET. The reason to use MXNET is that they have a nice R interface, so that I can just stay in my comfortable RStudio environment.

Moreover, the R example script of MXNET is ready to run, I just changed the input data and used more rounds of training and more hidden layers. The script and the data can be found on Github.

The LSTM model is fit on character level, the complete romance novel contains 817,204 characters, all these characters are mapped to a number (91 unique numbers). The first few numbers are shown in the following figure.

numbers

Once the model has been trained it can generate new text, character by character!

arsess whatever
yuu’re still expeliar a sally. Reftion while break in a limot.”
“Yes, ald what’s at my artmer and brow maned, but I’m so then for a
dinches suppretion. If you think vining. “Anastasia, and depregineon posing rave.
He’d deharing minuld, him drits.

“Miss Steele
“Fasting at liptfel, Miss I’ve dacind her leaches reme,” he knimes.
“I want to blight on to the wriptions of my great. I find sU she asks the stroke, to read with what’s old both – in our fills into his ear, surge • whirl happy, this is subconisue. Mrs. I can say about the battractive see. I slues
is her ever returns. “Anab.

It’s too even ullnes. “By heaven. Grey
about his voice. “Rest of the meriction.”
He scrompts to the possible. I shuke my too sucking four finishessaures. I need to fush quint the only more eat at me.
“Oh my. Kate. He’s follower socks?
“Lice in Quietly. In so morcieut wait to obsed teach beside my tired steately liked trying that.”
Kate for new of its street of confcinged. I haven’t Can regree.
“Where.” I fluscs up hwindwer-and I have

I’ll staring for conisure, pain!”
I know he’s just doesn’t walk to my backeting on Kate has hotelby of confidered Christaal side, supproately. Elliot, but it’s the ESca, that feel posing, it make my just drinking my eyes bigror on my head. S I’ll tratter topality butterch,” I mud
a nevignes, bleamn. “It’s not by there soup. He’s washing, and I arms and have. I wave to make my eyes. It’s forgately? Dash I’d desire to come your drink my heathman legt
you hay D1 Eyep, Christian Gry, husder with a truite sippking, I coold behind, it didn’t want to mive not to my stop?”
“Yes.”

“Sire, stcaring it was do and he licks his viice ever.”
I murmurs, most stare thut’s the then staraline for neced outsive. She
so know what differ at,” he murmurs?
“I shake my headanold.” Jeez.
“Are you?” Eviulder keep “Oh,_ I frosing gylaced in – angred. I am most drink to start and try aparts through. I really thrial you, dly woff you stund, there, I care an right dains to rainer.” He likes his eye finally finally my eyes to over opper heaven, places my trars his Necked her jups.
“Do you think your or Christian find at me, is so with that stand at my mouth sait the laxes any litee, this is a memory rude. It
flush,” He says usteer?” “Are so that front up.
I preparraps. I don’t scomine Kneat for from Christian.
“Christian,’! he leads the acnook. I can’t see. I breathing Kate’ve bill more over keen by. He releases?”
“I’m kisses take other in to peekies my tipgents my

Conclusion

The generated text does not make any sense, nor will it win any literature prize soon 🙂 Keep in mind, that the model is based ‘only’ on 817,204 characters  (which is considered a small number), and I did not bother to fine-tune the model at all. But still it is funny and remarkable to see that when you use it to generate text, character by character, it can still produce a lot of correct English words and even some correct basic grammar patterns!

cheers, Longhow.

 

Advertisements

Bon Appetit: A restaurant recommender for tourists visiting the Netherlands

Introduction

More and more tourists are visiting The Netherlands, this will become very clear if you walk through the center of Amsterdam on a sunny day. All those tourists need to eat somewhere, in some restaurant. You can see their sad faces as they have no clue where to go. Well, with the aid of a little data science I have made it easy for them :-). A small R Shiny app for tourists to inform them to which restaurant they should go in The Netherlands. In this blog post I will describe the different steps that I have taken.

iamsterdam

Tourists in Amsterdam wondering where to eat……

Iens reviews

In an earlier blog post I wrote about scraping restaurant review data from www.iens.nl and how to use that to generate restaurant recommendations. The technique was based on the restaurant ratings given by the reviewers. To generate personal recommendations you need to rate some restaurants first. But as a tourist visiting The Netherlands for the first time this might be difficult.

So I have made it a little bit easier, enter your idea of food in my Bon Appetit Shiny app, it will translate the text to Dutch if needed, then calculate the similarity of your translated text and all reviews from Iens, and then give you the top ten restaurants whose reviews matches best.

The Microsoft translator API

Almost all of the reviews on the Iens restaurant website are in Dutch, I assume that most tourists from outside The Netherlands do not speak Dutch. That is not a large problem, I can translate non Dutch text to Dutch by using a translator. Google and Microsoft offer translation API’s. I have chosen for the Microsoft API because they offer a free tier. The first 2 million characters are free per month. Sign-up and get started here. And because the API supports the Klingon language….. 🙂

The R franc package can recognize the language of the input text:


lang = franc(InputText)
ISO2 = speakers$iso6391[speakers$language==lang]
from = ISO2

The ISO 2 letter language code is needed in the call to the Microsoft translator API. I am making use of the httr package to set up the call. With your clientID and client secret a token must be retrieved. Then with this token the actual translation is done.


#Set up call to retrieve token

clientIDEncoded = URLencode("your microsoft client ID")

client_SecretEncoded = URLencode("your client secret")
Uri = "https://datamarket.accesscontrol.windows.net/v2/OAuth2-13"

MyBody = paste(
   "grant_type = client_credentials&client_id=",
   clientIDEncoded,
   "client_secret=",
   client_SecretEncoded,
   "&scope=http://api.microsofttranslator.com",
   sep="";
)

r = POST(url=Uri, body = MyBody, content_type("application/x-www-form-urlencoded"))
response = content(r)

Now that you have the token, make a call to translate the text


HeaderValue = paste("Bearer ", response$access_token, sep="")

TextEncoded = URLencode(InputText)

to = "nl"

uri2 = paste(
   "http://api.microsofttranslator.com/v2/Http.svc/Translate?text=",
   TextEncoded,
   "&from=",
   from,
   "&to=",
   to,
   sep=""
)

resp2 = GET(url = uri2, add_headers(Authorization = HeaderValue))
Translated = content(resp2)

#### dig out the text from the xml object
TranslatedText  = as(Translated , "character") %>% read_html(pp) %>% html_text()

Some example translations,

Louis van Gaal is notorious for his Dutch to English (or any other language for that matter) translations. Let’s see how the Microsoft API performs on some of his sentences.

  • Dutch: “Dat is hele andere koek”, van Gaal: That is different cook”, Microsoft: That is a whole different kettle of fish”.
  • Dutch: “de dood of de gladiolen”, van Gaal: “the dead or the gladiolus”, Microsoft: “the dead or the gladiolus”. 
  • Dutch: “Het is een kwestie van tijd”, van Gaal: “It’s a question of time”, Microsoft: “It’s a matter of time”.

The Cosine similarity

The distance or similarity between two documents (texts) can be measured by means of the cosine similarity. When you have a collection of reviews (texts), then this collection can be represented by a term document matrix. A row of this matrix is one review, its a vector of word counts. Another review or text is also a vector of word counts, given two vectors A and B the cosine similarity  is given by:

cosine

Now the input text that is translated to Dutch is also a vector of word counts and so can calculate the cosine similarity between each restaurant review and the input text. The restaurants corresponding to the most similar reviews are returned as recommended restaurants, bon appetit 🙂

Putting all together in a Shiny app

The above steps are implemented in my bon appetit Shiny app. Try out your thoughts and idea of food and get restaurant recommendations! Here is an example:

Input text: Large pizza with chicken and cheese that is tasty.

shiny

Input text translated to Dutch

shiny2

The top ten restaurants corresponding to the translated input text

 

And for the German tourist: “Ich suche eines schnelles leckeres Hahnchen”, this gets translated to Dutch “ik ben op zoek naar een snelle heerlijke kip” and the ten restaurant recommendations you get are given in the following figure.

shiny3

cheers,

— Longhow —

Soap analytics: Text mining “Goede tijden slechte tijden” plot summaries….

Sorry for the local nature of this blog post. I was watching Dutch television and zapping between channels the other day and I stumbled upon “Goede Tijden Slechte Tijden” (GTST). This is a Dutch soap series broadcast by RTL Nederland. I must confess, I was watching (had to watch) this years ago because my wife was watching it…… My gut feeling with these daily soap series is that missing a few months or even years does not matter. Once you’ve seen some GTST episodes you’ve seen them all, the story line is always very similar. Can we use some data science to test if this gut feeling makes sense? I am using R and SAS to investigate this and see if more interesting soap analytics can be derived.

Scraping episode plot summaries

First, data is needed. All GTST episode plot summaries are available, from the very first episode in October 1990.blogpic01

A great R package to scrape data from web sites is rvest by Hadley Wickham, I can advise anyone to learn this. Luckily, the website structure of the GTST plot summaries is not that difficult. With the following R code I was able to extract all episodes. First I wrote a function that extracts the plot summaries of one specific month.

getGTSTdata = function(httploc){
  tryCatch({
    gtst = html(httploc) %>%
     html_nodes(".mainarticle_body") %>%
     html_children()

    # The dates are in bold, the plot summaries are normal text
    texts = names(gtst[[1]]) == "text"
    datumsel = names(gtst[[1]]) == "b"

    episodeplot = gtst[[1]][texts] %>%
     sapply(html_text) %>%
     str_replace_all(pattern='\n'," ")

    episodedate = gtst[[1]][datumsel] %>%
     sapply(html_text)

    # put data in a data frame and return as results
    return(data.frame(episodeplot, episodedate))
  },
    error = function(cond) {
      message(paste("URL does not seem to exist:", httploc))
      message("Here's the original error message:")
      message(cond)
      # Choose a return value in case of error
      return(NULL)
    }
  )
}

This function is then used inside a loop over all months to get all the episode summaries. Some months do not have episodes, and the corresponding link does not exist (actors are on summer holiday!). So I used the tryCatch function inside my function which will continue the loop if a link does not exist.

months = c("januari", "februari","maart","april","mei","juni","juli","augustus","september","oktober","november","december")
years = 1990:2012

GTST_Allplots = data.frame()

for (j in years){
  for(m in months){
    httploc = paste("http://www.rtl.nl/soaps/gtst/meerdijk/tvgids/", m, j, ".xml", sep="")
    out = getGTSTdata(httploc)

    if (!is.null(out)){
      out$datums = paste(out$datums, j)
      GTST_Allplots = rbind(GTST_Allplots,out)
    }
  }
}

A a result of the scraping, I got 4090 episode plot summaries. The real GTST fans know that there are more episodes, the more recent episode summaries (from 2013) are on a different site. For this analysis I did not bother to use them.

Text mining the plot summaries

The episode summaries are now imported in SAS Enterprise Guide, the following figure shows the first few rows of the data.

blogpic02

Click to enlarge

With SAS Text miner it is very easy and fast to analyze unstructured data, see my earlier blog post. What are the most important topics that can be found in all the episodes? Let’s generate them, I can use either the text topic node or the text cluster node in SAS. The difference is that with the topic node an episode can belong to more than one topic, while with the cluster node an episode belongs only to one cluster.

blogpic03I have selected 30 topics to be generated, you can experiment with different numbers of topics. The resulting topics can be found in the following figure.

blogpic04

GTST topics. click to enlarge

Looking at the different clusters found, it turns out that many topics are described by the names of the characters in the soap series. For example topic number 2 is described by terms like, “Anton“, “Bianca“, “Lucas“, ‘Maxime“, and “Sjoerd“, they occur often together in 418 episodes. And topic number 25 involves the terms “Ludo“, “Isabelle“, “Martine“, “Janine“. Here is a picture collage to show this in a more visual appealing way. I have only attached faces for six clusters, the other clusters are still the colored squares. The clusters that you see in the graph are based on the underlying distances between the clusters.

GTST_COLLAGE

click to enlarge

Zoom in on a topic

Now let’s say I am interested in the characters of topic 25 (the Ludo Isabelle story line). Can I discover sub-topics in this story line? So, I apply a filter on topic 25, only the episodes that belong to the Ludo Isabelle story line are selected and I generate a new set of topics (call them sub topics to distinguish them form the original topics) .

blogpic05

Text miner flow to investigate a specific topic

What are some of the subtopics of the Ludo Isabelle story line?

  • Getting money back
  • George, Annie Big panic, very worried
  • Writing farewell letter
  • Plan of Jack brings people in danger

As a data scientist I should now reach out or go back to the business and ask for validation. So I am reaching out now: Are there any real hardcore GTST fans out there that can explain:

  • why certain characters are grouped together?
  • why certain groups of characters are closer to each other than other groups?
  • recognize subtopics in Ludo Isabelle story lines?

Text profiling

To comeback to my gut feeling, can I see if the different story lines remain similar? I can use a the Text profile node in SAS Text miner to investigate this. The Text Profile node enables you to profile a target variable using terms found in the documents. For each level of a target variable, the node outputs a list of terms from the collection that characterize or describe that level. The approach uses a hierarchical Bayesian belief model to predict which terms are the most likely to describe the level. In order to avoid merely selecting the most common terms, prior probabilities are used to down-weight terms that are common in more than one level of the target variable.

The target variable can also be a date variable. In my scraped episode data I have the date of the episodes, there is a nice feature that allows the user to set the aggregation level. Let’s look at years as aggregation level.

blogpic06

Text profile node in SAS

The output consists of different tables and plots, two interesting plots are the Term Time series plot and the Target Similarity plot. The first one shows for a selected year the most important terms and how these terms evolve over the other years. Suppose we select 1991 then we get the following graph.

blogpic07

Click to enlarge. Terms over years

Myriam was an important term (character) in 1991, but we see that here role stopped in 1995. Jef was a slightly less important term, but his character continued for a very long time in the series. The similarity plot shows the similarities between the sets of episodes by the different years. The distances are calculated on the term beliefs of the chosen terms.

blogpic08

GTST similarities by year

We can see very strong similarities between years 2000 and 2001, two years that are also very similar are 1996 and 1997. Very isolated years are 2009 and 2012, maybe the writers tried a new story line that was unsuccessful and canceled it. Now let’s focus on one season of GTST, season 22 from September 2011 to May 2012, and look at similarities between months. They are shown in the following figure.

blogpic09

Season 22 of GTST: monthly similarities

It seems that the months September ’11, November ’11 and May ’12 are very similar but the rest of the months are quite separate.

Conclusion

The combination R / rvest and SAS Text miner is an ideal tool combination to quickly scrape (text) data from sites and easily get first insights into those texts. Applying this on Goede Tijden Slechte Tijden (GTST) episode summaries rejects my gut feeling that once you’ve seen some GTST episodes you’ve seen them all. It turns out that story lines between years, and story lines within one year can be quite dissimilar!

Without watching thousands of episodes I can quickly get an idea of which characters belong together, what the main topics are of a season, the rise and fall of characters in the series and how topics are changing over time. But keep in mind: If you are in a train and hear two nice lady’s talking about GTST, will this analysis be enough to have meaningful conversations with them?  …… I don’t think so!

Thanks @ErwinHuizenga and @josvandongen for reviewing.

Text mining basics: The IMDB reviews

Within Big Data analytics, the analysis of unstructured data has received a lot of attention recently. text mining can be seen as a collection of techniques to analyze unstructured data (i.e. documents, e-mails, tweets, product reviews or complaints). Two main applications of text mining:

  1. Categorization. Text mining can categorize (large amounts of) documents into different topics. I think we all have created dozens of sub folders in Outlook to organize our e-mails, and manually moved e-mails to these folders. No need to do that anymore, let text mining create and name the folders and automatically categorize e-mails into those folders.
  2. Prediction. With text mining you can predict if a document belongs to a certain category. For example, all the e-mails that a company receives, which of those e-mails belong to the category “negative sentiment”, or which of those e-mails belong to people who are likely to terminate their subscription?

How does text mining work? Let’s analyze some IMDB reviews, 50.000 reviews that are already classified into positive or negative sentiment, the data can be found here. There are a few steps to undertake, quit easily performed in SAS Text Miner.

1. Parse / Filter

First I imported the reviews into a SAS data set. One column contains all the reviews, each review in a separate record. There is also a second column, the binary target, each review is classified as POSITIVE or NEGATIVE sentiment. The reviews need to be parsed and filtered.

PARSEFILTER

The parsing and filtering nodes parse the collection of reviews to quantify information about the terms in all the reviews.The parsing can adjusted to include stemming (treat house and houses as one term), synonym lists (treat movie and film as one term) , stop lists (ignore the, in and with). The result is a so-called term document matrix, representing the frequency that a term occurs in a document.

TermDoc

2. Apply Singular Value Decomposition

A problem that arises when there are many reviews is that the term document matrix can become very large, other problems that may arise in term document matrices are sparseness (many zeros in the matrix) and term dependency (the term boring and long may occur often together in reviews and so are strongly correlated). To resolve these problems a singular value decomposition (SVD) is applied on the term document matrix. In an earlier blog post I described the SVD. The term document matrix A is factorized into

A = U \Sigma V^T

Instead of using all the singular values we now only use the largest k singular values.In the term document each review R is represented by an m dimensional vector of terms, using the SVD this can be projected onto a lower dimensional sub space with

\hat{R} = U^T_k R

So our three reviews will look like

SVDspace

3. Categorization or Prediction

Now that each review is projected onto a lower dimensional subspace, we can apply data mining techniques. For categorization we can apply clustering ( for example k-means or hierarchical clustering). Each review will will be segmented into a cluster, for clustering we do not need the Sentiment column. The next figure shows an example of a hierarchical clustering in SAS Enterprise Miner.

Clusters

The clustering resulted in 13 clusters, the 13 node leaves. Each cluster is described by descriptive terms. For example, one cluster of reviews contains reviews of people talking about the fantastic script and plot, and another cluster talks about the bad acting.

To predict the sentiment we need the sentiment column, in Enterprise Miner I have set that column as a Target and projected the reviews onto a 300 dimensional subspace, so 300 inputs, SVD1, SVD2,…,SVD300. I have tried several models machine learning methods, random forests, gradient boosting, neural networks. traintextminer

It turns out that a neural network with 1 layer with 50 neurons works quit well, an area under ROC of 0.945 on a holdout out set.