Search the Bold

Looking for specific episodes of The Bold and The Beautiful?

There is a nice python package “holmes-extractor” that builds upon the spacy package. It supports a couple of use cases.

  • Chatbot
  • Structural matching
  • Topic matching
  • Supervised document classification I

I have used 4000 recaps of The Bold and the beautiful of the last 16 years to test Topic matching. I am using it to find texts in the recaps whose meaning is close to that of another query document or a query phrase entered by the user.

Just one example: Looking for: “she wanted to be close to him” gave me the recap of 2013-12-09: “It had only just begun, and he wanted to be closer to her. He had wanted to touch her, grab her, and he knew she wanted the same thing.

The code of holmes is concise, runs out of the box with little tweaking and tuning. See the Python module here. Try it out yourself and find your favourite TBATB episode. My example notebook and recaps.

The Bold & Beautiful Character Similarities using Word Embeddings

BBpost

Introduction

I often see advertisement for The Bold and The Beautiful, I have never watched a single episode of the series. Still, even as a data scientist you might be wondering how these beautiful ladies and gentlemen from the show are related to each other. I do not have the time to watch all these episodes to find out, so I am going to use word embeddings on recaps instead…

Calculating word embeddings

First, we need some data, from the first few google hits I got to the site soap central. Recaps can be found from the show that date back to 1997. Then, I used a little bit of rvest code to scrape the daily recaps into an R data set.

Word embedding is a technique to transform a word onto a vector of numbers, there are several approaches to do this. I have used the so-called Global Vector word embedding. See here for details, it makes use of word co-occurrences that are determined from a (large) collection of documents and there is fast implementation in the R text2vec package.

Once words are transformed to vectors, you can calculate distances (similarities) between the words, for a specific word you can calculate the top 10 closest words for example. More over linguistic regularities can be determined, for example:

amsterdam - netherlands + germany

would result in a vector that would be close to the vector for berlin.

Results for The B&B recaps

It takes about an hour on my laptop to determine the word vectors (length 250) from 3645 B&B recaps (15 seasons). After removing some common stop words, I have 10.293 unique words, text2vec puts the embeddings in a matrix (10.293 by 250).

Lets take the lovely steffy,

bb1
the ten closest words are:
    from     to     value
    <chr>  <chr>     <dbl>
 1 steffy steffy 1.0000000
 2 steffy   liam 0.8236346
 3 steffy   hope 0.7904697
 4 steffy   said 0.7846245
 5 steffy  wyatt 0.7665321
 6 steffy   bill 0.6978901
 7 steffy  asked 0.6879022
 8 steffy  quinn 0.6781523
 9 steffy agreed 0.6563833
10 steffy   rick 0.6506576

Lets take take the vector steffyliam, the closest words we get are

       death     furious      lastly     excused frustration       onset 
   0.2237339   0.2006695   0.1963466   0.1958089   0.1950601   0.1937230 

 and for bill – anger we get

     liam     katie     wyatt    steffy     quinn      said 
0.5550065 0.4845969 0.4829327 0.4645065 0.4491479 0.4201712 

The following figure shows some other B&B characters and their closest matches.

bb2

If you want to see the top n characters for other B&B characters use my little shiny app. The R code for scraping B&B recaps, calculating glove word-embeddings and a small shiny app can be found on my Git Hub.

Conclusion

This is a Mickey Mouse use case, but it might be handy if you are in the train and hear people next to you talking about the B&B, you can join their conversation. Especially if you have had a look at my B&B shiny app……

Cheers, Longhow

New chapters for 50 shades of grey….

WP

Introduction

Some time ago I had the honor to follow an interesting talk from Tijmen Blankevoort on neural networks and deeplearning. Convolutional and recurrent neural networks were topics that already caught my interest and this talk inspired me to dive into these topics deeper and do some more experiments with it.

In the same session organized by Martin de Lusenet for Ziggo (a Dutch cable company) I also had the honor to give a talk, my presentation contained a text mining experiment that I did earlier on the Dutch TV soap GTST “Goede Tijden Slechte Tijden”. A nice idea by Tijmen was: Why not use deep learning to generate new episode plots for GTST?

So I did that, see my LinkedIn post on GTST. However, these episodes are in Dutch and I guess only interesting for people here in the Netherlands. So to make things more international and more spicier I generated some new texts based on deep learning and the erotic romance novel 50 shades of grey 🙂

More than plain vanilla networks

In R or SAS you could already train plain vanilla neural networks for a long time. The so-called fully connected networks where all input nodes are connected to all nodes in the following hidden layer.And all nodes in a hidden layer are connected to all nodes in the following hidden layer or output layer.

neural

In more recent years deep learning frame works have become very popular. For example Caffe, Torch, CTNK, Tensorflow and MXNET. The additional value of these frame works compared to SAS for example are:

  • They support more network types than plain vanilla networks. For example, convolutional networks, where not all input nodes are connected to a next layer. And recurrent networks, where loops are present. A nice introduction to these networks can be found here and here.
  • They support computations on GPU’s, which could speed up things dramatically.
  • They are open-source and free. No need for long sales and implementation cycles 🙂 Just download it and use it!

RNN

recurrent neural network

My 50 Shades of Grey experiment

For my experiment I used the text of the erotic romance novel 50 shades of grey. A pdf can be found here, I used xpdfbin to extract all the words into a plain text file. I trained a Long Short Term Memory network (LSTM, a special type of recurrent networks), with MXNET. The reason to use MXNET is that they have a nice R interface, so that I can just stay in my comfortable RStudio environment.

Moreover, the R example script of MXNET is ready to run, I just changed the input data and used more rounds of training and more hidden layers. The script and the data can be found on Github.

The LSTM model is fit on character level, the complete romance novel contains 817,204 characters, all these characters are mapped to a number (91 unique numbers). The first few numbers are shown in the following figure.

numbers

Once the model has been trained it can generate new text, character by character!

arsess whatever
yuu’re still expeliar a sally. Reftion while break in a limot.”
“Yes, ald what’s at my artmer and brow maned, but I’m so then for a
dinches suppretion. If you think vining. “Anastasia, and depregineon posing rave.
He’d deharing minuld, him drits.

“Miss Steele
“Fasting at liptfel, Miss I’ve dacind her leaches reme,” he knimes.
“I want to blight on to the wriptions of my great. I find sU she asks the stroke, to read with what’s old both – in our fills into his ear, surge • whirl happy, this is subconisue. Mrs. I can say about the battractive see. I slues
is her ever returns. “Anab.

It’s too even ullnes. “By heaven. Grey
about his voice. “Rest of the meriction.”
He scrompts to the possible. I shuke my too sucking four finishessaures. I need to fush quint the only more eat at me.
“Oh my. Kate. He’s follower socks?
“Lice in Quietly. In so morcieut wait to obsed teach beside my tired steately liked trying that.”
Kate for new of its street of confcinged. I haven’t Can regree.
“Where.” I fluscs up hwindwer-and I have

I’ll staring for conisure, pain!”
I know he’s just doesn’t walk to my backeting on Kate has hotelby of confidered Christaal side, supproately. Elliot, but it’s the ESca, that feel posing, it make my just drinking my eyes bigror on my head. S I’ll tratter topality butterch,” I mud
a nevignes, bleamn. “It’s not by there soup. He’s washing, and I arms and have. I wave to make my eyes. It’s forgately? Dash I’d desire to come your drink my heathman legt
you hay D1 Eyep, Christian Gry, husder with a truite sippking, I coold behind, it didn’t want to mive not to my stop?”
“Yes.”

“Sire, stcaring it was do and he licks his viice ever.”
I murmurs, most stare thut’s the then staraline for neced outsive. She
so know what differ at,” he murmurs?
“I shake my headanold.” Jeez.
“Are you?” Eviulder keep “Oh,_ I frosing gylaced in – angred. I am most drink to start and try aparts through. I really thrial you, dly woff you stund, there, I care an right dains to rainer.” He likes his eye finally finally my eyes to over opper heaven, places my trars his Necked her jups.
“Do you think your or Christian find at me, is so with that stand at my mouth sait the laxes any litee, this is a memory rude. It
flush,” He says usteer?” “Are so that front up.
I preparraps. I don’t scomine Kneat for from Christian.
“Christian,’! he leads the acnook. I can’t see. I breathing Kate’ve bill more over keen by. He releases?”
“I’m kisses take other in to peekies my tipgents my

Conclusion

The generated text does not make any sense, nor will it win any literature prize soon 🙂 Keep in mind, that the model is based ‘only’ on 817,204 characters  (which is considered a small number), and I did not bother to fine-tune the model at all. But still it is funny and remarkable to see that when you use it to generate text, character by character, it can still produce a lot of correct English words and even some correct basic grammar patterns!

cheers, Longhow.

 

Bon Appetit: A restaurant recommender for tourists visiting the Netherlands

Introduction

More and more tourists are visiting The Netherlands, this will become very clear if you walk through the center of Amsterdam on a sunny day. All those tourists need to eat somewhere, in some restaurant. You can see their sad faces as they have no clue where to go. Well, with the aid of a little data science I have made it easy for them :-). A small R Shiny app for tourists to inform them to which restaurant they should go in The Netherlands. In this blog post I will describe the different steps that I have taken.

iamsterdam

Tourists in Amsterdam wondering where to eat……

Iens reviews

In an earlier blog post I wrote about scraping restaurant review data from www.iens.nl and how to use that to generate restaurant recommendations. The technique was based on the restaurant ratings given by the reviewers. To generate personal recommendations you need to rate some restaurants first. But as a tourist visiting The Netherlands for the first time this might be difficult.

So I have made it a little bit easier, enter your idea of food in my Bon Appetit Shiny app, it will translate the text to Dutch if needed, then calculate the similarity of your translated text and all reviews from Iens, and then give you the top ten restaurants whose reviews matches best.

The Microsoft translator API

Almost all of the reviews on the Iens restaurant website are in Dutch, I assume that most tourists from outside The Netherlands do not speak Dutch. That is not a large problem, I can translate non Dutch text to Dutch by using a translator. Google and Microsoft offer translation API’s. I have chosen for the Microsoft API because they offer a free tier. The first 2 million characters are free per month. Sign-up and get started here. And because the API supports the Klingon language….. 🙂

The R franc package can recognize the language of the input text:


lang = franc(InputText)
ISO2 = speakers$iso6391[speakers$language==lang]
from = ISO2

The ISO 2 letter language code is needed in the call to the Microsoft translator API. I am making use of the httr package to set up the call. With your clientID and client secret a token must be retrieved. Then with this token the actual translation is done.


#Set up call to retrieve token

clientIDEncoded = URLencode("your microsoft client ID")

client_SecretEncoded = URLencode("your client secret")
Uri = "https://datamarket.accesscontrol.windows.net/v2/OAuth2-13"

MyBody = paste(
   "grant_type = client_credentials&amp;amp;client_id=",
   clientIDEncoded,
   "client_secret=",
   client_SecretEncoded,
   "&scope=http://api.microsofttranslator.com",
   sep="";
)

r = POST(url=Uri, body = MyBody, content_type("application/x-www-form-urlencoded"))
response = content(r)

Now that you have the token, make a call to translate the text


HeaderValue = paste(&amp;quot;Bearer &amp;quot;, response$access_token, sep=&amp;quot;&amp;quot;)

TextEncoded = URLencode(InputText)

to = "nl"

uri2 = paste(
   "http://api.microsofttranslator.com/v2/Http.svc/Translate?text=",
   TextEncoded,
   "&from=",
   from,
   "&to=",
   to,
   sep=""
)

resp2 = GET(url = uri2, add_headers(Authorization = HeaderValue))
Translated = content(resp2)

#### dig out the text from the xml object
TranslatedText  = as(Translated , "character") %>% read_html(pp) %>% html_text()

Some example translations,

Louis van Gaal is notorious for his Dutch to English (or any other language for that matter) translations. Let’s see how the Microsoft API performs on some of his sentences.

  • Dutch: “Dat is hele andere koek”, van Gaal: That is different cook”, Microsoft: That is a whole different kettle of fish”.
  • Dutch: “de dood of de gladiolen”, van Gaal: “the dead or the gladiolus”, Microsoft: “the dead or the gladiolus”. 
  • Dutch: “Het is een kwestie van tijd”, van Gaal: “It’s a question of time”, Microsoft: “It’s a matter of time”.

The Cosine similarity

The distance or similarity between two documents (texts) can be measured by means of the cosine similarity. When you have a collection of reviews (texts), then this collection can be represented by a term document matrix. A row of this matrix is one review, its a vector of word counts. Another review or text is also a vector of word counts, given two vectors A and B the cosine similarity  is given by:

cosine

Now the input text that is translated to Dutch is also a vector of word counts and so can calculate the cosine similarity between each restaurant review and the input text. The restaurants corresponding to the most similar reviews are returned as recommended restaurants, bon appetit 🙂

Putting all together in a Shiny app

The above steps are implemented in my bon appetit Shiny app. Try out your thoughts and idea of food and get restaurant recommendations! Here is an example:

Input text: Large pizza with chicken and cheese that is tasty.

shiny

Input text translated to Dutch

shiny2

The top ten restaurants corresponding to the translated input text

 

And for the German tourist: “Ich suche eines schnelles leckeres Hahnchen”, this gets translated to Dutch “ik ben op zoek naar een snelle heerlijke kip” and the ten restaurant recommendations you get are given in the following figure.

shiny3

cheers,

— Longhow —

Soap analytics: Text mining “Goede tijden slechte tijden” plot summaries….

Sorry for the local nature of this blog post. I was watching Dutch television and zapping between channels the other day and I stumbled upon “Goede Tijden Slechte Tijden” (GTST). This is a Dutch soap series broadcast by RTL Nederland. I must confess, I was watching (had to watch) this years ago because my wife was watching it…… My gut feeling with these daily soap series is that missing a few months or even years does not matter. Once you’ve seen some GTST episodes you’ve seen them all, the story line is always very similar. Can we use some data science to test if this gut feeling makes sense? I am using R and SAS to investigate this and see if more interesting soap analytics can be derived.

Scraping episode plot summaries

First, data is needed. All GTST episode plot summaries are available, from the very first episode in October 1990.blogpic01

A great R package to scrape data from web sites is rvest by Hadley Wickham, I can advise anyone to learn this. Luckily, the website structure of the GTST plot summaries is not that difficult. With the following R code I was able to extract all episodes. First I wrote a function that extracts the plot summaries of one specific month.

getGTSTdata = function(httploc){
  tryCatch({
    gtst = html(httploc) %>%
     html_nodes(".mainarticle_body") %>%
     html_children()

    # The dates are in bold, the plot summaries are normal text
    texts = names(gtst[[1]]) == "text"
    datumsel = names(gtst[[1]]) == "b"

    episodeplot = gtst[[1]][texts] %>%
     sapply(html_text) %>%
     str_replace_all(pattern='\n'," ")

    episodedate = gtst[[1]][datumsel] %>%
     sapply(html_text)

    # put data in a data frame and return as results
    return(data.frame(episodeplot, episodedate))
  },
    error = function(cond) {
      message(paste("URL does not seem to exist:", httploc))
      message("Here's the original error message:")
      message(cond)
      # Choose a return value in case of error
      return(NULL)
    }
  )
}

This function is then used inside a loop over all months to get all the episode summaries. Some months do not have episodes, and the corresponding link does not exist (actors are on summer holiday!). So I used the tryCatch function inside my function which will continue the loop if a link does not exist.

months = c("januari", "februari","maart","april","mei","juni","juli","augustus","september","oktober","november","december")
years = 1990:2012

GTST_Allplots = data.frame()

for (j in years){
  for(m in months){
    httploc = paste("http://www.rtl.nl/soaps/gtst/meerdijk/tvgids/", m, j, ".xml", sep="")
    out = getGTSTdata(httploc)

    if (!is.null(out)){
      out$datums = paste(out$datums, j)
      GTST_Allplots = rbind(GTST_Allplots,out)
    }
  }
}

A a result of the scraping, I got 4090 episode plot summaries. The real GTST fans know that there are more episodes, the more recent episode summaries (from 2013) are on a different site. For this analysis I did not bother to use them.

Text mining the plot summaries

The episode summaries are now imported in SAS Enterprise Guide, the following figure shows the first few rows of the data.

blogpic02

Click to enlarge

With SAS Text miner it is very easy and fast to analyze unstructured data, see my earlier blog post. What are the most important topics that can be found in all the episodes? Let’s generate them, I can use either the text topic node or the text cluster node in SAS. The difference is that with the topic node an episode can belong to more than one topic, while with the cluster node an episode belongs only to one cluster.

blogpic03I have selected 30 topics to be generated, you can experiment with different numbers of topics. The resulting topics can be found in the following figure.

blogpic04

GTST topics. click to enlarge

Looking at the different clusters found, it turns out that many topics are described by the names of the characters in the soap series. For example topic number 2 is described by terms like, “Anton“, “Bianca“, “Lucas“, ‘Maxime“, and “Sjoerd“, they occur often together in 418 episodes. And topic number 25 involves the terms “Ludo“, “Isabelle“, “Martine“, “Janine“. Here is a picture collage to show this in a more visual appealing way. I have only attached faces for six clusters, the other clusters are still the colored squares. The clusters that you see in the graph are based on the underlying distances between the clusters.

GTST_COLLAGE

click to enlarge

Zoom in on a topic

Now let’s say I am interested in the characters of topic 25 (the Ludo Isabelle story line). Can I discover sub-topics in this story line? So, I apply a filter on topic 25, only the episodes that belong to the Ludo Isabelle story line are selected and I generate a new set of topics (call them sub topics to distinguish them form the original topics) .

blogpic05

Text miner flow to investigate a specific topic

What are some of the subtopics of the Ludo Isabelle story line?

  • Getting money back
  • George, Annie Big panic, very worried
  • Writing farewell letter
  • Plan of Jack brings people in danger

As a data scientist I should now reach out or go back to the business and ask for validation. So I am reaching out now: Are there any real hardcore GTST fans out there that can explain:

  • why certain characters are grouped together?
  • why certain groups of characters are closer to each other than other groups?
  • recognize subtopics in Ludo Isabelle story lines?

Text profiling

To comeback to my gut feeling, can I see if the different story lines remain similar? I can use a the Text profile node in SAS Text miner to investigate this. The Text Profile node enables you to profile a target variable using terms found in the documents. For each level of a target variable, the node outputs a list of terms from the collection that characterize or describe that level. The approach uses a hierarchical Bayesian belief model to predict which terms are the most likely to describe the level. In order to avoid merely selecting the most common terms, prior probabilities are used to down-weight terms that are common in more than one level of the target variable.

The target variable can also be a date variable. In my scraped episode data I have the date of the episodes, there is a nice feature that allows the user to set the aggregation level. Let’s look at years as aggregation level.

blogpic06

Text profile node in SAS

The output consists of different tables and plots, two interesting plots are the Term Time series plot and the Target Similarity plot. The first one shows for a selected year the most important terms and how these terms evolve over the other years. Suppose we select 1991 then we get the following graph.

blogpic07

Click to enlarge. Terms over years

Myriam was an important term (character) in 1991, but we see that here role stopped in 1995. Jef was a slightly less important term, but his character continued for a very long time in the series. The similarity plot shows the similarities between the sets of episodes by the different years. The distances are calculated on the term beliefs of the chosen terms.

blogpic08

GTST similarities by year

We can see very strong similarities between years 2000 and 2001, two years that are also very similar are 1996 and 1997. Very isolated years are 2009 and 2012, maybe the writers tried a new story line that was unsuccessful and canceled it. Now let’s focus on one season of GTST, season 22 from September 2011 to May 2012, and look at similarities between months. They are shown in the following figure.

blogpic09

Season 22 of GTST: monthly similarities

It seems that the months September ’11, November ’11 and May ’12 are very similar but the rest of the months are quite separate.

Conclusion

The combination R / rvest and SAS Text miner is an ideal tool combination to quickly scrape (text) data from sites and easily get first insights into those texts. Applying this on Goede Tijden Slechte Tijden (GTST) episode summaries rejects my gut feeling that once you’ve seen some GTST episodes you’ve seen them all. It turns out that story lines between years, and story lines within one year can be quite dissimilar!

Without watching thousands of episodes I can quickly get an idea of which characters belong together, what the main topics are of a season, the rise and fall of characters in the series and how topics are changing over time. But keep in mind: If you are in a train and hear two nice lady’s talking about GTST, will this analysis be enough to have meaningful conversations with them?  …… I don’t think so!

Thanks @ErwinHuizenga and @josvandongen for reviewing.

Restaurant analytics: Text mining, Path analysis, Sankey, Sunbursts and Chord plots

Last month I visited my favorite Chinese restaurant with some friends in the center of Amsterdam, Golden chopsticks, and had some nice food. I was wondering if other people shared the same opinion. So I visited iens.nl, a Dutch restaurant review site and looked up Golden chopsticks restaurant. They have fairly good reviews, so that was good. For our next restaurant diner my friends wanted to try something else, we had Chinese, so do we want to go to a Chinese restaurant again, or something else?

Let’s use some analytics to help me decide. With R and the package rVest it is not difficult to retrieve data from the Iens restaurant site. Some knowledge on CSS, Xpath and regular expressions is needed but then you can scrape away…. Inside a for loop over i,  I have code fragments like the ones given below.


start   = "http://www.iens.nl/restaurant/?q=&amp;amp;perPagina=20&amp;amp;pagina="
httploc = paste(start,i,sep="")

restaurants     = html(httploc)
recensieinfo    = html_text(html_nodes(restaurants,xpath='//div[@class="listerItem_score"]'))
restaurantLabel = html_nodes(restaurants,".listerItem")
restName        = html_text(html_nodes(reslabel2,".fontSerif"))
restAddress     = html_nodes(restarantLabel ,"div address")

Nreviews        = str_extract(str_extract(recensieinfo,pattern="[0-9]* recensies"), pattern = "[0-9]*")
averageScore    = str_extract(recensieinfo,pattern="[0-9].[0-9]")

# more statements

I have retrieved information of around 11.000 restaurants, for these restaurants there are around 215.000 reviews (taking only the reviews that also have numeric scores for food, service and scenery, and ignoring older reviews). The data I scraped are in two tables, one table at restaurant level and one on review level.

restdata  restdata2

First, some interesting facts on the scraped data.

Chinese restaurant names

It is true that certain names of Chinese restaurants occur more often. There are around 1600 Chinese restaurants, the most frequent names: Kota Radja (39), Peking (36), Lotus (33), De Chinese Muur (32), De Lange Muur (25), China Garden (25), Hong kong (22), Ni Hao (16). My father once had a Chinese restaurant called Hong Kong!  See the dashboard below. These kind of more occurring names is quit specific to Chinese restaurants, if we look at other kitchen types we find more unique restaurant names. You’ll find the names per kitchen in a little Shiny app here.

Kitchen/restaurant type and number of reviews

Of the 11.000 restaurants, there are around 8900 restaurants that have reviews, with restaurants El Olivo, Rhodos, Floor17 having the most reviews. There are around 215.000 reviews written by 103.800 reviewers, with MartinK having the most reviews.

The five types of restaurants that occur most often in all the reviews are “International”, “Hollands”, French, Italian and Chinese, as shown in the tree map below. The color of the tree map represents the average number of reviews per restaurant type. So although Chinese restaurants form a large fraction (8.8%) of all restaurants, there are only on average 8.3 reviews per Chinese restaurant. On the other hand, Italian restaurants form 8.4% of all restaurants and have more than 18 reviews per Italian restaurant. Conclusion: People eat at Chinese restaurants, but they just don’t write about it…..

kitchentypes

Review dates

The following figure shows a screen shot of a SAS Visual Analytics dashboard that I made from the Iens data, it shows a couple of things.

  • Uniqueness of restaurant names per kitchen type
  • The number of reviews per day, we can see a Valentin peak at 14-Feb-2015,
  • The avergare score per day, it looks like the scores in the month July are a little bit lower.
  • Saturdays and Fridays are more crowded for restaurants, everybody knows that, and the Iens review data confirms this.

dashboard

Review topics

If you look at the Iens site, then you’ll see that almost all of the reviews are written in Dutch, which you would expect. But if you run a language detection first on the reviews, its funny to see that here are some reviews in other languages. English is the second language (300 or so reviews) and a few Italian reviews. It turns out that the language detection (R textcat package) on “Italian” reviews is fooled by some particular Italian words like Buonissimo, cappuccino carprese and piza.

What are the topics that we can extract from the 215.000 reviews, In an earlier blog post I wrote about text mining, I have used SAS Text Miner to generate topics from the Iens reviews. So what are people writing about? The picture below shows the five topics that occur a lot in all the reviews. A topic in SAS Text Miner is characterized by key terms (in Dutch) which are given below.

topics

So it’s always the same: complaining about the long wait for the food…. but on the other hand a lot of reviews on how great and fantastic the evening was…

Next restaurant type to visit

To come back to my first question: what type of restaurant should I visit next? I can perform a path analysis to answer that question. Scraping the restaurant site not only resulted in the texts of the reviews but also the reviewer, the date and the restaurant(type). So in the data below for example Pauls path consists of a first visited to Sisters, then to  Fussia and then to Milo.

pathdata

With SAS software I can either use Visual Analytics to generate a Sankey diagram that will show the most occurring paths. Or alternatively, I can perform a path analysis on this data in Enterprise Miner and then visualize the most occurring paths in a sunburst or chord diagram. See the pictures below.

pathanalysissankey

sunburstchordplot

Interactive versions of the IENS sunburst and chord diagrams can be found here and hereConclusion: It turns out that after a Chinese restaurant visit most reviewers will visit an “International” restaurant type…. In a next blog post I will go one step further and show how to use techniques from recommendation engines to recommend at restaurant level.

Cheers,

Longhow

Barack Obama: Is he speeching like Frank Sinatra or Elvis Presley?

In my previous post I wrote about the basics on text mining. How can you apply this? Well, an urgent question you may have is this: Is Barack Obama speeching like Frank Sinatra or Elvis Presley?

PresleyorSinatra

Text mining can answer that question! So, how did I apply text mining. I scraped some lyrics from LyricWikia, 604 lyrics from Elvis, and 672 lyrics from Frank. I have imported these lyrics in a very simple data set with two columns. The first column contains the lyrics, 1276 rows, the second column is a binary target variable, with levels Elvis and Frank. In SAS Enterprise Miner you can then easily create a classifier. In this case a neural network with one layer with 50 neurons works reasonably well. On a 30% holdout set we have a gini coefficient of 0.78

MinerObama

Although the neural network classifier has the best predictive power, it is difficult to interpret. In SAS, we can build a classifier directly on the terms of the term-document matrix, instead of the SVD’s. This is a so-called Text Rule Builder, in this case it results in a less predictive classifier, but it shows some nice interpret-able rules. The Elvis lyrics can be characterized by the words: dog, bloom, lord, yeah, lovin, rock, dark and pretty. While Sinatra lyrics are characterized by words like: smile, writer, winter, song and light.

The next step is to score Obama’s speeches with the neural network classifier that was just build. I have extracted 90 speeches from Obama, mostly from his period as senator, and gave each speech a “Sinatra” score (i.e. the probability that a particular speech is classified as Sinatra). A histogram of all the 90 Sinatra scores is given below.

SinatraScore

The average score of all the 90 speeches is 50.2%. So to answer the main question: Obama can’t make up his mind, half of the time he is talking like Sinatra the other half he is talking like Presley!