SAS

Sorry for the local nature of this blog post. I was watching Dutch television and zapping between channels the other day and I stumbled upon “Goede Tijden Slechte Tijden” (GTST). This is a Dutch soap series broadcast by RTL Nederland. I must confess, I was watching (had to watch) this years ago because my wife was watching it…… My gut feeling with these daily soap series is that missing a few months or even years does not matter. Once you’ve seen some GTST episodes you’ve seen them all, the story line is always very similar. Can we use some data science to test if this gut feeling makes sense? I am using R and SAS to investigate this and see if more interesting soap analytics can be derived.

Scraping episode plot summaries

First, data is needed. All GTST episode plot summaries are available, from the very first episode in October 1990.

A great R package to scrape data from web sites is rvest by Hadley Wickham, I can advise anyone to learn this. Luckily, the website structure of the GTST plot summaries is not that difficult. With the following R code I was able to extract all episodes. First I wrote a function that extracts the plot summaries of one specific month.

getGTSTdata = function(httploc){
  tryCatch({
    gtst = html(httploc) %>%
     html_nodes(".mainarticle_body") %>%
     html_children()

    # The dates are in bold, the plot summaries are normal text
    texts = names(gtst[[1]]) == "text"
    datumsel = names(gtst[[1]]) == "b"

    episodeplot = gtst[[1]][texts] %>%
     sapply(html_text) %>%
     str_replace_all(pattern='\n'," ")

    episodedate = gtst[[1]][datumsel] %>%
     sapply(html_text)

    # put data in a data frame and return as results
    return(data.frame(episodeplot, episodedate))
  },
    error = function(cond) {
      message(paste("URL does not seem to exist:", httploc))
      message("Here's the original error message:")
      message(cond)
      # Choose a return value in case of error
      return(NULL)
    }
  )
}

This function is then used inside a loop over all months to get all the episode summaries. Some months do not have episodes, and the corresponding link does not exist (actors are on summer holiday!). So I used the tryCatch function inside my function which will continue the loop if a link does not exist.

months = c("januari", "februari","maart","april","mei","juni","juli","augustus","september","oktober","november","december")
years = 1990:2012

GTST_Allplots = data.frame()

for (j in years){
  for(m in months){
    httploc = paste("http://www.rtl.nl/soaps/gtst/meerdijk/tvgids/", m, j, ".xml", sep="")
    out = getGTSTdata(httploc)

    if (!is.null(out)){
      out$datums = paste(out$datums, j)
      GTST_Allplots = rbind(GTST_Allplots,out)
    }
  }
}

A a result of the scraping, I got 4090 episode plot summaries. The real GTST fans know that there are more episodes, the more recent episode summaries (from 2013) are on a different site. For this analysis I did not bother to use them.

Text mining the plot summaries

The episode summaries are now imported in SAS Enterprise Guide, the following figure shows the first few rows of the data.

Click to enlarge

With SAS Text miner it is very easy and fast to analyze unstructured data, see my earlier blog post. What are the most important topics that can be found in all the episodes? Let’s generate them, I can use either the text topic node or the text cluster node in SAS. The difference is that with the topic node an episode can belong to more than one topic, while with the cluster node an episode belongs only to one cluster.

I have selected 30 topics to be generated, you can experiment with different numbers of topics. The resulting topics can be found in the following figure.

GTST topics. click to enlarge

Looking at the different clusters found, it turns out that many topics are described by the names of the characters in the soap series. For example topic number 2 is described by terms like, “Anton“, “Bianca“, “Lucas“, ‘Maxime“, and “Sjoerd“, they occur often together in 418 episodes. And topic number 25 involves the terms “Ludo“, “Isabelle“, “Martine“, “Janine“. Here is a picture collage to show this in a more visual appealing way. I have only attached faces for six clusters, the other clusters are still the colored squares. The clusters that you see in the graph are based on the underlying distances between the clusters.

click to enlarge

Zoom in on a topic

Now let’s say I am interested in the characters of topic 25 (the Ludo Isabelle story line). Can I discover sub-topics in this story line? So, I apply a filter on topic 25, only the episodes that belong to the Ludo Isabelle story line are selected and I generate a new set of topics (call them sub topics to distinguish them form the original topics) .

Text miner flow to investigate a specific topic

What are some of the subtopics of the Ludo Isabelle story line?

Getting money back
George, Annie Big panic, very worried
Writing farewell letter
Plan of Jack brings people in danger

As a data scientist I should now reach out or go back to the business and ask for validation. So I am reaching out now: Are there any real hardcore GTST fans out there that can explain:

why certain characters are grouped together?
why certain groups of characters are closer to each other than other groups?
recognize subtopics in Ludo Isabelle story lines?

Text profiling

To comeback to my gut feeling, can I see if the different story lines remain similar? I can use a the Text profile node in SAS Text miner to investigate this. The Text Profile node enables you to profile a target variable using terms found in the documents. For each level of a target variable, the node outputs a list of terms from the collection that characterize or describe that level. The approach uses a hierarchical Bayesian belief model to predict which terms are the most likely to describe the level. In order to avoid merely selecting the most common terms, prior probabilities are used to down-weight terms that are common in more than one level of the target variable.

The target variable can also be a date variable. In my scraped episode data I have the date of the episodes, there is a nice feature that allows the user to set the aggregation level. Let’s look at years as aggregation level.

Text profile node in SAS

The output consists of different tables and plots, two interesting plots are the Term Time series plot and the Target Similarity plot. The first one shows for a selected year the most important terms and how these terms evolve over the other years. Suppose we select 1991 then we get the following graph.

Click to enlarge. Terms over years

Myriam was an important term (character) in 1991, but we see that here role stopped in 1995. Jef was a slightly less important term, but his character continued for a very long time in the series. The similarity plot shows the similarities between the sets of episodes by the different years. The distances are calculated on the term beliefs of the chosen terms.

GTST similarities by year

We can see very strong similarities between years 2000 and 2001, two years that are also very similar are 1996 and 1997. Very isolated years are 2009 and 2012, maybe the writers tried a new story line that was unsuccessful and canceled it. Now let’s focus on one season of GTST, season 22 from September 2011 to May 2012, and look at similarities between months. They are shown in the following figure.

Season 22 of GTST: monthly similarities

It seems that the months September ’11, November ’11 and May ’12 are very similar but the rest of the months are quite separate.

Conclusion

The combination R / rvest and SAS Text miner is an ideal tool combination to quickly scrape (text) data from sites and easily get first insights into those texts. Applying this on Goede Tijden Slechte Tijden (GTST) episode summaries rejects my gut feeling that once you’ve seen some GTST episodes you’ve seen them all. It turns out that story lines between years, and story lines within one year can be quite dissimilar!

Without watching thousands of episodes I can quickly get an idea of which characters belong together, what the main topics are of a season, the rise and fall of characters in the series and how topics are changing over time. But keep in mind: If you are in a train and hear two nice lady’s talking about GTST, will this analysis be enough to have meaningful conversations with them? …… I don’t think so!

Thanks @ErwinHuizenga and @josvandongen for reviewing.

Its personal

In my previous blog post I performed a path analysis in SAS on restaurant reviewers. It turned out that after a visit to a Chinese restaurant, reviewers on Iens tend to go to an “International” restaurant. But which one should I visit? A recommendation engine can answer that question. Everyone who has visited an e-commerce website for example Amazon, has experienced the results of recommendation engine. Based on your click/purchase history new products are recommended. I have a Netflix subscription, based on my viewing behavior I get recommendations for new movies, see my recommendations below.

Click to enlarge. Obviously these recommendations are based on the viewing behavior of my son and daugther, who spend too much time behind Netflix…. 🙂

Collaborative Filtering

How does it work? Lets fist look at the data that is needed, in the world of recommendation engines people often speak about users, items and the user-item rating matrix. In my scraped restaurant review data, this corresponds to reviewers, restaurants and their scores / ratings. See the figure below.

The question now is, how can we fill in the blanks? For example, in the data above Sarah likes Fussia and Jimmie’s Kitchen but she has not rated the other Restaurants. Can we (the computer) do this for her? Yes, we can fill in the blanks with a predicted rating and recommend the restaurant with the highest rating to Sarah as the restaurant to visit next. A term you often hear in this context is collaborative filtering. A class of techniques based on the believe that a person gets the most relevant recommendations from people with ‘similar’ tastes. I am not going to write about the techniques here, a nice overview paper is: Collaborative Filtering Recommender Systems By Michael D. Ekstrand, John T. Riedl and Joseph A. Konstan. It can be found here.

Iens restaurant reviewers

The review data that I have scraped from the iens website is of course much larger than the matrix shown above. There are 8,900 items (restaurants), and there are 100,889 users (reviewers). So we would have a user item matrix with 8,900 X 100,889 (= 897,912,100) ratings. That would mean that every reviewer has rated every restaurant, obviously that is not the case. In fact, the user-item matrix is often very sparse, the iens data consists of 211,143 ratings that is only 0.02% of the matrix when it is completely filled.

In SAS I can use the recommend procedure to create recommendation engines, the procedure supports different techniques

Average, SlopeOne,
KNN, Association Rules
SVD, Ensemble, Cluster

The rating data that is needed to run the procedure should be given in a different form than the user-item matrix. A SAS data set with three columns, user, item and rating is needed. A snippet of the data is shown below.

If I want the system to generate “personal” restaurant recommendations for me, I should also provide some personal ratings. Well, I liked Golden chopsticks (an 8 out of 10), a few months ago I was at Fussia, that was OK (a 7 out of 10), and for SAS I was at a client in Eindhoven, so I also ate at “Van der Valk Eindhoven” I did not really liked that (a 4 out of 10). So I have a created a small data set with my ratings and added that to the Iens ratings.

After that I used the recommend procedure to try different techniques and choose the one with the smallest error on a hold-out set. The workflow is given in the following screenshot.

To zoom in on the recommend procedure, it starts with the specification of the rating data set, and the specification of the user, item and rating columns. Then a method and its corresponding options need to be set. The following figure shows an example call

My personal recommendations

After the procedure has finished, a recommendation engine is available, in the above code example an engine with two methods (SVD and ARM) is available and recommendations can be generated for each user. The code below shows how to do this.

And the top five restaurants I should visit are (with their predicted rating)……

‘T Stiefkwartierke (9.61)
Brazz (9.19)
Bandoeng (9.05)
De Burgermeester (9.00)
Argentinos (9.00)

So the first restaurant ‘T Stiefkwartierke is in Breda, the south of The Netherlands. I am going to visit that when I am in the neighborhood….

Longhow Lam's Blog

Data Scientist, Machine learning, R, SAS, Python – Amsterdam (NL)

Soap analytics: Text mining “Goede tijden slechte tijden” plot summaries….