In my previous post I wrote about the basics on text mining. How can you apply this? Well, an urgent question you may have is this: Is Barack Obama speeching like Frank Sinatra or Elvis Presley?
Text mining can answer that question! So, how did I apply text mining. I scraped some lyrics from LyricWikia, 604 lyrics from Elvis, and 672 lyrics from Frank. I have imported these lyrics in a very simple data set with two columns. The first column contains the lyrics, 1276 rows, the second column is a binary target variable, with levels Elvis and Frank. In SAS Enterprise Miner you can then easily create a classifier. In this case a neural network with one layer with 50 neurons works reasonably well. On a 30% holdout set we have a gini coefficient of 0.78
Although the neural network classifier has the best predictive power, it is difficult to interpret. In SAS, we can build a classifier directly on the terms of the term-document matrix, instead of the SVD’s. This is a so-called Text Rule Builder, in this case it results in a less predictive classifier, but it shows some nice interpret-able rules. The Elvis lyrics can be characterized by the words: dog, bloom, lord, yeah, lovin, rock, dark and pretty. While Sinatra lyrics are characterized by words like: smile, writer, winter, song and light.
The next step is to score Obama’s speeches with the neural network classifier that was just build. I have extracted 90 speeches from Obama, mostly from his period as senator, and gave each speech a “Sinatra” score (i.e. the probability that a particular speech is classified as Sinatra). A histogram of all the 90 Sinatra scores is given below.
The average score of all the 90 speeches is 50.2%. So to answer the main question: Obama can’t make up his mind, half of the time he is talking like Sinatra the other half he is talking like Presley!