Barack Obama: Is he speeching like Frank Sinatra or Elvis Presley?

In my previous post I wrote about the basics on text mining. How can you apply this? Well, an urgent question you may have is this: Is Barack Obama speeching like Frank Sinatra or Elvis Presley?

PresleyorSinatra

Text mining can answer that question! So, how did I apply text mining. I scraped some lyrics from LyricWikia, 604 lyrics from Elvis, and 672 lyrics from Frank. I have imported these lyrics in a very simple data set with two columns. The first column contains the lyrics, 1276 rows, the second column is a binary target variable, with levels Elvis and Frank. In SAS Enterprise Miner you can then easily create a classifier. In this case a neural network with one layer with 50 neurons works reasonably well. On a 30% holdout set we have a gini coefficient of 0.78

MinerObama

Although the neural network classifier has the best predictive power, it is difficult to interpret. In SAS, we can build a classifier directly on the terms of the term-document matrix, instead of the SVD’s. This is a so-called Text Rule Builder, in this case it results in a less predictive classifier, but it shows some nice interpret-able rules. The Elvis lyrics can be characterized by the words: dog, bloom, lord, yeah, lovin, rock, dark and pretty. While Sinatra lyrics are characterized by words like: smile, writer, winter, song and light.

The next step is to score Obama’s speeches with the neural network classifier that was just build. I have extracted 90 speeches from Obama, mostly from his period as senator, and gave each speech a “Sinatra” score (i.e. the probability that a particular speech is classified as Sinatra). A histogram of all the 90 Sinatra scores is given below.

SinatraScore

The average score of all the 90 speeches is 50.2%. So to answer the main question: Obama can’t make up his mind, half of the time he is talking like Sinatra the other half he is talking like Presley!

Advertisements

Text mining basics: The IMDB reviews

Within Big Data analytics, the analysis of unstructured data has received a lot of attention recently. text mining can be seen as a collection of techniques to analyze unstructured data (i.e. documents, e-mails, tweets, product reviews or complaints). Two main applications of text mining:

  1. Categorization. Text mining can categorize (large amounts of) documents into different topics. I think we all have created dozens of sub folders in Outlook to organize our e-mails, and manually moved e-mails to these folders. No need to do that anymore, let text mining create and name the folders and automatically categorize e-mails into those folders.
  2. Prediction. With text mining you can predict if a document belongs to a certain category. For example, all the e-mails that a company receives, which of those e-mails belong to the category “negative sentiment”, or which of those e-mails belong to people who are likely to terminate their subscription?

How does text mining work? Let’s analyze some IMDB reviews, 50.000 reviews that are already classified into positive or negative sentiment, the data can be found here. There are a few steps to undertake, quit easily performed in SAS Text Miner.

1. Parse / Filter

First I imported the reviews into a SAS data set. One column contains all the reviews, each review in a separate record. There is also a second column, the binary target, each review is classified as POSITIVE or NEGATIVE sentiment. The reviews need to be parsed and filtered.

PARSEFILTER

The parsing and filtering nodes parse the collection of reviews to quantify information about the terms in all the reviews.The parsing can adjusted to include stemming (treat house and houses as one term), synonym lists (treat movie and film as one term) , stop lists (ignore the, in and with). The result is a so-called term document matrix, representing the frequency that a term occurs in a document.

TermDoc

2. Apply Singular Value Decomposition

A problem that arises when there are many reviews is that the term document matrix can become very large, other problems that may arise in term document matrices are sparseness (many zeros in the matrix) and term dependency (the term boring and long may occur often together in reviews and so are strongly correlated). To resolve these problems a singular value decomposition (SVD) is applied on the term document matrix. In an earlier blog post I described the SVD. The term document matrix A is factorized into

A = U \Sigma V^T

Instead of using all the singular values we now only use the largest k singular values.In the term document each review R is represented by an m dimensional vector of terms, using the SVD this can be projected onto a lower dimensional sub space with

\hat{R} = U^T_k R

So our three reviews will look like

SVDspace

3. Categorization or Prediction

Now that each review is projected onto a lower dimensional subspace, we can apply data mining techniques. For categorization we can apply clustering ( for example k-means or hierarchical clustering). Each review will will be segmented into a cluster, for clustering we do not need the Sentiment column. The next figure shows an example of a hierarchical clustering in SAS Enterprise Miner.

Clusters

The clustering resulted in 13 clusters, the 13 node leaves. Each cluster is described by descriptive terms. For example, one cluster of reviews contains reviews of people talking about the fantastic script and plot, and another cluster talks about the bad acting.

To predict the sentiment we need the sentiment column, in Enterprise Miner I have set that column as a Target and projected the reviews onto a 300 dimensional subspace, so 300 inputs, SVD1, SVD2,…,SVD300. I have tried several models machine learning methods, random forests, gradient boosting, neural networks. traintextminer

It turns out that a neural network with 1 layer with 50 neurons works quit well, an area under ROC of 0.945 on a holdout out set.

Big Data: Home values and Chinese restuarants

Hmm, did you ever wonder what factors might influence the value of your house? With big data analysis we can now do some interesting investigations…..

From a Dutch housing site we extracted houses for sale, so that we got a sample of the following form

houses

Then from a restaurant site we extracted a list of Chinese restaurant details in the form:

Restua

In SAS Enterprise Miner we combined these two sets to determine for each house

  1. the nearest Chinese restaurant
  2. the distance from a house to its nearest Chinese restaurant

MinerChinese

Then we plotted the relation between house value and the distance to its nearest Chinese restaurant.

HomeRest

Conclusion

There appears to be a weak correlation. For every kilometer your home is further away from a Chinese restaurant, your home value increases with 912 Euro and 12 cents !

SVD data compression, using my son as an experiment :-)

The SVD stands for Singular value decomposition, its a matrix factorization method more details can be found on Wikipedia. Each matrix A can be decomposed into 3 matrices.

A = U \Sigma V^T

Were \Sigma is a diagonal matrix with the r singular values. The decomposition will look like the follwoing figure:

SVDM2

Instead of using all r singular values, select the top k, and instead of multiplying U, \Sigma and V^T to get matrix A, now multiply U_k, \Sigma_k and V^T_k to get an approximation A_k of matrix A,

SVM3

So this approximation can be created by using far less data than the original matrix. How good is the approximation? Well you could look at the Frobenius norm, or any other matrix norm that measures the distance between the approximation and the original matrix, see http://en.wikipedia.org/wiki/Matrix_norm.

Instead, to get things more visual you can look at pictures. I am using an old picture of me and my son in the following “SVD experiment”.

Original picture  2448 X 3264 pixels  ~ a matrix with 8 milion numbers

SVD2

Now I am taking only the 15 largest singular values and reconstructing the picture with U_{15}\Sigma_{15}V^T_{15}. This uses only 1% of the data and I get the following picture

SVD1

Now I am taking only the 75 largest singular values and reconstructing the picture with U_{75}\Sigma_{75}V^T_{75}. This uses around 5% of the data and I get.

Original

In a next post I’ll show how this can be used in text mining….

Cheers, Longhow.