Text mining basics: The IMDB reviews

Within Big Data analytics, the analysis of unstructured data has received a lot of attention recently. text mining can be seen as a collection of techniques to analyze unstructured data (i.e. documents, e-mails, tweets, product reviews or complaints). Two main applications of text mining:

  1. Categorization. Text mining can categorize (large amounts of) documents into different topics. I think we all have created dozens of sub folders in Outlook to organize our e-mails, and manually moved e-mails to these folders. No need to do that anymore, let text mining create and name the folders and automatically categorize e-mails into those folders.
  2. Prediction. With text mining you can predict if a document belongs to a certain category. For example, all the e-mails that a company receives, which of those e-mails belong to the category “negative sentiment”, or which of those e-mails belong to people who are likely to terminate their subscription?

How does text mining work? Let’s analyze some IMDB reviews, 50.000 reviews that are already classified into positive or negative sentiment, the data can be found here. There are a few steps to undertake, quit easily performed in SAS Text Miner.

1. Parse / Filter

First I imported the reviews into a SAS data set. One column contains all the reviews, each review in a separate record. There is also a second column, the binary target, each review is classified as POSITIVE or NEGATIVE sentiment. The reviews need to be parsed and filtered.

PARSEFILTER

The parsing and filtering nodes parse the collection of reviews to quantify information about the terms in all the reviews.The parsing can adjusted to include stemming (treat house and houses as one term), synonym lists (treat movie and film as one term) , stop lists (ignore the, in and with). The result is a so-called term document matrix, representing the frequency that a term occurs in a document.

TermDoc

2. Apply Singular Value Decomposition

A problem that arises when there are many reviews is that the term document matrix can become very large, other problems that may arise in term document matrices are sparseness (many zeros in the matrix) and term dependency (the term boring and long may occur often together in reviews and so are strongly correlated). To resolve these problems a singular value decomposition (SVD) is applied on the term document matrix. In an earlier blog post I described the SVD. The term document matrix A is factorized into

A = U \Sigma V^T

Instead of using all the singular values we now only use the largest k singular values.In the term document each review R is represented by an m dimensional vector of terms, using the SVD this can be projected onto a lower dimensional sub space with

\hat{R} = U^T_k R

So our three reviews will look like

SVDspace

3. Categorization or Prediction

Now that each review is projected onto a lower dimensional subspace, we can apply data mining techniques. For categorization we can apply clustering ( for example k-means or hierarchical clustering). Each review will will be segmented into a cluster, for clustering we do not need the Sentiment column. The next figure shows an example of a hierarchical clustering in SAS Enterprise Miner.

Clusters

The clustering resulted in 13 clusters, the 13 node leaves. Each cluster is described by descriptive terms. For example, one cluster of reviews contains reviews of people talking about the fantastic script and plot, and another cluster talks about the bad acting.

To predict the sentiment we need the sentiment column, in Enterprise Miner I have set that column as a Target and projected the reviews onto a 300 dimensional subspace, so 300 inputs, SVD1, SVD2,…,SVD300. I have tried several models machine learning methods, random forests, gradient boosting, neural networks. traintextminer

It turns out that a neural network with 1 layer with 50 neurons works quit well, an area under ROC of 0.945 on a holdout out set.

2 thoughts on “Text mining basics: The IMDB reviews

  1. Pingback: Restaurant analytics: Text mining, Path analysis, Sankey, Sunbursts and Chord plots | Longhow Lam's Blog

  2. Pingback: Soap analytics: Text mining “Goede tijden slechte tijden” plot summaries…. | Longhow Lam's Blog

Leave a comment