Some simple facial analytics on actors (and my manager)

Some time ago I was at a party, inevitably, a question that came up was: “Longhow what kind of work are you doing?” I answered: I am a data scientist I have the most sexy job, do you want me to show you how to use deep learning for facial analytics…… Oops, it became very quiet. Warning: don’t google on the words sexy deep and facial at your work!

But anyway, I was triggered by the Face++ website and thought can we do some simple facial analytics? The Face++ site offers an API where you can upload images (or use links to images). For each image that you upload the Face++ engine can detect if there are one or more faces. If a face is detected, you can also let the engine return 83 facial landmarks. Example landmarks are nose_tip, left_eye_left corner, contour_chin, see the picture below. So an image is now reduced to 83 landmarks.

face_WP

What are some nice faces to analyse? Well, I just googled on “top 200 actors” and got a nice list of actor names, and with the permission of my manager Rein Mertens, I added his picture as well. So what steps did I do next:

Data preparation

1. I have downloaded import.io, this a very handy tool to turn web pages into data. In the tool enter the link to the list of 200 actors. It will then extract all the links to the 200 pictures (together with more data) and conveniently export it to a csv file.

import.io from web page to data

Click to enlarge

2. I wrote a small script in R to import the csv file from step 1, so for each of the 200 actors I have the link to his image. Now I can call the face++ api for every link and let Face++ return the 83 facial landmarks. The result is a data set with 200 rows and 166 columns (x and y position of a landmark) and a ID column containing the name of the actor.

3. I added the 83 land marks of my manager to the data set. So my set now looks like

Analytical base table of actor faces

Click to enlarge

There were some images where Face++ could not detect a face, I removed those images.

nofaces

faces not recognized by the Face++ engine

Now that I have the so-called facial analytical base table of actors I can apply some nice analytics

Deep learning, autoencoders

What are some strange faces? To answer that with an objective point of view, I used deep learning autoencoders. The idea of an autoencoder is to learn a compressed representation for a set of data typically for the purpose of dimensionality reduction (Wikipedia). The trick is to use a neural network where the output target is the same as the input variables. I have actor faces which are observations in a 166 dimensional space, using proc neural in SAS I have reduced the faces to points in a two dimensional space. I have used 5 hidden layers where the middle layer consist of two neurons, as shown in the sas code snippet below. More information on auto encoders in SAS can be found here.

procneural

The faces can now be plotted in a scatter plot, (as shown below) the two dimensions corresponding to the middle layer.

scatter plot of actor faces

click to enlarge

A more interactive version of this plot can be found here, it is a simple D3.js script where you can hover over the dots and click to see the underlying picture.

Hierarchical Cluster analysis

Can we group together faces that are similar (similar facial landmarks)? Yes, we can apply a cluster technique to do that. There are many different techniques, I have used an agglomerative hierarchical cluster technique because the algorithm that is used to cluster the faces can be nicely visualized in a so called dendrogram. The algorithm starts with each observation (face) as its own cluster, in each following iteration pairs of clusters are merged until all observations form one cluster. In SAS you can use proc cluster to perform hierarchical clustering, see the code and dendrogram below.

proccluster

dendrogram faces

Click to enlarge

Using D3.js I have also created a more interactive dendrogram.

— L. —

.

Advertisements

Some (spatial) analytics on primary school exams (Cito toets) in The Netherlands

Predicting the Cito Score

Recently my six year old son had his first Cito test, during his career at primary school he will have much more of such Cito tests. It will end when he is around twelve and will have his final Cito exam. A primary school in the Netherlands does not have to participate in the Cito exams, there are around 6800 primary schools in the Netherlands of which around 5300 participate in the Cito exam. For each of those schools the average Cito score (average of all pupils) of the final Cito exam is available. Lets just call this the Cito score.

As a data scientist, I like to know if there are factors that can be used to predict the Cito score, and more importantly can we get that data. A little search on internet resulted in a few interesting data sets that might be useful to predict the Cito score:

  • School details, like the type of school (public, christian, Jewish, Muslim Montessori, etc) and the location
  • Financial results, like the total assets, liabilities, working capital etc.
  • Staff & pupils, number of FTE’s, number of pupils, part-time personnel, management FTE
  • Data on the education level of the parents of the pupils

I also have the house for sale data from Funda, which can tell me something about the prices of the houses close to the school. There is also open data on the demographics of a neighborhood, like number of people, number of immigrants, number of children, number of males, etc.

Lets have a look at the Cito score first, how is it distributed, what are low and high values. The figure below shows the distribution, the average of the Cito scores is 534.7, the median is 535, the 10% percentile is 529.3 and the 90% percentile is 539.7.

CitoScoreVerdeling

Now lets predict the Cito score, in SAS Enterprise Miner I have fitted a simple linear regression (r^2 =0.35), a random forest (r^2 = 0.39) and a neural network (r^2 = 0.38), so the random forest has the best predictive power. Although it is difficult to interpreted a random forest, it does identify the factors that are important to predict the Cito score. The three most important factors are:

  • Percentage of pupils of a school whose parents both have an education
  • Percentage of pupils of a school whose parents both are uneducated
  • Average sales price of the 9 houses closest to the school

A plot of the three factors against the Cito score is given in the plots below, click on them to enlarge.

Gewicht000Gewicht030houseprice

From the graph above we can see that schools in an area where houses prices are 300K have a Cito score of around 1 point more than schools in an area where the house prices are 200K. So every 100K increase in house prices means a 1 point increase in the Cito score. Conclusion: With open data we can reasonably predict the Cito score with some accuracy. See my Shiny app for the random forest Cito score prediction for each individual school.

Spatial AutoCorrelation

Spatial autocorrelation occurs when points close to each others have similar (correlated) values. So do schools that are close to each other have similar Cito scores? We can analyse this by means of a variogram or correlogram. It provides a description of how data are related (correlated) with distance h.

The variogram is calculated by

\gamma(h) = \frac{1}{2 |N(h)|} \Sigma_{N(h)} (z_i - z_j)^2

where z_i and z_j are values (Cito scores) at locations i and j, and N(h) is the set of all pairwise distances i-j=h. So it calculates the average squared difference of values (Cito scores) separated by distance h. It is often easier to interpret a correlation (coefficient), instead of squared differences. For example, a correlation coefficient larger than 0.8 means strongly correlated values while a correlation lower than 0.1 means weakly correlated values. A correlogram \rho(h) can be calculated from a variogram:

\rho(h) = 1 - \frac{\gamma(h)}{C(0)}

where C(0) is the variance in the data. In SAS we can estimate the empirical variogram and correlogram from the Cito data with proc variogram. An example call is given in the figure below.

proc variogram

The resulting correlogram is given in the figure below.

correlogram

Conclusion: Schools that are close to each other (within 1.5 KM) have strongly correlated Cito scores. So if your son’s school has a bad Cito score, you can take him to another school, but be sure the other school is at least 7 KM away from the current school.