Some (spatial) analytics on primary school exams (Cito toets) in The Netherlands

Predicting the Cito Score

Recently my six year old son had his first Cito test, during his career at primary school he will have much more of such Cito tests. It will end when he is around twelve and will have his final Cito exam. A primary school in the Netherlands does not have to participate in the Cito exams, there are around 6800 primary schools in the Netherlands of which around 5300 participate in the Cito exam. For each of those schools the average Cito score (average of all pupils) of the final Cito exam is available. Lets just call this the Cito score.

As a data scientist, I like to know if there are factors that can be used to predict the Cito score, and more importantly can we get that data. A little search on internet resulted in a few interesting data sets that might be useful to predict the Cito score:

  • School details, like the type of school (public, christian, Jewish, Muslim Montessori, etc) and the location
  • Financial results, like the total assets, liabilities, working capital etc.
  • Staff & pupils, number of FTE’s, number of pupils, part-time personnel, management FTE
  • Data on the education level of the parents of the pupils

I also have the house for sale data from Funda, which can tell me something about the prices of the houses close to the school. There is also open data on the demographics of a neighborhood, like number of people, number of immigrants, number of children, number of males, etc.

Lets have a look at the Cito score first, how is it distributed, what are low and high values. The figure below shows the distribution, the average of the Cito scores is 534.7, the median is 535, the 10% percentile is 529.3 and the 90% percentile is 539.7.

CitoScoreVerdeling

Now lets predict the Cito score, in SAS Enterprise Miner I have fitted a simple linear regression (r^2 =0.35), a random forest (r^2 = 0.39) and a neural network (r^2 = 0.38), so the random forest has the best predictive power. Although it is difficult to interpreted a random forest, it does identify the factors that are important to predict the Cito score. The three most important factors are:

  • Percentage of pupils of a school whose parents both have an education
  • Percentage of pupils of a school whose parents both are uneducated
  • Average sales price of the 9 houses closest to the school

A plot of the three factors against the Cito score is given in the plots below, click on them to enlarge.

Gewicht000Gewicht030houseprice

From the graph above we can see that schools in an area where houses prices are 300K have a Cito score of around 1 point more than schools in an area where the house prices are 200K. So every 100K increase in house prices means a 1 point increase in the Cito score. Conclusion: With open data we can reasonably predict the Cito score with some accuracy. See my Shiny app for the random forest Cito score prediction for each individual school.

Spatial AutoCorrelation

Spatial autocorrelation occurs when points close to each others have similar (correlated) values. So do schools that are close to each other have similar Cito scores? We can analyse this by means of a variogram or correlogram. It provides a description of how data are related (correlated) with distance h.

The variogram is calculated by

\gamma(h) = \frac{1}{2 |N(h)|} \Sigma_{N(h)} (z_i - z_j)^2

where z_i and z_j are values (Cito scores) at locations i and j, and N(h) is the set of all pairwise distances i-j=h. So it calculates the average squared difference of values (Cito scores) separated by distance h. It is often easier to interpret a correlation (coefficient), instead of squared differences. For example, a correlation coefficient larger than 0.8 means strongly correlated values while a correlation lower than 0.1 means weakly correlated values. A correlogram \rho(h) can be calculated from a variogram:

\rho(h) = 1 - \frac{\gamma(h)}{C(0)}

where C(0) is the variance in the data. In SAS we can estimate the empirical variogram and correlogram from the Cito data with proc variogram. An example call is given in the figure below.

proc variogram

The resulting correlogram is given in the figure below.

correlogram

Conclusion: Schools that are close to each other (within 1.5 KM) have strongly correlated Cito scores. So if your son’s school has a bad Cito score, you can take him to another school, but be sure the other school is at least 7 KM away from the current school.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s