A long time ago in 1991 I had my first programming course (Modula 2) at the Vrije University in Amsterdam. I spend months behind a terminal with a green monochrome display doing the programming exercises using VI. Do you remember Shift ZZ, and :q!… 🙂 After my university period I did not use VI often… Recently, I got hooked up again!
I was excited to hear the announcement of sparkR, an R interface to spark and was eager to do some experiments with the software. Unfortunately none of the Hadoop sandboxes have spark 1.4 and sparkR pre-installed to play with. So I needed to undertake some steps myself. Luckily, all steps are beautifully described in great detail on different sites.
Spin up a vps
At argeweb I rented an Ubuntu VPS, 4 cores 8 GB. That is a very small environment for Hadoop, and of course a 1 node environment does not show the full potential of Hadoop / Spark. However, I am not trying to do performance or stress tests on very large data sets, just some functional tests. Moreover, I don’t want to spent more money :-), though the VPS can nicely be used to install nginx and host my little website -> www.longhowlam.nl
Install R, Rstudio and shiny
A very nice blog post by Dean Atalli, which I am not going to repeat here, describes how easy it is to setup R, RStudio and Shiny. I followed steps 6, 7 and 8 of his blog post and the result is a running Shiny server on my VPS environment. In addition to my local RStudio application on my laptop, I can now also use R on my iPhone through Rstudio server on my VPS. Can be quit handy in a crowded bar when I need to run some R commands….
Install Hadoop 2.7 and Spark
To run Hadoop, you need to install java first, configure SSH, fetch the hadoop tar.gz file, install it, set environment variables in the ~/.bashrc file, modify hadoop configuration files, format the hadoop file system and start it. All steps are described in full detail here. Then in addition to that download the latest version of Spark, the pre-build for hadoop 2.6 or later worked fine for me. You can just extract the tgz file, set the SPARK_HOME variable and you are done!
In each of the above steps different configuration files needed to be edited. Luckily I can still remember my basic VI skills……
Creating a very simple Shiny App
The SparkR package is already available when Spark is installed, its location is inside the Spark directory. So, when attaching the SparkR library to your R session, specify its location using the lib.loc argument in the library function. Alternatively, add the location of the SparkR library to the Search Paths for packages in R, using the .libPaths function. See some example code below.
library(SparkR, lib.loc = "/usr/local/spark/R/lib") ## initialeze SparkR environment sc = sparkR.init(sparkHome = '/usr/local/spark') sqlContext = sparkRSQL.init(sc) ## convert the local R 'faithful' data frame to a Spark data frame df = createDataFrame(sqlContext, faithful) ## apply a filter waiting > 50 and show the first few records df1 = filter(df, df$waiting > 50) head(df1) ## aggregate and collect the results in a local R data sets df2 = summarize(groupBy(df1, df1$waiting), count = n(df1$waiting)) df3.R = collect(df2)
Now create a new Shiny app, copy the R code above into the server.R file and instead of a hard coded value 50, let’s make this an input using a slider. That’s basically it, my first shiny app calling SparkR……