Combining Hadoop, Spark, R, SparkR and Shiny…. and it works :-)

A long time ago in 1991 I had my first programming course (Modula 2) at the Vrije University in Amsterdam. I spend months behind a terminal with a green monochrome display doing the programming exercises using VI. Do you remember Shift ZZ, and :q!… 🙂 After my university period I did not use VI often… Recently, I got hooked up again!

A system very similar to this one where I did my first text editing in VI

I was excited to hear the announcement of sparkR, an R interface to spark and was eager to do some experiments with the software. Unfortunately none of the Hadoop sandboxes have spark 1.4 and sparkR pre-installed to play with. So I needed to undertake some steps myself. Luckily, all steps are beautifully described in great detail on different sites.

Spin up a vps

At argeweb I rented an Ubuntu VPS, 4 cores 8 GB. That is a very small environment for Hadoop, and of course a 1 node environment does not show the full potential of Hadoop / Spark. However, I am not trying to do performance or stress tests on very large data sets, just some functional tests. Moreover, I don’t want to spent more money :-), though the VPS can nicely be used to install nginx and host my little website -> www.longhowlam.nl

Install R, Rstudio and shiny

A very nice blog post by Dean Atalli, which I am not going to repeat here, describes how easy it is to setup R, RStudio and Shiny. I followed steps 6, 7 and 8 of his blog post and the result is a running Shiny server on my VPS environment. In addition to my local RStudio application on my laptop, I can now also use R on my iPhone through Rstudio server on my VPS. Can be quit handy in a crowded bar when I need to run some R commands….

: using R on my iPhone. You need good eyes!

Install Hadoop 2.7 and Spark

To run Hadoop, you need to install java first, configure SSH, fetch the hadoop tar.gz file, install it, set environment variables in the ~/.bashrc file, modify hadoop configuration files, format the hadoop file system and start it. All steps are described in full detail here. Then in addition to that download the latest version of Spark, the pre-build for hadoop 2.6 or later worked fine for me. You can just extract the tgz file, set the SPARK_HOME variable and you are done!

In each of the above steps different configuration files needed to be edited. Luckily I can still remember my basic VI skills……

Creating a very simple Shiny App

The SparkR package is already available when Spark is installed, its location is inside the Spark directory. So, when attaching the SparkR library to your R session, specify its location using the lib.loc argument in the library function. Alternatively, add the location of the SparkR library to the Search Paths for packages in R, using the .libPaths function. See some example code below.

library(SparkR, lib.loc = "/usr/local/spark/R/lib")

## initialeze SparkR environment
sc = sparkR.init(sparkHome = '/usr/local/spark')
sqlContext = sparkRSQL.init(sc)

## convert the local R 'faithful' data frame to a Spark data frame
df = createDataFrame(sqlContext, faithful)

## apply a filter waiting &gt; 50 and show the first few records
df1 = filter(df, df$waiting &gt; 50)
head(df1)

## aggregate and collect the results in a local R data sets
df2 = summarize(groupBy(df1, df1$waiting), count = n(df1$waiting))
df3.R = collect(df2)

Now create a new Shiny app, copy the R code above into the server.R file and instead of a hard coded value 50, let’s make this an input using a slider. That’s basically it, my first shiny app calling SparkR……

Cheers, Longhow

2 thoughts on “Combining Hadoop, Spark, R, SparkR and Shiny…. and it works :-)”

jewel makerman

September 12, 2016 at 12:00 am

Helpful commentary – I loved the analysis . Does someone know if my business could locate a template AU QCAT Form 9 document to complete ?

LikeLike

Riju Bhattacharyya

March 14, 2017 at 8:29 am

Hi,

I was wondering, if I have a server that runs SparkR, and I have a ready code using Shiny on Rstudio to create a tool, what steps would I have to take to transition the same to SparkR? Also, is it possible to host a shiny app on the server if it has no way to provide an R server, only sparkR? Kinda urgent requirement, would be grateful for an early reply. Thanks!

LikeLike

Longhow Lam's Blog

Data Scientist, Machine learning, R, SAS, Python – Amsterdam (NL)

Combining Hadoop, Spark, R, SparkR and Shiny…. and it works :-)

Spin up a vps

Install R, Rstudio and shiny

Install Hadoop 2.7 and Spark

Creating a very simple Shiny App

2 thoughts on “Combining Hadoop, Spark, R, SparkR and Shiny…. and it works :-)”

Leave a comment Cancel reply

Spin up a vps

Install R, Rstudio and shiny

Install Hadoop 2.7 and Spark

Creating a very simple Shiny App

Share this:

2 thoughts on “Combining Hadoop, Spark, R, SparkR and Shiny…. and it works :-)”

Leave a comment Cancel reply