Association rules using FPGrowth in Spark MLlib through SparklyR



Market Basket Analysis or association rules mining can be a very useful technique to gain insights in transactional data sets, and it can be useful for product recommendation. The classical example is data in a supermarket. For each customer we know what the individual products (items) are that he has bought. With association rules mining we can identify items that are frequently bought together. Other use cases for MBA could be web click data, log files, and even questionnaires.

In R there is a package arules to calculate association rules, it makes use of the so-called Apriori algorithm. For data sets that are not too big, calculating rules with arules in R (on a laptop) is not a problem. But when you have very huge data sets, you need to do something else, you can:

  • use more computing power (or cluster of computing nodes).
  • use another algorithm, for example FP Growth, which is more scalable. See this blog for some details on Apriori vs. FP Growth.

Or do both of the above points by using FPGrowth in Spark MLlib on a cluster. And the nice thing is: you can stay in your familiar R Studio environment!

Spark MLlib and sparklyr

Example Data set

We use the example groceries transactions data in the arules package. It is not a big data set and you would definitely not need more than a laptop, but it is much more realistic than the example given in the Spark MLlib documentation :-).

Preparing the data

I am a fan of sparklyr 🙂 It offers a good R interface to Spark and MLlib. You can use dplyr syntax to prepare data on Spark, it exposes many of the MLlib machine learning algorithms in a uniform way. Moreover, it is nicely integrated into the RStudio environment offering the user views on Spark data and a way to manage the Spark connection.

First connect to spark and read in the groceries transactional data, and upload the data to Spark. I am just using a local spark install on my Ubuntu laptop.

###### sparklyr code to perform FPGrowth algorithm ############


#### spark connect #########################################
sc <- spark_connect(master = "local")

#### first create some dummy data ###########################
transactions = readRDS("transactions.RDs")

#### upload to spark #########################################  
trx_tbl  = copy_to(sc, transactions, overwrite = TRUE)

For demonstration purposes, data is copied in this example from the local R session to Spark. For large data sets this is not feasible anymore, in that case data can come from hive tables (on the cluster).

The figure above shows the products purchased by the first four customers in Spark in an RStudio grid. Although transactional systems will often output the data in this structure, it is not what the FPGrowth model in MLlib expects. It expects the data aggregated by id (customer) and the products inside an array. So there is one more preparation step.

# data needs to be aggregated by id, the items need to be in a list
trx_agg = trx_tbl %>% 
   group_by(id) %>% 
      items = collect_list(item)

The figure above shows the aggregated data, customer 12, has a list of 9 items that he has purchased.

Running the FPGrowth algorithm

We can now run the FPGrowth algorithm, but there is one more thing. Sparklyr does not expose the FPGrowth algorithm (yet), there is no R interface to the FPGrowth algorithm. Luckily, sparklyr allows the user to invoke the underlying Scala methods in Spark. We can define an new object with invoke_new

  uid = sparklyr:::random_string("fpgrowth_")
  jobj = invoke_new(sc, "", uid) 

Now jobj is an object of class FPGrowth in Spark.


And by looking at the Scala documentation of FPGrowth we see that there are more methods that you can use. We need to use the function invoke, to specify which column contains the list of items, to specify the minimum confidence and to specify the minimum support.

jobj %>% 
    invoke("setItemsCol", "items") %>%
    invoke("setMinConfidence", 0.03) %>%
    invoke("setMinSupport", 0.01)  %>%
    invoke("fit", spark_dataframe(trx_agg))

By invoking fit, the FPGrowth algorithm is fitted and an FPGrowthModel object is returned where we can invoke associationRules to get the calculated rules in a spark data frame

rules = FPGmodel %>% invoke("associationRules")

The rules in the spark data frame consists of an antecedent column (the left hand side of the rule), a consequent column (the right hand side of the rule) and a column with the confidence of the rule. Note that the antecedent and consequent are lists of items! If needed we can split these lists and collect them to R for plotting for further analysis.

The invoke statements and rules extractions statements can of course be wrapped inside functions to make it more reusable. So given the aggregated transactions in a spark table trx_agg, you can get something like:

GroceryRules =  ml_fpgrowth(
) %>%



The complete R script can be found on my GitHub. If arules in R on your laptop is not workable anymore because of the size of your data, consider FPGrowth in Spark through sparklyr.

cheers, Longhow


Some insights in soccer transfers using Market Basket Analysis



Although more than 20 years old, Market Basket Analysis (MBA) (or association rules mining) can still be a very useful technique to gain insights in large transactional data sets. The classical example is transactional data in a supermarket. For each customer we know what the individual products (items) are that he has put in his basket and bought. Other use cases for MBA could be web click data, log files, and even questionnaires.

With market basket analysis we can identify items that are frequently bought together. Usually the results of an MBA are presented in the form of rules. The rules can be as simple as {A ==> B}, when a customer buys item A then it is (very) likely that the customer buys item B. More complex rules are also possible {A, B ==> D, F}, when a customer buys items A and B then it is likely that he buys items D and F.


Soccer transactional data

To perform MBA you need of course data, but I don’t have real transactional data from a retailer that I can present here. So I am using soccer data instead 🙂 From the Kaggle site you can download some soccer data, thanks to Hugo Mathien. The data contains around 25.000 matches from eleven European soccer leagues starting from season 2008/2009 until season 2015/2016. After some data wrangling I was able to generate a transactional data set suitable for market basket analysis. The data structure is very simple, some records are given in the figure below:

So we do not have customers but soccer players, and we do not have products but soccer clubs. In total, my soccer transactional data set contains around 18.000 records. Obviously, these records do not only include the multi-million transfers covered in the media, but also all the transfers of players nobody has ever heard of 🙂

Market basket results

In R you can use the arules package for MBA / association rules mining. Alternatively, when the order of the transactions is important, like my soccer transfers, you should use the arulesSequences package. After running the algorithm I got some interesting results. The figure below shows the most frequently occurring transfers between clubs:

So in this data set the most frequently occurring transfer is from Fiorentina to Genoa (12 transfers in total). I have published the entire table with the rules on RPubs, where you can look up the transfer activity of your favorite soccer club.

Network graph visualization

All the rules that we get from the association rules mining form a network graph. The individual soccer clubs are the nodes of the graph and each rule “from ==> to” is an edge of the graph. In R, network graphs can be visualized nicely by means of the visNetwork package. The network is shown in the picture below.

An interactive version can be found on RPubs. The different colors represent the different soccer leagues. There are eleven leagues in this data, there are more leagues in Europe, but in this data we see that the Polish league is quite isolated from the rest. Almost blended in each other are the German, Spanish, English and French leagues. Less connected are the Scottish and Portuguese leagues, but also in the big English Premier and German leagues you will find less connected clubs like Bournemouth, Reading or Arminia Bielefeld.

The size of a node in the above graph represents it’s betweenness centrality, it is an indicator of a node’s centrality in a network. It is equal to the number of shortest paths from all vertices to all others that pass through that node. In R betweenness measures can be calculated with the igraph package. The most central clubs in the transfers of players are Sporting CP, Lechia Gdansk, Sunderland, FC Porto and PSV Eindhoven.

Virtual Items

An old trick among marketeers is to use virtual items in a market basket analysis. Besides the ‘physical’ items that a customer has in his basket, a marketeer can add extra virtual items in the basket. These could be for example customer characteristic like age-class, sex, but also things like day of week, region etc. The transactional data with virtual items might look like:

If you run a MBA on the transactional data with virtual items, interesting rules might appear. For example:

  • {Chocolate, Female ==> Eggs}
  • {Chocolate, Male ==> Apples}
  • {Beer, Friday, Male, Age[18:23] ==> sausages}.

Virtual items that I can add to my soccer transactional data are: age-class, four classes: 1: players younger than 25, 2: [25, 29), 3: [29, 33) and 4: the players that are 33 or older. Preferred foot, two classes: left or right. Height class, four classes: 1; players smaller than 178 cm, 2: [178, 183), 3: [183, 186), and 4: players taller than 186 cm.

After running the algorithm again the results allow you to find out the frequently occuring transfers of let-footers. I can see 4 left footers that transferred from Roma to Sampdoria, more rules can be seen on my RPubs site.


When you have transactional data, even as small as the soccer transfers, market basket analysis is definitely one of the techniques you should try to get some first insights. Feel free to look at my R code on GitHub to experiment with the soccer transfers data.

Cheers Longhow.