Association rules using FPGrowth in Spark MLlib through SparklyR

sparkfp

Introduction

Market Basket Analysis or association rules mining can be a very useful technique to gain insights in transactional data sets, and it can be useful for product recommendation. The classical example is data in a supermarket. For each customer we know what the individual products (items) are that he has bought. With association rules mining we can identify items that are frequently bought together. Other use cases for MBA could be web click data, log files, and even questionnaires.

In R there is a package arules to calculate association rules, it makes use of the so-called Apriori algorithm. For data sets that are not too big, calculating rules with arules in R (on a laptop) is not a problem. But when you have very huge data sets, you need to do something else, you can:

use more computing power (or cluster of computing nodes).
use another algorithm, for example FP Growth, which is more scalable. See this blog for some details on Apriori vs. FP Growth.

Or do both of the above points by using FPGrowth in Spark MLlib on a cluster. And the nice thing is: you can stay in your familiar R Studio environment!

Spark MLlib and sparklyr

Example Data set

We use the example groceries transactions data in the arules package. It is not a big data set and you would definitely not need more than a laptop, but it is much more realistic than the example given in the Spark MLlib documentation :-).

Preparing the data

I am a fan of sparklyr 🙂 It offers a good R interface to Spark and MLlib. You can use dplyr syntax to prepare data on Spark, it exposes many of the MLlib machine learning algorithms in a uniform way. Moreover, it is nicely integrated into the RStudio environment offering the user views on Spark data and a way to manage the Spark connection.

sparklyr

First connect to spark and read in the groceries transactional data, and upload the data to Spark. I am just using a local spark install on my Ubuntu laptop.

###### sparklyr code to perform FPGrowth algorithm ############

library(sparklyr)
library(dplyr)

#### spark connect #########################################
sc <- spark_connect(master = "local")

#### first create some dummy data ###########################
transactions = readRDS("transactions.RDs")

#### upload to spark #########################################  
trx_tbl  = copy_to(sc, transactions, overwrite = TRUE)

For demonstration purposes, data is copied in this example from the local R session to Spark. For large data sets this is not feasible anymore, in that case data can come from hive tables (on the cluster).

The figure above shows the products purchased by the first four customers in Spark in an RStudio grid. Although transactional systems will often output the data in this structure, it is not what the FPGrowth model in MLlib expects. It expects the data aggregated by id (customer) and the products inside an array. So there is one more preparation step.

# data needs to be aggregated by id, the items need to be in a list
trx_agg = trx_tbl %>% 
   group_by(id) %>% 
   summarise(
      items = collect_list(item)
   )

The figure above shows the aggregated data, customer 12, has a list of 9 items that he has purchased.

Running the FPGrowth algorithm

We can now run the FPGrowth algorithm, but there is one more thing. Sparklyr does not expose the FPGrowth algorithm (yet), there is no R interface to the FPGrowth algorithm. Luckily, sparklyr allows the user to invoke the underlying Scala methods in Spark. We can define an new object with invoke_new

  uid = sparklyr:::random_string("fpgrowth_")
  jobj = invoke_new(sc, "org.apache.spark.ml.fpm.FPGrowth", uid)

Now jobj is an object of class FPGrowth in Spark.

jobj
<jobj[457]>
  class org.apache.spark.ml.fpm.FPGrowth
  fpgrowth_d4d41f71f3e0

And by looking at the Scala documentation of FPGrowth we see that there are more methods that you can use. We need to use the function invoke, to specify which column contains the list of items, to specify the minimum confidence and to specify the minimum support.

jobj %>% 
    invoke("setItemsCol", "items") %>%
    invoke("setMinConfidence", 0.03) %>%
    invoke("setMinSupport", 0.01)  %>%
    invoke("fit", spark_dataframe(trx_agg))

By invoking fit, the FPGrowth algorithm is fitted and an FPGrowthModel object is returned where we can invoke associationRules to get the calculated rules in a spark data frame

rules = FPGmodel %>% invoke("associationRules")

The rules in the spark data frame consists of an antecedent column (the left hand side of the rule), a consequent column (the right hand side of the rule) and a column with the confidence of the rule. Note that the antecedent and consequent are lists of items! If needed we can split these lists and collect them to R for plotting for further analysis.

The invoke statements and rules extractions statements can of course be wrapped inside functions to make it more reusable. So given the aggregated transactions in a spark table trx_agg, you can get something like:

GroceryRules =  ml_fpgrowth(
  trx_agg
) %>%
  ml_fpgrowth_extract_rules()

plot_rules(GroceryRules)

Conclusion

The complete R script can be found on my GitHub. If arules in R on your laptop is not workable anymore because of the size of your data, consider FPGrowth in Spark through sparklyr.

cheers, Longhow

18 thoughts on “Association rules using FPGrowth in Spark MLlib through SparklyR”

Pingback: Association rules using FPGrowth in Spark MLlib through SparklyR – Mubashir Qasim

Hi Longhow, thanks for this interesting post. I got one question about calculating the rule lift value. In order to do this, we need the support value of the rule consequent. How could we get this value, or how could we get the support of different frequent itemsets? Any insights would be appreciated. Thanks!

LikeLike

Longhow Lam

November 24, 2017 at 7:23 am

The current version of Spark does not include the support in the association rules data frame. But looks like there is already work done to include it in the data frame

https://github.com/apache/spark/pull/17280/files

LikeLike

Reply

Hello Logchow, thank you for this. In Apriori in R, the number of repeat items in a basket is ignored, so five beer and ten crisps (ratio 1:2) is just {beer, crisps} as if it were just 1:1 every time. Can FP-Growth use the number of each item as part of the algorithm to determine patterns of association that reflect the ratios of items in the basket?

LikeLike

Hello Allan,

I don’t know the details of the algorithm. But the algorithm in spark MLlib is described in this paper

Click to access 2008_recsys_pfp.pdf

LikeLike

Thank you for this interesting article.
You said that for large data sets copying the data to Spark is not feasible anymore.
Besides using data coming from hive tables (on the cluster) how else can I upload a large dataset (30 million transactions csv file) to spark using R?

LikeLike

Longhow Lam

November 24, 2017 at 1:42 pm

Depends on how large the node is where you have R running, if it is big enough even 30 million records should be OK. I have run arules with 50 million transactions on a laptop with 32GB ram.

Otherwise you could upload data in chunks to spark if you really need to go trough local R.

LikeLike

Reply
- JD
  
  November 24, 2017 at 5:18 pm
  
  Thank you
  
  LikeLike

Reblogged this on paulvanderlaken.com and commented:
Great tutorial on how to conduct simple market basket analysis on your laptop either with association rules through the arules package or with frequent pattern
mining (FPGrowth) in Spark via sparklyr!

LikeLike

What are the memory limitations for a transaction set? I would like to use this for maintenance actions…great customer ID as work order ID..set of parts for each work order. Would like to understand my memory limitations better for setting this up.

LikeLike

Longhow Lam

April 14, 2018 at 4:42 pm

How big is your transaction data set? In a way, there is no limitation. Spark scales horizontally so you could extend the number of nodes as needed.

LikeLike

Reply

Hi,
I am getting this error “Items in a transaction must be unique but got WrappedArray(30535931, 30536336, 30536336).” when I run ‘FPGmodel = ml_fpgrowth(trx_agg, “items”, support = 0.01, confidence = 0.01)’ How to fix this?

LikeLike

Longhow Lam

April 14, 2018 at 5:13 pm

You need to undouble (make unique) your transaction set. If person A buys product X then that wil result in a line A and X the data set. Now that line cannot occur again.

LikeLike

Reply
- Rishab Oberoi
  
  April 14, 2018 at 7:12 pm
  
  but I can use distinct on trx_agg?
  
  LikeLike

I think you want collect_set vice collect_list
..this worked for me

LikeLike

Hi Longhow, thanks for the great post, I am only getting started with Spark and this is a great help. I believe I have followed your example correctly but when I run “rules = FPGmodel %>% invoke(“associationRules”)” I get the following error “Error in eval(lhs, parent, parent) : object ‘FPGmodel’ not found”. Where does FPGmodel come from in your example? Thanks in advance.

LikeLike

Interesting post. I Have Been wondering about this issue, so thanks for posting. Pretty cool post.It ‘s really very nice and Useful post.Thanks
360DigiTMG

LikeLike

Longhow Lam's Blog

Data Scientist, Machine learning, R, SAS, Python – Amsterdam (NL)

Association rules using FPGrowth in Spark MLlib through SparklyR

Introduction

Spark MLlib and sparklyr

Example Data set

Preparing the data

Running the FPGrowth algorithm

Conclusion

18 thoughts on “Association rules using FPGrowth in Spark MLlib through SparklyR”

Leave a comment Cancel reply

Introduction

Spark MLlib and sparklyr

Example Data set

Preparing the data

Running the FPGrowth algorithm

Conclusion

Share this:

18 thoughts on “Association rules using FPGrowth in Spark MLlib through SparklyR”

Leave a comment Cancel reply