Dataiku 4.1.0: More support for R users!

ddsR

Introduction

Recently, Dataiku 4.1.0 was released, it now offers much more support for R users. But wait a minute, Data-what? I guess some of you do not know Dataiku, so what is Dataiku in the first place? It is a collaborative data science platform created to design and run data products at scale. The main themes of the product are:

Collaboration & Orchestration: A data science project often involves a team of people with different skills and interests. To name a few, we have data engineers, data scientists, business analysts, business stakeholders, hardcore coders, R users and Python users. Dataiku provides a platform to accommodate the needs of these different roles to work together on data science projects.

Productivity: Whether you like hard core coding or are more GUI oriented, the platform offers an environment for both. A flow interface can handle most of the steps needed in a data science project, and this can be enriched by Python or R recipes. Moreover, a managed notebook environment is integrated in Dataiku to do whatever you want with code.

Deployment of data science products: As a data scientist you can produce many interesting stuff, i.e. graphs, data transformations, analysis, predictive models. The Dataiku platform facilitates the deployment of these deliverables, so that others (in your organization) can consume them. There are dashboards, web-apps, model API’s, productionized model API’s and data pipelines.

 

There is a free version which contains already a lot of features to be very useful, and there is an paid version, with “enterprise features“. See for the Dataiku website for more info.

Improved R Support in 4.1.0

Among many new features, and the one that interests me the most as an R user, is the improved support for R. In previous versions of Dataiku there was already some support for R, this version has the following improvements. There is now support for:

R Code environments

In Dataiku you can now create so-called code environments for R (and Python). A code environment is a standalone and self-contained environment to run R code. Each environment can have its own set of packages (and specific versions of packages). Dataiku provides a handy GUI to manage different code environments. The figure below shows an example code environment with specific packages.

 

In Dataiku whenever you make use of R –> in R recipes, Shiny, R Markdown or creating R API’s you can select a specific R code environment to use.

R Markdown reports & Shiny applications

If you are working in RStudio you most likely already know R Markdown documents and Shiny applications. In this version, you can also create them in Dataiku. Now, why would you do that and not just create them in RStudio? Well, the reports and shiny apps become part of the Dataiku environment and so:

  • They are managed in the environment. You will have a good overview of all reports and apps and see who has created/edited them.
  • You can make use of all data that is already available in the Dataiku environment.
  • Moreover, the resulting reports and Shiny apps can be embedded inside Dataiku dashboards.

 

The figure above shows a R markdown report in Dataiku, the interface provides a nice way to edit the report, alter settings and publish the report. Below is an example dashboard in Dataiku with a markdown and Shiny report.

Creating R API’s

Once you created an analytical model in R, you want to deploy it and make use of its predictions. With Dataiku you can now easily expose R prediction models as an API. In fact, you can expose any R function as an API. The Dataiku GUI provides an environment where you can easily set up and test an R API’s. Moreover, The Dataiku API Node, which can be installed on a (separate) production server imports the R models that you have created in the GUI and can take care of load balancing, high availability and scaling of real-time scoring.

The following three figures give you an overview of how easy it is to work with the R API functionality.

First, define an API endpoint and R (prediction) function.

 

Then, define the R function, it can make use of data in Dataiku, R objects created earlier or any R library you need.

 

Then, test and deploy the R function. Dataiku provides a handy interface to test your function/API.

 

Finally, once you are satisfied with the R API you can make a package of the API, that package can then be imported on a production server with Dataiku API node installed. From which you can then serve API requests.

Conclusion

The new Dataiku 4.1.0 version has a lot to offer for anyone involved in a data science project. The system already has a wide range support for Python, but now with the improved support for R, the system is even more useful to a very large group of data scientists.

Cheers, Longhow.

Advertisements

Test driving Python integration in R, using the ‘reticulate’ package

RP

Introduction

Not so long ago RStudio released the R package ‘reticulate‘, it is an R interface to Python. Of course, it was already possible to execute python scripts from within R, but this integration takes it one step further. Imported Python modules, classes and functions can be called inside an R session as if it were just native R functions.

Below you’ll find some screen shot code snippets of using certain Python modules within R with the reticulate package. On my GitHub page you’ll find the R files from which these snippets were taken from.

Using python packages

The nice thing about reticulate in RStudio is the support for code completion. When you have imported a python module, RStudio will recognize the methods that are available in the python module:

clarifai_code_comp

The clarifai module

Clarifai provides a set of computer vision API’s for image recognition, face detection, extracting tags, etc. There is an official python module and there is also an R package by Guarav Sood, but it exposes less functionality. So I am going to use the python module in R. The following code snippet shows how easy it is to call python functions.

clarifaicode

The output returned from the clarifai call is a nested list and can be quit intimidating at first sight. To browse trough these nested lists and to get a better idea of what is in those lists, you can use the package listviewer:

listviewer

The pattern.nl module

The pattern.nl module contains a fast part-of-speech tagger for Dutch, sentiment analysis, and tools for Dutch verb conjugation and noun singularization & pluralization. At the moment it does not support python 3. That is not a big deal, I am using Anaconda and created a Python 2.7 environment to install pattern.nl.

The nice thing of the reticulate package is that it allows you to choose a specific Python environment to be used.

pattern

The pytorch module

pytorch is a python package that provides tensor computations and deep neural networks. There is no ‘R torch’ equivalent, but we can use reticulate in R. There is an example of training a logistic regression in pytorch, see the code here. It takes just a little rewrite of this code to make this work in R. See the first few lines in the figure below.

pytorch

Conclusion

As a data scientist you should know both R and Python, the reticulate package is no excuse for not learning Python! However, the reticulate package can be very useful if you want to do all your analysis in the RStudio environment. It works very well.

For example, I have used rvest to scrape some Dutch news texts, then used the Python module pattern.nl for Dutch sentiment and wrote an R Markdown document to present the results. Then the reticulate package is a nice way to keep everything in one environment.

Cheers, Longhow