Last month I visited my favorite Chinese restaurant with some friends in the center of Amsterdam, Golden chopsticks, and had some nice food. I was wondering if other people shared the same opinion. So I visited iens.nl, a Dutch restaurant review site and looked up Golden chopsticks restaurant. They have fairly good reviews, so that was good. For our next restaurant diner my friends wanted to try something else, we had Chinese, so do we want to go to a Chinese restaurant again, or something else?
Let’s use some analytics to help me decide. With R and the package rVest it is not difficult to retrieve data from the Iens restaurant site. Some knowledge on CSS, Xpath and regular expressions is needed but then you can scrape away…. Inside a for loop over i, I have code fragments like the ones given below.
start = "http://www.iens.nl/restaurant/?q=&amp;perPagina=20&amp;pagina=" httploc = paste(start,i,sep="") restaurants = html(httploc) recensieinfo = html_text(html_nodes(restaurants,xpath='//div[@class="listerItem_score"]')) restaurantLabel = html_nodes(restaurants,".listerItem") restName = html_text(html_nodes(reslabel2,".fontSerif")) restAddress = html_nodes(restarantLabel ,"div address") Nreviews = str_extract(str_extract(recensieinfo,pattern="[0-9]* recensies"), pattern = "[0-9]*") averageScore = str_extract(recensieinfo,pattern="[0-9].[0-9]") # more statements
I have retrieved information of around 11.000 restaurants, for these restaurants there are around 215.000 reviews (taking only the reviews that also have numeric scores for food, service and scenery, and ignoring older reviews). The data I scraped are in two tables, one table at restaurant level and one on review level.
First, some interesting facts on the scraped data.
Chinese restaurant names
It is true that certain names of Chinese restaurants occur more often. There are around 1600 Chinese restaurants, the most frequent names: Kota Radja (39), Peking (36), Lotus (33), De Chinese Muur (32), De Lange Muur (25), China Garden (25), Hong kong (22), Ni Hao (16). My father once had a Chinese restaurant called Hong Kong! See the dashboard below. These kind of more occurring names is quit specific to Chinese restaurants, if we look at other kitchen types we find more unique restaurant names. You’ll find the names per kitchen in a little Shiny app here.
Kitchen/restaurant type and number of reviews
Of the 11.000 restaurants, there are around 8900 restaurants that have reviews, with restaurants El Olivo, Rhodos, Floor17 having the most reviews. There are around 215.000 reviews written by 103.800 reviewers, with MartinK having the most reviews.
The five types of restaurants that occur most often in all the reviews are “International”, “Hollands”, French, Italian and Chinese, as shown in the tree map below. The color of the tree map represents the average number of reviews per restaurant type. So although Chinese restaurants form a large fraction (8.8%) of all restaurants, there are only on average 8.3 reviews per Chinese restaurant. On the other hand, Italian restaurants form 8.4% of all restaurants and have more than 18 reviews per Italian restaurant. Conclusion: People eat at Chinese restaurants, but they just don’t write about it…..
The following figure shows a screen shot of a SAS Visual Analytics dashboard that I made from the Iens data, it shows a couple of things.
- Uniqueness of restaurant names per kitchen type
- The number of reviews per day, we can see a Valentin peak at 14-Feb-2015,
- The avergare score per day, it looks like the scores in the month July are a little bit lower.
- Saturdays and Fridays are more crowded for restaurants, everybody knows that, and the Iens review data confirms this.
If you look at the Iens site, then you’ll see that almost all of the reviews are written in Dutch, which you would expect. But if you run a language detection first on the reviews, its funny to see that here are some reviews in other languages. English is the second language (300 or so reviews) and a few Italian reviews. It turns out that the language detection (R textcat package) on “Italian” reviews is fooled by some particular Italian words like Buonissimo, cappuccino carprese and piza.
What are the topics that we can extract from the 215.000 reviews, In an earlier blog post I wrote about text mining, I have used SAS Text Miner to generate topics from the Iens reviews. So what are people writing about? The picture below shows the five topics that occur a lot in all the reviews. A topic in SAS Text Miner is characterized by key terms (in Dutch) which are given below.
So it’s always the same: complaining about the long wait for the food…. but on the other hand a lot of reviews on how great and fantastic the evening was…
Next restaurant type to visit
To come back to my first question: what type of restaurant should I visit next? I can perform a path analysis to answer that question. Scraping the restaurant site not only resulted in the texts of the reviews but also the reviewer, the date and the restaurant(type). So in the data below for example Pauls path consists of a first visited to Sisters, then to Fussia and then to Milo.
With SAS software I can either use Visual Analytics to generate a Sankey diagram that will show the most occurring paths. Or alternatively, I can perform a path analysis on this data in Enterprise Miner and then visualize the most occurring paths in a sunburst or chord diagram. See the pictures below.
Interactive versions of the IENS sunburst and chord diagrams can be found here and here. Conclusion: It turns out that after a Chinese restaurant visit most reviewers will visit an “International” restaurant type…. In a next blog post I will go one step further and show how to use techniques from recommendation engines to recommend at restaurant level.