Yelp dataset analysis

Introduction

This is a well-known dataset provided by Yelp for education and research purposes. Despite being explored extensively, this dataset is still a good playground for checking knowledge and practicing analysis skills including statistics, programming, and visualization using R or Python.

In the present project, we will begin with the construction of a local MySQL database, then move on to exploratory analyses inspired by a nice and extensive tutorial, and finally use data to answer some questions arise in my mind. The project will cover the application of R, Python, SQL, Tableau, an Statistics.

Note that some code was adopted from that tutorial with small amount of modifications, and some extension have been made to go further from there.

Data preparation

The entire dataset contains 5 JSON files and the total size is over 8 GB. It is possible to read in the data in Python, because python only read part of the data into memory. However, unlike Python, R imports all the data into memory, thus making it difficult to deal with the dataset. One solution is building a local MySQL database from which one can use R to submit the queries and retrieve only a subset of the data.

The following steps are involved in the present database construction, and this might not be the most efficient way, but at least it worked for me.

Read JSON files in python and dump them out as CSV files. Note that, MySQL can import JSON file directly by using MySQL shell interface (mysqlsh), however there are a few problems when the json file is not simple. Read CSV files are easier to handle and can be done from the MySQL prompt.
Create database and tables in MySQL for each CSV file. My way of doing that is quick and dirty and more tweaking such as field data types, NULL or NOT NULL, PRIMARY and FOREIGN KEY settings is needed.
Import the CSV files into the database from MySQL prompt.

Generate CSV files from JSON files in Python:

import json, csv
import pandas as pd

workdir = '/Users/cchen/work/yelp/'
ds_keys = ['business', 'tip', 'checkin', 'review', 'user']
ds = {}
for key in ds_keys:
    fname_json = f'yelp_academic_dataset_{key}.json'
    fpath_json = workdir + fname_json
    data = [json.loads(line) for line in open(fpath_json, 'r')]
    df = pd.DataFrame.from_dict(data)
    fname_csv = f'{key}.csv'
    fpath_csv = workdir + fname_csv
    df = df.replace('\\\\', ' ', regex = True) # remove '\\' from the text to avoid import errors in MySQL
    df.to_csv(fpath_csv, indexFalse, quotingcsv.QUOTE_ALL)

Build database in MySQL:

-- Create database for yelp
CREATE DATABASE yelp;
USE yelp;

-- Create table "business" for csv file importing
DROP TABLE IF EXISTS `business`;
CREATE TABLE `business` (
    `business_id` text, 
    `name` text, 
    `address` text, 
    `city` text, 
    `state` text, 
    `postal_code` text,
    `latitude` text, 
    `longitude` text, 
    `stars` text, 
    `review_count` text, 
    `is_open` text,
    `attributes` text, 
    `categories` text, 
    `hours` text
) ENGINE=InnoDB DEFAULT CHARSET utf8mb4 COLLATE=utf8mb4_0900_ai_ci;

LOAD DATA 
LOCAL INFILE '/Users/cchen/work/yelp/business.csv'     
INTO TABLE business
FIELDS ENCLOSED BY '"'
TERMINATED BY ','
IGNORE 1 ROWS;

-- Create table "review" for csv file importing
DROP TABLE IF EXISTS `review`;
CREATE TABLE `review` (
    `review_id` text, 
    `user_id` text, 
    `business_id` text, 
    `stars` text, 
    `useful` text, 
    `funny` text,
    `cool` text, 
    `text` text, 
    `date` text
) ENGINE=InnoDB DEFAULT CHARSET utf8mb4 COLLATE=utf8mb4_0900_ai_ci;

LOAD DATA 
LOCAL INFILE '/Users/cchen/work/yelp/review.csv'     
INTO TABLE review
FIELDS ENCLOSED BY '"'
TERMINATED BY ','
IGNORE 1 ROWS;

-- Create table "checkin" for csv file importing
-- Note date field could be longer than 65535, so longtext is used
DROP TABLE IF EXISTS `checkin`;
CREATE TABLE `checkin` (
    `business_id` text, 
    `date` longtext
) ENGINE=InnoDB DEFAULT CHARSET utf8mb4 COLLATE=utf8mb4_0900_ai_ci;

LOAD DATA 
LOCAL INFILE '/Users/cchen/work/yelp/checkin.csv'     
INTO TABLE checkin
FIELDS ENCLOSED BY '"'
TERMINATED BY ','
IGNORE 1 ROWS;

-- Create table "tip" for csv file importing
DROP TABLE IF EXISTS `tip`;
CREATE TABLE `tip` (
    `user_id` text, 
    `business_id` text,
    `text` text,
    `date` text,
    `compliment_count` text
) ENGINE=InnoDB DEFAULT CHARSET utf8mb4 COLLATE=utf8mb4_0900_ai_ci;

LOAD DATA 
LOCAL INFILE '/Users/cchen/work/yelp/tip.csv'     
INTO TABLE tip
FIELDS ENCLOSED BY '"'
TERMINATED BY ','
IGNORE 1 ROWS;

-- Create table "user" for csv file importing
DROP TABLE IF EXISTS `user`;
CREATE TABLE `user` (
    `user_id` text,
    `name` text,
    `review_count` text,
    `yelping_since` text,
    `useful` text,
    `funny` text,
    `cool` text,
    `elite` text,
    `friends` longtext,
    `fans` text,
    `average_stars` text,
    `compliment_hot` text,
    `compliment_more` text,
    `compliment_profile` text,
    `compliment_cute` text,
    `compliment_list` text,
    `compliment_note` text,
    `compliment_plain` text,
    `compliment_cool` text,
    `compliment_funny` text,
    `compliment_writer` text,
    `compliment_photos` text
) ENGINE=InnoDB DEFAULT CHARSET utf8mb4 COLLATE=utf8mb4_0900_ai_ci;

LOAD DATA
LOCAL INFILE '/Users/cchen/work/yelp/user.csv'
INTO TABLE user
FIELDS ENCLOSED BY '"'
TERMINATED BY ','
IGNORE 1 ROWS;

Exploratory analysis

Load libraries

Many packages are required to process the dataset, perform analysis, and produce images.

required_packages <- 
    c("tidyverse", "stringr", "lubridate", "wordcloud", "tidytext", "DT",
      "leaflet", "igraph", "ggraph", "topicmodels", "tm", "SnowballC", "textcat",
      "rjson", "jsonlite", "scales", "ggthemes", "RMySQL", "data.table")

lapply(required_packages, library, character.only=TRUE)

setwd("/Users/cchen/work/yelp")

palname <- "Tableau 10"
pal_tab <- tableau_color_pal(palname)(10)
color_tab <- scale_color_tableau(palname, direction=-1)
fill_tab <- scale_fill_tableau(palname, direction=-1)

# Since the business table is only 287 MB, we can load it directly.
fname_business <- "yelp_academic_dataset_business.json"
business <- jsonlite::stream_in(file(fname_business), flatten=TRUE)

Top 10 categories of business

categories <- str_split(business$categories, ",")
categories <- as.data.frame(unlist(categories))
colnames(categories) <- c("Name")
categories$Name <- str_squish(categories$Name)

categories %>%
    group_by(Name) %>%
    summarise(Count=n()) %>%
    arrange(desc(Count)) %>%
    ungroup() %>%
    mutate(Name=reorder(Name, Count)) %>%
    head(10) %>%

    ggplot(aes(x=Name, y=Count)) +
    geom_bar(stat="identity", fill=pal_tab[1]) +
    geom_text(aes(x=Name, y=1, label=paste0(Count, sep=" ")),
              hjust=-0.5, vjust=0.5, size=3, colour='white') +
    labs(x="Category", y="Count",
         title="Top 10 categories of business") +
    coord_flip() +
    theme_bw()

Top 10 categories of business are listed above where the ranking is based on the number of businesses within each category. Restaurant is the most popular one followed by Food and Shopping which makes sense because from my personal point of view these are the categories I would like to look at before I go. A bit surprising is that Home Services is ranked 4th, and we can find out what kind of businesses fall into this category.

hs <- business[business$categories %like% "Home Services", ]
head(hs$name, n=10)

##  [1] "Jennie Deckert"                               
##  [2] "AM&PM Locksmith"                              
##  [3] "Absolutely Perfect Inc"                       
##  [4] "Premier Mortgage Resources"                   
##  [5] "Mighty Dustless"                              
##  [6] "Tinkle Belle Diaper Service"                  
##  [7] "Cook's Glass & Mirror"                        
##  [8] "Leslie's"                                     
##  [9] "David Gower, Jr.  - Coldwell Banker Preferred"
## [10] "Nash Painting"

Top 10 cities with the most business mentioned in Yelp

city <- str_split(business$city, ",")
city <- as.data.frame(unlist(city))
colnames(city) <- c("Name")
city$Name <- str_squish(city$Name)

city %>%
    group_by(Name) %>%
    summarise(Count=n()) %>%
    arrange(desc(Count)) %>%
    ungroup() %>%
    mutate(Name=reorder(Name, Count)) %>%
    head(10) %>%

    ggplot(aes(x=Name, y=Count)) +
    geom_bar(stat="identity", fill=pal_tab[1]) +
    geom_text(aes(x=Name, y=1, label=paste0(round(Count/1e3), "K", sep="")),
              hjust=-0.5, vjust=0.5, size=3, colour='white') +
    labs(x="City", y="Count",
         title="Top 10 cities with most mentioned business") +
    coord_flip() +
    theme_bw()

Top 10 cities with the most business mentioned in Yelp are listed above. I don’t think this represents the popularity of the cities around U.S., since the Yelp dataset we are querying from only contains data from certain states over a limited time range. But from what is shown here, we have Philadelphia being the most popular one.

Map of business in Tampa

My family likes Orlando a lot, and Tampa is one city we would include in our vacation to Florida. Therefore, I am personally interested in learning more about the businesses at Tampa.

target_city <- "Tampa"
CityCoords <- business %>% filter(city == target_city)

center_lon <- median(CityCoords$longitude, na.rm=T)
center_lat <- median(CityCoords$latitude, na.rm=T)

leaflet(CityCoords) %>% addProviderTiles("Esri.NatGeoWorldMap") %>%
    addCircles(lng=~longitude, lat=~latitude,radius=~sqrt(review_count)) %>%
    setView(lng=center_lon, lat=center_lat, zoom=10)

It seems that the businesses are quite evenly distributed in Tampa with a bit of more density in the downtown area and Tampa Bay.

Querying review data from database

As the JSON file contains the review data is over 5 GB, my laptop’s memory can easily be drained by import that directly. The following code demonstrate how to query from the local MySQL database we constructed previously.

# MySQL connection
conn <- dbConnect(RMySQL::MySQL(),
                  dbname='yelp',
                  host='localhost',
                  port=3306,
                  user='cchen',
                  password='20151228')

dbListTables(conn) # Show tables

query <- "SELECT * FROM review LIMIT 20"
res <- dbSendQuery(conn, query)
result <- dbFetch(res, n=5)
dbClearResult(res)

query <- "SELECT business_id, stars, date, text FROM review;"
res <- dbSendQuery(conn, query)
review <- dbFetch(res, n=-1)
dbClearResult(res)

Top 10 business with most five star reviews

review$stars <- as.numeric(review$stars)
most5StarsReviews <- review %>%
    filter(stars == 5) %>%
    group_by(business_id) %>%
    summarise(Count=n()) %>%
    arrange(desc(Count)) %>%
    ungroup() %>%
    mutate(BusinessID=reorder(business_id, Count)) %>%
    head(10)

most5StarsReviews = inner_join(most5StarsReviews, business)
most5StarsReviews %>%
    mutate(name=reorder(name, Count)) %>%
    ggplot(aes(x=name,y=Count)) +
    geom_bar(stat="identity", fill=pal_tab[1]) +
    geom_text(aes(x=name, y=1, label=paste0(Count, sep="")),
              hjust=-0.5, vjust=0.5, size=3, colour='white') +
    labs(x="Name of the business", y="Count",
         title="Top 10 business with most five star reviews") +
    coord_flip() +
    theme_bw()

First, the top 10 business with most five star reviews have been listed above. Reading Terminal Market is the top one which is unfair because I can imagine it is a shopping plaza comprising a lot of other businesses. So let’s look at the Oceana Grill this time.

What they say about “Oceana Grill”?

As a business owner, spending some time on reading and engineering the reviews can be beneficial. It gives you the feedback from the customers and provide you with info why you are running good or bad as well as which part needs improvement.

We can first look at the high-frequency words in the reviews through the word cloud. This word cloud might not new to you, and it uses the font size to indicate how often certain words appear in a context.

# Get the business_id for "Oceana Grill"
bid_hattie <- business$business_id[business$name %like% "Oceana Grill"]
createWordCloud=function(train) {
    train %>%
        unnest_tokens(word, text) %>%
        filter(!word %in% stop_words$word) %>%
        count(word, sort=TRUE) %>%
        ungroup()  %>%
        head(100) %>%

        with(wordcloud(word, n, max.words=100, color=rev(pal_tab)))
}

createWordCloud(review %>%
                    filter(business_id == bid_hattie))

From the word cloud image shown above, we can get an overview of what customers are talking about in the reviews. For example, “oysters”, “crab”, “shrimp” are mentioned indicating that they are popular food that people are interested to try, whereas “amazing”, “delicious”, “recommend” are usually good words that appear in positive reviews. The “service” is something people mentioned a lot, the word does not imply anything, so the owner should spend some time checking out whether they are compliments or criticisms.

Review staring distribution

ggplot(review, aes(x=stars)) +
    geom_bar(width=0.5, fill=pal_tab[1]) +
    labs(x="Stars", y="Count",
         title="Review stars histogram") +
    theme_bw()

The histogram of review stars gives us some hints about the tendency of customer to leave reviews – very satisfied or very unsatisfied, reflected as 1-star and 5-star reviews are the most. From my personal perspective, when there is little to say about a business I would rather keep silent, and when something is bad I would choose the lowest score instead of rationally give a 2-star review.

Review date distribution

review$year <- year(ymd_hms(review$date))
ggplot(review, aes(x=year)) +
    geom_bar(fill=pal_tab[1]) +
    labs(x="Year", y="Count",
         title="Review year histogram") +
    theme_bw()

The histogram of review years shows some interesting trend. First, from 2005 to 2020 the review gets more and more, alluding to the fact that more and more businesses and customers are using the Yelp platform. There is a clear drop from 2019 to 2020 and resumed slowly to 2021, which might be due to the impact of SARS-CoV-2 where a lot of businesses did not survive and a lot of places are in lock-down condition. We can probably make the prediction that the total number of reviews will come back to the level of 2018 and 2019 or even higher.

Relationship between number of reviews and business stars

One of the questions that I have in my mind before exploring the dataset is that, how is the number of reviews related to the business stars? Another question is that, would more reviews bring more business? The answer to these questions could provide the business owners evidence about whether it is worth the effort of taking care of the online reviews, for example reply to the reviews or contact the customer to turn weakness into opportunities.

To achieve this, we first need to retrive the data from the “checkin” tables.

query <- "SELECT * FROM checkin;"
res <- dbSendQuery(conn, query)
checkin <- dbFetch(res, n=-1)
dbClearResult(res)

Then we can count the number of checkins for each business and make the plot.

checkin$count <- lengths(str_split(checkin$date, ","))
checkin_ijoin_business <- inner_join(checkin, business)

## Joining with `by = join_by(business_id)`

checkin_ijoin_business %>%
    ggplot(aes(x=stars, y=count)) +
    geom_point( color=pal_tab[1]) +
    labs(x="Star", y="Checkin count",
         title="Relationship between checkins and the business stars") +
    theme_bw()

From to the plot above, a prominent peak can be observed around 4-star businesses, and the distribution diminishes at both ends. This makes a lot of sense because people would like to go to the business that has better reviews. However, there are a few outlines that can hardly be ignored because the they have really high number of checkins while the rating is pretty low. To figure out why, we need to extract those businesses by setting some filters.

checkin_ijoin_business$name[checkin_ijoin_business$count > 20000 & checkin_ijoin_business$stars < 4]

## [1] "Philadelphia International Airport - PHL"              
## [2] "Nashville International Airport - BNA"                 
## [3] "St. Louis Lambert International Airport - STL"         
## [4] "Louis Armstrong New Orleans International Airport  MSY"

Once the results appear on the screen, you will be like “Aha, I got it now”. These businesses represent those services you might not like but you have to live with.

Geographic dependence of business and reviews

Finally, we would like to look at the geographic dependence of the number of reviews by averaging all the business in the state.

review_geo <- checkin_ijoin_business %>%
    group_by(state) %>%
    summarise(Count=n(), TotChk=sum(count), TotRev=sum(review_count), AveStar=mean(stars)) %>%
    ungroup()

write.csv(review_geo, 'review_geo.csv')

We can load the exported CSV file into Tableau and map the results for different states onto the U.S. map.

On the figure above, we plotted for each available state the total number of checkins using the size of the circle and the averaged stars of all the businesses using colors. Apparently, the present Yelp dataset only contains data for a few states, and for each state the data might not be complete. With that being said, there is not enough data to draw any conslusion on the distribution across the states. However, the combination of R and Tableau can be a very power tool to illustrate the findings and keep track of the changes of the large dataset.

Summary

In the present project, we’ve accomplished the following tasks:

Used MySQL to construct a local database for large dataset queries.
Performed exploratory analyses on single tables and joined tables.
Illustrate the results using appropriate types of graphs.

There are actually much more can be done on the dataset, for example using natural language processing models to understand the sentiments of reviews, and also the evolution of the business performance can be tracked if the data is more complete.