This is a well-known dataset provided by Yelp for education and research purposes. Despite being explored extensively, this dataset is still a good playground for checking knowledge and practicing analysis skills including statistics, programming, and visualization using R or Python.
In the present project, we will begin with the construction of a local MySQL database, then move on to exploratory analyses inspired by a nice and extensive tutorial, and finally use data to answer some questions arise in my mind. The project will cover the application of R, Python, SQL, Tableau, an Statistics.
Note that some code was adopted from that tutorial with small amount of modifications, and some extension have been made to go further from there.
The entire dataset contains 5 JSON files and the total size is over 8 GB. It is possible to read in the data in Python, because python only read part of the data into memory. However, unlike Python, R imports all the data into memory, thus making it difficult to deal with the dataset. One solution is building a local MySQL database from which one can use R to submit the queries and retrieve only a subset of the data.
The following steps are involved in the present database construction, and this might not be the most efficient way, but at least it worked for me.
Generate CSV files from JSON files in Python:
import json, csv
import pandas as pd
workdir = '/Users/cchen/work/yelp/'
ds_keys = ['business', 'tip', 'checkin', 'review', 'user']
ds = {}
for key in ds_keys:
fname_json = f'yelp_academic_dataset_{key}.json'
fpath_json = workdir + fname_json
data = [json.loads(line) for line in open(fpath_json, 'r')]
df = pd.DataFrame.from_dict(data)
fname_csv = f'{key}.csv'
fpath_csv = workdir + fname_csv
df = df.replace('\\\\', ' ', regex = True) # remove '\\' from the text to avoid import errors in MySQL
df.to_csv(fpath_csv, indexFalse, quotingcsv.QUOTE_ALL)
Build database in MySQL:
-- Create database for yelp
CREATE DATABASE yelp;
USE yelp;
-- Create table "business" for csv file importing
DROP TABLE IF EXISTS `business`;
CREATE TABLE `business` (
`business_id` text,
`name` text,
`address` text,
`city` text,
`state` text,
`postal_code` text,
`latitude` text,
`longitude` text,
`stars` text,
`review_count` text,
`is_open` text,
`attributes` text,
`categories` text,
`hours` text
) ENGINE=InnoDB DEFAULT CHARSET utf8mb4 COLLATE=utf8mb4_0900_ai_ci;
LOAD DATA
LOCAL INFILE '/Users/cchen/work/yelp/business.csv'
INTO TABLE business
FIELDS ENCLOSED BY '"'
TERMINATED BY ','
IGNORE 1 ROWS;
-- Create table "review" for csv file importing
DROP TABLE IF EXISTS `review`;
CREATE TABLE `review` (
`review_id` text,
`user_id` text,
`business_id` text,
`stars` text,
`useful` text,
`funny` text,
`cool` text,
`text` text,
`date` text
) ENGINE=InnoDB DEFAULT CHARSET utf8mb4 COLLATE=utf8mb4_0900_ai_ci;
LOAD DATA
LOCAL INFILE '/Users/cchen/work/yelp/review.csv'
INTO TABLE review
FIELDS ENCLOSED BY '"'
TERMINATED BY ','
IGNORE 1 ROWS;
-- Create table "checkin" for csv file importing
-- Note date field could be longer than 65535, so longtext is used
DROP TABLE IF EXISTS `checkin`;
CREATE TABLE `checkin` (
`business_id` text,
`date` longtext
) ENGINE=InnoDB DEFAULT CHARSET utf8mb4 COLLATE=utf8mb4_0900_ai_ci;
LOAD DATA
LOCAL INFILE '/Users/cchen/work/yelp/checkin.csv'
INTO TABLE checkin
FIELDS ENCLOSED BY '"'
TERMINATED BY ','
IGNORE 1 ROWS;
-- Create table "tip" for csv file importing
DROP TABLE IF EXISTS `tip`;
CREATE TABLE `tip` (
`user_id` text,
`business_id` text,
`text` text,
`date` text,
`compliment_count` text
) ENGINE=InnoDB DEFAULT CHARSET utf8mb4 COLLATE=utf8mb4_0900_ai_ci;
LOAD DATA
LOCAL INFILE '/Users/cchen/work/yelp/tip.csv'
INTO TABLE tip
FIELDS ENCLOSED BY '"'
TERMINATED BY ','
IGNORE 1 ROWS;
-- Create table "user" for csv file importing
DROP TABLE IF EXISTS `user`;
CREATE TABLE `user` (
`user_id` text,
`name` text,
`review_count` text,
`yelping_since` text,
`useful` text,
`funny` text,
`cool` text,
`elite` text,
`friends` longtext,
`fans` text,
`average_stars` text,
`compliment_hot` text,
`compliment_more` text,
`compliment_profile` text,
`compliment_cute` text,
`compliment_list` text,
`compliment_note` text,
`compliment_plain` text,
`compliment_cool` text,
`compliment_funny` text,
`compliment_writer` text,
`compliment_photos` text
) ENGINE=InnoDB DEFAULT CHARSET utf8mb4 COLLATE=utf8mb4_0900_ai_ci;
LOAD DATA
LOCAL INFILE '/Users/cchen/work/yelp/user.csv'
INTO TABLE user
FIELDS ENCLOSED BY '"'
TERMINATED BY ','
IGNORE 1 ROWS;
Many packages are required to process the dataset, perform analysis, and produce images.
required_packages <-
c("tidyverse", "stringr", "lubridate", "wordcloud", "tidytext", "DT",
"leaflet", "igraph", "ggraph", "topicmodels", "tm", "SnowballC", "textcat",
"rjson", "jsonlite", "scales", "ggthemes", "RMySQL", "data.table")
lapply(required_packages, library, character.only=TRUE)
setwd("/Users/cchen/work/yelp")
palname <- "Tableau 10"
pal_tab <- tableau_color_pal(palname)(10)
color_tab <- scale_color_tableau(palname, direction=-1)
fill_tab <- scale_fill_tableau(palname, direction=-1)
# Since the business table is only 287 MB, we can load it directly.
fname_business <- "yelp_academic_dataset_business.json"
business <- jsonlite::stream_in(file(fname_business), flatten=TRUE)
categories <- str_split(business$categories, ",")
categories <- as.data.frame(unlist(categories))
colnames(categories) <- c("Name")
categories$Name <- str_squish(categories$Name)
categories %>%
group_by(Name) %>%
summarise(Count=n()) %>%
arrange(desc(Count)) %>%
ungroup() %>%
mutate(Name=reorder(Name, Count)) %>%
head(10) %>%
ggplot(aes(x=Name, y=Count)) +
geom_bar(stat="identity", fill=pal_tab[1]) +
geom_text(aes(x=Name, y=1, label=paste0(Count, sep=" ")),
hjust=-0.5, vjust=0.5, size=3, colour='white') +
labs(x="Category", y="Count",
title="Top 10 categories of business") +
coord_flip() +
theme_bw()
Top 10 categories of business are listed above where the ranking is
based on the number of businesses within each category.
Restaurant
is the most popular one followed by
Food
and Shopping
which makes sense because
from my personal point of view these are the categories I would like to
look at before I go. A bit surprising is that Home Services
is ranked 4th, and we can find out what kind of businesses fall into
this category.
hs <- business[business$categories %like% "Home Services", ]
head(hs$name, n=10)
## [1] "Jennie Deckert"
## [2] "AM&PM Locksmith"
## [3] "Absolutely Perfect Inc"
## [4] "Premier Mortgage Resources"
## [5] "Mighty Dustless"
## [6] "Tinkle Belle Diaper Service"
## [7] "Cook's Glass & Mirror"
## [8] "Leslie's"
## [9] "David Gower, Jr. - Coldwell Banker Preferred"
## [10] "Nash Painting"
city <- str_split(business$city, ",")
city <- as.data.frame(unlist(city))
colnames(city) <- c("Name")
city$Name <- str_squish(city$Name)
city %>%
group_by(Name) %>%
summarise(Count=n()) %>%
arrange(desc(Count)) %>%
ungroup() %>%
mutate(Name=reorder(Name, Count)) %>%
head(10) %>%
ggplot(aes(x=Name, y=Count)) +
geom_bar(stat="identity", fill=pal_tab[1]) +
geom_text(aes(x=Name, y=1, label=paste0(round(Count/1e3), "K", sep="")),
hjust=-0.5, vjust=0.5, size=3, colour='white') +
labs(x="City", y="Count",
title="Top 10 cities with most mentioned business") +
coord_flip() +
theme_bw()
Top 10 cities with the most business mentioned in Yelp are listed above. I don’t think this represents the popularity of the cities around U.S., since the Yelp dataset we are querying from only contains data from certain states over a limited time range. But from what is shown here, we have Philadelphia being the most popular one.
My family likes Orlando a lot, and Tampa is one city we would include in our vacation to Florida. Therefore, I am personally interested in learning more about the businesses at Tampa.
target_city <- "Tampa"
CityCoords <- business %>% filter(city == target_city)
center_lon <- median(CityCoords$longitude, na.rm=T)
center_lat <- median(CityCoords$latitude, na.rm=T)
leaflet(CityCoords) %>% addProviderTiles("Esri.NatGeoWorldMap") %>%
addCircles(lng=~longitude, lat=~latitude,radius=~sqrt(review_count)) %>%
setView(lng=center_lon, lat=center_lat, zoom=10)
It seems that the businesses are quite evenly distributed in Tampa with a bit of more density in the downtown area and Tampa Bay.
As the JSON file contains the review data is over 5 GB, my laptop’s memory can easily be drained by import that directly. The following code demonstrate how to query from the local MySQL database we constructed previously.
# MySQL connection
conn <- dbConnect(RMySQL::MySQL(),
dbname='yelp',
host='localhost',
port=3306,
user='cchen',
password='20151228')
dbListTables(conn) # Show tables
query <- "SELECT * FROM review LIMIT 20"
res <- dbSendQuery(conn, query)
result <- dbFetch(res, n=5)
dbClearResult(res)
query <- "SELECT business_id, stars, date, text FROM review;"
res <- dbSendQuery(conn, query)
review <- dbFetch(res, n=-1)
dbClearResult(res)
review$stars <- as.numeric(review$stars)
most5StarsReviews <- review %>%
filter(stars == 5) %>%
group_by(business_id) %>%
summarise(Count=n()) %>%
arrange(desc(Count)) %>%
ungroup() %>%
mutate(BusinessID=reorder(business_id, Count)) %>%
head(10)
most5StarsReviews = inner_join(most5StarsReviews, business)
most5StarsReviews %>%
mutate(name=reorder(name, Count)) %>%
ggplot(aes(x=name,y=Count)) +
geom_bar(stat="identity", fill=pal_tab[1]) +
geom_text(aes(x=name, y=1, label=paste0(Count, sep="")),
hjust=-0.5, vjust=0.5, size=3, colour='white') +
labs(x="Name of the business", y="Count",
title="Top 10 business with most five star reviews") +
coord_flip() +
theme_bw()
First, the top 10 business with most five star reviews have been
listed above. Reading Terminal Market
is the top one which
is unfair because I can imagine it is a shopping plaza comprising a lot
of other businesses. So let’s look at the Oceana Grill
this
time.
As a business owner, spending some time on reading and engineering the reviews can be beneficial. It gives you the feedback from the customers and provide you with info why you are running good or bad as well as which part needs improvement.
We can first look at the high-frequency words in the reviews through the word cloud. This word cloud might not new to you, and it uses the font size to indicate how often certain words appear in a context.
# Get the business_id for "Oceana Grill"
bid_hattie <- business$business_id[business$name %like% "Oceana Grill"]
createWordCloud=function(train) {
train %>%
unnest_tokens(word, text) %>%
filter(!word %in% stop_words$word) %>%
count(word, sort=TRUE) %>%
ungroup() %>%
head(100) %>%
with(wordcloud(word, n, max.words=100, color=rev(pal_tab)))
}
createWordCloud(review %>%
filter(business_id == bid_hattie))
From the word cloud image shown above, we can get an overview of what customers are talking about in the reviews. For example, “oysters”, “crab”, “shrimp” are mentioned indicating that they are popular food that people are interested to try, whereas “amazing”, “delicious”, “recommend” are usually good words that appear in positive reviews. The “service” is something people mentioned a lot, the word does not imply anything, so the owner should spend some time checking out whether they are compliments or criticisms.
ggplot(review, aes(x=stars)) +
geom_bar(width=0.5, fill=pal_tab[1]) +
labs(x="Stars", y="Count",
title="Review stars histogram") +
theme_bw()
The histogram of review stars gives us some hints about the tendency of customer to leave reviews – very satisfied or very unsatisfied, reflected as 1-star and 5-star reviews are the most. From my personal perspective, when there is little to say about a business I would rather keep silent, and when something is bad I would choose the lowest score instead of rationally give a 2-star review.
review$year <- year(ymd_hms(review$date))
ggplot(review, aes(x=year)) +
geom_bar(fill=pal_tab[1]) +
labs(x="Year", y="Count",
title="Review year histogram") +
theme_bw()
The histogram of review years shows some interesting trend. First, from 2005 to 2020 the review gets more and more, alluding to the fact that more and more businesses and customers are using the Yelp platform. There is a clear drop from 2019 to 2020 and resumed slowly to 2021, which might be due to the impact of SARS-CoV-2 where a lot of businesses did not survive and a lot of places are in lock-down condition. We can probably make the prediction that the total number of reviews will come back to the level of 2018 and 2019 or even higher.
One of the questions that I have in my mind before exploring the dataset is that, how is the number of reviews related to the business stars? Another question is that, would more reviews bring more business? The answer to these questions could provide the business owners evidence about whether it is worth the effort of taking care of the online reviews, for example reply to the reviews or contact the customer to turn weakness into opportunities.
To achieve this, we first need to retrive the data from the “checkin” tables.
query <- "SELECT * FROM checkin;"
res <- dbSendQuery(conn, query)
checkin <- dbFetch(res, n=-1)
dbClearResult(res)
Then we can count the number of checkins for each business and make the plot.
checkin$count <- lengths(str_split(checkin$date, ","))
checkin_ijoin_business <- inner_join(checkin, business)
## Joining with `by = join_by(business_id)`
checkin_ijoin_business %>%
ggplot(aes(x=stars, y=count)) +
geom_point( color=pal_tab[1]) +
labs(x="Star", y="Checkin count",
title="Relationship between checkins and the business stars") +
theme_bw()
From to the plot above, a prominent peak can be observed around 4-star businesses, and the distribution diminishes at both ends. This makes a lot of sense because people would like to go to the business that has better reviews. However, there are a few outlines that can hardly be ignored because the they have really high number of checkins while the rating is pretty low. To figure out why, we need to extract those businesses by setting some filters.
checkin_ijoin_business$name[checkin_ijoin_business$count > 20000 & checkin_ijoin_business$stars < 4]
## [1] "Philadelphia International Airport - PHL"
## [2] "Nashville International Airport - BNA"
## [3] "St. Louis Lambert International Airport - STL"
## [4] "Louis Armstrong New Orleans International Airport MSY"
Once the results appear on the screen, you will be like “Aha, I got it now”. These businesses represent those services you might not like but you have to live with.
Finally, we would like to look at the geographic dependence of the number of reviews by averaging all the business in the state.
review_geo <- checkin_ijoin_business %>%
group_by(state) %>%
summarise(Count=n(), TotChk=sum(count), TotRev=sum(review_count), AveStar=mean(stars)) %>%
ungroup()
write.csv(review_geo, 'review_geo.csv')
We can load the exported CSV file into Tableau and map the results for
different states onto the U.S. map.
On the figure above, we plotted for each available state the total number of checkins using the size of the circle and the averaged stars of all the businesses using colors. Apparently, the present Yelp dataset only contains data for a few states, and for each state the data might not be complete. With that being said, there is not enough data to draw any conslusion on the distribution across the states. However, the combination of R and Tableau can be a very power tool to illustrate the findings and keep track of the changes of the large dataset.
In the present project, we’ve accomplished the following tasks:
There are actually much more can be done on the dataset, for example using natural language processing models to understand the sentiments of reviews, and also the evolution of the business performance can be tracked if the data is more complete.