Yelp Data Analysis

Naive Bayes - Predict Ratings based on Yelp Restaurant Reviews

Does the Rating and Review match in Yelp? On famous websites like Amazon and Yelp, many products and businesses receive tens or hundreds of reviews, making it impossible for readers to read all of them.Generally, readers prefer to look at the star ratings only and ignore the text. However, the relationship between the text and the rating is not obvious. In particular, several questions may be asked: why exactly did this reviewer give the restaurant 3/5 stars? In addition to the quality of food, variety, size and service time, what other features of the restaurant did the user implicitly consider, and what was the relative importance given to each of them? How does this relationship change if we consider a different user’s rating and text review? The process of predicting this relationship for a generic user is called Review Rating Prediction.The main challenge which we will solve is building a good predictor which effectively extract useful features of the product from the text reviews and then quantify their relative importance with respect to the rating. You can find the code I used on my Github repo

Requirements

R==4.0.3

caret==6.0-86

DocumentTermMatrix==0.7

wordcloud==2.6

Data set

33K Rows, with 17 columns. You can download data on the link https://www.kaggle.com/shikhar42/yelps-dataset

Names	Description
business_id	ID related to each business
name	Name of the business
address	Street Adress of the business
postal_code	zip code of the business
latitude	Latitude of the business
longitude	Longitude of the business
stars	Rating given by user to the business
review_count	Total number of reviews a user had posted at the time of data collection
is_open	Restaraunt open or closed
review_id	Unique Review Id
user_id	User Id of the reviewer

Using Naive Bayes to validate the review and ratings

Does the Rating and Review match in Yelp? On famous websites like Amazon and Yelp, many products and businesses receive tens or hundreds of reviews, making it impossible for readers to read all of them.Generally, readers prefer to look at the star ratings only and ignore the text. However, the relationship between the text and the rating is not obvious. In particular, several questions may be asked: why exactly did this reviewer give the restaurant 3/5 stars? In addition to the quality of food, variety, size and service time, what other features of the restaurant did the user implicitly consider, and what was the relative importance given to each of them? How does this relationship change if we consider a different user’s rating and text review? The process of predicting this relationship for a generic user is called Review Rating Prediction

The main challenge which we will solve is building a good predictor which effectivelys extract useful features of the product from the text reviews and then quantify their relative importance with respect to the rating.

Data Description

33K Rows, with 17 columns. You can download data on the link https://www.kaggle.com/shikhar42/yelps-dataset

library(knitr)
library(kableExtra)
knitr::opts_chunk$set(message = FALSE)
knitr::opts_chunk$set(warning = FALSE)


df <- data.frame(Names = c("business_id","name","address",
                           "postal_code","latitude","longitude",
                           "stars","review_count","is_open",
                           "review_id","user_id"),
                  Description = c("ID related to each business","Name of the business",
                  "Street Adress of the business", "zip code of the business",
                  "Latitude of the business", "Longitude of the business" ,
                  "Rating given by user to the business", "Total number of
                  reviews a user had posted at the time of data collection",
                  "Restaraunt open or closed","Unique Review Id","User Id of the reviewer"
                ))
kbl(df) %>%
  kable_paper(full_width = F) %>%
  column_spec(2, width = "30em")

Names	Description
business_id	ID related to each business
name	Name of the business
address	Street Adress of the business
postal_code	zip code of the business
latitude	Latitude of the business
longitude	Longitude of the business
stars	Rating given by user to the business
review_count	Total number of reviews a user had posted at the time of data collection
is_open	Restaraunt open or closed
review_id	Unique Review Id
user_id	User Id of the reviewer

Using Naive Bayes to validate the review and ratings

Step 1: import dataset

library(class)
library(knitr)
library(kableExtra)
library(caret)
library(tidyverse)
library(tokenizers)
library(tidytext)
library(wordcloud)
library(tm)
library(dplyr)
library(caret)
library(naivebayes)
library(wordcloud)
yelpdataset=read.csv(file = "/Users/nselvarajan/Desktop/test/archive/cleaned.csv", sep = ",")
yelpdataset <- data.frame(yelpdataset, stringsAsFactors = FALSE)
head(yelpdataset)

##              business_id         name                address postal_code
## 1 rDMptJYWtnMhpQu_rRXHng "McDonald's" "719 E Thunderbird Rd"       85022
## 2 rDMptJYWtnMhpQu_rRXHng "McDonald's" "719 E Thunderbird Rd"       85022
## 3 rDMptJYWtnMhpQu_rRXHng "McDonald's" "719 E Thunderbird Rd"       85022
## 4 rDMptJYWtnMhpQu_rRXHng "McDonald's" "719 E Thunderbird Rd"       85022
## 5 rDMptJYWtnMhpQu_rRXHng "McDonald's" "719 E Thunderbird Rd"       85022
## 6 rDMptJYWtnMhpQu_rRXHng "McDonald's" "719 E Thunderbird Rd"       85022
##   latitude longitude stars_res review_count is_open              review_id
## 1 33.60707 -112.0644         1           10       1 bABGON0ehmb7MBJrI02l7Q
## 2 33.60707 -112.0644         1           10       1 zn7bEYAVzwWSJdSd2a4zoQ
## 3 33.60707 -112.0644         1           10       1 ONnRwv_KOLRyKyk72SzTHg
## 4 33.60707 -112.0644         1           10       1 wlcWp7STNY0Ccnpap2_Nzw
## 5 33.60707 -112.0644         1           10       1 0BsbVLK2dLyT55Nw-omXRA
## 6 33.60707 -112.0644         1           10       1 nSq8oldCoOHxhvfvc2D7SQ
##                  user_id stars    date
## 1 Ck73f1qtZbu68F_vjzsBrQ     1 2/25/16
## 2 u0JoB0Vm1ZhwF8nysxPnfg     2  6/6/11
## 3 F95NFEFwuwA__SIRt9IJNA     1 11/5/15
## 4 uHZxYHgjxhXY7PS6g2rFsA     1 6/17/12
## 5 Akt0llUBaVa1Qxi8Ogdv4Q     1 8/12/11
## 6 2gWCW1oEuyhaxrlTTghvtQ     1 8/27/17
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    text
## 1 The speed of delivery of my food order was terrible.  It took 10 minutes from the time of my order.  My order was two salads and Quarter Pounder combo and there was no waiting line inside.  The store manager was the real problem.  He was checking his Iphone constantly, distracting employees and flirting with a female employee about 20 years younger.  He had no sense of urgency whatsoever for my order or anyone else.  \n\nI worked at McDonald's during high school and college to pay for my tuition.  I believe in McDonald's as a great institution.  This was very disappointing.\n\nAvoid the night shift time...
## 2                                                                                                                                                                                                                                                                                                                                                        McDonald's is McDonald's. My "beef" with this place is they stopped selling ice and I prefer not to have to drive all over town in search of ice while wearing pajamas! \n\n(See http:\\/\\/www.yelp.com\\/biz\\/mcdonalds-restaurants-phoenix-44#hrid:jpfxQUBZMuBoIIFn-MWp7g)
## 3                                                                                                                                                                                                                                                                                                                                        I stopped by for a double quarter pounder with cheese,no pickles no onions, after work at 2 p.m. The bun was hard and crusty, the cheese was two different shades and hard at the tips. I was so disappointed when I got home and realized what I had purchased.   \n\n\nA customer no more!!!
## 4                                                                                                                                                                                                                                                                                                                                                                                                                                                                       Was there Friday morning around 9am- rudest young guy working the order window and took our payment. That guy has no business in the customer service industry.
## 5                                                                                                                                                                                                                                                                                                                                                                                                                                                  went back there today and asked for two simple burgers with nothing but ketchup and cheese, not hard, right? apparently it is if your english is as good as your order taking skills
## 6                                                                                                                                                                                                                                                                                                                             I was told tonight at 8:30 pm that they were not serving breakfast. Excuse me? Why does their menu say all day breakfast and why was I able to order breakfast items in the evening a week before? This location is in serious need of some actual managers. Ridiculous. We left and went somewhere else.
##   useful funny cool
## 1      3     0    0
## 2      6     7    4
## 3      2     0    0
## 4      1     0    0
## 5      0     0    0
## 6      0     0    0

Step 2: Clean the Data
- Create a outcome variable which is a true or false indicator specifying if the sentiment corresponding to the review is positive or not.

yelpdataset$positive = as.factor(yelpdataset$stars > 3)

Step 3: Features and Preprocessing
- Load the data into a Corpus (a collection of documents) which is the main data structure used by tm.
- Review texts were cleaned by tm package which provides several function to clean the text via the tm_map() function.
- Cleaning proces include removing format, punctuation and extra whitespace.All characters from the dataset are lowercase, so there is no need to preprocess uppercase letters. Word stemming was achieved using Porter stemming algorithm, which erased word suffixes to retrieve the root or stem. Stopwords, that is, words with no information value but appear too common in a language, were also removed according to a list from nltk.corpus.
```
myCorpus <- Corpus(VectorSource(yelpdataset$text))
corpus <- tm_map(myCorpus,removeNumbers)
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, tolower)
corpus <- tm_map(corpus, stemDocument, language = 'english')
corpus <- tm_map(corpus, removeWords, stopwords('english'))
corpus <- tm_map(corpus, stripWhitespace)
```

Step 4: Making a document-term matrix

The document-term matrix is used when you want to have each document represented as a row.
Bag of words is a way to count terms, n-grams, across a collection of documents.
Create dataframe from cleaned corpus

   bag_of_words <- DocumentTermMatrix(corpus)
   inspect(bag_of_words)

## <<DocumentTermMatrix (documents: 331400, terms: 158826)>>
## Non-/sparse entries: 14800215/52620136185
## Sparsity           : 100%
## Maximal term length: 256
## Weighting          : term frequency (tf)
## Sample             :
##         Terms
## Docs     food good great just like order place servic time veri
##   122288    4    1     1    2    6     1     1      0    0    1
##   126026    2    2     2    4    7     1     1      0    4    2
##   195082    1    3     2    1    3     2     4      0    4    2
##   280215    1    3     2    0    2     1     0      0    0    1
##   31177     3    6     3    2    2     7     6      6    3    2
##   330939    1    1     3    1    1     1     0      2    0    2
##   50554     2    1     2    2    4     3     3      0    4    0
##   75719     0    4     2    1    7     2     0      0    1    2
##   80183     0    1     0    5    8     1     1      1    3    1
##   91714     6    4     0   16    4     9     6      2    6    0

   dataframe<-data.frame(text=unlist(sapply(corpus, `[`)), stringsAsFactors=F)
   yelpdataset$text <- dataframe$text

Step 5: Build Word Cloud
- Word cloud is a fun way to visualize popular words/phrases in group of text.
- This function takes a single parameter of review text and builds word clouds for words occuring with the highest frequencies in reviews for these restaurants
```
   y<-head(yelpdataset,100)
   library(wordcloud)
   wordcloud(y$text)
```

Step 6: Build Word Cloud For 5 Star Reviews

Build word cloud for 5 star ratings.

rating5 <- subset(yelpdataset, stars == "5")  ##Filtering data for 5 star reviews
myCorpusRating5 <- Corpus(VectorSource(rating5$text))
myCorpusRating5 <- tm_map(myCorpusRating5,removeNumbers)
myCorpusRating5 <- tm_map(myCorpusRating5, removePunctuation)
myCorpusRating5 <- tm_map(myCorpusRating5, tolower)
myCorpusRating5 <- tm_map(myCorpusRating5, stemDocument, language = 'english')
myCorpusRating5 <- tm_map(myCorpusRating5, removeWords, stopwords('english'))
myCorpusRating5 <- tm_map(myCorpusRating5, stripWhitespace)
bag_of_words_rating_5 <- DocumentTermMatrix(myCorpusRating5)
##creating DTM to get frequencies
inspect(bag_of_words_rating_5)

## <<DocumentTermMatrix (documents: 140341, terms: 77011)>>
## Non-/sparse entries: 5285796/10802514955
## Sparsity           : 100%
## Maximal term length: 180
## Weighting          : term frequency (tf)
## Sample             :
##         Terms
## Docs     food friend good great love order place servic time veri
##   106225    2      0    4     1    4     4     5      0    5    7
##   117499    2      0    1     0    0     2     3      0    2    1
##   134018    1      1    2     3    1     2     3      0    2    1
##   140126    1      1    1     3    2     1     0      2    0    2
##   32730     0      0    4     2    3     2     0      0    1    2
##   42102     3      2    2     4    4     2     7      1    2    0
##   51703     4      0    1     1    1     1     1      0    0    1
##   55581     1      1    6     2    2     1     1      0    0    1
##   90524     0      0    3     2    0     1     4      2    4    1
##   97647     4      0    0     4    5     2     7      0    2    8

dataframeRating5<-data.frame(text=unlist(sapply(myCorpusRating5,
`[`)), stringsAsFactors=F) ##creating data fram from matrix
yFiveStar<-head(dataframeRating5,100)
# word cloud visualization
wordcloud(yFiveStar$text)

Step 7: Model Training and Testing
- I used 25% to test data and 75% to data train.
- After obtaining training and testing data sets, then we will create a separate data frame which has values to be compared with actual final values

dataset_train <- yelpdataset[1:24000,]  ###dividing data into training and test set
dataset_test <- yelpdataset[24000:331400,]
 ##creating corpus for training
myCorpus_model_train <- Corpus(VectorSource(dataset_train$text))
 ##since this data was already cleaned before, we can straigtaway move to DTM
dtm_train <- DocumentTermMatrix(myCorpus_model_train)
dtm_train <- removeSparseTerms(dtm_train,0.95)
##creating corpus for test
myCorpus_model_test <- Corpus(VectorSource(dataset_test$text))
 ##since this data was already cleaned before, we can straigtaway move to DTM
dtm_test <- DocumentTermMatrix(myCorpus_model_test)
dtm_test <- removeSparseTerms(dtm_test,0.95)

Step 8: Making predictions
We build Naive Bayes by using training & test data sets.
We apply Laplace smoothing , which is a technique for smoothing categorical data.
A small-sample correction, or pseudo-count, will be incorporated in every probability estimate. Consequently, no probability will be zero. this is a way of regularizing Naive Bayes, and when the pseudo-count is zero, it is called Laplace smoothing.

model <- naive_bayes(as.data.frame(as.matrix(dtm_train)), dataset_train$positive, laplace = 1)
model

##
## ================================== Naive Bayes ==================================
##
##  Call:
## naive_bayes.default(x = as.data.frame(as.matrix(dtm_train)),
##     y = dataset_train$positive, laplace = 1)
##
## ---------------------------------------------------------------------------------
##
## Laplace smoothing: 1
##
## ---------------------------------------------------------------------------------
##
##  A priori probabilities:
##
##     FALSE      TRUE
## 0.2692917 0.7307083
##
## ---------------------------------------------------------------------------------
##
##  Tables:
##
## ---------------------------------------------------------------------------------
##  ::: check (Gaussian)
## ---------------------------------------------------------------------------------
##
## check       FALSE       TRUE
##   mean 0.10165558 0.05628101
##   sd   0.40198599 0.26564318
##
## ---------------------------------------------------------------------------------
##  ::: disappoint (Gaussian)
## ---------------------------------------------------------------------------------
##
## disappoint      FALSE       TRUE
##       mean 0.14265821 0.05730741
##       sd   0.39347611 0.24275615
##
## ---------------------------------------------------------------------------------
##  ::: food (Gaussian)
## ---------------------------------------------------------------------------------
##
## food       FALSE      TRUE
##   mean 0.9413585 0.6349433
##   sd   1.1963510 0.8701281
##
## ---------------------------------------------------------------------------------
##  ::: great (Gaussian)
## ---------------------------------------------------------------------------------
##
## great      FALSE      TRUE
##   mean 0.2846975 0.5872156
##   sd   0.6195211 0.8539856
##
## ---------------------------------------------------------------------------------
##  ::: high (Gaussian)
## ---------------------------------------------------------------------------------
##
## high        FALSE       TRUE
##   mean 0.07086492 0.07692308
##   sd   0.28025496 0.28708026
##
## ---------------------------------------------------------------------------------
##
## # ... and 176 more tables
##
## ---------------------------------------------------------------------------------

Interpretation of the results and prediction accuracy achieved

Evaluate the model performance using confusionMatrix
The accuracy of our model on the testing set is 72%.
We can visualise the model’s performance using a confusion matrix.
We can evaluvate the accuracy, precision and recall on the training and validation sets to evaluate the performance of naive bayes algorithm.

model_predict <- predict(model, as.data.frame(as.matrix(dtm_test)))
confusionMatrix(model_predict, dataset_test$positive)

## Confusion Matrix and Statistics
##
##           Reference
## Prediction  FALSE   TRUE
##      FALSE  46645  32147
##      TRUE   51353 177256
##
##                Accuracy : 0.7284
##                  95% CI : (0.7268, 0.7299)
##     No Information Rate : 0.6812
##     P-Value [Acc > NIR] : < 2.2e-16
##
##                   Kappa : 0.3402
##
##  Mcnemar's Test P-Value : < 2.2e-16
##
##             Sensitivity : 0.4760
##             Specificity : 0.8465
##          Pos Pred Value : 0.5920
##          Neg Pred Value : 0.7754
##              Prevalence : 0.3188
##          Detection Rate : 0.1517
##    Detection Prevalence : 0.2563
##       Balanced Accuracy : 0.6612
##
##        'Positive' Class : FALSE
##

Evaluate the model performance using CrossTable

library(gmodels)
CrossTable(model_predict, dataset_test$positive,
           prop.chisq = FALSE, prop.t = FALSE, prop.r = FALSE,
           dnn = c('predicted', 'actual'))

##
##
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Col Total |
## |-------------------------|
##
##
## Total Observations in Table:  307401
##
##
##              | actual
##    predicted |     FALSE |      TRUE | Row Total |
## -------------|-----------|-----------|-----------|
##        FALSE |     46645 |     32147 |     78792 |
##              |     0.476 |     0.154 |           |
## -------------|-----------|-----------|-----------|
##         TRUE |     51353 |    177256 |    228609 |
##              |     0.524 |     0.846 |           |
## -------------|-----------|-----------|-----------|
## Column Total |     97998 |    209403 |    307401 |
##              |     0.319 |     0.681 |           |
## -------------|-----------|-----------|-----------|
##
##

Overall insights obtained from the implemented project

Overall accuracy of the model is 72%.It is safe to assume that naive bayes models can be trained on to find the rating of the restaurant based on the reviews.
Sensitivity for finding ratings is 0.4760.
Specificity for finding ratings is 0.8465.
Since the dataset was clean, and reviews are equally distributed between test & training set, adding laplace smoothing factor did not make much difference in the accuracy.

</div>