- Setup
- Load packages
- Load data
- Part 1: Data
- Part 2: Research question
- Part 3: Exploratory data analysis
- Visualizing spread of critics and audience scores by movie genres
- Summary statistics of critics and audience scores by genre
- Part 4: Modeling
- Model Selection
- Model Diagnostics
- Part 5: Prediction
- Part 6: Conclusion
Setup
Load packages
library(ggplot2)
library(dplyr)
## Warning: package 'dplyr' was built under R version 3.3.1
library(statsr)
Load data
load("movies.Rdata")
Part 1: Data
The dataset is comprised of 651 randomly sampled movies produced and released before the year 2016. Information on these movies was gotten from the Internet Movie DataBase (IMDB) and Rotten Tomatoes.
The data was collection method involved random sampling but no random assignment. Therefore, the results from this project is generalizable but we cannot infer causation from our results - only association.
Part 2: Research question
The aim of my research question is to identify movie attributes that are significantly associated with higher or lower critics/audience scores. These attributes will be used to develop two models to predict critics and audience scores respectively.
Movies popular with critics may not be popular with audiences and vice versa. I would like to:
- Identify movie attributes that are associated with higher or lower critics/audience scores.
- Predict the critics and audience score for a movie so we can determine whether or not a movie will be a hit with either critics or audiences (or both).
Please note that the dataset contains some missing values. For my regression analysis, I chose to drop these missing observations.
Response variables
- critics_scores: Critics score on Rotten Tomatoes
- audience_scores Audience score on Rotten Tomatoes
Explanatory variables
- genre: Genre of movie
- runtime: Runtime of movie (in minutes)
- best_pic_nom : Whether or not the movie was nominated for a best picture Oscar (no, yes)
- best_pic_win : Whether or not the movie won a best picture Oscar (no, yes)
- best_actor_win : Whether or not one of the main actors in the movie ever won an Oscar (no, yes)
- best_actress_win : Whether or not one of the main actresses in the movie ever won an Oscar (no, yes)
- best_dir_win : Whether or not the director of the movie ever won an Oscar (no, yes)
- top200_box : Whether or not the movie is in the Top 200 Box Office list on BoxOfficeMojo (no, yes)
Part 3: Exploratory data analysis
Distribution of critics and audience scores
There is duplicate row in the dataset, let’s drop it
unique_mov <- movies[-244, ]
#Drop missing data from the analysis
unique_mov <- na.omit(unique_mov)
#Histogram of audience scores
ggplot(data = unique_mov, aes(x = audience_score)) + geom_histogram(binwidth = 10) + scale_x_continuous("Audience Score", breaks = seq(0, 100, by = 10))
#Histogram of critics scores
ggplot(data = unique_mov, aes(x = critics_score)) + geom_histogram(binwidth = 10) + scale_x_continuous("Critics Score", breaks = seq(0, 100, by = 10))
The distribution of audience scores is bimodal with two roughly equal peaks at 80 and 85. Furthermore, majority of the scores are concentrated at the range 30 - 90 with 20 or more movies getting a score between this interval. The distribution also appears to be slightly left skewed.
The distribution of critics scores is unimodal with a peak at 90. Furthermore critics’ scores appear to be uniformly distributed between 10 and 60. This distribution does not seem to have any significant skew to the left or right.
Barplot of Genres
ggplot(data = unique_mov, aes(x = genre)) + geom_bar() + theme(text = element_text(size = 12), axis.text.x = element_text(angle = 90, vjust = 1))
The barplot shows that Drama
genre has the highest number of movies in the dataset
Visualizing spread of critics and audience scores by movie genres
#Boxplot showing variablity of audience's scores by movie genre
ggplot(data = unique_mov, aes(x = genre, y = audience_score)) + geom_boxplot() + theme(text = element_text(size = 12), axis.text.x = element_text(angle = 90, vjust = 1)) + scale_y_continuous("Audience score")
#Boxplot showing variability of critics' scores by movie genre
ggplot(data = unique_mov, aes(x = genre, y = critics_score)) + geom_boxplot() + theme(text = element_text(size = 12), axis.text.x = element_text(angle = 90, vjust = 1)) + scale_y_continuous("Critics score")
From the boxplots above, we can see that critics scores tend to have more variability than audience scores for most movie genres.
Summary statistics of critics and audience scores by genre
#Summary statistics of audience scores by genre
unique_mov %>%
group_by(genre) %>%
summarise(mean_score = mean(audience_score), median_score = median(audience_score), sd_score = sd(audience_score), n = n())
## Source: local data frame [11 x 5]
##
## genre mean_score median_score sd_score n
## <fctr> <dbl> <dbl> <dbl> <int>
## 1 Action & Adventure 53.91935 51.5 20.006801 62
## 2 Animation 62.37500 67.5 20.982561 8
## 3 Art House & International 69.50000 71.0 13.641514 12
## 4 Comedy 52.24419 49.5 19.201676 86
## 5 Documentary 83.23077 86.0 8.594556 39
## 6 Drama 65.28859 70.0 18.630483 298
## 7 Horror 45.90909 42.0 16.480606 22
## 8 Musical & Performing Arts 80.16667 80.5 11.360484 12
## 9 Mystery & Suspense 55.19643 51.5 19.277325 56
## 10 Other 69.73333 74.0 18.491182 15
## 11 Science Fiction & Fantasy 55.12500 53.0 25.759534 8
#Summary statistics of critics scores by genre
unique_mov %>%
group_by(genre) %>%
summarise(mean_score = mean(critics_score), median_score = median(critics_score), sd_score = sd(critics_score), n = n())
## Source: local data frame [11 x 5]
##
## genre mean_score median_score sd_score n
## <fctr> <dbl> <dbl> <dbl> <int>
## 1 Action & Adventure 42.24194 37.0 27.83599 62
## 2 Animation 51.12500 58.0 32.25983 8
## 3 Art House & International 56.91667 65.5 31.06725 12
## 4 Comedy 40.93023 36.0 27.62386 86
## 5 Documentary 86.33333 92.0 17.20516 39
## 6 Drama 61.98993 67.0 25.02019 298
## 7 Horror 42.04545 38.0 25.24636 22
## 8 Musical & Performing Arts 76.66667 89.0 26.53071 12
## 9 Mystery & Suspense 54.89286 60.5 27.56393 56
## 10 Other 67.73333 72.0 26.45337 15
## 11 Science Fiction & Fantasy 55.37500 68.0 28.50031 8
The summary statistics indicate that genres such as Action & Adventure
, Animation
, Comedy
and Art House & International
are more popular with audiences than critics. The only genre with a higher summary statistic for the critics score is the Documentary
genre. Other genres have similar summary statistics for both critics and audience scores.
Part 4: Modeling
#Model to predict audience scores
audience_score_model <- lm(audience_score ~ genre + runtime + best_pic_nom + best_pic_win + best_actor_win + best_actress_win + best_dir_win + top200_box, data = unique_mov)
summary(audience_score_model)
##
## Call:
## lm(formula = audience_score ~ genre + runtime + best_pic_nom +
## best_pic_win + best_actor_win + best_actress_win + best_dir_win +
## top200_box, data = unique_mov)
##
## Residuals:
## Min 1Q Median 3Q Max
## -50.914 -12.090 0.903 12.667 40.294
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 41.77370 4.92429 8.483 < 2e-16 ***
## genreAnimation 11.77660 6.68231 1.762 0.07852 .
## genreArt House & International 16.96864 5.59278 3.034 0.00252 **
## genreComedy 0.07318 2.98850 0.024 0.98047
## genreDocumentary 30.93089 3.63738 8.504 < 2e-16 ***
## genreDrama 10.74960 2.52580 4.256 2.42e-05 ***
## genreHorror -5.84919 4.42274 -1.323 0.18650
## genreMusical & Performing Arts 26.10499 5.59825 4.663 3.84e-06 ***
## genreMystery & Suspense 1.28949 3.31620 0.389 0.69753
## genreOther 12.99451 5.12437 2.536 0.01147 *
## genreScience Fiction & Fantasy 0.39472 6.64330 0.059 0.95264
## runtime 0.10547 0.04290 2.458 0.01424 *
## best_pic_nomyes 19.58898 4.54392 4.311 1.90e-05 ***
## best_pic_winyes -1.22161 8.05756 -0.152 0.87955
## best_actor_winyes -1.85801 2.12682 -0.874 0.38268
## best_actress_winyes -2.53213 2.34515 -1.080 0.28070
## best_dir_winyes 5.34985 3.03261 1.764 0.07822 .
## top200_boxyes 12.45183 4.74057 2.627 0.00884 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 17.65 on 600 degrees of freedom
## Multiple R-squared: 0.2551, Adjusted R-squared: 0.234
## F-statistic: 12.09 on 17 and 600 DF, p-value: < 2.2e-16
#Model to predict critics scores
critics_score_model <- lm(critics_score ~ genre + runtime + best_pic_nom + best_pic_win + best_actor_win + best_actress_win + best_dir_win + top200_box, data = unique_mov)
summary(critics_score_model)
##
## Call:
## lm(formula = critics_score ~ genre + runtime + best_pic_nom +
## best_pic_win + best_actor_win + best_actress_win + best_dir_win +
## top200_box, data = unique_mov)
##
## Residuals:
## Min 1Q Median 3Q Max
## -65.294 -20.706 1.971 19.085 58.933
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 28.5993 6.9668 4.105 4.60e-05 ***
## genreAnimation 12.9685 9.4540 1.372 0.17066
## genreArt House & International 16.9204 7.9126 2.138 0.03289 *
## genreComedy 0.7965 4.2281 0.188 0.85065
## genreDocumentary 46.6674 5.1461 9.068 < 2e-16 ***
## genreDrama 18.8383 3.5735 5.272 1.89e-07 ***
## genreHorror 2.7291 6.2572 0.436 0.66288
## genreMusical & Performing Arts 34.6387 7.9203 4.373 1.44e-05 ***
## genreMystery & Suspense 12.1806 4.6917 2.596 0.00966 **
## genreOther 22.0525 7.2499 3.042 0.00245 **
## genreScience Fiction & Fantasy 11.7458 9.3988 1.250 0.21189
## runtime 0.1102 0.0607 1.815 0.06998 .
## best_pic_nomyes 22.2377 6.4287 3.459 0.00058 ***
## best_pic_winyes 0.4397 11.3997 0.039 0.96925
## best_actor_winyes -1.1342 3.0090 -0.377 0.70635
## best_actress_winyes -0.7529 3.3179 -0.227 0.82057
## best_dir_winyes 11.8695 4.2905 2.766 0.00584 **
## top200_boxyes 18.6753 6.7069 2.785 0.00553 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 24.97 on 600 degrees of freedom
## Multiple R-squared: 0.2425, Adjusted R-squared: 0.2211
## F-statistic: 11.3 on 17 and 600 DF, p-value: < 2.2e-16
Model Selection
Since I am trying to find the statistically significant predictors for critics/audience scores, I am going to use the p-value criteria to decide whether a predictor should remain in the model.
Audience scores model
The predictor best_pic_win
has the highest p-value, therefore it will be dropped. Note that this variable has only two levels - yes or no. So it’s okay to drop it
best_aud_model <- lm(audience_score ~ genre + runtime + best_pic_nom + best_actor_win + best_actress_win + best_dir_win + top200_box, data = unique_mov)
summary(best_aud_model)
##
## Call:
## lm(formula = audience_score ~ genre + runtime + best_pic_nom +
## best_actor_win + best_actress_win + best_dir_win + top200_box,
## data = unique_mov)
##
## Residuals:
## Min 1Q Median 3Q Max
## -50.921 -12.097 0.921 12.641 40.295
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 41.81206 4.91379 8.509 < 2e-16 ***
## genreAnimation 11.76300 6.67628 1.762 0.07859 .
## genreArt House & International 16.96196 5.58806 3.035 0.00251 **
## genreComedy 0.05819 2.98443 0.019 0.98445
## genreDocumentary 30.92160 3.63391 8.509 < 2e-16 ***
## genreDrama 10.74987 2.52375 4.259 2.38e-05 ***
## genreHorror -5.85474 4.41899 -1.325 0.18571
## genreMusical & Performing Arts 26.10848 5.59365 4.668 3.76e-06 ***
## genreMystery & Suspense 1.28250 3.31319 0.387 0.69883
## genreOther 13.03691 5.11257 2.550 0.01102 *
## genreScience Fiction & Fantasy 0.40627 6.63746 0.061 0.95121
## runtime 0.10518 0.04283 2.456 0.01434 *
## best_pic_nomyes 19.28849 4.08558 4.721 2.92e-06 ***
## best_actor_winyes -1.83185 2.11809 -0.865 0.38746
## best_actress_winyes -2.54992 2.34030 -1.090 0.27634
## best_dir_winyes 5.21783 2.90253 1.798 0.07273 .
## top200_boxyes 12.42088 4.73233 2.625 0.00889 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 17.64 on 601 degrees of freedom
## Multiple R-squared: 0.2551, Adjusted R-squared: 0.2352
## F-statistic: 12.86 on 16 and 601 DF, p-value: < 2.2e-16
#Next we drop 'best_actor_win'. Note that even though genreComedy has a higher p-value, we don't drop it because other levels of this predictor are significant.
best_aud_model <- lm(audience_score ~ genre + runtime + best_pic_nom + best_actress_win + best_dir_win + top200_box, data = unique_mov)
summary(best_aud_model)
##
## Call:
## lm(formula = audience_score ~ genre + runtime + best_pic_nom +
## best_actress_win + best_dir_win + top200_box, data = unique_mov)
##
## Residuals:
## Min 1Q Median 3Q Max
## -50.650 -12.151 1.022 12.692 40.358
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 42.32265 4.87718 8.678 < 2e-16 ***
## genreAnimation 11.64108 6.67339 1.744 0.08160 .
## genreArt House & International 17.16705 5.58186 3.076 0.00220 **
## genreComedy 0.05448 2.98381 0.018 0.98544
## genreDocumentary 31.05200 3.63002 8.554 < 2e-16 ***
## genreDrama 10.70198 2.52261 4.242 2.56e-05 ***
## genreHorror -5.73540 4.41591 -1.299 0.19451
## genreMusical & Performing Arts 26.22260 5.59093 4.690 3.38e-06 ***
## genreMystery & Suspense 1.03635 3.30025 0.314 0.75361
## genreOther 12.98759 5.11119 2.541 0.01130 *
## genreScience Fiction & Fantasy 0.60163 6.63222 0.091 0.92775
## runtime 0.09839 0.04209 2.337 0.01974 *
## best_pic_nomyes 18.98872 4.07000 4.666 3.80e-06 ***
## best_actress_winyes -2.66234 2.33620 -1.140 0.25490
## best_dir_winyes 5.15994 2.90115 1.779 0.07581 .
## top200_boxyes 12.35954 4.73080 2.613 0.00921 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 17.63 on 602 degrees of freedom
## Multiple R-squared: 0.2542, Adjusted R-squared: 0.2356
## F-statistic: 13.68 on 15 and 602 DF, p-value: < 2.2e-16
#Drop best_actress_win
best_aud_model <- lm(audience_score ~ genre + runtime + best_pic_nom + best_dir_win + top200_box, data = unique_mov)
summary(best_aud_model)
##
## Call:
## lm(formula = audience_score ~ genre + runtime + best_pic_nom +
## best_dir_win + top200_box, data = unique_mov)
##
## Residuals:
## Min 1Q Median 3Q Max
## -50.297 -12.780 1.107 12.619 40.548
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 42.95766 4.84644 8.864 < 2e-16 ***
## genreAnimation 11.18871 6.66323 1.679 0.09364 .
## genreArt House & International 16.91334 5.57880 3.032 0.00254 **
## genreComedy -0.28256 2.96985 -0.095 0.92423
## genreDocumentary 30.93374 3.62943 8.523 < 2e-16 ***
## genreDrama 10.34021 2.50318 4.131 4.13e-05 ***
## genreHorror -5.83222 4.41619 -1.321 0.18712
## genreMusical & Performing Arts 26.25191 5.59225 4.694 3.31e-06 ***
## genreMystery & Suspense 0.60188 3.27896 0.184 0.85442
## genreOther 12.76553 5.10874 2.499 0.01273 *
## genreScience Fiction & Fantasy 0.60429 6.63387 0.091 0.92745
## runtime 0.09259 0.04179 2.215 0.02711 *
## best_pic_nomyes 18.31730 4.02812 4.547 6.57e-06 ***
## best_dir_winyes 5.10450 2.90146 1.759 0.07904 .
## top200_boxyes 12.03445 4.72336 2.548 0.01109 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 17.64 on 603 degrees of freedom
## Multiple R-squared: 0.2525, Adjusted R-squared: 0.2352
## F-statistic: 14.55 on 14 and 603 DF, p-value: < 2.2e-16
#Finally drop best_dir_win
best_aud_model <- lm(audience_score ~ genre + runtime + best_pic_nom + top200_box, data = unique_mov)
summary(best_aud_model)
##
## Call:
## lm(formula = audience_score ~ genre + runtime + best_pic_nom +
## top200_box, data = unique_mov)
##
## Residuals:
## Min 1Q Median 3Q Max
## -50.643 -12.477 0.953 12.581 40.516
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 41.69153 4.80101 8.684 < 2e-16 ***
## genreAnimation 11.09461 6.67456 1.662 0.09699 .
## genreArt House & International 16.58775 5.58539 2.970 0.00310 **
## genreComedy -0.27064 2.97499 -0.091 0.92755
## genreDocumentary 30.65603 3.63228 8.440 2.36e-16 ***
## genreDrama 10.29897 2.50740 4.107 4.55e-05 ***
## genreHorror -5.74769 4.42358 -1.299 0.19433
## genreMusical & Performing Arts 26.20247 5.60187 4.677 3.59e-06 ***
## genreMystery & Suspense 0.70227 3.28415 0.214 0.83075
## genreOther 12.56243 5.11628 2.455 0.01435 *
## genreScience Fiction & Fantasy 0.95383 6.64238 0.144 0.88587
## runtime 0.10789 0.04095 2.635 0.00863 **
## best_pic_nomyes 18.97423 4.01773 4.723 2.90e-06 ***
## top200_boxyes 12.01338 4.73153 2.539 0.01137 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 17.67 on 604 degrees of freedom
## Multiple R-squared: 0.2487, Adjusted R-squared: 0.2325
## F-statistic: 15.38 on 13 and 604 DF, p-value: < 2.2e-16
For audiences the significant predictors are:
- genre (5 out of 10 levels are significant predictors). They are:
- Art House & International
- Documentary
- Drama
- Musical & Performing Arts
- Other
- runtime
- best_pic_nom
- top200_box
Critics scores model
I am using the p-value selection criteria as a basis for dropping predictors from the model. The predictor best_pic_win
has the highest p-value so it will be dropped from the model.
best_critics_model <- lm(critics_score ~ genre + runtime + best_pic_nom + best_actor_win + best_actress_win + best_dir_win + top200_box, data = unique_mov)
summary(best_critics_model)
##
## Call:
## lm(formula = critics_score ~ genre + runtime + best_pic_nom +
## best_actor_win + best_actress_win + best_dir_win + top200_box,
## data = unique_mov)
##
## Residuals:
## Min 1Q Median 3Q Max
## -65.293 -20.744 1.958 19.065 58.937
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 28.58553 6.95182 4.112 4.47e-05 ***
## genreAnimation 12.97335 9.44532 1.374 0.170102
## genreArt House & International 16.92276 7.90576 2.141 0.032711 *
## genreComedy 0.80185 4.22225 0.190 0.849443
## genreDocumentary 46.67071 5.14110 9.078 < 2e-16 ***
## genreDrama 18.83821 3.57049 5.276 1.84e-07 ***
## genreHorror 2.73109 6.25181 0.437 0.662379
## genreMusical & Performing Arts 34.63741 7.91367 4.377 1.42e-05 ***
## genreMystery & Suspense 12.18313 4.68736 2.599 0.009575 **
## genreOther 22.03728 7.23306 3.047 0.002415 **
## genreScience Fiction & Fantasy 11.74162 9.39040 1.250 0.211645
## runtime 0.11029 0.06059 1.820 0.069203 .
## best_pic_nomyes 22.34584 5.78010 3.866 0.000123 ***
## best_actor_winyes -1.14362 2.99658 -0.382 0.702862
## best_actress_winyes -0.74646 3.31096 -0.225 0.821705
## best_dir_winyes 11.91697 4.10638 2.902 0.003843 **
## top200_boxyes 18.68647 6.69510 2.791 0.005420 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 24.95 on 601 degrees of freedom
## Multiple R-squared: 0.2425, Adjusted R-squared: 0.2224
## F-statistic: 12.03 on 16 and 601 DF, p-value: < 2.2e-16
#Drop best_actress_win
best_critics_model <- lm(critics_score ~ genre + runtime + best_pic_nom + best_actor_win + best_dir_win + top200_box, data = unique_mov)
summary(best_critics_model)
##
## Call:
## lm(formula = critics_score ~ genre + runtime + best_pic_nom +
## best_actor_win + best_dir_win + top200_box, data = unique_mov)
##
## Residuals:
## Min 1Q Median 3Q Max
## -65.289 -20.623 2.059 19.125 58.910
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 28.75257 6.90678 4.163 3.60e-05 ***
## genreAnimation 12.84941 9.42187 1.364 0.173145
## genreArt House & International 16.84765 7.89250 2.135 0.033194 *
## genreComedy 0.70772 4.19824 0.169 0.866187
## genreDocumentary 46.63499 5.13460 9.082 < 2e-16 ***
## genreDrama 18.73807 3.53996 5.293 1.69e-07 ***
## genreHorror 2.70158 6.24551 0.433 0.665486
## genreMusical & Performing Arts 34.64327 7.90738 4.381 1.39e-05 ***
## genreMystery & Suspense 12.06673 4.65516 2.592 0.009770 **
## genreOther 21.97622 7.22228 3.043 0.002446 **
## genreScience Fiction & Fantasy 11.73837 9.38298 1.251 0.211410
## runtime 0.10881 0.06018 1.808 0.071108 .
## best_pic_nomyes 22.16431 5.71923 3.875 0.000118 ***
## best_actor_winyes -1.18114 2.98960 -0.395 0.692921
## best_dir_winyes 11.90266 4.10265 2.901 0.003853 **
## top200_boxyes 18.59686 6.67802 2.785 0.005525 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 24.93 on 602 degrees of freedom
## Multiple R-squared: 0.2425, Adjusted R-squared: 0.2236
## F-statistic: 12.85 on 15 and 602 DF, p-value: < 2.2e-16
#Drop best_actor_win
best_critics_model <- lm(critics_score ~ genre + runtime + best_pic_nom + best_dir_win + top200_box, data = unique_mov)
summary(best_critics_model)
##
## Call:
## lm(formula = critics_score ~ genre + runtime + best_pic_nom +
## best_dir_win + top200_box, data = unique_mov)
##
## Residuals:
## Min 1Q Median 3Q Max
## -65.304 -20.700 2.277 19.060 58.995
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 29.09907 6.84607 4.250 2.47e-05 ***
## genreAnimation 12.75848 9.41247 1.355 0.175770
## genreArt House & International 16.97298 7.88060 2.154 0.031654 *
## genreComedy 0.69615 4.19520 0.166 0.868260
## genreDocumentary 46.71584 5.12693 9.112 < 2e-16 ***
## genreDrama 18.69734 3.53598 5.288 1.73e-07 ***
## genreHorror 2.77589 6.23831 0.445 0.656498
## genreMusical & Performing Arts 34.71765 7.89961 4.395 1.31e-05 ***
## genreMystery & Suspense 11.89619 4.63186 2.568 0.010458 *
## genreOther 21.93837 7.21659 3.040 0.002468 **
## genreScience Fiction & Fantasy 11.86440 9.37099 1.266 0.205974
## runtime 0.10428 0.05904 1.766 0.077858 .
## best_pic_nomyes 21.95274 5.69012 3.858 0.000127 ***
## best_dir_winyes 11.86382 4.09860 2.895 0.003934 **
## top200_boxyes 18.54846 6.67222 2.780 0.005606 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 24.92 on 603 degrees of freedom
## Multiple R-squared: 0.2423, Adjusted R-squared: 0.2247
## F-statistic: 13.77 on 14 and 603 DF, p-value: < 2.2e-16
#Drop runtime
best_critics_model <- lm(critics_score ~ genre + best_pic_nom + best_dir_win + top200_box, data = unique_mov)
summary(best_critics_model)
##
## Call:
## lm(formula = critics_score ~ genre + best_pic_nom + best_dir_win +
## top200_box, data = unique_mov)
##
## Residuals:
## Min 1Q Median 3Q Max
## -66.333 -21.017 2.147 18.909 58.229
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 39.77134 3.22424 12.335 < 2e-16 ***
## genreAnimation 11.35366 9.39524 1.208 0.227348
## genreArt House & International 17.14532 7.89382 2.172 0.030244 *
## genreComedy 0.02477 4.18527 0.006 0.995279
## genreDocumentary 46.56199 5.13518 9.067 < 2e-16 ***
## genreDrama 19.31990 3.52454 5.482 6.20e-08 ***
## genreHorror 1.66635 6.21748 0.268 0.788782
## genreMusical & Performing Arts 35.78110 7.89044 4.535 6.96e-06 ***
## genreMystery & Suspense 12.47174 4.62848 2.695 0.007244 **
## genreOther 22.52639 7.22155 3.119 0.001899 **
## genreScience Fiction & Fantasy 11.43997 9.38433 1.219 0.223301
## best_pic_nomyes 24.11224 5.56696 4.331 1.74e-05 ***
## best_dir_winyes 13.37070 4.01585 3.329 0.000923 ***
## top200_boxyes 19.93878 6.63724 3.004 0.002774 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 24.96 on 604 degrees of freedom
## Multiple R-squared: 0.2383, Adjusted R-squared: 0.222
## F-statistic: 14.54 on 13 and 604 DF, p-value: < 2.2e-16
For critics the significant predictors are:
- genre (6 out of 10 levels are significant predictors). They are:
- Art House & International
- Documentary
- Drama
- Musical & Performing Arts
- Mystery & Suspense
- Other
- best_pic_nom
- best_dir_win
- top200_box
Model Diagnostics
Next, I ran model diagnostics on both models to check if linear regression conditions are satisfied.
Diagnostics for audience scores model
plot(unique_mov$audience_score, best_aud_model$residuals)
#Add red horizontal line to the plot at zero
abline(h = 0, col = "red")
#Histogram of residuals
hist(best_aud_model$residuals)
qqnorm(best_aud_model$residuals)
qqline(best_aud_model$residuals)
Diagnostics for critics scores model
plot(unique_mov$audience_score, best_critics_model$residuals)
#Add red horizontal line to the plot at zero
abline(h = 0, col = "red")
#Histogram of residuals
hist(best_critics_model$residuals)
qqnorm(best_critics_model$residuals)
qqline(best_critics_model$residuals)
We can see that the conditions for regression are fairly satisfied for both models.
Part 5: Prediction
I went ahead to predict the critics and audience score for the movie X-Men Apocalypse
. I got data for this movie from its Wikipedia article. Although the movie is called a super-hero film in the article, I think the Action & Adventure
genre is the closest to it in the dataset.
#Attributes for the movie
genre <- c("Action & Adventure")
best_pic_nom <- c("no")
best_pic_win <- c("no")
best_actor_win <- c("no")
best_actress_win <- c("yes") #Jennifer Lawrence
best_dir_win <- c("no")
top200_box <- c("yes")
runtime <- c(144)
xmen <- data.frame(genre, runtime, best_pic_nom, best_pic_win, best_actor_win, best_actress_win, best_dir_win, top200_box)
#Predict audience score with 95% confidence interval
predict(best_aud_model, xmen, interval = "confidence", level = 0.95)
## fit lwr upr
## 1 69.24129 59.44201 79.04057
#Predict critics score with 95% confidence interval
predict(best_critics_model, xmen, interval = "confidence", level = 0.95)
## fit lwr upr
## 1 59.71012 46.18609 73.23415
The predicted audience score for the X-Men Apocalypse movie is higher so we predict that the movie will be bigger hit with audiences than critics.
Part 6: Conclusion
In this project we were able to build two models for predicting the critics and audience scores for a movie. The models had adjusted \(R^2\) values of 22.2% and 23.25% respectively.
I have learnt that whether or not a movie (or its director) has been nominated for an Oscar is a significantly associated with critics scores. This makes sense because a critic may give a movie a more favorable score if its director has been nominated for an Oscar. Also, movies with favorable critics scores tend to receive more Oscar nominations.
The runtime of a movie is also significantly associated with audience scores.
Limitations of the project and Future Research
I think one main limitation of this project is that there are many factors (and combination of factors) associated with critics and audiences scores and it’s possible that some of these factors were not included in the models we developed. So predictors which may look significant may not be significant when these other factors are included. So I would recommend future efforts to look for ways to build a model which will include all these factors.
Another thing I want to recommend is future efforts should focus on ways that allow us to infer causation not just association. That is, it would be good if we can identify the factors that cause favourable/unfavourable critics or audience scores.