Modeling and prediction for movies

Table of Contents

Setup
Load packages
Load data
Part 1: Data
Part 2: Research question
Part 3: Exploratory data analysis
Visualizing spread of critics and audience scores by movie genres
Summary statistics of critics and audience scores by genre
Part 4: Modeling
Model Selection
Model Diagnostics
Part 5: Prediction
Part 6: Conclusion

Setup

Load packages

library(ggplot2)
library(dplyr)

## Warning: package 'dplyr' was built under R version 3.3.1

library(statsr)

Load data

load("movies.Rdata")

Part 1: Data

The dataset is comprised of 651 randomly sampled movies produced and released before the year 2016. Information on these movies was gotten from the Internet Movie DataBase (IMDB) and Rotten Tomatoes.

The data was collection method involved random sampling but no random assignment. Therefore, the results from this project is generalizable but we cannot infer causation from our results - only association.

Part 2: Research question

The aim of my research question is to identify movie attributes that are significantly associated with higher or lower critics/audience scores. These attributes will be used to develop two models to predict critics and audience scores respectively.

Movies popular with critics may not be popular with audiences and vice versa. I would like to:

Identify movie attributes that are associated with higher or lower critics/audience scores.
Predict the critics and audience score for a movie so we can determine whether or not a movie will be a hit with either critics or audiences (or both).

Please note that the dataset contains some missing values. For my regression analysis, I chose to drop these missing observations.

Response variables

critics_scores: Critics score on Rotten Tomatoes
audience_scores Audience score on Rotten Tomatoes

Explanatory variables

genre: Genre of movie
runtime: Runtime of movie (in minutes)
best_pic_nom : Whether or not the movie was nominated for a best picture Oscar (no, yes)
best_pic_win : Whether or not the movie won a best picture Oscar (no, yes)
best_actor_win : Whether or not one of the main actors in the movie ever won an Oscar (no, yes)
best_actress_win : Whether or not one of the main actresses in the movie ever won an Oscar (no, yes)
best_dir_win : Whether or not the director of the movie ever won an Oscar (no, yes)
top200_box : Whether or not the movie is in the Top 200 Box Office list on BoxOfficeMojo (no, yes)

Part 3: Exploratory data analysis

Distribution of critics and audience scores

There is duplicate row in the dataset, let’s drop it

unique_mov <- movies[-244, ]

#Drop missing data from the analysis
unique_mov <- na.omit(unique_mov)

#Histogram of audience scores
ggplot(data = unique_mov, aes(x = audience_score)) + geom_histogram(binwidth = 10) + scale_x_continuous("Audience Score", breaks = seq(0, 100, by = 10))

#Histogram of critics scores
ggplot(data = unique_mov, aes(x = critics_score)) + geom_histogram(binwidth = 10) + scale_x_continuous("Critics Score", breaks = seq(0, 100, by = 10))

The distribution of audience scores is bimodal with two roughly equal peaks at 80 and 85. Furthermore, majority of the scores are concentrated at the range 30 - 90 with 20 or more movies getting a score between this interval. The distribution also appears to be slightly left skewed.

The distribution of critics scores is unimodal with a peak at 90. Furthermore critics’ scores appear to be uniformly distributed between 10 and 60. This distribution does not seem to have any significant skew to the left or right.

Barplot of Genres

ggplot(data = unique_mov, aes(x = genre)) + geom_bar() + theme(text = element_text(size = 12), axis.text.x = element_text(angle = 90, vjust = 1))

The barplot shows that Drama genre has the highest number of movies in the dataset

Visualizing spread of critics and audience scores by movie genres

#Boxplot showing variablity of audience's scores by movie genre
ggplot(data = unique_mov, aes(x = genre, y = audience_score)) + geom_boxplot() + theme(text = element_text(size = 12), axis.text.x = element_text(angle = 90, vjust = 1)) + scale_y_continuous("Audience score")

#Boxplot showing variability of critics' scores by movie genre
ggplot(data = unique_mov, aes(x = genre, y = critics_score)) + geom_boxplot() + theme(text = element_text(size = 12), axis.text.x = element_text(angle = 90, vjust = 1)) + scale_y_continuous("Critics score")

From the boxplots above, we can see that critics scores tend to have more variability than audience scores for most movie genres.

Summary statistics of critics and audience scores by genre

#Summary statistics of audience scores by genre
unique_mov %>%
    group_by(genre) %>%
        summarise(mean_score = mean(audience_score), median_score = median(audience_score), sd_score = sd(audience_score), n = n())

## Source: local data frame [11 x 5]
## 
##                        genre mean_score median_score  sd_score     n
##                       <fctr>      <dbl>        <dbl>     <dbl> <int>
## 1         Action & Adventure   53.91935         51.5 20.006801    62
## 2                  Animation   62.37500         67.5 20.982561     8
## 3  Art House & International   69.50000         71.0 13.641514    12
## 4                     Comedy   52.24419         49.5 19.201676    86
## 5                Documentary   83.23077         86.0  8.594556    39
## 6                      Drama   65.28859         70.0 18.630483   298
## 7                     Horror   45.90909         42.0 16.480606    22
## 8  Musical & Performing Arts   80.16667         80.5 11.360484    12
## 9         Mystery & Suspense   55.19643         51.5 19.277325    56
## 10                     Other   69.73333         74.0 18.491182    15
## 11 Science Fiction & Fantasy   55.12500         53.0 25.759534     8

#Summary statistics of critics scores by genre
unique_mov %>%
    group_by(genre) %>%
        summarise(mean_score = mean(critics_score), median_score = median(critics_score), sd_score = sd(critics_score), n = n())

## Source: local data frame [11 x 5]
## 
##                        genre mean_score median_score sd_score     n
##                       <fctr>      <dbl>        <dbl>    <dbl> <int>
## 1         Action & Adventure   42.24194         37.0 27.83599    62
## 2                  Animation   51.12500         58.0 32.25983     8
## 3  Art House & International   56.91667         65.5 31.06725    12
## 4                     Comedy   40.93023         36.0 27.62386    86
## 5                Documentary   86.33333         92.0 17.20516    39
## 6                      Drama   61.98993         67.0 25.02019   298
## 7                     Horror   42.04545         38.0 25.24636    22
## 8  Musical & Performing Arts   76.66667         89.0 26.53071    12
## 9         Mystery & Suspense   54.89286         60.5 27.56393    56
## 10                     Other   67.73333         72.0 26.45337    15
## 11 Science Fiction & Fantasy   55.37500         68.0 28.50031     8

The summary statistics indicate that genres such as Action & Adventure, Animation, Comedy and Art House & International are more popular with audiences than critics. The only genre with a higher summary statistic for the critics score is the Documentary genre. Other genres have similar summary statistics for both critics and audience scores.

Part 4: Modeling

#Model to predict audience scores
audience_score_model <- lm(audience_score ~ genre + runtime + best_pic_nom + best_pic_win + best_actor_win + best_actress_win + best_dir_win + top200_box, data = unique_mov)

summary(audience_score_model)

## 
## Call:
## lm(formula = audience_score ~ genre + runtime + best_pic_nom + 
##     best_pic_win + best_actor_win + best_actress_win + best_dir_win + 
##     top200_box, data = unique_mov)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -50.914 -12.090   0.903  12.667  40.294 
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    41.77370    4.92429   8.483  < 2e-16 ***
## genreAnimation                 11.77660    6.68231   1.762  0.07852 .  
## genreArt House & International 16.96864    5.59278   3.034  0.00252 ** 
## genreComedy                     0.07318    2.98850   0.024  0.98047    
## genreDocumentary               30.93089    3.63738   8.504  < 2e-16 ***
## genreDrama                     10.74960    2.52580   4.256 2.42e-05 ***
## genreHorror                    -5.84919    4.42274  -1.323  0.18650    
## genreMusical & Performing Arts 26.10499    5.59825   4.663 3.84e-06 ***
## genreMystery & Suspense         1.28949    3.31620   0.389  0.69753    
## genreOther                     12.99451    5.12437   2.536  0.01147 *  
## genreScience Fiction & Fantasy  0.39472    6.64330   0.059  0.95264    
## runtime                         0.10547    0.04290   2.458  0.01424 *  
## best_pic_nomyes                19.58898    4.54392   4.311 1.90e-05 ***
## best_pic_winyes                -1.22161    8.05756  -0.152  0.87955    
## best_actor_winyes              -1.85801    2.12682  -0.874  0.38268    
## best_actress_winyes            -2.53213    2.34515  -1.080  0.28070    
## best_dir_winyes                 5.34985    3.03261   1.764  0.07822 .  
## top200_boxyes                  12.45183    4.74057   2.627  0.00884 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 17.65 on 600 degrees of freedom
## Multiple R-squared:  0.2551, Adjusted R-squared:  0.234 
## F-statistic: 12.09 on 17 and 600 DF,  p-value: < 2.2e-16

#Model to predict critics scores
critics_score_model <- lm(critics_score ~ genre + runtime + best_pic_nom + best_pic_win + best_actor_win + best_actress_win + best_dir_win + top200_box, data = unique_mov)

summary(critics_score_model)

## 
## Call:
## lm(formula = critics_score ~ genre + runtime + best_pic_nom + 
##     best_pic_win + best_actor_win + best_actress_win + best_dir_win + 
##     top200_box, data = unique_mov)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -65.294 -20.706   1.971  19.085  58.933 
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     28.5993     6.9668   4.105 4.60e-05 ***
## genreAnimation                  12.9685     9.4540   1.372  0.17066    
## genreArt House & International  16.9204     7.9126   2.138  0.03289 *  
## genreComedy                      0.7965     4.2281   0.188  0.85065    
## genreDocumentary                46.6674     5.1461   9.068  < 2e-16 ***
## genreDrama                      18.8383     3.5735   5.272 1.89e-07 ***
## genreHorror                      2.7291     6.2572   0.436  0.66288    
## genreMusical & Performing Arts  34.6387     7.9203   4.373 1.44e-05 ***
## genreMystery & Suspense         12.1806     4.6917   2.596  0.00966 ** 
## genreOther                      22.0525     7.2499   3.042  0.00245 ** 
## genreScience Fiction & Fantasy  11.7458     9.3988   1.250  0.21189    
## runtime                          0.1102     0.0607   1.815  0.06998 .  
## best_pic_nomyes                 22.2377     6.4287   3.459  0.00058 ***
## best_pic_winyes                  0.4397    11.3997   0.039  0.96925    
## best_actor_winyes               -1.1342     3.0090  -0.377  0.70635    
## best_actress_winyes             -0.7529     3.3179  -0.227  0.82057    
## best_dir_winyes                 11.8695     4.2905   2.766  0.00584 ** 
## top200_boxyes                   18.6753     6.7069   2.785  0.00553 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 24.97 on 600 degrees of freedom
## Multiple R-squared:  0.2425, Adjusted R-squared:  0.2211 
## F-statistic:  11.3 on 17 and 600 DF,  p-value: < 2.2e-16

Model Selection

Since I am trying to find the statistically significant predictors for critics/audience scores, I am going to use the p-value criteria to decide whether a predictor should remain in the model.

Audience scores model

The predictor best_pic_win has the highest p-value, therefore it will be dropped. Note that this variable has only two levels - yes or no. So it’s okay to drop it

best_aud_model <- lm(audience_score ~ genre + runtime + best_pic_nom + best_actor_win + best_actress_win + best_dir_win + top200_box, data = unique_mov)

summary(best_aud_model)

## 
## Call:
## lm(formula = audience_score ~ genre + runtime + best_pic_nom + 
##     best_actor_win + best_actress_win + best_dir_win + top200_box, 
##     data = unique_mov)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -50.921 -12.097   0.921  12.641  40.295 
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    41.81206    4.91379   8.509  < 2e-16 ***
## genreAnimation                 11.76300    6.67628   1.762  0.07859 .  
## genreArt House & International 16.96196    5.58806   3.035  0.00251 ** 
## genreComedy                     0.05819    2.98443   0.019  0.98445    
## genreDocumentary               30.92160    3.63391   8.509  < 2e-16 ***
## genreDrama                     10.74987    2.52375   4.259 2.38e-05 ***
## genreHorror                    -5.85474    4.41899  -1.325  0.18571    
## genreMusical & Performing Arts 26.10848    5.59365   4.668 3.76e-06 ***
## genreMystery & Suspense         1.28250    3.31319   0.387  0.69883    
## genreOther                     13.03691    5.11257   2.550  0.01102 *  
## genreScience Fiction & Fantasy  0.40627    6.63746   0.061  0.95121    
## runtime                         0.10518    0.04283   2.456  0.01434 *  
## best_pic_nomyes                19.28849    4.08558   4.721 2.92e-06 ***
## best_actor_winyes              -1.83185    2.11809  -0.865  0.38746    
## best_actress_winyes            -2.54992    2.34030  -1.090  0.27634    
## best_dir_winyes                 5.21783    2.90253   1.798  0.07273 .  
## top200_boxyes                  12.42088    4.73233   2.625  0.00889 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 17.64 on 601 degrees of freedom
## Multiple R-squared:  0.2551, Adjusted R-squared:  0.2352 
## F-statistic: 12.86 on 16 and 601 DF,  p-value: < 2.2e-16

#Next we drop 'best_actor_win'. Note that even though genreComedy has a higher p-value, we don't drop it because other levels of this predictor are significant.

best_aud_model <- lm(audience_score ~ genre + runtime + best_pic_nom + best_actress_win + best_dir_win + top200_box, data = unique_mov)

summary(best_aud_model)

## 
## Call:
## lm(formula = audience_score ~ genre + runtime + best_pic_nom + 
##     best_actress_win + best_dir_win + top200_box, data = unique_mov)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -50.650 -12.151   1.022  12.692  40.358 
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    42.32265    4.87718   8.678  < 2e-16 ***
## genreAnimation                 11.64108    6.67339   1.744  0.08160 .  
## genreArt House & International 17.16705    5.58186   3.076  0.00220 ** 
## genreComedy                     0.05448    2.98381   0.018  0.98544    
## genreDocumentary               31.05200    3.63002   8.554  < 2e-16 ***
## genreDrama                     10.70198    2.52261   4.242 2.56e-05 ***
## genreHorror                    -5.73540    4.41591  -1.299  0.19451    
## genreMusical & Performing Arts 26.22260    5.59093   4.690 3.38e-06 ***
## genreMystery & Suspense         1.03635    3.30025   0.314  0.75361    
## genreOther                     12.98759    5.11119   2.541  0.01130 *  
## genreScience Fiction & Fantasy  0.60163    6.63222   0.091  0.92775    
## runtime                         0.09839    0.04209   2.337  0.01974 *  
## best_pic_nomyes                18.98872    4.07000   4.666 3.80e-06 ***
## best_actress_winyes            -2.66234    2.33620  -1.140  0.25490    
## best_dir_winyes                 5.15994    2.90115   1.779  0.07581 .  
## top200_boxyes                  12.35954    4.73080   2.613  0.00921 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 17.63 on 602 degrees of freedom
## Multiple R-squared:  0.2542, Adjusted R-squared:  0.2356 
## F-statistic: 13.68 on 15 and 602 DF,  p-value: < 2.2e-16

#Drop best_actress_win

best_aud_model <- lm(audience_score ~ genre + runtime + best_pic_nom + best_dir_win + top200_box, data = unique_mov)

summary(best_aud_model)

## 
## Call:
## lm(formula = audience_score ~ genre + runtime + best_pic_nom + 
##     best_dir_win + top200_box, data = unique_mov)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -50.297 -12.780   1.107  12.619  40.548 
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    42.95766    4.84644   8.864  < 2e-16 ***
## genreAnimation                 11.18871    6.66323   1.679  0.09364 .  
## genreArt House & International 16.91334    5.57880   3.032  0.00254 ** 
## genreComedy                    -0.28256    2.96985  -0.095  0.92423    
## genreDocumentary               30.93374    3.62943   8.523  < 2e-16 ***
## genreDrama                     10.34021    2.50318   4.131 4.13e-05 ***
## genreHorror                    -5.83222    4.41619  -1.321  0.18712    
## genreMusical & Performing Arts 26.25191    5.59225   4.694 3.31e-06 ***
## genreMystery & Suspense         0.60188    3.27896   0.184  0.85442    
## genreOther                     12.76553    5.10874   2.499  0.01273 *  
## genreScience Fiction & Fantasy  0.60429    6.63387   0.091  0.92745    
## runtime                         0.09259    0.04179   2.215  0.02711 *  
## best_pic_nomyes                18.31730    4.02812   4.547 6.57e-06 ***
## best_dir_winyes                 5.10450    2.90146   1.759  0.07904 .  
## top200_boxyes                  12.03445    4.72336   2.548  0.01109 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 17.64 on 603 degrees of freedom
## Multiple R-squared:  0.2525, Adjusted R-squared:  0.2352 
## F-statistic: 14.55 on 14 and 603 DF,  p-value: < 2.2e-16

#Finally drop best_dir_win

best_aud_model <- lm(audience_score ~ genre + runtime + best_pic_nom + top200_box, data = unique_mov)

summary(best_aud_model)

## 
## Call:
## lm(formula = audience_score ~ genre + runtime + best_pic_nom + 
##     top200_box, data = unique_mov)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -50.643 -12.477   0.953  12.581  40.516 
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    41.69153    4.80101   8.684  < 2e-16 ***
## genreAnimation                 11.09461    6.67456   1.662  0.09699 .  
## genreArt House & International 16.58775    5.58539   2.970  0.00310 ** 
## genreComedy                    -0.27064    2.97499  -0.091  0.92755    
## genreDocumentary               30.65603    3.63228   8.440 2.36e-16 ***
## genreDrama                     10.29897    2.50740   4.107 4.55e-05 ***
## genreHorror                    -5.74769    4.42358  -1.299  0.19433    
## genreMusical & Performing Arts 26.20247    5.60187   4.677 3.59e-06 ***
## genreMystery & Suspense         0.70227    3.28415   0.214  0.83075    
## genreOther                     12.56243    5.11628   2.455  0.01435 *  
## genreScience Fiction & Fantasy  0.95383    6.64238   0.144  0.88587    
## runtime                         0.10789    0.04095   2.635  0.00863 ** 
## best_pic_nomyes                18.97423    4.01773   4.723 2.90e-06 ***
## top200_boxyes                  12.01338    4.73153   2.539  0.01137 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 17.67 on 604 degrees of freedom
## Multiple R-squared:  0.2487, Adjusted R-squared:  0.2325 
## F-statistic: 15.38 on 13 and 604 DF,  p-value: < 2.2e-16

For audiences the significant predictors are:

genre (5 out of 10 levels are significant predictors). They are:
- Art House & International
- Documentary
- Drama
- Musical & Performing Arts
- Other
runtime
best_pic_nom
top200_box

Critics scores model

I am using the p-value selection criteria as a basis for dropping predictors from the model. The predictor best_pic_win has the highest p-value so it will be dropped from the model.

best_critics_model <- lm(critics_score ~ genre + runtime + best_pic_nom +  best_actor_win + best_actress_win + best_dir_win + top200_box, data = unique_mov)

summary(best_critics_model)

## 
## Call:
## lm(formula = critics_score ~ genre + runtime + best_pic_nom + 
##     best_actor_win + best_actress_win + best_dir_win + top200_box, 
##     data = unique_mov)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -65.293 -20.744   1.958  19.065  58.937 
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    28.58553    6.95182   4.112 4.47e-05 ***
## genreAnimation                 12.97335    9.44532   1.374 0.170102    
## genreArt House & International 16.92276    7.90576   2.141 0.032711 *  
## genreComedy                     0.80185    4.22225   0.190 0.849443    
## genreDocumentary               46.67071    5.14110   9.078  < 2e-16 ***
## genreDrama                     18.83821    3.57049   5.276 1.84e-07 ***
## genreHorror                     2.73109    6.25181   0.437 0.662379    
## genreMusical & Performing Arts 34.63741    7.91367   4.377 1.42e-05 ***
## genreMystery & Suspense        12.18313    4.68736   2.599 0.009575 ** 
## genreOther                     22.03728    7.23306   3.047 0.002415 ** 
## genreScience Fiction & Fantasy 11.74162    9.39040   1.250 0.211645    
## runtime                         0.11029    0.06059   1.820 0.069203 .  
## best_pic_nomyes                22.34584    5.78010   3.866 0.000123 ***
## best_actor_winyes              -1.14362    2.99658  -0.382 0.702862    
## best_actress_winyes            -0.74646    3.31096  -0.225 0.821705    
## best_dir_winyes                11.91697    4.10638   2.902 0.003843 ** 
## top200_boxyes                  18.68647    6.69510   2.791 0.005420 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 24.95 on 601 degrees of freedom
## Multiple R-squared:  0.2425, Adjusted R-squared:  0.2224 
## F-statistic: 12.03 on 16 and 601 DF,  p-value: < 2.2e-16

#Drop best_actress_win
best_critics_model <- lm(critics_score ~ genre + runtime + best_pic_nom +  best_actor_win  + best_dir_win + top200_box, data = unique_mov)

summary(best_critics_model)

## 
## Call:
## lm(formula = critics_score ~ genre + runtime + best_pic_nom + 
##     best_actor_win + best_dir_win + top200_box, data = unique_mov)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -65.289 -20.623   2.059  19.125  58.910 
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    28.75257    6.90678   4.163 3.60e-05 ***
## genreAnimation                 12.84941    9.42187   1.364 0.173145    
## genreArt House & International 16.84765    7.89250   2.135 0.033194 *  
## genreComedy                     0.70772    4.19824   0.169 0.866187    
## genreDocumentary               46.63499    5.13460   9.082  < 2e-16 ***
## genreDrama                     18.73807    3.53996   5.293 1.69e-07 ***
## genreHorror                     2.70158    6.24551   0.433 0.665486    
## genreMusical & Performing Arts 34.64327    7.90738   4.381 1.39e-05 ***
## genreMystery & Suspense        12.06673    4.65516   2.592 0.009770 ** 
## genreOther                     21.97622    7.22228   3.043 0.002446 ** 
## genreScience Fiction & Fantasy 11.73837    9.38298   1.251 0.211410    
## runtime                         0.10881    0.06018   1.808 0.071108 .  
## best_pic_nomyes                22.16431    5.71923   3.875 0.000118 ***
## best_actor_winyes              -1.18114    2.98960  -0.395 0.692921    
## best_dir_winyes                11.90266    4.10265   2.901 0.003853 ** 
## top200_boxyes                  18.59686    6.67802   2.785 0.005525 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 24.93 on 602 degrees of freedom
## Multiple R-squared:  0.2425, Adjusted R-squared:  0.2236 
## F-statistic: 12.85 on 15 and 602 DF,  p-value: < 2.2e-16

#Drop best_actor_win
best_critics_model <- lm(critics_score ~ genre + runtime + best_pic_nom + best_dir_win + top200_box, data = unique_mov)

summary(best_critics_model)

## 
## Call:
## lm(formula = critics_score ~ genre + runtime + best_pic_nom + 
##     best_dir_win + top200_box, data = unique_mov)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -65.304 -20.700   2.277  19.060  58.995 
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    29.09907    6.84607   4.250 2.47e-05 ***
## genreAnimation                 12.75848    9.41247   1.355 0.175770    
## genreArt House & International 16.97298    7.88060   2.154 0.031654 *  
## genreComedy                     0.69615    4.19520   0.166 0.868260    
## genreDocumentary               46.71584    5.12693   9.112  < 2e-16 ***
## genreDrama                     18.69734    3.53598   5.288 1.73e-07 ***
## genreHorror                     2.77589    6.23831   0.445 0.656498    
## genreMusical & Performing Arts 34.71765    7.89961   4.395 1.31e-05 ***
## genreMystery & Suspense        11.89619    4.63186   2.568 0.010458 *  
## genreOther                     21.93837    7.21659   3.040 0.002468 ** 
## genreScience Fiction & Fantasy 11.86440    9.37099   1.266 0.205974    
## runtime                         0.10428    0.05904   1.766 0.077858 .  
## best_pic_nomyes                21.95274    5.69012   3.858 0.000127 ***
## best_dir_winyes                11.86382    4.09860   2.895 0.003934 ** 
## top200_boxyes                  18.54846    6.67222   2.780 0.005606 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 24.92 on 603 degrees of freedom
## Multiple R-squared:  0.2423, Adjusted R-squared:  0.2247 
## F-statistic: 13.77 on 14 and 603 DF,  p-value: < 2.2e-16

#Drop runtime
best_critics_model <- lm(critics_score ~ genre + best_pic_nom + best_dir_win + top200_box, data = unique_mov)

summary(best_critics_model)

## 
## Call:
## lm(formula = critics_score ~ genre + best_pic_nom + best_dir_win + 
##     top200_box, data = unique_mov)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -66.333 -21.017   2.147  18.909  58.229 
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    39.77134    3.22424  12.335  < 2e-16 ***
## genreAnimation                 11.35366    9.39524   1.208 0.227348    
## genreArt House & International 17.14532    7.89382   2.172 0.030244 *  
## genreComedy                     0.02477    4.18527   0.006 0.995279    
## genreDocumentary               46.56199    5.13518   9.067  < 2e-16 ***
## genreDrama                     19.31990    3.52454   5.482 6.20e-08 ***
## genreHorror                     1.66635    6.21748   0.268 0.788782    
## genreMusical & Performing Arts 35.78110    7.89044   4.535 6.96e-06 ***
## genreMystery & Suspense        12.47174    4.62848   2.695 0.007244 ** 
## genreOther                     22.52639    7.22155   3.119 0.001899 ** 
## genreScience Fiction & Fantasy 11.43997    9.38433   1.219 0.223301    
## best_pic_nomyes                24.11224    5.56696   4.331 1.74e-05 ***
## best_dir_winyes                13.37070    4.01585   3.329 0.000923 ***
## top200_boxyes                  19.93878    6.63724   3.004 0.002774 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 24.96 on 604 degrees of freedom
## Multiple R-squared:  0.2383, Adjusted R-squared:  0.222 
## F-statistic: 14.54 on 13 and 604 DF,  p-value: < 2.2e-16

For critics the significant predictors are:

genre (6 out of 10 levels are significant predictors). They are:
- Art House & International
- Documentary
- Drama
- Musical & Performing Arts
- Mystery & Suspense
- Other
best_pic_nom
best_dir_win
top200_box

Model Diagnostics

Next, I ran model diagnostics on both models to check if linear regression conditions are satisfied.

Diagnostics for audience scores model

plot(unique_mov$audience_score, best_aud_model$residuals)

#Add red horizontal line to the plot at zero
abline(h = 0, col = "red")

#Histogram of residuals
hist(best_aud_model$residuals)

qqnorm(best_aud_model$residuals)
qqline(best_aud_model$residuals)

Diagnostics for critics scores model

plot(unique_mov$audience_score, best_critics_model$residuals)

#Add red horizontal line to the plot at zero
abline(h = 0, col = "red")

#Histogram of residuals
hist(best_critics_model$residuals)

qqnorm(best_critics_model$residuals)
qqline(best_critics_model$residuals)

We can see that the conditions for regression are fairly satisfied for both models.

Part 5: Prediction

I went ahead to predict the critics and audience score for the movie X-Men Apocalypse. I got data for this movie from its Wikipedia article. Although the movie is called a super-hero film in the article, I think the Action & Adventure genre is the closest to it in the dataset.

#Attributes for the movie
genre <- c("Action & Adventure")
best_pic_nom <- c("no")
best_pic_win <- c("no")
best_actor_win <- c("no")
best_actress_win <- c("yes") #Jennifer Lawrence
best_dir_win <- c("no")
top200_box <- c("yes")
runtime <- c(144)

xmen <- data.frame(genre, runtime, best_pic_nom, best_pic_win, best_actor_win, best_actress_win, best_dir_win, top200_box)

#Predict audience score with 95% confidence interval
predict(best_aud_model, xmen, interval = "confidence", level = 0.95)

##        fit      lwr      upr
## 1 69.24129 59.44201 79.04057

#Predict critics score with 95% confidence interval
predict(best_critics_model, xmen, interval = "confidence", level = 0.95)

##        fit      lwr      upr
## 1 59.71012 46.18609 73.23415

The predicted audience score for the X-Men Apocalypse movie is higher so we predict that the movie will be bigger hit with audiences than critics.

Part 6: Conclusion

In this project we were able to build two models for predicting the critics and audience scores for a movie. The models had adjusted \(R^2\) values of 22.2% and 23.25% respectively.

I have learnt that whether or not a movie (or its director) has been nominated for an Oscar is a significantly associated with critics scores. This makes sense because a critic may give a movie a more favorable score if its director has been nominated for an Oscar. Also, movies with favorable critics scores tend to receive more Oscar nominations.

The runtime of a movie is also significantly associated with audience scores.

Limitations of the project and Future Research

I think one main limitation of this project is that there are many factors (and combination of factors) associated with critics and audiences scores and it’s possible that some of these factors were not included in the models we developed. So predictors which may look significant may not be significant when these other factors are included. So I would recommend future efforts to look for ways to build a model which will include all these factors.

Another thing I want to recommend is future efforts should focus on ways that allow us to infer causation not just association. That is, it would be good if we can identify the factors that cause favourable/unfavourable critics or audience scores.