#LOAD PACKAGES
library(tidyverse)
library(openintro) #data
library(broom) # tidy() for regression output
library(ggfortify) #for autoplots for checking assumptions and residuals
#LOAD DATA
data(evals)
STAT 118: Notes N
Multiple Regression
The data are gathered from end of semester student evaluations for 463 courses taught by a sample of 94 professors from the University of Texas at Austin.
Recall: Simple Linear Regression
Predicting Professor’s Teaching Evaluation Scores from their Average Beauty Score:
%>%
evals ggplot(aes(x=bty_avg, y=score)) +
geom_point() +
geom_smooth(method='lm', se=FALSE) +
theme_minimal()+
labs(xlab="Average Beauty Score", ylab = "Teaching Score")
`geom_smooth()` using formula = 'y ~ x'
# build model lm(y ~ x)
<- lm(evals$score ~ evals$bty_avg)
fit
tidy(fit)
# A tibble: 2 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 3.88 0.0761 51.0 1.56e-191
2 evals$bty_avg 0.0666 0.0163 4.09 5.08e- 5
\[ \hat{y} = 3.88 + 0.067 (btyavg) \]
#check residual plots
autoplot(fit, which=1)
autoplot(fit, which=2)
Simple Linear Regression with a Categorical Variable as the Predictor
What if we wanted to use a categorical variable (like gender
) instead a quantitative variable (like bty_avg
)?
#fit model
<- lm(evals$bty_avg ~ evals$gender)
fit2 tidy(fit2)
# A tibble: 2 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 4.66 0.108 42.9 3.25e-163
2 evals$gendermale -0.416 0.143 -2.92 3.67e- 3
\[ \hat{bty} = 4.65 - 0.416(gendermale) \]
Where did females go? They are the first “level” of gender (since they come first in the alphabet) so they are are our base level that all other genders will be compared against.
intercept
(\(b_0\)): Assuming gendermale=0 (they are female), the predicted average beauty score is 4.65. gendermale
(\(b_1\)): Compared to females, we predict males to get 0.416 less on their beauty scores, average.
For female faculty: We predict their beauty score to be 4.65. For male faculty: We predict the beauty score for male faculty to be 4.65 - 0.416 = 4.234
%>%
evals group_by(gender) %>%
summarize(avg_bty_score = mean(bty_avg))
# A tibble: 2 × 2
gender avg_bty_score
<fct> <dbl>
1 female 4.66
2 male 4.24
Simple Linear Regression with a Categorical Variable as the Predictor (more than two levels)
rank
has three levels: teaching
, tenure track
, and tenured
. Usually they are ordered alphabetically, but you can check how they are ordered using the code:
levels(evals$rank)
[1] "teaching" "tenure track" "tenured"
#if you want to reorder them you can use the code:
# evals$ranks <- factor(evals$ranks, levels=c("tenured", "tenure track", "teaching") )
Since teaching
is first, it will be the base level when we run the model.
#fit model
<- lm(evals$score ~ evals$rank)
fit3 tidy(fit3)
# A tibble: 3 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 4.28 0.0537 79.9 1.02e-271
2 evals$ranktenure track -0.130 0.0748 -1.73 8.37e- 2
3 evals$ranktenured -0.145 0.0636 -2.28 2.28e- 2
\[ \hat{score} = 4.28 - 0.12(ranktenuretrack) - 0.145(ranktenured) \]
intercept
: For teaching faculty, we predict their average teaching evaluation to be 4.28 ranktenure track
: Compared to teaching faculty, we predict tenure track faculty on average to have a teaching score 0.12 below. ranktenured
: Compared to teaching faculty, we predict tenured faculty on average to have a teaching score 0.15 below.
For teaching
faculty: average 4.28 For tenure track
faculty: average 4.28 - 0.12 = 4.16 For tenured
faculty: average 4.28 - 0.15 = 4.13
Multiple Linear Regression
Sometimes we may want to incorporate multiple variables into the same model (in order to get the best prediction of the y variable possible).
Parallel Slopes Model
#fit model
<- lm(evals$score ~ evals$gender + evals$bty_avg)
fit5 tidy(fit5)
# A tibble: 3 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 3.75 0.0847 44.3 6.23e-168
2 evals$gendermale 0.172 0.0502 3.43 6.52e- 4
3 evals$bty_avg 0.0742 0.0163 4.56 6.48e- 6
library(moderndive)
Attaching package: 'moderndive'
The following object is masked _by_ '.GlobalEnv':
evals
The following objects are masked from 'package:openintro':
babies, evals
%>%
evals ggplot(aes(x = bty_avg, y = score, color = gender)) +
geom_point() +
labs(x = "Average Beauty Score", y = "Teaching Score", color = "Gender") +
geom_parallel_slopes(se = FALSE) +
theme_minimal()
Interaction Terms Model
#fit model
<- lm(evals$score ~ evals$gender + evals$bty_avg + evals$gender*evals$bty_avg)
fit6 tidy(fit6)
# A tibble: 4 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 3.95 0.118 33.5 2.92e-125
2 evals$gendermale -0.184 0.153 -1.20 2.32e- 1
3 evals$bty_avg 0.0306 0.0240 1.28 2.02e- 1
4 evals$gendermale:evals$bty_avg 0.0796 0.0325 2.45 1.46e- 2
\[ \hat{score} = 3.95 - 0.18(gendermale) + 0.03(btyavg) + 0.08(btyavg)(gendermale) \]
%>%
evals ggplot(aes(x = bty_avg, y = score, color = gender)) +
geom_point() +
labs(x = "Average Beauty Score", y = "Teaching Score", color = "Gender") +
geom_smooth(method='lm', se = FALSE) +
theme_minimal()
`geom_smooth()` using formula = 'y ~ x'