STAT 118: Notes M

Simple Linear Regression

Author
Affiliation

Emily Malcolm-White

Middlebury College

#LOAD PACKAGES
library(tidyverse)
library(scales)

#LOAD DATA
housingdata <- read.csv("https://raw.githubusercontent.com/abhiram-ds/housing_price_adv_regression/main/train.csv")

The data comes from Kaggle. It describing the sale of individual residential property in Ames, Iowa from 2006 to 2010.

Motivation

housingdata %>% 
  ggplot(aes(x=GrLivArea, y=SalePrice)) + 
  geom_point(pch=16) + 
  geom_smooth(method=lm, se=FALSE) +
  scale_y_continuous(labels = dollar_format(prefix="$")) + 
  labs(x = "Above Ground Living Area (sqft)", y = "Sale Price") + 
  theme_minimal()
`geom_smooth()` using formula = 'y ~ x'

Correlation

Correlation always takes on values between -1 and 1, describes the strength of the linear relationship between two variables.

cor(housingdata$GrLivArea, housingdata$SalePrice)
[1] 0.7086245

Correlation Does Not Imply Causation

Could be:

  • Random coincidence

  • Reverse causality

  • Confounding variable

Equations and Interpretation of Coefficients

Want to fit a linear model:

\[ \hat{y} = b_0 + b_1x \]

To fit a linear regression model we need the lm() function.

#lm(dependent_variable ~ independent_variable, data = dataset)
lm(SalePrice ~ GrLivArea, data=housingdata)

Call:
lm(formula = SalePrice ~ GrLivArea, data = housingdata)

Coefficients:
(Intercept)    GrLivArea  
    18569.0        107.1  
#OR

lm(SalePrice ~ GrLivArea, data=housingdata) %>% 
  summary()

Call:
lm(formula = SalePrice ~ GrLivArea, data = housingdata)

Residuals:
    Min      1Q  Median      3Q     Max 
-462999  -29800   -1124   21957  339832 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 18569.026   4480.755   4.144 3.61e-05 ***
GrLivArea     107.130      2.794  38.348  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 56070 on 1458 degrees of freedom
Multiple R-squared:  0.5021,    Adjusted R-squared:  0.5018 
F-statistic:  1471 on 1 and 1458 DF,  p-value: < 2.2e-16

Tidy it up with broom

library(broom)
library(kableExtra)

#prettier
lm(SalePrice ~ GrLivArea, data=housingdata) %>% 
  tidy() %>% 
  kbl() %>% 
  kable_styling()
term estimate std.error statistic p.value
(Intercept) 18569.0259 4480.754549 4.144174 3.61e-05
GrLivArea 107.1304 2.793621 38.348207 0.00e+00

To interpret the y-intercept coefficient (\(b_0\)):

  • When x is 0, we predict y to be \(b_0\).
  • In this case, when a house has NO above ground square footage, we predict it will cost $18,569.
Tip

Note that the interpretation of the y-intercept doesn’t always make sense.

In this case, what does it mean for a house to have no above ground square footage? Is it only a basement? Only a plot of land with no house?

To interpret the slope coefficient (\(b_1\)):

  • For every 1 unit increase in x, we expect a \(b_1\) unit increase in y, on average.
  • In this case, for every 1 additional square foot above ground in a house, we expect an $107 increase in the sales price of the house.

Assumptions, Residuals, and Residual Plots

Assumptions of Linear Regression:

  • Linearity: The relationship between X and the mean of Y is linear – look at scatterplot
  • Homoscedasticity: The variance of residual is the same for any value of X – look at fitted vs. residuals plot
  • Independence: Observations are independent of each other – think about context
  • Normality: For any fixed value of X, Y is normally distributed. – check the QQ plot

Check the scatterplot

housingdata %>% 
  ggplot(aes(x=GrLivArea, y=SalePrice)) + 
  geom_point(pch=16) + 
  geom_smooth(method=lm, se=FALSE) +
  scale_y_continuous(labels = dollar_format(prefix="$")) + 
  labs(x = "Above Ground Living Area (sqft)", y = "Sale Price") + 
  theme_minimal()
`geom_smooth()` using formula = 'y ~ x'

Some examples of nonlinear plots:

Credit: chartio

fitted vs. residuals plot and QQplots

# save our lm() as fit
fit <- lm(SalePrice ~ GrLivArea, data=housingdata)
library(ggfortify)
autoplot(fit, which=1)

Some other examples:

QQ plot

autoplot(fit, which=2)

Some other examples:

Prediction

Suppose you are selling your 2000 sqft house. Use the model above to make a prediction for the

newdata <- data.frame(GrLivArea = c(2000))
predict(fit, newdata)
       1 
232829.7 

Interpolation vs. Extrapolation

Interpolation is a method of estimating a hypothetical value that exists within a data set.

Extrapolation is a method of estimation for hypothetical values that fall outside a data set.