2.5 Linear Models
Up to now, we have looked at descriptive statistics, and evaluated summaries, correlations in the data (with p values).
We are now interested in looking at group differences. But before moving
into linear models, let’s assume we simply wanted to use a student’s
t.test to evaluate the differences between two groups. In our case, we
have two age groups within the variable AgeSubject
(young vs old).
Before running a t.test, there are some assumptions one needs to take
into account:
2.5.1 Normality of distribution
2.5.1.1 Shapiro test
To check normality of distribution, we use the shapiro.test
on the
numeric outcome. Given that our predictor now has two levels.
##
## Shapiro-Wilk normality test
##
## data: english$RTlexdec
## W = 0.98595, p-value < 2.2e-16
How to interpret this statistically significant result? This tells us that the distribution of the data is statistically different from a normal distribution.
2.5.1.2 Density plot
We can also use a density plot to evaluate normality. The results show that both levels have bell shaped distributions but they are both skewed to the right (positively skewed data). See this website for a nice example of left and right skwed data distributions.
2.5.2 Homogeneity of variance
If our data were normally distributed, we can use the bartlett
test.
If our data were non-normally distributed, we would use the Levene Test
(either with var.test
from base-R or leveneTest
from the car package
We can check both.
2.5.2.1 Bartlett test
##
## Bartlett test of homogeneity of variances
##
## data: RTlexdec by AgeSubject
## Bartlett's K-squared = 15.883, df = 1, p-value = 6.737e-05
2.5.2.2 Levene test
2.5.2.2.1 Base R
##
## F test to compare two variances
##
## data: RTlexdec by AgeSubject
## F = 1.1817, num df = 2283, denom df = 2283, p-value = 6.737e-05
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
## 1.088581 1.282737
## sample estimates:
## ratio of variances
## 1.181678
2.5.2.2.2 From the car
package
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 1 5.1615 0.02314 *
## 4566
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
In all cases, the statistically significant result indicates that there
is evidence that the variance of two levels of the factor AgeSubject
is statistically significant; i.e., the variances are not equal. This is
important as we will use this in our t-test later on.
2.5.3 First steps
Up to now, we have looked at descriptive statistics, and evaluated summaries, correlations in the data (with p values), and checked the normality of distribution of our data.
We are now interested in looking at group differences.
Let us start with a simple t-test
2.5.3.1 T-test
We then run a t-test on the data. We specify the formula as y ~ x
and
add var.equal = FALSE
(based on the normality tests above)
##
## Welch Two Sample t-test
##
## data: RTlexdec by AgeSubject
## t = 67.468, df = 4534.6, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group old and group young is not equal to 0
## 95 percent confidence interval:
## 0.2152787 0.2281642
## sample estimates:
## mean in group old mean in group young
## 6.660958 6.439237
To interpret the t-test, we say that there is evidence for a
statistically significant difference between the two groups: young
and old
: t(4534.6) = 67.468, p < 2.2e-16
. The mean of
RTLexdec
for the young group is significantly lower than that of the
old group.
2.5.3.2 Linear Model
Let us run a linear model on the same data
##
## Call:
## lm(formula = RTlexdec ~ AgeSubject, data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.25776 -0.08339 -0.01669 0.06921 0.52685
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.660958 0.002324 2866.44 <2e-16 ***
## AgeSubjectyoung -0.221721 0.003286 -67.47 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1111 on 4566 degrees of freedom
## Multiple R-squared: 0.4992, Adjusted R-squared: 0.4991
## F-statistic: 4552 on 1 and 4566 DF, p-value: < 2.2e-16
Any comments? discuss with your neighbour.
The results of the linear model are exactly the same, albeit with a
difference in the sign of the difference. This indicates that the
\(\beta\) coefficient for young
is significantly lower than that
of old
by -0.221721
: t = -67.47, p < <2e-16
.
The dataset english
contains two levels for AgeSubject.
We will run
an ANOVA and a linear model on the data.
2.5.3.3 Basic ANOVA
We can use the function aov
to run an Analysis of Variance on the full
dataset
## Df Sum Sq Mean Sq F value Pr(>F)
## AgeSubject 1 56.14 56.14 4552 <2e-16 ***
## Residuals 4566 56.31 0.01
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
2.5.3.4 Linear model
We can use the function lm
to run a linear model
##
## Call:
## lm(formula = RTlexdec ~ AgeSubject, data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.25776 -0.08339 -0.01669 0.06921 0.52685
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.660958 0.002324 2866.44 <2e-16 ***
## AgeSubjectyoung -0.221721 0.003286 -67.47 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1111 on 4566 degrees of freedom
## Multiple R-squared: 0.4992, Adjusted R-squared: 0.4991
## F-statistic: 4552 on 1 and 4566 DF, p-value: < 2.2e-16
But wait… How is the linear model comparable to the analysis of
variance we ran above? This linear model derives the analysis of
variance we saw above, use anova
on your linear model..
Here are the results of the initial Analysis of variance:
## Df Sum Sq Mean Sq F value Pr(>F)
## AgeSubject 1 56.14 56.14 4552 <2e-16 ***
## Residuals 4566 56.31 0.01
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
And here are the results of the linear model with the anova
function
## Analysis of Variance Table
##
## Response: RTlexdec
## Df Sum Sq Mean Sq F value Pr(>F)
## AgeSubject 1 56.141 56.141 4552 < 2.2e-16 ***
## Residuals 4566 56.314 0.012
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
They are exactly the same… The underlying of an Analysis of variance is a linear model. We will continue with a linear model to understand it better
2.5.4 Introduction to Linear models
The basic assumption of a Linear model is to create a regression
analysis on the data. We have an outcome (or dependent variable) and a
predictor (or an independent variable). The formula of a linear model is
as follows outcome ~ predictor
that can be read as “outcome as a
function of the predictor”. We can add “1” to specify an intercept, but
this is by default added to the model
2.5.4.1 Model estimation
english2 <- english %>%
mutate(AgeSubject = factor(AgeSubject, levels = c("young", "old")))
mdl.lm <- english2 %>%
lm(RTlexdec ~ AgeSubject, data = .)
#lm(RTlexdec ~ AgeSubject, data = english)
mdl.lm #also print(mdl.lm)
##
## Call:
## lm(formula = RTlexdec ~ AgeSubject, data = .)
##
## Coefficients:
## (Intercept) AgeSubjectold
## 6.4392 0.2217
##
## Call:
## lm(formula = RTlexdec ~ AgeSubject, data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.25776 -0.08339 -0.01669 0.06921 0.52685
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.439237 0.002324 2771.03 <2e-16 ***
## AgeSubjectold 0.221721 0.003286 67.47 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1111 on 4566 degrees of freedom
## Multiple R-squared: 0.4992, Adjusted R-squared: 0.4991
## F-statistic: 4552 on 1 and 4566 DF, p-value: < 2.2e-16
2.5.5 Role of coding schemes
2.5.5.1 Intro
There are other coding schemes that can be used. See Schad et al. (2020): “How to capitalize on a priori contrasts in linear (mixed) models: A tutorial”. Journal of Memory and Language,vol. 110, 104038.
We have:
- The treatment coding
- The contrast or sum coding
- The polynomial coding
- The Repeated coding
- The helmert coding
The first two are the most commonly used coding schemes, but see the paper for why one can use the others.
2.5.5.2 Treatment coding
By default, R
uses a treatment coding scheme. By this we mean that the
intercept has the outcome of a specific level in the dataset (usually,
the first in alphabetical order, unless you have changed that!). In our
example above, we changed the reference level of the variable
AgeSubject
to be “young” vs “old”. What we see as an result in our LM
is the intercept (=Young) vs “AgeSubjectold”. This will mean we are
looking at the difference between the reference level and all other
levels.
The code below allows you to see what the coding scheme is.
english2$AgeSubject2 <- english2$AgeSubject
contrasts(english2$AgeSubject2) <- contr.treatment(2)
contrasts(english2$AgeSubject2)
## 2
## young 0
## old 1
mdl.lm.T <- english2 %>%
lm(RTlexdec ~ AgeSubject2, data = .)
#lm(RTlexdec ~ AgeSubject, data = english)
mdl.lm.T #also print(mdl.lm)
##
## Call:
## lm(formula = RTlexdec ~ AgeSubject2, data = .)
##
## Coefficients:
## (Intercept) AgeSubject22
## 6.4392 0.2217
##
## Call:
## lm(formula = RTlexdec ~ AgeSubject2, data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.25776 -0.08339 -0.01669 0.06921 0.52685
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.439237 0.002324 2771.03 <2e-16 ***
## AgeSubject22 0.221721 0.003286 67.47 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1111 on 4566 degrees of freedom
## Multiple R-squared: 0.4992, Adjusted R-squared: 0.4991
## F-statistic: 4552 on 1 and 4566 DF, p-value: < 2.2e-16
2.5.5.3 Contrast (or sum) coding
2.5.5.3.1 Default
Let’s change the coding scheme and see the difference in the output.
english2 <- english2 %>%
mutate(AgeSubject2 = factor(AgeSubject2, levels = c("old", "young")))
contrasts(english2$AgeSubject2) <- contr.sum(2)
contrasts(english2$AgeSubject2)
## [,1]
## old 1
## young -1
mdl.lm.C <- english2 %>%
lm(RTlexdec ~ AgeSubject2, data = .)
#lm(RTlexdec ~ AgeSubject, data = english)
mdl.lm.C #also print(mdl.lm)
##
## Call:
## lm(formula = RTlexdec ~ AgeSubject2, data = .)
##
## Coefficients:
## (Intercept) AgeSubject21
## 6.5501 0.1109
##
## Call:
## lm(formula = RTlexdec ~ AgeSubject2, data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.25776 -0.08339 -0.01669 0.06921 0.52685
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.550097 0.001643 3986.29 <2e-16 ***
## AgeSubject21 0.110861 0.001643 67.47 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1111 on 4566 degrees of freedom
## Multiple R-squared: 0.4992, Adjusted R-squared: 0.4991
## F-statistic: 4552 on 1 and 4566 DF, p-value: < 2.2e-16
With contrast (or sum) coding, the intercept is almost the same as before (treatment = 6.439237 vs contrast = 6.550097). The intercept is now the average of all of the data points
## meanRTlexdec
## 1 6.550097
## 2 6.550097
## 3 6.550097
## 4 6.550097
## 5 6.550097
## 6 6.550097
## 7 6.550097
## 8 6.550097
## 9 6.550097
## 10 6.550097
The coefficient for “old” is different. It is now nearly half of that in the treatment coding (0.221721 vs 0.110861). Why is this the case? In treatment coding, the distance between old and young was of 1 (1- 0 = 1), in contrast coding, it is of 2 (1 - -1 = 2). The coefficient of the intercept is exactly the same; for the second, it is half of the one above (0.221721 / 1 = 0.221721; 0.221721 / 2 = 0.110861)!
How to interpret the coefficient for the level “old”? It is the distance from the average!
Let’s change slightly the coding scheme
2.5.5.3.2 Modified
english2 <- english2 %>%
mutate(AgeSubject2 = factor(AgeSubject, levels = c("young", "old")))
contrasts(english2$AgeSubject2) <- c(-0.5, 0.5)
contrasts(english2$AgeSubject2)
## [,1]
## young -0.5
## old 0.5
mdl.lmC.2 <- english2 %>%
lm(RTlexdec ~ AgeSubject2, data = .)
#lm(RTlexdec ~ AgeSubject, data = english)
mdl.lmC.2 #also print(mdl.lmC.2)
##
## Call:
## lm(formula = RTlexdec ~ AgeSubject2, data = .)
##
## Coefficients:
## (Intercept) AgeSubject21
## 6.5501 0.2217
##
## Call:
## lm(formula = RTlexdec ~ AgeSubject2, data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.25776 -0.08339 -0.01669 0.06921 0.52685
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.550097 0.001643 3986.29 <2e-16 ***
## AgeSubject21 0.221721 0.003286 67.47 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1111 on 4566 degrees of freedom
## Multiple R-squared: 0.4992, Adjusted R-squared: 0.4991
## F-statistic: 4552 on 1 and 4566 DF, p-value: < 2.2e-16
Our intercept is still the average of all datapoints of the dependent variable; that of “old” is now at 0.221721 as expected. The distance between young and old is of 1 (0.5 - -0.5 = 1).
When and why do we use the different coding schemes? We use it to make the intercept more interpretable. This is especially the case when having more than two categorical predictors, or interactions between a numeric and a categorical predictor. Read the paper and the Bodo Winter’s book for more details.
2.5.6 Further steps
2.5.6.1 Tidying the output
We use our original model with treatment coding.
# from library(broom)
tidy(mdl.lm) %>%
select(term, estimate) %>%
mutate(estimate = round(estimate, 3))
## # A tibble: 2 × 2
## term estimate
## <chr> <dbl>
## 1 (Intercept) 6.44
## 2 AgeSubjectold 0.222
Obtaining mean values from our model
## [1] 6.439237
## [1] 6.660958
2.5.6.2 Nice table of our model summary
We can also obtain a nice table of our model summary. We can use the
package knitr
or xtable
2.5.6.2.1 Directly from model summary
Estimate | Std. Error | t value | Pr(>|t|) | |
---|---|---|---|---|
(Intercept) | 6.439 | 0.002 | 2771.027 | 0 |
AgeSubjectold | 0.222 | 0.003 | 67.468 | 0 |
2.5.6.3 Dissecting the model
Let us dissect the model. If you use “str”, you will be able to see what is available under our linear model. To access some info from the model
2.5.6.3.1 “str” and “coef”
## List of 13
## $ coefficients : Named num [1:2] 6.439 0.222
## ..- attr(*, "names")= chr [1:2] "(Intercept)" "AgeSubjectold"
## $ residuals : Named num [1:4568] 0.1045 -0.0416 -0.1343 -0.015 0.0114 ...
## ..- attr(*, "names")= chr [1:4568] "1" "2" "3" "4" ...
## $ effects : Named num [1:4568] -442.7013 7.4927 -0.1352 -0.0159 0.0105 ...
## ..- attr(*, "names")= chr [1:4568] "(Intercept)" "AgeSubjectold" "" "" ...
## $ rank : int 2
## $ fitted.values: Named num [1:4568] 6.44 6.44 6.44 6.44 6.44 ...
## ..- attr(*, "names")= chr [1:4568] "1" "2" "3" "4" ...
## $ assign : int [1:2] 0 1
## $ qr :List of 5
## ..$ qr : num [1:4568, 1:2] -67.587 0.0148 0.0148 0.0148 0.0148 ...
## .. ..- attr(*, "dimnames")=List of 2
## .. .. ..$ : chr [1:4568] "1" "2" "3" "4" ...
## .. .. ..$ : chr [1:2] "(Intercept)" "AgeSubjectold"
## .. ..- attr(*, "assign")= int [1:2] 0 1
## .. ..- attr(*, "contrasts")=List of 1
## .. .. ..$ AgeSubject: chr "contr.treatment"
## ..$ qraux: num [1:2] 1.01 1.01
## ..$ pivot: int [1:2] 1 2
## ..$ tol : num 1e-07
## ..$ rank : int 2
## ..- attr(*, "class")= chr "qr"
## $ df.residual : int 4566
## $ contrasts :List of 1
## ..$ AgeSubject: chr "contr.treatment"
## $ xlevels :List of 1
## ..$ AgeSubject: chr [1:2] "young" "old"
## $ call : language lm(formula = RTlexdec ~ AgeSubject, data = .)
## $ terms :Classes 'terms', 'formula' language RTlexdec ~ AgeSubject
## .. ..- attr(*, "variables")= language list(RTlexdec, AgeSubject)
## .. ..- attr(*, "factors")= int [1:2, 1] 0 1
## .. .. ..- attr(*, "dimnames")=List of 2
## .. .. .. ..$ : chr [1:2] "RTlexdec" "AgeSubject"
## .. .. .. ..$ : chr "AgeSubject"
## .. ..- attr(*, "term.labels")= chr "AgeSubject"
## .. ..- attr(*, "order")= int 1
## .. ..- attr(*, "intercept")= int 1
## .. ..- attr(*, "response")= int 1
## .. ..- attr(*, ".Environment")=<environment: 0x000001ac6b9735b8>
## .. ..- attr(*, "predvars")= language list(RTlexdec, AgeSubject)
## .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "factor"
## .. .. ..- attr(*, "names")= chr [1:2] "RTlexdec" "AgeSubject"
## $ model :'data.frame': 4568 obs. of 2 variables:
## ..$ RTlexdec : num [1:4568] 6.54 6.4 6.3 6.42 6.45 ...
## ..$ AgeSubject: Factor w/ 2 levels "young","old": 1 1 1 1 1 1 1 1 1 1 ...
## ..- attr(*, "terms")=Classes 'terms', 'formula' language RTlexdec ~ AgeSubject
## .. .. ..- attr(*, "variables")= language list(RTlexdec, AgeSubject)
## .. .. ..- attr(*, "factors")= int [1:2, 1] 0 1
## .. .. .. ..- attr(*, "dimnames")=List of 2
## .. .. .. .. ..$ : chr [1:2] "RTlexdec" "AgeSubject"
## .. .. .. .. ..$ : chr "AgeSubject"
## .. .. ..- attr(*, "term.labels")= chr "AgeSubject"
## .. .. ..- attr(*, "order")= int 1
## .. .. ..- attr(*, "intercept")= int 1
## .. .. ..- attr(*, "response")= int 1
## .. .. ..- attr(*, ".Environment")=<environment: 0x000001ac6b9735b8>
## .. .. ..- attr(*, "predvars")= language list(RTlexdec, AgeSubject)
## .. .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "factor"
## .. .. .. ..- attr(*, "names")= chr [1:2] "RTlexdec" "AgeSubject"
## - attr(*, "class")= chr "lm"
## (Intercept) AgeSubjectold
## 6.4392366 0.2217215
2.5.6.3.2 “coef” and “coefficients”
What if I want to obtain the “Intercept”? Or the coefficient for distance? What if I want the full row for distance?
## (Intercept)
## 6.439237
## AgeSubjectold
## 0.2217215
## Estimate Std. Error t value Pr(>|t|)
## 0.22172146 0.00328631 67.46820211 0.00000000
## [1] 0
2.5.6.3.3 Residuals
What about residuals (difference between the observed value and the estimated value of the quantity) and fitted values? This allows us to evaluate how normal our residuals are and how different they are from a normal distribution.
2.5.6.3.4 Goodness of fit?
## [1] -7110.962
## [1] -7091.682
## 'log Lik.' 3558.481 (df=3)
Or use the following from broom
## # A tibble: 1 × 12
## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.499 0.499 0.111 4552. 0 1 3558. -7111. -7092.
## # ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
2.5.6.3.5 Significance testing
Are the above informative? of course not directly. If we want to test for overall significance of model. We run a null model (aka intercept only) and compare models.
mdl.lm.Null <- english %>%
lm(RTlexdec ~ 1, data = .)
mdl.comp <- anova(mdl.lm.Null, mdl.lm)
mdl.comp
## Analysis of Variance Table
##
## Model 1: RTlexdec ~ 1
## Model 2: RTlexdec ~ AgeSubject
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 4567 112.456
## 2 4566 56.314 1 56.141 4552 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The results show that adding the variable “AgeSubject” improves the model fit. We can write this as follows: Model comparison showed that the addition of AgeSubject improved the model fit when compared with an intercept only model (\(F\)(1) = 4551.96, p < 0) (F(1) = 4552 , p < 2.2e-16)
2.5.7 Plotting fitted values
2.5.7.1 Trend line
Let’s plot our fitted values but only for the trend line
english %>%
ggplot(aes(x = AgeSubject, y = RTlexdec))+
geom_boxplot()+
theme_bw() + theme(text = element_text(size = 15))+
geom_smooth(aes(x = as.numeric(AgeSubject), y = predict(mdl.lm)), method = "lm", color = "blue") +
labs(x = "Age", y = "RTLexDec", title = "Boxplot and predicted trend line", subtitle = "with ggplot2")
This allows us to plot the fitted values from our model with the predicted linear trend. This is exactly the same as our original data.
2.5.7.2 Predicted means and the trend line
We can also plot the predicted means and linear trend
english %>%
ggplot(aes(x = AgeSubject, y = predict(mdl.lm)))+
geom_boxplot(color = "blue") +
theme_bw() + theme(text = element_text(size = 15)) +
geom_smooth(aes(x = as.numeric(AgeSubject), y = predict(mdl.lm)), method = "lm", color = "blue") +
labs(x = "Age", y = "RTLexDec", title = "Predicted means and trend line", subtitle = "with ggplot2")
2.5.7.3 Raw data, predicted means and the trend line
We can also plot the actual data, the predicted means and linear trend
english %>%
ggplot(aes(x = AgeSubject, y = RTlexdec))+
geom_boxplot() +
geom_boxplot(aes(x = AgeSubject, y = predict(mdl.lm)), color = "blue") +
theme_bw() + theme(text = element_text(size = 15)) +
geom_smooth(aes(x = as.numeric(AgeSubject), y = predict(mdl.lm)), method = "lm", color = "blue") +
labs(x = "Species", y = "Length", title = "Boxplot raw data, predicted means (in blue) and trend line", subtitle = "with ggplot2")
2.5.7.4 Add significance levels and trend line on a plot?
We can use the p values generated from either our linear model to add significance levels on a plot. We use the code from above and add the significance level. We also add a trend line
english %>%
ggplot(aes(x = AgeSubject, y = RTlexdec))+
geom_boxplot() +
geom_boxplot(aes(x = AgeSubject, y = predict(mdl.lm)), color = "blue") +
theme_bw() + theme(text = element_text(size = 15)) +
geom_smooth(aes(x = as.numeric(AgeSubject), y = predict(mdl.lm)), method = "lm", color = "blue") +
labs(x = "Species", y = "Length", title = "Boxplot raw data, predicted means (in blue) and trend line", subtitle = "with significance testing") +
geom_signif(comparison = list(c("old", "young")),
map_signif_level = TRUE, test = function(a, b) {
list(p.value = summary(mdl.lm)$coefficients[2, 4])})
2.5.8 What about pairwise comparison?
When having three of more levels in our predictor, we can use pairwise comparisons, with corrections to evaluate differences between each level.
##
## Call:
## lm(formula = RTlexdec ~ AgeSubject, data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.25776 -0.08339 -0.01669 0.06921 0.52685
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.439237 0.002324 2771.03 <2e-16 ***
## AgeSubjectold 0.221721 0.003286 67.47 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1111 on 4566 degrees of freedom
## Multiple R-squared: 0.4992, Adjusted R-squared: 0.4991
## F-statistic: 4552 on 1 and 4566 DF, p-value: < 2.2e-16
## $emmeans
## AgeSubject emmean SE df lower.CL upper.CL
## young 6.439 0.00232 4566 6.435 6.444
## old 6.661 0.00232 4566 6.656 6.666
##
## Confidence level used: 0.95
##
## $contrasts
## contrast estimate SE df t.ratio p.value
## young - old -0.222 0.00329 4566 -67.468 <.0001
How to interpret the output? Discuss with your neighbour and share with the group.
Hint… Look at the emmeans values for each level of our factor “Species” and the contrasts.
2.5.9 Multiple predictors?
Linear models require a numeric outcome, but the predictor can be either numeric or a factor. We can have more than one predictor. The only issue is that this complicates the interpretation of results
##
## Call:
## lm(formula = RTlexdec ~ AgeSubject * WordCategory, data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.25079 -0.08273 -0.01516 0.06940 0.52285
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.664955 0.002911 2289.950 <2e-16 ***
## AgeSubjectyoung -0.220395 0.004116 -53.545 <2e-16 ***
## WordCategoryV -0.010972 0.004822 -2.275 0.0229 *
## AgeSubjectyoung:WordCategoryV -0.003642 0.006820 -0.534 0.5934
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1109 on 4564 degrees of freedom
## Multiple R-squared: 0.5008, Adjusted R-squared: 0.5005
## F-statistic: 1526 on 3 and 4564 DF, p-value: < 2.2e-16
And with an Anova
## Analysis of Variance Table
##
## Response: RTlexdec
## Df Sum Sq Mean Sq F value Pr(>F)
## AgeSubject 1 56.141 56.141 4564.2810 < 2.2e-16 ***
## WordCategory 1 0.173 0.173 14.0756 0.0001778 ***
## AgeSubject:WordCategory 1 0.004 0.004 0.2851 0.5933724
## Residuals 4564 56.138 0.012
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The results above tell us that all predictors used are significantly different.