## Use the code below to check if you have all required packages installed. If some are not installed already, the code below will install these. If you have all packages installed, then you could load them with the second code.
requiredPackages = c('tidyverse', 'ordinal', 'broom', 'emmeans', 'knitr', 'Hmisc', 'corrplot', 'psycho', 'PresenceAbsence', 'lme4', 'lmerTest', 'ggsignif')
for(p in requiredPackages){
if(!require(p,character.only = TRUE)) install.packages(p)
library(p,character.only = TRUE)
}
In this session, we will look at basic functions in R that will help us in running some inferential statistics. These will help us to evaluate the relationship between one (or more) predictor(s) (independent variable) and an outcome (dependent variable).
It is important to know the class of the outcome before doing any pre-data analyses or inferential statistics. Outcome classes can be one of:
Numeric
: As an example, we have length/width of leaf; height of mountain; fundamental frequency of the voice; etc. These are true
numbers and we can use summaries, t-tests, linear models, etc.
Categorical
(Unordered): Observations for two or more categories. As an example, we can have gender of a speaker (male or female); responses to a True vs False perception tests; Colour (unordered) categorisation, e.g., red, blue, yellow, orange, etc.. For these we can use a Generalised Linear Model (binomial or multinomial) or a simple chi-square test. Count data are numbers related to a category. But these should be analysed using a poisson logistic regression
Categorical
(Ordered): When you run a rating experiment, where the outcome is either numeric
(i.e., 1, 2, 3, 4, 5) or categories
(i.e., disagree, neutral, agree). The numeric
option is NOT a true number as for the participant, these are categories. Cumulative Logit models (or Generalised Linear Model with a cumulative function) are used. The mean is meaningless here, and the median is a preferred descriptive statistic.
The content will be split into 5 sections. Some of you will only ever look at some of these, while others may need to use all. Look at the section that works best for your data.
Pre-data analyses: We will look at summaries, plotting, correlations, checking normality of distribution and homogeneity of variance (for t-tests)
Statistical Analyses: from t-test and ANOVA, to Linear Models: model estimation, understanding coefficients, residuals, and predictions
We then move to Generalised Linear Models, with Logistic Regression (with a binomial categorical outcome) and touch upon Signal Detection Theory for estimating accuracy, biases, with sensitivity and specificity measures.
Dealing with rating data using Generalised Linear Models, with a Cumulative Logit function
Linear Mixed effects Models: Introduction to random effects and how to deal with these.
We will use one of the built in R
. You can check all available datasets in R
using the following:
data()
# or below for all datasets available in all installed packages
data(package = .packages(all.available = TRUE))
We will use the iris
dataset from the package MASS
'data.frame': 150 obs. of 5 variables:
$ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
$ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
$ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
$ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
$ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
We have a dataframe with 150 observations and 5 variables; 4 numeric and 1 factor with 3 levels.
We summarise the data to see the trends:
Sepal.Length Sepal.Width Petal.Length Petal.Width
Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
Median :5.800 Median :3.000 Median :4.350 Median :1.300
Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
Species
setosa :50
versicolor:50
virginica :50
So we have an equal dataframe (50 observations under each level of the factor Species
), with no missing values (aka NA
).
What if you want to summarise the data and get the mean, SD, by each level of the factor for Sepal.Length
?
Do some additional summaries.. You may want to check the median
, range
, etc..
We can make a boxplot of the data and add a trend line. This allows us to visualise the median, and quantiles in addition to the standard deviation and any outliers… All in the same plot!
iris %>%
ggplot(aes(x = Species, y = Sepal.Length)) +
geom_boxplot() +
geom_smooth(aes(x = as.numeric(Species), y = Sepal.Length), method = "lm") +
labs(x = "Species", y = "Length", title = "Boxplot and trend line", subtitle = "with ggplot2") +
theme_bw() + theme(text = element_text(size = 15))
Here I have used the variable Sepal.Length
. You can use any of the additional variables to plot the data.
We use the function cor
to obtain the pearson correlation and cor.test
to run a basic correlation test on our data with significance testing
[1] 0.8717538
Pearson's product-moment correlation
data: iris$Sepal.Length and iris$Petal.Length
t = 21.646, df = 148, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.8270363 0.9055080
sample estimates:
cor
0.8717538
Can you check whether there is a correlation between the Sepal.Length
and the Petal.Width
? What about Petal.Length
and Petal.Width
?
corrplot
Above, we did a correlation test on two predictors. We run multiple correlations on multiple predictors.
## correlation using "corrplot"
## based on the function `rcorr' from the `Hmisc` package
## Need to change dataframe into a matrix
corr <- as.matrix(iris[-5]) %>%
rcorr(type="pearson")
print(corr)
Sepal.Length Sepal.Width Petal.Length Petal.Width
Sepal.Length 1.00 -0.12 0.87 0.82
Sepal.Width -0.12 1.00 -0.43 -0.37
Petal.Length 0.87 -0.43 1.00 0.96
Petal.Width 0.82 -0.37 0.96 1.00
n= 150
P
Sepal.Length Sepal.Width Petal.Length Petal.Width
Sepal.Length 0.1519 0.0000 0.0000
Sepal.Width 0.1519 0.0000 0.0000
Petal.Length 0.0000 0.0000 0.0000
Petal.Width 0.0000 0.0000 0.0000
Look into the corrplot
specification, using ?corrplot
and amend some of the criteria. Run a correlation plot while filtering the data according to the species
.
Up to now, we have done some basic summaries and checked the correlations in the data. The pearson correlations we have done provided us with significance levels related to the correlations between two numeric outcomes. We continue by examining the normality of distribution of our data.
In the iris
dataset, we have a categorical predictor: Species
which has three levels
[1] "setosa" "versicolor" "virginica"
Let’s subset the data to setosa
and versicolor
. We will also check the normality and homogeneity of variance in the data
To check normality of distribution, we use the shapiro.test
on the numeric outcome. Given that our predictor now has two levels. We need to subset the data again to check normality of the outcome Sepal.Length
for each level of our factor Species
irisSubSet <- iris %>%
filter(Species == "setosa")
irisSubVers <- iris %>%
filter(Species == "versicolor")
shapiro.test(irisSubSet$Sepal.Length)
Shapiro-Wilk normality test
data: irisSubSet$Sepal.Length
W = 0.9777, p-value = 0.4595
Shapiro-Wilk normality test
data: irisSubVers$Sepal.Length
W = 0.97784, p-value = 0.4647
How to interpret this non-statistically significant result? This tells us that the distribution of the data is not statistically different from a normal distribution.
We can also use a density plot to evaluate normality. The results show that both levels have bell shaped distributions.
Because our data is normally distributed, we can use the bartlett
test. If our data were non-normally distributed, we would use the Levene Test (either with var.test
from base-R or leveneTest
from the car package We can check both.
Bartlett test of homogeneity of variances
data: Sepal.Length by Species
Bartlett's K-squared = 6.8917, df = 1, p-value = 0.00866
F test to compare two variances
data: Sepal.Length by Species
F = 0.46634, num df = 49, denom df = 49, p-value = 0.008657
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
0.2646385 0.8217841
sample estimates:
ratio of variances
0.4663429
Levene's Test for Homogeneity of Variance (center = median)
Df F value Pr(>F)
group 1 8.1727 0.005196 **
98
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
In all cases, the statistically significant result indicates that there is evidence that the variance of two levels of the factor Species
is statistically significant; i.e., the variances are not equal. This is important as we will use this in our t-test later on
Up to now, we have looked at descriptive statistics, and evaluated summaries, correlations in the data (with p values), and checked the normality of distribution of our data.
We are now interested in looking at group differences.
Let us start with a simple t-test
We then run a t-test on the data. We specify the formula as y ~ x
and add var.equal = FALSE
(based on the normality tests above)
Welch Two Sample t-test
data: Sepal.Length by Species
t = -10.521, df = 86.538, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-1.1057074 -0.7542926
sample estimates:
mean in group setosa mean in group versicolor
5.006 5.936
To interpret the t-test, we say that there is evidence for a statistically significant difference between the two groups: setosa
and versicolor
: t(86) = -10.521, p < 2.2e-16
. The mean of setosa
is significantly lower than that of versicolor
.
Let us run a linear model on the same data
Call:
lm(formula = Sepal.Length ~ Species, data = .)
Residuals:
Min 1Q Median 3Q Max
-1.0360 -0.3135 -0.0060 0.2715 1.0640
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.00600 0.06250 80.09 <2e-16 ***
Speciesversicolor 0.93000 0.08839 10.52 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.442 on 98 degrees of freedom
Multiple R-squared: 0.5304, Adjusted R-squared: 0.5256
F-statistic: 110.7 on 1 and 98 DF, p-value: < 2.2e-16
Any comments? discuss with your neighbour.
The results of the linear model are exactly the same, albeit with a difference in the sign of the difference. This indicates that the \(\beta\) coefficient for versicolor
is significantly higher than that of setosa
by 0.93000
: t(98) = 10.52, p < <2e-16
.
The dataset iris
contains three species. We will run an ANOVA and a linear model on the data.
We can use the function aov
to run an Analysis of Variance on the full dataset
Df Sum Sq Mean Sq F value Pr(>F)
Species 2 63.21 31.606 119.3 <2e-16 ***
Residuals 147 38.96 0.265
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
We can use the function lm
to run a linear model
Call:
lm(formula = Sepal.Length ~ Species, data = .)
Residuals:
Min 1Q Median 3Q Max
-1.6880 -0.3285 -0.0060 0.3120 1.3120
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.0060 0.0728 68.762 < 2e-16 ***
Speciesversicolor 0.9300 0.1030 9.033 8.77e-16 ***
Speciesvirginica 1.5820 0.1030 15.366 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.5148 on 147 degrees of freedom
Multiple R-squared: 0.6187, Adjusted R-squared: 0.6135
F-statistic: 119.3 on 2 and 147 DF, p-value: < 2.2e-16
But wait… How is the linear model comparable to the analysis of variance we ran above? This linear model derives the analysis of variance we saw above, use anova
on your linear model..
Here are the results of the initial Analysis of variance:
Df Sum Sq Mean Sq F value Pr(>F)
Species 2 63.21 31.606 119.3 <2e-16 ***
Residuals 147 38.96 0.265
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
And here are the results of the linear model with the anova
function
Analysis of Variance Table
Response: Sepal.Length
Df Sum Sq Mean Sq F value Pr(>F)
Species 2 63.212 31.606 119.26 < 2.2e-16 ***
Residuals 147 38.956 0.265
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
They are exactly the same… The underlying of an Analysis of variance is a linear model. We will continue with a linear model to understand it better
The basic assumption of a Linear model is to create a regression analysis on the data. We have an outcome (or dependent variable) and a predictor (or an independent variable). The formula of a linear model is as follows outcome ~ predictor
that can be read as “outcome as a function of the predictor”. We can add “1” to specify an intercept, but this is by default added to the model
mdl.lm <- iris %>%
lm(Sepal.Length ~ Species, data = .)
# same as below.
#mdl.lm <- lm(Sepal.Length ~ 1 + Species, data = iris)
mdl.lm #also print(mdl.lm)
Call:
lm(formula = Sepal.Length ~ Species, data = .)
Coefficients:
(Intercept) Speciesversicolor Speciesvirginica
5.006 0.930 1.582
Call:
lm(formula = Sepal.Length ~ Species, data = .)
Residuals:
Min 1Q Median 3Q Max
-1.6880 -0.3285 -0.0060 0.3120 1.3120
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.0060 0.0728 68.762 < 2e-16 ***
Speciesversicolor 0.9300 0.1030 9.033 8.77e-16 ***
Speciesvirginica 1.5820 0.1030 15.366 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.5148 on 147 degrees of freedom
Multiple R-squared: 0.6187, Adjusted R-squared: 0.6135
F-statistic: 119.3 on 2 and 147 DF, p-value: < 2.2e-16
# from library(broom)
tidy(mdl.lm) %>%
select(term, estimate) %>%
mutate(estimate = round(estimate, 3))
To interpret the model, we need look at the coefficients. The Intercept
(=Setosa) is 5.006 and the coefficients for Versicolor
and for Virginica
are respectively 5.936 and 6.588 This tells us that compared to Setosa
, moving from this category to Versicolor
leads to a significant increase by 5.936, and for Virginica
, there is a significant increase by 6.588.
But where are our actual values based on the means in the table above?
We run a model that suppresses the intercept (i.e., adding 0 instead of 1) and this will allow us to obtain the “true” coefficients for each level of our predictor. This is also known as a saturated
model
Call:
lm(formula = Sepal.Length ~ 0 + Species, data = .)
Residuals:
Min 1Q Median 3Q Max
-1.6880 -0.3285 -0.0060 0.3120 1.3120
Coefficients:
Estimate Std. Error t value Pr(>|t|)
Speciessetosa 5.0060 0.0728 68.76 <2e-16 ***
Speciesversicolor 5.9360 0.0728 81.54 <2e-16 ***
Speciesvirginica 6.5880 0.0728 90.49 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.5148 on 147 degrees of freedom
Multiple R-squared: 0.9925, Adjusted R-squared: 0.9924
F-statistic: 6522 on 3 and 147 DF, p-value: < 2.2e-16
This matches the original data. Setosa
has a mean of 5.006, Versicolor
10.942, and Virginica
10.942. See table above and coefficients below
[1] 5.006
[1] 5.936
[1] 6.588
The same as
[1] 5.006
[1] 5.936
[1] 6.588
We can also obtain a nice table of our model summary. We can use the package knitr
or xtable
Estimate | Std. Error | t value | Pr(>|t|) | |
---|---|---|---|---|
(Intercept) | 5.006 | 0.073 | 68.762 | 0 |
Speciesversicolor | 0.930 | 0.103 | 9.033 | 0 |
Speciesvirginica | 1.582 | 0.103 | 15.366 | 0 |
tidy
outputterm | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 5.006 | 0.073 | 68.762 | 0 |
Speciesversicolor | 0.930 | 0.103 | 9.033 | 0 |
Speciesvirginica | 1.582 | 0.103 | 15.366 | 0 |
Let us dissect the model. If you use “str”, you will be able to see what is available under our linear model. To access some info from the model
List of 13
$ coefficients : Named num [1:3] 5.01 0.93 1.58
..- attr(*, "names")= chr [1:3] "(Intercept)" "Speciesversicolor" "Speciesvirginica"
$ residuals : Named num [1:150] 0.094 -0.106 -0.306 -0.406 -0.006 ...
..- attr(*, "names")= chr [1:150] "1" "2" "3" "4" ...
$ effects : Named num [1:150] -71.5659 0.8025 7.91 -0.3826 0.0174 ...
..- attr(*, "names")= chr [1:150] "(Intercept)" "Speciesversicolor" "Speciesvirginica" "" ...
$ rank : int 3
$ fitted.values: Named num [1:150] 5.01 5.01 5.01 5.01 5.01 ...
..- attr(*, "names")= chr [1:150] "1" "2" "3" "4" ...
$ assign : int [1:3] 0 1 1
$ qr :List of 5
..$ qr : num [1:150, 1:3] -12.2474 0.0816 0.0816 0.0816 0.0816 ...
.. ..- attr(*, "dimnames")=List of 2
.. .. ..$ : chr [1:150] "1" "2" "3" "4" ...
.. .. ..$ : chr [1:3] "(Intercept)" "Speciesversicolor" "Speciesvirginica"
.. ..- attr(*, "assign")= int [1:3] 0 1 1
.. ..- attr(*, "contrasts")=List of 1
.. .. ..$ Species: chr "contr.treatment"
..$ qraux: num [1:3] 1.08 1.05 1.09
..$ pivot: int [1:3] 1 2 3
..$ tol : num 1e-07
..$ rank : int 3
..- attr(*, "class")= chr "qr"
$ df.residual : int 147
$ contrasts :List of 1
..$ Species: chr "contr.treatment"
$ xlevels :List of 1
..$ Species: chr [1:3] "setosa" "versicolor" "virginica"
$ call : language lm(formula = Sepal.Length ~ Species, data = .)
$ terms :Classes 'terms', 'formula' language Sepal.Length ~ Species
.. ..- attr(*, "variables")= language list(Sepal.Length, Species)
.. ..- attr(*, "factors")= int [1:2, 1] 0 1
.. .. ..- attr(*, "dimnames")=List of 2
.. .. .. ..$ : chr [1:2] "Sepal.Length" "Species"
.. .. .. ..$ : chr "Species"
.. ..- attr(*, "term.labels")= chr "Species"
.. ..- attr(*, "order")= int 1
.. ..- attr(*, "intercept")= int 1
.. ..- attr(*, "response")= int 1
.. ..- attr(*, ".Environment")=<environment: 0x000002715c292770>
.. ..- attr(*, "predvars")= language list(Sepal.Length, Species)
.. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "factor"
.. .. ..- attr(*, "names")= chr [1:2] "Sepal.Length" "Species"
$ model :'data.frame': 150 obs. of 2 variables:
..$ Sepal.Length: num [1:150] 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
..$ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
..- attr(*, "terms")=Classes 'terms', 'formula' language Sepal.Length ~ Species
.. .. ..- attr(*, "variables")= language list(Sepal.Length, Species)
.. .. ..- attr(*, "factors")= int [1:2, 1] 0 1
.. .. .. ..- attr(*, "dimnames")=List of 2
.. .. .. .. ..$ : chr [1:2] "Sepal.Length" "Species"
.. .. .. .. ..$ : chr "Species"
.. .. ..- attr(*, "term.labels")= chr "Species"
.. .. ..- attr(*, "order")= int 1
.. .. ..- attr(*, "intercept")= int 1
.. .. ..- attr(*, "response")= int 1
.. .. ..- attr(*, ".Environment")=<environment: 0x000002715c292770>
.. .. ..- attr(*, "predvars")= language list(Sepal.Length, Species)
.. .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "factor"
.. .. .. ..- attr(*, "names")= chr [1:2] "Sepal.Length" "Species"
- attr(*, "class")= chr "lm"
(Intercept) Speciesversicolor Speciesvirginica
5.006 0.930 1.582
What if I want to obtain the “Intercept”? Or the coefficient for distance? What if I want the full row for distance?
(Intercept)
5.006
Speciesversicolor
0.93
Estimate Std. Error t value Pr(>|t|)
9.300000e-01 1.029579e-01 9.032819e+00 8.770194e-16
[1] 8.770194e-16
Play around with the model summary and obtain the t values for the three levels. You can do this by referring to the coefficient as above
What about residuals (difference between the observed value and the estimated value of the quantity) and fitted values?
[1] 231.452
[1] 243.4945
'log Lik.' -111.726 (df=4)
Or use the following from broom
Are the above informative? of course not directly. If we want to test for overall significance of model. We run a null model (aka intercept only) and compare models.
mdl.lm.Null <- iris %>%
lm(Sepal.Length ~ 1, data = .)
mdl.comp <- anova(mdl.lm.Null, mdl.lm)
mdl.comp
Analysis of Variance Table
Model 1: Sepal.Length ~ 1
Model 2: Sepal.Length ~ Species
Res.Df RSS Df Sum of Sq F Pr(>F)
1 149 102.168
2 147 38.956 2 63.212 119.26 < 2.2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
The results show that adding the factor “Species” improves the model fit. We can write this as follows: Model comparison showed that the addition of Species improved the model fit when compared with an intercept only model (\(F\)(2) = 119.26, p < 1.669669210^{-31})
Let’s plot our fitted values but only for the trend line
iris %>%
ggplot(aes(x = Species, y = Sepal.Length))+
geom_boxplot() +
labs(x = "Species", y = "Length", title = "Boxplot and predicted trend line", subtitle = "with ggplot2") +
theme_bw() + theme(text = element_text(size = 15))+
geom_smooth(aes(x = as.numeric(Species), y = predict(mdl.lm)), method = "lm", color = "blue")
This allows us to plot the fitted values from our model with the predicted linear trend. This is exactly the same as our original data.
We can also plot the predicted means and linear trend
iris %>%
ggplot(aes(x = Species, y = predict(mdl.lm)))+
geom_boxplot(color = "blue") +
labs(x = "Species", y = "Length", title = "Predicted means and trend line", subtitle = "with ggplot2") +
theme_bw() + theme(text = element_text(size = 15))+
geom_smooth(aes(x = as.numeric(Species), y = predict(mdl.lm)), method = "lm", color = "blue")
We can also plot the actual data, the predicted means and linear trend
iris %>%
ggplot(aes(x = Species, y = Sepal.Length))+
geom_boxplot()+
geom_boxplot(aes(x = Species, y = predict(mdl.lm)), color = "blue") +
labs(x = "Species", y = "Length", title = "Boxplot raw data, predicted means (in blue) and trend line", subtitle = "with ggplot2") +
theme_bw() + theme(text = element_text(size = 15))+
geom_smooth(aes(x = as.numeric(Species), y = predict(mdl.lm)), method = "lm", color = "blue")
Based on our model’s summary, can you tell me if there is a difference between Versicolor and Virginica?
Call:
lm(formula = Sepal.Length ~ Species, data = .)
Residuals:
Min 1Q Median 3Q Max
-1.6880 -0.3285 -0.0060 0.3120 1.3120
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.0060 0.0728 68.762 < 2e-16 ***
Speciesversicolor 0.9300 0.1030 9.033 8.77e-16 ***
Speciesvirginica 1.5820 0.1030 15.366 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.5148 on 147 degrees of freedom
Multiple R-squared: 0.6187, Adjusted R-squared: 0.6135
F-statistic: 119.3 on 2 and 147 DF, p-value: < 2.2e-16
$emmeans
Species emmean SE df lower.CL upper.CL
setosa 5.01 0.0728 147 4.86 5.15
versicolor 5.94 0.0728 147 5.79 6.08
virginica 6.59 0.0728 147 6.44 6.73
Confidence level used: 0.95
$contrasts
contrast estimate SE df t.ratio p.value
setosa - versicolor -0.930 0.103 147 -9.033 <.0001
setosa - virginica -1.582 0.103 147 -15.366 <.0001
versicolor - virginica -0.652 0.103 147 -6.333 <.0001
P value adjustment: fdr method for 3 tests
How to interpret the output? Discuss with your neighbour and share with the group.
Hint… Look at the emmeans values for each level of our factor “Species” and the contrasts.
We can use the p values generated from either our linear model or the pairwise comparison to add significance levels on a plot. We use the code from above and add the significance level
iris %>%
ggplot(aes(x = Species, y = Sepal.Length))+
geom_boxplot()+
geom_boxplot(aes(x = Species, y = predict(mdl.lm)), color = "blue") +
labs(x = "Species", y = "Length", title = "Boxplot raw data, predicted means (in blue) and trend line", subtitle = "with ggplot2") +
theme_bw() + theme(text = element_text(size = 15))+
geom_smooth(aes(x = as.numeric(Species), y = predict(mdl.lm)), method = "lm", color = "blue") +
geom_signif(comparison = list(c("setosa", "versicolor")),
map_signif_level = TRUE, test = function(a, b) {
list(p.value = summary(mdl.emmeans)$contrasts[1, 6])}) +
geom_signif(comparison = list(c("versicolor", "virginica")),
map_signif_level = TRUE, test = function(a, b) {
list(p.value = summary(mdl.emmeans)$contrasts[3, 6])})
So far, we only looked at “Sepal.Length”. What about the other outcomes? how informative are they? Do we have statistical difference between the three levels of our predictor? You can do this in your spare time.
We have so far looked at the Linear Model. The underlying assumption about linear models is that we have a normal (Gaussian) distribution. This is the model to be used when we have a numeric
outcome. What if our outcome is not numeric? What if we have two categories, i.e., black vs white? correct vs incorrect? yes vs no? These are categorical binary outcome. We look in the next section at Logistic Regression
Here we will look at an example when the outcome is binary. This simulated data is structured as follows. We asked one participant to listen to 165 sentences, and to judge whether these are “grammatical” or “ungrammatical”. There were 105 sentences that were “grammatical” and 60 “ungrammatical”. This fictitious example can apply in any other situation. Let’s think Geography: 165 lands: 105 “flat” and 60 “non-flat”, etc. This applies to any case where you need to “categorise” the outcome into two groups.
Let’s load in the data and do some basic summaries
spec_tbl_df [165 x 2] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
$ grammaticality: chr [1:165] "grammatical" "grammatical" "grammatical" "grammatical" ...
$ response : chr [1:165] "yes" "yes" "yes" "yes" ...
- attr(*, "spec")=
.. cols(
.. grammaticality = col_character(),
.. response = col_character()
.. )
Let’s run a first GLM (Generalised Linear Model). A GLM uses a special family “binomial” as it assumes the outcome has a binomial distribution. In general, results from a Logistic Regression are close to what we get from SDT (see above).
To run the results, we will change the reference level for both response and grammaticality. The basic assumption about GLM is that we start with our reference level being the “no” responses to the “ungrammatical” category. Any changes to this reference will be seen in the coefficients as “yes” responses to the “grammatical” category.
The results below show the logodds for our model.
grammatical <- grammatical %>%
mutate(response = factor(response, levels = c("no", "yes")),
grammaticality = factor(grammaticality, levels = c("ungrammatical", "grammatical")))
grammatical %>%
group_by(grammaticality, response) %>%
table()
response
grammaticality no yes
ungrammatical 50 10
grammatical 5 100
mdl.glm <- grammatical %>%
glm(response ~ grammaticality, data = ., family = binomial)
summary(mdl.glm)
Call:
glm(formula = response ~ grammaticality, family = binomial, data = .)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.4676 -0.6039 0.3124 0.3124 1.8930
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.6094 0.3464 -4.646 3.38e-06 ***
grammaticalitygrammatical 4.6052 0.5744 8.017 1.08e-15 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 210.050 on 164 degrees of freedom
Residual deviance: 94.271 on 163 degrees of freedom
AIC: 98.271
Number of Fisher Scoring iterations: 5
The results show that for one unit increase in the response (i.e., from no to yes), the logodds of being “grammatical” is increased by 4.6051702 (the intercept shows that when the response is “no”, the logodds are -1.6094379). The actual logodds for the response “yes” to grammatical is 2.9957323
Logodds can be modified to talk about the odds of an event. For our model above, the odds of “grammatical” receiving a “no” response is a mere 0.2; the odds of “grammatical” to receive a “yes” is a 20; i.e., 20 times more likely
[1] 0.2
[1] 20
If you want to talk about the percentage “accuracy” of our model, then we can transform our loggodds into proportions. This shows that the proportion of “grammatical” receiving a “yes” response increases by 99% (or 95% based on our “true” coefficients)
[1] 0.1666667
[1] 0.952381
grammatical <- grammatical %>%
mutate(prob = predict(mdl.glm, type = "response"))
grammatical %>%
ggplot(aes(x = as.numeric(grammaticality), y = prob)) +
geom_point() +
geom_smooth(method = "glm",
method.args = list(family = "binomial"),
se = T) + theme_bw(base_size = 20)+
labs(y = "Probability", x = "")+
coord_cartesian(ylim = c(0,1))+
scale_x_discrete(limits = c("Ungrammatical", "Grammatical"))
We are generally interested in performance, i.e., whether the we have “accurately” categorised the outcome or not and at the same time want to evaluate our biases in responses. When deciding on categories, we are usually biased in our selection.
Let’s ask the question: How many of you have a Mac laptop and how many a Windows laptop? For those with a Mac, what was the main reason for choosing it? Are you biased in anyway by your decision?
To correct for these biases, we use some variants from Signal Detection Theory to obtain the true estimates without being influenced by the biases.
Let’s do some stats on this
Yes | No | Total | |
---|---|---|---|
Grammatical (Yes Actual) | TP = 100 | FN = 5 | (Yes Actual) 105 |
Ungrammatical (No Actual) | FP = 10 | TN = 50 | (No Actual) 60 |
Total | (Yes Response) 110 | (No Response) 55 | 165 |
grammatical <- grammatical %>%
mutate(response = factor(response, levels = c("yes", "no")),
grammaticality = factor(grammaticality, levels = c("grammatical", "ungrammatical")))
## TP = True Positive (Hit); FP = False Positive; FN = False Negative; TN = True Negative
TP <- nrow(grammatical %>%
filter(grammaticality == "grammatical" &
response == "yes"))
FN <- nrow(grammatical %>%
filter(grammaticality == "grammatical" &
response == "no"))
FP <- nrow(grammatical %>%
filter(grammaticality == "ungrammatical" &
response == "yes"))
TN <- nrow(grammatical %>%
filter(grammaticality == "ungrammatical" &
response == "no"))
TP
[1] 100
[1] 5
[1] 10
[1] 50
[1] 165
[1] 0.9090909
[1] 0.09090909
# When stimulus = yes, how many times response = yes?
TP/(TP+FN) # also True Positive Rate or Specificity
[1] 0.952381
[1] 0.1666667
[1] 0.8333333
[1] 0.9090909
# getting dprime (or the sensitivity index); beta (bias criterion, 0-1, lower=increase in "yes"); Aprime (estimate of discriminability, 0-1, 1=good discrimination; 0 at chance); bppd (b prime prime d, -1 to 1; 0 = no bias, negative = tendency to respond "yes", positive = tendency to respond "no"); c (index of bias, equals to SD)
#(see also https://www.r-bloggers.com/compute-signal-detection-theory-indices-with-r/amp/)
psycho::dprime(TP, FP, FN, TN,
n_targets = TP+FN,
n_distractors = FP+TN,
adjust=F)
$dprime
[1] 2.635813
$beta
[1] 0.3970026
$aprime
[1] 0.9419643
$bppd
[1] -0.5076923
$c
[1] -0.3504848
The most important from above, is d-prime. This is modelling the difference between the rate of “True Positive” responses and “False Positive” responses in standard unit (or z-scores). The formula can be written as:
d' (d prime) = Z(True Positive Rate) - Z(False Positive Rate)
The code below demonstrates the links between our GLM model and what we had obtained above from SDT. The predictions’ table shows that our GLM was successful at obtaining prediction that are identical to our initial data setup. Look at the table here and the table above. Once we have created our table of outcome, we can compute percent correct, the specificity, the sensitivity, the Kappa score, etc.. this yields the actual value with the SD that is related to variations in responses.
## predict(mdl.glm)>0.5 is identical to
## predict(glm(response~grammaticality,data=grammatical,family = binomial),type="response")
grammatical <- grammatical %>%
mutate(response = factor(response, levels = c("yes", "no")),
grammaticality = factor(grammaticality, levels = c("grammatical", "ungrammatical")))
mdl.glm.C <- grammatical %>%
glm(response ~ grammaticality, data = .,family = binomial)
tbl.glm <- table(grammatical$response, predict(mdl.glm.C, type = "response")>0.5)
colnames(tbl.glm) <- c("grammatical", "ungrammatical")
tbl.glm
grammatical ungrammatical
yes 100 10
no 5 50
If you look at the results from SDT above, these results are the same as the following
Accuracy: (TP+TN)/Total (0.9090909)
True Positive Rate (or Specificity) TP/(TP+FN) (0.952381)
True Negative Rate (or Sensitivity) TN/(FP+TN) (0.8333333)
The values obtained here match those obtained from SDT. For d prime, the difference stems from the use of the logit variant of the Binomial family. By using a probit variant, one obtains the same values (see here for more details). A probit variant models the z-score differences in the outcome and is evaluated in change in 1-standard unit. This is modelling the change from “ungrammatical” “no” responses into “grammatical” “yes” responses in z-scores. The same conceptual underpinnings of d-prime from Signal Detection Theory.
## d prime
psycho::dprime(TP, FP, FN, TN,
n_targets = TP+FN,
n_distractors = FP+TN,
adjust=F)$dprime
[1] 2.635813
## GLM with probit
coef(glm(response ~ grammaticality, data = grammatical, family = binomial(probit)))[2]
grammaticalityungrammatical
2.635813
If your data does not fit a binomial distribution, and is a multinomial (i.e., three or more response categories) or poisson (count data), then you need to use the glm function with a specific family function.
These models work perfectly with rating data. Ratings are inherently ordered, 1, 2, … n, and expect to observe an increase (or decrease) in overall ratings from 1 to n. To demonstrate this, we will use an example using the package “ordinal”. Data were from a rating experiment where six participants rated the percept of nasality in the production of particular consonants in Arabic. The data came from nine producing subjects. The ratings were from 1 to 5. This example can apply to any study, e.g., rating grammaticality of sentences, rating how positive the sentiments are in a article, interview responses, etc.
We start by importing the data and process it. We change the reference level in the predictor
rating <- rating %>%
mutate(Response = factor(Response),
Context = factor(Context)) %>%
mutate(Context = relevel(Context, "isolation"))
rating
We run our first clm model as a simple, i.e., with no random effects
formula: Response ~ Context
data: .
Coefficients:
Estimate Std. Error z value Pr(>|z|)
Context3--3 -0.1384 0.5848 -0.237 0.8130
Context3-n 3.5876 0.4721 7.600 2.96e-14 ***
Context3-o -0.4977 0.3859 -1.290 0.1971
Context7-n 2.3271 0.5079 4.582 4.60e-06 ***
Context7-o 0.2904 0.4002 0.726 0.4680
Contextn-3 2.8957 0.6685 4.331 1.48e-05 ***
Contextn-7 2.2678 0.4978 4.556 5.22e-06 ***
Contextn-n 2.8697 0.4317 6.647 2.99e-11 ***
Contextn-o 3.5152 0.4397 7.994 1.30e-15 ***
Contexto-3 -0.2540 0.4017 -0.632 0.5272
Contexto-7 -0.6978 0.3769 -1.851 0.0641 .
Contexto-n 2.9640 0.4159 7.126 1.03e-12 ***
Contexto-o -0.6147 0.3934 -1.562 0.1182
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Threshold coefficients:
Estimate Std. Error z value
1|2 -1.4615 0.2065 -7.077
2|3 0.4843 0.1824 2.655
3|4 1.5492 0.2044 7.578
4|5 3.1817 0.2632 12.089
We can evaluate whether “Context” improves the model fit, by comparing a null model with our model. Of course “Context” is improving the model fit.
Likelihood ratio tests of cumulative link models:
no.par AIC logLik LR.stat df Pr(>Chisq)
mdl.clm.Null 4 1281.1 -636.56
mdl.clm 17 1086.3 -526.16 220.8 13 < 2.2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
As a way to interpret the model, we can look at the coefficients and make sense of the results. A CLM model is a Logistic model with a cumulative effect. The “Coefficients” are the estimates for each level of the fixed effect; the “Threshold coefficients” are those of the response. For the former, a negative coefficient indicates a negative association with the response; and a positive is positively associated with the response. The p values are indicating the significance of each level. For the “Threshold coefficients”, we can see the cumulative effects of ratings 1|2, 2|3, 3|4 and 4|5 which indicate an overall increase in the ratings from 1 to 5.
We use a modified version of a plotting function that allows us to visualise the effects. For this, we use the base R plotting functions
par(oma=c(1, 0, 0, 3),mgp=c(2, 1, 0))
xlimNas = c(min(mdl.clm$beta), max(mdl.clm$beta))
ylimNas = c(0,1)
plot(0,0,xlim=xlimNas, ylim=ylimNas, type="n", ylab=expression(Probability), xlab="", xaxt = "n",main="Predicted curves - Nasalisation",cex=2,cex.lab=1.5,cex.main=1.5,cex.axis=1.5)
axis(side = 1, at = c(0,mdl.clm$beta),labels = levels(rating$Context), las=2,cex=2,cex.lab=1.5,cex.axis=1.5)
xsNas = seq(xlimNas[1], xlimNas[2], length.out=100)
lines(xsNas, plogis(mdl.clm$Theta[1] - xsNas), col='black')
lines(xsNas, plogis(mdl.clm$Theta[2] - xsNas)-plogis(mdl.clm$Theta[1] - xsNas), col='red')
Let’s generate a new dataframe that we will use later on for our mixed models
## Courtesy of Bodo Winter
set.seed(666)
#we create 6 subjects
subjects <- paste0('S', 1:6)
#here we add repetitions within speakers
subjects <- rep(subjects, each = 20)
items <- paste0('Item', 1:20)
#below repeats
items <- rep(items, 6)
#below is to generate random numbers that are log values
logFreq <- round(rexp(20)*5, 2)
#below we are repeating the logFreq 6 times to fit with the number of speakers and items
logFreq <- rep(logFreq, 6)
xdf <- data.frame(subjects, items, logFreq)
#below removes the individual variables we had created because they are already in the dataframe
rm(subjects, items, logFreq)
xdf$Intercept <- 300
submeans <- rep(rnorm(6, sd = 40), 20)
#sort make the means for each subject is the same...
submeans <- sort(submeans)
xdf$submeans <- submeans
#we create the same thing for items... we allow the items mean to vary between words...
itsmeans <- rep(rnorm(20, sd = 20), 6)
xdf$itsmeans <- itsmeans
xdf$error <- rnorm(120, sd = 20)
#here we create an effect column,
#here for each logFreq, we have a decrease of -5 of that particular logFreq
xdf$effect <- -5 * xdf$logFreq
xdf$dur <- xdf$Intercept + xdf$submeans + xdf$itsmeans + xdf$error + xdf$effect
#below is to subset the data and get only a few columns.. the -c(4:8) removes the columns 4 to 8..
xreal <- xdf[,-c(4:8)]
head(xreal)
Let’s start by doing a correlation test and plotting the data. Our results show that there is a negative correlation between duration and LogFrequency, and the plot shows this decrease.
logFreq dur
logFreq 1.00 -0.54
dur -0.54 1.00
n= 120
P
logFreq dur
logFreq 0
dur 0
Let’s run a simple linear model on the data. As we can see below, there are some issues with the “simple” linear model: we had set our SD for subjects to be 40, but this was picked up as 120 (see histogram of residuals). The QQ Plot is not “normal”.
Call:
lm(formula = dur ~ logFreq, data = .)
Residuals:
Min 1Q Median 3Q Max
-94.322 -35.465 -4.364 33.020 123.955
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 337.9730 6.2494 54.081 < 2e-16 ***
logFreq -5.4601 0.7846 -6.959 2.06e-10 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 48.29 on 118 degrees of freedom
Multiple R-squared: 0.291, Adjusted R-squared: 0.285
F-statistic: 48.43 on 1 and 118 DF, p-value: 2.057e-10
Our Linear Mixed effects Model will take into account the random effects we added and also our model specifications. We use a Maximum Likelihood estimate (REML = FALSE) as this is what we will use for model comparison. The Linear Mixed Model is reflecting our model specifications The SD of our subjects is picked up correctly. The model results are “almost” the same as our linear model above. The coefficient for the “Intercept” is at 337.973 and the coefficient for LogFrequency is at -5.460. This indicates that for each unit of increase in the LogFrequency, there is a decrease by 5.460 (ms).
mdl.lmer.xreal <- xreal %>%
lmer(dur ~ logFreq +(1|subjects) + (1|items), data = ., REML = FALSE)
summary(mdl.lmer.xreal)
Linear mixed model fit by maximum likelihood . t-tests use
Satterthwaite's method [lmerModLmerTest]
Formula: dur ~ logFreq + (1 | subjects) + (1 | items)
Data: .
AIC BIC logLik deviance df.resid
1105.8 1119.8 -547.9 1095.8 115
Scaled residuals:
Min 1Q Median 3Q Max
-2.06735 -0.60675 0.07184 0.61122 2.39854
Random effects:
Groups Name Variance Std.Dev.
items (Intercept) 589.8 24.29
subjects (Intercept) 1471.7 38.36
Residual 284.0 16.85
Number of obs: 120, groups: items, 20; subjects, 6
Fixed effects:
Estimate Std. Error df t value Pr(>|t|)
(Intercept) 337.973 17.587 9.126 19.218 1.08e-08 ***
logFreq -5.460 1.004 19.215 -5.436 2.92e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Correlation of Fixed Effects:
(Intr)
logFreq -0.322
This second model add a by-subject random slope. Random slopes allow for the variation that exists in the random effects to be taken into account. An intercept only model provides an averaged values to our participants.
mdl.lmer.xreal.2 <- xreal %>%
lmer(dur ~ logFreq + (logFreq|subjects) + (1|items), data = ., REML = FALSE)
summary(mdl.lmer.xreal.2)
Linear mixed model fit by maximum likelihood . t-tests use
Satterthwaite's method [lmerModLmerTest]
Formula: dur ~ logFreq + (logFreq | subjects) + (1 | items)
Data: .
AIC BIC logLik deviance df.resid
1109.5 1129.0 -547.7 1095.5 113
Scaled residuals:
Min 1Q Median 3Q Max
-2.1087 -0.6067 0.0623 0.5828 2.4564
Random effects:
Groups Name Variance Std.Dev. Corr
items (Intercept) 5.897e+02 24.2838
subjects (Intercept) 1.400e+03 37.4229
logFreq 2.902e-02 0.1704 1.00
Residual 2.829e+02 16.8196
Number of obs: 120, groups: items, 20; subjects, 6
Fixed effects:
Estimate Std. Error df t value Pr(>|t|)
(Intercept) 337.973 17.245 9.093 19.598 9.50e-09 ***
logFreq -5.460 1.007 19.361 -5.424 2.92e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Correlation of Fixed Effects:
(Intr)
logFreq -0.267
optimizer (nloptwrap) convergence code: 0 (OK)
boundary (singular) fit: see ?isSingular
But where are our p values? The lme4 developers decided not to include p values due to various issues with estimating df. What we can do instead is to compare models. We need to create a null model to allow for significance testing. As expected our predictor is significantly contributing to the difference.
mdl.lmer.xreal.Null <- xreal %>%
lmer(dur ~ 1 + (logFreq|subjects) + (1|items), data = ., REML = FALSE)
anova(mdl.lmer.xreal.Null, mdl.lmer.xreal.2)
Data: .
Models:
mdl.lmer.xreal.Null: dur ~ 1 + (logFreq | subjects) + (1 | items)
mdl.lmer.xreal.2: dur ~ logFreq + (logFreq | subjects) + (1 | items)
npar AIC BIC logLik deviance Chisq Df
mdl.lmer.xreal.Null 6 1125.4 1142.1 -556.68 1113.4
mdl.lmer.xreal.2 7 1109.5 1129.0 -547.73 1095.5 17.892 1
Pr(>Chisq)
mdl.lmer.xreal.Null
mdl.lmer.xreal.2 2.339e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Also, do we really need random slopes? From the result below, we don’t seem to need random slopes at all, given that adding random slopes does not improve the model fit. I always recommend testing this. Most of the time I keep random slopes.
Data: .
Models:
mdl.lmer.xreal: dur ~ logFreq + (1 | subjects) + (1 | items)
mdl.lmer.xreal.2: dur ~ logFreq + (logFreq | subjects) + (1 | items)
npar AIC BIC logLik deviance Chisq Df
mdl.lmer.xreal 5 1105.8 1119.8 -547.92 1095.8
mdl.lmer.xreal.2 7 1109.5 1129.0 -547.73 1095.5 0.3788 2
Pr(>Chisq)
mdl.lmer.xreal
mdl.lmer.xreal.2 0.8274
But if you are really (really!!!) obsessed by p values, then you can also use lmerTest. BUT use after comparing models to evaluate contribution of predictors
mdl.lmer.xreal.lmerTest <- xreal %>%
lmer(dur ~ logFreq + (logFreq|subjects) + (1|items), data = ., REML = TRUE)
summary(mdl.lmer.xreal.lmerTest)
Linear mixed model fit by REML. t-tests use Satterthwaite's method [
lmerModLmerTest]
Formula: dur ~ logFreq + (logFreq | subjects) + (1 | items)
Data: .
REML criterion at convergence: 1086.1
Scaled residuals:
Min 1Q Median 3Q Max
-2.09691 -0.60118 0.06418 0.58483 2.46245
Random effects:
Groups Name Variance Std.Dev. Corr
items (Intercept) 629.5679 25.0912
subjects (Intercept) 1651.2357 40.6354
logFreq 0.0342 0.1849 1.00
Residual 282.8593 16.8184
Number of obs: 120, groups: items, 20; subjects, 6
Fixed effects:
Estimate Std. Error df t value Pr(>|t|)
(Intercept) 337.973 18.526 7.396 18.24 2.03e-07 ***
logFreq -5.460 1.038 18.136 -5.26 5.18e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Correlation of Fixed Effects:
(Intr)
logFreq -0.250
optimizer (nloptwrap) convergence code: 0 (OK)
boundary (singular) fit: see ?isSingular
Our final model uses REML (or Restricted Maximum Likelihood Estimate of Variance Component) to estimate the model.
mdl.lmer.xreal.Full <- xreal %>%
lmer(dur ~ logFreq + (logFreq|subjects) + (1|items), data = ., REML = TRUE)
summary(mdl.lmer.xreal.Full)
Linear mixed model fit by REML ['lmerMod']
Formula: dur ~ logFreq + (logFreq | subjects) + (1 | items)
Data: .
REML criterion at convergence: 1086.1
Scaled residuals:
Min 1Q Median 3Q Max
-2.09691 -0.60118 0.06418 0.58483 2.46245
Random effects:
Groups Name Variance Std.Dev. Corr
items (Intercept) 629.5679 25.0912
subjects (Intercept) 1651.2357 40.6354
logFreq 0.0342 0.1849 1.00
Residual 282.8593 16.8184
Number of obs: 120, groups: items, 20; subjects, 6
Fixed effects:
Estimate Std. Error t value
(Intercept) 337.973 18.526 18.24
logFreq -5.460 1.038 -5.26
Correlation of Fixed Effects:
(Intr)
logFreq -0.250
optimizer (nloptwrap) convergence code: 0 (OK)
boundary (singular) fit: see ?isSingular
Analysis of Variance Table
npar Sum Sq Mean Sq F value
logFreq 1 7826.9 7826.9 27.671
$items
(Intercept) logFreq
Item1 352.3567 -5.460115
Item10 331.7618 -5.460115
Item11 324.7269 -5.460115
Item12 350.2318 -5.460115
Item13 353.1174 -5.460115
Item14 311.8355 -5.460115
Item15 354.0591 -5.460115
Item16 353.9389 -5.460115
Item17 288.7843 -5.460115
Item18 362.4702 -5.460115
Item19 338.1424 -5.460115
Item2 325.1855 -5.460115
Item20 359.7414 -5.460115
Item3 370.1804 -5.460115
Item4 302.4265 -5.460115
Item5 350.0499 -5.460115
Item6 338.9482 -5.460115
Item7 362.8402 -5.460115
Item8 295.5943 -5.460115
Item9 333.0693 -5.460115
$subjects
(Intercept) logFreq
S1 314.4694 -5.567073
S2 303.9037 -5.615155
S3 314.2920 -5.567881
S4 318.4282 -5.549058
S5 373.3006 -5.299350
S6 403.4443 -5.162175
attr(,"class")
[1] "coef.mer"
(Intercept) logFreq
337.973044 -5.460115
(Intercept)
337.973
logFreq
-5.460115
In general, I use the prediction from my final model in any plots. To generate this, we can use the following
The code above was using a Linear Mixed Effects Modelling. The outcome was a numeric object. In some cases (as we have seen above), we may have:
The code below gives you an idea of how to specify these models
## Binomial family
## lme4::glmer(outcome~predictor(s)+(1|subject)+(1|items)..., data=data, family=binomial)
## Poisson family
## lme4::glmer(outcome~predictor(s)+(1|subject)+(1|items)..., data=data, family=poisson)
## Multinomial family
## a bit complicated as there is a need to use Bayesian approaches, see e.g.,
## glmmADMB
## mixcat
## MCMCglmm
## see https://gist.github.com/casallas/8263818
## Rating data, use following
## ordinal::clmm(outcome~predictor(s)+(1|subject)+(1|items)..., data=data)
## Remember to test for random effects and whether slopes are needed.
R version 4.0.3 (2020-10-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19041)
Matrix products: default
locale:
[1] LC_COLLATE=English_United Kingdom.1252
[2] LC_CTYPE=English_United Kingdom.1252
[3] LC_MONETARY=English_United Kingdom.1252
[4] LC_NUMERIC=C
[5] LC_TIME=English_United Kingdom.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods
[7] base
other attached packages:
[1] PresenceAbsence_1.1.9 psycho_0.6.1 corrplot_0.84
[4] Hmisc_4.5-0 Formula_1.2-4 survival_3.2-7
[7] lattice_0.20-41 knitr_1.31 emmeans_1.5.4
[10] broom_0.7.5 ordinal_2019.12-10 ggsignif_0.6.1
[13] palmerpenguins_0.1.0 forcats_0.5.1 stringr_1.4.0
[16] dplyr_1.0.5 purrr_0.3.4 readr_1.4.0
[19] tidyr_1.1.3 tibble_3.1.0 ggplot2_3.3.3
[22] tidyverse_1.3.0 lme4_1.1-26 Matrix_1.3-2
loaded via a namespace (and not attached):
[1] TH.data_1.0-10 minqa_1.2.4 colorspace_2.0-0
[4] ellipsis_0.3.1 rio_0.5.26 htmlTable_2.1.0
[7] estimability_1.3 base64enc_0.1-3 fs_1.5.0
[10] rstudioapi_0.13 farver_2.1.0 fansi_0.4.2
[13] mvtnorm_1.1-1 lubridate_1.7.10 xml2_1.3.2
[16] codetools_0.2-18 splines_4.0.3 jsonlite_1.7.2
[19] nloptr_1.2.2.2 cluster_2.1.1 dbplyr_2.1.0
[22] png_0.1-7 compiler_4.0.3 httr_1.4.2
[25] backports_1.2.1 assertthat_0.2.1 cli_2.3.1
[28] htmltools_0.5.1.1 tools_4.0.3 gtable_0.3.0
[31] glue_1.4.2 Rcpp_1.0.6 carData_3.0-4
[34] jquerylib_0.1.3 cellranger_1.1.0 vctrs_0.3.6
[37] nlme_3.1-152 xfun_0.21 openxlsx_4.2.3
[40] rvest_0.3.6 lifecycle_1.0.0 pacman_0.5.1
[43] statmod_1.4.35 MASS_7.3-53.1 zoo_1.8-8
[46] scales_1.1.1 hms_1.0.0 sandwich_3.0-0
[49] RColorBrewer_1.1-2 yaml_2.2.1 curl_4.3
[52] gridExtra_2.3 sass_0.3.1 rpart_4.1-15
[55] latticeExtra_0.6-29 stringi_1.5.3 highr_0.8
[58] ucminf_1.1-4 checkmate_2.0.0 boot_1.3-27
[61] zip_2.1.1 rlang_0.4.10 pkgconfig_2.0.3
[64] evaluate_0.14 htmlwidgets_1.5.3 labeling_0.4.2
[67] tidyselect_1.1.0 plyr_1.8.6 magrittr_2.0.1
[70] R6_2.5.0 generics_0.1.0 multcomp_1.4-16
[73] DBI_1.1.1 pillar_1.5.1 haven_2.3.1
[76] foreign_0.8-81 withr_2.4.1 mgcv_1.8-34
[79] abind_1.4-5 nnet_7.3-15 modelr_0.1.8
[82] crayon_1.4.1 car_3.0-10 utf8_1.1.4
[85] rmarkdown_2.7 jpeg_0.1-8.1 grid_4.0.3
[88] readxl_1.3.1 data.table_1.14.0 reprex_1.0.0
[91] digest_0.6.27 xtable_1.8-4 numDeriv_2016.8-1.1
[94] munsell_0.5.0 bslib_0.2.4