In this session, we will look at basic functions in R that will help us in running some inferential statistics. These will help us to evaluate the relationship between one (or more) predictor(s) and an outcome. We will start with a basic example. Using the dataframe “cars”.
## Use the code below to check if you have all required packages installed. If some are not installed already, the code below will install these. If you have all packages installed, then you could load them with the second code.
requiredPackages = c('tidyverse', 'psycho', 'ordinal', 'PresenceAbsence', 'broom', 'emmeans', 'car','knitr')
for(p in requiredPackages){
if(!require(p,character.only = TRUE)) install.packages(p)
library(p,character.only = TRUE)
}
We will use one of the built in R
. You can check all available datasets in R
using the following:
data()
# or below for all datasets available in all installed packages
data(package = .packages(all.available = TRUE))
datasets have been moved from package 'base' to package 'datasets'datasets have been moved from package 'stats' to package 'datasets'
We will use the iris
dataset from the package MASS
'data.frame': 150 obs. of 5 variables:
$ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
$ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
$ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
$ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
$ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
We have a dataframe with 150 observations and 5 variables; 4 numeric and 1 factor with 3 levels.
We summarise the data to see the trends:
Sepal.Length Sepal.Width Petal.Length Petal.Width
Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
Median :5.800 Median :3.000 Median :4.350 Median :1.300
Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
Species
setosa :50
versicolor:50
virginica :50
So we have an equal dataframe (50 observations under each level of the factor Species
), with no missing values (aka NA
).
What if you want to summarise the data and get the mean, SD, by each level of the factor?
Do some additional summaries
We can make a boxplot of the data and add a trend line. This allows us to visualise the median, and quantiles in addition to the standard deviation and any outliers… All in the same plot!
iris %>%
ggplot(aes(x = Species, y = Sepal.Length))+
geom_boxplot() +
labs(x = "Species", y = "Length", title = "Boxplot and trend line", subtitle="with ggplot2") +
theme_bw() + theme(text = element_text(size = 15))+
geom_smooth(aes(x = as.numeric(Species), y = Sepal.Length), method = "lm")
Here I have used the variable Sepal.Length
. You can use any of the additional variables to plot the data.
We use the function cor
to obtain the pearson correlation and cor.test
to run a basic correlation test on our data with significance testing
[1] 0.8717538
Pearson's product-moment correlation
data: iris$Sepal.Length and iris$Petal.Length
t = 21.646, df = 148, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.8270363 0.9055080
sample estimates:
cor
0.8717538
Can you check whether there is a correlation between the Sepal.Length
and the Petal.Width
? What about Petal.Length
and Petal.Width
?
psycho
Above, we did a correlation test on two predictors. We run multiple correlations on multiple predictors with corrections. You can get a table, and a plot. (see here https://www.r-bloggers.com/beautiful-and-powerful-correlation-tables-in-r/amp/)
## correlation using "psycho"
## use "Alt" + "-" to add "<-"
## use "Ctrl" + "Shift" + "M" to obtain "%>%"
cor <- iris %>%
correlation()
summary(cor)
Pearson Full correlation (p value correction: holm):
- Sepal.Length / Sepal.Width: Results of the Pearson correlation showed a non significant small, and negative association between Sepal.Length and Sepal.Width (r(148) = -0.12, p > .1).
- Sepal.Length / Petal.Length: Results of the Pearson correlation showed a significant large, and positive association between Sepal.Length and Petal.Length (r(148) = 0.87, p < .001***).
- Sepal.Width / Petal.Length: Results of the Pearson correlation showed a significant moderate, and negative association between Sepal.Width and Petal.Length (r(148) = -0.43, p < .001***).
- Sepal.Length / Petal.Width: Results of the Pearson correlation showed a significant large, and positive association between Sepal.Length and Petal.Width (r(148) = 0.82, p < .001***).
- Sepal.Width / Petal.Width: Results of the Pearson correlation showed a significant moderate, and negative association between Sepal.Width and Petal.Width (r(148) = -0.37, p < .001***).
- Petal.Length / Petal.Width: Results of the Pearson correlation showed a significant large, and positive association between Petal.Length and Petal.Width (r(148) = 0.96, p < .001***).
You can change the following if you are interested in getting a spearman
correlation compared to a pearson
correlation: method = "spearman"
(method = “pearson”, “spearman”, “kendall”).
The defaults have corrections for multiple comparisons using the holm
method; you can use adjust="bonferroni"
to use a bonferroni correction (adjust= “holm” (default), “bonferroni”, “fdr”, “none”)
Run multiple correlations below by changing the method, or the adjustments. Any comments?
Up to now, we have done some basic summaries and checked the correlations in the data. The pearson correlations we have done provided us with significance levels related to the correlations between two numeric outcomes We do not know if the levels of our variable Species
are statistically different based on one of the predictors. We use inferential statistics here.
T.test are used when the outcome (DV) is numeric and the predictor is either numeric or categorical (with two levels only).
We can run a basic t-test on our data, using the function t-test
. This looks at the two numeric outcomes and sees if they are different from each other
Welch Two Sample t-test
data: iris$Sepal.Length and iris$Petal.Length
t = 13.098, df = 211.54, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
1.771500 2.399166
sample estimates:
mean of x mean of y
5.843333 3.758000
In the iris
dataset, we have a categorical predictor: Species
which has three levels
[1] "setosa" "versicolor" "virginica"
Let’s subset the data and evaluate the differences between setosa
and versicolor
using a t-test and a linear model. We will also check the normality and homogeneity of variance in the data
[1] "setosa" "versicolor" "virginica"
But wait a minute… We just subsetted the data and our levels
show three levels factor.. why? Let’s run a summary
setosa versicolor virginica
50 50 0
Aha! So the subsetting worked as the virginica
cases are 0. We need to tell R
to change the factor levels
[1] "setosa" "versicolor"
Now this works better
To check normality of distribution, we use the shapiro.test
on the numeric outcome. Given that our predictor now has two levels. We need to subset the data again to check normality of the outcome Sepal.Length
for each level of our factor Species
irisSubSet <- iris %>%
filter(Species == "setosa")
irisSubVers <- iris %>%
filter(Species == "versicolor")
shapiro.test(irisSubSet$Sepal.Length)
Shapiro-Wilk normality test
data: irisSubSet$Sepal.Length
W = 0.9777, p-value = 0.4595
Shapiro-Wilk normality test
data: irisSubVers$Sepal.Length
W = 0.97784, p-value = 0.4647
How to interpret this non-statistically significant result? This tells us that the distribution of the data is not statistically different from a normal distribution.
We can also use a density plot to evaluate normality. The results show that both levels have bell shaped distributions.
Because our data is normally distributed, we can use the barlett
test. If our data were non-normally distributed, we would use the leveneTest. We can check both.
Bartlett test of homogeneity of variances
data: Sepal.Length by Species
Bartlett's K-squared = 6.8917, df = 1, p-value = 0.00866
F test to compare two variances
data: Sepal.Length by Species
F = 0.46634, num df = 49, denom df = 49, p-value = 0.008657
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
0.2646385 0.8217841
sample estimates:
ratio of variances
0.4663429
Levene's Test for Homogeneity of Variance (center = median)
Df F value Pr(>F)
group 1 8.1727 0.005196 **
98
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
In all cases, the statistically significant result indicates that there is evidence that the variance of two levels of the factor Species
is statistically significant; i.e., the variances are not equal
We then run a t-test on the data. We specify the formula as y ~ x
and add var.equal = FALSE
Welch Two Sample t-test
data: Sepal.Length by Species
t = -10.521, df = 86.538, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-1.1057074 -0.7542926
sample estimates:
mean in group setosa mean in group versicolor
5.006 5.936
To interpret the t-test, we say that there is evidence for a statistically significant difference between the two groups: setosa
and versicolor
: t(86) = -10.521, p < 2.2e-16
. The mean of setosa
is significantly lower than that of versicolor
.
Let us run a linear model on the same data
Call:
lm(formula = Sepal.Length ~ Species, data = irisSub)
Residuals:
Min 1Q Median 3Q Max
-1.0360 -0.3135 -0.0060 0.2715 1.0640
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.00600 0.06250 80.09 <2e-16 ***
Speciesversicolor 0.93000 0.08839 10.52 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.442 on 98 degrees of freedom
Multiple R-squared: 0.5304, Adjusted R-squared: 0.5256
F-statistic: 110.7 on 1 and 98 DF, p-value: < 2.2e-16
Any comments? discuss with your neighbour.
The results of the linear model are exactly the same, albeit with a difference in the sign of the difference. This indicates that the \(\beta\) coefficient for versicolor
is significantly higher than that of setosa
by 0.93000
: t(98) = 10.52, p < <2e-16
.
The dataset iris
contains three species. We will run an ANOVA and a linear model on the data.
We can use the function aov
to run an Analysis of Variance
Df Sum Sq Mean Sq F value Pr(>F)
Species 2 63.21 31.606 119.3 <2e-16 ***
Residuals 147 38.96 0.265
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
We can use the function lm
to run a linear model
Call:
lm(formula = Sepal.Length ~ Species, data = iris)
Residuals:
Min 1Q Median 3Q Max
-1.6880 -0.3285 -0.0060 0.3120 1.3120
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.0060 0.0728 68.762 < 2e-16 ***
Speciesversicolor 0.9300 0.1030 9.033 8.77e-16 ***
Speciesvirginica 1.5820 0.1030 15.366 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.5148 on 147 degrees of freedom
Multiple R-squared: 0.6187, Adjusted R-squared: 0.6135
F-statistic: 119.3 on 2 and 147 DF, p-value: < 2.2e-16
But wait… How is the linear model comparable to the analysis of variance we ran above? This linear model derives the analysis of variance we saw above, use anova
on your linear model..
Here are the results of the initial Analysis of variance:
Df Sum Sq Mean Sq F value Pr(>F)
Species 2 63.21 31.606 119.3 <2e-16 ***
Residuals 147 38.96 0.265
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
And here are the results of the linear model with the anova
function
Analysis of Variance Table
Response: Sepal.Length
Df Sum Sq Mean Sq F value Pr(>F)
Species 2 63.212 31.606 119.26 < 2.2e-16 ***
Residuals 147 38.956 0.265
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
They are exactly the same… The underlying of an Analysis of variance is a linear model. We will continue with a linear model to understand it better
The basic assumption of a Linear model is to create a regression analysis on the data. We have an outcome (or dependent variable) and a predictor (or an independent variable). The formula of a linear model is as follows outcome~predictor
that can be read as “outcome as a function of the predictor”. We can add “1” to specify an intercept, but this is by default added to the model
mdl.lm <- lm(Sepal.Length ~ Species, data = iris)
# same as below.
#mdl.lm <- lm(Sepal.Length~1+Species, data=iris)
mdl.lm#also print(mdl.lm)
Call:
lm(formula = Sepal.Length ~ Species, data = iris)
Coefficients:
(Intercept) Speciesversicolor Speciesvirginica
5.006 0.930 1.582
Call:
lm(formula = Sepal.Length ~ Species, data = iris)
Residuals:
Min 1Q Median 3Q Max
-1.6880 -0.3285 -0.0060 0.3120 1.3120
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.0060 0.0728 68.762 < 2e-16 ***
Speciesversicolor 0.9300 0.1030 9.033 8.77e-16 ***
Speciesvirginica 1.5820 0.1030 15.366 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.5148 on 147 degrees of freedom
Multiple R-squared: 0.6187, Adjusted R-squared: 0.6135
F-statistic: 119.3 on 2 and 147 DF, p-value: < 2.2e-16
# from library(broom)
tidy(mdl.lm) %>%
select(term, estimate) %>%
mutate(estimate = round(estimate, 3))
To interpret the model, we need look at the coefficients. The “Intercept” (=Setosa) is 5.006 and the coefficients for Versicolor and for Virginica are respectively 5.936 and 6.588 This tells us that compared to Setosa, moving from this category to Versicolor leads to a significant increase by 5.936, and for Virginica, there is a significant increase by 6.588.
But where are our actual values based on the means in the table above? We run a model that suppresses the intercept (i.e., adding 0 instead of 1) and this will allow us to obtain the “true” coefficients for each level of our predictor
Call:
lm(formula = Sepal.Length ~ 0 + Species, data = iris)
Residuals:
Min 1Q Median 3Q Max
-1.6880 -0.3285 -0.0060 0.3120 1.3120
Coefficients:
Estimate Std. Error t value Pr(>|t|)
Speciessetosa 5.0060 0.0728 68.76 <2e-16 ***
Speciesversicolor 5.9360 0.0728 81.54 <2e-16 ***
Speciesvirginica 6.5880 0.0728 90.49 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.5148 on 147 degrees of freedom
Multiple R-squared: 0.9925, Adjusted R-squared: 0.9924
F-statistic: 6522 on 3 and 147 DF, p-value: < 2.2e-16
This matches the original data. Setosa has a mean of 5.006, Versicolor 10.942, and Virginica 10.942. See table above and coefficients below
[1] 5.006
[1] 5.936
[1] 6.588
The same as
[1] 5.006
[1] 5.936
[1] 6.588
We can also obtain a nice table of our model summary. We can use the package knitr
or xtable
Estimate | Std. Error | t value | Pr(>|t|) | |
---|---|---|---|---|
(Intercept) | 5.006 | 0.073 | 68.762 | 0 |
Speciesversicolor | 0.930 | 0.103 | 9.033 | 0 |
Speciesvirginica | 1.582 | 0.103 | 15.366 | 0 |
tidy
outputterm | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 5.006 | 0.073 | 68.762 | 0 |
Speciesversicolor | 0.930 | 0.103 | 9.033 | 0 |
Speciesvirginica | 1.582 | 0.103 | 15.366 | 0 |
Let us dissect the model. If you use “str”, you will be able to see what is available under our linear model. To access some info from the model
List of 13
$ coefficients : Named num [1:3] 5.01 0.93 1.58
..- attr(*, "names")= chr [1:3] "(Intercept)" "Speciesversicolor" "Speciesvirginica"
$ residuals : Named num [1:150] 0.094 -0.106 -0.306 -0.406 -0.006 ...
..- attr(*, "names")= chr [1:150] "1" "2" "3" "4" ...
$ effects : Named num [1:150] -71.5659 0.8025 7.91 -0.3826 0.0174 ...
..- attr(*, "names")= chr [1:150] "(Intercept)" "Speciesversicolor" "Speciesvirginica" "" ...
$ rank : int 3
$ fitted.values: Named num [1:150] 5.01 5.01 5.01 5.01 5.01 ...
..- attr(*, "names")= chr [1:150] "1" "2" "3" "4" ...
$ assign : int [1:3] 0 1 1
$ qr :List of 5
..$ qr : num [1:150, 1:3] -12.2474 0.0816 0.0816 0.0816 0.0816 ...
.. ..- attr(*, "dimnames")=List of 2
.. .. ..$ : chr [1:150] "1" "2" "3" "4" ...
.. .. ..$ : chr [1:3] "(Intercept)" "Speciesversicolor" "Speciesvirginica"
.. ..- attr(*, "assign")= int [1:3] 0 1 1
.. ..- attr(*, "contrasts")=List of 1
.. .. ..$ Species: chr "contr.treatment"
..$ qraux: num [1:3] 1.08 1.05 1.09
..$ pivot: int [1:3] 1 2 3
..$ tol : num 1e-07
..$ rank : int 3
..- attr(*, "class")= chr "qr"
$ df.residual : int 147
$ contrasts :List of 1
..$ Species: chr "contr.treatment"
$ xlevels :List of 1
..$ Species: chr [1:3] "setosa" "versicolor" "virginica"
$ call : language lm(formula = Sepal.Length ~ Species, data = iris)
$ terms :Classes 'terms', 'formula' language Sepal.Length ~ Species
.. ..- attr(*, "variables")= language list(Sepal.Length, Species)
.. ..- attr(*, "factors")= int [1:2, 1] 0 1
.. .. ..- attr(*, "dimnames")=List of 2
.. .. .. ..$ : chr [1:2] "Sepal.Length" "Species"
.. .. .. ..$ : chr "Species"
.. ..- attr(*, "term.labels")= chr "Species"
.. ..- attr(*, "order")= int 1
.. ..- attr(*, "intercept")= int 1
.. ..- attr(*, "response")= int 1
.. ..- attr(*, ".Environment")=<environment: R_GlobalEnv>
.. ..- attr(*, "predvars")= language list(Sepal.Length, Species)
.. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "factor"
.. .. ..- attr(*, "names")= chr [1:2] "Sepal.Length" "Species"
$ model :'data.frame': 150 obs. of 2 variables:
..$ Sepal.Length: num [1:150] 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
..$ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
..- attr(*, "terms")=Classes 'terms', 'formula' language Sepal.Length ~ Species
.. .. ..- attr(*, "variables")= language list(Sepal.Length, Species)
.. .. ..- attr(*, "factors")= int [1:2, 1] 0 1
.. .. .. ..- attr(*, "dimnames")=List of 2
.. .. .. .. ..$ : chr [1:2] "Sepal.Length" "Species"
.. .. .. .. ..$ : chr "Species"
.. .. ..- attr(*, "term.labels")= chr "Species"
.. .. ..- attr(*, "order")= int 1
.. .. ..- attr(*, "intercept")= int 1
.. .. ..- attr(*, "response")= int 1
.. .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv>
.. .. ..- attr(*, "predvars")= language list(Sepal.Length, Species)
.. .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "factor"
.. .. .. ..- attr(*, "names")= chr [1:2] "Sepal.Length" "Species"
- attr(*, "class")= chr "lm"
(Intercept) Speciesversicolor Speciesvirginica
5.006 0.930 1.582
What if I want to obtain the “Intercept”? Or the coefficient for distance? What if I want the full row for distance?
(Intercept)
5.006
Speciesversicolor
0.93
Estimate Std. Error t value Pr(>|t|)
9.300000e-01 1.029579e-01 9.032819e+00 8.770194e-16
[1] 8.770194e-16
Play around with the model summary and obtain the t values for the three levels. You can do this by referring to the coefficient as above
What about residuals (difference between the observed value and the estimated value of the quantity) and fitted values?
[1] 231.452
[1] 243.4945
'log Lik.' -111.726 (df=4)
Or use the following from broom
Are the above informative? of course not directly. If we want to test for overall significance of model. We run a null model (aka intercept only) and compare models.
Analysis of Variance Table
Model 1: Sepal.Length ~ 1
Model 2: Sepal.Length ~ Species
Res.Df RSS Df Sum of Sq F Pr(>F)
1 149 102.168
2 147 38.956 2 63.212 119.26 < 2.2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
The results show that adding the factor “Species” improves the model fit. We can write this as follows: Model comparison showed that the addition of Species improved the model fit when compared with an intercept only model (\(F\)(2) = 119.26, p < 1.669669210^{-31})
Let’s plot our fitted values but only for the trend line
iris %>%
ggplot(aes(x = Species, y = Sepal.Length))+
geom_boxplot() +
labs(x = "Species", y = "Length", title = "Boxplot and predicted trend line", subtitle = "with ggplot2") +
theme_bw() + theme(text = element_text(size = 15))+
geom_smooth(aes(x = as.numeric(Species), y = predict(mdl.lm)), method = "lm", color = "blue")
This allows us to plot the fitted values from our model with the predicted linear trend. This is exactly the same as our original data.
We can also plot the predicted means and linear trend
iris %>%
ggplot(aes(x = Species, y = predict(mdl.lm)))+
geom_boxplot(color = "blue") +
labs(x = "Species", y = "Length", title = "Predicted means and trend line", subtitle = "with ggplot2") +
theme_bw() + theme(text = element_text(size = 15))+
geom_smooth(aes(x = as.numeric(Species), y = predict(mdl.lm)), method = "lm", color = "blue")
We can also plot the actual data, the predicted means and linear trend
iris %>%
ggplot(aes(x = Species, y = Sepal.Length))+
geom_boxplot()+
geom_boxplot(aes(x = Species, y = predict(mdl.lm)), color = "blue") +
labs(x = "Species", y = "Length", title = "Boxplot raw data, predicted means (in blue) and trend line", subtitle = "with ggplot2") +
theme_bw() + theme(text = element_text(size = 15))+
geom_smooth(aes(x = as.numeric(Species), y = predict(mdl.lm)), method = "lm", color = "blue")
Based on our model’s summary, can you tell me if there is a difference between Versicolor and Virginica?
Call:
lm(formula = Sepal.Length ~ Species, data = iris)
Residuals:
Min 1Q Median 3Q Max
-1.6880 -0.3285 -0.0060 0.3120 1.3120
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.0060 0.0728 68.762 < 2e-16 ***
Speciesversicolor 0.9300 0.1030 9.033 8.77e-16 ***
Speciesvirginica 1.5820 0.1030 15.366 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.5148 on 147 degrees of freedom
Multiple R-squared: 0.6187, Adjusted R-squared: 0.6135
F-statistic: 119.3 on 2 and 147 DF, p-value: < 2.2e-16
$emmeans
Species emmean SE df lower.CL upper.CL
setosa 5.01 0.0728 147 4.86 5.15
versicolor 5.94 0.0728 147 5.79 6.08
virginica 6.59 0.0728 147 6.44 6.73
Confidence level used: 0.95
$contrasts
contrast estimate SE df t.ratio p.value
setosa - versicolor -0.930 0.103 147 -9.033 <.0001
setosa - virginica -1.582 0.103 147 -15.366 <.0001
versicolor - virginica -0.652 0.103 147 -6.333 <.0001
P value adjustment: fdr method for 3 tests
How to interpret the output? Discuss with your neighbour and share with the group.
Hint… Look at the emmeans values for each level of our factor “Species” and the contrasts.
So far, we only looked at “Sepal.Length”. What about the other outcomes? how informative are they? Do we have statistical difference between the three levels of our predictor? You can do this in your spare time.
We have so far looked at the Linear Model. The underlying assumption about linear models is that we have a normal (Gaussian) distribution. This is the model to be used when we have a numeric outcome. What if our outcome is not numeric? What if we have two categories, i.e., black vs white? correct vs incorrect? yes vs no? These are categorical binary outcome. We look in the next section at Logistic Regression
Here we will look at an example when the outcome is binary. This simulated data is structured as follows. We asked one participant to listen to 165 sentences, and to judge whether these are “grammatical” or “ungrammatical”. There were 105 sentences that were “grammatical” and 60 “ungrammatical”. This fictitious example can apply in any other situation. Let’s think Geography: 165 lands: 105 “flat” and 60 “non-flat”, etc. This applies to any case where you need to “categorise” the outcome into two groups.
Let’s load in the data and do some basic summaries
'data.frame': 165 obs. of 2 variables:
$ grammaticality: Factor w/ 2 levels "grammatical",..: 1 1 1 1 1 1 1 1 1 1 ...
$ response : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
grammatical$response <- relevel(grammatical$response, ref = "yes")
table(grammatical$grammaticality, grammatical$response)
yes no
grammatical 100 5
ungrammatical 10 50
We are generally interested in performance, i.e., whether the we have “accurately” categorised the outcome or not and at the same time want to evaluate our biases in responses. When deciding on categories, we are usually biased in our selection.
Let’s ask the question: How many of you have a Mac laptop and how many a Windows laptop? For those with a Mac, what was the main reason for choosing it? Are you biased in anyway by your decision?
To correct for these biases, we use some variants from Signal Detection Theory to obtain the true estimates without being influenced by the biases.
Let’s do some stats on this
Yes | No | Total | |
---|---|---|---|
Grammatical (Yes Actual) | TP = 100 | FN = 5 | (Yes Actual) 105 |
Ungrammatical (No Actual) | FP = 10 | TN = 50 | (No Actual) 60 |
Total | (Yes Response) 110 | (No Response) 55 | 165 |
## TP = True Positive (Hit); FP = False Positive; FN = False Negative; TN = True Negative
TP = 100
FP = 10
FN = 5
TN = 50
Total = 165
(TP+TN)/Total # accuracy
[1] 0.9090909
[1] 0.09090909
# When stimulus = yes, how many times response = yes?
TP/(TP+FN) # also True Positive Rate or Specificity
[1] 0.952381
[1] 0.1666667
[1] 0.8333333
[1] 0.9090909
# getting dprime (or the sensetivity index); beta (bias criterion, 0-1, lower=increase in "yes"); Aprime (estimate of discriminability, 0-1, 1=good discrimination; 0 at chance); bppd (b prime prime d, -1 to 1; 0 = no bias, negative = tendency to respond "yes", positive = tendency to respond "no"); c (index of bias, equals to SD)
#(see also https://www.r-bloggers.com/compute-signal-detection-theory-indices-with-r/amp/)
psycho::dprime(TP, FP, FN, TN,
n_targets = TP+FN,
n_distractors = FP+TN,
adjust=F)
$dprime
[1] 2.635813
$beta
[1] 0.3970026
$aprime
[1] 0.9419643
$bppd
[1] -0.5076923
$c
[1] -0.3504848
The most important from above, is d-prime. This is modelling the difference between the rate of “True Positive” responses and “False Positive” responses in standard unit (or z-scores). The formula can be written as:
d' (d prime) = Z(True Positive Rate) - Z(False Positive Rate)
Let’s run a first GLM (Generalised Linear Model). A GLM uses a special family “binomial” as it assumes the outcome has a binomial distribution. In general, results from a Logistic Regression are close to what we get from SDT (see above).
To run the results, we will change the reference level for both response and grammaticality. The basic assumption about GLM is that we start with our reference level being the “no” responses to the “ungrammatical” category. Any changes to this reference will be seen in the coefficients as “yes” responses to the “grammatical” category.
The results below show the logodds for our model.
[1] "yes" "no"
[1] "grammatical" "ungrammatical"
grammatical$response <- relevel(grammatical$response, ref = "no")
grammatical$grammaticality <- relevel(grammatical$grammaticality, ref = "ungrammatical")
mdl.glm <- glm(response ~ grammaticality, data = grammatical, family = binomial)
summary(mdl.glm)
Call:
glm(formula = response ~ grammaticality, family = binomial, data = grammatical)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.4676 -0.6039 0.3124 0.3124 1.8930
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.6094 0.3464 -4.646 3.38e-06 ***
grammaticalitygrammatical 4.6052 0.5744 8.017 1.08e-15 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 210.050 on 164 degrees of freedom
Residual deviance: 94.271 on 163 degrees of freedom
AIC: 98.271
Number of Fisher Scoring iterations: 5
The results show that for one unit increase in the response (i.e., from no to yes), the logodds of being “grammatical” is increased by 4.6051702 (the intercept shows that when the response is “no”, the logodds are -1.6094379). The actual logodds for the response “yes” to grammatical is 2.9957323
How to get the true coefficients? How to suppress the Intercept?
Here we will run the same GLM model above but by suppressing the “Intercept”. The idea here is to get the “true” coefficient for a “grammatical” and a “yes” response. We can use the above code, i.e., “mycoef2[1]+mycoef2[2]” or as below
# An intercept is always included in any regression, but you can specify it with "1"
## glm(response~1+grammaticality,data=grammatical,family = binomial)
mdl.glm2 <- glm(response ~ 0 + grammaticality, data = grammatical, family = binomial)
summary(mdl.glm2)
Call:
glm(formula = response ~ 0 + grammaticality, family = binomial,
data = grammatical)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.4676 -0.6039 0.3124 0.3124 1.8930
Coefficients:
Estimate Std. Error z value Pr(>|z|)
grammaticalityungrammatical -1.6094 0.3464 -4.646 3.38e-06
grammaticalitygrammatical 2.9957 0.4582 6.538 6.25e-11
grammaticalityungrammatical ***
grammaticalitygrammatical ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 228.739 on 165 degrees of freedom
Residual deviance: 94.271 on 163 degrees of freedom
AIC: 98.271
Number of Fisher Scoring iterations: 5
Logodds can be modified to talk about the odds of an event. For our model above, the odds of “grammatical” receiving a “yes” response increase by 100; whereas receiving a “no” is a mere 0.2. Using the second model (i.e., without an Intercept) allows us to get the odd ratios for each of responses to “Ungrammatical” and “Grammatical”, which is still 0.2 for “ungrammatical” and 20 for “grammatical”, i.e., 20 times more
(Intercept) grammaticalitygrammatical
0.2 100.0
grammaticalityungrammatical grammaticalitygrammatical
0.2 20.0
If you want to talk about the percentage “accuracy” of our model, then we can transform our loggodds into proportions. This shows that the proportion of “grammatical” receiving a “yes” response increases by 99% (or 95% based on our “true” coefficients)
(Intercept) grammaticalitygrammatical
0.1666667 0.9900990
grammaticalityungrammatical grammaticalitygrammatical
0.1666667 0.9523810
The code below demonstrates the links between our GLM model and what we had obtained above from SDT. The predictions’ table shows that our GLM was successful at obtaining prediction that are identical to our initial data setup. Look at the table here and the table above. Once we have created our table of outcome, we can compute percent correct, the specificity, the sensitivity, the Kappa score, etc.. this yields the actual value with the SD that is related to variations in responses.
## predict(mdl.glm)>0.5 is identical to
## predict(glm(response~grammaticality,data=grammatical,family = binomial),type="response")
grammatical$response <- relevel(grammatical$response, "yes")
grammatical$grammaticality <- relevel(grammatical$grammaticality, "grammatical")
mdl.glm.C <- glm(response ~ grammaticality, data = grammatical,family = binomial)
tbl.glm <- table(grammatical$response, predict(mdl.glm.C, type = "response")>0.5)
colnames(tbl.glm) <- c("grammatical", "ungrammatical")
tbl.glm
grammatical ungrammatical
yes 100 10
no 5 50
If you look at the results from SDT above, these results are the same as the following
Accuracy: (TP+TN)/Total (0.9090909)
True Positive Rate (or Specificity) TP/(TP+FN) (0.952381)
True Negative Rate (or Sensitivity) TN/(FP+TN) (0.8333333)
grammatical$prob <- predict(glm(response ~ grammaticality, data = grammatical, family = binomial), type = "response")
grammatical %>%
ggplot(aes(x = as.numeric(grammaticality), y = prob)) +
geom_point() +
geom_smooth(method = "glm",
method.args = list(family = "binomial"),
se = T) + theme_bw(base_size = 20)+
labs(y = "Probability", x = "")+
coord_cartesian(ylim = c(0,1))+
scale_x_discrete(limits = c("Ungrammatical", "Grammatical"))
The values obtained here match those obtained from SDT. For d prime, the difference stems from the use of the logit variant of the Binomial family. By using a probit variant, one obtains the same values (see here for more details). A probit variant models the z-score differences in the outcome and is evaluated in change in 1-standard unit. This is modelling the change from “ungrammatical” “no” responses into “grammatical” “yes” responses in z-scores. The same conceptual underpinnings of d-prime from Signal Detection Theory.
## d prime
psycho::dprime(TP, FP, FN, TN,
n_targets = TP+FN,
n_distractors = FP+TN,
adjust=F)$dprime
[1] 2.635813
## GLM with probit
coef(glm(response ~ grammaticality, data = grammatical, family = binomial(probit)))[2]
grammaticalityungrammatical
2.635813
If your data does not fit a binomial distribution, and is a multinomial (i.e., three or more response categories) or poisson (count data), then you need to use the glm function with a specific family function.
These models work perfectly with rating data. Ratings are inherently ordered, 1, 2, … n, and expect to observe an increase (or decrease) in overall ratings from 1 to n. To demonstrate this, we will use an example using the package “ordinal”. Data were from a rating experiment where six participants rated the percept of nasality in the production of particular consonants in Arabic. The data came from nine producing subjects. The ratings were from 1 to 5. This example can apply to any study, e.g., rating grammaticality of sentences, rating how positive the sentiments are in a article, interview responses, etc.
We start by importing the data and process it. We change the reference level in the predictor
'data.frame': 405 obs. of 6 variables:
$ X : int 406 407 408 409 410 411 412 413 414 415 ...
$ Response: int 2 4 2 4 3 3 1 2 2 1 ...
$ Context : Factor w/ 14 levels "3--3","3-n","3-o",..: 7 6 11 6 8 6 1 6 12 11 ...
$ Subject : Factor w/ 9 levels "p01","p02","p03",..: 4 4 4 4 4 4 4 4 4 4 ...
$ Item : Factor w/ 45 levels "3aaf-w","3aam-w",..: 38 37 24 23 41 40 8 7 27 19 ...
$ Rater : Factor w/ 1 level "R01": 1 1 1 1 1 1 1 1 1 1 ...
We run our first clm model as a simple, i.e., with no random effects
formula: Response ~ Context
data: rating
Coefficients:
Estimate Std. Error z value Pr(>|z|)
Context3--3 -0.1384 0.5848 -0.237 0.8130
Context3-n 3.5876 0.4721 7.600 2.96e-14 ***
Context3-o -0.4977 0.3859 -1.290 0.1971
Context7-n 2.3271 0.5079 4.582 4.60e-06 ***
Context7-o 0.2904 0.4002 0.726 0.4680
Contextn-3 2.8957 0.6685 4.331 1.48e-05 ***
Contextn-7 2.2678 0.4978 4.556 5.22e-06 ***
Contextn-n 2.8697 0.4317 6.647 2.99e-11 ***
Contextn-o 3.5152 0.4397 7.994 1.30e-15 ***
Contexto-3 -0.2540 0.4017 -0.632 0.5272
Contexto-7 -0.6978 0.3769 -1.851 0.0641 .
Contexto-n 2.9640 0.4159 7.126 1.03e-12 ***
Contexto-o -0.6147 0.3934 -1.562 0.1182
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Threshold coefficients:
Estimate Std. Error z value
1|2 -1.4615 0.2065 -7.077
2|3 0.4843 0.1824 2.655
3|4 1.5492 0.2044 7.578
4|5 3.1817 0.2632 12.089
We can evaluate whether “Context” improves the model fit, by comparing a null model with our model. Of course “Context” is improving the model fit.
Likelihood ratio tests of cumulative link models:
no.par AIC logLik LR.stat df Pr(>Chisq)
mdl.clm.Null 4 1281.1 -636.56
mdl.clm 17 1086.3 -526.16 220.8 13 < 2.2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
As a way to interpret the model, we can look at the coefficients and make sense of the results. A CLM model is a Logistic model with a cumulative effect. The “Coefficients” are the estimates for each level of the fixed effect; the “Threshold coefficients” are those of the response. For the former, a negative coefficient indicates a negative association with the response; and a positive is positively associated with the response. The p values are indicating the significance of each level. For the “Threshold coefficients”, we can see the cumulative effects of ratings 1|2, 2|3, 3|4 and 4|5 which indicate an overall increase in the ratings from 1 to 5.
We use a modified version of a plotting function that allows us to visualise the effects. For this, we use the base R plotting functions
par(oma=c(1, 0, 0, 3),mgp=c(2, 1, 0))
xlimNas = c(min(mdl.clm$beta), max(mdl.clm$beta))
ylimNas = c(0,1)
plot(0,0,xlim=xlimNas, ylim=ylimNas, type="n", ylab=expression(Probability), xlab="", xaxt = "n",main="Predicted curves - Nasalisation",cex=2,cex.lab=1.5,cex.main=1.5,cex.axis=1.5)
axis(side = 1, at = c(0,mdl.clm$beta),labels = levels(rating$Context), las=2,cex=2,cex.lab=1.5,cex.axis=1.5)
xsNas = seq(xlimNas[1], xlimNas[2], length.out=100)
lines(xsNas, plogis(mdl.clm$Theta[1] - xsNas), col='black')
lines(xsNas, plogis(mdl.clm$Theta[2] - xsNas)-plogis(mdl.clm$Theta[1] - xsNas), col='red')