In this session, we will look at basic functions in R that will help us in running some inferential statistics. These will help us to evaluate the relationship between one (or more) predictor(s) and an outcome. We will start with a basic example. Using the dataframe “cars”.
## Use the code below to check if you have all required packages installed. If some are not installed already, the code below will install these. If you have all packages installed, then you could load them with the second code.
requiredPackages = c('ggplot2','tidyverse','psycho','lme4','ordinal','lmerTest','PresenceAbsence')
for(p in requiredPackages){
if(!require(p,character.only = TRUE)) install.packages(p)
library(p,character.only = TRUE)
}
detach("package:lmerTest", unload=TRUE)
# library(ggplot2);library(tidyverse);library(psycho);library(lme4);library(ordinal);library(lmerTest)
head(cars)
str(cars)
'data.frame': 50 obs. of 2 variables:
$ speed: num 4 4 7 7 8 9 10 10 10 11 ...
$ dist : num 2 10 4 22 16 10 18 26 34 17 ...
summary(cars)
speed dist
Min. : 4.0 Min. : 2.00
1st Qu.:12.0 1st Qu.: 26.00
Median :15.0 Median : 36.00
Mean :15.4 Mean : 42.98
3rd Qu.:19.0 3rd Qu.: 56.00
Max. :25.0 Max. :120.00
ggplot.cars1 <- ggplot(cars,aes(x=speed,y=dist))+
geom_point()+theme_bw(base_size = 20)+
labs(y = "Stopping distance (ft)", x = "Speed (mph)")
ggplot.cars1
Let us do some basic correlation analysis using base R and package psycho. And let’s visualise this correlation using a linear regression line
cor(cars)
speed dist
speed 1.0000000 0.8068949
dist 0.8068949 1.0000000
## correlation using "psycho"
cars %>%
correlation()
Pearson Full Correlation (p value correction: holm):
- speed - dist: Results of the Pearson correlation showed a significant and strong negative association between speed and dist (r(48) = 0.81, p < .001***).
ggplot.cars2 <- ggplot(cars,aes(x=speed,y=dist))+
geom_point()+theme_bw(base_size = 20)+
labs(y = "Stopping distance (ft)", x = "Speed (mph)")+
geom_smooth(method = lm, se=F)
ggplot.cars2
What about running our first statistical analysis of the data? We will use a Linear regression model. We can visualise the results using “summary”. If you are new to linear models and prefer an ANOVA-style analysis, then use the function “ANOVA” on the linear model. We’ll come back to the linear model later on.
mdl.lm <- lm(dist~speed, data=cars)
mdl.lm#also print(mdl.lm)
Call:
lm(formula = dist ~ speed, data = cars)
Coefficients:
(Intercept) speed
-17.579 3.932
summary(mdl.lm)
Call:
lm(formula = dist ~ speed, data = cars)
Residuals:
Min 1Q Median 3Q Max
-29.069 -9.525 -2.272 9.215 43.201
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -17.5791 6.7584 -2.601 0.0123 *
speed 3.9324 0.4155 9.464 1.49e-12 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 15.38 on 48 degrees of freedom
Multiple R-squared: 0.6511, Adjusted R-squared: 0.6438
F-statistic: 89.57 on 1 and 48 DF, p-value: 1.49e-12
anova(mdl.lm)
Analysis of Variance Table
Response: dist
Df Sum Sq Mean Sq F value Pr(>F)
speed 1 21186 21185.5 89.567 1.49e-12 ***
Residuals 48 11354 236.5
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
To interpret the model, we need look at the coefficients. The “Intercept” is -17.5791 and the coefficient for “speed” is 3.9324. This tells us that when “speed” = 0, the “distance” is at -17.5791; with one unit increase in “speed”, the “distance” increases by 3.9324 (ft). There may be some issues with interpretation of this model and we need to center, standardise scores. We are not going to cover this here, but you can center and/or scale with this formula:
Let us dissect the model. If you use “str”, you will be able to see what is available under our linear model. To access some info from the model
str(mdl.lm)
List of 12
$ coefficients : Named num [1:2] -17.58 3.93
..- attr(*, "names")= chr [1:2] "(Intercept)" "speed"
$ residuals : Named num [1:50] 3.85 11.85 -5.95 12.05 2.12 ...
..- attr(*, "names")= chr [1:50] "1" "2" "3" "4" ...
$ effects : Named num [1:50] -303.914 145.552 -8.115 9.885 0.194 ...
..- attr(*, "names")= chr [1:50] "(Intercept)" "speed" "" "" ...
$ rank : int 2
$ fitted.values: Named num [1:50] -1.85 -1.85 9.95 9.95 13.88 ...
..- attr(*, "names")= chr [1:50] "1" "2" "3" "4" ...
$ assign : int [1:2] 0 1
$ qr :List of 5
..$ qr : num [1:50, 1:2] -7.071 0.141 0.141 0.141 0.141 ...
.. ..- attr(*, "dimnames")=List of 2
.. .. ..$ : chr [1:50] "1" "2" "3" "4" ...
.. .. ..$ : chr [1:2] "(Intercept)" "speed"
.. ..- attr(*, "assign")= int [1:2] 0 1
..$ qraux: num [1:2] 1.14 1.27
..$ pivot: int [1:2] 1 2
..$ tol : num 1e-07
..$ rank : int 2
..- attr(*, "class")= chr "qr"
$ df.residual : int 48
$ xlevels : Named list()
$ call : language lm(formula = dist ~ speed, data = cars)
$ terms :Classes 'terms', 'formula' language dist ~ speed
.. ..- attr(*, "variables")= language list(dist, speed)
.. ..- attr(*, "factors")= int [1:2, 1] 0 1
.. .. ..- attr(*, "dimnames")=List of 2
.. .. .. ..$ : chr [1:2] "dist" "speed"
.. .. .. ..$ : chr "speed"
.. ..- attr(*, "term.labels")= chr "speed"
.. ..- attr(*, "order")= int 1
.. ..- attr(*, "intercept")= int 1
.. ..- attr(*, "response")= int 1
.. ..- attr(*, ".Environment")=<environment: R_GlobalEnv>
.. ..- attr(*, "predvars")= language list(dist, speed)
.. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
.. .. ..- attr(*, "names")= chr [1:2] "dist" "speed"
$ model :'data.frame': 50 obs. of 2 variables:
..$ dist : num [1:50] 2 10 4 22 16 10 18 26 34 17 ...
..$ speed: num [1:50] 4 4 7 7 8 9 10 10 10 11 ...
..- attr(*, "terms")=Classes 'terms', 'formula' language dist ~ speed
.. .. ..- attr(*, "variables")= language list(dist, speed)
.. .. ..- attr(*, "factors")= int [1:2, 1] 0 1
.. .. .. ..- attr(*, "dimnames")=List of 2
.. .. .. .. ..$ : chr [1:2] "dist" "speed"
.. .. .. .. ..$ : chr "speed"
.. .. ..- attr(*, "term.labels")= chr "speed"
.. .. ..- attr(*, "order")= int 1
.. .. ..- attr(*, "intercept")= int 1
.. .. ..- attr(*, "response")= int 1
.. .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv>
.. .. ..- attr(*, "predvars")= language list(dist, speed)
.. .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
.. .. .. ..- attr(*, "names")= chr [1:2] "dist" "speed"
- attr(*, "class")= chr "lm"
coef(mdl.lm)
(Intercept) speed
-17.579095 3.932409
## same as
## mdl.lm$coefficients
What if I want to obtain the “Intercept”? Or the coefficient for distance? What if I want the full row for distance?
coef(mdl.lm)[1] # same as mdl.lm$coefficients[1]
(Intercept)
-17.57909
coef(mdl.lm)[2] # same as mdl.lm$coefficients[2]
speed
3.932409
summary(mdl.lm)$coefficients[2, ] # full row
Estimate Std. Error t value Pr(>|t|)
3.932409e+00 4.155128e-01 9.463990e+00 1.489836e-12
summary(mdl.lm)$coefficients[2, 4] #for p value
[1] 1.489836e-12
What about residuals (difference between the observed value and the estimated value of the quantity) and fitted values?
residuals(mdl.lm)
1 2 3 4 5 6 7
3.849460 11.849460 -5.947766 12.052234 2.119825 -7.812584 -3.744993
8 9 10 11 12 13 14
4.255007 12.255007 -8.677401 2.322599 -15.609810 -9.609810 -5.609810
15 16 17 18 19 20 21
-1.609810 -7.542219 0.457781 0.457781 12.457781 -11.474628 -1.474628
22 23 24 25 26 27 28
22.525372 42.525372 -21.407036 -15.407036 12.592964 -13.339445 -5.339445
29 30 31 32 33 34 35
-17.271854 -9.271854 0.728146 -11.204263 2.795737 22.795737 30.795737
36 37 38 39 40 41 42
-21.136672 -11.136672 10.863328 -29.069080 -13.069080 -9.069080 -5.069080
43 44 45 46 47 48 49
2.930920 -2.933898 -18.866307 -6.798715 15.201285 16.201285 43.201285
50
4.268876
fitted(mdl.lm)
1 2 3 4 5 6 7 8
-1.849460 -1.849460 9.947766 9.947766 13.880175 17.812584 21.744993 21.744993
9 10 11 12 13 14 15 16
21.744993 25.677401 25.677401 29.609810 29.609810 29.609810 29.609810 33.542219
17 18 19 20 21 22 23 24
33.542219 33.542219 33.542219 37.474628 37.474628 37.474628 37.474628 41.407036
25 26 27 28 29 30 31 32
41.407036 41.407036 45.339445 45.339445 49.271854 49.271854 49.271854 53.204263
33 34 35 36 37 38 39 40
53.204263 53.204263 53.204263 57.136672 57.136672 57.136672 61.069080 61.069080
41 42 43 44 45 46 47 48
61.069080 61.069080 61.069080 68.933898 72.866307 76.798715 76.798715 76.798715
49 50
76.798715 80.731124
AIC(mdl.lm) # Akaike's Information Criterion, lower values are better
[1] 419.1569
BIC(mdl.lm) # Bayesian AIC
[1] 424.8929
logLik(mdl.lm) # log likelihood
'log Lik.' -206.5784 (df=3)
Are the above informative? of course not directly. If we want to test for overall significance of model. We run a null model (aka intercept only) and compare models.
mdl.lm.Null <- lm(dist~1, data=cars)
anova(mdl.lm,mdl.lm.Null)
Analysis of Variance Table
Model 1: dist ~ speed
Model 2: dist ~ 1
Res.Df RSS Df Sum of Sq F Pr(>F)
1 48 11354
2 49 32539 -1 -21186 89.567 1.49e-12 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Let’s plot our fitted values. Any problems with this?
ggplot.cars3 <- ggplot(cars,aes(x=speed,y=predict(mdl.lm)))+
geom_point()+theme_bw(base_size = 20)+
labs(y = "Stopping distance (ft)", x = "Speed (mph)")+
geom_smooth(method = lm, se=F)
ggplot.cars3
Above, we had two predictors and we checked for correlations. What if we have more predictors and we want to have cool correlation tables and plots? (see here https://www.r-bloggers.com/beautiful-and-powerful-correlation-tables-in-r/amp/)
cor <- psycho::affective %>%
correlation()
summary(cor)
plot(cor)
print(cor)
Pearson Full Correlation (p value correction: holm):
- Age - Life_Satisfaction: Results of the Pearson correlation showed a non significant and weak negative association between Age and Life_Satisfaction (r(1249) = 0.030, p > .1).
- Age - Concealing: Results of the Pearson correlation showed a non significant and weak positive association between Age and Concealing (r(1249) = -0.050, p > .1).
- Life_Satisfaction - Concealing: Results of the Pearson correlation showed a non significant and weak positive association between Life_Satisfaction and Concealing (r(1249) = -0.063, p > .1).
- Age - Adjusting: Results of the Pearson correlation showed a non significant and weak negative association between Age and Adjusting (r(1249) = 0.027, p > .1).
- Life_Satisfaction - Adjusting: Results of the Pearson correlation showed a significant and moderate negative association between Life_Satisfaction and Adjusting (r(1249) = 0.36, p < .001***).
- Concealing - Adjusting: Results of the Pearson correlation showed a significant and weak negative association between Concealing and Adjusting (r(1249) = 0.22, p < .001***).
- Age - Tolerating: Results of the Pearson correlation showed a non significant and weak negative association between Age and Tolerating (r(1249) = 0.031, p > .1).
- Life_Satisfaction - Tolerating: Results of the Pearson correlation showed a significant and weak negative association between Life_Satisfaction and Tolerating (r(1249) = 0.15, p < .001***).
- Concealing - Tolerating: Results of the Pearson correlation showed a non significant and weak negative association between Concealing and Tolerating (r(1249) = 0.074, p = 0.05°).
- Adjusting - Tolerating: Results of the Pearson correlation showed a significant and weak negative association between Adjusting and Tolerating (r(1249) = 0.29, p < .001***).
What of we don’t want to use corrections for multiple comparisons?
cor <- psycho::affective %>%
correlation(adjust="none")
summary(cor)
plot(cor)
Let’s generate a dataframe with 11 predictors and 1000 rows each. Because these are simulated data, there is no underlying correlation between these.
df_with_11_vars <- data.frame(replicate(11, rnorm(1000)))
cor2 <- correlation(df_with_11_vars, adjust="none")
summary(cor2)
Deactivate this by using “i_am_cheating”!
cor3 <- correlation(df_with_11_vars, adjust="none",i_am_cheating=TRUE)
summary(cor3)
cor4 <- correlation(df_with_11_vars)
summary(cor4)
Here we will look at an example when the outcome is a binary outcome. This simulated data is structured as follows. We asked one participant to listen to 165 sentences, and to judge whether these are “grammatical” or “ungrammatical”. There were 105 sentences that were “grammatical” and 60 “ungrammatical”. This fictitious example can apply in any other situation. Let’s think Geography: 165 lands: 105 “flat” and 60 “non-flat”, etc. This applies to any case where you need to “categorise” the outcome into two groups.
Let’s load in the data and do some basic summaries
grammatical <- read.csv("grammatical.csv")
str(grammatical)
'data.frame': 165 obs. of 2 variables:
$ grammaticality: Factor w/ 2 levels "grammatical",..: 1 1 1 1 1 1 1 1 1 1 ...
$ response : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
head(grammatical)
grammatical$response <- relevel(grammatical$response,ref="yes")
table(grammatical$grammaticality,grammatical$response)
yes no
grammatical 100 5
ungrammatical 10 50
We are generally interested in performance, i.e., whether the we have “accurately” categorised the outcome or not. Let’s do some stats on this
Yes | No | Total | |
---|---|---|---|
Grammatical (Yes Actual) | TP = 100 | FN = 5 | (Yes Actual) 105 |
Ungrammatical (No Actual) | FP = 10 | TN = 50 | (No Actual) 60 |
Total | (Yes Response) 110 | (No Response) 55 | 165 |
## TP = True Positive (Hit); FP = False Positive; FN = False Negative; TN = True Negative
TP = 100
FP = 10
FN = 5
TN = 50
Total = 165
(TP+TN)/Total # accuracy
[1] 0.9090909
(FP+FN)/Total # error, also 1-accuracy
[1] 0.09090909
# When stimulus = yes, how many times response = yes?
TP/(TP+FN) # also True Positive Rate:
[1] 0.952381
# When stimulus = no, how many times response = yes?
FP/(FP+TN) # False Positive Rate,
[1] 0.1666667
# When stimulus = no, how many times response = no?
TN/(FP+TN) # specificity, also 1-sensetivity
[1] 0.8333333
# When subject responds "yes" how many times is (s)he correct?
TP/(TP+FP) # precision
[1] 0.9090909
# getting dprime (or the sensetivity index); beta (bias criterion, 0-1, lower=increase in "yes"); Aprime (estimate of discriminability, 0-1, 1=good discrimination; 0 at chance); bppd (b prime prime d, -1 to 1; 0 = no bias, negative = tendency to respond "yes", positive = tendency to respond "no"); c (index of bias, equals to SD)
#(see also https://www.r-bloggers.com/compute-signal-detection-theory-indices-with-r/amp/)
psycho::dprime(TP, FP, FN, TN)
$`dprime`
[1] 2.537557
$beta
[1] 0.4467859
$aprime
[1] 0.7892857
$bppd
[1] -0.9512195
$c
[1] -0.3175006
Let’s run a first GLM (Generalised Linear Model). A GLM uses a special family “binomial” as it assumes the outcome has a binomial distribution. In general, results from a Logistic Regression are close to what we get from SDT (see above).
To run the results, we will change the reference level for both response and grammaticality.
The results below show the logodds for our model. The results show that for one unit increase in the response (i.e., from no to yes), the logodds of being “grammatical” is increased by 4.61 (the intercept shows that when the response is “no”, the logodds are -1.61).
levels(grammatical$response)
[1] "yes" "no"
levels(grammatical$grammaticality)
[1] "grammatical" "ungrammatical"
grammatical$response <- relevel(grammatical$response, ref="no")
grammatical$grammaticality <- relevel(grammatical$grammaticality, ref="ungrammatical")
mdl.glm <- glm(response~grammaticality,data=grammatical,family = binomial)
summary(mdl.glm)
Call:
glm(formula = response ~ grammaticality, family = binomial, data = grammatical)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.4676 -0.6039 0.3124 0.3124 1.8930
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.6094 0.3464 -4.646 3.38e-06 ***
grammaticalitygrammatical 4.6052 0.5744 8.017 1.08e-15 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 210.050 on 164 degrees of freedom
Residual deviance: 94.271 on 163 degrees of freedom
AIC: 98.271
Number of Fisher Scoring iterations: 5
Here we will run the same GLM model above but by supressing the “Intercept”. The idea here is to get the “true” coefficient for a “grammatical” and a “yes” response.
# An intercept is always included in any regression, but you can specify it with "1"
## glm(response~1+grammaticality,data=grammatical,family = binomial)
mdl.glm2 <- glm(response~-1+grammaticality,data=grammatical,family = binomial)
summary(mdl.glm2)
Call:
glm(formula = response ~ -1 + grammaticality, family = binomial,
data = grammatical)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.4676 -0.6039 0.3124 0.3124 1.8930
Coefficients:
Estimate Std. Error z value Pr(>|z|)
grammaticalityungrammatical -1.6094 0.3464 -4.646 3.38e-06 ***
grammaticalitygrammatical 2.9957 0.4582 6.538 6.25e-11 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 228.739 on 165 degrees of freedom
Residual deviance: 94.271 on 163 degrees of freedom
AIC: 98.271
Number of Fisher Scoring iterations: 5
Logodds can be modified to talk about the odds of an event. For our model above, the odds of “grammatical” receiving a “yes” response increase by 100; whereas receiving a “no” is a mere 0.2. Using the second model (i.e., without an Intercept) allows us to get the odd rations for each of responses to “Ungrammatical” and “Grammatical”
exp(coef(mdl.glm))
(Intercept) grammaticalitygrammatical
0.2 100.0
exp(coef(mdl.glm2))
grammaticalityungrammatical grammaticalitygrammatical
0.2 20.0
If you want to talk about the percentage “accuracy” of our model, then we can transform our loggodds into proportions. This shows that the proportion of “grammatical” receiving a “yes” response increase by 99%
plogis(coef(mdl.glm))
(Intercept) grammaticalitygrammatical
0.1666667 0.9900990
plogis(coef(mdl.glm2))
grammaticalityungrammatical grammaticalitygrammatical
0.1666667 0.9523810
The code below demosntrates the links between our GLM model and what we had obtained above from SDT. The predictions’ table shows that our GLM was successful at obtaining prediction that are identical to our initial data setup. Look at the table here and the table above. Once we have created our table of outcome, we can compute percent correct, the specificity, the sensetivity, the Kappa score, etc.. this yields the actual value with the SD that is related to variations in responses
## predict(mdl.glm)>0.5 is identical to
## predict(glm(response~grammaticality,data=grammatical,family = binomial),type="response")
tbl.glm <- table(grammatical$response,predict(mdl.glm,type="response")>0.5)
tbl.glm
FALSE TRUE
no 50 5
yes 10 100
PresenceAbsence::pcc(tbl.glm)
PresenceAbsence::specificity(tbl.glm)
PresenceAbsence::sensitivity(tbl.glm)
PresenceAbsence::Kappa(tbl.glm)
###etc..
grammatical$prob <- predict(glm(response~grammaticality,data=grammatical,family = binomial),type="response")
ggplot(grammatical, aes(x=as.numeric(grammaticality), y=prob)) +
geom_point() +
geom_smooth(method = "glm",
method.args = list(family = "binomial"),
se = T) + theme_bw(base_size = 20)+
labs(y = "Probability", x = "")+
coord_cartesian(ylim=c(0,1))+
scale_x_discrete(limits = c("Ungrammatical","Grammatical"))
If your data does not fit a binomial distribution, and is a multinomial (i.e., three or more response categories) or poisson (count data), then you need to use the glm function with a specific family function.
These models work perfectly with rating data. Ratings are inherently ordered, 1, 2, … n, and expect to observe an increase (or decrease) in overall ratings from 1 to n. To demonstrate this, we will use an example using the package “ordinal”. Data were from a rating experiment where six participants rated the percept of nasality in the production of particular consonants in Arabic. The data came from nine producing subjects. The ratings were from 1 to 5. This example can apply to any study, e.g., rating grammaticality of sentences, rating how positive the sentiments are in a article, interview responses, etc.
We start by importing the data and process it. We change the reference level in the predictor
rating <- read.csv("rating.csv")
str(rating)
'data.frame': 405 obs. of 6 variables:
$ X : int 406 407 408 409 410 411 412 413 414 415 ...
$ Response: int 2 4 2 4 3 3 1 2 2 1 ...
$ Context : Factor w/ 14 levels "3--3","3-n","3-o",..: 7 6 11 6 8 6 1 6 12 11 ...
$ Subject : Factor w/ 9 levels "p01","p02","p03",..: 4 4 4 4 4 4 4 4 4 4 ...
$ Item : Factor w/ 45 levels "3aaf-w","3aam-w",..: 38 37 24 23 41 40 8 7 27 19 ...
$ Rater : Factor w/ 1 level "R01": 1 1 1 1 1 1 1 1 1 1 ...
## we need to convert "Response" to a factor
rating$Response <- as.factor(rating$Response)
rating$Context <- relevel(rating$Context, ref="isolation")
We run our first clm model as a simple, i.e., with no random effects
mdl.clm <- clm(Response~Context, data=rating)
summary(mdl.clm)
formula: Response ~ Context
data: rating
Coefficients:
Estimate Std. Error z value Pr(>|z|)
Context3--3 -0.1384 0.5848 -0.237 0.8130
Context3-n 3.5876 0.4721 7.600 2.96e-14 ***
Context3-o -0.4977 0.3859 -1.290 0.1971
Context7-n 2.3271 0.5079 4.582 4.60e-06 ***
Context7-o 0.2904 0.4002 0.726 0.4680
Contextn-3 2.8957 0.6685 4.331 1.48e-05 ***
Contextn-7 2.2678 0.4978 4.556 5.22e-06 ***
Contextn-n 2.8697 0.4317 6.647 2.99e-11 ***
Contextn-o 3.5152 0.4397 7.994 1.30e-15 ***
Contexto-3 -0.2540 0.4017 -0.632 0.5272
Contexto-7 -0.6978 0.3769 -1.851 0.0641 .
Contexto-n 2.9640 0.4159 7.126 1.03e-12 ***
Contexto-o -0.6147 0.3934 -1.562 0.1182
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Threshold coefficients:
Estimate Std. Error z value
1|2 -1.4615 0.2065 -7.077
2|3 0.4843 0.1824 2.655
3|4 1.5492 0.2044 7.578
4|5 3.1817 0.2632 12.089
We can evaluate whether “Context” improves the model fit, by comparing a null model with our model. Of course “Context” is improving the model fit.
mdl.clm.Null <- clm(Response~1, data=rating)
anova(mdl.clm,mdl.clm.Null)
Likelihood ratio tests of cumulative link models:
no.par AIC logLik LR.stat df Pr(>Chisq)
mdl.clm.Null 4 1281.1 -636.56
mdl.clm 17 1086.3 -526.16 220.8 13 < 2.2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
As a way to interpret the model, we can look at the coefficients and make sense of the results. A CLM model is a Logistic model with a cumulative effect. The “Coefficients” are the estimates for each level of the fixed effect; the “Threshold coefficients” are those of the response. For the former, a negative coefficient indicates a negative association with the response; and a positive is positively associated with the response. The p values are indicating the significance of each level. For the “Threshold coefficients”, we can see the cumulative effects of ratings 1|2, 2|3, 3|4 and 4|5 which indicate an overall increase in ht eratings from 1 to 5.
We use a modified version of a plotting function that allows us to visualise the effects. For this, we use the base R plotting functions
par(oma=c(1, 0, 0, 3),mgp=c(2, 1, 0))
xlimNas = c(min(mdl.clm$beta), max(mdl.clm$beta))
ylimNas = c(0,1)
plot(0,0,xlim=xlimNas, ylim=ylimNas, type="n", ylab=expression(Probability), xlab="", xaxt = "n",main="Predicted curves - Nasalisation",cex=2,cex.lab=1.5,cex.main=1.5,cex.axis=1.5)
axis(side = 1, at = c(0,mdl.clm$beta),labels = levels(rating$Context), las=2,cex=2,cex.lab=1.5,cex.axis=1.5)
xsNas = seq(xlimNas[1], xlimNas[2], length.out=100)
lines(xsNas, plogis(mdl.clm$Theta[1] - xsNas), col='black')
lines(xsNas, plogis(mdl.clm$Theta[2] - xsNas)-plogis(mdl.clm$Theta[1] - xsNas), col='red')
lines(xsNas, plogis(mdl.clm$Theta[3] - xsNas)-plogis(mdl.clm$Theta[2] - xsNas), col='green')
lines(xsNas, plogis(mdl.clm$Theta[4] - xsNas)-plogis(mdl.clm$Theta[3] - xsNas), col='orange')
lines(xsNas, 1-(plogis(mdl.clm$Theta[4] - xsNas)), col='blue')
abline(v=c(0,mdl.clm$beta),lty=3)
abline(h=0, lty="dashed")
abline(h=1, lty="dashed")
legend(par('usr')[2], par('usr')[4], bty='n', xpd=NA,lty=1, col=c("black", "red", "green", "orange", "blue"),
legend=c("Oral", "2", "3", "4", "Nasal"),cex=0.75)
Let’s generate a new dataframe that we will use later on for our mixed models
set.seed(666)
#we create 6 subjects
subjects <- paste0('S', 1:6)
#here we ad repetitions within speakers
subjects <- rep(subjects, each = 20)
items <- paste0('Item', 1:20)
#below repeates
items <- rep(items, 6)
#below is to generate random numbers that are log values
logFreq <- round(rexp(20)*5,2)
#below we are repeating the logFreq 6 times to fit with the number of speakers and items
logFreq <- rep(logFreq,6)
xdf <- data.frame(subjects, items, logFreq)
#below removes the individual variables we had created because they are already in the dataframe
rm(subjects,items,logFreq)
xdf$Intercept <- 300
submeans <- rep(rnorm(6,sd = 40),20)
#sort make the means for each subject is the same...
submeans <- sort(submeans)
xdf$submeans <- submeans
#we create the same thing for items... we allow the items mean to vary between words...
itsmeans <- rep(rnorm(20,sd=20),6)
xdf$itsmeans <- itsmeans
xdf$error <- rnorm(120,sd=20)
#here we create an effect column,
#here for each logFreq, we have a decrease of -5 of that particular logFreq
xdf$effect <- -5*xdf$logFreq
xdf$dur <- xdf$Intercept +xdf$submeans +xdf$itsmeans +xdf$error + xdf$effect
#below is to subset the data and get only a few columns.. the -c(4:8) removes the columns 4 to 8..
xreal <- xdf[,-c(4:8)]
head(xreal)
rm(xdf,submeans,itsmeans)
Let’s start by doing a correlation test and plotting the data. Our results show that there is a negative correlation between duration and LogFrequency, and the plot shows this decrease.
cor5 <- xreal %>%
correlation()
summary(cor5)
ggplot.xreal <- ggplot(xreal,aes(x=logFreq,y=dur))+
geom_point()+theme_bw(base_size = 20)+
labs(y = "Duration", x = "Frequency (Log)")+
geom_smooth(method = lm, se=F)
ggplot.xreal
Let’s run a simple linear model on the data. As we can see below, thereare some issues with the “simple” linear model: we had set our SD for subjects to be 40, but this was picked up as 120 (see histogram of residuals). The QQ Plot is not “normal”.
mdl.lm.xreal <- lm(dur~logFreq, data=xreal)
summary(mdl.lm.xreal)
Call:
lm(formula = dur ~ logFreq, data = xreal)
Residuals:
Min 1Q Median 3Q Max
-94.322 -35.465 -4.364 33.020 123.955
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 337.9730 6.2494 54.081 < 2e-16 ***
logFreq -5.4601 0.7846 -6.959 2.06e-10 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 48.29 on 118 degrees of freedom
Multiple R-squared: 0.291, Adjusted R-squared: 0.285
F-statistic: 48.43 on 1 and 118 DF, p-value: 2.057e-10
hist(residuals(mdl.lm.xreal))
qqnorm(residuals(mdl.lm.xreal)); qqline(residuals(mdl.lm.xreal))
plot(fitted(mdl.lm.xreal),residuals(mdl.lm.xreal), cex = 4)
Our Linear Mixed effects Model will take into account the random effects we added and also our model specifications. We use a Maximum Likelihood estimate (REML=FALSE) as this is what we will use for model comparison. The Linear Mixed Model is reflecting our model specifications The SD of our subjects is picked up correctly. The model results are “almost” the same as our linear model above. The coefficient for the “Intercept” is at 337.973 and the coefficient for LogFrequency is at -5.460. This indicates that for each unit of increase in the LogFrequency, there is a decrease by 5.460 (ms).
mdl.lmer.xreal <- lmer(dur~logFreq+(1|subjects)+(1|items), data=xreal,REML=FALSE)
summary(mdl.lmer.xreal)
Linear mixed model fit by maximum likelihood ['lmerMod']
Formula: dur ~ logFreq + (1 | subjects) + (1 | items)
Data: xreal
AIC BIC logLik deviance df.resid
1105.8 1119.8 -547.9 1095.8 115
Scaled residuals:
Min 1Q Median 3Q Max
-2.06735 -0.60675 0.07184 0.61122 2.39854
Random effects:
Groups Name Variance Std.Dev.
items (Intercept) 589.8 24.29
subjects (Intercept) 1471.7 38.36
Residual 284.0 16.85
Number of obs: 120, groups: items, 20; subjects, 6
Fixed effects:
Estimate Std. Error t value
(Intercept) 337.973 17.587 19.217
logFreq -5.460 1.004 -5.436
Correlation of Fixed Effects:
(Intr)
logFreq -0.322
hist(residuals(mdl.lmer.xreal))
qqnorm(residuals(mdl.lmer.xreal)); qqline(residuals(mdl.lmer.xreal))
plot(fitted(mdl.lmer.xreal),residuals(mdl.lmer.xreal), cex = 4)
This second model add a by-subject random slope. Random slopes allow for the variation that exists in the random effects to be taken into account. An intercept only model provides an averaged values to our participants.
mdl.lmer.xreal.2 <- lmer(dur~logFreq+(logFreq|subjects)+(1|items), data=xreal,REML=FALSE)
summary(mdl.lmer.xreal.2)
Linear mixed model fit by maximum likelihood ['lmerMod']
Formula: dur ~ logFreq + (logFreq | subjects) + (1 | items)
Data: xreal
AIC BIC logLik deviance df.resid
1109.5 1129.0 -547.7 1095.5 113
Scaled residuals:
Min 1Q Median 3Q Max
-2.1087 -0.6068 0.0624 0.5827 2.4565
Random effects:
Groups Name Variance Std.Dev. Corr
items (Intercept) 5.900e+02 24.2890
subjects (Intercept) 1.399e+03 37.4028
logFreq 2.898e-02 0.1702 1.00
Residual 2.829e+02 16.8195
Number of obs: 120, groups: items, 20; subjects, 6
Fixed effects:
Estimate Std. Error t value
(Intercept) 337.973 17.239 19.605
logFreq -5.460 1.007 -5.423
Correlation of Fixed Effects:
(Intr)
logFreq -0.267
hist(residuals(mdl.lmer.xreal.2))
qqnorm(residuals(mdl.lmer.xreal.2)); qqline(residuals(mdl.lmer.xreal.2))
plot(fitted(mdl.lmer.xreal.2),residuals(mdl.lmer.xreal.2), cex = 4)
But where are our p values? The lme4 developers decided not to include p values due to various issues with estimating df. What we can do instead is to compare models. We need to create a null model to allow for significance testing. As expected our predictor is significantly contributing to the difference.
mdl.lmer.xreal.Null <- lmer(dur~1+(logFreq|subjects)+(1|items), data=xreal,REML=FALSE)
anova(mdl.lmer.xreal.Null,mdl.lmer.xreal.2)
Data: xreal
Models:
mdl.lmer.xreal.Null: dur ~ 1 + (logFreq | subjects) + (1 | items)
mdl.lmer.xreal.2: dur ~ logFreq + (logFreq | subjects) + (1 | items)
Df AIC BIC logLik deviance Chisq Chi Df Pr(>Chisq)
mdl.lmer.xreal.Null 6 1131.9 1148.6 -559.93 1119.9
mdl.lmer.xreal.2 7 1109.5 1129.0 -547.73 1095.5 24.394 1 7.852e-07
mdl.lmer.xreal.Null
mdl.lmer.xreal.2 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Also, do we really need random slopes? From the result below, we don’t seem to need random slopes at all, given that adding random slopes does not improve the model fit. I always recommend testing this. Most of the time I keep random slopes.
anova(mdl.lmer.xreal,mdl.lmer.xreal.2)
Data: xreal
Models:
mdl.lmer.xreal: dur ~ logFreq + (1 | subjects) + (1 | items)
mdl.lmer.xreal.2: dur ~ logFreq + (logFreq | subjects) + (1 | items)
Df AIC BIC logLik deviance Chisq Chi Df Pr(>Chisq)
mdl.lmer.xreal 5 1105.8 1119.8 -547.92 1095.8
mdl.lmer.xreal.2 7 1109.5 1129.0 -547.73 1095.5 0.3788 2 0.8274
But if you are really (really!!!) obsessed by p values, then you can also use lmerTest. BUT use after comparing models to evaluate contribution of predictors
library(lmerTest)
mdl.lmer.xreal.lmerTest <- lmer(dur~logFreq+(logFreq|subjects)+(1|items), data=xreal,REML=TRUE)
summary(mdl.lmer.xreal.lmerTest)
Linear mixed model fit by REML. t-tests use Satterthwaite's method [
lmerModLmerTest]
Formula: dur ~ logFreq + (logFreq | subjects) + (1 | items)
Data: xreal
REML criterion at convergence: 1086.1
Scaled residuals:
Min 1Q Median 3Q Max
-2.09692 -0.60118 0.06414 0.58479 2.46241
Random effects:
Groups Name Variance Std.Dev. Corr
items (Intercept) 6.294e+02 25.088
subjects (Intercept) 1.651e+03 40.637
logFreq 3.421e-02 0.185 1.00
Residual 2.829e+02 16.819
Number of obs: 120, groups: items, 20; subjects, 6
Fixed effects:
Estimate Std. Error df t value Pr(>|t|)
(Intercept) 337.973 18.526 7.394 18.243 2.04e-07 ***
logFreq -5.460 1.038 18.142 -5.261 5.17e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Correlation of Fixed Effects:
(Intr)
logFreq -0.249
detach("package:lmerTest", unload=TRUE)
Our final model uses REML (or Restricted Maximum Likelihood Estimate of Variance Component) to estimate the model.
mdl.lmer.xreal.Full <- lmer(dur~logFreq+(logFreq|subjects)+(1|items), data=xreal,REML=TRUE)
summary(mdl.lmer.xreal.Full)
Linear mixed model fit by REML ['lmerMod']
Formula: dur ~ logFreq + (logFreq | subjects) + (1 | items)
Data: xreal
REML criterion at convergence: 1086.1
Scaled residuals:
Min 1Q Median 3Q Max
-2.09692 -0.60118 0.06414 0.58479 2.46241
Random effects:
Groups Name Variance Std.Dev. Corr
items (Intercept) 6.294e+02 25.088
subjects (Intercept) 1.651e+03 40.637
logFreq 3.421e-02 0.185 1.00
Residual 2.829e+02 16.819
Number of obs: 120, groups: items, 20; subjects, 6
Fixed effects:
Estimate Std. Error t value
(Intercept) 337.973 18.526 18.243
logFreq -5.460 1.038 -5.261
Correlation of Fixed Effects:
(Intr)
logFreq -0.249
anova(mdl.lmer.xreal.Full)
Analysis of Variance Table
Df Sum Sq Mean Sq F value
logFreq 1 7828.8 7828.8 27.677
hist(residuals(mdl.lmer.xreal.Full))
qqnorm(residuals(mdl.lmer.xreal.Full)); qqline(residuals(mdl.lmer.xreal.Full))
plot(fitted(mdl.lmer.xreal.Full),residuals(mdl.lmer.xreal.Full), cex = 4)
coef(mdl.lmer.xreal.Full)
$`items`
(Intercept) logFreq
Item1 352.3564 -5.460115
Item10 331.7619 -5.460115
Item11 324.7272 -5.460115
Item12 350.2316 -5.460115
Item13 353.1171 -5.460115
Item14 311.8359 -5.460115
Item15 354.0588 -5.460115
Item16 353.9386 -5.460115
Item17 288.7852 -5.460115
Item18 362.4698 -5.460115
Item19 338.1424 -5.460115
Item2 325.1858 -5.460115
Item20 359.7410 -5.460115
Item3 370.1798 -5.460115
Item4 302.4272 -5.460115
Item5 350.0497 -5.460115
Item6 338.9482 -5.460115
Item7 362.8397 -5.460115
Item8 295.5951 -5.460115
Item9 333.0694 -5.460115
$subjects
(Intercept) logFreq
S1 314.4695 -5.567091
S2 303.9038 -5.615180
S3 314.2921 -5.567898
S4 318.4283 -5.549073
S5 373.3005 -5.299324
S6 403.4440 -5.162127
attr(,"class")
[1] "coef.mer"
fixef(mdl.lmer.xreal.Full)
(Intercept) logFreq
337.973044 -5.460115
fixef(mdl.lmer.xreal.Full)[1]
(Intercept)
337.973
fixef(mdl.lmer.xreal.Full)[2]
logFreq
-5.460115
coef(mdl.lmer.xreal.Full)$`subjects`[1]
coef(mdl.lmer.xreal.Full)$`subjects`[2]
coef(mdl.lmer.xreal.Full)$`items`[1]
coef(mdl.lmer.xreal.Full)$`items`[2]
In general, I use the prediction from my final model in any plots. To generate this, we can use the following
xreal$Pred_Dur <- predict(lmer(dur~logFreq+(logFreq|subjects)+(1|items), data=xreal,REML=TRUE))
ggplot.xreal.Pred <- ggplot(xreal,aes(x=logFreq,y=Pred_Dur))+
geom_point()+theme_bw(base_size = 20)+
labs(y = "Duration", x = "Frequency (Log)",title="Predicted")+
geom_smooth(method = lm, se=F)+coord_cartesian(ylim=c(200,450))
ggplot.xreal.Pred
## original plot
ggplot.xreal.Original <- ggplot(xreal,aes(x=logFreq,y=dur))+
geom_point()+theme_bw(base_size = 20)+
labs(y = "Duration", x = "Frequency (Log)",title="Original")+
geom_smooth(method = lm, se=F)+coord_cartesian(ylim=c(200,450))
ggplot.xreal.Original
The code above was using a Linear Mixed Effects Modelling. The outcome was a numeric object. In some cases (as we have seen above), we may have:
The code below gives you an idea of how to specificy these models
## Binomial family
## lme4::glmer(outcome~predictor(s)+(1|subject)+(1|items)..., data=data, family=binomial)
## Poisson family
## lme4::glmer(outcome~predictor(s)+(1|subject)+(1|items)..., data=data, family=poisson)
## Multinomial family
## a bit complicated as tehre is a need to use Bayesian approaches, see e.g.,
## glmmADMB
## mixcat
## MCMCglmm
## see https://gist.github.com/casallas/8263818
## Rating data, use following
## ordinal::clmm(outcome~predictor(s)+(1|subject)+(1|items)..., data=data)
## Remember to test for random effects and whether slopes are needed.
See this website for how to run and automatically draw tables for your publications. I haven’t tested this but it is in development: 7) https://www.r-bloggers.com/elegant-regression-results-tables-and-plots-in-r-the-finalfit-package/amp