In this session, we will look at basic functions in R that will help us in running some inferential statistics. These will help us to evaluate the relationship between one (or more) predictor(s) and an outcome. We will start with a basic example. Using the dataframe “cars”.

1 Loading packages

## Use the code below to check if you have all required packages installed. If some are not installed already, the code below will install these. If you have all packages installed, then you could load them with the second code.
requiredPackages = c('ggplot2','tidyverse','psycho','lme4','ordinal','lmerTest','PresenceAbsence')
for(p in requiredPackages){
  if(!require(p,character.only = TRUE)) install.packages(p)
  library(p,character.only = TRUE)
}
detach("package:lmerTest", unload=TRUE)
# library(ggplot2);library(tidyverse);library(psycho);library(lme4);library(ordinal);library(lmerTest)

2 Starting with a linear model

head(cars)
str(cars)
'data.frame':   50 obs. of  2 variables:
 $ speed: num  4 4 7 7 8 9 10 10 10 11 ...
 $ dist : num  2 10 4 22 16 10 18 26 34 17 ...
summary(cars)
     speed           dist       
 Min.   : 4.0   Min.   :  2.00  
 1st Qu.:12.0   1st Qu.: 26.00  
 Median :15.0   Median : 36.00  
 Mean   :15.4   Mean   : 42.98  
 3rd Qu.:19.0   3rd Qu.: 56.00  
 Max.   :25.0   Max.   :120.00  
ggplot.cars1 <- ggplot(cars,aes(x=speed,y=dist))+
                geom_point()+theme_bw(base_size = 20)+
                labs(y = "Stopping distance (ft)", x = "Speed (mph)")
ggplot.cars1

2.1 Basic correlations and plotting the linear relationship

Let us do some basic correlation analysis using base R and package psycho. And let’s visualise this correlation using a linear regression line

cor(cars)
          speed      dist
speed 1.0000000 0.8068949
dist  0.8068949 1.0000000
## correlation using "psycho"
cars %>% 
  correlation()
Pearson Full Correlation (p value correction: holm):
   - speed - dist:   Results of the Pearson correlation showed a significant and strong negative association between speed and dist (r(48) = 0.81, p < .001***).
ggplot.cars2 <- ggplot(cars,aes(x=speed,y=dist))+
                geom_point()+theme_bw(base_size = 20)+
                labs(y = "Stopping distance (ft)", x = "Speed (mph)")+
                geom_smooth(method = lm, se=F)
ggplot.cars2

2.2 Our first statistical model

What about running our first statistical analysis of the data? We will use a Linear regression model. We can visualise the results using “summary”. If you are new to linear models and prefer an ANOVA-style analysis, then use the function “ANOVA” on the linear model. We’ll come back to the linear model later on.

mdl.lm <- lm(dist~speed, data=cars)
mdl.lm#also print(mdl.lm)

Call:
lm(formula = dist ~ speed, data = cars)

Coefficients:
(Intercept)        speed  
    -17.579        3.932  
summary(mdl.lm)

Call:
lm(formula = dist ~ speed, data = cars)

Residuals:
    Min      1Q  Median      3Q     Max 
-29.069  -9.525  -2.272   9.215  43.201 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -17.5791     6.7584  -2.601   0.0123 *  
speed         3.9324     0.4155   9.464 1.49e-12 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 15.38 on 48 degrees of freedom
Multiple R-squared:  0.6511,    Adjusted R-squared:  0.6438 
F-statistic: 89.57 on 1 and 48 DF,  p-value: 1.49e-12
anova(mdl.lm)
Analysis of Variance Table

Response: dist
          Df Sum Sq Mean Sq F value   Pr(>F)    
speed      1  21186 21185.5  89.567 1.49e-12 ***
Residuals 48  11354   236.5                     
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

To interpret the model, we need look at the coefficients. The “Intercept” is -17.5791 and the coefficient for “speed” is 3.9324. This tells us that when “speed” = 0, the “distance” is at -17.5791; with one unit increase in “speed”, the “distance” increases by 3.9324 (ft). There may be some issues with interpretation of this model and we need to center, standardise scores. We are not going to cover this here, but you can center and/or scale with this formula:

  1. Center = scale(x, center = TRUE, scale = FALSE)
  2. Standardise = scale(x, center = TRUE, scale = TRUE)

2.2.1 Dissecting the model

Let us dissect the model. If you use “str”, you will be able to see what is available under our linear model. To access some info from the model

2.2.2 “str” and “coef”

str(mdl.lm)
List of 12
 $ coefficients : Named num [1:2] -17.58 3.93
  ..- attr(*, "names")= chr [1:2] "(Intercept)" "speed"
 $ residuals    : Named num [1:50] 3.85 11.85 -5.95 12.05 2.12 ...
  ..- attr(*, "names")= chr [1:50] "1" "2" "3" "4" ...
 $ effects      : Named num [1:50] -303.914 145.552 -8.115 9.885 0.194 ...
  ..- attr(*, "names")= chr [1:50] "(Intercept)" "speed" "" "" ...
 $ rank         : int 2
 $ fitted.values: Named num [1:50] -1.85 -1.85 9.95 9.95 13.88 ...
  ..- attr(*, "names")= chr [1:50] "1" "2" "3" "4" ...
 $ assign       : int [1:2] 0 1
 $ qr           :List of 5
  ..$ qr   : num [1:50, 1:2] -7.071 0.141 0.141 0.141 0.141 ...
  .. ..- attr(*, "dimnames")=List of 2
  .. .. ..$ : chr [1:50] "1" "2" "3" "4" ...
  .. .. ..$ : chr [1:2] "(Intercept)" "speed"
  .. ..- attr(*, "assign")= int [1:2] 0 1
  ..$ qraux: num [1:2] 1.14 1.27
  ..$ pivot: int [1:2] 1 2
  ..$ tol  : num 1e-07
  ..$ rank : int 2
  ..- attr(*, "class")= chr "qr"
 $ df.residual  : int 48
 $ xlevels      : Named list()
 $ call         : language lm(formula = dist ~ speed, data = cars)
 $ terms        :Classes 'terms', 'formula'  language dist ~ speed
  .. ..- attr(*, "variables")= language list(dist, speed)
  .. ..- attr(*, "factors")= int [1:2, 1] 0 1
  .. .. ..- attr(*, "dimnames")=List of 2
  .. .. .. ..$ : chr [1:2] "dist" "speed"
  .. .. .. ..$ : chr "speed"
  .. ..- attr(*, "term.labels")= chr "speed"
  .. ..- attr(*, "order")= int 1
  .. ..- attr(*, "intercept")= int 1
  .. ..- attr(*, "response")= int 1
  .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv> 
  .. ..- attr(*, "predvars")= language list(dist, speed)
  .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
  .. .. ..- attr(*, "names")= chr [1:2] "dist" "speed"
 $ model        :'data.frame':  50 obs. of  2 variables:
  ..$ dist : num [1:50] 2 10 4 22 16 10 18 26 34 17 ...
  ..$ speed: num [1:50] 4 4 7 7 8 9 10 10 10 11 ...
  ..- attr(*, "terms")=Classes 'terms', 'formula'  language dist ~ speed
  .. .. ..- attr(*, "variables")= language list(dist, speed)
  .. .. ..- attr(*, "factors")= int [1:2, 1] 0 1
  .. .. .. ..- attr(*, "dimnames")=List of 2
  .. .. .. .. ..$ : chr [1:2] "dist" "speed"
  .. .. .. .. ..$ : chr "speed"
  .. .. ..- attr(*, "term.labels")= chr "speed"
  .. .. ..- attr(*, "order")= int 1
  .. .. ..- attr(*, "intercept")= int 1
  .. .. ..- attr(*, "response")= int 1
  .. .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv> 
  .. .. ..- attr(*, "predvars")= language list(dist, speed)
  .. .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
  .. .. .. ..- attr(*, "names")= chr [1:2] "dist" "speed"
 - attr(*, "class")= chr "lm"
coef(mdl.lm)
(Intercept)       speed 
 -17.579095    3.932409 
## same as 
## mdl.lm$coefficients

2.2.2.1 “coef” and “coefficients”

What if I want to obtain the “Intercept”? Or the coefficient for distance? What if I want the full row for distance?

coef(mdl.lm)[1] # same as mdl.lm$coefficients[1]
(Intercept) 
  -17.57909 
coef(mdl.lm)[2] # same as mdl.lm$coefficients[2]
   speed 
3.932409 
summary(mdl.lm)$coefficients[2, ] # full row
    Estimate   Std. Error      t value     Pr(>|t|) 
3.932409e+00 4.155128e-01 9.463990e+00 1.489836e-12 
summary(mdl.lm)$coefficients[2, 4] #for p value
[1] 1.489836e-12

What about residuals (difference between the observed value and the estimated value of the quantity) and fitted values?

residuals(mdl.lm)
         1          2          3          4          5          6          7 
  3.849460  11.849460  -5.947766  12.052234   2.119825  -7.812584  -3.744993 
         8          9         10         11         12         13         14 
  4.255007  12.255007  -8.677401   2.322599 -15.609810  -9.609810  -5.609810 
        15         16         17         18         19         20         21 
 -1.609810  -7.542219   0.457781   0.457781  12.457781 -11.474628  -1.474628 
        22         23         24         25         26         27         28 
 22.525372  42.525372 -21.407036 -15.407036  12.592964 -13.339445  -5.339445 
        29         30         31         32         33         34         35 
-17.271854  -9.271854   0.728146 -11.204263   2.795737  22.795737  30.795737 
        36         37         38         39         40         41         42 
-21.136672 -11.136672  10.863328 -29.069080 -13.069080  -9.069080  -5.069080 
        43         44         45         46         47         48         49 
  2.930920  -2.933898 -18.866307  -6.798715  15.201285  16.201285  43.201285 
        50 
  4.268876 
fitted(mdl.lm)
        1         2         3         4         5         6         7         8 
-1.849460 -1.849460  9.947766  9.947766 13.880175 17.812584 21.744993 21.744993 
        9        10        11        12        13        14        15        16 
21.744993 25.677401 25.677401 29.609810 29.609810 29.609810 29.609810 33.542219 
       17        18        19        20        21        22        23        24 
33.542219 33.542219 33.542219 37.474628 37.474628 37.474628 37.474628 41.407036 
       25        26        27        28        29        30        31        32 
41.407036 41.407036 45.339445 45.339445 49.271854 49.271854 49.271854 53.204263 
       33        34        35        36        37        38        39        40 
53.204263 53.204263 53.204263 57.136672 57.136672 57.136672 61.069080 61.069080 
       41        42        43        44        45        46        47        48 
61.069080 61.069080 61.069080 68.933898 72.866307 76.798715 76.798715 76.798715 
       49        50 
76.798715 80.731124 

2.2.2.2 Goodness of fit?

AIC(mdl.lm) # Akaike's Information Criterion, lower values are better
[1] 419.1569
BIC(mdl.lm) # Bayesian AIC
[1] 424.8929
logLik(mdl.lm)  # log likelihood
'log Lik.' -206.5784 (df=3)

2.2.2.3 Significance testing

Are the above informative? of course not directly. If we want to test for overall significance of model. We run a null model (aka intercept only) and compare models.

mdl.lm.Null <- lm(dist~1, data=cars)
anova(mdl.lm,mdl.lm.Null)
Analysis of Variance Table

Model 1: dist ~ speed
Model 2: dist ~ 1
  Res.Df   RSS Df Sum of Sq      F   Pr(>F)    
1     48 11354                                 
2     49 32539 -1    -21186 89.567 1.49e-12 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

2.2.2.4 Plotting fitted values

Let’s plot our fitted values. Any problems with this?

ggplot.cars3 <- ggplot(cars,aes(x=speed,y=predict(mdl.lm)))+
                geom_point()+theme_bw(base_size = 20)+
                labs(y = "Stopping distance (ft)", x = "Speed (mph)")+
                geom_smooth(method = lm, se=F)
ggplot.cars3

2.3 Some cool correlations

Above, we had two predictors and we checked for correlations. What if we have more predictors and we want to have cool correlation tables and plots? (see here https://www.r-bloggers.com/beautiful-and-powerful-correlation-tables-in-r/amp/)

2.3.1 With corrections

cor <- psycho::affective %>% 
  correlation()
summary(cor)
plot(cor)

print(cor)
Pearson Full Correlation (p value correction: holm):
   - Age - Life_Satisfaction:   Results of the Pearson correlation showed a non significant and weak negative association between Age and Life_Satisfaction (r(1249) = 0.030, p > .1).
   - Age - Concealing:   Results of the Pearson correlation showed a non significant and weak positive association between Age and Concealing (r(1249) = -0.050, p > .1).
   - Life_Satisfaction - Concealing:   Results of the Pearson correlation showed a non significant and weak positive association between Life_Satisfaction and Concealing (r(1249) = -0.063, p > .1).
   - Age - Adjusting:   Results of the Pearson correlation showed a non significant and weak negative association between Age and Adjusting (r(1249) = 0.027, p > .1).
   - Life_Satisfaction - Adjusting:   Results of the Pearson correlation showed a significant and moderate negative association between Life_Satisfaction and Adjusting (r(1249) = 0.36, p < .001***).
   - Concealing - Adjusting:   Results of the Pearson correlation showed a significant and weak negative association between Concealing and Adjusting (r(1249) = 0.22, p < .001***).
   - Age - Tolerating:   Results of the Pearson correlation showed a non significant and weak negative association between Age and Tolerating (r(1249) = 0.031, p > .1).
   - Life_Satisfaction - Tolerating:   Results of the Pearson correlation showed a significant and weak negative association between Life_Satisfaction and Tolerating (r(1249) = 0.15, p < .001***).
   - Concealing - Tolerating:   Results of the Pearson correlation showed a non significant and weak negative association between Concealing and Tolerating (r(1249) = 0.074, p = 0.05°).
   - Adjusting - Tolerating:   Results of the Pearson correlation showed a significant and weak negative association between Adjusting and Tolerating (r(1249) = 0.29, p < .001***).

2.3.2 No corrections

What of we don’t want to use corrections for multiple comparisons?

cor <- psycho::affective %>% 
  correlation(adjust="none")
summary(cor)
plot(cor)

2.3.3 Issues with no corrections

Let’s generate a dataframe with 11 predictors and 1000 rows each. Because these are simulated data, there is no underlying correlation between these.

df_with_11_vars <- data.frame(replicate(11, rnorm(1000)))
cor2 <- correlation(df_with_11_vars, adjust="none")
summary(cor2)

Deactivate this by using “i_am_cheating”!

cor3 <- correlation(df_with_11_vars, adjust="none",i_am_cheating=TRUE)
summary(cor3)

2.3.4 Same model with corrections

cor4 <- correlation(df_with_11_vars)
summary(cor4)

3 From Linear to Logistic models

Here we will look at an example when the outcome is a binary outcome. This simulated data is structured as follows. We asked one participant to listen to 165 sentences, and to judge whether these are “grammatical” or “ungrammatical”. There were 105 sentences that were “grammatical” and 60 “ungrammatical”. This fictitious example can apply in any other situation. Let’s think Geography: 165 lands: 105 “flat” and 60 “non-flat”, etc. This applies to any case where you need to “categorise” the outcome into two groups.

3.1 Load and summaries

Let’s load in the data and do some basic summaries

grammatical <- read.csv("grammatical.csv")
str(grammatical)
'data.frame':   165 obs. of  2 variables:
 $ grammaticality: Factor w/ 2 levels "grammatical",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ response      : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
head(grammatical)
grammatical$response <- relevel(grammatical$response,ref="yes")
table(grammatical$grammaticality,grammatical$response)
               
                yes  no
  grammatical   100   5
  ungrammatical  10  50

3.2 Accuracy and Signal Detection Theory

We are generally interested in performance, i.e., whether the we have “accurately” categorised the outcome or not. Let’s do some stats on this

Yes No Total
Grammatical (Yes Actual) TP = 100 FN = 5 (Yes Actual) 105
Ungrammatical (No Actual) FP = 10 TN = 50 (No Actual) 60
Total (Yes Response) 110 (No Response) 55 165
## TP = True Positive (Hit); FP = False Positive; FN = False Negative; TN = True Negative
TP = 100
FP = 10
FN = 5
TN = 50
Total = 165
(TP+TN)/Total # accuracy
[1] 0.9090909
(FP+FN)/Total # error, also 1-accuracy
[1] 0.09090909
# When stimulus = yes, how many times response = yes?
TP/(TP+FN) # also True Positive Rate:
[1] 0.952381
# When stimulus = no, how many times response = yes?
FP/(FP+TN) # False Positive Rate, 
[1] 0.1666667
# When stimulus = no, how many times response = no?
TN/(FP+TN) # specificity, also 1-sensetivity 
[1] 0.8333333
# When subject responds "yes" how many times is (s)he correct?
TP/(TP+FP) # precision
[1] 0.9090909
# getting dprime (or the sensetivity index); beta (bias criterion, 0-1, lower=increase in "yes"); Aprime (estimate of discriminability, 0-1, 1=good discrimination; 0 at chance); bppd (b prime prime d, -1 to 1; 0 = no bias, negative = tendency to respond "yes", positive = tendency to respond "no"); c (index of bias, equals to SD)
#(see also https://www.r-bloggers.com/compute-signal-detection-theory-indices-with-r/amp/) 
psycho::dprime(TP, FP, FN, TN)
$`dprime`
[1] 2.537557

$beta
[1] 0.4467859

$aprime
[1] 0.7892857

$bppd
[1] -0.9512195

$c
[1] -0.3175006

3.3 GLM

Let’s run a first GLM (Generalised Linear Model). A GLM uses a special family “binomial” as it assumes the outcome has a binomial distribution. In general, results from a Logistic Regression are close to what we get from SDT (see above).

To run the results, we will change the reference level for both response and grammaticality.

3.3.1 Results

The results below show the logodds for our model. The results show that for one unit increase in the response (i.e., from no to yes), the logodds of being “grammatical” is increased by 4.61 (the intercept shows that when the response is “no”, the logodds are -1.61).

levels(grammatical$response)
[1] "yes" "no" 
levels(grammatical$grammaticality)
[1] "grammatical"   "ungrammatical"
grammatical$response <- relevel(grammatical$response, ref="no")
grammatical$grammaticality <- relevel(grammatical$grammaticality, ref="ungrammatical")
mdl.glm <- glm(response~grammaticality,data=grammatical,family = binomial)
summary(mdl.glm)

Call:
glm(formula = response ~ grammaticality, family = binomial, data = grammatical)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.4676  -0.6039   0.3124   0.3124   1.8930  

Coefficients:
                          Estimate Std. Error z value Pr(>|z|)    
(Intercept)                -1.6094     0.3464  -4.646 3.38e-06 ***
grammaticalitygrammatical   4.6052     0.5744   8.017 1.08e-15 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 210.050  on 164  degrees of freedom
Residual deviance:  94.271  on 163  degrees of freedom
AIC: 98.271

Number of Fisher Scoring iterations: 5

3.3.2 Getting “true” coefficients from GLM

Here we will run the same GLM model above but by supressing the “Intercept”. The idea here is to get the “true” coefficient for a “grammatical” and a “yes” response.

# An intercept is always included in any regression, but you can specify it with "1"
## glm(response~1+grammaticality,data=grammatical,family = binomial)
mdl.glm2 <- glm(response~-1+grammaticality,data=grammatical,family = binomial)
summary(mdl.glm2)

Call:
glm(formula = response ~ -1 + grammaticality, family = binomial, 
    data = grammatical)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.4676  -0.6039   0.3124   0.3124   1.8930  

Coefficients:
                            Estimate Std. Error z value Pr(>|z|)    
grammaticalityungrammatical  -1.6094     0.3464  -4.646 3.38e-06 ***
grammaticalitygrammatical     2.9957     0.4582   6.538 6.25e-11 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 228.739  on 165  degrees of freedom
Residual deviance:  94.271  on 163  degrees of freedom
AIC: 98.271

Number of Fisher Scoring iterations: 5

3.3.3 Logodds to Odd ratios

Logodds can be modified to talk about the odds of an event. For our model above, the odds of “grammatical” receiving a “yes” response increase by 100; whereas receiving a “no” is a mere 0.2. Using the second model (i.e., without an Intercept) allows us to get the odd rations for each of responses to “Ungrammatical” and “Grammatical”

exp(coef(mdl.glm))
              (Intercept) grammaticalitygrammatical 
                      0.2                     100.0 
exp(coef(mdl.glm2))
grammaticalityungrammatical   grammaticalitygrammatical 
                        0.2                        20.0 

3.3.4 LogOdds to proportions

If you want to talk about the percentage “accuracy” of our model, then we can transform our loggodds into proportions. This shows that the proportion of “grammatical” receiving a “yes” response increase by 99%

plogis(coef(mdl.glm))
              (Intercept) grammaticalitygrammatical 
                0.1666667                 0.9900990 
plogis(coef(mdl.glm2))
grammaticalityungrammatical   grammaticalitygrammatical 
                  0.1666667                   0.9523810 

3.3.5 GLM as a classification tool

The code below demosntrates the links between our GLM model and what we had obtained above from SDT. The predictions’ table shows that our GLM was successful at obtaining prediction that are identical to our initial data setup. Look at the table here and the table above. Once we have created our table of outcome, we can compute percent correct, the specificity, the sensetivity, the Kappa score, etc.. this yields the actual value with the SD that is related to variations in responses

## predict(mdl.glm)>0.5 is identical to 
## predict(glm(response~grammaticality,data=grammatical,family = binomial),type="response")
tbl.glm <- table(grammatical$response,predict(mdl.glm,type="response")>0.5)
tbl.glm
     
      FALSE TRUE
  no     50    5
  yes    10  100
PresenceAbsence::pcc(tbl.glm)
PresenceAbsence::specificity(tbl.glm)
PresenceAbsence::sensitivity(tbl.glm)
PresenceAbsence::Kappa(tbl.glm)
###etc..

3.3.6 Plotting

grammatical$prob <- predict(glm(response~grammaticality,data=grammatical,family = binomial),type="response")
ggplot(grammatical, aes(x=as.numeric(grammaticality), y=prob)) +
  geom_point() +
  geom_smooth(method = "glm", 
    method.args = list(family = "binomial"), 
    se = T) + theme_bw(base_size = 20)+
    labs(y = "Probability", x = "")+
    coord_cartesian(ylim=c(0,1))+
    scale_x_discrete(limits = c("Ungrammatical","Grammatical"))

3.4 GLM: Other distributions

If your data does not fit a binomial distribution, and is a multinomial (i.e., three or more response categories) or poisson (count data), then you need to use the glm function with a specific family function.

5 From Linear to Mixed models. Why random effects matter

Let’s generate a new dataframe that we will use later on for our mixed models

set.seed(666)
#we create 6 subjects
subjects <- paste0('S', 1:6)
#here we ad repetitions within speakers
subjects <- rep(subjects, each = 20)
items <- paste0('Item', 1:20)
#below repeates
items <- rep(items, 6)
#below is to generate random numbers that are log values
logFreq <- round(rexp(20)*5,2)
#below we are repeating the logFreq 6 times to fit with the number of speakers and items
logFreq <- rep(logFreq,6)
xdf <- data.frame(subjects, items, logFreq)
#below removes the individual variables we had created because they are already in the dataframe
rm(subjects,items,logFreq)
xdf$Intercept <- 300
submeans <- rep(rnorm(6,sd = 40),20)
#sort make the means for each subject is the same...
submeans <- sort(submeans)
xdf$submeans <- submeans
#we create the same thing for items... we allow the items mean to vary between words...
itsmeans <- rep(rnorm(20,sd=20),6)
xdf$itsmeans <- itsmeans
xdf$error <- rnorm(120,sd=20)
#here we create an effect column,  
#here for each logFreq, we have a decrease of -5 of that particular logFreq 
xdf$effect <- -5*xdf$logFreq
xdf$dur <- xdf$Intercept +xdf$submeans +xdf$itsmeans +xdf$error + xdf$effect
#below is to subset the data and get only a few columns.. the -c(4:8) removes the columns 4 to 8..
xreal <- xdf[,-c(4:8)]
head(xreal)
rm(xdf,submeans,itsmeans)

5.1 Correlations and plots

Let’s start by doing a correlation test and plotting the data. Our results show that there is a negative correlation between duration and LogFrequency, and the plot shows this decrease.

cor5 <- xreal %>% 
  correlation()
summary(cor5)
ggplot.xreal <- ggplot(xreal,aes(x=logFreq,y=dur))+
                geom_point()+theme_bw(base_size = 20)+
                labs(y = "Duration", x = "Frequency (Log)")+
                geom_smooth(method = lm, se=F)
ggplot.xreal

5.2 Linear model

Let’s run a simple linear model on the data. As we can see below, thereare some issues with the “simple” linear model: we had set our SD for subjects to be 40, but this was picked up as 120 (see histogram of residuals). The QQ Plot is not “normal”.

mdl.lm.xreal <- lm(dur~logFreq, data=xreal)
summary(mdl.lm.xreal)

Call:
lm(formula = dur ~ logFreq, data = xreal)

Residuals:
    Min      1Q  Median      3Q     Max 
-94.322 -35.465  -4.364  33.020 123.955 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 337.9730     6.2494  54.081  < 2e-16 ***
logFreq      -5.4601     0.7846  -6.959 2.06e-10 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 48.29 on 118 degrees of freedom
Multiple R-squared:  0.291, Adjusted R-squared:  0.285 
F-statistic: 48.43 on 1 and 118 DF,  p-value: 2.057e-10
hist(residuals(mdl.lm.xreal))

qqnorm(residuals(mdl.lm.xreal)); qqline(residuals(mdl.lm.xreal))

plot(fitted(mdl.lm.xreal),residuals(mdl.lm.xreal), cex = 4)

5.3 Linear Mixed Model

Our Linear Mixed effects Model will take into account the random effects we added and also our model specifications. We use a Maximum Likelihood estimate (REML=FALSE) as this is what we will use for model comparison. The Linear Mixed Model is reflecting our model specifications The SD of our subjects is picked up correctly. The model results are “almost” the same as our linear model above. The coefficient for the “Intercept” is at 337.973 and the coefficient for LogFrequency is at -5.460. This indicates that for each unit of increase in the LogFrequency, there is a decrease by 5.460 (ms).

mdl.lmer.xreal <- lmer(dur~logFreq+(1|subjects)+(1|items), data=xreal,REML=FALSE)
summary(mdl.lmer.xreal)
Linear mixed model fit by maximum likelihood  ['lmerMod']
Formula: dur ~ logFreq + (1 | subjects) + (1 | items)
   Data: xreal

     AIC      BIC   logLik deviance df.resid 
  1105.8   1119.8   -547.9   1095.8      115 

Scaled residuals: 
     Min       1Q   Median       3Q      Max 
-2.06735 -0.60675  0.07184  0.61122  2.39854 

Random effects:
 Groups   Name        Variance Std.Dev.
 items    (Intercept)  589.8   24.29   
 subjects (Intercept) 1471.7   38.36   
 Residual              284.0   16.85   
Number of obs: 120, groups:  items, 20; subjects, 6

Fixed effects:
            Estimate Std. Error t value
(Intercept)  337.973     17.587  19.217
logFreq       -5.460      1.004  -5.436

Correlation of Fixed Effects:
        (Intr)
logFreq -0.322
hist(residuals(mdl.lmer.xreal))

qqnorm(residuals(mdl.lmer.xreal)); qqline(residuals(mdl.lmer.xreal))

plot(fitted(mdl.lmer.xreal),residuals(mdl.lmer.xreal), cex = 4)

5.4 Our second Mixed model

This second model add a by-subject random slope. Random slopes allow for the variation that exists in the random effects to be taken into account. An intercept only model provides an averaged values to our participants.

mdl.lmer.xreal.2 <- lmer(dur~logFreq+(logFreq|subjects)+(1|items), data=xreal,REML=FALSE)
summary(mdl.lmer.xreal.2)
Linear mixed model fit by maximum likelihood  ['lmerMod']
Formula: dur ~ logFreq + (logFreq | subjects) + (1 | items)
   Data: xreal

     AIC      BIC   logLik deviance df.resid 
  1109.5   1129.0   -547.7   1095.5      113 

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-2.1087 -0.6068  0.0624  0.5827  2.4565 

Random effects:
 Groups   Name        Variance  Std.Dev. Corr
 items    (Intercept) 5.900e+02 24.2890      
 subjects (Intercept) 1.399e+03 37.4028      
          logFreq     2.898e-02  0.1702  1.00
 Residual             2.829e+02 16.8195      
Number of obs: 120, groups:  items, 20; subjects, 6

Fixed effects:
            Estimate Std. Error t value
(Intercept)  337.973     17.239  19.605
logFreq       -5.460      1.007  -5.423

Correlation of Fixed Effects:
        (Intr)
logFreq -0.267
hist(residuals(mdl.lmer.xreal.2))

qqnorm(residuals(mdl.lmer.xreal.2)); qqline(residuals(mdl.lmer.xreal.2))

plot(fitted(mdl.lmer.xreal.2),residuals(mdl.lmer.xreal.2), cex = 4)

5.5 Model comparison

But where are our p values? The lme4 developers decided not to include p values due to various issues with estimating df. What we can do instead is to compare models. We need to create a null model to allow for significance testing. As expected our predictor is significantly contributing to the difference.

mdl.lmer.xreal.Null <- lmer(dur~1+(logFreq|subjects)+(1|items), data=xreal,REML=FALSE)
anova(mdl.lmer.xreal.Null,mdl.lmer.xreal.2)
Data: xreal
Models:
mdl.lmer.xreal.Null: dur ~ 1 + (logFreq | subjects) + (1 | items)
mdl.lmer.xreal.2: dur ~ logFreq + (logFreq | subjects) + (1 | items)
                    Df    AIC    BIC  logLik deviance  Chisq Chi Df Pr(>Chisq)
mdl.lmer.xreal.Null  6 1131.9 1148.6 -559.93   1119.9                         
mdl.lmer.xreal.2     7 1109.5 1129.0 -547.73   1095.5 24.394      1  7.852e-07
                       
mdl.lmer.xreal.Null    
mdl.lmer.xreal.2    ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Also, do we really need random slopes? From the result below, we don’t seem to need random slopes at all, given that adding random slopes does not improve the model fit. I always recommend testing this. Most of the time I keep random slopes.

anova(mdl.lmer.xreal,mdl.lmer.xreal.2)
Data: xreal
Models:
mdl.lmer.xreal: dur ~ logFreq + (1 | subjects) + (1 | items)
mdl.lmer.xreal.2: dur ~ logFreq + (logFreq | subjects) + (1 | items)
                 Df    AIC    BIC  logLik deviance  Chisq Chi Df Pr(>Chisq)
mdl.lmer.xreal    5 1105.8 1119.8 -547.92   1095.8                         
mdl.lmer.xreal.2  7 1109.5 1129.0 -547.73   1095.5 0.3788      2     0.8274

But if you are really (really!!!) obsessed by p values, then you can also use lmerTest. BUT use after comparing models to evaluate contribution of predictors

library(lmerTest)
mdl.lmer.xreal.lmerTest <- lmer(dur~logFreq+(logFreq|subjects)+(1|items), data=xreal,REML=TRUE)
summary(mdl.lmer.xreal.lmerTest)
Linear mixed model fit by REML. t-tests use Satterthwaite's method [
lmerModLmerTest]
Formula: dur ~ logFreq + (logFreq | subjects) + (1 | items)
   Data: xreal

REML criterion at convergence: 1086.1

Scaled residuals: 
     Min       1Q   Median       3Q      Max 
-2.09692 -0.60118  0.06414  0.58479  2.46241 

Random effects:
 Groups   Name        Variance  Std.Dev. Corr
 items    (Intercept) 6.294e+02 25.088       
 subjects (Intercept) 1.651e+03 40.637       
          logFreq     3.421e-02  0.185   1.00
 Residual             2.829e+02 16.819       
Number of obs: 120, groups:  items, 20; subjects, 6

Fixed effects:
            Estimate Std. Error      df t value Pr(>|t|)    
(Intercept)  337.973     18.526   7.394  18.243 2.04e-07 ***
logFreq       -5.460      1.038  18.142  -5.261 5.17e-05 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Correlation of Fixed Effects:
        (Intr)
logFreq -0.249
detach("package:lmerTest", unload=TRUE)

5.6 Our final Mixed model

Our final model uses REML (or Restricted Maximum Likelihood Estimate of Variance Component) to estimate the model.

mdl.lmer.xreal.Full <- lmer(dur~logFreq+(logFreq|subjects)+(1|items), data=xreal,REML=TRUE)
summary(mdl.lmer.xreal.Full)
Linear mixed model fit by REML ['lmerMod']
Formula: dur ~ logFreq + (logFreq | subjects) + (1 | items)
   Data: xreal

REML criterion at convergence: 1086.1

Scaled residuals: 
     Min       1Q   Median       3Q      Max 
-2.09692 -0.60118  0.06414  0.58479  2.46241 

Random effects:
 Groups   Name        Variance  Std.Dev. Corr
 items    (Intercept) 6.294e+02 25.088       
 subjects (Intercept) 1.651e+03 40.637       
          logFreq     3.421e-02  0.185   1.00
 Residual             2.829e+02 16.819       
Number of obs: 120, groups:  items, 20; subjects, 6

Fixed effects:
            Estimate Std. Error t value
(Intercept)  337.973     18.526  18.243
logFreq       -5.460      1.038  -5.261

Correlation of Fixed Effects:
        (Intr)
logFreq -0.249
anova(mdl.lmer.xreal.Full)
Analysis of Variance Table
        Df Sum Sq Mean Sq F value
logFreq  1 7828.8  7828.8  27.677
hist(residuals(mdl.lmer.xreal.Full))

qqnorm(residuals(mdl.lmer.xreal.Full)); qqline(residuals(mdl.lmer.xreal.Full))

plot(fitted(mdl.lmer.xreal.Full),residuals(mdl.lmer.xreal.Full), cex = 4)

5.7 Dissecting the model

coef(mdl.lmer.xreal.Full)
$`items`
       (Intercept)   logFreq
Item1     352.3564 -5.460115
Item10    331.7619 -5.460115
Item11    324.7272 -5.460115
Item12    350.2316 -5.460115
Item13    353.1171 -5.460115
Item14    311.8359 -5.460115
Item15    354.0588 -5.460115
Item16    353.9386 -5.460115
Item17    288.7852 -5.460115
Item18    362.4698 -5.460115
Item19    338.1424 -5.460115
Item2     325.1858 -5.460115
Item20    359.7410 -5.460115
Item3     370.1798 -5.460115
Item4     302.4272 -5.460115
Item5     350.0497 -5.460115
Item6     338.9482 -5.460115
Item7     362.8397 -5.460115
Item8     295.5951 -5.460115
Item9     333.0694 -5.460115

$subjects
   (Intercept)   logFreq
S1    314.4695 -5.567091
S2    303.9038 -5.615180
S3    314.2921 -5.567898
S4    318.4283 -5.549073
S5    373.3005 -5.299324
S6    403.4440 -5.162127

attr(,"class")
[1] "coef.mer"
fixef(mdl.lmer.xreal.Full)
(Intercept)     logFreq 
 337.973044   -5.460115 
fixef(mdl.lmer.xreal.Full)[1]
(Intercept) 
    337.973 
fixef(mdl.lmer.xreal.Full)[2]
  logFreq 
-5.460115 
coef(mdl.lmer.xreal.Full)$`subjects`[1]
coef(mdl.lmer.xreal.Full)$`subjects`[2]
coef(mdl.lmer.xreal.Full)$`items`[1]
coef(mdl.lmer.xreal.Full)$`items`[2]

5.8 Using predictions from our model

In general, I use the prediction from my final model in any plots. To generate this, we can use the following

xreal$Pred_Dur <- predict(lmer(dur~logFreq+(logFreq|subjects)+(1|items), data=xreal,REML=TRUE))
ggplot.xreal.Pred <- ggplot(xreal,aes(x=logFreq,y=Pred_Dur))+
                geom_point()+theme_bw(base_size = 20)+
                labs(y = "Duration", x = "Frequency (Log)",title="Predicted")+
                geom_smooth(method = lm, se=F)+coord_cartesian(ylim=c(200,450))
ggplot.xreal.Pred

## original plot
ggplot.xreal.Original <- ggplot(xreal,aes(x=logFreq,y=dur))+
                geom_point()+theme_bw(base_size = 20)+
                labs(y = "Duration", x = "Frequency (Log)",title="Original")+
                geom_smooth(method = lm, se=F)+coord_cartesian(ylim=c(200,450))
ggplot.xreal.Original

5.9 GLMM and CLMM

The code above was using a Linear Mixed Effects Modelling. The outcome was a numeric object. In some cases (as we have seen above), we may have:

  1. Binary outcome (binomial)
  2. Count data (poisson),
  3. Multi-category outcome (multinomial)
  4. Rating data (cumulative function)

The code below gives you an idea of how to specificy these models

## Binomial family
## lme4::glmer(outcome~predictor(s)+(1|subject)+(1|items)..., data=data, family=binomial)
## Poisson family
## lme4::glmer(outcome~predictor(s)+(1|subject)+(1|items)..., data=data, family=poisson)
## Multinomial family
## a bit complicated as tehre is a need to use Bayesian approaches, see e.g., 
## glmmADMB
## mixcat
## MCMCglmm
## see https://gist.github.com/casallas/8263818
## Rating data, use following
## ordinal::clmm(outcome~predictor(s)+(1|subject)+(1|items)..., data=data)
## Remember to test for random effects and whether slopes are needed.

6 Running your code and automatically drawing quality tables for publications

See this website for how to run and automatically draw tables for your publications. I haven’t tested this but it is in development: 7) https://www.r-bloggers.com/elegant-regression-results-tables-and-plots-in-r-the-finalfit-package/amp

---
title: "Session_4-AnalysingData"
output: 
  html_notebook:
    highlight: pygments
    number_sections: yes
    toc: yes
    toc_depth: 4
    toc_float:
      collapsed: yes
---

In this session, we will look at basic functions in R that will help us in running some inferential statistics. These will help us to evaluate the relationship between one (or more) predictor(s) and an outcome. We will start with a basic example. Using the dataframe "cars".


# Loading packages 
```{r warning=FALSE, message=FALSE, error=FALSE}
## Use the code below to check if you have all required packages installed. If some are not installed already, the code below will install these. If you have all packages installed, then you could load them with the second code.
requiredPackages = c('ggplot2','tidyverse','psycho','lme4','ordinal','lmerTest','PresenceAbsence')
for(p in requiredPackages){
  if(!require(p,character.only = TRUE)) install.packages(p)
  library(p,character.only = TRUE)
}
detach("package:lmerTest", unload=TRUE)
# library(ggplot2);library(tidyverse);library(psycho);library(lme4);library(ordinal);library(lmerTest)
```

# Starting with a linear model
```{r warning=FALSE, message=FALSE, error=FALSE}
head(cars)
str(cars)
summary(cars)
ggplot.cars1 <- ggplot(cars,aes(x=speed,y=dist))+
                geom_point()+theme_bw(base_size = 20)+
                labs(y = "Stopping distance (ft)", x = "Speed (mph)")
ggplot.cars1

```

## Basic correlations and plotting the linear relationship 
Let us do some basic correlation analysis using base R and package psycho. And let's visualise this correlation using a linear regression line

```{r warning=FALSE, message=FALSE, error=FALSE}
cor(cars)
## correlation using "psycho"
cars %>% 
  correlation()

ggplot.cars2 <- ggplot(cars,aes(x=speed,y=dist))+
                geom_point()+theme_bw(base_size = 20)+
                labs(y = "Stopping distance (ft)", x = "Speed (mph)")+
                geom_smooth(method = lm, se=F)
ggplot.cars2

```

## Our first statistical model
What about running our first statistical analysis of the data? We will use a Linear regression model. We can visualise the results using "summary". If you are new to linear models and prefer an ANOVA-style analysis, then use the function "ANOVA" on the linear model. We'll come back to the linear model later on. 

```{r warning=FALSE, message=FALSE, error=FALSE}
mdl.lm <- lm(dist~speed, data=cars)
mdl.lm#also print(mdl.lm)
summary(mdl.lm)
anova(mdl.lm)
```

To interpret the model, we need look at the coefficients. The "Intercept" is -17.5791 and the coefficient for "speed" is 3.9324. This tells us that when "speed" = 0, the "distance" is at -17.5791; with one unit increase in "speed", the "distance" increases by 3.9324 (ft). There may be some issues with interpretation of this model and we need to center, standardise scores. We are not going to cover this here, but you can center and/or scale with this formula: 

1. Center = scale(x, center = TRUE, scale = FALSE)
2. Standardise = scale(x, center = TRUE, scale = TRUE)


### Dissecting the model
Let us dissect the model. If you use "str", you will be able to see what is available under our linear model. To access some info from the model

### "str" and "coef"
```{r warning=FALSE, message=FALSE, error=FALSE}
str(mdl.lm)
coef(mdl.lm)
## same as 
## mdl.lm$coefficients
```

#### "coef" and "coefficients"
What if I want to obtain the "Intercept"? Or the coefficient for distance? What if I want the full row for distance?
```{r warning=FALSE, message=FALSE, error=FALSE}
coef(mdl.lm)[1] # same as mdl.lm$coefficients[1]
coef(mdl.lm)[2] # same as mdl.lm$coefficients[2]

summary(mdl.lm)$coefficients[2, ] # full row
summary(mdl.lm)$coefficients[2, 4] #for p value

```

What about residuals (difference between the observed value and the estimated value of the quantity) and fitted values?

```{r warning=FALSE, message=FALSE, error=FALSE}
residuals(mdl.lm)
fitted(mdl.lm)
```

#### Goodness of fit?
```{r warning=FALSE, message=FALSE, error=FALSE}
AIC(mdl.lm)	# Akaike's Information Criterion, lower values are better
BIC(mdl.lm)	# Bayesian AIC
logLik(mdl.lm)	# log likelihood
```

#### Significance testing
Are the above informative? of course not directly. If we want to test for overall significance of model. We run a null model (aka intercept only) and compare models. 

```{r warning=FALSE, message=FALSE, error=FALSE}
mdl.lm.Null <- lm(dist~1, data=cars)
anova(mdl.lm,mdl.lm.Null)

```


#### Plotting fitted values
Let's plot our fitted values. Any problems with this?
```{r warning=FALSE, message=FALSE, error=FALSE}
ggplot.cars3 <- ggplot(cars,aes(x=speed,y=predict(mdl.lm)))+
                geom_point()+theme_bw(base_size = 20)+
                labs(y = "Stopping distance (ft)", x = "Speed (mph)")+
                geom_smooth(method = lm, se=F)
ggplot.cars3

```


## Some cool correlations
Above, we had two predictors and we checked for correlations. What if we have more predictors and we want to have cool correlation tables and plots? (see here https://www.r-bloggers.com/beautiful-and-powerful-correlation-tables-in-r/amp/)

### With corrections
```{r warning=FALSE, message=FALSE, error=FALSE}
cor <- psycho::affective %>% 
  correlation()
summary(cor)
plot(cor)
print(cor)
```

### No corrections
What of we don't want to use corrections for multiple comparisons?

```{r warning=FALSE, message=FALSE, error=FALSE}
cor <- psycho::affective %>% 
  correlation(adjust="none")
summary(cor)
plot(cor)
```


### Issues with no corrections
Let's generate a dataframe with 11 predictors and 1000 rows each. Because these are simulated data, there is no underlying correlation between these.

```{r warning=FALSE, message=FALSE, error=FALSE}
df_with_11_vars <- data.frame(replicate(11, rnorm(1000)))
cor2 <- correlation(df_with_11_vars, adjust="none")
summary(cor2)
```

Deactivate this by using "i_am_cheating"!
```{r warning=FALSE, message=FALSE, error=FALSE}
cor3 <- correlation(df_with_11_vars, adjust="none",i_am_cheating=TRUE)
summary(cor3)
```

### Same model with corrections

```{r warning=FALSE, message=FALSE, error=FALSE}
cor4 <- correlation(df_with_11_vars)
summary(cor4)
```

# From Linear to Logistic models
Here we will look at an example when the outcome is a binary outcome. This simulated data is structured as follows. We asked one participant to listen to 165 sentences, and to judge whether these are "grammatical" or "ungrammatical". There were 105 sentences that were "grammatical" and 60 "ungrammatical". This fictitious example can apply in any other situation. Let's think Geography: 165 lands: 105 "flat" and 60 "non-flat", etc. This applies to any case where you need to "categorise" the outcome into two groups. 

## Load and summaries
Let's load in the data and do some basic summaries

```{r warning=FALSE, message=FALSE, error=FALSE}
grammatical <- read.csv("grammatical.csv")
str(grammatical)
head(grammatical)
grammatical$response <- relevel(grammatical$response,ref="yes")
table(grammatical$grammaticality,grammatical$response)
```

## Accuracy and Signal Detection Theory
We are generally interested in performance, i.e., whether the we have "accurately" categorised the outcome or not. Let's do some stats on this 

|  | Yes | No | Total |
|----------------------------|--------------------|------------------|------------------|
| Grammatical (Yes Actual) | TP = 100 | FN = 5 | (Yes Actual) 105 |
| Ungrammatical (No Actual)  | FP = 10 | TN = 50 | (No Actual) 60 |
| Total | (Yes Response) 110 | (No Response) 55 | 165 |

```{r warning=FALSE, message=FALSE, error=FALSE}
## TP = True Positive (Hit); FP = False Positive; FN = False Negative; TN = True Negative


TP = 100
FP = 10
FN = 5
TN = 50
Total = 165

(TP+TN)/Total # accuracy
(FP+FN)/Total # error, also 1-accuracy

# When stimulus = yes, how many times response = yes?
TP/(TP+FN) # also True Positive Rate:

# When stimulus = no, how many times response = yes?
FP/(FP+TN) # False Positive Rate, 

# When stimulus = no, how many times response = no?
TN/(FP+TN) # specificity, also 1-sensetivity 

# When subject responds "yes" how many times is (s)he correct?
TP/(TP+FP) # precision

# getting dprime (or the sensetivity index); beta (bias criterion, 0-1, lower=increase in "yes"); Aprime (estimate of discriminability, 0-1, 1=good discrimination; 0 at chance); bppd (b prime prime d, -1 to 1; 0 = no bias, negative = tendency to respond "yes", positive = tendency to respond "no"); c (index of bias, equals to SD)
#(see also https://www.r-bloggers.com/compute-signal-detection-theory-indices-with-r/amp/) 
psycho::dprime(TP, FP, FN, TN)

```


## GLM
Let's run a first GLM (Generalised Linear Model). A GLM uses a special family "binomial" as it assumes the outcome has a binomial distribution. In general, results from a Logistic Regression are close to what we get from SDT (see above).

To run the results, we will change the reference level for both response and grammaticality. 

### Results

The results below show the logodds for our model. The results show that for one unit increase in the response (i.e., from no to yes), the logodds of being "grammatical" is increased by 4.61 (the intercept shows that when the response is "no", the logodds are -1.61).
```{r warning=FALSE, message=FALSE, error=FALSE}
levels(grammatical$response)
levels(grammatical$grammaticality)

grammatical$response <- relevel(grammatical$response, ref="no")
grammatical$grammaticality <- relevel(grammatical$grammaticality, ref="ungrammatical")

mdl.glm <- glm(response~grammaticality,data=grammatical,family = binomial)
summary(mdl.glm)

```

### Getting "true" coefficients from GLM
Here we will run the same GLM model above but by supressing the "Intercept". The idea here is to get the "true" coefficient for a "grammatical" and a "yes" response.

```{r warning=FALSE, message=FALSE, error=FALSE}
# An intercept is always included in any regression, but you can specify it with "1"
## glm(response~1+grammaticality,data=grammatical,family = binomial)

mdl.glm2 <- glm(response~-1+grammaticality,data=grammatical,family = binomial)
summary(mdl.glm2)


```


### Logodds to Odd ratios

Logodds can be modified to talk about the odds of an event. For our model above, the odds of "grammatical" receiving a "yes" response increase by 100; whereas receiving a "no" is a mere 0.2. Using the second model (i.e., without an Intercept) allows us to get the odd rations for each of responses to "Ungrammatical" and "Grammatical"

```{r warning=FALSE, message=FALSE, error=FALSE}
exp(coef(mdl.glm))
exp(coef(mdl.glm2))

```

### LogOdds to proportions

If you want to talk about the percentage "accuracy" of our model, then we can transform our loggodds into proportions. This shows that the proportion of "grammatical" receiving a "yes" response increase by 99%

```{r warning=FALSE, message=FALSE, error=FALSE}
plogis(coef(mdl.glm))
plogis(coef(mdl.glm2))
```

### GLM as a classification tool

The code below demosntrates the links between our GLM model and what we had obtained above from SDT. The predictions' table shows that our GLM was successful at obtaining prediction that are identical to our initial data setup. Look at the table here and the table above. Once we have created our table of outcome, we can compute percent correct, the specificity, the sensetivity, the Kappa score, etc.. this yields the actual value with the SD that is related to variations in responses

```{r}
## predict(mdl.glm)>0.5 is identical to 
## predict(glm(response~grammaticality,data=grammatical,family = binomial),type="response")
tbl.glm <- table(grammatical$response,predict(mdl.glm,type="response")>0.5)
tbl.glm
PresenceAbsence::pcc(tbl.glm)
PresenceAbsence::specificity(tbl.glm)
PresenceAbsence::sensitivity(tbl.glm)
PresenceAbsence::Kappa(tbl.glm)
###etc..
```

### Plotting

```{r warning=FALSE, message=FALSE, error=FALSE}
grammatical$prob <- predict(glm(response~grammaticality,data=grammatical,family = binomial),type="response")
ggplot(grammatical, aes(x=as.numeric(grammaticality), y=prob)) +
  geom_point() +
  geom_smooth(method = "glm", 
    method.args = list(family = "binomial"), 
    se = T) + theme_bw(base_size = 20)+
    labs(y = "Probability", x = "")+
    coord_cartesian(ylim=c(0,1))+
    scale_x_discrete(limits = c("Ungrammatical","Grammatical"))
```

## GLM: Other distributions

If your data does not fit a binomial distribution, and is a multinomial (i.e., three or more response categories) or poisson (count data), then you need to use the glm function with a specific family function. 

```{r warning=FALSE, message=FALSE, error=FALSE, echo=FALSE}
## For a multinomial (3 or more response categories), see below and use the following specification
## https://stats.idre.ucla.edu/r/dae/multinomial-logistic-regression/
## mdl.multi <- nnet::multinom(outcome~predictor, data=data)

## For a poisson (count data), see below and use the following specification
## https://stats.idre.ucla.edu/r/dae/poisson-regression/

## mdl.poisson <- glm(outcome~predictor, data=data,family="poisson")


```


# Cumulative Link Models

These models work perfectly with rating data. Ratings are inherently ordered, 1, 2, ... n, and expect to observe an increase (or decrease) in overall ratings from 1 to n. To demonstrate this, we will use an example using the package "ordinal". Data were from a rating experiment where six participants rated the percept of nasality in the production of particular consonants in Arabic. The data came from nine producing subjects. The ratings were from 1 to 5. This example can apply to any study, e.g., rating grammaticality of sentences, rating how positive the sentiments are in a article, interview responses, etc.

## Importing and pre-processing
We start by importing the data and process it. We change the reference level in the predictor

```{r warning=FALSE, message=FALSE, error=FALSE}
rating <- read.csv("rating.csv")
str(rating)
## we need to convert "Response" to a factor
rating$Response <- as.factor(rating$Response)
rating$Context <- relevel(rating$Context, ref="isolation")
```

## Our first model
We run our first clm model as a simple, i.e., with no random effects

```{r warning=FALSE, message=FALSE, error=FALSE}
mdl.clm <- clm(Response~Context, data=rating)
summary(mdl.clm)
```


## Testing significance 
We can evaluate whether "Context" improves the model fit, by comparing a null model with our model. Of course "Context" is improving the model fit.

```{r warning=FALSE, message=FALSE, error=FALSE}
mdl.clm.Null <- clm(Response~1, data=rating)
anova(mdl.clm,mdl.clm.Null)

```

## Interpreting a cumulative model
As a way to interpret the model, we can look at the coefficients and make sense of the results. A CLM model is a Logistic model with a cumulative effect. The "Coefficients" are the estimates for each level of the fixed effect; the "Threshold coefficients" are those of the response. For the former, a negative coefficient indicates a negative association with the response; and a positive is positively associated with the response. The p values are indicating the significance of each level. For the "Threshold coefficients", we can see the cumulative effects of ratings 1|2, 2|3, 3|4 and 4|5 which indicate an overall increase in ht eratings from 1 to 5. 

## Plotting 
We use a modified version of a plotting function that allows us to visualise the effects. For this, we use the base R plotting functions

```{r warning=FALSE, message=FALSE, error=FALSE}
par(oma=c(1, 0, 0, 3),mgp=c(2, 1, 0))
xlimNas = c(min(mdl.clm$beta), max(mdl.clm$beta))
ylimNas = c(0,1)
plot(0,0,xlim=xlimNas, ylim=ylimNas, type="n", ylab=expression(Probability), xlab="", xaxt = "n",main="Predicted curves - Nasalisation",cex=2,cex.lab=1.5,cex.main=1.5,cex.axis=1.5)
axis(side = 1, at = c(0,mdl.clm$beta),labels = levels(rating$Context), las=2,cex=2,cex.lab=1.5,cex.axis=1.5)
xsNas = seq(xlimNas[1], xlimNas[2], length.out=100)
lines(xsNas, plogis(mdl.clm$Theta[1] - xsNas), col='black')
lines(xsNas, plogis(mdl.clm$Theta[2] - xsNas)-plogis(mdl.clm$Theta[1] - xsNas), col='red')
lines(xsNas, plogis(mdl.clm$Theta[3] - xsNas)-plogis(mdl.clm$Theta[2] - xsNas), col='green')
lines(xsNas, plogis(mdl.clm$Theta[4] - xsNas)-plogis(mdl.clm$Theta[3] - xsNas), col='orange')
lines(xsNas, 1-(plogis(mdl.clm$Theta[4] - xsNas)), col='blue')
abline(v=c(0,mdl.clm$beta),lty=3)
abline(h=0, lty="dashed")
abline(h=1, lty="dashed")
legend(par('usr')[2], par('usr')[4], bty='n', xpd=NA,lty=1, col=c("black", "red", "green", "orange", "blue"), 
       legend=c("Oral", "2", "3", "4", "Nasal"),cex=0.75)

```


# From Linear to Mixed models. Why random effects matter
Let's generate a new dataframe that we will use later on for our mixed models

```{r warning=FALSE, message=FALSE, error=FALSE}
set.seed(666)
#we create 6 subjects
subjects <- paste0('S', 1:6)
#here we ad repetitions within speakers
subjects <- rep(subjects, each = 20)
items <- paste0('Item', 1:20)
#below repeates
items <- rep(items, 6)
#below is to generate random numbers that are log values
logFreq <- round(rexp(20)*5,2)
#below we are repeating the logFreq 6 times to fit with the number of speakers and items
logFreq <- rep(logFreq,6)
xdf <- data.frame(subjects, items, logFreq)
#below removes the individual variables we had created because they are already in the dataframe
rm(subjects,items,logFreq)

xdf$Intercept <- 300
submeans <- rep(rnorm(6,sd = 40),20)
#sort make the means for each subject is the same...
submeans <- sort(submeans)
xdf$submeans <- submeans
#we create the same thing for items... we allow the items mean to vary between words...
itsmeans <- rep(rnorm(20,sd=20),6)
xdf$itsmeans <- itsmeans
xdf$error <- rnorm(120,sd=20)
#here we create an effect column,  
#here for each logFreq, we have a decrease of -5 of that particular logFreq 
xdf$effect <- -5*xdf$logFreq

xdf$dur <- xdf$Intercept +xdf$submeans +xdf$itsmeans +xdf$error + xdf$effect

#below is to subset the data and get only a few columns.. the -c(4:8) removes the columns 4 to 8..
xreal <- xdf[,-c(4:8)]
head(xreal)
rm(xdf,submeans,itsmeans)
```

## Correlations and plots
Let's start by doing a correlation test and plotting the data. Our results show that there is a negative correlation between duration and LogFrequency, and the plot shows this decrease. 

```{r warning=FALSE, message=FALSE, error=FALSE}
cor5 <- xreal %>% 
  correlation()
summary(cor5)
ggplot.xreal <- ggplot(xreal,aes(x=logFreq,y=dur))+
                geom_point()+theme_bw(base_size = 20)+
                labs(y = "Duration", x = "Frequency (Log)")+
                geom_smooth(method = lm, se=F)
ggplot.xreal
```


## Linear model

Let's run a simple linear model on the data. As we can see below, thereare some issues with the "simple" linear model: we had set our SD for subjects to be 40, but this was picked up as 120 (see histogram of residuals). The QQ Plot is not "normal". 

```{r warning=FALSE, message=FALSE, error=FALSE}
mdl.lm.xreal <- lm(dur~logFreq, data=xreal)
summary(mdl.lm.xreal)
hist(residuals(mdl.lm.xreal))
qqnorm(residuals(mdl.lm.xreal)); qqline(residuals(mdl.lm.xreal))
plot(fitted(mdl.lm.xreal),residuals(mdl.lm.xreal), cex = 4)
```

## Linear Mixed Model

Our Linear Mixed effects Model will take into account the random effects we added and also our model specifications. We use a Maximum Likelihood estimate (REML=FALSE) as this is what we will use for model comparison. The Linear Mixed Model is reflecting our model specifications The SD of our subjects is picked up correctly. The model results are "almost" the same as our linear model above. The coefficient for the "Intercept" is at 337.973 and the coefficient for LogFrequency is at -5.460. This indicates that for each unit of increase in the LogFrequency, there is a decrease by 5.460 (ms).

```{r warning=FALSE, message=FALSE, error=FALSE}
mdl.lmer.xreal <- lmer(dur~logFreq+(1|subjects)+(1|items), data=xreal,REML=FALSE)
summary(mdl.lmer.xreal)
hist(residuals(mdl.lmer.xreal))
qqnorm(residuals(mdl.lmer.xreal)); qqline(residuals(mdl.lmer.xreal))
plot(fitted(mdl.lmer.xreal),residuals(mdl.lmer.xreal), cex = 4)
```

## Our second Mixed model

This second model add a by-subject random slope. Random slopes allow for the variation that exists in the random effects to be taken into account. An intercept only model provides an averaged values to our participants.

```{r warning=FALSE, message=FALSE, error=FALSE}
mdl.lmer.xreal.2 <- lmer(dur~logFreq+(logFreq|subjects)+(1|items), data=xreal,REML=FALSE)
summary(mdl.lmer.xreal.2)
hist(residuals(mdl.lmer.xreal.2))
qqnorm(residuals(mdl.lmer.xreal.2)); qqline(residuals(mdl.lmer.xreal.2))
plot(fitted(mdl.lmer.xreal.2),residuals(mdl.lmer.xreal.2), cex = 4)
```

## Model comparison

But where are our p values? The lme4 developers decided not to include p values due to various issues with estimating df. What we can do instead is to compare models. We need to create a null model to allow for significance testing. As expected our predictor is significantly contributing to the difference. 

```{r warning=FALSE, message=FALSE, error=FALSE}
mdl.lmer.xreal.Null <- lmer(dur~1+(logFreq|subjects)+(1|items), data=xreal,REML=FALSE)
anova(mdl.lmer.xreal.Null,mdl.lmer.xreal.2)
```

Also, do we really need random slopes? From the result below, we don't seem to need random slopes at all, given that adding random slopes does not improve the model fit. I always recommend testing this. Most of the time I keep random slopes.

```{r warning=FALSE, message=FALSE, error=FALSE}
anova(mdl.lmer.xreal,mdl.lmer.xreal.2)
```

But if you are really (really!!!) obsessed by p values, then you can also use lmerTest. BUT use after comparing models to evaluate contribution of predictors

```{r warning=FALSE, message=FALSE, error=FALSE}
library(lmerTest)
mdl.lmer.xreal.lmerTest <- lmer(dur~logFreq+(logFreq|subjects)+(1|items), data=xreal,REML=TRUE)
summary(mdl.lmer.xreal.lmerTest)
detach("package:lmerTest", unload=TRUE)
```


## Our final Mixed model

Our final model uses REML (or Restricted Maximum Likelihood Estimate of Variance Component) to estimate the model. 

```{r warning=FALSE, message=FALSE, error=FALSE}
mdl.lmer.xreal.Full <- lmer(dur~logFreq+(logFreq|subjects)+(1|items), data=xreal,REML=TRUE)
summary(mdl.lmer.xreal.Full)
anova(mdl.lmer.xreal.Full)
hist(residuals(mdl.lmer.xreal.Full))
qqnorm(residuals(mdl.lmer.xreal.Full)); qqline(residuals(mdl.lmer.xreal.Full))
plot(fitted(mdl.lmer.xreal.Full),residuals(mdl.lmer.xreal.Full), cex = 4)
```


## Dissecting the model

```{r warning=FALSE, message=FALSE, error=FALSE}
coef(mdl.lmer.xreal.Full)
fixef(mdl.lmer.xreal.Full)
fixef(mdl.lmer.xreal.Full)[1]
fixef(mdl.lmer.xreal.Full)[2]

coef(mdl.lmer.xreal.Full)$`subjects`[1]
coef(mdl.lmer.xreal.Full)$`subjects`[2]

coef(mdl.lmer.xreal.Full)$`items`[1]
coef(mdl.lmer.xreal.Full)$`items`[2]

```

## Using predictions from our model
In general, I use the prediction from my final model in any plots. To generate this, we can use the following

```{r warning=FALSE, message=FALSE, error=FALSE}
xreal$Pred_Dur <- predict(lmer(dur~logFreq+(logFreq|subjects)+(1|items), data=xreal,REML=TRUE))
ggplot.xreal.Pred <- ggplot(xreal,aes(x=logFreq,y=Pred_Dur))+
                geom_point()+theme_bw(base_size = 20)+
                labs(y = "Duration", x = "Frequency (Log)",title="Predicted")+
                geom_smooth(method = lm, se=F)+coord_cartesian(ylim=c(200,450))
ggplot.xreal.Pred
## original plot
ggplot.xreal.Original <- ggplot(xreal,aes(x=logFreq,y=dur))+
                geom_point()+theme_bw(base_size = 20)+
                labs(y = "Duration", x = "Frequency (Log)",title="Original")+
                geom_smooth(method = lm, se=F)+coord_cartesian(ylim=c(200,450))
ggplot.xreal.Original
```

## GLMM and CLMM

The code above was using a Linear Mixed Effects Modelling. The outcome was a numeric object. In some cases (as we have seen above), we may have: 

1. Binary outcome (binomial)
2. Count data (poisson), 
3. Multi-category outcome (multinomial)
4. Rating data (cumulative function)

The code below gives you an idea of how to specificy these models

```{r warning=FALSE, message=FALSE, error=FALSE}

## Binomial family
## lme4::glmer(outcome~predictor(s)+(1|subject)+(1|items)..., data=data, family=binomial)

## Poisson family
## lme4::glmer(outcome~predictor(s)+(1|subject)+(1|items)..., data=data, family=poisson)

## Multinomial family
## a bit complicated as tehre is a need to use Bayesian approaches, see e.g., 
## glmmADMB
## mixcat
## MCMCglmm
## see https://gist.github.com/casallas/8263818

## Rating data, use following
## ordinal::clmm(outcome~predictor(s)+(1|subject)+(1|items)..., data=data)


## Remember to test for random effects and whether slopes are needed.

```

# Running your code and automatically drawing quality tables for publications

See this website for how to run and automatically draw tables for your publications. I haven't tested this but it is in development:
7)	https://www.r-bloggers.com/elegant-regression-results-tables-and-plots-in-r-the-finalfit-package/amp

