1 Loading packages

## Use the code below to check if you have all required packages installed. If some are not installed already, the code below will install these. If you have all packages installed, then you could load them with the second code.
requiredPackages = c('tidyverse', 'ordinal', 'broom', 'emmeans', 'knitr', 'Hmisc', 'corrplot', 'psycho', 'PresenceAbsence', 'lme4', 'lmerTest')
for(p in requiredPackages){
  if(!require(p,character.only = TRUE)) install.packages(p)
  library(p,character.only = TRUE)
}

In this session, we will look at basic functions in R that will help us in running some inferential statistics. These will help us to evaluate the relationship between one (or more) predictor(s) (independent variable) and an outcome (dependent variable).

It is important to know the class of the outcome before doing any pre-data analyses or inferential statistics. Outcome classes can be one of:

  1. Numeric: As an example, we have length/width of leaf; height of mountain; fundamental frequency of the voice; etc. These are true numbers and we can use summaries, t-tests, linear models, etc.

  2. Categorical (Unordered): Observations for two or more categories. As an example, we can have gender of a speaker (male or female); responses to a True vs False perception tests; Colour (unordered) categorisation, e.g., red, blue, yellow, orange, etc.. For these we can use a Generalised Linear Model (binomial or multinomial) or a simple chi-square test. Count data are numbers related to a category. But these should be analysed using a poisson logistic regression

  3. Categorical (Ordered): When you run a rating experiment, where the outcome is either numeric (i.e., 1, 2, 3, 4, 5) or categories (i.e., disagree, neutral, agree). The numeric option is NOT a true number as for the participant, these are categories. Cumulative Logit models (or Generalised Linear Model with a cumulative function) are used. The mean is meaningless here, and the median is a preferred descriptive statistic.

The content will be split into 5 sections. Some of you will only ever look at some of these, while others may need to use all. Look at the section that works best for your data.

  1. Pre-data analyses: We will look at summaries, plotting, correlations, checking normality of distribution and homogeneity of variance (for t-tests)

  2. Statistical Analyses: from t-test and ANOVA, to Linear Models: model estimation, understanding coefficients, residuals, and predictions

  3. We then move to Generalised Linear Models, with Logistic Regression (with a binomial categorical outcome) and touch upon Signal Detection Theory for estimating accuracy, biases, with sensitivity and specificity measures.

  4. Dealing with rating data using Generalised Linear Models, with a Cumulative Logit function

  5. Linear Mixed effects Models: Introduction to random effects and how to deal with these.

2 Pre-data analsyes

2.1 Built-in datasets

We will use one of the built in R. You can check all available datasets in R using the following:

data()
# or below for all datasets available in all installed packages
data(package = .packages(all.available = TRUE))

We will use the iris dataset from the package MASS

2.2 Checking structure and summaries

2.2.1 Structure

iris %>% 
  str()  
'data.frame':   150 obs. of  5 variables:
 $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
 $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
 $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
 $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
 $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

We have a dataframe with 150 observations and 5 variables; 4 numeric and 1 factor with 3 levels.

2.2.2 Summary

We summarise the data to see the trends:

iris %>% 
  summary()
  Sepal.Length    Sepal.Width     Petal.Length    Petal.Width          Species  
 Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100   setosa    :50  
 1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300   versicolor:50  
 Median :5.800   Median :3.000   Median :4.350   Median :1.300   virginica :50  
 Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199                  
 3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800                  
 Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500                  

So we have an equal dataframe (50 observations under each level of the factor Species), with no missing values (aka NA).

2.2.3 Advanced

2.2.3.1 For a specific variable

What if you want to summarise the data and get the mean, SD, by each level of the factor for Sepal.Length?

iris %>% 
  group_by(Species) %>% 
  summarise(
  SL.Mean = mean(Sepal.Length),
  SL.SD = sd(Sepal.Length)
  )

2.2.3.2 For all variables

iris %>% 
  group_by(Species) %>% 
  summarise_all(list(mean = mean, sd = sd)
  )

2.2.4 Up to you

Do some additional summaries.. You may want to check the median, range, etc..

2.3 Plot

We can make a boxplot of the data and add a trend line. This allows us to visualise the median, and quantiles in addition to the standard deviation and any outliers… All in the same plot!

iris %>% 
  ggplot(aes(x = Species, y = Sepal.Length)) +
  geom_boxplot() +
  geom_smooth(aes(x = as.numeric(Species), y = Sepal.Length), method = "lm") +
  labs(x = "Species", y = "Length", title = "Boxplot and trend line", subtitle = "with ggplot2") + 
  theme_bw() + theme(text = element_text(size = 15))

Here I have used the variable Sepal.Length. You can use any of the additional variables to plot the data.

2.4 Correlation tests

2.4.1 Basic correlations

We use the function cor to obtain the pearson correlation and cor.test to run a basic correlation test on our data with significance testing

cor(iris$Sepal.Length, iris$Petal.Length, method = "pearson")
[1] 0.8717538
cor.test(iris$Sepal.Length, iris$Petal.Length)

    Pearson's product-moment correlation

data:  iris$Sepal.Length and iris$Petal.Length
t = 21.646, df = 148, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.8270363 0.9055080
sample estimates:
      cor 
0.8717538 

2.4.1.1 Up to you

Can you check whether there is a correlation between the Sepal.Length and the Petal.Width? What about Petal.Length and Petal.Width?

2.4.2 Using the package corrplot

Above, we did a correlation test on two predictors. We run multiple correlations on multiple predictors.

2.4.2.1 Correlations

## correlation using "corrplot"
## based on the function `rcorr' from the `Hmisc` package
## Need to change dataframe into a matrix
corr <- as.matrix(iris[-5]) %>% 
  rcorr(type="pearson")
print(corr)
             Sepal.Length Sepal.Width Petal.Length Petal.Width
Sepal.Length         1.00       -0.12         0.87        0.82
Sepal.Width         -0.12        1.00        -0.43       -0.37
Petal.Length         0.87       -0.43         1.00        0.96
Petal.Width          0.82       -0.37         0.96        1.00

n= 150 


P
             Sepal.Length Sepal.Width Petal.Length Petal.Width
Sepal.Length              0.1519      0.0000       0.0000     
Sepal.Width  0.1519                   0.0000       0.0000     
Petal.Length 0.0000       0.0000                   0.0000     
Petal.Width  0.0000       0.0000      0.0000                  
corrplot(corr$r, p.mat = corr$P,
         addCoef.col = "black", diag = FALSE, type = "upper", tl.srt = 55)

2.4.2.2 Up to you

Look into the corrplot specification, using ?corrplot and amend some of the criteria. Run a correlation plot while filtering the data according to the species.

# hint 
# create a new dataframe by filtering `Species`, then compute correlations and plot
  

Up to now, we have done some basic summaries and checked the correlations in the data. The pearson correlations we have done provided us with significance levels related to the correlations between two numeric outcomes. We continue by examining the normality of distribution of our data.

2.5 Normality of distribution

2.5.1 Subsetting data

In the iris dataset, we have a categorical predictor: Species which has three levels

levels(iris$Species)
[1] "setosa"     "versicolor" "virginica" 

Let’s subset the data to setosa and versicolor. We will also check the normality and homogeneity of variance in the data

irisSub <- iris %>% 
  filter(Species %in% c("setosa", "versicolor"))

2.5.2 Shapiro test

To check normality of distribution, we use the shapiro.test on the numeric outcome. Given that our predictor now has two levels. We need to subset the data again to check normality of the outcome Sepal.Length for each level of our factor Species

irisSubSet <- iris %>% 
  filter(Species == "setosa")
irisSubVers <- iris %>% 
  filter(Species == "versicolor")
  
shapiro.test(irisSubSet$Sepal.Length)

    Shapiro-Wilk normality test

data:  irisSubSet$Sepal.Length
W = 0.9777, p-value = 0.4595
shapiro.test(irisSubVers$Sepal.Length)

    Shapiro-Wilk normality test

data:  irisSubVers$Sepal.Length
W = 0.97784, p-value = 0.4647

How to interpret this non-statistically significant result? This tells us that the distribution of the data is not statistically different from a normal distribution.

2.5.3 Density plot

We can also use a density plot to evaluate normality. The results show that both levels have bell shaped distributions.

irisSub %>% 
  ggplot(aes(x = Sepal.Length))+
  geom_density()+
  facet_wrap(~Species, scales = "free_x")

2.5.4 Homogeneity of variance

Because our data is normally distributed, we can use the bartlett test. If our data were non-normally distributed, we would use the Levene Test (either with var.test from base-R or leveneTest from the car package We can check both.

2.5.4.1 Bartlett test

irisSub %>% 
  bartlett.test(Sepal.Length ~ Species, data = .)

    Bartlett test of homogeneity of variances

data:  Sepal.Length by Species
Bartlett's K-squared = 6.8917, df = 1, p-value = 0.00866

2.5.4.2 Levene test

irisSub %>% 
  var.test(Sepal.Length ~ Species, data = .)

    F test to compare two variances

data:  Sepal.Length by Species
F = 0.46634, num df = 49, denom df = 49, p-value = 0.008657
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
 0.2646385 0.8217841
sample estimates:
ratio of variances 
         0.4663429 
irisSub %>% 
  car::leveneTest(Sepal.Length ~ Species, data = .)
Levene's Test for Homogeneity of Variance (center = median)
      Df F value   Pr(>F)   
group  1  8.1727 0.005196 **
      98                    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

In all cases, the statistically significant result indicates that there is evidence that the variance of two levels of the factor Species is statistically significant; i.e., the variances are not equal. This is important as we will use this in our t-test later on

3 Linear Models

Up to now, we have looked at descriptive statistics, and evaluated summaries, correlations in the data (with p values), and checked the normality of distribution of our data.

We are now interested in looking at group differences.

Let us start with a simple t-test

3.1 First steps

3.1.1 T-test

We then run a t-test on the data. We specify the formula as y ~ x and add var.equal = FALSE (based on the normality tests above)

irisSub %>% 
  t.test(Sepal.Length ~ Species, data = ., var.equal = FALSE)

    Welch Two Sample t-test

data:  Sepal.Length by Species
t = -10.521, df = 86.538, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -1.1057074 -0.7542926
sample estimates:
    mean in group setosa mean in group versicolor 
                   5.006                    5.936 

To interpret the t-test, we say that there is evidence for a statistically significant difference between the two groups: setosa and versicolor: t(86) = -10.521, p < 2.2e-16. The mean of setosa is significantly lower than that of versicolor.

3.1.2 Linear Model

Let us run a linear model on the same data

irisSub %>% 
  lm(Sepal.Length ~ Species, data = .) %>% 
  summary()

Call:
lm(formula = Sepal.Length ~ Species, data = .)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.0360 -0.3135 -0.0060  0.2715  1.0640 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)        5.00600    0.06250   80.09   <2e-16 ***
Speciesversicolor  0.93000    0.08839   10.52   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.442 on 98 degrees of freedom
Multiple R-squared:  0.5304,    Adjusted R-squared:  0.5256 
F-statistic: 110.7 on 1 and 98 DF,  p-value: < 2.2e-16

Any comments? discuss with your neighbour.

The results of the linear model are exactly the same, albeit with a difference in the sign of the difference. This indicates that the \(\beta\) coefficient for versicolor is significantly higher than that of setosa by 0.93000: t(98) = 10.52, p < <2e-16.

The dataset iris contains three species. We will run an ANOVA and a linear model on the data.

3.1.3 Basic ANOVA

We can use the function aov to run an Analysis of Variance on the full dataset

iris %>% 
  aov(Sepal.Length ~ Species, data = .) %>% 
  summary()
             Df Sum Sq Mean Sq F value Pr(>F)    
Species       2  63.21  31.606   119.3 <2e-16 ***
Residuals   147  38.96   0.265                   
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

3.1.4 Linear model

We can use the function lm to run a linear model

iris %>% 
  lm(Sepal.Length ~ Species, data = .) %>% 
  summary()

Call:
lm(formula = Sepal.Length ~ Species, data = .)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.6880 -0.3285 -0.0060  0.3120  1.3120 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)         5.0060     0.0728  68.762  < 2e-16 ***
Speciesversicolor   0.9300     0.1030   9.033 8.77e-16 ***
Speciesvirginica    1.5820     0.1030  15.366  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.5148 on 147 degrees of freedom
Multiple R-squared:  0.6187,    Adjusted R-squared:  0.6135 
F-statistic: 119.3 on 2 and 147 DF,  p-value: < 2.2e-16

But wait… How is the linear model comparable to the analysis of variance we ran above? This linear model derives the analysis of variance we saw above, use anova on your linear model..

Here are the results of the initial Analysis of variance:

iris %>% 
  aov(Sepal.Length ~ Species, data = .) %>% 
  summary()
             Df Sum Sq Mean Sq F value Pr(>F)    
Species       2  63.21  31.606   119.3 <2e-16 ***
Residuals   147  38.96   0.265                   
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

And here are the results of the linear model with the anova function

iris %>% 
  lm(Sepal.Length ~ Species, data = .) %>% 
  anova()
Analysis of Variance Table

Response: Sepal.Length
           Df Sum Sq Mean Sq F value    Pr(>F)    
Species     2 63.212  31.606  119.26 < 2.2e-16 ***
Residuals 147 38.956   0.265                      
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

They are exactly the same… The underlying of an Analysis of variance is a linear model. We will continue with a linear model to understand it better

3.2 Linear Model

The basic assumption of a Linear model is to create a regression analysis on the data. We have an outcome (or dependent variable) and a predictor (or an independent variable). The formula of a linear model is as follows outcome ~ predictor that can be read as “outcome as a function of the predictor”. We can add “1” to specify an intercept, but this is by default added to the model

3.2.1 Model estimation

mdl.lm <- iris %>% 
  lm(Sepal.Length ~ Species, data = .)
# same as below.
#mdl.lm <- lm(Sepal.Length ~ 1 + Species, data = iris)
mdl.lm #also print(mdl.lm)

Call:
lm(formula = Sepal.Length ~ Species, data = .)

Coefficients:
      (Intercept)  Speciesversicolor   Speciesvirginica  
            5.006              0.930              1.582  
summary(mdl.lm)

Call:
lm(formula = Sepal.Length ~ Species, data = .)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.6880 -0.3285 -0.0060  0.3120  1.3120 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)         5.0060     0.0728  68.762  < 2e-16 ***
Speciesversicolor   0.9300     0.1030   9.033 8.77e-16 ***
Speciesvirginica    1.5820     0.1030  15.366  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.5148 on 147 degrees of freedom
Multiple R-squared:  0.6187,    Adjusted R-squared:  0.6135 
F-statistic: 119.3 on 2 and 147 DF,  p-value: < 2.2e-16

3.2.2 Tidying the output

# from library(broom)
tidy(mdl.lm) %>% 
  select(term, estimate) %>% 
  mutate(estimate = round(estimate, 3))
mycoefE <- tidy(mdl.lm) %>% pull(estimate)

To interpret the model, we need look at the coefficients. The Intercept (=Setosa) is 5.006 and the coefficients for Versicolor and for Virginica are respectively 5.936 and 6.588 This tells us that compared to Setosa, moving from this category to Versicolor leads to a significant increase by 5.936, and for Virginica, there is a significant increase by 6.588.

3.2.3 Obtaining our “true” coefficients

But where are our actual values based on the means in the table above?

We run a model that suppresses the intercept (i.e., adding 0 instead of 1) and this will allow us to obtain the “true” coefficients for each level of our predictor. This is also known as a saturated model

mdl.lm.2 <- iris %>% 
  lm(Sepal.Length ~ 0 + Species, data = .)
summary(mdl.lm.2)

Call:
lm(formula = Sepal.Length ~ 0 + Species, data = .)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.6880 -0.3285 -0.0060  0.3120  1.3120 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
Speciessetosa       5.0060     0.0728   68.76   <2e-16 ***
Speciesversicolor   5.9360     0.0728   81.54   <2e-16 ***
Speciesvirginica    6.5880     0.0728   90.49   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.5148 on 147 degrees of freedom
Multiple R-squared:  0.9925,    Adjusted R-squared:  0.9924 
F-statistic:  6522 on 3 and 147 DF,  p-value: < 2.2e-16

This matches the original data. Setosa has a mean of 5.006, Versicolor 10.942, and Virginica 10.942. See table above and coefficients below

#Setosa
mycoefE[1]
[1] 5.006
#Versicolor
mycoefE[1] + mycoefE[2]
[1] 5.936
#Virginica
mycoefE[1] + mycoefE[3]
[1] 6.588

The same as

tidy(mdl.lm.2) %>% 
  select(term, estimate) %>% 
  mutate(estimate = round(estimate, 3))
mycoefE <- tidy(mdl.lm.2) %>%
  pull(estimate)

#Setosa
mycoefE[1]
[1] 5.006
#Versicolor
mycoefE[2]
[1] 5.936
#Virginica
mycoefE[3]
[1] 6.588

3.2.4 Nice table of our model summary

We can also obtain a nice table of our model summary. We can use the package knitr or xtable

3.2.4.1 Directly from model summary

kable(summary(mdl.lm)$coef, digits = 3)

Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.006 0.073 68.762 0
Speciesversicolor 0.930 0.103 9.033 0
Speciesvirginica 1.582 0.103 15.366 0

NA

3.2.4.2 From the tidy output

mdl.lmT <- tidy(mdl.lm)
kable(mdl.lmT, digits = 3)
term estimate std.error statistic p.value
(Intercept) 5.006 0.073 68.762 0
Speciesversicolor 0.930 0.103 9.033 0
Speciesvirginica 1.582 0.103 15.366 0

3.2.5 Dissecting the model

Let us dissect the model. If you use “str”, you will be able to see what is available under our linear model. To access some info from the model

3.2.5.1 “str” and “coef”

str(mdl.lm)
List of 13
 $ coefficients : Named num [1:3] 5.01 0.93 1.58
  ..- attr(*, "names")= chr [1:3] "(Intercept)" "Speciesversicolor" "Speciesvirginica"
 $ residuals    : Named num [1:150] 0.094 -0.106 -0.306 -0.406 -0.006 ...
  ..- attr(*, "names")= chr [1:150] "1" "2" "3" "4" ...
 $ effects      : Named num [1:150] -71.5659 0.8025 7.91 -0.3826 0.0174 ...
  ..- attr(*, "names")= chr [1:150] "(Intercept)" "Speciesversicolor" "Speciesvirginica" "" ...
 $ rank         : int 3
 $ fitted.values: Named num [1:150] 5.01 5.01 5.01 5.01 5.01 ...
  ..- attr(*, "names")= chr [1:150] "1" "2" "3" "4" ...
 $ assign       : int [1:3] 0 1 1
 $ qr           :List of 5
  ..$ qr   : num [1:150, 1:3] -12.2474 0.0816 0.0816 0.0816 0.0816 ...
  .. ..- attr(*, "dimnames")=List of 2
  .. .. ..$ : chr [1:150] "1" "2" "3" "4" ...
  .. .. ..$ : chr [1:3] "(Intercept)" "Speciesversicolor" "Speciesvirginica"
  .. ..- attr(*, "assign")= int [1:3] 0 1 1
  .. ..- attr(*, "contrasts")=List of 1
  .. .. ..$ Species: chr "contr.treatment"
  ..$ qraux: num [1:3] 1.08 1.05 1.09
  ..$ pivot: int [1:3] 1 2 3
  ..$ tol  : num 1e-07
  ..$ rank : int 3
  ..- attr(*, "class")= chr "qr"
 $ df.residual  : int 147
 $ contrasts    :List of 1
  ..$ Species: chr "contr.treatment"
 $ xlevels      :List of 1
  ..$ Species: chr [1:3] "setosa" "versicolor" "virginica"
 $ call         : language lm(formula = Sepal.Length ~ Species, data = .)
 $ terms        :Classes 'terms', 'formula'  language Sepal.Length ~ Species
  .. ..- attr(*, "variables")= language list(Sepal.Length, Species)
  .. ..- attr(*, "factors")= int [1:2, 1] 0 1
  .. .. ..- attr(*, "dimnames")=List of 2
  .. .. .. ..$ : chr [1:2] "Sepal.Length" "Species"
  .. .. .. ..$ : chr "Species"
  .. ..- attr(*, "term.labels")= chr "Species"
  .. ..- attr(*, "order")= int 1
  .. ..- attr(*, "intercept")= int 1
  .. ..- attr(*, "response")= int 1
  .. ..- attr(*, ".Environment")=<environment: 0x00000267f1c2e870> 
  .. ..- attr(*, "predvars")= language list(Sepal.Length, Species)
  .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "factor"
  .. .. ..- attr(*, "names")= chr [1:2] "Sepal.Length" "Species"
 $ model        :'data.frame':  150 obs. of  2 variables:
  ..$ Sepal.Length: num [1:150] 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
  ..$ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
  ..- attr(*, "terms")=Classes 'terms', 'formula'  language Sepal.Length ~ Species
  .. .. ..- attr(*, "variables")= language list(Sepal.Length, Species)
  .. .. ..- attr(*, "factors")= int [1:2, 1] 0 1
  .. .. .. ..- attr(*, "dimnames")=List of 2
  .. .. .. .. ..$ : chr [1:2] "Sepal.Length" "Species"
  .. .. .. .. ..$ : chr "Species"
  .. .. ..- attr(*, "term.labels")= chr "Species"
  .. .. ..- attr(*, "order")= int 1
  .. .. ..- attr(*, "intercept")= int 1
  .. .. ..- attr(*, "response")= int 1
  .. .. ..- attr(*, ".Environment")=<environment: 0x00000267f1c2e870> 
  .. .. ..- attr(*, "predvars")= language list(Sepal.Length, Species)
  .. .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "factor"
  .. .. .. ..- attr(*, "names")= chr [1:2] "Sepal.Length" "Species"
 - attr(*, "class")= chr "lm"
coef(mdl.lm)
      (Intercept) Speciesversicolor  Speciesvirginica 
            5.006             0.930             1.582 
## same as 
## mdl.lm$coefficients

3.2.5.2 “coef” and “coefficients”

What if I want to obtain the “Intercept”? Or the coefficient for distance? What if I want the full row for distance?

coef(mdl.lm)[1] # same as mdl.lm$coefficients[1]
(Intercept) 
      5.006 
coef(mdl.lm)[2] # same as mdl.lm$coefficients[2]
Speciesversicolor 
             0.93 
summary(mdl.lm)$coefficients[2, ] # full row
    Estimate   Std. Error      t value     Pr(>|t|) 
9.300000e-01 1.029579e-01 9.032819e+00 8.770194e-16 
summary(mdl.lm)$coefficients[2, 4] #for p value
[1] 8.770194e-16

3.2.5.3 Up to you

Play around with the model summary and obtain the t values for the three levels. You can do this by referring to the coefficient as above

3.2.5.4 Residuals

What about residuals (difference between the observed value and the estimated value of the quantity) and fitted values?

hist(residuals(mdl.lm))

qqnorm(residuals(mdl.lm)); qqline(residuals(mdl.lm))

plot(fitted(mdl.lm), residuals(mdl.lm), cex = 4)

3.2.5.5 Goodness of fit?

AIC(mdl.lm) # Akaike's Information Criterion, lower values are better
[1] 231.452
BIC(mdl.lm) # Bayesian AIC
[1] 243.4945
logLik(mdl.lm)  # log likelihood
'log Lik.' -111.726 (df=4)

Or use the following from broom

glance(mdl.lm)

3.2.5.6 Significance testing

Are the above informative? of course not directly. If we want to test for overall significance of model. We run a null model (aka intercept only) and compare models.

mdl.lm.Null <- iris %>% 
  lm(Sepal.Length ~ 1, data = .)
mdl.comp <- anova(mdl.lm.Null, mdl.lm)
mdl.comp
Analysis of Variance Table

Model 1: Sepal.Length ~ 1
Model 2: Sepal.Length ~ Species
  Res.Df     RSS Df Sum of Sq      F    Pr(>F)    
1    149 102.168                                  
2    147  38.956  2    63.212 119.26 < 2.2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

The results show that adding the factor “Species” improves the model fit. We can write this as follows: Model comparison showed that the addition of Species improved the model fit when compared with an intercept only model (\(F\)(2) = 119.26, p < 1.669669210^{-31})

3.2.5.7 Plotting fitted values

3.2.5.7.1 Trend line

Let’s plot our fitted values but only for the trend line

iris %>% 
  ggplot(aes(x = Species, y = Sepal.Length))+
  geom_boxplot() +
  labs(x = "Species", y = "Length", title = "Boxplot and predicted trend line", subtitle = "with ggplot2") + 
  theme_bw() + theme(text = element_text(size = 15))+
  geom_smooth(aes(x = as.numeric(Species), y = predict(mdl.lm)), method = "lm", color = "blue")

This allows us to plot the fitted values from our model with the predicted linear trend. This is exactly the same as our original data.

3.2.5.7.2 Predicted means and the trend line

We can also plot the predicted means and linear trend

iris %>% 
  ggplot(aes(x = Species, y = predict(mdl.lm)))+
  geom_boxplot(color = "blue") +
  labs(x = "Species", y = "Length", title = "Predicted means and trend line", subtitle = "with ggplot2") + 
  theme_bw() + theme(text = element_text(size = 15))+
  geom_smooth(aes(x = as.numeric(Species), y = predict(mdl.lm)), method = "lm", color = "blue")

3.2.5.7.3 Raw data, predicted means and the trend line

We can also plot the actual data, the predicted means and linear trend

iris %>% 
  ggplot(aes(x = Species, y = Sepal.Length))+
  geom_boxplot()+
  geom_boxplot(aes(x = Species, y = predict(mdl.lm)), color = "blue") +
  labs(x = "Species", y = "Length", title = "Boxplot raw data, predicted means (in blue) and trend line", subtitle = "with ggplot2") + 
  theme_bw() + theme(text = element_text(size = 15))+
  geom_smooth(aes(x = as.numeric(Species), y = predict(mdl.lm)), method = "lm", color = "blue")

3.2.6 What about pairwise comparison?

Based on our model’s summary, can you tell me if there is a difference between Versicolor and Virginica?

summary(mdl.lm)

Call:
lm(formula = Sepal.Length ~ Species, data = .)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.6880 -0.3285 -0.0060  0.3120  1.3120 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)         5.0060     0.0728  68.762  < 2e-16 ***
Speciesversicolor   0.9300     0.1030   9.033 8.77e-16 ***
Speciesvirginica    1.5820     0.1030  15.366  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.5148 on 147 degrees of freedom
Multiple R-squared:  0.6187,    Adjusted R-squared:  0.6135 
F-statistic: 119.3 on 2 and 147 DF,  p-value: < 2.2e-16
mdl.lm %>% emmeans(pairwise ~ Species, adjust = "fdr")
$emmeans
 Species    emmean     SE  df lower.CL upper.CL
 setosa       5.01 0.0728 147     4.83     5.18
 versicolor   5.94 0.0728 147     5.76     6.11
 virginica    6.59 0.0728 147     6.41     6.76

Confidence level used: 0.95 
Conf-level adjustment: bonferroni method for 3 estimates 

$contrasts
 contrast               estimate    SE  df t.ratio p.value
 setosa - versicolor      -0.930 0.103 147  -9.033 <.0001 
 setosa - virginica       -1.582 0.103 147 -15.366 <.0001 
 versicolor - virginica   -0.652 0.103 147  -6.333 <.0001 

P value adjustment: fdr method for 3 tests 

How to interpret the output? Discuss with your neighbour and share with the group.

Hint… Look at the emmeans values for each level of our factor “Species” and the contrasts.

3.3 Other outcomes?

So far, we only looked at “Sepal.Length”. What about the other outcomes? how informative are they? Do we have statistical difference between the three levels of our predictor? You can do this in your spare time.

3.4 Conclusion

We have so far looked at the Linear Model. The underlying assumption about linear models is that we have a normal (Gaussian) distribution. This is the model to be used when we have a numeric outcome. What if our outcome is not numeric? What if we have two categories, i.e., black vs white? correct vs incorrect? yes vs no? These are categorical binary outcome. We look in the next section at Logistic Regression

4 Generalised Linear Models

Here we will look at an example when the outcome is binary. This simulated data is structured as follows. We asked one participant to listen to 165 sentences, and to judge whether these are “grammatical” or “ungrammatical”. There were 105 sentences that were “grammatical” and 60 “ungrammatical”. This fictitious example can apply in any other situation. Let’s think Geography: 165 lands: 105 “flat” and 60 “non-flat”, etc. This applies to any case where you need to “categorise” the outcome into two groups.

4.1 Load and summaries

Let’s load in the data and do some basic summaries

grammatical <- read_csv("grammatical.csv")
grammatical
str(grammatical)
tibble [165 x 2] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ grammaticality: chr [1:165] "grammatical" "grammatical" "grammatical" "grammatical" ...
 $ response      : chr [1:165] "yes" "yes" "yes" "yes" ...
 - attr(*, "spec")=
  .. cols(
  ..   grammaticality = col_character(),
  ..   response = col_character()
  .. )
head(grammatical)

4.2 GLM

Let’s run a first GLM (Generalised Linear Model). A GLM uses a special family “binomial” as it assumes the outcome has a binomial distribution. In general, results from a Logistic Regression are close to what we get from SDT (see above).

To run the results, we will change the reference level for both response and grammaticality. The basic assumption about GLM is that we start with our reference level being the “no” responses to the “ungrammatical” category. Any changes to this reference will be seen in the coefficients as “yes” responses to the “grammatical” category.

4.2.1 Model estimation and results

The results below show the logodds for our model.

grammatical <- grammatical %>% 
  mutate(response = factor(response, levels = c("no", "yes")),
         grammaticality = factor(grammaticality, levels = c("ungrammatical", "grammatical")))

grammatical %>% 
  group_by(grammaticality, response) %>% 
  table()
               response
grammaticality   no yes
  ungrammatical  50  10
  grammatical     5 100
mdl.glm <- grammatical %>% 
  glm(response ~ grammaticality, data = ., family = binomial)
summary(mdl.glm)

Call:
glm(formula = response ~ grammaticality, family = binomial, data = .)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.4676  -0.6039   0.3124   0.3124   1.8930  

Coefficients:
                          Estimate Std. Error z value Pr(>|z|)    
(Intercept)                -1.6094     0.3464  -4.646 3.38e-06 ***
grammaticalitygrammatical   4.6052     0.5744   8.017 1.08e-15 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 210.050  on 164  degrees of freedom
Residual deviance:  94.271  on 163  degrees of freedom
AIC: 98.271

Number of Fisher Scoring iterations: 5
tidy(mdl.glm) %>% 
  select(term, estimate) %>% 
  mutate(estimate = round(estimate, 3))
# to only get the coefficients
mycoef2 <- tidy(mdl.glm) %>% pull(estimate)

The results show that for one unit increase in the response (i.e., from no to yes), the logodds of being “grammatical” is increased by 4.6051702 (the intercept shows that when the response is “no”, the logodds are -1.6094379). The actual logodds for the response “yes” to grammatical is 2.9957323

4.2.2 Logodds to Odd ratios

Logodds can be modified to talk about the odds of an event. For our model above, the odds of “grammatical” receiving a “no” response is a mere 0.2; the odds of “grammatical” to receive a “yes” is a 20; i.e., 20 times more likely

exp(mycoef2[1])
[1] 0.2
exp(mycoef2[1] + mycoef2[2])
[1] 20

4.2.3 LogOdds to proportions

If you want to talk about the percentage “accuracy” of our model, then we can transform our loggodds into proportions. This shows that the proportion of “grammatical” receiving a “yes” response increases by 99% (or 95% based on our “true” coefficients)

plogis(mycoef2[1])
[1] 0.1666667
plogis(mycoef2[1] + mycoef2[2])
[1] 0.952381

4.2.4 Plotting

grammatical <- grammatical %>% 
  mutate(prob = predict(mdl.glm, type = "response"))
grammatical %>% 
  ggplot(aes(x = as.numeric(grammaticality), y = prob)) +
  geom_point() +
  geom_smooth(method = "glm", 
    method.args = list(family = "binomial"), 
    se = T) + theme_bw(base_size = 20)+
    labs(y = "Probability", x = "")+
    coord_cartesian(ylim = c(0,1))+
    scale_x_discrete(limits = c("Ungrammatical", "Grammatical"))

4.3 Accuracy and Signal Detection Theory

4.3.1 Rationale

We are generally interested in performance, i.e., whether the we have “accurately” categorised the outcome or not and at the same time want to evaluate our biases in responses. When deciding on categories, we are usually biased in our selection.

Let’s ask the question: How many of you have a Mac laptop and how many a Windows laptop? For those with a Mac, what was the main reason for choosing it? Are you biased in anyway by your decision?

To correct for these biases, we use some variants from Signal Detection Theory to obtain the true estimates without being influenced by the biases.

4.3.2 Running stats

Let’s do some stats on this

Yes No Total
Grammatical (Yes Actual) TP = 100 FN = 5 (Yes Actual) 105
Ungrammatical (No Actual) FP = 10 TN = 50 (No Actual) 60
Total (Yes Response) 110 (No Response) 55 165
grammatical <- grammatical %>% 
  mutate(response = factor(response, levels = c("yes", "no")),
         grammaticality = factor(grammaticality, levels = c("grammatical", "ungrammatical")))

## TP = True Positive (Hit); FP = False Positive; FN = False Negative; TN = True Negative


TP <- nrow(grammatical %>% 
             filter(grammaticality == "grammatical" &
                      response == "yes"))
FN <- nrow(grammatical %>% 
             filter(grammaticality == "grammatical" &
                      response == "no"))
FP <- nrow(grammatical %>% 
             filter(grammaticality == "ungrammatical" &
                      response == "yes"))
TN <- nrow(grammatical %>% 
             filter(grammaticality == "ungrammatical" &
                      response == "no"))
TP
[1] 100
FN
[1] 5
FP
[1] 10
TN
[1] 50
Total <- nrow(grammatical)
Total
[1] 165
(TP+TN)/Total # accuracy
[1] 0.9090909
(FP+FN)/Total # error, also 1-accuracy
[1] 0.09090909
# When stimulus = yes, how many times response = yes?
TP/(TP+FN) # also True Positive Rate or Specificity
[1] 0.952381
# When stimulus = no, how many times response = yes?
FP/(FP+TN) # False Positive Rate, 
[1] 0.1666667
# When stimulus = no, how many times response = no?
TN/(FP+TN) # True Negative Rate or Sensitivity 
[1] 0.8333333
# When subject responds "yes" how many times is (s)he correct?
TP/(TP+FP) # precision
[1] 0.9090909
# getting dprime (or the sensetivity index); beta (bias criterion, 0-1, lower=increase in "yes"); Aprime (estimate of discriminability, 0-1, 1=good discrimination; 0 at chance); bppd (b prime prime d, -1 to 1; 0 = no bias, negative = tendency to respond "yes", positive = tendency to respond "no"); c (index of bias, equals to SD)
#(see also https://www.r-bloggers.com/compute-signal-detection-theory-indices-with-r/amp/) 
psycho::dprime(TP, FP, FN, TN, 
               n_targets = TP+FN, 
               n_distractors = FP+TN,
               adjust=F)
$dprime
[1] 2.635813

$beta
[1] 0.3970026

$aprime
[1] 0.9419643

$bppd
[1] -0.5076923

$c
[1] -0.3504848

The most important from above, is d-prime. This is modelling the difference between the rate of “True Positive” responses and “False Positive” responses in standard unit (or z-scores). The formula can be written as:

d' (d prime) = Z(True Positive Rate) - Z(False Positive Rate)

4.3.3 GLM as a classification tool

The code below demonstrates the links between our GLM model and what we had obtained above from SDT. The predictions’ table shows that our GLM was successful at obtaining prediction that are identical to our initial data setup. Look at the table here and the table above. Once we have created our table of outcome, we can compute percent correct, the specificity, the sensitivity, the Kappa score, etc.. this yields the actual value with the SD that is related to variations in responses.

## predict(mdl.glm)>0.5 is identical to 
## predict(glm(response~grammaticality,data=grammatical,family = binomial),type="response")
grammatical <- grammatical %>% 
  mutate(response = factor(response, levels = c("yes", "no")),
         grammaticality = factor(grammaticality, levels = c("grammatical", "ungrammatical")))



mdl.glm.C <- grammatical %>% 
  glm(response ~ grammaticality, data = .,family = binomial)

tbl.glm <- table(grammatical$response, predict(mdl.glm.C, type = "response")>0.5)
colnames(tbl.glm) <- c("grammatical", "ungrammatical")
tbl.glm
     
      grammatical ungrammatical
  yes         100            10
  no            5            50
PresenceAbsence::pcc(tbl.glm)
PresenceAbsence::specificity(tbl.glm)
PresenceAbsence::sensitivity(tbl.glm)
###etc..

If you look at the results from SDT above, these results are the same as the following

Accuracy: (TP+TN)/Total (0.9090909)

True Positive Rate (or Specificity) TP/(TP+FN) (0.952381)

True Negative Rate (or Sensitivity) TN/(FP+TN) (0.8333333)

4.3.4 GLM and d prime

The values obtained here match those obtained from SDT. For d prime, the difference stems from the use of the logit variant of the Binomial family. By using a probit variant, one obtains the same values (see here for more details). A probit variant models the z-score differences in the outcome and is evaluated in change in 1-standard unit. This is modelling the change from “ungrammatical” “no” responses into “grammatical” “yes” responses in z-scores. The same conceptual underpinnings of d-prime from Signal Detection Theory.

## d prime
psycho::dprime(TP, FP, FN, TN, 
               n_targets = TP+FN, 
               n_distractors = FP+TN,
               adjust=F)$dprime
[1] 2.635813
## GLM with probit
coef(glm(response ~ grammaticality, data = grammatical, family = binomial(probit)))[2]
grammaticalityungrammatical 
                   2.635813 

4.4 GLM: Other distributions

If your data does not fit a binomial distribution, and is a multinomial (i.e., three or more response categories) or poisson (count data), then you need to use the glm function with a specific family function.

6 Linear Mixed-effects Models. Why random effects matter

Let’s generate a new dataframe that we will use later on for our mixed models

## Courtesy of Bodo Winter
set.seed(666)
#we create 6 subjects
subjects <- paste0('S', 1:6)
#here we add repetitions within speakers
subjects <- rep(subjects, each = 20)
items <- paste0('Item', 1:20)
#below repeats
items <- rep(items, 6)
#below is to generate random numbers that are log values
logFreq <- round(rexp(20)*5, 2)
#below we are repeating the logFreq 6 times to fit with the number of speakers and items
logFreq <- rep(logFreq, 6)
xdf <- data.frame(subjects, items, logFreq)
#below removes the individual variables we had created because they are already in the dataframe
rm(subjects, items, logFreq)

xdf$Intercept <- 300
submeans <- rep(rnorm(6, sd = 40), 20)
#sort make the means for each subject is the same...
submeans <- sort(submeans)
xdf$submeans <- submeans
#we create the same thing for items... we allow the items mean to vary between words...
itsmeans <- rep(rnorm(20, sd = 20), 6)
xdf$itsmeans <- itsmeans
xdf$error <- rnorm(120, sd = 20)
#here we create an effect column,  
#here for each logFreq, we have a decrease of -5 of that particular logFreq 
xdf$effect <- -5 * xdf$logFreq

xdf$dur <- xdf$Intercept + xdf$submeans + xdf$itsmeans + xdf$error + xdf$effect
#below is to subset the data and get only a few columns.. the -c(4:8) removes the columns 4 to 8..
xreal <- xdf[,-c(4:8)]
head(xreal)
rm(xdf, submeans, itsmeans)

6.1 Plots

Let’s start by doing a correlation test and plotting the data. Our results show that there is a negative correlation between duration and LogFrequency, and the plot shows this decrease.

corrMixed <- as.matrix(xreal[-c(1:2)]) %>% 
  rcorr(type="pearson")
print(corrMixed)
        logFreq   dur
logFreq    1.00 -0.54
dur       -0.54  1.00

n= 120 


P
        logFreq dur
logFreq          0 
dur      0         
corrplot(corrMixed$r, method = "circle", type = "upper", tl.srt = 45,
         addCoef.col = "black", diag = FALSE,
         p.mat = corrMixed$p, sig.level = 0.05)




ggplot.xreal <- xreal %>% 
  ggplot(aes(x = logFreq, y = dur)) +
  geom_point()+ theme_bw(base_size = 20) +
  labs(y = "Duration", x = "Frequency (Log)") +
  geom_smooth(method = lm, se=F)
ggplot.xreal

6.2 Linear model

Let’s run a simple linear model on the data. As we can see below, there are some issues with the “simple” linear model: we had set our SD for subjects to be 40, but this was picked up as 120 (see histogram of residuals). The QQ Plot is not “normal”.

mdl.lm.xreal <- xreal %>% 
  lm(dur ~ logFreq, data = .)
summary(mdl.lm.xreal)

Call:
lm(formula = dur ~ logFreq, data = .)

Residuals:
    Min      1Q  Median      3Q     Max 
-94.322 -35.465  -4.364  33.020 123.955 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 337.9730     6.2494  54.081  < 2e-16 ***
logFreq      -5.4601     0.7846  -6.959 2.06e-10 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 48.29 on 118 degrees of freedom
Multiple R-squared:  0.291, Adjusted R-squared:  0.285 
F-statistic: 48.43 on 1 and 118 DF,  p-value: 2.057e-10
hist(residuals(mdl.lm.xreal))

qqnorm(residuals(mdl.lm.xreal)); qqline(residuals(mdl.lm.xreal))

plot(fitted(mdl.lm.xreal), residuals(mdl.lm.xreal), cex = 4)

6.3 Linear Mixed Model

Our Linear Mixed effects Model will take into account the random effects we added and also our model specifications. We use a Maximum Likelihood estimate (REML = FALSE) as this is what we will use for model comparison. The Linear Mixed Model is reflecting our model specifications The SD of our subjects is picked up correctly. The model results are “almost” the same as our linear model above. The coefficient for the “Intercept” is at 337.973 and the coefficient for LogFrequency is at -5.460. This indicates that for each unit of increase in the LogFrequency, there is a decrease by 5.460 (ms).

mdl.lmer.xreal <- xreal %>% 
  lmer(dur ~ logFreq  +(1|subjects) + (1|items), data = ., REML = FALSE)
summary(mdl.lmer.xreal)
Linear mixed model fit by maximum likelihood . t-tests use Satterthwaite's method [
lmerModLmerTest]
Formula: dur ~ logFreq + (1 | subjects) + (1 | items)
   Data: .

     AIC      BIC   logLik deviance df.resid 
  1105.8   1119.8   -547.9   1095.8      115 

Scaled residuals: 
     Min       1Q   Median       3Q      Max 
-2.06735 -0.60675  0.07184  0.61122  2.39854 

Random effects:
 Groups   Name        Variance Std.Dev.
 items    (Intercept)  589.8   24.29   
 subjects (Intercept) 1471.7   38.36   
 Residual              284.0   16.85   
Number of obs: 120, groups:  items, 20; subjects, 6

Fixed effects:
            Estimate Std. Error      df t value Pr(>|t|)    
(Intercept)  337.973     17.587   9.126  19.218 1.08e-08 ***
logFreq       -5.460      1.004  19.215  -5.436 2.92e-05 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Correlation of Fixed Effects:
        (Intr)
logFreq -0.322
hist(residuals(mdl.lmer.xreal))

qqnorm(residuals(mdl.lmer.xreal)); qqline(residuals(mdl.lmer.xreal))

plot(fitted(mdl.lmer.xreal), residuals(mdl.lmer.xreal), cex = 4)

6.4 Our second Mixed model

This second model add a by-subject random slope. Random slopes allow for the variation that exists in the random effects to be taken into account. An intercept only model provides an averaged values to our participants.

mdl.lmer.xreal.2 <- xreal %>% 
  lmer(dur ~ logFreq + (logFreq|subjects) + (1|items), data = ., REML = FALSE)
summary(mdl.lmer.xreal.2)
Linear mixed model fit by maximum likelihood . t-tests use Satterthwaite's method [
lmerModLmerTest]
Formula: dur ~ logFreq + (logFreq | subjects) + (1 | items)
   Data: .

     AIC      BIC   logLik deviance df.resid 
  1109.5   1129.0   -547.7   1095.5      113 

Scaled residuals: 
     Min       1Q   Median       3Q      Max 
-2.10907 -0.60681  0.06198  0.58272  2.45718 

Random effects:
 Groups   Name        Variance  Std.Dev. Corr
 items    (Intercept) 5.892e+02 24.2736      
 subjects (Intercept) 1.419e+03 37.6676      
          logFreq     2.917e-02  0.1708  1.00
 Residual             2.827e+02 16.8144      
Number of obs: 120, groups:  items, 20; subjects, 6

Fixed effects:
            Estimate Std. Error      df t value Pr(>|t|)    
(Intercept)  337.973     17.332   8.882  19.500 1.35e-08 ***
logFreq       -5.460      1.006  19.384  -5.426 2.89e-05 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Correlation of Fixed Effects:
        (Intr)
logFreq -0.265
convergence code: 0
boundary (singular) fit: see ?isSingular
hist(residuals(mdl.lmer.xreal.2))

qqnorm(residuals(mdl.lmer.xreal.2)); qqline(residuals(mdl.lmer.xreal.2))

plot(fitted(mdl.lmer.xreal.2), residuals(mdl.lmer.xreal.2), cex = 4)

6.5 Model comparison

But where are our p values? The lme4 developers decided not to include p values due to various issues with estimating df. What we can do instead is to compare models. We need to create a null model to allow for significance testing. As expected our predictor is significantly contributing to the difference.

mdl.lmer.xreal.Null <- xreal %>% 
  lmer(dur ~ 1 + (logFreq|subjects) + (1|items), data = ., REML = FALSE)
anova(mdl.lmer.xreal.Null, mdl.lmer.xreal.2)
Data: .
Models:
mdl.lmer.xreal.Null: dur ~ 1 + (logFreq | subjects) + (1 | items)
mdl.lmer.xreal.2: dur ~ logFreq + (logFreq | subjects) + (1 | items)
                    npar    AIC    BIC  logLik deviance  Chisq Df Pr(>Chisq)    
mdl.lmer.xreal.Null    6 1125.4 1142.1 -556.68   1113.4                         
mdl.lmer.xreal.2       7 1109.5 1129.0 -547.73   1095.5 17.891  1  2.339e-05 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Also, do we really need random slopes? From the result below, we don’t seem to need random slopes at all, given that adding random slopes does not improve the model fit. I always recommend testing this. Most of the time I keep random slopes.

anova(mdl.lmer.xreal, mdl.lmer.xreal.2)
Data: .
Models:
mdl.lmer.xreal: dur ~ logFreq + (1 | subjects) + (1 | items)
mdl.lmer.xreal.2: dur ~ logFreq + (logFreq | subjects) + (1 | items)
                 npar    AIC    BIC  logLik deviance  Chisq Df Pr(>Chisq)
mdl.lmer.xreal      5 1105.8 1119.8 -547.92   1095.8                     
mdl.lmer.xreal.2    7 1109.5 1129.0 -547.73   1095.5 0.3782  2     0.8277

But if you are really (really!!!) obsessed by p values, then you can also use lmerTest. BUT use after comparing models to evaluate contribution of predictors

mdl.lmer.xreal.lmerTest <- xreal %>% 
  lmer(dur ~ logFreq + (logFreq|subjects) + (1|items), data = ., REML = TRUE)
summary(mdl.lmer.xreal.lmerTest)
Linear mixed model fit by REML. t-tests use Satterthwaite's method ['lmerModLmerTest']
Formula: dur ~ logFreq + (logFreq | subjects) + (1 | items)
   Data: .

REML criterion at convergence: 1086.1

Scaled residuals: 
     Min       1Q   Median       3Q      Max 
-2.09668 -0.60121  0.06407  0.58451  2.46140 

Random effects:
 Groups   Name        Variance  Std.Dev. Corr
 items    (Intercept) 6.284e+02 25.0682      
 subjects (Intercept) 1.639e+03 40.4785      
          logFreq     3.402e-02  0.1845  1.00
 Residual             2.831e+02 16.8244      
Number of obs: 120, groups:  items, 20; subjects, 6

Fixed effects:
            Estimate Std. Error      df t value Pr(>|t|)    
(Intercept)  337.973     18.465   7.494  18.303 1.72e-07 ***
logFreq       -5.460      1.037  18.180  -5.265 5.09e-05 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Correlation of Fixed Effects:
        (Intr)
logFreq -0.250
convergence code: 0
boundary (singular) fit: see ?isSingular
detach("package:lmerTest", unload = TRUE)

6.6 Our final Mixed model

Our final model uses REML (or Restricted Maximum Likelihood Estimate of Variance Component) to estimate the model.

mdl.lmer.xreal.Full <- xreal %>% 
  lmer(dur ~ logFreq + (logFreq|subjects) + (1|items), data = ., REML = TRUE)
summary(mdl.lmer.xreal.Full)
Linear mixed model fit by REML ['lmerMod']
Formula: dur ~ logFreq + (logFreq | subjects) + (1 | items)
   Data: .

REML criterion at convergence: 1086.1

Scaled residuals: 
     Min       1Q   Median       3Q      Max 
-2.09668 -0.60121  0.06407  0.58451  2.46140 

Random effects:
 Groups   Name        Variance  Std.Dev. Corr
 items    (Intercept) 6.284e+02 25.0682      
 subjects (Intercept) 1.639e+03 40.4785      
          logFreq     3.402e-02  0.1845  1.00
 Residual             2.831e+02 16.8244      
Number of obs: 120, groups:  items, 20; subjects, 6

Fixed effects:
            Estimate Std. Error t value
(Intercept)  337.973     18.465  18.303
logFreq       -5.460      1.037  -5.265

Correlation of Fixed Effects:
        (Intr)
logFreq -0.250
convergence code: 0
boundary (singular) fit: see ?isSingular
anova(mdl.lmer.xreal.Full)
Analysis of Variance Table
        npar Sum Sq Mean Sq F value
logFreq    1 7845.7  7845.7  27.717
hist(residuals(mdl.lmer.xreal.Full))

qqnorm(residuals(mdl.lmer.xreal.Full)); qqline(residuals(mdl.lmer.xreal.Full))

plot(fitted(mdl.lmer.xreal.Full), residuals(mdl.lmer.xreal.Full), cex = 4)

6.7 Dissecting the model

coef(mdl.lmer.xreal.Full)
$items
       (Intercept)   logFreq
Item1     352.3541 -5.460115
Item10    331.7629 -5.460115
Item11    324.7293 -5.460115
Item12    350.2297 -5.460115
Item13    353.1147 -5.460115
Item14    311.8401 -5.460115
Item15    354.0563 -5.460115
Item16    353.9361 -5.460115
Item17    288.7930 -5.460115
Item18    362.4659 -5.460115
Item19    338.1424 -5.460115
Item2     325.1878 -5.460115
Item20    359.7375 -5.460115
Item3     370.1747 -5.460115
Item4     302.4329 -5.460115
Item5     350.0478 -5.460115
Item6     338.9481 -5.460115
Item7     362.8357 -5.460115
Item8     295.6018 -5.460115
Item9     333.0702 -5.460115

$subjects
   (Intercept)   logFreq
S1    314.4719 -5.567204
S2    303.9072 -5.615345
S3    314.2944 -5.568013
S4    318.4301 -5.549168
S5    373.2970 -5.299153
S6    403.4376 -5.161809

attr(,"class")
[1] "coef.mer"
fixef(mdl.lmer.xreal.Full)
(Intercept)     logFreq 
 337.973044   -5.460115 
fixef(mdl.lmer.xreal.Full)[1]
(Intercept) 
    337.973 
fixef(mdl.lmer.xreal.Full)[2]
  logFreq 
-5.460115 
coef(mdl.lmer.xreal.Full)$`subjects`[1]
coef(mdl.lmer.xreal.Full)$`subjects`[2]

coef(mdl.lmer.xreal.Full)$`items`[1]
coef(mdl.lmer.xreal.Full)$`items`[2]
NA

6.8 Using predictions from our model

In general, I use the prediction from my final model in any plots. To generate this, we can use the following

xreal <- xreal %>% 
  mutate(Pred_Dur = predict(mdl.lmer.xreal.Full))

xreal %>% 
  ggplot(aes(x = logFreq, y = Pred_Dur)) +
  geom_point() + theme_bw(base_size = 20) +
  labs(y = "Duration", x = "Frequency (Log)", title = "Predicted") +
  geom_smooth(method = lm, se = F) + coord_cartesian(ylim = c(200,450))


## original plot
xreal %>% 
  ggplot(aes(x = logFreq , y = dur)) +
  geom_point() + theme_bw(base_size = 20)+
  labs(y = "Duration", x = "Frequency (Log)", title = "Original")+
  geom_smooth(method = lm, se = F) + coord_cartesian(ylim = c(200,450))

6.9 GLMM and CLMM

The code above was using a Linear Mixed Effects Modelling. The outcome was a numeric object. In some cases (as we have seen above), we may have:

  1. Binary outcome (binomial)
  2. Count data (poisson),
  3. Multi-category outcome (multinomial)
  4. Rating data (cumulative function)

The code below gives you an idea of how to specify these models


## Binomial family
## lme4::glmer(outcome~predictor(s)+(1|subject)+(1|items)..., data=data, family=binomial)

## Poisson family
## lme4::glmer(outcome~predictor(s)+(1|subject)+(1|items)..., data=data, family=poisson)

## Multinomial family
## a bit complicated as tehre is a need to use Bayesian approaches, see e.g., 
## glmmADMB
## mixcat
## MCMCglmm
## see https://gist.github.com/casallas/8263818

## Rating data, use following
## ordinal::clmm(outcome~predictor(s)+(1|subject)+(1|items)..., data=data)


## Remember to test for random effects and whether slopes are needed.
---
title: "Session_3-AnalysingData"
output: 
  html_notebook:
    highlight: pygments
    number_sections: yes
    toc: yes
    toc_depth: 6
    toc_float:
      collapsed: yes
  
---


# Loading packages 
```{r warning=FALSE, message=FALSE, error=FALSE}
## Use the code below to check if you have all required packages installed. If some are not installed already, the code below will install these. If you have all packages installed, then you could load them with the second code.
requiredPackages = c('tidyverse', 'ordinal', 'broom', 'emmeans', 'knitr', 'Hmisc', 'corrplot', 'psycho', 'PresenceAbsence', 'lme4', 'lmerTest')
for(p in requiredPackages){
  if(!require(p,character.only = TRUE)) install.packages(p)
  library(p,character.only = TRUE)
}
```


In this session, we will look at basic functions in R that will help us in running some inferential statistics. These will help us to evaluate the relationship between one (or more) predictor(s) (independent variable) and an outcome (dependent variable). 

It is important to know the class of the outcome before doing any pre-data analyses or inferential statistics. Outcome classes can be one of:

1. `Numeric`: As an example, we have length/width of leaf; height of mountain; fundamental frequency of the voice; etc. These are `true` numbers and we can use summaries, t-tests, linear models, etc. 

2. `Categorical` (Unordered): Observations for two or more categories. As an example, we can have gender of a speaker (male or female); responses to a True vs False perception tests; Colour (unordered) categorisation, e.g., red, blue, yellow, orange, etc.. For these we can use a Generalised Linear Model (binomial or multinomial) or a simple chi-square test. Count data are numbers related to a category. But these should be analysed using a poisson logistic regression

3. `Categorical` (Ordered): When you run a rating experiment, where the outcome is either `numeric` (i.e., 1, 2, 3, 4, 5) or `categories` (i.e., disagree, neutral, agree). The `numeric` option is NOT a true number as for the participant, these are categories. Cumulative Logit models (or Generalised Linear Model with a cumulative function) are used. The mean is meaningless here, and the median is a preferred descriptive statistic.

The content will be split into 5 sections. Some of you will only ever look at some of these, while others may need to use all. Look at the section that works best for your data.

1. Pre-data analyses: We will look at summaries, plotting, correlations, checking normality of distribution and homogeneity of variance (for t-tests)

2. Statistical Analyses: from t-test and ANOVA, to Linear Models: model estimation, understanding coefficients, residuals, and predictions

3. We then move to Generalised Linear Models, with Logistic Regression (with a binomial categorical outcome) and touch upon Signal Detection Theory for estimating accuracy, biases, with sensitivity and specificity measures.

4. Dealing with rating data using Generalised Linear Models, with a Cumulative Logit function

5. Linear Mixed effects Models: Introduction to random effects and how to deal with these. 


# Pre-data analsyes

## Built-in datasets

We will use one of the built in `R`. You can check all available datasets in `R` using the following:

```{r}
data()
# or below for all datasets available in all installed packages
data(package = .packages(all.available = TRUE))
```

We will use the `iris` dataset from the package `MASS`

## Checking structure and summaries

### Structure

```{r}
iris %>% 
  str()  
```

We have a dataframe with 150 observations and 5 variables; 4 numeric and 1 factor with 3 levels. 

### Summary

We summarise the data to see the trends:

```{r}
iris %>% 
  summary()
```

So we have an equal dataframe (50 observations under each level of the factor `Species`), with no missing values (aka `NA`).


### Advanced


#### For a specific variable

What if you want to summarise the data and get the mean, SD, by each level of the factor for `Sepal.Length`? 

```{r}
iris %>% 
  group_by(Species) %>% 
  summarise(
  SL.Mean = mean(Sepal.Length),
  SL.SD = sd(Sepal.Length)
  )
```

#### For all variables

```{r}
iris %>% 
  group_by(Species) %>% 
  summarise_all(list(mean = mean, sd = sd)
  )
```


### Up to you

Do some additional summaries.. You may want to check the `median`, `range`, etc..

```{r}

```


## Plot

We can make a boxplot of the data and add a trend line. This allows us to visualise the median, and quantiles in addition to the standard deviation and any outliers... All in the same plot!

```{r,warning=FALSE,message=FALSE}
iris %>% 
  ggplot(aes(x = Species, y = Sepal.Length)) +
  geom_boxplot() +
  geom_smooth(aes(x = as.numeric(Species), y = Sepal.Length), method = "lm") +
  labs(x = "Species", y = "Length", title = "Boxplot and trend line", subtitle = "with ggplot2") + 
  theme_bw() + theme(text = element_text(size = 15))
```

Here I have used the variable `Sepal.Length`. You can use any of the additional variables to plot the data.

## Correlation tests

### Basic correlations

We use the function `cor` to obtain the pearson correlation and `cor.test` to run a basic correlation test on our data with significance testing

```{r}
cor(iris$Sepal.Length, iris$Petal.Length, method = "pearson")
cor.test(iris$Sepal.Length, iris$Petal.Length)
```

#### Up to you

Can you check whether there is a correlation between the `Sepal.Length` and the `Petal.Width`? What about `Petal.Length` and `Petal.Width`? 


```{r}

```



### Using the package `corrplot`

Above, we did a correlation test on two predictors. We run multiple correlations on multiple predictors. 

#### Correlations

```{r}
## correlation using "corrplot"
## based on the function `rcorr' from the `Hmisc` package
## Need to change dataframe into a matrix
corr <- as.matrix(iris[-5]) %>% 
  rcorr(type="pearson")
print(corr)
corrplot(corr$r, p.mat = corr$P,
         addCoef.col = "black", diag = FALSE, type = "upper", tl.srt = 55)
```




#### Up to you

Look into the `corrplot` specification, using `?corrplot` and amend some of the criteria. Run a correlation plot while filtering the data according to the `species`.

```{r}
# hint 
# create a new dataframe by filtering `Species`, then compute correlations and plot
  

```


Up to now, we have done some basic summaries and checked the correlations in the data. The pearson correlations we have done provided us with significance levels related to the correlations between two *numeric* outcomes. We continue by examining the normality of distribution of our data. 

## Normality of distribution

### Subsetting data

In the `iris` dataset, we have a categorical predictor: `Species` which has three levels

```{r}
levels(iris$Species)
```

Let's subset the data to `setosa` and `versicolor`. We will also check the normality and homogeneity of variance in the data


```{r}
irisSub <- iris %>% 
  filter(Species %in% c("setosa", "versicolor"))
```


### Shapiro test

To check normality of distribution, we use the `shapiro.test` on the numeric outcome. Given that our predictor now has two levels. We need to subset the data again to check normality of the outcome `Sepal.Length` for each level of our factor `Species`

```{r}
irisSubSet <- iris %>% 
  filter(Species == "setosa")
irisSubVers <- iris %>% 
  filter(Species == "versicolor")
  
shapiro.test(irisSubSet$Sepal.Length)
shapiro.test(irisSubVers$Sepal.Length)
```

How to interpret this non-statistically significant result? This tells us that the distribution of the data is not statistically different from a normal distribution.

### Density plot

We can also use a density plot to evaluate normality. The results show that both levels have bell shaped distributions.

```{r}
irisSub %>% 
  ggplot(aes(x = Sepal.Length))+
  geom_density()+
  facet_wrap(~Species, scales = "free_x")
```


### Homogeneity of variance

Because our data is normally distributed, we can use the `bartlett` test. If our data were non-normally distributed, we would use the Levene Test (either with `var.test` from base-R or `leveneTest` from the car package We can check both.

#### Bartlett test

```{r}
irisSub %>% 
  bartlett.test(Sepal.Length ~ Species, data = .)
```

#### Levene test

```{r}
irisSub %>% 
  var.test(Sepal.Length ~ Species, data = .)
irisSub %>% 
  car::leveneTest(Sepal.Length ~ Species, data = .)
```

In all cases, the statistically significant result indicates that there is evidence that the variance of two levels of the factor `Species` is statistically significant; i.e., the variances are not equal. This is important as we will use this in our t-test later on

# Linear Models

Up to now, we have looked at descriptive statistics, and evaluated summaries, correlations in the data (with p values), and checked the normality of distribution of our data.

We are now interested in looking at group differences. 

Let us start with a simple t-test

## First steps

### T-test

We then run a t-test on the data. We specify the formula as `y ~ x` and add `var.equal = FALSE` (based on the normality tests above)

```{r}
irisSub %>% 
  t.test(Sepal.Length ~ Species, data = ., var.equal = FALSE)
```

To interpret the t-test, we say that there is evidence for a statistically significant difference between the two groups: `setosa` and `versicolor`: `t(86) = -10.521, p < 2.2e-16`. The mean of `setosa` is significantly lower than that of `versicolor`.

### Linear Model

Let us run a linear model on the same data


```{r}
irisSub %>% 
  lm(Sepal.Length ~ Species, data = .) %>% 
  summary()
```

Any comments? discuss with your neighbour.

The results of the linear model are exactly the same, albeit with a difference in the sign of the difference. This indicates that the $\beta$ coefficient for `versicolor` is significantly higher than that of `setosa` by `0.93000`: `t(98) = 10.52, p < <2e-16`.


The dataset `iris` contains three species. We will run an ANOVA and a linear model on the data.

### Basic ANOVA

We can use the function `aov` to run an Analysis of Variance on the full dataset

```{r}
iris %>% 
  aov(Sepal.Length ~ Species, data = .) %>% 
  summary()
```

### Linear model

We can use the function `lm` to run a linear model

```{r}
iris %>% 
  lm(Sepal.Length ~ Species, data = .) %>% 
  summary()
```

But wait... How is the linear model comparable to the analysis of variance we ran above? This linear model derives the analysis of variance we saw above, use `anova` on your linear model..

Here are the results of the initial Analysis of variance:

```{r}
iris %>% 
  aov(Sepal.Length ~ Species, data = .) %>% 
  summary()
```

And here are the results of the linear model with the `anova` function

```{r}
iris %>% 
  lm(Sepal.Length ~ Species, data = .) %>% 
  anova()
```

They are exactly the same... The underlying of an Analysis of variance is a linear model. We will continue with a linear model to understand it better

## Linear Model

The basic assumption of a Linear model is to create a regression analysis on the data. We have an outcome (or dependent variable) and a predictor (or an independent variable). The formula of a linear model is as follows `outcome ~ predictor` that can be read as "outcome as a function of the predictor". We can add "1" to specify an intercept, but this is by default added to the model

### Model estimation

```{r warning=FALSE, message=FALSE, error=FALSE}
mdl.lm <- iris %>% 
  lm(Sepal.Length ~ Species, data = .)
# same as below.
#mdl.lm <- lm(Sepal.Length ~ 1 + Species, data = iris)
mdl.lm #also print(mdl.lm)
summary(mdl.lm)
```

### Tidying the output

```{r}
# from library(broom)
tidy(mdl.lm) %>% 
  select(term, estimate) %>% 
  mutate(estimate = round(estimate, 3))
mycoefE <- tidy(mdl.lm) %>% pull(estimate)

```



To interpret the model, we need look at the coefficients. The `Intercept` (=Setosa) is `r mycoefE[1]` and the coefficients for `Versicolor` and for `Virginica` are respectively `r mycoefE[2]` and `r mycoefE[3]` This tells us that compared to `Setosa`, moving from this category to `Versicolor` leads to a significant increase by `r mycoefE[2]`, and for `Virginica`, there is a significant increase by `r mycoefE[3]`. 

### Obtaining our "true" coefficients

But where are our actual values based on the means in the table above?

We run a model that suppresses the intercept (i.e., adding 0 instead of 1) and this will allow us to obtain the "true" coefficients for each level of our predictor. This is also known as a `saturated` model

```{r}
mdl.lm.2 <- iris %>% 
  lm(Sepal.Length ~ 0 + Species, data = .)
summary(mdl.lm.2)
```

This matches the original data. `Setosa` has a mean of `r mycoefE[1]`, `Versicolor` `r mycoefE[1] + mycoefE[2]`, and `Virginica` `r mycoefE[1] + mycoefE[2]`. See table above and coefficients below 

```{r}
#Setosa
mycoefE[1]
#Versicolor
mycoefE[1] + mycoefE[2]
#Virginica
mycoefE[1] + mycoefE[3]
```

The same as

```{r}
tidy(mdl.lm.2) %>% 
  select(term, estimate) %>% 
  mutate(estimate = round(estimate, 3))
mycoefE <- tidy(mdl.lm.2) %>%
  pull(estimate)

#Setosa
mycoefE[1]
#Versicolor
mycoefE[2]
#Virginica
mycoefE[3]
```


### Nice table of our model summary

We can also obtain a nice table of our model summary. We can use the package `knitr` or `xtable`

#### Directly from model summary

```{r}
kable(summary(mdl.lm)$coef, digits = 3)

```

#### From the `tidy` output

```{r}
mdl.lmT <- tidy(mdl.lm)
kable(mdl.lmT, digits = 3)
```


### Dissecting the model

Let us dissect the model. If you use "str", you will be able to see what is available under our linear model. To access some info from the model

#### "str" and "coef"

```{r warning=FALSE, message=FALSE, error=FALSE}
str(mdl.lm)
coef(mdl.lm)
## same as 
## mdl.lm$coefficients
```

#### "coef" and "coefficients"

What if I want to obtain the "Intercept"? Or the coefficient for distance? What if I want the full row for distance?

```{r warning=FALSE, message=FALSE, error=FALSE}
coef(mdl.lm)[1] # same as mdl.lm$coefficients[1]
coef(mdl.lm)[2] # same as mdl.lm$coefficients[2]

summary(mdl.lm)$coefficients[2, ] # full row
summary(mdl.lm)$coefficients[2, 4] #for p value

```


#### Up to you

Play around with the model summary and obtain the t values for the three levels. You can do this by referring to the coefficient as above

```{r}

```


#### Residuals

What about residuals (difference between the observed value and the estimated value of the quantity) and fitted values?

```{r warning=FALSE, message=FALSE, error=FALSE}
hist(residuals(mdl.lm))
qqnorm(residuals(mdl.lm)); qqline(residuals(mdl.lm))
plot(fitted(mdl.lm), residuals(mdl.lm), cex = 4)
```

#### Goodness of fit?

```{r warning=FALSE, message=FALSE, error=FALSE}
AIC(mdl.lm)	# Akaike's Information Criterion, lower values are better
BIC(mdl.lm)	# Bayesian AIC
logLik(mdl.lm)	# log likelihood
```


Or use the following from `broom`

```{r}
glance(mdl.lm)
```


#### Significance testing

Are the above informative? of course not directly. If we want to test for overall significance of model. We run a null model (aka intercept only) and compare models.

```{r warning=FALSE, message=FALSE, error=FALSE}
mdl.lm.Null <- iris %>% 
  lm(Sepal.Length ~ 1, data = .)
mdl.comp <- anova(mdl.lm.Null, mdl.lm)
mdl.comp
```

The results show that adding the factor "Species" improves the model fit. We can write this as follows: Model comparison showed that the addition of Species improved the model fit when compared with an intercept only model ($F$(`r mdl.comp[2,3]`) = `r round(mdl.comp[2,5], 2)`, *p* < `r mdl.comp[2,6]`) 

#### Plotting fitted values

##### Trend line

Let's plot our fitted values but only for the trend line

```{r warning=FALSE, message=FALSE, error=FALSE}
iris %>% 
  ggplot(aes(x = Species, y = Sepal.Length))+
  geom_boxplot() +
  labs(x = "Species", y = "Length", title = "Boxplot and predicted trend line", subtitle = "with ggplot2") + 
  theme_bw() + theme(text = element_text(size = 15))+
  geom_smooth(aes(x = as.numeric(Species), y = predict(mdl.lm)), method = "lm", color = "blue")
```

This allows us to plot the fitted values from our model with the predicted linear trend. This is exactly the same as our original data.

##### Predicted means and the trend line

We can also plot the predicted means and linear trend

```{r warning=FALSE, message=FALSE, error=FALSE}
iris %>% 
  ggplot(aes(x = Species, y = predict(mdl.lm)))+
  geom_boxplot(color = "blue") +
  labs(x = "Species", y = "Length", title = "Predicted means and trend line", subtitle = "with ggplot2") + 
  theme_bw() + theme(text = element_text(size = 15))+
  geom_smooth(aes(x = as.numeric(Species), y = predict(mdl.lm)), method = "lm", color = "blue")
```


##### Raw data, predicted means and the trend line

We can also plot the actual data, the predicted means and linear trend

```{r warning=FALSE, message=FALSE, error=FALSE}
iris %>% 
  ggplot(aes(x = Species, y = Sepal.Length))+
  geom_boxplot()+
  geom_boxplot(aes(x = Species, y = predict(mdl.lm)), color = "blue") +
  labs(x = "Species", y = "Length", title = "Boxplot raw data, predicted means (in blue) and trend line", subtitle = "with ggplot2") + 
  theme_bw() + theme(text = element_text(size = 15))+
  geom_smooth(aes(x = as.numeric(Species), y = predict(mdl.lm)), method = "lm", color = "blue")
```


### What about pairwise comparison?

Based on our model's summary, can you tell me if there is a difference between Versicolor and Virginica?

```{r}
summary(mdl.lm)
```


```{r}
mdl.lm %>% emmeans(pairwise ~ Species, adjust = "fdr")
```

How to interpret the output? Discuss with your neighbour and share with the group.

Hint... Look at the emmeans values for each level of our factor "Species" and the contrasts. 

## Other outcomes?

So far, we only looked at "Sepal.Length". What about the other outcomes? how informative are they? Do we have statistical difference between the three levels of our predictor? You can do this in your spare time.

## Conclusion

We have so far looked at the Linear Model. The underlying assumption about linear models is that we have a normal (Gaussian) distribution. This is the model to be used when we have a `numeric` outcome. What if our outcome is not numeric? What if we have two categories, i.e., black vs white? correct vs incorrect? yes vs no? These are categorical **binary** outcome. We look in the next section at Logistic Regression

# Generalised Linear Models

Here we will look at an example when the outcome is binary. This simulated data is structured as follows. We asked one participant to listen to 165 sentences, and to judge whether these are "grammatical" or "ungrammatical". There were 105 sentences that were "grammatical" and 60 "ungrammatical". This fictitious example can apply in any other situation. Let's think Geography: 165 lands: 105 "flat" and 60 "non-flat", etc. This applies to any case where you need to "categorise" the outcome into two groups. 

## Load and summaries

Let's load in the data and do some basic summaries

```{r warning=FALSE, message=FALSE, error=FALSE}
grammatical <- read_csv("grammatical.csv")
grammatical
str(grammatical)
head(grammatical)
```

## GLM

Let's run a first GLM (Generalised Linear Model). A GLM uses a special family "binomial" as it assumes the outcome has a binomial distribution. In general, results from a Logistic Regression are close to what we get from SDT (see above).

To run the results, we will change the reference level for both response and grammaticality. The basic assumption about GLM is that we start with our reference level being the "no" responses to the "ungrammatical" category. Any changes to this reference will be seen in the coefficients as "yes" responses to the "grammatical" category.

### Model estimation and results

The results below show the logodds for our model. 

```{r warning=FALSE, message=FALSE, error=FALSE}
grammatical <- grammatical %>% 
  mutate(response = factor(response, levels = c("no", "yes")),
         grammaticality = factor(grammaticality, levels = c("ungrammatical", "grammatical")))

grammatical %>% 
  group_by(grammaticality, response) %>% 
  table()

mdl.glm <- grammatical %>% 
  glm(response ~ grammaticality, data = ., family = binomial)
summary(mdl.glm)

tidy(mdl.glm) %>% 
  select(term, estimate) %>% 
  mutate(estimate = round(estimate, 3))
# to only get the coefficients
mycoef2 <- tidy(mdl.glm) %>% pull(estimate)
```


The results show that for one unit increase in the response (i.e., from no to yes), the logodds of being "grammatical" is increased by `r mycoef2[2]` (the intercept shows that when the response is "no", the logodds are `r mycoef2[1]`). The actual logodds for the response "yes" to grammatical is `r mycoef2[1]+mycoef2[2]` 

### Logodds to Odd ratios

Logodds can be modified to talk about the odds of an event. For our model above, the odds of "grammatical" receiving a "no" response is a mere 0.2; the odds of "grammatical" to receive a "yes" is a 20; i.e., 20 times more likely 


```{r warning=FALSE, message=FALSE, error=FALSE}
exp(mycoef2[1])
exp(mycoef2[1] + mycoef2[2])

```

### LogOdds to proportions

If you want to talk about the percentage "accuracy" of our model, then we can transform our loggodds into proportions. This shows that the proportion of "grammatical" receiving a "yes" response increases by 99% (or 95% based on our "true" coefficients)

```{r warning=FALSE, message=FALSE, error=FALSE}
plogis(mycoef2[1])
plogis(mycoef2[1] + mycoef2[2])
```

### Plotting

```{r warning=FALSE, message=FALSE, error=FALSE}
grammatical <- grammatical %>% 
  mutate(prob = predict(mdl.glm, type = "response"))
grammatical %>% 
  ggplot(aes(x = as.numeric(grammaticality), y = prob)) +
  geom_point() +
  geom_smooth(method = "glm", 
    method.args = list(family = "binomial"), 
    se = T) + theme_bw(base_size = 20)+
    labs(y = "Probability", x = "")+
    coord_cartesian(ylim = c(0,1))+
    scale_x_discrete(limits = c("Ungrammatical", "Grammatical"))
```

## Accuracy and Signal Detection Theory

### Rationale

We are generally interested in performance, i.e., whether the we have "accurately" categorised the outcome or not and at the same time want to evaluate our biases in responses. When deciding on categories, we are usually biased in our selection. 

Let's ask the question: How many of you have a Mac laptop and how many a Windows laptop? For those with a Mac, what was the main reason for choosing it? Are you biased in anyway by your decision? 

To correct for these biases, we use some variants from Signal Detection Theory to obtain the true estimates without being influenced by the biases. 

### Running stats

Let's do some stats on this 

|  | Yes | No | Total |
|----------------------------|--------------------|------------------|------------------|
| Grammatical (Yes Actual) | TP = 100 | FN = 5 | (Yes Actual) 105 |
| Ungrammatical (No Actual)  | FP = 10 | TN = 50 | (No Actual) 60 |
| Total | (Yes Response) 110 | (No Response) 55 | 165 |

```{r warning=FALSE, message=FALSE, error=FALSE}
grammatical <- grammatical %>% 
  mutate(response = factor(response, levels = c("yes", "no")),
         grammaticality = factor(grammaticality, levels = c("grammatical", "ungrammatical")))

## TP = True Positive (Hit); FP = False Positive; FN = False Negative; TN = True Negative


TP <- nrow(grammatical %>% 
             filter(grammaticality == "grammatical" &
                      response == "yes"))
FN <- nrow(grammatical %>% 
             filter(grammaticality == "grammatical" &
                      response == "no"))
FP <- nrow(grammatical %>% 
             filter(grammaticality == "ungrammatical" &
                      response == "yes"))
TN <- nrow(grammatical %>% 
             filter(grammaticality == "ungrammatical" &
                      response == "no"))
TP
FN
FP
TN

Total <- nrow(grammatical)
Total
(TP+TN)/Total # accuracy
(FP+FN)/Total # error, also 1-accuracy

# When stimulus = yes, how many times response = yes?
TP/(TP+FN) # also True Positive Rate or Specificity

# When stimulus = no, how many times response = yes?
FP/(FP+TN) # False Positive Rate, 

# When stimulus = no, how many times response = no?
TN/(FP+TN) # True Negative Rate or Sensitivity 

# When subject responds "yes" how many times is (s)he correct?
TP/(TP+FP) # precision

# getting dprime (or the sensetivity index); beta (bias criterion, 0-1, lower=increase in "yes"); Aprime (estimate of discriminability, 0-1, 1=good discrimination; 0 at chance); bppd (b prime prime d, -1 to 1; 0 = no bias, negative = tendency to respond "yes", positive = tendency to respond "no"); c (index of bias, equals to SD)
#(see also https://www.r-bloggers.com/compute-signal-detection-theory-indices-with-r/amp/) 
psycho::dprime(TP, FP, FN, TN, 
               n_targets = TP+FN, 
               n_distractors = FP+TN,
               adjust=F)

```

The most important from above, is d-prime. This is modelling the difference between the rate of "True Positive" responses and "False Positive" responses in standard unit (or z-scores). The formula can be written as:

`d' (d prime) = Z(True Positive Rate) - Z(False Positive Rate)`

### GLM as a classification tool

The code below demonstrates the links between our GLM model and what we had obtained above from SDT. The predictions' table shows that our GLM was successful at obtaining prediction that are identical to our initial data setup. Look at the table here and the table above. Once we have created our table of outcome, we can compute percent correct, the specificity, the sensitivity, the Kappa score, etc.. this yields the actual value with the SD that is related to variations in responses. 

```{r}
## predict(mdl.glm)>0.5 is identical to 
## predict(glm(response~grammaticality,data=grammatical,family = binomial),type="response")
grammatical <- grammatical %>% 
  mutate(response = factor(response, levels = c("yes", "no")),
         grammaticality = factor(grammaticality, levels = c("grammatical", "ungrammatical")))



mdl.glm.C <- grammatical %>% 
  glm(response ~ grammaticality, data = .,family = binomial)

tbl.glm <- table(grammatical$response, predict(mdl.glm.C, type = "response")>0.5)
colnames(tbl.glm) <- c("grammatical", "ungrammatical")
tbl.glm
PresenceAbsence::pcc(tbl.glm)
PresenceAbsence::specificity(tbl.glm)
PresenceAbsence::sensitivity(tbl.glm)
###etc..
```

If you look at the results from SDT above, these results are the same as
the following

Accuracy: (TP+TN)/Total (`r (TP+TN)/Total`) 

True Positive Rate (or Specificity) TP/(TP+FN) (`r TP/(TP+FN)`)

True Negative Rate (or Sensitivity) TN/(FP+TN) (`r TN/(FP+TN)`) 

### GLM and d prime

The values obtained here match those obtained from SDT. For d prime, the difference stems from the use of the logit variant of the Binomial family. By using a probit variant, one obtains the same values ([see here](https://stats.idre.ucla.edu/r/dae/probit-regression/) for more details). A probit variant models the z-score differences in the outcome and is evaluated in change in 1-standard unit. This is modelling the change from "ungrammatical" "no" responses into "grammatical" "yes" responses in z-scores. The same conceptual underpinnings of d-prime from Signal Detection Theory.

```{r}
## d prime
psycho::dprime(TP, FP, FN, TN, 
               n_targets = TP+FN, 
               n_distractors = FP+TN,
               adjust=F)$dprime

## GLM with probit
coef(glm(response ~ grammaticality, data = grammatical, family = binomial(probit)))[2]

```




## GLM: Other distributions

If your data does not fit a binomial distribution, and is a multinomial (i.e., three or more response categories) or poisson (count data), then you need to use the glm function with a specific family function. 

```{r warning=FALSE, message=FALSE, error=FALSE, echo=FALSE}
## For a multinomial (3 or more response categories), see below and use the following specification
## https://stats.idre.ucla.edu/r/dae/multinomial-logistic-regression/
## mdl.multi <- nnet::multinom(outcome~predictor, data=data)

## For a poisson (count data), see below and use the following specification
## https://stats.idre.ucla.edu/r/dae/poisson-regression/

## mdl.poisson <- glm(outcome~predictor, data = data, family = "poisson")


```



# Cumulative Logit Link Models

These models work perfectly with rating data. Ratings are inherently ordered, 1, 2, ... n, and expect to observe an increase (or decrease) in overall ratings from 1 to n. To demonstrate this, we will use an example using the package "ordinal". Data were from a rating experiment where six participants rated the percept of nasality in the production of particular consonants in Arabic. The data came from nine producing subjects. The ratings were from 1 to 5. This example can apply to any study, e.g., rating grammaticality of sentences, rating how positive the sentiments are in a article, interview responses, etc.

## Importing and pre-processing

We start by importing the data and process it. We change the reference level in the predictor

```{r warning=FALSE, message=FALSE, error=FALSE}
rating <- read_csv("rating.csv")
rating
rating <- rating %>% 
  mutate(Response = factor(Response),
         Context = factor(Context)) %>% 
  mutate(Context = relevel(Context, "isolation"))
rating
```

## Our first model

We run our first clm model as a simple, i.e., with no random effects

```{r warning=FALSE, message=FALSE, error=FALSE}
mdl.clm <- rating %>% 
  clm(Response ~ Context, data = .)
summary(mdl.clm)
```


## Testing significance 

We can evaluate whether "Context" improves the model fit, by comparing a null model with our model. Of course "Context" is improving the model fit.

```{r warning=FALSE, message=FALSE, error=FALSE}
mdl.clm.Null <- rating %>% 
  clm(Response ~ 1, data = .)
anova(mdl.clm, mdl.clm.Null)

```

## Interpreting a cumulative model

As a way to interpret the model, we can look at the coefficients and make sense of the results. A CLM model is a Logistic model with a cumulative effect. The "Coefficients" are the estimates for each level of the fixed effect; the "Threshold coefficients" are those of the response. For the former, a negative coefficient indicates a negative association with the response; and a positive is positively associated with the response. The p values are indicating the significance of each level. For the "Threshold coefficients", we can see the cumulative effects of ratings 1|2, 2|3, 3|4 and 4|5 which indicate an overall increase in the ratings from 1 to 5. 

## Plotting 

We use a modified version of a plotting function that allows us to visualise the effects. For this, we use the base R plotting functions

```{r warning=FALSE, message=FALSE, error=FALSE}
par(oma=c(1, 0, 0, 3),mgp=c(2, 1, 0))
xlimNas = c(min(mdl.clm$beta), max(mdl.clm$beta))
ylimNas = c(0,1)
plot(0,0,xlim=xlimNas, ylim=ylimNas, type="n", ylab=expression(Probability), xlab="", xaxt = "n",main="Predicted curves - Nasalisation",cex=2,cex.lab=1.5,cex.main=1.5,cex.axis=1.5)
axis(side = 1, at = c(0,mdl.clm$beta),labels = levels(rating$Context), las=2,cex=2,cex.lab=1.5,cex.axis=1.5)
xsNas = seq(xlimNas[1], xlimNas[2], length.out=100)
lines(xsNas, plogis(mdl.clm$Theta[1] - xsNas), col='black')
lines(xsNas, plogis(mdl.clm$Theta[2] - xsNas)-plogis(mdl.clm$Theta[1] - xsNas), col='red')
lines(xsNas, plogis(mdl.clm$Theta[3] - xsNas)-plogis(mdl.clm$Theta[2] - xsNas), col='green')
lines(xsNas, plogis(mdl.clm$Theta[4] - xsNas)-plogis(mdl.clm$Theta[3] - xsNas), col='orange')
lines(xsNas, 1-(plogis(mdl.clm$Theta[4] - xsNas)), col='blue')
abline(v=c(0,mdl.clm$beta),lty=3)
abline(h=0, lty="dashed")
abline(h=1, lty="dashed")
legend(par('usr')[2], par('usr')[4], bty='n', xpd=NA,lty=1, col=c("black", "red", "green", "orange", "blue"), 
       legend=c("Oral", "2", "3", "4", "Nasal"),cex=0.75)

```

# Linear Mixed-effects Models. Why random effects matter

Let's generate a new dataframe that we will use later on for our mixed models

```{r warning=FALSE, message=FALSE, error=FALSE}
## Courtesy of Bodo Winter
set.seed(666)
#we create 6 subjects
subjects <- paste0('S', 1:6)
#here we add repetitions within speakers
subjects <- rep(subjects, each = 20)
items <- paste0('Item', 1:20)
#below repeats
items <- rep(items, 6)
#below is to generate random numbers that are log values
logFreq <- round(rexp(20)*5, 2)
#below we are repeating the logFreq 6 times to fit with the number of speakers and items
logFreq <- rep(logFreq, 6)
xdf <- data.frame(subjects, items, logFreq)
#below removes the individual variables we had created because they are already in the dataframe
rm(subjects, items, logFreq)

xdf$Intercept <- 300
submeans <- rep(rnorm(6, sd = 40), 20)
#sort make the means for each subject is the same...
submeans <- sort(submeans)
xdf$submeans <- submeans
#we create the same thing for items... we allow the items mean to vary between words...
itsmeans <- rep(rnorm(20, sd = 20), 6)
xdf$itsmeans <- itsmeans
xdf$error <- rnorm(120, sd = 20)
#here we create an effect column,  
#here for each logFreq, we have a decrease of -5 of that particular logFreq 
xdf$effect <- -5 * xdf$logFreq

xdf$dur <- xdf$Intercept + xdf$submeans + xdf$itsmeans + xdf$error + xdf$effect
#below is to subset the data and get only a few columns.. the -c(4:8) removes the columns 4 to 8..
xreal <- xdf[,-c(4:8)]
head(xreal)
rm(xdf, submeans, itsmeans)
```

## Plots
Let's start by doing a correlation test and plotting the data. Our results show that there is a negative correlation between duration and LogFrequency, and the plot shows this decrease. 

```{r warning=FALSE, message=FALSE, error=FALSE}
corrMixed <- as.matrix(xreal[-c(1:2)]) %>% 
  rcorr(type="pearson")
print(corrMixed)
corrplot(corrMixed$r, method = "circle", type = "upper", tl.srt = 45,
         addCoef.col = "black", diag = FALSE,
         p.mat = corrMixed$p, sig.level = 0.05)



ggplot.xreal <- xreal %>% 
  ggplot(aes(x = logFreq, y = dur)) +
  geom_point()+ theme_bw(base_size = 20) +
  labs(y = "Duration", x = "Frequency (Log)") +
  geom_smooth(method = lm, se=F)
ggplot.xreal
```


## Linear model

Let's run a simple linear model on the data. As we can see below, there are some issues with the "simple" linear model: we had set our SD for subjects to be 40, but this was picked up as 120 (see histogram of residuals). The QQ Plot is not "normal". 

```{r warning=FALSE, message=FALSE, error=FALSE}
mdl.lm.xreal <- xreal %>% 
  lm(dur ~ logFreq, data = .)
summary(mdl.lm.xreal)
hist(residuals(mdl.lm.xreal))
qqnorm(residuals(mdl.lm.xreal)); qqline(residuals(mdl.lm.xreal))
plot(fitted(mdl.lm.xreal), residuals(mdl.lm.xreal), cex = 4)
```

## Linear Mixed Model

Our Linear Mixed effects Model will take into account the random effects we added and also our model specifications. We use a Maximum Likelihood estimate (REML = FALSE) as this is what we will use for model comparison. The Linear Mixed Model is reflecting our model specifications The SD of our subjects is picked up correctly. The model results are "almost" the same as our linear model above. The coefficient for the "Intercept" is at 337.973 and the coefficient for LogFrequency is at -5.460. This indicates that for each unit of increase in the LogFrequency, there is a decrease by 5.460 (ms).

```{r warning=FALSE, message=FALSE, error=FALSE}
mdl.lmer.xreal <- xreal %>% 
  lmer(dur ~ logFreq  +(1|subjects) + (1|items), data = ., REML = FALSE)
summary(mdl.lmer.xreal)
hist(residuals(mdl.lmer.xreal))
qqnorm(residuals(mdl.lmer.xreal)); qqline(residuals(mdl.lmer.xreal))
plot(fitted(mdl.lmer.xreal), residuals(mdl.lmer.xreal), cex = 4)
```

## Our second Mixed model

This second model add a by-subject random slope. Random slopes allow for the variation that exists in the random effects to be taken into account. An intercept only model provides an averaged values to our participants.

```{r warning=FALSE, message=FALSE, error=FALSE}
mdl.lmer.xreal.2 <- xreal %>% 
  lmer(dur ~ logFreq + (logFreq|subjects) + (1|items), data = ., REML = FALSE)
summary(mdl.lmer.xreal.2)
hist(residuals(mdl.lmer.xreal.2))
qqnorm(residuals(mdl.lmer.xreal.2)); qqline(residuals(mdl.lmer.xreal.2))
plot(fitted(mdl.lmer.xreal.2), residuals(mdl.lmer.xreal.2), cex = 4)
```

## Model comparison

But where are our p values? The lme4 developers decided not to include p values due to various issues with estimating df. What we can do instead is to compare models. We need to create a null model to allow for significance testing. As expected our predictor is significantly contributing to the difference. 

```{r warning=FALSE, message=FALSE, error=FALSE}
mdl.lmer.xreal.Null <- xreal %>% 
  lmer(dur ~ 1 + (logFreq|subjects) + (1|items), data = ., REML = FALSE)
anova(mdl.lmer.xreal.Null, mdl.lmer.xreal.2)
```

Also, do we really need random slopes? From the result below, we don't seem to need random slopes at all, given that adding random slopes does not improve the model fit. I always recommend testing this. Most of the time I keep random slopes.

```{r warning=FALSE, message=FALSE, error=FALSE}
anova(mdl.lmer.xreal, mdl.lmer.xreal.2)
```

But if you are really (really!!!) obsessed by p values, then you can also use lmerTest. BUT use after comparing models to evaluate contribution of predictors

```{r warning=FALSE, message=FALSE, error=FALSE}
mdl.lmer.xreal.lmerTest <- xreal %>% 
  lmer(dur ~ logFreq + (logFreq|subjects) + (1|items), data = ., REML = TRUE)
summary(mdl.lmer.xreal.lmerTest)
detach("package:lmerTest", unload = TRUE)
```


## Our final Mixed model

Our final model uses REML (or Restricted Maximum Likelihood Estimate of Variance Component) to estimate the model. 

```{r warning=FALSE, message=FALSE, error=FALSE}
mdl.lmer.xreal.Full <- xreal %>% 
  lmer(dur ~ logFreq + (logFreq|subjects) + (1|items), data = ., REML = TRUE)
summary(mdl.lmer.xreal.Full)
anova(mdl.lmer.xreal.Full)
hist(residuals(mdl.lmer.xreal.Full))
qqnorm(residuals(mdl.lmer.xreal.Full)); qqline(residuals(mdl.lmer.xreal.Full))
plot(fitted(mdl.lmer.xreal.Full), residuals(mdl.lmer.xreal.Full), cex = 4)
```


## Dissecting the model

```{r warning=FALSE, message=FALSE, error=FALSE}
coef(mdl.lmer.xreal.Full)
fixef(mdl.lmer.xreal.Full)
fixef(mdl.lmer.xreal.Full)[1]
fixef(mdl.lmer.xreal.Full)[2]

coef(mdl.lmer.xreal.Full)$`subjects`[1]
coef(mdl.lmer.xreal.Full)$`subjects`[2]

coef(mdl.lmer.xreal.Full)$`items`[1]
coef(mdl.lmer.xreal.Full)$`items`[2]

```

## Using predictions from our model
In general, I use the prediction from my final model in any plots. To generate this, we can use the following

```{r warning=FALSE, message=FALSE, error=FALSE}
xreal <- xreal %>% 
  mutate(Pred_Dur = predict(mdl.lmer.xreal.Full))

xreal %>% 
  ggplot(aes(x = logFreq, y = Pred_Dur)) +
  geom_point() + theme_bw(base_size = 20) +
  labs(y = "Duration", x = "Frequency (Log)", title = "Predicted") +
  geom_smooth(method = lm, se = F) + coord_cartesian(ylim = c(200,450))

## original plot
xreal %>% 
  ggplot(aes(x = logFreq , y = dur)) +
  geom_point() + theme_bw(base_size = 20)+
  labs(y = "Duration", x = "Frequency (Log)", title = "Original")+
  geom_smooth(method = lm, se = F) + coord_cartesian(ylim = c(200,450))

```

## GLMM and CLMM

The code above was using a Linear Mixed Effects Modelling. The outcome was a numeric object. In some cases (as we have seen above), we may have: 

1. Binary outcome (binomial)
2. Count data (poisson), 
3. Multi-category outcome (multinomial)
4. Rating data (cumulative function)

The code below gives you an idea of how to specify these models

```{r warning=FALSE, message=FALSE, error=FALSE}

## Binomial family
## lme4::glmer(outcome~predictor(s)+(1|subject)+(1|items)..., data=data, family=binomial)

## Poisson family
## lme4::glmer(outcome~predictor(s)+(1|subject)+(1|items)..., data=data, family=poisson)

## Multinomial family
## a bit complicated as tehre is a need to use Bayesian approaches, see e.g., 
## glmmADMB
## mixcat
## MCMCglmm
## see https://gist.github.com/casallas/8263818

## Rating data, use following
## ordinal::clmm(outcome~predictor(s)+(1|subject)+(1|items)..., data=data)


## Remember to test for random effects and whether slopes are needed.

```
