1 Loading packages

## Use the code below to check if you have all required packages installed. If some are not installed already, the code below will install these. If you have all packages installed, then you could load them with the second code.
requiredPackages = c('tidyverse', 'broom', 'knitr', 'Hmisc', 'corrplot', 'lme4', 'lmerTest', 'party', 'ranger','doFuture',  'tidymodels', 'pROC', 'varImp', 'lattice', 'vip', 'emmeans', 'ggsignif', 'PresenceAbsence', 'languageR', 'FactoMineR', 'factoextra', 'RColorBrewer', 'scatterplot3d', 'cowplot', 'psycho', 'ordinal')
for(p in requiredPackages){
  if(!require(p,character.only = TRUE)) install.packages(p)
  library(p,character.only = TRUE)
}
Loading required package: tidyverse
Registered S3 methods overwritten by 'dbplyr':
  method         from
  print.tbl_lazy     
  print.tbl_sql      
-- Attaching packages -----------
v ggplot2 3.3.5     v purrr   0.3.4
v tibble  3.1.5     v dplyr   1.0.7
v tidyr   1.1.4     v stringr 1.4.0
v readr   2.0.2     v forcats 0.5.1
-- Conflicts --------------------
x dplyr::filter() masks stats::filter()
x dplyr::lag()    masks stats::lag()
Loading required package: broom
Loading required package: knitr
Loading required package: Hmisc
Loading required package: lattice
Loading required package: survival
Loading required package: Formula
Registered S3 method overwritten by 'htmlwidgets':
  method           from         
  print.htmlwidget tools:rstudio
Registered S3 method overwritten by 'data.table':
  method           from
  print.data.table     

Attaching package: ‘Hmisc’

The following objects are masked from ‘package:dplyr’:

    src, summarize

The following objects are masked from ‘package:base’:

    format.pval, units

Loading required package: corrplot
corrplot 0.90 loaded
Loading required package: lme4
Loading required package: Matrix

Attaching package: ‘Matrix’

The following objects are masked from ‘package:tidyr’:

    expand, pack, unpack

Loading required package: lmerTest

Attaching package: ‘lmerTest’

The following object is masked from ‘package:lme4’:

    lmer

The following object is masked from ‘package:stats’:

    step

Loading required package: party
Loading required package: grid
Loading required package: mvtnorm
Loading required package: modeltools
Loading required package: stats4

Attaching package: ‘modeltools’

The following object is masked from ‘package:lme4’:

    refit

Loading required package: strucchange
Loading required package: zoo

Attaching package: ‘zoo’

The following objects are masked from ‘package:base’:

    as.Date, as.Date.numeric

Loading required package: sandwich

Attaching package: ‘strucchange’

The following object is masked from ‘package:stringr’:

    boundary

Loading required package: ranger
Loading required package: doFuture
Loading required package: foreach

Attaching package: ‘foreach’

The following objects are masked from ‘package:purrr’:

    accumulate, when

Loading required package: future

Attaching package: ‘future’

The following object is masked from ‘package:survival’:

    cluster

Loading required package: tidymodels
Registered S3 method overwritten by 'tune':
  method                   from   
  required_pkgs.model_spec parsnip
-- Attaching packages -----------
v dials        0.0.10     v rsample      0.1.0 
v infer        1.0.0      v tune         0.1.6 
v modeldata    0.1.1      v workflows    0.2.4 
v parsnip      0.1.7      v workflowsets 0.1.0 
v recipes      0.1.17     v yardstick    0.0.8 
-- Conflicts --------------------
x foreach::accumulate() masks purrr::accumulate()
x scales::discard()     masks purrr::discard()
x Matrix::expand()      masks tidyr::expand()
x dplyr::filter()       masks stats::filter()
x parsnip::fit()        masks infer::fit(), party::fit(), modeltools::fit()
x recipes::fixed()      masks stringr::fixed()
x dplyr::lag()          masks stats::lag()
x Matrix::pack()        masks tidyr::pack()
x tune::parameters()    masks dials::parameters(), modeltools::parameters()
x yardstick::spec()     masks readr::spec()
x Hmisc::src()          masks dplyr::src()
x recipes::step()       masks lmerTest::step(), stats::step()
x Hmisc::summarize()    masks dplyr::summarize()
x parsnip::translate()  masks Hmisc::translate()
x Matrix::unpack()      masks tidyr::unpack()
x recipes::update()     masks stats4::update(), Matrix::update(), stats::update()
x foreach::when()       masks purrr::when()
* Dig deeper into tidy modeling with R at https://www.tmwr.org
Loading required package: pROC
Type 'citation("pROC")' for a citation.

Attaching package: ‘pROC’

The following objects are masked from ‘package:stats’:

    cov, smooth, var

Loading required package: varImp
Loading required package: measures
Loading required package: vip

Attaching package: ‘vip’

The following object is masked from ‘package:utils’:

    vi

Loading required package: emmeans
Loading required package: ggsignif
Loading required package: PresenceAbsence

Attaching package: ‘PresenceAbsence’

The following object is masked from ‘package:pROC’:

    auc

The following objects are masked from ‘package:yardstick’:

    sensitivity, specificity

Loading required package: languageR
Loading required package: FactoMineR
Loading required package: factoextra
Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
Loading required package: RColorBrewer
Loading required package: scatterplot3d
Loading required package: cowplot
Loading required package: psycho

Attaching package: ‘psycho’

The following object is masked from ‘package:future’:

    values

The following object is masked from ‘package:lme4’:

    golden

Loading required package: ordinal

Attaching package: ‘ordinal’

The following objects are masked from ‘package:lme4’:

    ranef, VarCorr

The following object is masked from ‘package:dplyr’:

    slice

2 Correlation tests

2.1 Basic correlations

Let us start with a basic correlation test. We want to evaluate if two numeric variables are correlated with each other.

We use the function cor to obtain the pearson correlation and cor.test to run a basic correlation test on our data with significance testing

cor(english$RTlexdec, english$RTnaming, method = "pearson")
[1] 0.7587033
cor.test(english$RTlexdec, english$RTnaming)

    Pearson's product-moment
    correlation

data:  english$RTlexdec and english$RTnaming
t = 78.699, df = 4566,
p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.7461195 0.7707453
sample estimates:
      cor 
0.7587033 

What these results are telling us? There is a positive correlation between RTlexdec and RTnaming. The correlation coefficient (R²) is 0.76 (limits between -1 and 1). This correlation is statistically significant with a t value of 78.699, degrees of freedom of 4566 and a p-value < 2.2e-16.

What are the degrees of freedom? These relate to number of total observations - number of comparisons. Here we have 4568 observations in the dataset, and two comparisons, hence 4568 - 2 = 4566.

For the p value, there is a threshold we usually use. This threshold is p = 0.05. This threshold means we have a minimum to consider any difference as significant or not. 0.05 means that we have a probability to find a significant difference that is at 5% or lower. IN our case, the p value is lower that 2.2e-16. How to interpret this number? this tells us to add 15 0s before the 2!! i.e., 0.0000000000000002. This probability is very (very!!) low. So we conclude that there is a statistically significant correlation between the two variables.

The formula to calculate the t value is below.

x̄ = sample mean μ0 = population mean s = sample standard deviation n = sample size

The p value is influenced by various factors, number of observations, strength of the difference, mean values, etc.. You should always be careful with interpreting p values taking everything else into account.

2.2 Using the package corrplot

Above, we did a correlation test on two predictors. What if we want to obtain a nice plot of all numeric predictors and add significance levels?

2.2.1 Correlation plots

corr <- 
  english %>% 
  select(where(is.numeric)) %>% 
  cor() %>% 
  print()
                                    RTlexdec
RTlexdec                         1.000000000
RTnaming                         0.758703280
Familiarity                     -0.444409734
WrittenFrequency                -0.434814982
WrittenSpokenFrequencyRatio      0.039820007
FamilySize                      -0.349595853
DerivationalEntropy             -0.161164620
InflectionalEntropy             -0.088418681
NumberSimplexSynsets            -0.309008140
NumberComplexSynsets            -0.328613209
LengthInLetters                  0.049747275
Ncount                          -0.065726313
MeanBigramFrequency              0.002633525
FrequencyInitialDiphone         -0.074452719
ConspelV                        -0.032867467
ConspelN                        -0.107538023
ConphonV                        -0.021747588
ConphonN                        -0.080930543
ConfriendsV                     -0.025720833
ConfriendsN                     -0.117532883
ConffV                          -0.016494945
ConffN                          -0.005679088
ConfbV                          -0.022515482
ConfbN                          -0.018539499
NounFrequency                   -0.167189500
VerbFrequency                   -0.076388309
FrequencyInitialDiphoneWord     -0.042640861
FrequencyInitialDiphoneSyllable -0.035503708
CorrectLexdec                   -0.253188184
                                    RTnaming
RTlexdec                         0.758703280
RTnaming                         1.000000000
Familiarity                     -0.094793069
WrittenFrequency                -0.095994313
WrittenSpokenFrequencyRatio      0.036592754
FamilySize                      -0.088037010
DerivationalEntropy             -0.049456670
InflectionalEntropy             -0.022110376
NumberSimplexSynsets            -0.071900207
NumberComplexSynsets            -0.076384846
LengthInLetters                  0.094497065
Ncount                          -0.094669618
MeanBigramFrequency              0.048459360
FrequencyInitialDiphone         -0.057216874
ConspelV                        -0.025165185
ConspelN                        -0.034801239
ConphonV                         0.001175572
ConphonN                        -0.014364896
ConfriendsV                     -0.027741071
ConfriendsN                     -0.044064997
ConffV                           0.007924551
ConffN                           0.011407182
ConfbV                           0.019159417
ConfbN                           0.017011731
NounFrequency                   -0.043148572
VerbFrequency                   -0.024593780
FrequencyInitialDiphoneWord      0.020488545
FrequencyInitialDiphoneSyllable  0.026897756
CorrectLexdec                    0.151348043
                                Familiarity
RTlexdec                        -0.44440973
RTnaming                        -0.09479307
Familiarity                      1.00000000
WrittenFrequency                 0.79125559
WrittenSpokenFrequencyRatio     -0.18989881
FamilySize                       0.59191973
DerivationalEntropy              0.22071588
InflectionalEntropy              0.10795420
NumberSimplexSynsets             0.51065170
NumberComplexSynsets             0.51001913
LengthInLetters                 -0.08215272
Ncount                           0.09650461
MeanBigramFrequency              0.02962138
FrequencyInitialDiphone          0.12847193
ConspelV                         0.07451417
ConspelN                         0.21437628
ConphonV                         0.05408975
ConphonN                         0.16180440
ConfriendsV                      0.04584224
ConfriendsN                      0.20673977
ConffV                           0.05610343
ConffN                           0.02945722
ConfbV                           0.05687343
ConfbN                           0.05199653
NounFrequency                    0.38119070
VerbFrequency                    0.23817700
FrequencyInitialDiphoneWord      0.09106333
FrequencyInitialDiphoneSyllable  0.07354114
CorrectLexdec                    0.52685458
                                WrittenFrequency
RTlexdec                             -0.43481498
RTnaming                             -0.09599431
Familiarity                           0.79125559
WrittenFrequency                      1.00000000
WrittenSpokenFrequencyRatio           0.07158067
FamilySize                            0.66253864
DerivationalEntropy                   0.25522889
InflectionalEntropy                  -0.04048005
NumberSimplexSynsets                  0.55874958
NumberComplexSynsets                  0.59105478
LengthInLetters                      -0.06663196
Ncount                                0.10492564
MeanBigramFrequency                   0.07758879
FrequencyInitialDiphone               0.16748670
ConspelV                              0.05864228
ConspelN                              0.28248908
ConphonV                              0.08201255
ConphonN                              0.22245283
ConfriendsV                           0.02146455
ConfriendsN                           0.26326498
ConffV                                0.08162166
ConffN                                0.05028101
ConfbV                                0.11975724
ConfbN                                0.10409106
NounFrequency                         0.46955152
VerbFrequency                         0.27879235
FrequencyInitialDiphoneWord           0.10827895
FrequencyInitialDiphoneSyllable       0.09111661
CorrectLexdec                         0.45797185
                                WrittenSpokenFrequencyRatio
RTlexdec                                        0.039820007
RTnaming                                        0.036592754
Familiarity                                    -0.189898811
WrittenFrequency                                0.071580669
WrittenSpokenFrequencyRatio                     1.000000000
FamilySize                                     -0.108801543
DerivationalEntropy                            -0.010987756
InflectionalEntropy                            -0.118804044
NumberSimplexSynsets                           -0.085088573
NumberComplexSynsets                           -0.104598273
LengthInLetters                                 0.204091196
Ncount                                         -0.188200595
MeanBigramFrequency                             0.191513493
FrequencyInitialDiphone                         0.020799272
ConspelV                                       -0.157689304
ConspelN                                       -0.057307387
ConphonV                                       -0.034796879
ConphonN                                       -0.025856126
ConfriendsV                                    -0.136348207
ConfriendsN                                    -0.055568525
ConffV                                         -0.026731676
ConffN                                         -0.003403134
ConfbV                                          0.093614167
ConfbN                                          0.070239322
NounFrequency                                   0.012482293
VerbFrequency                                  -0.096554415
FrequencyInitialDiphoneWord                     0.002627022
FrequencyInitialDiphoneSyllable                 0.012872673
CorrectLexdec                                   0.008564774
                                  FamilySize
RTlexdec                        -0.349595853
RTnaming                        -0.088037010
Familiarity                      0.591919733
WrittenFrequency                 0.662538635
WrittenSpokenFrequencyRatio     -0.108801543
FamilySize                       1.000000000
DerivationalEntropy              0.692088896
InflectionalEntropy              0.101743523
NumberSimplexSynsets             0.590763556
NumberComplexSynsets             0.645411663
LengthInLetters                 -0.122995009
Ncount                           0.174107015
MeanBigramFrequency             -0.001056468
FrequencyInitialDiphone          0.126804334
ConspelV                         0.110812602
ConspelN                         0.249522442
ConphonV                         0.050171973
ConphonN                         0.161531494
ConfriendsV                      0.079271194
ConfriendsN                      0.242711641
ConffV                           0.080362377
ConffN                           0.059199476
ConfbV                           0.050020822
ConfbN                           0.038457171
NounFrequency                    0.417794301
VerbFrequency                    0.114132925
FrequencyInitialDiphoneWord      0.096705342
FrequencyInitialDiphoneSyllable  0.086426850
CorrectLexdec                    0.360613035
                                DerivationalEntropy
RTlexdec                               -0.161164620
RTnaming                               -0.049456670
Familiarity                             0.220715880
WrittenFrequency                        0.255228886
WrittenSpokenFrequencyRatio            -0.010987756
FamilySize                              0.692088896
DerivationalEntropy                     1.000000000
InflectionalEntropy                    -0.050795034
NumberSimplexSynsets                    0.223943027
NumberComplexSynsets                    0.331924999
LengthInLetters                        -0.104860729
Ncount                                  0.123827732
MeanBigramFrequency                    -0.020837051
FrequencyInitialDiphone                 0.083835479
ConspelV                                0.046117273
ConspelN                                0.137519738
ConphonV                               -0.002754648
ConphonN                                0.074468614
ConfriendsV                             0.028929529
ConfriendsN                             0.131756437
ConffV                                  0.041097365
ConffN                                  0.040363830
ConfbV                                  0.007743200
ConfbN                                  0.011462290
NounFrequency                           0.172254519
VerbFrequency                          -0.019725738
FrequencyInitialDiphoneWord             0.029479534
FrequencyInitialDiphoneSyllable         0.027755605
CorrectLexdec                           0.188753214
                                InflectionalEntropy
RTlexdec                               -0.088418681
RTnaming                               -0.022110376
Familiarity                             0.107954197
WrittenFrequency                       -0.040480046
WrittenSpokenFrequencyRatio            -0.118804044
FamilySize                              0.101743523
DerivationalEntropy                    -0.050795034
InflectionalEntropy                     1.000000000
NumberSimplexSynsets                    0.398736053
NumberComplexSynsets                    0.005589502
LengthInLetters                         0.052485031
Ncount                                 -0.003252708
MeanBigramFrequency                     0.024789643
FrequencyInitialDiphone                -0.034461207
ConspelV                                0.140798520
ConspelN                                0.046826086
ConphonV                                0.082962738
ConphonN                                0.031410725
ConfriendsV                             0.131247972
ConfriendsN                             0.067675205
ConffV                                  0.014388810
ConffN                                  0.010758578
ConfbV                                  0.002863801
ConfbN                                 -0.007583063
NounFrequency                          -0.114401007
VerbFrequency                           0.094002603
FrequencyInitialDiphoneWord             0.052469468
FrequencyInitialDiphoneSyllable         0.050939450
CorrectLexdec                           0.182065382
                                NumberSimplexSynsets
RTlexdec                               -0.3090081404
RTnaming                               -0.0719002065
Familiarity                             0.5106516971
WrittenFrequency                        0.5587495840
WrittenSpokenFrequencyRatio            -0.0850885729
FamilySize                              0.5907635556
DerivationalEntropy                     0.2239430269
InflectionalEntropy                     0.3987360535
NumberSimplexSynsets                    1.0000000000
NumberComplexSynsets                    0.5245365002
LengthInLetters                        -0.0063644110
Ncount                                  0.1129586209
MeanBigramFrequency                     0.0539689516
FrequencyInitialDiphone                 0.0571276406
ConspelV                                0.1660590990
ConspelN                                0.2275859556
ConphonV                                0.0747186906
ConphonN                                0.1443620165
ConfriendsV                             0.1546554020
ConfriendsN                             0.2524518522
ConffV                                  0.0362906719
ConffN                                  0.0227783558
ConfbV                                  0.0215807967
ConfbN                                  0.0002057108
NounFrequency                           0.2380855612
VerbFrequency                           0.1887961418
FrequencyInitialDiphoneWord             0.1270081956
FrequencyInitialDiphoneSyllable         0.1160585801
CorrectLexdec                           0.3500774208
                                NumberComplexSynsets
RTlexdec                                -0.328613209
RTnaming                                -0.076384846
Familiarity                              0.510019126
WrittenFrequency                         0.591054783
WrittenSpokenFrequencyRatio             -0.104598273
FamilySize                               0.645411663
DerivationalEntropy                      0.331924999
InflectionalEntropy                      0.005589502
NumberSimplexSynsets                     0.524536500
NumberComplexSynsets                     1.000000000
LengthInLetters                         -0.120445975
Ncount                                   0.137748482
MeanBigramFrequency                     -0.023604116
FrequencyInitialDiphone                  0.103684145
ConspelV                                 0.071760775
ConspelN                                 0.193733204
ConphonV                                 0.047783270
ConphonN                                 0.142498720
ConfriendsV                              0.037761099
ConfriendsN                              0.180629522
ConffV                                   0.082102822
ConffN                                   0.057010830
ConfbV                                   0.052172715
ConfbN                                   0.049175194
NounFrequency                            0.349469930
VerbFrequency                            0.092248597
FrequencyInitialDiphoneWord              0.058217178
FrequencyInitialDiphoneSyllable          0.047009152
CorrectLexdec                            0.329011088
                                LengthInLetters
RTlexdec                            0.049747275
RTnaming                            0.094497065
Familiarity                        -0.082152716
WrittenFrequency                   -0.066631955
WrittenSpokenFrequencyRatio         0.204091196
FamilySize                         -0.122995009
DerivationalEntropy                -0.104860729
InflectionalEntropy                 0.052485031
NumberSimplexSynsets               -0.006364411
NumberComplexSynsets               -0.120445975
LengthInLetters                     1.000000000
Ncount                             -0.625129141
MeanBigramFrequency                 0.790492091
FrequencyInitialDiphone            -0.060443836
ConspelV                           -0.226416938
ConspelN                           -0.170022083
ConphonV                           -0.202368726
ConphonN                           -0.205167896
ConfriendsV                        -0.192199942
ConfriendsN                        -0.156898314
ConffV                             -0.019244458
ConffN                              0.010765359
ConfbV                             -0.040037290
ConfbN                             -0.069985486
NounFrequency                      -0.035331865
VerbFrequency                      -0.083729951
FrequencyInitialDiphoneWord         0.155454553
FrequencyInitialDiphoneSyllable     0.150391668
CorrectLexdec                       0.046317578
                                      Ncount
RTlexdec                        -0.065726313
RTnaming                        -0.094669618
Familiarity                      0.096504609
WrittenFrequency                 0.104925644
WrittenSpokenFrequencyRatio     -0.188200595
FamilySize                       0.174107015
DerivationalEntropy              0.123827732
InflectionalEntropy             -0.003252708
NumberSimplexSynsets             0.112958621
NumberComplexSynsets             0.137748482
LengthInLetters                 -0.625129141
Ncount                           1.000000000
MeanBigramFrequency             -0.387546284
FrequencyInitialDiphone          0.135888890
ConspelV                         0.474710938
ConspelN                         0.346547943
ConphonV                         0.210193628
ConphonN                         0.190732679
ConfriendsV                      0.436821402
ConfriendsN                      0.340831193
ConffV                           0.076838593
ConffN                           0.069850580
ConfbV                          -0.036034465
ConfbN                          -0.042132633
NounFrequency                    0.035870248
VerbFrequency                    0.053361797
FrequencyInitialDiphoneWord      0.007890710
FrequencyInitialDiphoneSyllable  0.021719522
CorrectLexdec                    0.016048288
                                MeanBigramFrequency
RTlexdec                                0.002633525
RTnaming                                0.048459360
Familiarity                             0.029621385
WrittenFrequency                        0.077588795
WrittenSpokenFrequencyRatio             0.191513493
FamilySize                             -0.001056468
DerivationalEntropy                    -0.020837051
InflectionalEntropy                     0.024789643
NumberSimplexSynsets                    0.053968952
NumberComplexSynsets                   -0.023604116
LengthInLetters                         0.790492091
Ncount                                 -0.387546284
MeanBigramFrequency                     1.000000000
FrequencyInitialDiphone                 0.324815461
ConspelV                               -0.091270605
ConspelN                                0.060952203
ConphonV                               -0.122457211
ConphonN                               -0.064645303
ConfriendsV                            -0.078215883
ConfriendsN                             0.049075415
ConffV                                  0.072593437
ConffN                                  0.113704972
ConfbV                                  0.002954953
ConfbN                                 -0.019550727
NounFrequency                           0.043361959
VerbFrequency                          -0.045835069
FrequencyInitialDiphoneWord             0.214165942
FrequencyInitialDiphoneSyllable         0.201570329
CorrectLexdec                           0.063566285
                                FrequencyInitialDiphone
RTlexdec                                  -0.0744527186
RTnaming                                  -0.0572168742
Familiarity                                0.1284719337
WrittenFrequency                           0.1674867041
WrittenSpokenFrequencyRatio                0.0207992723
FamilySize                                 0.1268043344
DerivationalEntropy                        0.0838354786
InflectionalEntropy                       -0.0344612070
NumberSimplexSynsets                       0.0571276406
NumberComplexSynsets                       0.1036841450
LengthInLetters                           -0.0604438357
Ncount                                     0.1358888899
MeanBigramFrequency                        0.3248154611
FrequencyInitialDiphone                    1.0000000000
ConspelV                                  -0.0557304958
ConspelN                                   0.0309540623
ConphonV                                  -0.0142352920
ConphonN                                   0.0104023606
ConfriendsV                               -0.0495263185
ConfriendsN                                0.0362062483
ConffV                                    -0.0001697975
ConffN                                     0.0021493798
ConfbV                                     0.0334157038
ConfbN                                     0.0191463198
NounFrequency                              0.0985964971
VerbFrequency                              0.0557187478
FrequencyInitialDiphoneWord                0.1310981285
FrequencyInitialDiphoneSyllable            0.1188976490
CorrectLexdec                              0.0486800603
                                   ConspelV
RTlexdec                        -0.03286747
RTnaming                        -0.02516518
Familiarity                      0.07451417
WrittenFrequency                 0.05864228
WrittenSpokenFrequencyRatio     -0.15768930
FamilySize                       0.11081260
DerivationalEntropy              0.04611727
InflectionalEntropy              0.14079852
NumberSimplexSynsets             0.16605910
NumberComplexSynsets             0.07176077
LengthInLetters                 -0.22641694
Ncount                           0.47471094
MeanBigramFrequency             -0.09127061
FrequencyInitialDiphone         -0.05573050
ConspelV                         1.00000000
ConspelN                         0.64214341
ConphonV                         0.54021641
ConphonN                         0.41727634
ConfriendsV                      0.91949493
ConfriendsN                      0.62267751
ConffV                           0.23618800
ConffN                           0.16655427
ConfbV                           0.04946527
ConfbN                           0.02724069
NounFrequency                   -0.01696005
VerbFrequency                    0.06291967
FrequencyInitialDiphoneWord      0.11861458
FrequencyInitialDiphoneSyllable  0.12276477
CorrectLexdec                    0.04934274
                                   ConspelN
RTlexdec                        -0.10753802
RTnaming                        -0.03480124
Familiarity                      0.21437628
WrittenFrequency                 0.28248908
WrittenSpokenFrequencyRatio     -0.05730739
FamilySize                       0.24952244
DerivationalEntropy              0.13751974
InflectionalEntropy              0.04682609
NumberSimplexSynsets             0.22758596
NumberComplexSynsets             0.19373320
LengthInLetters                 -0.17002208
Ncount                           0.34654794
MeanBigramFrequency              0.06095220
FrequencyInitialDiphone          0.03095406
ConspelV                         0.64214341
ConspelN                         1.00000000
ConphonV                         0.38047467
ConphonN                         0.65365781
ConfriendsV                      0.55820343
ConfriendsN                      0.88292615
ConffV                           0.27788182
ConffN                           0.34213915
ConfbV                           0.14221895
ConfbN                           0.14531495
NounFrequency                    0.11924516
VerbFrequency                    0.12533768
FrequencyInitialDiphoneWord      0.11832011
FrequencyInitialDiphoneSyllable  0.11843351
CorrectLexdec                    0.10432858
                                    ConphonV
RTlexdec                        -0.021747588
RTnaming                         0.001175572
Familiarity                      0.054089750
WrittenFrequency                 0.082012553
WrittenSpokenFrequencyRatio     -0.034796879
FamilySize                       0.050171973
DerivationalEntropy             -0.002754648
InflectionalEntropy              0.082962738
NumberSimplexSynsets             0.074718691
NumberComplexSynsets             0.047783270
LengthInLetters                 -0.202368726
Ncount                           0.210193628
MeanBigramFrequency             -0.122457211
FrequencyInitialDiphone         -0.014235292
ConspelV                         0.540216414
ConspelN                         0.380474673
ConphonV                         1.000000000
ConphonN                         0.665883587
ConfriendsV                      0.533170763
ConfriendsN                      0.378854310
ConffV                           0.039617689
ConffN                           0.060092579
ConfbV                           0.741851531
ConfbN                           0.609947163
NounFrequency                   -0.008769463
VerbFrequency                    0.064268066
FrequencyInitialDiphoneWord      0.029920110
FrequencyInitialDiphoneSyllable  0.033639918
CorrectLexdec                    0.020750103
                                   ConphonN
RTlexdec                        -0.08093054
RTnaming                        -0.01436490
Familiarity                      0.16180440
WrittenFrequency                 0.22245283
WrittenSpokenFrequencyRatio     -0.02585613
FamilySize                       0.16153149
DerivationalEntropy              0.07446861
InflectionalEntropy              0.03141073
NumberSimplexSynsets             0.14436202
NumberComplexSynsets             0.14249872
LengthInLetters                 -0.20516790
Ncount                           0.19073268
MeanBigramFrequency             -0.06464530
FrequencyInitialDiphone          0.01040236
ConspelV                         0.41727634
ConspelN                         0.65365781
ConphonV                         0.66588359
ConphonN                         1.00000000
ConfriendsV                      0.38675454
ConfriendsN                      0.65028040
ConffV                           0.08270634
ConffN                           0.09538521
ConfbV                           0.56997101
ConfbN                           0.66832337
NounFrequency                    0.08519330
VerbFrequency                    0.10336888
FrequencyInitialDiphoneWord      0.05363076
FrequencyInitialDiphoneSyllable  0.05429738
CorrectLexdec                    0.06878849
                                ConfriendsV
RTlexdec                        -0.02572083
RTnaming                        -0.02774107
Familiarity                      0.04584224
WrittenFrequency                 0.02146455
WrittenSpokenFrequencyRatio     -0.13634821
FamilySize                       0.07927119
DerivationalEntropy              0.02892953
InflectionalEntropy              0.13124797
NumberSimplexSynsets             0.15465540
NumberComplexSynsets             0.03776110
LengthInLetters                 -0.19219994
Ncount                           0.43682140
MeanBigramFrequency             -0.07821588
FrequencyInitialDiphone         -0.04952632
ConspelV                         0.91949493
ConspelN                         0.55820343
ConphonV                         0.53317076
ConphonN                         0.38675454
ConfriendsV                      1.00000000
ConfriendsN                      0.64901860
ConffV                          -0.12054529
ConffN                          -0.10390690
ConfbV                           0.01094067
ConfbN                          -0.01072111
NounFrequency                   -0.02244854
VerbFrequency                    0.00393842
FrequencyInitialDiphoneWord      0.12811075
FrequencyInitialDiphoneSyllable  0.13787711
CorrectLexdec                    0.04575846
                                 ConfriendsN
RTlexdec                        -0.117532883
RTnaming                        -0.044064997
Familiarity                      0.206739773
WrittenFrequency                 0.263264980
WrittenSpokenFrequencyRatio     -0.055568525
FamilySize                       0.242711641
DerivationalEntropy              0.131756437
InflectionalEntropy              0.067675205
NumberSimplexSynsets             0.252451852
NumberComplexSynsets             0.180629522
LengthInLetters                 -0.156898314
Ncount                           0.340831193
MeanBigramFrequency              0.049075415
FrequencyInitialDiphone          0.036206248
ConspelV                         0.622677513
ConspelN                         0.882926151
ConphonV                         0.378854310
ConphonN                         0.650280396
ConfriendsV                      0.649018597
ConfriendsN                      1.000000000
ConffV                           0.006536121
ConffN                           0.020956643
ConfbV                           0.083786596
ConfbN                           0.089220835
NounFrequency                    0.120108606
VerbFrequency                    0.118902818
FrequencyInitialDiphoneWord      0.115161203
FrequencyInitialDiphoneSyllable  0.121747881
CorrectLexdec                    0.124420710
                                       ConffV
RTlexdec                        -0.0164949450
RTnaming                         0.0079245511
Familiarity                      0.0561034304
WrittenFrequency                 0.0816216587
WrittenSpokenFrequencyRatio     -0.0267316761
FamilySize                       0.0803623769
DerivationalEntropy              0.0410973651
InflectionalEntropy              0.0143888100
NumberSimplexSynsets             0.0362906719
NumberComplexSynsets             0.0821028216
LengthInLetters                 -0.0192444580
Ncount                           0.0768385931
MeanBigramFrequency              0.0725934375
FrequencyInitialDiphone         -0.0001697975
ConspelV                         0.2361879962
ConspelN                         0.2778818176
ConphonV                         0.0396176894
ConphonN                         0.0827063427
ConfriendsV                     -0.1205452871
ConfriendsN                      0.0065361208
ConffV                           1.0000000000
ConffN                           0.8241820547
ConfbV                           0.0729283492
ConfbN                           0.0683948055
NounFrequency                    0.0367115079
VerbFrequency                    0.1198303241
FrequencyInitialDiphoneWord      0.0058281749
FrequencyInitialDiphoneSyllable -0.0110108598
CorrectLexdec                    0.0072904730
                                      ConffN
RTlexdec                        -0.005679088
RTnaming                         0.011407182
Familiarity                      0.029457215
WrittenFrequency                 0.050281007
WrittenSpokenFrequencyRatio     -0.003403134
FamilySize                       0.059199476
DerivationalEntropy              0.040363830
InflectionalEntropy              0.010758578
NumberSimplexSynsets             0.022778356
NumberComplexSynsets             0.057010830
LengthInLetters                  0.010765359
Ncount                           0.069850580
MeanBigramFrequency              0.113704972
FrequencyInitialDiphone          0.002149380
ConspelV                         0.166554266
ConspelN                         0.342139148
ConphonV                         0.060092579
ConphonN                         0.095385208
ConfriendsV                     -0.103906902
ConfriendsN                      0.020956643
ConffV                           0.824182055
ConffN                           1.000000000
ConfbV                           0.114815595
ConfbN                           0.093364309
NounFrequency                    0.010288796
VerbFrequency                    0.082163492
FrequencyInitialDiphoneWord      0.003618182
FrequencyInitialDiphoneSyllable -0.008732692
CorrectLexdec                   -0.007205900
                                      ConfbV
RTlexdec                        -0.022515482
RTnaming                         0.019159417
Familiarity                      0.056873430
WrittenFrequency                 0.119757242
WrittenSpokenFrequencyRatio      0.093614167
FamilySize                       0.050020822
DerivationalEntropy              0.007743200
InflectionalEntropy              0.002863801
NumberSimplexSynsets             0.021580797
NumberComplexSynsets             0.052172715
LengthInLetters                 -0.040037290
Ncount                          -0.036034465
MeanBigramFrequency              0.002954953
FrequencyInitialDiphone          0.033415704
ConspelV                         0.049465271
ConspelN                         0.142218950
ConphonV                         0.741851531
ConphonN                         0.569971012
ConfriendsV                      0.010940674
ConfriendsN                      0.083786596
ConffV                           0.072928349
ConffN                           0.114815595
ConfbV                           1.000000000
ConfbN                           0.842446966
NounFrequency                    0.021685118
VerbFrequency                    0.050519739
FrequencyInitialDiphoneWord     -0.019915368
FrequencyInitialDiphoneSyllable -0.027020380
CorrectLexdec                    0.005393638
                                       ConfbN
RTlexdec                        -0.0185394993
RTnaming                         0.0170117309
Familiarity                      0.0519965301
WrittenFrequency                 0.1040910628
WrittenSpokenFrequencyRatio      0.0702393218
FamilySize                       0.0384571715
DerivationalEntropy              0.0114622898
InflectionalEntropy             -0.0075830633
NumberSimplexSynsets             0.0002057108
NumberComplexSynsets             0.0491751942
LengthInLetters                 -0.0699854865
Ncount                          -0.0421326334
MeanBigramFrequency             -0.0195507270
FrequencyInitialDiphone          0.0191463198
ConspelV                         0.0272406932
ConspelN                         0.1453149511
ConphonV                         0.6099471634
ConphonN                         0.6683233736
ConfriendsV                     -0.0107211109
ConfriendsN                      0.0892208353
ConffV                           0.0683948055
ConffN                           0.0933643088
ConfbV                           0.8424469664
ConfbN                           1.0000000000
NounFrequency                    0.0252276045
VerbFrequency                    0.0329567506
FrequencyInitialDiphoneWord     -0.0188858839
FrequencyInitialDiphoneSyllable -0.0244143624
CorrectLexdec                    0.0039391579
                                NounFrequency
RTlexdec                         -0.167189500
RTnaming                         -0.043148572
Familiarity                       0.381190698
WrittenFrequency                  0.469551521
WrittenSpokenFrequencyRatio       0.012482293
FamilySize                        0.417794301
DerivationalEntropy               0.172254519
InflectionalEntropy              -0.114401007
NumberSimplexSynsets              0.238085561
NumberComplexSynsets              0.349469930
LengthInLetters                  -0.035331865
Ncount                            0.035870248
MeanBigramFrequency               0.043361959
FrequencyInitialDiphone           0.098596497
ConspelV                         -0.016960053
ConspelN                          0.119245162
ConphonV                         -0.008769463
ConphonN                          0.085193295
ConfriendsV                      -0.022448538
ConfriendsN                       0.120108606
ConffV                            0.036711508
ConffN                            0.010288796
ConfbV                            0.021685118
ConfbN                            0.025227604
NounFrequency                     1.000000000
VerbFrequency                    -0.003117231
FrequencyInitialDiphoneWord       0.047626002
FrequencyInitialDiphoneSyllable   0.034300335
CorrectLexdec                     0.128263251
                                VerbFrequency
RTlexdec                         -0.076388309
RTnaming                         -0.024593780
Familiarity                       0.238176996
WrittenFrequency                  0.278792355
WrittenSpokenFrequencyRatio      -0.096554415
FamilySize                        0.114132925
DerivationalEntropy              -0.019725738
InflectionalEntropy               0.094002603
NumberSimplexSynsets              0.188796142
NumberComplexSynsets              0.092248597
LengthInLetters                  -0.083729951
Ncount                            0.053361797
MeanBigramFrequency              -0.045835069
FrequencyInitialDiphone           0.055718748
ConspelV                          0.062919672
ConspelN                          0.125337679
ConphonV                          0.064268066
ConphonN                          0.103368885
ConfriendsV                       0.003938420
ConfriendsN                       0.118902818
ConffV                            0.119830324
ConffN                            0.082163492
ConfbV                            0.050519739
ConfbN                            0.032956751
NounFrequency                    -0.003117231
VerbFrequency                     1.000000000
FrequencyInitialDiphoneWord       0.069596145
FrequencyInitialDiphoneSyllable   0.055821617
CorrectLexdec                     0.050165423
                                FrequencyInitialDiphoneWord
RTlexdec                                       -0.042640861
RTnaming                                        0.020488545
Familiarity                                     0.091063334
WrittenFrequency                                0.108278953
WrittenSpokenFrequencyRatio                     0.002627022
FamilySize                                      0.096705342
DerivationalEntropy                             0.029479534
InflectionalEntropy                             0.052469468
NumberSimplexSynsets                            0.127008196
NumberComplexSynsets                            0.058217178
LengthInLetters                                 0.155454553
Ncount                                          0.007890710
MeanBigramFrequency                             0.214165942
FrequencyInitialDiphone                         0.131098129
ConspelV                                        0.118614576
ConspelN                                        0.118320106
ConphonV                                        0.029920110
ConphonN                                        0.053630763
ConfriendsV                                     0.128110751
ConfriendsN                                     0.115161203
ConffV                                          0.005828175
ConffN                                          0.003618182
ConfbV                                         -0.019915368
ConfbN                                         -0.018885884
NounFrequency                                   0.047626002
VerbFrequency                                   0.069596145
FrequencyInitialDiphoneWord                     1.000000000
FrequencyInitialDiphoneSyllable                 0.978742189
CorrectLexdec                                   0.062039751
                                FrequencyInitialDiphoneSyllable
RTlexdec                                           -0.035503708
RTnaming                                            0.026897756
Familiarity                                         0.073541144
WrittenFrequency                                    0.091116609
WrittenSpokenFrequencyRatio                         0.012872673
FamilySize                                          0.086426850
DerivationalEntropy                                 0.027755605
InflectionalEntropy                                 0.050939450
NumberSimplexSynsets                                0.116058580
NumberComplexSynsets                                0.047009152
LengthInLetters                                     0.150391668
Ncount                                              0.021719522
MeanBigramFrequency                                 0.201570329
FrequencyInitialDiphone                             0.118897649
ConspelV                                            0.122764768
ConspelN                                            0.118433514
ConphonV                                            0.033639918
ConphonN                                            0.054297378
ConfriendsV                                         0.137877114
ConfriendsN                                         0.121747881
ConffV                                             -0.011010860
ConffN                                             -0.008732692
ConfbV                                             -0.027020380
ConfbN                                             -0.024414362
NounFrequency                                       0.034300335
VerbFrequency                                       0.055821617
FrequencyInitialDiphoneWord                         0.978742189
FrequencyInitialDiphoneSyllable                     1.000000000
CorrectLexdec                                       0.057000795
                                CorrectLexdec
RTlexdec                         -0.253188184
RTnaming                          0.151348043
Familiarity                       0.526854585
WrittenFrequency                  0.457971849
WrittenSpokenFrequencyRatio       0.008564774
FamilySize                        0.360613035
DerivationalEntropy               0.188753214
InflectionalEntropy               0.182065382
NumberSimplexSynsets              0.350077421
NumberComplexSynsets              0.329011088
LengthInLetters                   0.046317578
Ncount                            0.016048288
MeanBigramFrequency               0.063566285
FrequencyInitialDiphone           0.048680060
ConspelV                          0.049342737
ConspelN                          0.104328581
ConphonV                          0.020750103
ConphonN                          0.068788492
ConfriendsV                       0.045758455
ConfriendsN                       0.124420710
ConffV                            0.007290473
ConffN                           -0.007205900
ConfbV                            0.005393638
ConfbN                            0.003939158
NounFrequency                     0.128263251
VerbFrequency                     0.050165423
FrequencyInitialDiphoneWord       0.062039751
FrequencyInitialDiphoneSyllable   0.057000795
CorrectLexdec                     1.000000000
print(corr)
                                    RTlexdec
RTlexdec                         1.000000000
RTnaming                         0.758703280
Familiarity                     -0.444409734
WrittenFrequency                -0.434814982
WrittenSpokenFrequencyRatio      0.039820007
FamilySize                      -0.349595853
DerivationalEntropy             -0.161164620
InflectionalEntropy             -0.088418681
NumberSimplexSynsets            -0.309008140
NumberComplexSynsets            -0.328613209
LengthInLetters                  0.049747275
Ncount                          -0.065726313
MeanBigramFrequency              0.002633525
FrequencyInitialDiphone         -0.074452719
ConspelV                        -0.032867467
ConspelN                        -0.107538023
ConphonV                        -0.021747588
ConphonN                        -0.080930543
ConfriendsV                     -0.025720833
ConfriendsN                     -0.117532883
ConffV                          -0.016494945
ConffN                          -0.005679088
ConfbV                          -0.022515482
ConfbN                          -0.018539499
NounFrequency                   -0.167189500
VerbFrequency                   -0.076388309
FrequencyInitialDiphoneWord     -0.042640861
FrequencyInitialDiphoneSyllable -0.035503708
CorrectLexdec                   -0.253188184
                                    RTnaming
RTlexdec                         0.758703280
RTnaming                         1.000000000
Familiarity                     -0.094793069
WrittenFrequency                -0.095994313
WrittenSpokenFrequencyRatio      0.036592754
FamilySize                      -0.088037010
DerivationalEntropy             -0.049456670
InflectionalEntropy             -0.022110376
NumberSimplexSynsets            -0.071900207
NumberComplexSynsets            -0.076384846
LengthInLetters                  0.094497065
Ncount                          -0.094669618
MeanBigramFrequency              0.048459360
FrequencyInitialDiphone         -0.057216874
ConspelV                        -0.025165185
ConspelN                        -0.034801239
ConphonV                         0.001175572
ConphonN                        -0.014364896
ConfriendsV                     -0.027741071
ConfriendsN                     -0.044064997
ConffV                           0.007924551
ConffN                           0.011407182
ConfbV                           0.019159417
ConfbN                           0.017011731
NounFrequency                   -0.043148572
VerbFrequency                   -0.024593780
FrequencyInitialDiphoneWord      0.020488545
FrequencyInitialDiphoneSyllable  0.026897756
CorrectLexdec                    0.151348043
                                Familiarity
RTlexdec                        -0.44440973
RTnaming                        -0.09479307
Familiarity                      1.00000000
WrittenFrequency                 0.79125559
WrittenSpokenFrequencyRatio     -0.18989881
FamilySize                       0.59191973
DerivationalEntropy              0.22071588
InflectionalEntropy              0.10795420
NumberSimplexSynsets             0.51065170
NumberComplexSynsets             0.51001913
LengthInLetters                 -0.08215272
Ncount                           0.09650461
MeanBigramFrequency              0.02962138
FrequencyInitialDiphone          0.12847193
ConspelV                         0.07451417
ConspelN                         0.21437628
ConphonV                         0.05408975
ConphonN                         0.16180440
ConfriendsV                      0.04584224
ConfriendsN                      0.20673977
ConffV                           0.05610343
ConffN                           0.02945722
ConfbV                           0.05687343
ConfbN                           0.05199653
NounFrequency                    0.38119070
VerbFrequency                    0.23817700
FrequencyInitialDiphoneWord      0.09106333
FrequencyInitialDiphoneSyllable  0.07354114
CorrectLexdec                    0.52685458
                                WrittenFrequency
RTlexdec                             -0.43481498
RTnaming                             -0.09599431
Familiarity                           0.79125559
WrittenFrequency                      1.00000000
WrittenSpokenFrequencyRatio           0.07158067
FamilySize                            0.66253864
DerivationalEntropy                   0.25522889
InflectionalEntropy                  -0.04048005
NumberSimplexSynsets                  0.55874958
NumberComplexSynsets                  0.59105478
LengthInLetters                      -0.06663196
Ncount                                0.10492564
MeanBigramFrequency                   0.07758879
FrequencyInitialDiphone               0.16748670
ConspelV                              0.05864228
ConspelN                              0.28248908
ConphonV                              0.08201255
ConphonN                              0.22245283
ConfriendsV                           0.02146455
ConfriendsN                           0.26326498
ConffV                                0.08162166
ConffN                                0.05028101
ConfbV                                0.11975724
ConfbN                                0.10409106
NounFrequency                         0.46955152
VerbFrequency                         0.27879235
FrequencyInitialDiphoneWord           0.10827895
FrequencyInitialDiphoneSyllable       0.09111661
CorrectLexdec                         0.45797185
                                WrittenSpokenFrequencyRatio
RTlexdec                                        0.039820007
RTnaming                                        0.036592754
Familiarity                                    -0.189898811
WrittenFrequency                                0.071580669
WrittenSpokenFrequencyRatio                     1.000000000
FamilySize                                     -0.108801543
DerivationalEntropy                            -0.010987756
InflectionalEntropy                            -0.118804044
NumberSimplexSynsets                           -0.085088573
NumberComplexSynsets                           -0.104598273
LengthInLetters                                 0.204091196
Ncount                                         -0.188200595
MeanBigramFrequency                             0.191513493
FrequencyInitialDiphone                         0.020799272
ConspelV                                       -0.157689304
ConspelN                                       -0.057307387
ConphonV                                       -0.034796879
ConphonN                                       -0.025856126
ConfriendsV                                    -0.136348207
ConfriendsN                                    -0.055568525
ConffV                                         -0.026731676
ConffN                                         -0.003403134
ConfbV                                          0.093614167
ConfbN                                          0.070239322
NounFrequency                                   0.012482293
VerbFrequency                                  -0.096554415
FrequencyInitialDiphoneWord                     0.002627022
FrequencyInitialDiphoneSyllable                 0.012872673
CorrectLexdec                                   0.008564774
                                  FamilySize
RTlexdec                        -0.349595853
RTnaming                        -0.088037010
Familiarity                      0.591919733
WrittenFrequency                 0.662538635
WrittenSpokenFrequencyRatio     -0.108801543
FamilySize                       1.000000000
DerivationalEntropy              0.692088896
InflectionalEntropy              0.101743523
NumberSimplexSynsets             0.590763556
NumberComplexSynsets             0.645411663
LengthInLetters                 -0.122995009
Ncount                           0.174107015
MeanBigramFrequency             -0.001056468
FrequencyInitialDiphone          0.126804334
ConspelV                         0.110812602
ConspelN                         0.249522442
ConphonV                         0.050171973
ConphonN                         0.161531494
ConfriendsV                      0.079271194
ConfriendsN                      0.242711641
ConffV                           0.080362377
ConffN                           0.059199476
ConfbV                           0.050020822
ConfbN                           0.038457171
NounFrequency                    0.417794301
VerbFrequency                    0.114132925
FrequencyInitialDiphoneWord      0.096705342
FrequencyInitialDiphoneSyllable  0.086426850
CorrectLexdec                    0.360613035
                                DerivationalEntropy
RTlexdec                               -0.161164620
RTnaming                               -0.049456670
Familiarity                             0.220715880
WrittenFrequency                        0.255228886
WrittenSpokenFrequencyRatio            -0.010987756
FamilySize                              0.692088896
DerivationalEntropy                     1.000000000
InflectionalEntropy                    -0.050795034
NumberSimplexSynsets                    0.223943027
NumberComplexSynsets                    0.331924999
LengthInLetters                        -0.104860729
Ncount                                  0.123827732
MeanBigramFrequency                    -0.020837051
FrequencyInitialDiphone                 0.083835479
ConspelV                                0.046117273
ConspelN                                0.137519738
ConphonV                               -0.002754648
ConphonN                                0.074468614
ConfriendsV                             0.028929529
ConfriendsN                             0.131756437
ConffV                                  0.041097365
ConffN                                  0.040363830
ConfbV                                  0.007743200
ConfbN                                  0.011462290
NounFrequency                           0.172254519
VerbFrequency                          -0.019725738
FrequencyInitialDiphoneWord             0.029479534
FrequencyInitialDiphoneSyllable         0.027755605
CorrectLexdec                           0.188753214
                                InflectionalEntropy
RTlexdec                               -0.088418681
RTnaming                               -0.022110376
Familiarity                             0.107954197
WrittenFrequency                       -0.040480046
WrittenSpokenFrequencyRatio            -0.118804044
FamilySize                              0.101743523
DerivationalEntropy                    -0.050795034
InflectionalEntropy                     1.000000000
NumberSimplexSynsets                    0.398736053
NumberComplexSynsets                    0.005589502
LengthInLetters                         0.052485031
Ncount                                 -0.003252708
MeanBigramFrequency                     0.024789643
FrequencyInitialDiphone                -0.034461207
ConspelV                                0.140798520
ConspelN                                0.046826086
ConphonV                                0.082962738
ConphonN                                0.031410725
ConfriendsV                             0.131247972
ConfriendsN                             0.067675205
ConffV                                  0.014388810
ConffN                                  0.010758578
ConfbV                                  0.002863801
ConfbN                                 -0.007583063
NounFrequency                          -0.114401007
VerbFrequency                           0.094002603
FrequencyInitialDiphoneWord             0.052469468
FrequencyInitialDiphoneSyllable         0.050939450
CorrectLexdec                           0.182065382
                                NumberSimplexSynsets
RTlexdec                               -0.3090081404
RTnaming                               -0.0719002065
Familiarity                             0.5106516971
WrittenFrequency                        0.5587495840
WrittenSpokenFrequencyRatio            -0.0850885729
FamilySize                              0.5907635556
DerivationalEntropy                     0.2239430269
InflectionalEntropy                     0.3987360535
NumberSimplexSynsets                    1.0000000000
NumberComplexSynsets                    0.5245365002
LengthInLetters                        -0.0063644110
Ncount                                  0.1129586209
MeanBigramFrequency                     0.0539689516
FrequencyInitialDiphone                 0.0571276406
ConspelV                                0.1660590990
ConspelN                                0.2275859556
ConphonV                                0.0747186906
ConphonN                                0.1443620165
ConfriendsV                             0.1546554020
ConfriendsN                             0.2524518522
ConffV                                  0.0362906719
ConffN                                  0.0227783558
ConfbV                                  0.0215807967
ConfbN                                  0.0002057108
NounFrequency                           0.2380855612
VerbFrequency                           0.1887961418
FrequencyInitialDiphoneWord             0.1270081956
FrequencyInitialDiphoneSyllable         0.1160585801
CorrectLexdec                           0.3500774208
                                NumberComplexSynsets
RTlexdec                                -0.328613209
RTnaming                                -0.076384846
Familiarity                              0.510019126
WrittenFrequency                         0.591054783
WrittenSpokenFrequencyRatio             -0.104598273
FamilySize                               0.645411663
DerivationalEntropy                      0.331924999
InflectionalEntropy                      0.005589502
NumberSimplexSynsets                     0.524536500
NumberComplexSynsets                     1.000000000
LengthInLetters                         -0.120445975
Ncount                                   0.137748482
MeanBigramFrequency                     -0.023604116
FrequencyInitialDiphone                  0.103684145
ConspelV                                 0.071760775
ConspelN                                 0.193733204
ConphonV                                 0.047783270
ConphonN                                 0.142498720
ConfriendsV                              0.037761099
ConfriendsN                              0.180629522
ConffV                                   0.082102822
ConffN                                   0.057010830
ConfbV                                   0.052172715
ConfbN                                   0.049175194
NounFrequency                            0.349469930
VerbFrequency                            0.092248597
FrequencyInitialDiphoneWord              0.058217178
FrequencyInitialDiphoneSyllable          0.047009152
CorrectLexdec                            0.329011088
                                LengthInLetters
RTlexdec                            0.049747275
RTnaming                            0.094497065
Familiarity                        -0.082152716
WrittenFrequency                   -0.066631955
WrittenSpokenFrequencyRatio         0.204091196
FamilySize                         -0.122995009
DerivationalEntropy                -0.104860729
InflectionalEntropy                 0.052485031
NumberSimplexSynsets               -0.006364411
NumberComplexSynsets               -0.120445975
LengthInLetters                     1.000000000
Ncount                             -0.625129141
MeanBigramFrequency                 0.790492091
FrequencyInitialDiphone            -0.060443836
ConspelV                           -0.226416938
ConspelN                           -0.170022083
ConphonV                           -0.202368726
ConphonN                           -0.205167896
ConfriendsV                        -0.192199942
ConfriendsN                        -0.156898314
ConffV                             -0.019244458
ConffN                              0.010765359
ConfbV                             -0.040037290
ConfbN                             -0.069985486
NounFrequency                      -0.035331865
VerbFrequency                      -0.083729951
FrequencyInitialDiphoneWord         0.155454553
FrequencyInitialDiphoneSyllable     0.150391668
CorrectLexdec                       0.046317578
                                      Ncount
RTlexdec                        -0.065726313
RTnaming                        -0.094669618
Familiarity                      0.096504609
WrittenFrequency                 0.104925644
WrittenSpokenFrequencyRatio     -0.188200595
FamilySize                       0.174107015
DerivationalEntropy              0.123827732
InflectionalEntropy             -0.003252708
NumberSimplexSynsets             0.112958621
NumberComplexSynsets             0.137748482
LengthInLetters                 -0.625129141
Ncount                           1.000000000
MeanBigramFrequency             -0.387546284
FrequencyInitialDiphone          0.135888890
ConspelV                         0.474710938
ConspelN                         0.346547943
ConphonV                         0.210193628
ConphonN                         0.190732679
ConfriendsV                      0.436821402
ConfriendsN                      0.340831193
ConffV                           0.076838593
ConffN                           0.069850580
ConfbV                          -0.036034465
ConfbN                          -0.042132633
NounFrequency                    0.035870248
VerbFrequency                    0.053361797
FrequencyInitialDiphoneWord      0.007890710
FrequencyInitialDiphoneSyllable  0.021719522
CorrectLexdec                    0.016048288
                                MeanBigramFrequency
RTlexdec                                0.002633525
RTnaming                                0.048459360
Familiarity                             0.029621385
WrittenFrequency                        0.077588795
WrittenSpokenFrequencyRatio             0.191513493
FamilySize                             -0.001056468
DerivationalEntropy                    -0.020837051
InflectionalEntropy                     0.024789643
NumberSimplexSynsets                    0.053968952
NumberComplexSynsets                   -0.023604116
LengthInLetters                         0.790492091
Ncount                                 -0.387546284
MeanBigramFrequency                     1.000000000
FrequencyInitialDiphone                 0.324815461
ConspelV                               -0.091270605
ConspelN                                0.060952203
ConphonV                               -0.122457211
ConphonN                               -0.064645303
ConfriendsV                            -0.078215883
ConfriendsN                             0.049075415
ConffV                                  0.072593437
ConffN                                  0.113704972
ConfbV                                  0.002954953
ConfbN                                 -0.019550727
NounFrequency                           0.043361959
VerbFrequency                          -0.045835069
FrequencyInitialDiphoneWord             0.214165942
FrequencyInitialDiphoneSyllable         0.201570329
CorrectLexdec                           0.063566285
                                FrequencyInitialDiphone
RTlexdec                                  -0.0744527186
RTnaming                                  -0.0572168742
Familiarity                                0.1284719337
WrittenFrequency                           0.1674867041
WrittenSpokenFrequencyRatio                0.0207992723
FamilySize                                 0.1268043344
DerivationalEntropy                        0.0838354786
InflectionalEntropy                       -0.0344612070
NumberSimplexSynsets                       0.0571276406
NumberComplexSynsets                       0.1036841450
LengthInLetters                           -0.0604438357
Ncount                                     0.1358888899
MeanBigramFrequency                        0.3248154611
FrequencyInitialDiphone                    1.0000000000
ConspelV                                  -0.0557304958
ConspelN                                   0.0309540623
ConphonV                                  -0.0142352920
ConphonN                                   0.0104023606
ConfriendsV                               -0.0495263185
ConfriendsN                                0.0362062483
ConffV                                    -0.0001697975
ConffN                                     0.0021493798
ConfbV                                     0.0334157038
ConfbN                                     0.0191463198
NounFrequency                              0.0985964971
VerbFrequency                              0.0557187478
FrequencyInitialDiphoneWord                0.1310981285
FrequencyInitialDiphoneSyllable            0.1188976490
CorrectLexdec                              0.0486800603
                                   ConspelV
RTlexdec                        -0.03286747
RTnaming                        -0.02516518
Familiarity                      0.07451417
WrittenFrequency                 0.05864228
WrittenSpokenFrequencyRatio     -0.15768930
FamilySize                       0.11081260
DerivationalEntropy              0.04611727
InflectionalEntropy              0.14079852
NumberSimplexSynsets             0.16605910
NumberComplexSynsets             0.07176077
LengthInLetters                 -0.22641694
Ncount                           0.47471094
MeanBigramFrequency             -0.09127061
FrequencyInitialDiphone         -0.05573050
ConspelV                         1.00000000
ConspelN                         0.64214341
ConphonV                         0.54021641
ConphonN                         0.41727634
ConfriendsV                      0.91949493
ConfriendsN                      0.62267751
ConffV                           0.23618800
ConffN                           0.16655427
ConfbV                           0.04946527
ConfbN                           0.02724069
NounFrequency                   -0.01696005
VerbFrequency                    0.06291967
FrequencyInitialDiphoneWord      0.11861458
FrequencyInitialDiphoneSyllable  0.12276477
CorrectLexdec                    0.04934274
                                   ConspelN
RTlexdec                        -0.10753802
RTnaming                        -0.03480124
Familiarity                      0.21437628
WrittenFrequency                 0.28248908
WrittenSpokenFrequencyRatio     -0.05730739
FamilySize                       0.24952244
DerivationalEntropy              0.13751974
InflectionalEntropy              0.04682609
NumberSimplexSynsets             0.22758596
NumberComplexSynsets             0.19373320
LengthInLetters                 -0.17002208
Ncount                           0.34654794
MeanBigramFrequency              0.06095220
FrequencyInitialDiphone          0.03095406
ConspelV                         0.64214341
ConspelN                         1.00000000
ConphonV                         0.38047467
ConphonN                         0.65365781
ConfriendsV                      0.55820343
ConfriendsN                      0.88292615
ConffV                           0.27788182
ConffN                           0.34213915
ConfbV                           0.14221895
ConfbN                           0.14531495
NounFrequency                    0.11924516
VerbFrequency                    0.12533768
FrequencyInitialDiphoneWord      0.11832011
FrequencyInitialDiphoneSyllable  0.11843351
CorrectLexdec                    0.10432858
                                    ConphonV
RTlexdec                        -0.021747588
RTnaming                         0.001175572
Familiarity                      0.054089750
WrittenFrequency                 0.082012553
WrittenSpokenFrequencyRatio     -0.034796879
FamilySize                       0.050171973
DerivationalEntropy             -0.002754648
InflectionalEntropy              0.082962738
NumberSimplexSynsets             0.074718691
NumberComplexSynsets             0.047783270
LengthInLetters                 -0.202368726
Ncount                           0.210193628
MeanBigramFrequency             -0.122457211
FrequencyInitialDiphone         -0.014235292
ConspelV                         0.540216414
ConspelN                         0.380474673
ConphonV                         1.000000000
ConphonN                         0.665883587
ConfriendsV                      0.533170763
ConfriendsN                      0.378854310
ConffV                           0.039617689
ConffN                           0.060092579
ConfbV                           0.741851531
ConfbN                           0.609947163
NounFrequency                   -0.008769463
VerbFrequency                    0.064268066
FrequencyInitialDiphoneWord      0.029920110
FrequencyInitialDiphoneSyllable  0.033639918
CorrectLexdec                    0.020750103
                                   ConphonN
RTlexdec                        -0.08093054
RTnaming                        -0.01436490
Familiarity                      0.16180440
WrittenFrequency                 0.22245283
WrittenSpokenFrequencyRatio     -0.02585613
FamilySize                       0.16153149
DerivationalEntropy              0.07446861
InflectionalEntropy              0.03141073
NumberSimplexSynsets             0.14436202
NumberComplexSynsets             0.14249872
LengthInLetters                 -0.20516790
Ncount                           0.19073268
MeanBigramFrequency             -0.06464530
FrequencyInitialDiphone          0.01040236
ConspelV                         0.41727634
ConspelN                         0.65365781
ConphonV                         0.66588359
ConphonN                         1.00000000
ConfriendsV                      0.38675454
ConfriendsN                      0.65028040
ConffV                           0.08270634
ConffN                           0.09538521
ConfbV                           0.56997101
ConfbN                           0.66832337
NounFrequency                    0.08519330
VerbFrequency                    0.10336888
FrequencyInitialDiphoneWord      0.05363076
FrequencyInitialDiphoneSyllable  0.05429738
CorrectLexdec                    0.06878849
                                ConfriendsV
RTlexdec                        -0.02572083
RTnaming                        -0.02774107
Familiarity                      0.04584224
WrittenFrequency                 0.02146455
WrittenSpokenFrequencyRatio     -0.13634821
FamilySize                       0.07927119
DerivationalEntropy              0.02892953
InflectionalEntropy              0.13124797
NumberSimplexSynsets             0.15465540
NumberComplexSynsets             0.03776110
LengthInLetters                 -0.19219994
Ncount                           0.43682140
MeanBigramFrequency             -0.07821588
FrequencyInitialDiphone         -0.04952632
ConspelV                         0.91949493
ConspelN                         0.55820343
ConphonV                         0.53317076
ConphonN                         0.38675454
ConfriendsV                      1.00000000
ConfriendsN                      0.64901860
ConffV                          -0.12054529
ConffN                          -0.10390690
ConfbV                           0.01094067
ConfbN                          -0.01072111
NounFrequency                   -0.02244854
VerbFrequency                    0.00393842
FrequencyInitialDiphoneWord      0.12811075
FrequencyInitialDiphoneSyllable  0.13787711
CorrectLexdec                    0.04575846
                                 ConfriendsN
RTlexdec                        -0.117532883
RTnaming                        -0.044064997
Familiarity                      0.206739773
WrittenFrequency                 0.263264980
WrittenSpokenFrequencyRatio     -0.055568525
FamilySize                       0.242711641
DerivationalEntropy              0.131756437
InflectionalEntropy              0.067675205
NumberSimplexSynsets             0.252451852
NumberComplexSynsets             0.180629522
LengthInLetters                 -0.156898314
Ncount                           0.340831193
MeanBigramFrequency              0.049075415
FrequencyInitialDiphone          0.036206248
ConspelV                         0.622677513
ConspelN                         0.882926151
ConphonV                         0.378854310
ConphonN                         0.650280396
ConfriendsV                      0.649018597
ConfriendsN                      1.000000000
ConffV                           0.006536121
ConffN                           0.020956643
ConfbV                           0.083786596
ConfbN                           0.089220835
NounFrequency                    0.120108606
VerbFrequency                    0.118902818
FrequencyInitialDiphoneWord      0.115161203
FrequencyInitialDiphoneSyllable  0.121747881
CorrectLexdec                    0.124420710
                                       ConffV
RTlexdec                        -0.0164949450
RTnaming                         0.0079245511
Familiarity                      0.0561034304
WrittenFrequency                 0.0816216587
WrittenSpokenFrequencyRatio     -0.0267316761
FamilySize                       0.0803623769
DerivationalEntropy              0.0410973651
InflectionalEntropy              0.0143888100
NumberSimplexSynsets             0.0362906719
NumberComplexSynsets             0.0821028216
LengthInLetters                 -0.0192444580
Ncount                           0.0768385931
MeanBigramFrequency              0.0725934375
FrequencyInitialDiphone         -0.0001697975
ConspelV                         0.2361879962
ConspelN                         0.2778818176
ConphonV                         0.0396176894
ConphonN                         0.0827063427
ConfriendsV                     -0.1205452871
ConfriendsN                      0.0065361208
ConffV                           1.0000000000
ConffN                           0.8241820547
ConfbV                           0.0729283492
ConfbN                           0.0683948055
NounFrequency                    0.0367115079
VerbFrequency                    0.1198303241
FrequencyInitialDiphoneWord      0.0058281749
FrequencyInitialDiphoneSyllable -0.0110108598
CorrectLexdec                    0.0072904730
                                      ConffN
RTlexdec                        -0.005679088
RTnaming                         0.011407182
Familiarity                      0.029457215
WrittenFrequency                 0.050281007
WrittenSpokenFrequencyRatio     -0.003403134
FamilySize                       0.059199476
DerivationalEntropy              0.040363830
InflectionalEntropy              0.010758578
NumberSimplexSynsets             0.022778356
NumberComplexSynsets             0.057010830
LengthInLetters                  0.010765359
Ncount                           0.069850580
MeanBigramFrequency              0.113704972
FrequencyInitialDiphone          0.002149380
ConspelV                         0.166554266
ConspelN                         0.342139148
ConphonV                         0.060092579
ConphonN                         0.095385208
ConfriendsV                     -0.103906902
ConfriendsN                      0.020956643
ConffV                           0.824182055
ConffN                           1.000000000
ConfbV                           0.114815595
ConfbN                           0.093364309
NounFrequency                    0.010288796
VerbFrequency                    0.082163492
FrequencyInitialDiphoneWord      0.003618182
FrequencyInitialDiphoneSyllable -0.008732692
CorrectLexdec                   -0.007205900
                                      ConfbV
RTlexdec                        -0.022515482
RTnaming                         0.019159417
Familiarity                      0.056873430
WrittenFrequency                 0.119757242
WrittenSpokenFrequencyRatio      0.093614167
FamilySize                       0.050020822
DerivationalEntropy              0.007743200
InflectionalEntropy              0.002863801
NumberSimplexSynsets             0.021580797
NumberComplexSynsets             0.052172715
LengthInLetters                 -0.040037290
Ncount                          -0.036034465
MeanBigramFrequency              0.002954953
FrequencyInitialDiphone          0.033415704
ConspelV                         0.049465271
ConspelN                         0.142218950
ConphonV                         0.741851531
ConphonN                         0.569971012
ConfriendsV                      0.010940674
ConfriendsN                      0.083786596
ConffV                           0.072928349
ConffN                           0.114815595
ConfbV                           1.000000000
ConfbN                           0.842446966
NounFrequency                    0.021685118
VerbFrequency                    0.050519739
FrequencyInitialDiphoneWord     -0.019915368
FrequencyInitialDiphoneSyllable -0.027020380
CorrectLexdec                    0.005393638
                                       ConfbN
RTlexdec                        -0.0185394993
RTnaming                         0.0170117309
Familiarity                      0.0519965301
WrittenFrequency                 0.1040910628
WrittenSpokenFrequencyRatio      0.0702393218
FamilySize                       0.0384571715
DerivationalEntropy              0.0114622898
InflectionalEntropy             -0.0075830633
NumberSimplexSynsets             0.0002057108
NumberComplexSynsets             0.0491751942
LengthInLetters                 -0.0699854865
Ncount                          -0.0421326334
MeanBigramFrequency             -0.0195507270
FrequencyInitialDiphone          0.0191463198
ConspelV                         0.0272406932
ConspelN                         0.1453149511
ConphonV                         0.6099471634
ConphonN                         0.6683233736
ConfriendsV                     -0.0107211109
ConfriendsN                      0.0892208353
ConffV                           0.0683948055
ConffN                           0.0933643088
ConfbV                           0.8424469664
ConfbN                           1.0000000000
NounFrequency                    0.0252276045
VerbFrequency                    0.0329567506
FrequencyInitialDiphoneWord     -0.0188858839
FrequencyInitialDiphoneSyllable -0.0244143624
CorrectLexdec                    0.0039391579
                                NounFrequency
RTlexdec                         -0.167189500
RTnaming                         -0.043148572
Familiarity                       0.381190698
WrittenFrequency                  0.469551521
WrittenSpokenFrequencyRatio       0.012482293
FamilySize                        0.417794301
DerivationalEntropy               0.172254519
InflectionalEntropy              -0.114401007
NumberSimplexSynsets              0.238085561
NumberComplexSynsets              0.349469930
LengthInLetters                  -0.035331865
Ncount                            0.035870248
MeanBigramFrequency               0.043361959
FrequencyInitialDiphone           0.098596497
ConspelV                         -0.016960053
ConspelN                          0.119245162
ConphonV                         -0.008769463
ConphonN                          0.085193295
ConfriendsV                      -0.022448538
ConfriendsN                       0.120108606
ConffV                            0.036711508
ConffN                            0.010288796
ConfbV                            0.021685118
ConfbN                            0.025227604
NounFrequency                     1.000000000
VerbFrequency                    -0.003117231
FrequencyInitialDiphoneWord       0.047626002
FrequencyInitialDiphoneSyllable   0.034300335
CorrectLexdec                     0.128263251
                                VerbFrequency
RTlexdec                         -0.076388309
RTnaming                         -0.024593780
Familiarity                       0.238176996
WrittenFrequency                  0.278792355
WrittenSpokenFrequencyRatio      -0.096554415
FamilySize                        0.114132925
DerivationalEntropy              -0.019725738
InflectionalEntropy               0.094002603
NumberSimplexSynsets              0.188796142
NumberComplexSynsets              0.092248597
LengthInLetters                  -0.083729951
Ncount                            0.053361797
MeanBigramFrequency              -0.045835069
FrequencyInitialDiphone           0.055718748
ConspelV                          0.062919672
ConspelN                          0.125337679
ConphonV                          0.064268066
ConphonN                          0.103368885
ConfriendsV                       0.003938420
ConfriendsN                       0.118902818
ConffV                            0.119830324
ConffN                            0.082163492
ConfbV                            0.050519739
ConfbN                            0.032956751
NounFrequency                    -0.003117231
VerbFrequency                     1.000000000
FrequencyInitialDiphoneWord       0.069596145
FrequencyInitialDiphoneSyllable   0.055821617
CorrectLexdec                     0.050165423
                                FrequencyInitialDiphoneWord
RTlexdec                                       -0.042640861
RTnaming                                        0.020488545
Familiarity                                     0.091063334
WrittenFrequency                                0.108278953
WrittenSpokenFrequencyRatio                     0.002627022
FamilySize                                      0.096705342
DerivationalEntropy                             0.029479534
InflectionalEntropy                             0.052469468
NumberSimplexSynsets                            0.127008196
NumberComplexSynsets                            0.058217178
LengthInLetters                                 0.155454553
Ncount                                          0.007890710
MeanBigramFrequency                             0.214165942
FrequencyInitialDiphone                         0.131098129
ConspelV                                        0.118614576
ConspelN                                        0.118320106
ConphonV                                        0.029920110
ConphonN                                        0.053630763
ConfriendsV                                     0.128110751
ConfriendsN                                     0.115161203
ConffV                                          0.005828175
ConffN                                          0.003618182
ConfbV                                         -0.019915368
ConfbN                                         -0.018885884
NounFrequency                                   0.047626002
VerbFrequency                                   0.069596145
FrequencyInitialDiphoneWord                     1.000000000
FrequencyInitialDiphoneSyllable                 0.978742189
CorrectLexdec                                   0.062039751
                                FrequencyInitialDiphoneSyllable
RTlexdec                                           -0.035503708
RTnaming                                            0.026897756
Familiarity                                         0.073541144
WrittenFrequency                                    0.091116609
WrittenSpokenFrequencyRatio                         0.012872673
FamilySize                                          0.086426850
DerivationalEntropy                                 0.027755605
InflectionalEntropy                                 0.050939450
NumberSimplexSynsets                                0.116058580
NumberComplexSynsets                                0.047009152
LengthInLetters                                     0.150391668
Ncount                                              0.021719522
MeanBigramFrequency                                 0.201570329
FrequencyInitialDiphone                             0.118897649
ConspelV                                            0.122764768
ConspelN                                            0.118433514
ConphonV                                            0.033639918
ConphonN                                            0.054297378
ConfriendsV                                         0.137877114
ConfriendsN                                         0.121747881
ConffV                                             -0.011010860
ConffN                                             -0.008732692
ConfbV                                             -0.027020380
ConfbN                                             -0.024414362
NounFrequency                                       0.034300335
VerbFrequency                                       0.055821617
FrequencyInitialDiphoneWord                         0.978742189
FrequencyInitialDiphoneSyllable                     1.000000000
CorrectLexdec                                       0.057000795
                                CorrectLexdec
RTlexdec                         -0.253188184
RTnaming                          0.151348043
Familiarity                       0.526854585
WrittenFrequency                  0.457971849
WrittenSpokenFrequencyRatio       0.008564774
FamilySize                        0.360613035
DerivationalEntropy               0.188753214
InflectionalEntropy               0.182065382
NumberSimplexSynsets              0.350077421
NumberComplexSynsets              0.329011088
LengthInLetters                   0.046317578
Ncount                            0.016048288
MeanBigramFrequency               0.063566285
FrequencyInitialDiphone           0.048680060
ConspelV                          0.049342737
ConspelN                          0.104328581
ConphonV                          0.020750103
ConphonN                          0.068788492
ConfriendsV                       0.045758455
ConfriendsN                       0.124420710
ConffV                            0.007290473
ConffN                           -0.007205900
ConfbV                            0.005393638
ConfbN                            0.003939158
NounFrequency                     0.128263251
VerbFrequency                     0.050165423
FrequencyInitialDiphoneWord       0.062039751
FrequencyInitialDiphoneSyllable   0.057000795
CorrectLexdec                     1.000000000
corrplot(corr, method = 'ellipse', type = 'upper')

2.2.2 More advanced

Let’s first compute the correlations between all numeric variables and plot these with the p values

## correlation using "corrplot"
## based on the function `rcorr' from the `Hmisc` package
## Need to change dataframe into a matrix
corr <- 
  english %>% 
  select(where(is.numeric)) %>% 
  as.matrix(english) %>% 
  rcorr(type = "pearson")
Warning in if (rownames.force %in% FALSE) NULL else if (rownames.force %in%  :
  the condition has length > 1 and only the first element will be used
print(corr)
                                RTlexdec
RTlexdec                            1.00
RTnaming                            0.76
Familiarity                        -0.44
WrittenFrequency                   -0.43
WrittenSpokenFrequencyRatio         0.04
FamilySize                         -0.35
DerivationalEntropy                -0.16
InflectionalEntropy                -0.09
NumberSimplexSynsets               -0.31
NumberComplexSynsets               -0.33
LengthInLetters                     0.05
Ncount                             -0.07
MeanBigramFrequency                 0.00
FrequencyInitialDiphone            -0.07
ConspelV                           -0.03
ConspelN                           -0.11
ConphonV                           -0.02
ConphonN                           -0.08
ConfriendsV                        -0.03
ConfriendsN                        -0.12
ConffV                             -0.02
ConffN                             -0.01
ConfbV                             -0.02
ConfbN                             -0.02
NounFrequency                      -0.17
VerbFrequency                      -0.08
FrequencyInitialDiphoneWord        -0.04
FrequencyInitialDiphoneSyllable    -0.04
CorrectLexdec                      -0.25
                                RTnaming
RTlexdec                            0.76
RTnaming                            1.00
Familiarity                        -0.09
WrittenFrequency                   -0.10
WrittenSpokenFrequencyRatio         0.04
FamilySize                         -0.09
DerivationalEntropy                -0.05
InflectionalEntropy                -0.02
NumberSimplexSynsets               -0.07
NumberComplexSynsets               -0.08
LengthInLetters                     0.09
Ncount                             -0.09
MeanBigramFrequency                 0.05
FrequencyInitialDiphone            -0.06
ConspelV                           -0.03
ConspelN                           -0.03
ConphonV                            0.00
ConphonN                           -0.01
ConfriendsV                        -0.03
ConfriendsN                        -0.04
ConffV                              0.01
ConffN                              0.01
ConfbV                              0.02
ConfbN                              0.02
NounFrequency                      -0.04
VerbFrequency                      -0.02
FrequencyInitialDiphoneWord         0.02
FrequencyInitialDiphoneSyllable     0.03
CorrectLexdec                       0.15
                                Familiarity
RTlexdec                              -0.44
RTnaming                              -0.09
Familiarity                            1.00
WrittenFrequency                       0.79
WrittenSpokenFrequencyRatio           -0.19
FamilySize                             0.59
DerivationalEntropy                    0.22
InflectionalEntropy                    0.11
NumberSimplexSynsets                   0.51
NumberComplexSynsets                   0.51
LengthInLetters                       -0.08
Ncount                                 0.10
MeanBigramFrequency                    0.03
FrequencyInitialDiphone                0.13
ConspelV                               0.07
ConspelN                               0.21
ConphonV                               0.05
ConphonN                               0.16
ConfriendsV                            0.05
ConfriendsN                            0.21
ConffV                                 0.06
ConffN                                 0.03
ConfbV                                 0.06
ConfbN                                 0.05
NounFrequency                          0.38
VerbFrequency                          0.24
FrequencyInitialDiphoneWord            0.09
FrequencyInitialDiphoneSyllable        0.07
CorrectLexdec                          0.53
                                WrittenFrequency
RTlexdec                                   -0.43
RTnaming                                   -0.10
Familiarity                                 0.79
WrittenFrequency                            1.00
WrittenSpokenFrequencyRatio                 0.07
FamilySize                                  0.66
DerivationalEntropy                         0.26
InflectionalEntropy                        -0.04
NumberSimplexSynsets                        0.56
NumberComplexSynsets                        0.59
LengthInLetters                            -0.07
Ncount                                      0.10
MeanBigramFrequency                         0.08
FrequencyInitialDiphone                     0.17
ConspelV                                    0.06
ConspelN                                    0.28
ConphonV                                    0.08
ConphonN                                    0.22
ConfriendsV                                 0.02
ConfriendsN                                 0.26
ConffV                                      0.08
ConffN                                      0.05
ConfbV                                      0.12
ConfbN                                      0.10
NounFrequency                               0.47
VerbFrequency                               0.28
FrequencyInitialDiphoneWord                 0.11
FrequencyInitialDiphoneSyllable             0.09
CorrectLexdec                               0.46
                                WrittenSpokenFrequencyRatio
RTlexdec                                               0.04
RTnaming                                               0.04
Familiarity                                           -0.19
WrittenFrequency                                       0.07
WrittenSpokenFrequencyRatio                            1.00
FamilySize                                            -0.11
DerivationalEntropy                                   -0.01
InflectionalEntropy                                   -0.12
NumberSimplexSynsets                                  -0.09
NumberComplexSynsets                                  -0.10
LengthInLetters                                        0.20
Ncount                                                -0.19
MeanBigramFrequency                                    0.19
FrequencyInitialDiphone                                0.02
ConspelV                                              -0.16
ConspelN                                              -0.06
ConphonV                                              -0.03
ConphonN                                              -0.03
ConfriendsV                                           -0.14
ConfriendsN                                           -0.06
ConffV                                                -0.03
ConffN                                                 0.00
ConfbV                                                 0.09
ConfbN                                                 0.07
NounFrequency                                          0.01
VerbFrequency                                         -0.10
FrequencyInitialDiphoneWord                            0.00
FrequencyInitialDiphoneSyllable                        0.01
CorrectLexdec                                          0.01
                                FamilySize
RTlexdec                             -0.35
RTnaming                             -0.09
Familiarity                           0.59
WrittenFrequency                      0.66
WrittenSpokenFrequencyRatio          -0.11
FamilySize                            1.00
DerivationalEntropy                   0.69
InflectionalEntropy                   0.10
NumberSimplexSynsets                  0.59
NumberComplexSynsets                  0.65
LengthInLetters                      -0.12
Ncount                                0.17
MeanBigramFrequency                   0.00
FrequencyInitialDiphone               0.13
ConspelV                              0.11
ConspelN                              0.25
ConphonV                              0.05
ConphonN                              0.16
ConfriendsV                           0.08
ConfriendsN                           0.24
ConffV                                0.08
ConffN                                0.06
ConfbV                                0.05
ConfbN                                0.04
NounFrequency                         0.42
VerbFrequency                         0.11
FrequencyInitialDiphoneWord           0.10
FrequencyInitialDiphoneSyllable       0.09
CorrectLexdec                         0.36
                                DerivationalEntropy
RTlexdec                                      -0.16
RTnaming                                      -0.05
Familiarity                                    0.22
WrittenFrequency                               0.26
WrittenSpokenFrequencyRatio                   -0.01
FamilySize                                     0.69
DerivationalEntropy                            1.00
InflectionalEntropy                           -0.05
NumberSimplexSynsets                           0.22
NumberComplexSynsets                           0.33
LengthInLetters                               -0.10
Ncount                                         0.12
MeanBigramFrequency                           -0.02
FrequencyInitialDiphone                        0.08
ConspelV                                       0.05
ConspelN                                       0.14
ConphonV                                       0.00
ConphonN                                       0.07
ConfriendsV                                    0.03
ConfriendsN                                    0.13
ConffV                                         0.04
ConffN                                         0.04
ConfbV                                         0.01
ConfbN                                         0.01
NounFrequency                                  0.17
VerbFrequency                                 -0.02
FrequencyInitialDiphoneWord                    0.03
FrequencyInitialDiphoneSyllable                0.03
CorrectLexdec                                  0.19
                                InflectionalEntropy
RTlexdec                                      -0.09
RTnaming                                      -0.02
Familiarity                                    0.11
WrittenFrequency                              -0.04
WrittenSpokenFrequencyRatio                   -0.12
FamilySize                                     0.10
DerivationalEntropy                           -0.05
InflectionalEntropy                            1.00
NumberSimplexSynsets                           0.40
NumberComplexSynsets                           0.01
LengthInLetters                                0.05
Ncount                                         0.00
MeanBigramFrequency                            0.02
FrequencyInitialDiphone                       -0.03
ConspelV                                       0.14
ConspelN                                       0.05
ConphonV                                       0.08
ConphonN                                       0.03
ConfriendsV                                    0.13
ConfriendsN                                    0.07
ConffV                                         0.01
ConffN                                         0.01
ConfbV                                         0.00
ConfbN                                        -0.01
NounFrequency                                 -0.11
VerbFrequency                                  0.09
FrequencyInitialDiphoneWord                    0.05
FrequencyInitialDiphoneSyllable                0.05
CorrectLexdec                                  0.18
                                NumberSimplexSynsets
RTlexdec                                       -0.31
RTnaming                                       -0.07
Familiarity                                     0.51
WrittenFrequency                                0.56
WrittenSpokenFrequencyRatio                    -0.09
FamilySize                                      0.59
DerivationalEntropy                             0.22
InflectionalEntropy                             0.40
NumberSimplexSynsets                            1.00
NumberComplexSynsets                            0.52
LengthInLetters                                -0.01
Ncount                                          0.11
MeanBigramFrequency                             0.05
FrequencyInitialDiphone                         0.06
ConspelV                                        0.17
ConspelN                                        0.23
ConphonV                                        0.07
ConphonN                                        0.14
ConfriendsV                                     0.15
ConfriendsN                                     0.25
ConffV                                          0.04
ConffN                                          0.02
ConfbV                                          0.02
ConfbN                                          0.00
NounFrequency                                   0.24
VerbFrequency                                   0.19
FrequencyInitialDiphoneWord                     0.13
FrequencyInitialDiphoneSyllable                 0.12
CorrectLexdec                                   0.35
                                NumberComplexSynsets
RTlexdec                                       -0.33
RTnaming                                       -0.08
Familiarity                                     0.51
WrittenFrequency                                0.59
WrittenSpokenFrequencyRatio                    -0.10
FamilySize                                      0.65
DerivationalEntropy                             0.33
InflectionalEntropy                             0.01
NumberSimplexSynsets                            0.52
NumberComplexSynsets                            1.00
LengthInLetters                                -0.12
Ncount                                          0.14
MeanBigramFrequency                            -0.02
FrequencyInitialDiphone                         0.10
ConspelV                                        0.07
ConspelN                                        0.19
ConphonV                                        0.05
ConphonN                                        0.14
ConfriendsV                                     0.04
ConfriendsN                                     0.18
ConffV                                          0.08
ConffN                                          0.06
ConfbV                                          0.05
ConfbN                                          0.05
NounFrequency                                   0.35
VerbFrequency                                   0.09
FrequencyInitialDiphoneWord                     0.06
FrequencyInitialDiphoneSyllable                 0.05
CorrectLexdec                                   0.33
                                LengthInLetters
RTlexdec                                   0.05
RTnaming                                   0.09
Familiarity                               -0.08
WrittenFrequency                          -0.07
WrittenSpokenFrequencyRatio                0.20
FamilySize                                -0.12
DerivationalEntropy                       -0.10
InflectionalEntropy                        0.05
NumberSimplexSynsets                      -0.01
NumberComplexSynsets                      -0.12
LengthInLetters                            1.00
Ncount                                    -0.63
MeanBigramFrequency                        0.79
FrequencyInitialDiphone                   -0.06
ConspelV                                  -0.23
ConspelN                                  -0.17
ConphonV                                  -0.20
ConphonN                                  -0.21
ConfriendsV                               -0.19
ConfriendsN                               -0.16
ConffV                                    -0.02
ConffN                                     0.01
ConfbV                                    -0.04
ConfbN                                    -0.07
NounFrequency                             -0.04
VerbFrequency                             -0.08
FrequencyInitialDiphoneWord                0.16
FrequencyInitialDiphoneSyllable            0.15
CorrectLexdec                              0.05
                                Ncount
RTlexdec                         -0.07
RTnaming                         -0.09
Familiarity                       0.10
WrittenFrequency                  0.10
WrittenSpokenFrequencyRatio      -0.19
FamilySize                        0.17
DerivationalEntropy               0.12
InflectionalEntropy               0.00
NumberSimplexSynsets              0.11
NumberComplexSynsets              0.14
LengthInLetters                  -0.63
Ncount                            1.00
MeanBigramFrequency              -0.39
FrequencyInitialDiphone           0.14
ConspelV                          0.47
ConspelN                          0.35
ConphonV                          0.21
ConphonN                          0.19
ConfriendsV                       0.44
ConfriendsN                       0.34
ConffV                            0.08
ConffN                            0.07
ConfbV                           -0.04
ConfbN                           -0.04
NounFrequency                     0.04
VerbFrequency                     0.05
FrequencyInitialDiphoneWord       0.01
FrequencyInitialDiphoneSyllable   0.02
CorrectLexdec                     0.02
                                MeanBigramFrequency
RTlexdec                                       0.00
RTnaming                                       0.05
Familiarity                                    0.03
WrittenFrequency                               0.08
WrittenSpokenFrequencyRatio                    0.19
FamilySize                                     0.00
DerivationalEntropy                           -0.02
InflectionalEntropy                            0.02
NumberSimplexSynsets                           0.05
NumberComplexSynsets                          -0.02
LengthInLetters                                0.79
Ncount                                        -0.39
MeanBigramFrequency                            1.00
FrequencyInitialDiphone                        0.32
ConspelV                                      -0.09
ConspelN                                       0.06
ConphonV                                      -0.12
ConphonN                                      -0.06
ConfriendsV                                   -0.08
ConfriendsN                                    0.05
ConffV                                         0.07
ConffN                                         0.11
ConfbV                                         0.00
ConfbN                                        -0.02
NounFrequency                                  0.04
VerbFrequency                                 -0.05
FrequencyInitialDiphoneWord                    0.21
FrequencyInitialDiphoneSyllable                0.20
CorrectLexdec                                  0.06
                                FrequencyInitialDiphone
RTlexdec                                          -0.07
RTnaming                                          -0.06
Familiarity                                        0.13
WrittenFrequency                                   0.17
WrittenSpokenFrequencyRatio                        0.02
FamilySize                                         0.13
DerivationalEntropy                                0.08
InflectionalEntropy                               -0.03
NumberSimplexSynsets                               0.06
NumberComplexSynsets                               0.10
LengthInLetters                                   -0.06
Ncount                                             0.14
MeanBigramFrequency                                0.32
FrequencyInitialDiphone                            1.00
ConspelV                                          -0.06
ConspelN                                           0.03
ConphonV                                          -0.01
ConphonN                                           0.01
ConfriendsV                                       -0.05
ConfriendsN                                        0.04
ConffV                                             0.00
ConffN                                             0.00
ConfbV                                             0.03
ConfbN                                             0.02
NounFrequency                                      0.10
VerbFrequency                                      0.06
FrequencyInitialDiphoneWord                        0.13
FrequencyInitialDiphoneSyllable                    0.12
CorrectLexdec                                      0.05
                                ConspelV
RTlexdec                           -0.03
RTnaming                           -0.03
Familiarity                         0.07
WrittenFrequency                    0.06
WrittenSpokenFrequencyRatio        -0.16
FamilySize                          0.11
DerivationalEntropy                 0.05
InflectionalEntropy                 0.14
NumberSimplexSynsets                0.17
NumberComplexSynsets                0.07
LengthInLetters                    -0.23
Ncount                              0.47
MeanBigramFrequency                -0.09
FrequencyInitialDiphone            -0.06
ConspelV                            1.00
ConspelN                            0.64
ConphonV                            0.54
ConphonN                            0.42
ConfriendsV                         0.92
ConfriendsN                         0.62
ConffV                              0.24
ConffN                              0.17
ConfbV                              0.05
ConfbN                              0.03
NounFrequency                      -0.02
VerbFrequency                       0.06
FrequencyInitialDiphoneWord         0.12
FrequencyInitialDiphoneSyllable     0.12
CorrectLexdec                       0.05
                                ConspelN
RTlexdec                           -0.11
RTnaming                           -0.03
Familiarity                         0.21
WrittenFrequency                    0.28
WrittenSpokenFrequencyRatio        -0.06
FamilySize                          0.25
DerivationalEntropy                 0.14
InflectionalEntropy                 0.05
NumberSimplexSynsets                0.23
NumberComplexSynsets                0.19
LengthInLetters                    -0.17
Ncount                              0.35
MeanBigramFrequency                 0.06
FrequencyInitialDiphone             0.03
ConspelV                            0.64
ConspelN                            1.00
ConphonV                            0.38
ConphonN                            0.65
ConfriendsV                         0.56
ConfriendsN                         0.88
ConffV                              0.28
ConffN                              0.34
ConfbV                              0.14
ConfbN                              0.15
NounFrequency                       0.12
VerbFrequency                       0.13
FrequencyInitialDiphoneWord         0.12
FrequencyInitialDiphoneSyllable     0.12
CorrectLexdec                       0.10
                                ConphonV
RTlexdec                           -0.02
RTnaming                            0.00
Familiarity                         0.05
WrittenFrequency                    0.08
WrittenSpokenFrequencyRatio        -0.03
FamilySize                          0.05
DerivationalEntropy                 0.00
InflectionalEntropy                 0.08
NumberSimplexSynsets                0.07
NumberComplexSynsets                0.05
LengthInLetters                    -0.20
Ncount                              0.21
MeanBigramFrequency                -0.12
FrequencyInitialDiphone            -0.01
ConspelV                            0.54
ConspelN                            0.38
ConphonV                            1.00
ConphonN                            0.67
ConfriendsV                         0.53
ConfriendsN                         0.38
ConffV                              0.04
ConffN                              0.06
ConfbV                              0.74
ConfbN                              0.61
NounFrequency                      -0.01
VerbFrequency                       0.06
FrequencyInitialDiphoneWord         0.03
FrequencyInitialDiphoneSyllable     0.03
CorrectLexdec                       0.02
                                ConphonN
RTlexdec                           -0.08
RTnaming                           -0.01
Familiarity                         0.16
WrittenFrequency                    0.22
WrittenSpokenFrequencyRatio        -0.03
FamilySize                          0.16
DerivationalEntropy                 0.07
InflectionalEntropy                 0.03
NumberSimplexSynsets                0.14
NumberComplexSynsets                0.14
LengthInLetters                    -0.21
Ncount                              0.19
MeanBigramFrequency                -0.06
FrequencyInitialDiphone             0.01
ConspelV                            0.42
ConspelN                            0.65
ConphonV                            0.67
ConphonN                            1.00
ConfriendsV                         0.39
ConfriendsN                         0.65
ConffV                              0.08
ConffN                              0.10
ConfbV                              0.57
ConfbN                              0.67
NounFrequency                       0.09
VerbFrequency                       0.10
FrequencyInitialDiphoneWord         0.05
FrequencyInitialDiphoneSyllable     0.05
CorrectLexdec                       0.07
                                ConfriendsV
RTlexdec                              -0.03
RTnaming                              -0.03
Familiarity                            0.05
WrittenFrequency                       0.02
WrittenSpokenFrequencyRatio           -0.14
FamilySize                             0.08
DerivationalEntropy                    0.03
InflectionalEntropy                    0.13
NumberSimplexSynsets                   0.15
NumberComplexSynsets                   0.04
LengthInLetters                       -0.19
Ncount                                 0.44
MeanBigramFrequency                   -0.08
FrequencyInitialDiphone               -0.05
ConspelV                               0.92
ConspelN                               0.56
ConphonV                               0.53
ConphonN                               0.39
ConfriendsV                            1.00
ConfriendsN                            0.65
ConffV                                -0.12
ConffN                                -0.10
ConfbV                                 0.01
ConfbN                                -0.01
NounFrequency                         -0.02
VerbFrequency                          0.00
FrequencyInitialDiphoneWord            0.13
FrequencyInitialDiphoneSyllable        0.14
CorrectLexdec                          0.05
                                ConfriendsN
RTlexdec                              -0.12
RTnaming                              -0.04
Familiarity                            0.21
WrittenFrequency                       0.26
WrittenSpokenFrequencyRatio           -0.06
FamilySize                             0.24
DerivationalEntropy                    0.13
InflectionalEntropy                    0.07
NumberSimplexSynsets                   0.25
NumberComplexSynsets                   0.18
LengthInLetters                       -0.16
Ncount                                 0.34
MeanBigramFrequency                    0.05
FrequencyInitialDiphone                0.04
ConspelV                               0.62
ConspelN                               0.88
ConphonV                               0.38
ConphonN                               0.65
ConfriendsV                            0.65
ConfriendsN                            1.00
ConffV                                 0.01
ConffN                                 0.02
ConfbV                                 0.08
ConfbN                                 0.09
NounFrequency                          0.12
VerbFrequency                          0.12
FrequencyInitialDiphoneWord            0.12
FrequencyInitialDiphoneSyllable        0.12
CorrectLexdec                          0.12
                                ConffV
RTlexdec                         -0.02
RTnaming                          0.01
Familiarity                       0.06
WrittenFrequency                  0.08
WrittenSpokenFrequencyRatio      -0.03
FamilySize                        0.08
DerivationalEntropy               0.04
InflectionalEntropy               0.01
NumberSimplexSynsets              0.04
NumberComplexSynsets              0.08
LengthInLetters                  -0.02
Ncount                            0.08
MeanBigramFrequency               0.07
FrequencyInitialDiphone           0.00
ConspelV                          0.24
ConspelN                          0.28
ConphonV                          0.04
ConphonN                          0.08
ConfriendsV                      -0.12
ConfriendsN                       0.01
ConffV                            1.00
ConffN                            0.82
ConfbV                            0.07
ConfbN                            0.07
NounFrequency                     0.04
VerbFrequency                     0.12
FrequencyInitialDiphoneWord       0.01
FrequencyInitialDiphoneSyllable  -0.01
CorrectLexdec                     0.01
                                ConffN
RTlexdec                         -0.01
RTnaming                          0.01
Familiarity                       0.03
WrittenFrequency                  0.05
WrittenSpokenFrequencyRatio       0.00
FamilySize                        0.06
DerivationalEntropy               0.04
InflectionalEntropy               0.01
NumberSimplexSynsets              0.02
NumberComplexSynsets              0.06
LengthInLetters                   0.01
Ncount                            0.07
MeanBigramFrequency               0.11
FrequencyInitialDiphone           0.00
ConspelV                          0.17
ConspelN                          0.34
ConphonV                          0.06
ConphonN                          0.10
ConfriendsV                      -0.10
ConfriendsN                       0.02
ConffV                            0.82
ConffN                            1.00
ConfbV                            0.11
ConfbN                            0.09
NounFrequency                     0.01
VerbFrequency                     0.08
FrequencyInitialDiphoneWord       0.00
FrequencyInitialDiphoneSyllable  -0.01
CorrectLexdec                    -0.01
                                ConfbV
RTlexdec                         -0.02
RTnaming                          0.02
Familiarity                       0.06
WrittenFrequency                  0.12
WrittenSpokenFrequencyRatio       0.09
FamilySize                        0.05
DerivationalEntropy               0.01
InflectionalEntropy               0.00
NumberSimplexSynsets              0.02
NumberComplexSynsets              0.05
LengthInLetters                  -0.04
Ncount                           -0.04
MeanBigramFrequency               0.00
FrequencyInitialDiphone           0.03
ConspelV                          0.05
ConspelN                          0.14
ConphonV                          0.74
ConphonN                          0.57
ConfriendsV                       0.01
ConfriendsN                       0.08
ConffV                            0.07
ConffN                            0.11
ConfbV                            1.00
ConfbN                            0.84
NounFrequency                     0.02
VerbFrequency                     0.05
FrequencyInitialDiphoneWord      -0.02
FrequencyInitialDiphoneSyllable  -0.03
CorrectLexdec                     0.01
                                ConfbN
RTlexdec                         -0.02
RTnaming                          0.02
Familiarity                       0.05
WrittenFrequency                  0.10
WrittenSpokenFrequencyRatio       0.07
FamilySize                        0.04
DerivationalEntropy               0.01
InflectionalEntropy              -0.01
NumberSimplexSynsets              0.00
NumberComplexSynsets              0.05
LengthInLetters                  -0.07
Ncount                           -0.04
MeanBigramFrequency              -0.02
FrequencyInitialDiphone           0.02
ConspelV                          0.03
ConspelN                          0.15
ConphonV                          0.61
ConphonN                          0.67
ConfriendsV                      -0.01
ConfriendsN                       0.09
ConffV                            0.07
ConffN                            0.09
ConfbV                            0.84
ConfbN                            1.00
NounFrequency                     0.03
VerbFrequency                     0.03
FrequencyInitialDiphoneWord      -0.02
FrequencyInitialDiphoneSyllable  -0.02
CorrectLexdec                     0.00
                                NounFrequency
RTlexdec                                -0.17
RTnaming                                -0.04
Familiarity                              0.38
WrittenFrequency                         0.47
WrittenSpokenFrequencyRatio              0.01
FamilySize                               0.42
DerivationalEntropy                      0.17
InflectionalEntropy                     -0.11
NumberSimplexSynsets                     0.24
NumberComplexSynsets                     0.35
LengthInLetters                         -0.04
Ncount                                   0.04
MeanBigramFrequency                      0.04
FrequencyInitialDiphone                  0.10
ConspelV                                -0.02
ConspelN                                 0.12
ConphonV                                -0.01
ConphonN                                 0.09
ConfriendsV                             -0.02
ConfriendsN                              0.12
ConffV                                   0.04
ConffN                                   0.01
ConfbV                                   0.02
ConfbN                                   0.03
NounFrequency                            1.00
VerbFrequency                            0.00
FrequencyInitialDiphoneWord              0.05
FrequencyInitialDiphoneSyllable          0.03
CorrectLexdec                            0.13
                                VerbFrequency
RTlexdec                                -0.08
RTnaming                                -0.02
Familiarity                              0.24
WrittenFrequency                         0.28
WrittenSpokenFrequencyRatio             -0.10
FamilySize                               0.11
DerivationalEntropy                     -0.02
InflectionalEntropy                      0.09
NumberSimplexSynsets                     0.19
NumberComplexSynsets                     0.09
LengthInLetters                         -0.08
Ncount                                   0.05
MeanBigramFrequency                     -0.05
FrequencyInitialDiphone                  0.06
ConspelV                                 0.06
ConspelN                                 0.13
ConphonV                                 0.06
ConphonN                                 0.10
ConfriendsV                              0.00
ConfriendsN                              0.12
ConffV                                   0.12
ConffN                                   0.08
ConfbV                                   0.05
ConfbN                                   0.03
NounFrequency                            0.00
VerbFrequency                            1.00
FrequencyInitialDiphoneWord              0.07
FrequencyInitialDiphoneSyllable          0.06
CorrectLexdec                            0.05
                                FrequencyInitialDiphoneWord
RTlexdec                                              -0.04
RTnaming                                               0.02
Familiarity                                            0.09
WrittenFrequency                                       0.11
WrittenSpokenFrequencyRatio                            0.00
FamilySize                                             0.10
DerivationalEntropy                                    0.03
InflectionalEntropy                                    0.05
NumberSimplexSynsets                                   0.13
NumberComplexSynsets                                   0.06
LengthInLetters                                        0.16
Ncount                                                 0.01
MeanBigramFrequency                                    0.21
FrequencyInitialDiphone                                0.13
ConspelV                                               0.12
ConspelN                                               0.12
ConphonV                                               0.03
ConphonN                                               0.05
ConfriendsV                                            0.13
ConfriendsN                                            0.12
ConffV                                                 0.01
ConffN                                                 0.00
ConfbV                                                -0.02
ConfbN                                                -0.02
NounFrequency                                          0.05
VerbFrequency                                          0.07
FrequencyInitialDiphoneWord                            1.00
FrequencyInitialDiphoneSyllable                        0.98
CorrectLexdec                                          0.06
                                FrequencyInitialDiphoneSyllable
RTlexdec                                                  -0.04
RTnaming                                                   0.03
Familiarity                                                0.07
WrittenFrequency                                           0.09
WrittenSpokenFrequencyRatio                                0.01
FamilySize                                                 0.09
DerivationalEntropy                                        0.03
InflectionalEntropy                                        0.05
NumberSimplexSynsets                                       0.12
NumberComplexSynsets                                       0.05
LengthInLetters                                            0.15
Ncount                                                     0.02
MeanBigramFrequency                                        0.20
FrequencyInitialDiphone                                    0.12
ConspelV                                                   0.12
ConspelN                                                   0.12
ConphonV                                                   0.03
ConphonN                                                   0.05
ConfriendsV                                                0.14
ConfriendsN                                                0.12
ConffV                                                    -0.01
ConffN                                                    -0.01
ConfbV                                                    -0.03
ConfbN                                                    -0.02
NounFrequency                                              0.03
VerbFrequency                                              0.06
FrequencyInitialDiphoneWord                                0.98
FrequencyInitialDiphoneSyllable                            1.00
CorrectLexdec                                              0.06
                                CorrectLexdec
RTlexdec                                -0.25
RTnaming                                 0.15
Familiarity                              0.53
WrittenFrequency                         0.46
WrittenSpokenFrequencyRatio              0.01
FamilySize                               0.36
DerivationalEntropy                      0.19
InflectionalEntropy                      0.18
NumberSimplexSynsets                     0.35
NumberComplexSynsets                     0.33
LengthInLetters                          0.05
Ncount                                   0.02
MeanBigramFrequency                      0.06
FrequencyInitialDiphone                  0.05
ConspelV                                 0.05
ConspelN                                 0.10
ConphonV                                 0.02
ConphonN                                 0.07
ConfriendsV                              0.05
ConfriendsN                              0.12
ConffV                                   0.01
ConffN                                  -0.01
ConfbV                                   0.01
ConfbN                                   0.00
NounFrequency                            0.13
VerbFrequency                            0.05
FrequencyInitialDiphoneWord              0.06
FrequencyInitialDiphoneSyllable          0.06
CorrectLexdec                            1.00

n= 4568 


P
                                RTlexdec
RTlexdec                                
RTnaming                        0.0000  
Familiarity                     0.0000  
WrittenFrequency                0.0000  
WrittenSpokenFrequencyRatio     0.0071  
FamilySize                      0.0000  
DerivationalEntropy             0.0000  
InflectionalEntropy             0.0000  
NumberSimplexSynsets            0.0000  
NumberComplexSynsets            0.0000  
LengthInLetters                 0.0008  
Ncount                          0.0000  
MeanBigramFrequency             0.8588  
FrequencyInitialDiphone         0.0000  
ConspelV                        0.0263  
ConspelN                        0.0000  
ConphonV                        0.1417  
ConphonN                        0.0000  
ConfriendsV                     0.0822  
ConfriendsN                     0.0000  
ConffV                          0.2650  
ConffN                          0.7012  
ConfbV                          0.1281  
ConfbN                          0.2103  
NounFrequency                   0.0000  
VerbFrequency                   0.0000  
FrequencyInitialDiphoneWord     0.0039  
FrequencyInitialDiphoneSyllable 0.0164  
CorrectLexdec                   0.0000  
                                RTnaming
RTlexdec                        0.0000  
RTnaming                                
Familiarity                     0.0000  
WrittenFrequency                0.0000  
WrittenSpokenFrequencyRatio     0.0134  
FamilySize                      0.0000  
DerivationalEntropy             0.0008  
InflectionalEntropy             0.1351  
NumberSimplexSynsets            0.0000  
NumberComplexSynsets            0.0000  
LengthInLetters                 0.0000  
Ncount                          0.0000  
MeanBigramFrequency             0.0011  
FrequencyInitialDiphone         0.0001  
ConspelV                        0.0890  
ConspelN                        0.0187  
ConphonV                        0.9367  
ConphonN                        0.3317  
ConfriendsV                     0.0608  
ConfriendsN                     0.0029  
ConffV                          0.5923  
ConffN                          0.4408  
ConfbV                          0.1954  
ConfbN                          0.2503  
NounFrequency                   0.0035  
VerbFrequency                   0.0965  
FrequencyInitialDiphoneWord     0.1662  
FrequencyInitialDiphoneSyllable 0.0691  
CorrectLexdec                   0.0000  
                                Familiarity
RTlexdec                        0.0000     
RTnaming                        0.0000     
Familiarity                                
WrittenFrequency                0.0000     
WrittenSpokenFrequencyRatio     0.0000     
FamilySize                      0.0000     
DerivationalEntropy             0.0000     
InflectionalEntropy             0.0000     
NumberSimplexSynsets            0.0000     
NumberComplexSynsets            0.0000     
LengthInLetters                 0.0000     
Ncount                          0.0000     
MeanBigramFrequency             0.0453     
FrequencyInitialDiphone         0.0000     
ConspelV                        0.0000     
ConspelN                        0.0000     
ConphonV                        0.0003     
ConphonN                        0.0000     
ConfriendsV                     0.0019     
ConfriendsN                     0.0000     
ConffV                          0.0001     
ConffN                          0.0465     
ConfbV                          0.0001     
ConfbN                          0.0004     
NounFrequency                   0.0000     
VerbFrequency                   0.0000     
FrequencyInitialDiphoneWord     0.0000     
FrequencyInitialDiphoneSyllable 0.0000     
CorrectLexdec                   0.0000     
                                WrittenFrequency
RTlexdec                        0.0000          
RTnaming                        0.0000          
Familiarity                     0.0000          
WrittenFrequency                                
WrittenSpokenFrequencyRatio     0.0000          
FamilySize                      0.0000          
DerivationalEntropy             0.0000          
InflectionalEntropy             0.0062          
NumberSimplexSynsets            0.0000          
NumberComplexSynsets            0.0000          
LengthInLetters                 0.0000          
Ncount                          0.0000          
MeanBigramFrequency             0.0000          
FrequencyInitialDiphone         0.0000          
ConspelV                        0.0000          
ConspelN                        0.0000          
ConphonV                        0.0000          
ConphonN                        0.0000          
ConfriendsV                     0.1469          
ConfriendsN                     0.0000          
ConffV                          0.0000          
ConffN                          0.0007          
ConfbV                          0.0000          
ConfbN                          0.0000          
NounFrequency                   0.0000          
VerbFrequency                   0.0000          
FrequencyInitialDiphoneWord     0.0000          
FrequencyInitialDiphoneSyllable 0.0000          
CorrectLexdec                   0.0000          
                                WrittenSpokenFrequencyRatio
RTlexdec                        0.0071                     
RTnaming                        0.0134                     
Familiarity                     0.0000                     
WrittenFrequency                0.0000                     
WrittenSpokenFrequencyRatio                                
FamilySize                      0.0000                     
DerivationalEntropy             0.4578                     
InflectionalEntropy             0.0000                     
NumberSimplexSynsets            0.0000                     
NumberComplexSynsets            0.0000                     
LengthInLetters                 0.0000                     
Ncount                          0.0000                     
MeanBigramFrequency             0.0000                     
FrequencyInitialDiphone         0.1599                     
ConspelV                        0.0000                     
ConspelN                        0.0001                     
ConphonV                        0.0187                     
ConphonN                        0.0806                     
ConfriendsV                     0.0000                     
ConfriendsN                     0.0002                     
ConffV                          0.0708                     
ConffN                          0.8181                     
ConfbV                          0.0000                     
ConfbN                          0.0000                     
NounFrequency                   0.3990                     
VerbFrequency                   0.0000                     
FrequencyInitialDiphoneWord     0.8591                     
FrequencyInitialDiphoneSyllable 0.3844                     
CorrectLexdec                   0.5628                     
                                FamilySize
RTlexdec                        0.0000    
RTnaming                        0.0000    
Familiarity                     0.0000    
WrittenFrequency                0.0000    
WrittenSpokenFrequencyRatio     0.0000    
FamilySize                                
DerivationalEntropy             0.0000    
InflectionalEntropy             0.0000    
NumberSimplexSynsets            0.0000    
NumberComplexSynsets            0.0000    
LengthInLetters                 0.0000    
Ncount                          0.0000    
MeanBigramFrequency             0.9431    
FrequencyInitialDiphone         0.0000    
ConspelV                        0.0000    
ConspelN                        0.0000    
ConphonV                        0.0007    
ConphonN                        0.0000    
ConfriendsV                     0.0000    
ConfriendsN                     0.0000    
ConffV                          0.0000    
ConffN                          0.0000    
ConfbV                          0.0007    
ConfbN                          0.0093    
NounFrequency                   0.0000    
VerbFrequency                   0.0000    
FrequencyInitialDiphoneWord     0.0000    
FrequencyInitialDiphoneSyllable 0.0000    
CorrectLexdec                   0.0000    
                                DerivationalEntropy
RTlexdec                        0.0000             
RTnaming                        0.0008             
Familiarity                     0.0000             
WrittenFrequency                0.0000             
WrittenSpokenFrequencyRatio     0.4578             
FamilySize                      0.0000             
DerivationalEntropy                                
InflectionalEntropy             0.0006             
NumberSimplexSynsets            0.0000             
NumberComplexSynsets            0.0000             
LengthInLetters                 0.0000             
Ncount                          0.0000             
MeanBigramFrequency             0.1591             
FrequencyInitialDiphone         0.0000             
ConspelV                        0.0018             
ConspelN                        0.0000             
ConphonV                        0.8523             
ConphonN                        0.0000             
ConfriendsV                     0.0506             
ConfriendsN                     0.0000             
ConffV                          0.0055             
ConffN                          0.0064             
ConfbV                          0.6008             
ConfbN                          0.4386             
NounFrequency                   0.0000             
VerbFrequency                   0.1825             
FrequencyInitialDiphoneWord     0.0463             
FrequencyInitialDiphoneSyllable 0.0607             
CorrectLexdec                   0.0000             
                                InflectionalEntropy
RTlexdec                        0.0000             
RTnaming                        0.1351             
Familiarity                     0.0000             
WrittenFrequency                0.0062             
WrittenSpokenFrequencyRatio     0.0000             
FamilySize                      0.0000             
DerivationalEntropy             0.0006             
InflectionalEntropy                                
NumberSimplexSynsets            0.0000             
NumberComplexSynsets            0.7057             
LengthInLetters                 0.0004             
Ncount                          0.8260             
MeanBigramFrequency             0.0939             
FrequencyInitialDiphone         0.0198             
ConspelV                        0.0000             
ConspelN                        0.0015             
ConphonV                        0.0000             
ConphonN                        0.0338             
ConfriendsV                     0.0000             
ConfriendsN                     0.0000             
ConffV                          0.3309             
ConffN                          0.4672             
ConfbV                          0.8466             
ConfbN                          0.6084             
NounFrequency                   0.0000             
VerbFrequency                   0.0000             
FrequencyInitialDiphoneWord     0.0004             
FrequencyInitialDiphoneSyllable 0.0006             
CorrectLexdec                   0.0000             
                                NumberSimplexSynsets
RTlexdec                        0.0000              
RTnaming                        0.0000              
Familiarity                     0.0000              
WrittenFrequency                0.0000              
WrittenSpokenFrequencyRatio     0.0000              
FamilySize                      0.0000              
DerivationalEntropy             0.0000              
InflectionalEntropy             0.0000              
NumberSimplexSynsets                                
NumberComplexSynsets            0.0000              
LengthInLetters                 0.6672              
Ncount                          0.0000              
MeanBigramFrequency             0.0003              
FrequencyInitialDiphone         0.0001              
ConspelV                        0.0000              
ConspelN                        0.0000              
ConphonV                        0.0000              
ConphonN                        0.0000              
ConfriendsV                     0.0000              
ConfriendsN                     0.0000              
ConffV                          0.0142              
ConffN                          0.1237              
ConfbV                          0.1447              
ConfbN                          0.9889              
NounFrequency                   0.0000              
VerbFrequency                   0.0000              
FrequencyInitialDiphoneWord     0.0000              
FrequencyInitialDiphoneSyllable 0.0000              
CorrectLexdec                   0.0000              
                                NumberComplexSynsets
RTlexdec                        0.0000              
RTnaming                        0.0000              
Familiarity                     0.0000              
WrittenFrequency                0.0000              
WrittenSpokenFrequencyRatio     0.0000              
FamilySize                      0.0000              
DerivationalEntropy             0.0000              
InflectionalEntropy             0.7057              
NumberSimplexSynsets            0.0000              
NumberComplexSynsets                                
LengthInLetters                 0.0000              
Ncount                          0.0000              
MeanBigramFrequency             0.1107              
FrequencyInitialDiphone         0.0000              
ConspelV                        0.0000              
ConspelN                        0.0000              
ConphonV                        0.0012              
ConphonN                        0.0000              
ConfriendsV                     0.0107              
ConfriendsN                     0.0000              
ConffV                          0.0000              
ConffN                          0.0001              
ConfbV                          0.0004              
ConfbN                          0.0009              
NounFrequency                   0.0000              
VerbFrequency                   0.0000              
FrequencyInitialDiphoneWord     0.0000              
FrequencyInitialDiphoneSyllable 0.0015              
CorrectLexdec                   0.0000              
                                LengthInLetters
RTlexdec                        0.0008         
RTnaming                        0.0000         
Familiarity                     0.0000         
WrittenFrequency                0.0000         
WrittenSpokenFrequencyRatio     0.0000         
FamilySize                      0.0000         
DerivationalEntropy             0.0000         
InflectionalEntropy             0.0004         
NumberSimplexSynsets            0.6672         
NumberComplexSynsets            0.0000         
LengthInLetters                                
Ncount                          0.0000         
MeanBigramFrequency             0.0000         
FrequencyInitialDiphone         0.0000         
ConspelV                        0.0000         
ConspelN                        0.0000         
ConphonV                        0.0000         
ConphonN                        0.0000         
ConfriendsV                     0.0000         
ConfriendsN                     0.0000         
ConffV                          0.1935         
ConffN                          0.4670         
ConfbV                          0.0068         
ConfbN                          0.0000         
NounFrequency                   0.0169         
VerbFrequency                   0.0000         
FrequencyInitialDiphoneWord     0.0000         
FrequencyInitialDiphoneSyllable 0.0000         
CorrectLexdec                   0.0017         
                                Ncount
RTlexdec                        0.0000
RTnaming                        0.0000
Familiarity                     0.0000
WrittenFrequency                0.0000
WrittenSpokenFrequencyRatio     0.0000
FamilySize                      0.0000
DerivationalEntropy             0.0000
InflectionalEntropy             0.8260
NumberSimplexSynsets            0.0000
NumberComplexSynsets            0.0000
LengthInLetters                 0.0000
Ncount                                
MeanBigramFrequency             0.0000
FrequencyInitialDiphone         0.0000
ConspelV                        0.0000
ConspelN                        0.0000
ConphonV                        0.0000
ConphonN                        0.0000
ConfriendsV                     0.0000
ConfriendsN                     0.0000
ConffV                          0.0000
ConffN                          0.0000
ConfbV                          0.0149
ConfbN                          0.0044
NounFrequency                   0.0153
VerbFrequency                   0.0003
FrequencyInitialDiphoneWord     0.5939
FrequencyInitialDiphoneSyllable 0.1422
CorrectLexdec                   0.2782
                                MeanBigramFrequency
RTlexdec                        0.8588             
RTnaming                        0.0011             
Familiarity                     0.0453             
WrittenFrequency                0.0000             
WrittenSpokenFrequencyRatio     0.0000             
FamilySize                      0.9431             
DerivationalEntropy             0.1591             
InflectionalEntropy             0.0939             
NumberSimplexSynsets            0.0003             
NumberComplexSynsets            0.1107             
LengthInLetters                 0.0000             
Ncount                          0.0000             
MeanBigramFrequency                                
FrequencyInitialDiphone         0.0000             
ConspelV                        0.0000             
ConspelN                        0.0000             
ConphonV                        0.0000             
ConphonN                        0.0000             
ConfriendsV                     0.0000             
ConfriendsN                     0.0009             
ConffV                          0.0000             
ConffN                          0.0000             
ConfbV                          0.8417             
ConfbN                          0.1865             
NounFrequency                   0.0034             
VerbFrequency                   0.0019             
FrequencyInitialDiphoneWord     0.0000             
FrequencyInitialDiphoneSyllable 0.0000             
CorrectLexdec                   0.0000             
                                FrequencyInitialDiphone
RTlexdec                        0.0000                 
RTnaming                        0.0001                 
Familiarity                     0.0000                 
WrittenFrequency                0.0000                 
WrittenSpokenFrequencyRatio     0.1599                 
FamilySize                      0.0000                 
DerivationalEntropy             0.0000                 
InflectionalEntropy             0.0198                 
NumberSimplexSynsets            0.0001                 
NumberComplexSynsets            0.0000                 
LengthInLetters                 0.0000                 
Ncount                          0.0000                 
MeanBigramFrequency             0.0000                 
FrequencyInitialDiphone                                
ConspelV                        0.0002                 
ConspelN                        0.0364                 
ConphonV                        0.3361                 
ConphonN                        0.4821                 
ConfriendsV                     0.0008                 
ConfriendsN                     0.0144                 
ConffV                          0.9908                 
ConffN                          0.8845                 
ConfbV                          0.0239                 
ConfbN                          0.1957                 
NounFrequency                   0.0000                 
VerbFrequency                   0.0002                 
FrequencyInitialDiphoneWord     0.0000                 
FrequencyInitialDiphoneSyllable 0.0000                 
CorrectLexdec                   0.0010                 
                                ConspelV
RTlexdec                        0.0263  
RTnaming                        0.0890  
Familiarity                     0.0000  
WrittenFrequency                0.0000  
WrittenSpokenFrequencyRatio     0.0000  
FamilySize                      0.0000  
DerivationalEntropy             0.0018  
InflectionalEntropy             0.0000  
NumberSimplexSynsets            0.0000  
NumberComplexSynsets            0.0000  
LengthInLetters                 0.0000  
Ncount                          0.0000  
MeanBigramFrequency             0.0000  
FrequencyInitialDiphone         0.0002  
ConspelV                                
ConspelN                        0.0000  
ConphonV                        0.0000  
ConphonN                        0.0000  
ConfriendsV                     0.0000  
ConfriendsN                     0.0000  
ConffV                          0.0000  
ConffN                          0.0000  
ConfbV                          0.0008  
ConfbN                          0.0656  
NounFrequency                   0.2518  
VerbFrequency                   0.0000  
FrequencyInitialDiphoneWord     0.0000  
FrequencyInitialDiphoneSyllable 0.0000  
CorrectLexdec                   0.0008  
                                ConspelN
RTlexdec                        0.0000  
RTnaming                        0.0187  
Familiarity                     0.0000  
WrittenFrequency                0.0000  
WrittenSpokenFrequencyRatio     0.0001  
FamilySize                      0.0000  
DerivationalEntropy             0.0000  
InflectionalEntropy             0.0015  
NumberSimplexSynsets            0.0000  
NumberComplexSynsets            0.0000  
LengthInLetters                 0.0000  
Ncount                          0.0000  
MeanBigramFrequency             0.0000  
FrequencyInitialDiphone         0.0364  
ConspelV                        0.0000  
ConspelN                                
ConphonV                        0.0000  
ConphonN                        0.0000  
ConfriendsV                     0.0000  
ConfriendsN                     0.0000  
ConffV                          0.0000  
ConffN                          0.0000  
ConfbV                          0.0000  
ConfbN                          0.0000  
NounFrequency                   0.0000  
VerbFrequency                   0.0000  
FrequencyInitialDiphoneWord     0.0000  
FrequencyInitialDiphoneSyllable 0.0000  
CorrectLexdec                   0.0000  
                                ConphonV
RTlexdec                        0.1417  
RTnaming                        0.9367  
Familiarity                     0.0003  
WrittenFrequency                0.0000  
WrittenSpokenFrequencyRatio     0.0187  
FamilySize                      0.0007  
DerivationalEntropy             0.8523  
InflectionalEntropy             0.0000  
NumberSimplexSynsets            0.0000  
NumberComplexSynsets            0.0012  
LengthInLetters                 0.0000  
Ncount                          0.0000  
MeanBigramFrequency             0.0000  
FrequencyInitialDiphone         0.3361  
ConspelV                        0.0000  
ConspelN                        0.0000  
ConphonV                                
ConphonN                        0.0000  
ConfriendsV                     0.0000  
ConfriendsN                     0.0000  
ConffV                          0.0074  
ConffN                          0.0000  
ConfbV                          0.0000  
ConfbN                          0.0000  
NounFrequency                   0.5535  
VerbFrequency                   0.0000  
FrequencyInitialDiphoneWord     0.0432  
FrequencyInitialDiphoneSyllable 0.0230  
CorrectLexdec                   0.1609  
                                ConphonN
RTlexdec                        0.0000  
RTnaming                        0.3317  
Familiarity                     0.0000  
WrittenFrequency                0.0000  
WrittenSpokenFrequencyRatio     0.0806  
FamilySize                      0.0000  
DerivationalEntropy             0.0000  
InflectionalEntropy             0.0338  
NumberSimplexSynsets            0.0000  
NumberComplexSynsets            0.0000  
LengthInLetters                 0.0000  
Ncount                          0.0000  
MeanBigramFrequency             0.0000  
FrequencyInitialDiphone         0.4821  
ConspelV                        0.0000  
ConspelN                        0.0000  
ConphonV                        0.0000  
ConphonN                                
ConfriendsV                     0.0000  
ConfriendsN                     0.0000  
ConffV                          0.0000  
ConffN                          0.0000  
ConfbV                          0.0000  
ConfbN                          0.0000  
NounFrequency                   0.0000  
VerbFrequency                   0.0000  
FrequencyInitialDiphoneWord     0.0003  
FrequencyInitialDiphoneSyllable 0.0002  
CorrectLexdec                   0.0000  
                                ConfriendsV
RTlexdec                        0.0822     
RTnaming                        0.0608     
Familiarity                     0.0019     
WrittenFrequency                0.1469     
WrittenSpokenFrequencyRatio     0.0000     
FamilySize                      0.0000     
DerivationalEntropy             0.0506     
InflectionalEntropy             0.0000     
NumberSimplexSynsets            0.0000     
NumberComplexSynsets            0.0107     
LengthInLetters                 0.0000     
Ncount                          0.0000     
MeanBigramFrequency             0.0000     
FrequencyInitialDiphone         0.0008     
ConspelV                        0.0000     
ConspelN                        0.0000     
ConphonV                        0.0000     
ConphonN                        0.0000     
ConfriendsV                                
ConfriendsN                     0.0000     
ConffV                          0.0000     
ConffN                          0.0000     
ConfbV                          0.4597     
ConfbN                          0.4688     
NounFrequency                   0.1293     
VerbFrequency                   0.7902     
FrequencyInitialDiphoneWord     0.0000     
FrequencyInitialDiphoneSyllable 0.0000     
CorrectLexdec                   0.0020     
                                ConfriendsN
RTlexdec                        0.0000     
RTnaming                        0.0029     
Familiarity                     0.0000     
WrittenFrequency                0.0000     
WrittenSpokenFrequencyRatio     0.0002     
FamilySize                      0.0000     
DerivationalEntropy             0.0000     
InflectionalEntropy             0.0000     
NumberSimplexSynsets            0.0000     
NumberComplexSynsets            0.0000     
LengthInLetters                 0.0000     
Ncount                          0.0000     
MeanBigramFrequency             0.0009     
FrequencyInitialDiphone         0.0144     
ConspelV                        0.0000     
ConspelN                        0.0000     
ConphonV                        0.0000     
ConphonN                        0.0000     
ConfriendsV                     0.0000     
ConfriendsN                                
ConffV                          0.6587     
ConffN                          0.1567     
ConfbV                          0.0000     
ConfbN                          0.0000     
NounFrequency                   0.0000     
VerbFrequency                   0.0000     
FrequencyInitialDiphoneWord     0.0000     
FrequencyInitialDiphoneSyllable 0.0000     
CorrectLexdec                   0.0000     
                                ConffV
RTlexdec                        0.2650
RTnaming                        0.5923
Familiarity                     0.0001
WrittenFrequency                0.0000
WrittenSpokenFrequencyRatio     0.0708
FamilySize                      0.0000
DerivationalEntropy             0.0055
InflectionalEntropy             0.3309
NumberSimplexSynsets            0.0142
NumberComplexSynsets            0.0000
LengthInLetters                 0.1935
Ncount                          0.0000
MeanBigramFrequency             0.0000
FrequencyInitialDiphone         0.9908
ConspelV                        0.0000
ConspelN                        0.0000
ConphonV                        0.0074
ConphonN                        0.0000
ConfriendsV                     0.0000
ConfriendsN                     0.6587
ConffV                                
ConffN                          0.0000
ConfbV                          0.0000
ConfbN                          0.0000
NounFrequency                   0.0131
VerbFrequency                   0.0000
FrequencyInitialDiphoneWord     0.6937
FrequencyInitialDiphoneSyllable 0.4569
CorrectLexdec                   0.6223
                                ConffN
RTlexdec                        0.7012
RTnaming                        0.4408
Familiarity                     0.0465
WrittenFrequency                0.0007
WrittenSpokenFrequencyRatio     0.8181
FamilySize                      0.0000
DerivationalEntropy             0.0064
InflectionalEntropy             0.4672
NumberSimplexSynsets            0.1237
NumberComplexSynsets            0.0001
LengthInLetters                 0.4670
Ncount                          0.0000
MeanBigramFrequency             0.0000
FrequencyInitialDiphone         0.8845
ConspelV                        0.0000
ConspelN                        0.0000
ConphonV                        0.0000
ConphonN                        0.0000
ConfriendsV                     0.0000
ConfriendsN                     0.1567
ConffV                          0.0000
ConffN                                
ConfbV                          0.0000
ConfbN                          0.0000
NounFrequency                   0.4869
VerbFrequency                   0.0000
FrequencyInitialDiphoneWord     0.8069
FrequencyInitialDiphoneSyllable 0.5551
CorrectLexdec                   0.6263
                                ConfbV
RTlexdec                        0.1281
RTnaming                        0.1954
Familiarity                     0.0001
WrittenFrequency                0.0000
WrittenSpokenFrequencyRatio     0.0000
FamilySize                      0.0007
DerivationalEntropy             0.6008
InflectionalEntropy             0.8466
NumberSimplexSynsets            0.1447
NumberComplexSynsets            0.0004
LengthInLetters                 0.0068
Ncount                          0.0149
MeanBigramFrequency             0.8417
FrequencyInitialDiphone         0.0239
ConspelV                        0.0008
ConspelN                        0.0000
ConphonV                        0.0000
ConphonN                        0.0000
ConfriendsV                     0.4597
ConfriendsN                     0.0000
ConffV                          0.0000
ConffN                          0.0000
ConfbV                                
ConfbN                          0.0000
NounFrequency                   0.1428
VerbFrequency                   0.0006
FrequencyInitialDiphoneWord     0.1784
FrequencyInitialDiphoneSyllable 0.0678
CorrectLexdec                   0.7155
                                ConfbN
RTlexdec                        0.2103
RTnaming                        0.2503
Familiarity                     0.0004
WrittenFrequency                0.0000
WrittenSpokenFrequencyRatio     0.0000
FamilySize                      0.0093
DerivationalEntropy             0.4386
InflectionalEntropy             0.6084
NumberSimplexSynsets            0.9889
NumberComplexSynsets            0.0009
LengthInLetters                 0.0000
Ncount                          0.0044
MeanBigramFrequency             0.1865
FrequencyInitialDiphone         0.1957
ConspelV                        0.0656
ConspelN                        0.0000
ConphonV                        0.0000
ConphonN                        0.0000
ConfriendsV                     0.4688
ConfriendsN                     0.0000
ConffV                          0.0000
ConffN                          0.0000
ConfbV                          0.0000
ConfbN                                
NounFrequency                   0.0882
VerbFrequency                   0.0259
FrequencyInitialDiphoneWord     0.2019
FrequencyInitialDiphoneSyllable 0.0990
CorrectLexdec                   0.7901
                                NounFrequency
RTlexdec                        0.0000       
RTnaming                        0.0035       
Familiarity                     0.0000       
WrittenFrequency                0.0000       
WrittenSpokenFrequencyRatio     0.3990       
FamilySize                      0.0000       
DerivationalEntropy             0.0000       
InflectionalEntropy             0.0000       
NumberSimplexSynsets            0.0000       
NumberComplexSynsets            0.0000       
LengthInLetters                 0.0169       
Ncount                          0.0153       
MeanBigramFrequency             0.0034       
FrequencyInitialDiphone         0.0000       
ConspelV                        0.2518       
ConspelN                        0.0000       
ConphonV                        0.5535       
ConphonN                        0.0000       
ConfriendsV                     0.1293       
ConfriendsN                     0.0000       
ConffV                          0.0131       
ConffN                          0.4869       
ConfbV                          0.1428       
ConfbN                          0.0882       
NounFrequency                                
VerbFrequency                   0.8332       
FrequencyInitialDiphoneWord     0.0013       
FrequencyInitialDiphoneSyllable 0.0204       
CorrectLexdec                   0.0000       
                                VerbFrequency
RTlexdec                        0.0000       
RTnaming                        0.0965       
Familiarity                     0.0000       
WrittenFrequency                0.0000       
WrittenSpokenFrequencyRatio     0.0000       
FamilySize                      0.0000       
DerivationalEntropy             0.1825       
InflectionalEntropy             0.0000       
NumberSimplexSynsets            0.0000       
NumberComplexSynsets            0.0000       
LengthInLetters                 0.0000       
Ncount                          0.0003       
MeanBigramFrequency             0.0019       
FrequencyInitialDiphone         0.0002       
ConspelV                        0.0000       
ConspelN                        0.0000       
ConphonV                        0.0000       
ConphonN                        0.0000       
ConfriendsV                     0.7902       
ConfriendsN                     0.0000       
ConffV                          0.0000       
ConffN                          0.0000       
ConfbV                          0.0006       
ConfbN                          0.0259       
NounFrequency                   0.8332       
VerbFrequency                                
FrequencyInitialDiphoneWord     0.0000       
FrequencyInitialDiphoneSyllable 0.0002       
CorrectLexdec                   0.0007       
                                FrequencyInitialDiphoneWord
RTlexdec                        0.0039                     
RTnaming                        0.1662                     
Familiarity                     0.0000                     
WrittenFrequency                0.0000                     
WrittenSpokenFrequencyRatio     0.8591                     
FamilySize                      0.0000                     
DerivationalEntropy             0.0463                     
InflectionalEntropy             0.0004                     
NumberSimplexSynsets            0.0000                     
NumberComplexSynsets            0.0000                     
LengthInLetters                 0.0000                     
Ncount                          0.5939                     
MeanBigramFrequency             0.0000                     
FrequencyInitialDiphone         0.0000                     
ConspelV                        0.0000                     
ConspelN                        0.0000                     
ConphonV                        0.0432                     
ConphonN                        0.0003                     
ConfriendsV                     0.0000                     
ConfriendsN                     0.0000                     
ConffV                          0.6937                     
ConffN                          0.8069                     
ConfbV                          0.1784                     
ConfbN                          0.2019                     
NounFrequency                   0.0013                     
VerbFrequency                   0.0000                     
FrequencyInitialDiphoneWord                                
FrequencyInitialDiphoneSyllable 0.0000                     
CorrectLexdec                   0.0000                     
                                FrequencyInitialDiphoneSyllable
RTlexdec                        0.0164                         
RTnaming                        0.0691                         
Familiarity                     0.0000                         
WrittenFrequency                0.0000                         
WrittenSpokenFrequencyRatio     0.3844                         
FamilySize                      0.0000                         
DerivationalEntropy             0.0607                         
InflectionalEntropy             0.0006                         
NumberSimplexSynsets            0.0000                         
NumberComplexSynsets            0.0015                         
LengthInLetters                 0.0000                         
Ncount                          0.1422                         
MeanBigramFrequency             0.0000                         
FrequencyInitialDiphone         0.0000                         
ConspelV                        0.0000                         
ConspelN                        0.0000                         
ConphonV                        0.0230                         
ConphonN                        0.0002                         
ConfriendsV                     0.0000                         
ConfriendsN                     0.0000                         
ConffV                          0.4569                         
ConffN                          0.5551                         
ConfbV                          0.0678                         
ConfbN                          0.0990                         
NounFrequency                   0.0204                         
VerbFrequency                   0.0002                         
FrequencyInitialDiphoneWord     0.0000                         
FrequencyInitialDiphoneSyllable                                
CorrectLexdec                   0.0001                         
                                CorrectLexdec
RTlexdec                        0.0000       
RTnaming                        0.0000       
Familiarity                     0.0000       
WrittenFrequency                0.0000       
WrittenSpokenFrequencyRatio     0.5628       
FamilySize                      0.0000       
DerivationalEntropy             0.0000       
InflectionalEntropy             0.0000       
NumberSimplexSynsets            0.0000       
NumberComplexSynsets            0.0000       
LengthInLetters                 0.0017       
Ncount                          0.2782       
MeanBigramFrequency             0.0000       
FrequencyInitialDiphone         0.0010       
ConspelV                        0.0008       
ConspelN                        0.0000       
ConphonV                        0.1609       
ConphonN                        0.0000       
ConfriendsV                     0.0020       
ConfriendsN                     0.0000       
ConffV                          0.6223       
ConffN                          0.6263       
ConfbV                          0.7155       
ConfbN                          0.7901       
NounFrequency                   0.0000       
VerbFrequency                   0.0007       
FrequencyInitialDiphoneWord     0.0000       
FrequencyInitialDiphoneSyllable 0.0001       
CorrectLexdec                                
# use corrplot to obtain a nice correlation plot!
corrplot(corr$r, p.mat = corr$P,
         addCoef.col = "black", diag = FALSE, type = "upper", tl.srt = 55)

english %>% 
  group_by(AgeSubject) %>% 
  summarise(mean = mean(RTlexdec),
            sd = sd(RTlexdec))

3 Linear Models

Up to now, we have looked at descriptive statistics, and evaluated summaries, correlations in the data (with p values).

We are now interested in looking at group differences.

3.1 Introduction

The basic assumption of a Linear model is to create a regression analysis on the data. We have an outcome (or dependent variable) and a predictor (or an independent variable). The formula of a linear model is as follows outcome ~ predictor that can be read as “outcome as a function of the predictor”. We can add “1” to specify an intercept, but this is by default added to the model

3.1.1 Model estimation

english2 <- english %>% 
  mutate(AgeSubject = factor(AgeSubject, levels = c("young", "old")))
mdl.lm <- english2 %>% 
  lm(RTlexdec ~ AgeSubject, data = .)
#lm(RTlexdec ~ AgeSubject, data = english)
mdl.lm #also print(mdl.lm)

Call:
lm(formula = RTlexdec ~ AgeSubject, data = .)

Coefficients:
  (Intercept)  AgeSubjectold  
       6.4392         0.2217  
summary(mdl.lm)

Call:
lm(formula = RTlexdec ~ AgeSubject, data = .)

Residuals:
     Min       1Q   Median 
-0.25776 -0.08339 -0.01669 
      3Q      Max 
 0.06921  0.52685 

Coefficients:
              Estimate
(Intercept)   6.439237
AgeSubjectold 0.221721
              Std. Error t value
(Intercept)     0.002324 2771.03
AgeSubjectold   0.003286   67.47
              Pr(>|t|)    
(Intercept)     <2e-16 ***
AgeSubjectold   <2e-16 ***
---
Signif. codes:  
  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’
  0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.1111 on 4566 degrees of freedom
Multiple R-squared:  0.4992,    Adjusted R-squared:  0.4991 
F-statistic:  4552 on 1 and 4566 DF,  p-value: < 2.2e-16

3.1.2 Tidying the output

# from library(broom)
tidy(mdl.lm) %>% 
  select(term, estimate) %>% 
  mutate(estimate = round(estimate, 3))
mycoefE <- tidy(mdl.lm) %>% pull(estimate)

Obtaining mean values from our model

#old
mycoefE[1]
[1] 6.439237
#young
mycoefE[1] + mycoefE[2]
[1] 6.660958

3.1.3 Nice table of our model summary

We can also obtain a nice table of our model summary. We can use the package knitr or xtable

3.1.3.1 Directly from model summary

kable(summary(mdl.lm)$coef, digits = 3)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.439 0.002 2771.027 0
AgeSubjectold 0.222 0.003 67.468 0
NA

3.1.3.2 From the tidy output

mdl.lmT <- tidy(mdl.lm)
kable(mdl.lmT, digits = 3)
term estimate std.error statistic p.value
(Intercept) 6.439 0.002 2771.027 0
AgeSubjectold 0.222 0.003 67.468 0

3.1.4 Dissecting the model

Let us dissect the model. If you use “str”, you will be able to see what is available under our linear model. To access some info from the model

3.1.4.1 “str” and “coef”

str(mdl.lm)
List of 13
 $ coefficients : Named num [1:2] 6.439 0.222
  ..- attr(*, "names")= chr [1:2] "(Intercept)" "AgeSubjectold"
 $ residuals    : Named num [1:4568] 0.1045 -0.0416 -0.1343 -0.015 0.0114 ...
  ..- attr(*, "names")= chr [1:4568] "1" "2" "3" "4" ...
 $ effects      : Named num [1:4568] -442.7013 7.4927 -0.1352 -0.0159 0.0105 ...
  ..- attr(*, "names")= chr [1:4568] "(Intercept)" "AgeSubjectold" "" "" ...
 $ rank         : int 2
 $ fitted.values: Named num [1:4568] 6.44 6.44 6.44 6.44 6.44 ...
  ..- attr(*, "names")= chr [1:4568] "1" "2" "3" "4" ...
 $ assign       : int [1:2] 0 1
 $ qr           :List of 5
  ..$ qr   : num [1:4568, 1:2] -67.587 0.0148 0.0148 0.0148 0.0148 ...
  .. ..- attr(*, "dimnames")=List of 2
  .. .. ..$ : chr [1:4568] "1" "2" "3" "4" ...
  .. .. ..$ : chr [1:2] "(Intercept)" "AgeSubjectold"
  .. ..- attr(*, "assign")= int [1:2] 0 1
  .. ..- attr(*, "contrasts")=List of 1
  .. .. ..$ AgeSubject: chr "contr.treatment"
  ..$ qraux: num [1:2] 1.01 1.01
  ..$ pivot: int [1:2] 1 2
  ..$ tol  : num 1e-07
  ..$ rank : int 2
  ..- attr(*, "class")= chr "qr"
 $ df.residual  : int 4566
 $ contrasts    :List of 1
  ..$ AgeSubject: chr "contr.treatment"
 $ xlevels      :List of 1
  ..$ AgeSubject: chr [1:2] "young" "old"
 $ call         : language lm(formula = RTlexdec ~      AgeSubject, data = .)
 $ terms        :Classes 'terms', 'formula'  language RTlexdec ~ AgeSubject
  .. ..- attr(*, "variables")= language list(RTlexdec, AgeSubject)
  .. ..- attr(*, "factors")= int [1:2, 1] 0 1
  .. .. ..- attr(*, "dimnames")=List of 2
  .. .. .. ..$ : chr [1:2] "RTlexdec" "AgeSubject"
  .. .. .. ..$ : chr "AgeSubject"
  .. ..- attr(*, "term.labels")= chr "AgeSubject"
  .. ..- attr(*, "order")= int 1
  .. ..- attr(*, "intercept")= int 1
  .. ..- attr(*, "response")= int 1
  .. ..- attr(*, ".Environment")=<environment: 0x000001e11015d748> 
  .. ..- attr(*, "predvars")= language list(RTlexdec, AgeSubject)
  .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "factor"
  .. .. ..- attr(*, "names")= chr [1:2] "RTlexdec" "AgeSubject"
 $ model        :'data.frame':  4568 obs. of  2 variables:
  ..$ RTlexdec  : num [1:4568] 6.54 6.4 6.3 6.42 6.45 ...
  ..$ AgeSubject: Factor w/ 2 levels "young","old": 1 1 1 1 1 1 1 1 1 1 ...
  ..- attr(*, "terms")=Classes 'terms', 'formula'  language RTlexdec ~ AgeSubject
  .. .. ..- attr(*, "variables")= language list(RTlexdec, AgeSubject)
  .. .. ..- attr(*, "factors")= int [1:2, 1] 0 1
  .. .. .. ..- attr(*, "dimnames")=List of 2
  .. .. .. .. ..$ : chr [1:2] "RTlexdec" "AgeSubject"
  .. .. .. .. ..$ : chr "AgeSubject"
  .. .. ..- attr(*, "term.labels")= chr "AgeSubject"
  .. .. ..- attr(*, "order")= int 1
  .. .. ..- attr(*, "intercept")= int 1
  .. .. ..- attr(*, "response")= int 1
  .. .. ..- attr(*, ".Environment")=<environment: 0x000001e11015d748> 
  .. .. ..- attr(*, "predvars")= language list(RTlexdec, AgeSubject)
  .. .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "factor"
  .. .. .. ..- attr(*, "names")= chr [1:2] "RTlexdec" "AgeSubject"
 - attr(*, "class")= chr "lm"
coef(mdl.lm)
  (Intercept) AgeSubjectold 
    6.4392366     0.2217215 
## same as 
## mdl.lm$coefficients

3.1.4.2 “coef” and “coefficients”

What if I want to obtain the “Intercept”? Or the coefficient for distance? What if I want the full row for distance?

coef(mdl.lm)[1] # same as mdl.lm$coefficients[1]
(Intercept) 
   6.439237 
coef(mdl.lm)[2] # same as mdl.lm$coefficients[2]
AgeSubjectold 
    0.2217215 
summary(mdl.lm)$coefficients[2, ] # full row
   Estimate  Std. Error 
 0.22172146  0.00328631 
    t value    Pr(>|t|) 
67.46820211  0.00000000 
summary(mdl.lm)$coefficients[2, 4] #for p value
[1] 0

3.1.4.3 Residuals

What about residuals (difference between the observed value and the estimated value of the quantity) and fitted values? This allows us to evaluate how normal our residuals are and how different they are from a normal distribution.

hist(residuals(mdl.lm))

qqnorm(residuals(mdl.lm)); qqline(residuals(mdl.lm))

plot(fitted(mdl.lm), residuals(mdl.lm), cex = 4)

3.1.4.4 Goodness of fit?

AIC(mdl.lm) # Akaike's Information Criterion, lower values are better
[1] -7110.962
BIC(mdl.lm) # Bayesian AIC
[1] -7091.682
logLik(mdl.lm)  # log likelihood
'log Lik.' 3558.481 (df=3)

Or use the following from broom

glance(mdl.lm)

3.1.4.5 Significance testing

Are the above informative? of course not directly. If we want to test for overall significance of model. We run a null model (aka intercept only) and compare models.

mdl.lm.Null <- english %>% 
  lm(RTlexdec ~ 1, data = .)
mdl.comp <- anova(mdl.lm.Null, mdl.lm)
mdl.comp
Analysis of Variance Table

Model 1: RTlexdec ~ 1
Model 2: RTlexdec ~ AgeSubject
  Res.Df     RSS Df Sum of Sq
1   4567 112.456             
2   4566  56.314  1    56.141
     F    Pr(>F)    
1                   
2 4552 < 2.2e-16 ***
---
Signif. codes:  
  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’
  0.05 ‘.’ 0.1 ‘ ’ 1

The results show that adding the variable “AgeSubject” improves the model fit. We can write this as follows: Model comparison showed that the addition of AgeSubject improved the model fit when compared with an intercept only model (\(F\)(1) = 4551.96, p < 0) (F(1) = 4552 , p < 2.2e-16)

3.2 Plotting fitted values

3.2.1 Trend line

Let’s plot our fitted values but only for the trend line

english %>% 
  ggplot(aes(x = AgeSubject, y = RTlexdec))+
  geom_boxplot()+
  theme_bw() + theme(text = element_text(size = 15))+
  geom_smooth(aes(x = as.numeric(AgeSubject), y = predict(mdl.lm)), method = "lm", color = "blue") + 
  labs(x = "Age", y = "RTLexDec", title = "Boxplot and predicted trend line", subtitle = "with ggplot2") 
`geom_smooth()` using formula 'y ~ x'

This allows us to plot the fitted values from our model with the predicted linear trend. This is exactly the same as our original data.

3.2.2 Predicted means and the trend line

We can also plot the predicted means and linear trend

english %>% 
  ggplot(aes(x = AgeSubject, y = predict(mdl.lm)))+
  geom_boxplot(color = "blue") +
  theme_bw() + theme(text = element_text(size = 15)) +
  geom_smooth(aes(x = as.numeric(AgeSubject), y = predict(mdl.lm)), method = "lm", color = "blue") + 
    labs(x = "Age", y = "RTLexDec", title = "Predicted means and trend line", subtitle = "with ggplot2") 
`geom_smooth()` using formula 'y ~ x'

3.2.3 Raw data, predicted means and the trend line

We can also plot the actual data, the predicted means and linear trend

english %>% 
  ggplot(aes(x = AgeSubject, y = RTlexdec))+
  geom_boxplot() +
  geom_boxplot(aes(x = AgeSubject, y = predict(mdl.lm)), color = "blue") +
  theme_bw() + theme(text = element_text(size = 15)) +
  geom_smooth(aes(x = as.numeric(AgeSubject), y = predict(mdl.lm)), method = "lm", color = "blue") +
    labs(x = "Species", y = "Length", title = "Boxplot raw data, predicted means (in blue) and trend line", subtitle = "with ggplot2")
`geom_smooth()` using formula 'y ~ x'

3.2.4 Add significance levels and trend line on a plot?

We can use the p values generated from either our linear model to add significance levels on a plot. We use the code from above and add the significance level. We also add a trend line

english %>% 
  ggplot(aes(x = AgeSubject, y = RTlexdec))+
  geom_boxplot() +
  geom_boxplot(aes(x = AgeSubject, y = predict(mdl.lm)), color = "blue") +
  theme_bw() + theme(text = element_text(size = 15)) +
  geom_smooth(aes(x = as.numeric(AgeSubject), y = predict(mdl.lm)), method = "lm", color = "blue") +
    labs(x = "Species", y = "Length", title = "Boxplot raw data, predicted means (in blue) and trend line", subtitle = "with significance testing") +
    geom_signif(comparison = list(c("old", "young")), 
              map_signif_level = TRUE, test = function(a, b) {
                list(p.value = summary(mdl.lm)$coefficients[2, 4])})
`geom_smooth()` using formula 'y ~ x'

3.3 What about pairwise comparison?

When having three of more levels in our predictor, we can use pairwise comparisons, with corrections to evaluate differences between each level.

summary(mdl.lm)

Call:
lm(formula = RTlexdec ~ AgeSubject, data = .)

Residuals:
     Min       1Q   Median 
-0.25776 -0.08339 -0.01669 
      3Q      Max 
 0.06921  0.52685 

Coefficients:
              Estimate
(Intercept)   6.439237
AgeSubjectold 0.221721
              Std. Error t value
(Intercept)     0.002324 2771.03
AgeSubjectold   0.003286   67.47
              Pr(>|t|)    
(Intercept)     <2e-16 ***
AgeSubjectold   <2e-16 ***
---
Signif. codes:  
  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’
  0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.1111 on 4566 degrees of freedom
Multiple R-squared:  0.4992,    Adjusted R-squared:  0.4991 
F-statistic:  4552 on 1 and 4566 DF,  p-value: < 2.2e-16
mdl.lm %>% emmeans(pairwise ~ AgeSubject, adjust = "fdr") -> mdl.emmeans
mdl.emmeans
$emmeans
 AgeSubject emmean       SE   df
 young       6.439 0.002324 4566
 old         6.661 0.002324 4566
 lower.CL upper.CL
    6.435    6.444
    6.656    6.666

Confidence level used: 0.95 

$contrasts
 contrast    estimate      SE
 young - old   -0.222 0.00329
   df t.ratio p.value
 4566 -67.468  <.0001

How to interpret the output? Discuss with your neighbour and share with the group.

Hint… Look at the emmeans values for each level of our factor “Species” and the contrasts.

3.4 Multiple predictors?

Linear models require a numeric outcome, but the predictor can be either numeric or a factor. We can have more than one predictor. The only issue is that this complicates the interpretation of results

english %>% 
  lm(RTlexdec ~ AgeSubject * WordCategory, data = .) %>% 
  summary()

Call:
lm(formula = RTlexdec ~ AgeSubject * WordCategory, data = .)

Residuals:
     Min       1Q   Median 
-0.25079 -0.08273 -0.01516 
      3Q      Max 
 0.06940  0.52285 

Coefficients:
                               Estimate
(Intercept)                    6.664955
AgeSubjectyoung               -0.220395
WordCategoryV                 -0.010972
AgeSubjectyoung:WordCategoryV -0.003642
                              Std. Error
(Intercept)                     0.002911
AgeSubjectyoung                 0.004116
WordCategoryV                   0.004822
AgeSubjectyoung:WordCategoryV   0.006820
                               t value
(Intercept)                   2289.950
AgeSubjectyoung                -53.545
WordCategoryV                   -2.275
AgeSubjectyoung:WordCategoryV   -0.534
                              Pr(>|t|)
(Intercept)                     <2e-16
AgeSubjectyoung                 <2e-16
WordCategoryV                   0.0229
AgeSubjectyoung:WordCategoryV   0.5934
                                 
(Intercept)                   ***
AgeSubjectyoung               ***
WordCategoryV                 *  
AgeSubjectyoung:WordCategoryV    
---
Signif. codes:  
  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’
  0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.1109 on 4564 degrees of freedom
Multiple R-squared:  0.5008,    Adjusted R-squared:  0.5005 
F-statistic:  1526 on 3 and 4564 DF,  p-value: < 2.2e-16

And with an Anova

english %>% 
  lm(RTlexdec ~ AgeSubject * WordCategory, data = .) %>% 
  anova()
Analysis of Variance Table

Response: RTlexdec
                          Df
AgeSubject                 1
WordCategory               1
AgeSubject:WordCategory    1
Residuals               4564
                        Sum Sq
AgeSubject              56.141
WordCategory             0.173
AgeSubject:WordCategory  0.004
Residuals               56.138
                        Mean Sq
AgeSubject               56.141
WordCategory              0.173
AgeSubject:WordCategory   0.004
Residuals                 0.012
                          F value
AgeSubject              4564.2810
WordCategory              14.0756
AgeSubject:WordCategory    0.2851
Residuals                        
                           Pr(>F)
AgeSubject              < 2.2e-16
WordCategory            0.0001778
AgeSubject:WordCategory 0.5933724
Residuals                        
                           
AgeSubject              ***
WordCategory            ***
AgeSubject:WordCategory    
Residuals                  
---
Signif. codes:  
  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’
  0.05 ‘.’ 0.1 ‘ ’ 1

The results above tell us that all predictors used are significantly different.

4 Generalised Linear Models

Here we will look at an example when the outcome is binary. This simulated data is structured as follows. We asked one participant to listen to 165 sentences, and to judge whether these are “grammatical” or “ungrammatical”. There were 105 sentences that were “grammatical” and 60 “ungrammatical”. This fictitious example can apply in any other situation. Let’s think Geography: 165 lands: 105 “flat” and 60 “non-flat”, etc. This applies to any case where you need to “categorise” the outcome into two groups.

4.1 Load and summaries

Let’s load in the data and do some basic summaries

grammatical <- read_csv("grammatical.csv")
Rows: 165 Columns: 2
-- Column specification ---------
Delimiter: ","
chr (2): grammaticality, resp...

i Use `spec()` to retrieve the full column specification for this data.
i Specify the column types or set `show_col_types = FALSE` to quiet this message.
grammatical
str(grammatical)
spec_tbl_df [165 x 2] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ grammaticality: chr [1:165] "grammatical" "grammatical" "grammatical" "grammatical" ...
 $ response      : chr [1:165] "yes" "yes" "yes" "yes" ...
 - attr(*, "spec")=
  .. cols(
  ..   grammaticality = col_character(),
  ..   response = col_character()
  .. )
 - attr(*, "problems")=<externalptr> 
head(grammatical)

4.2 GLM - Categorical predictors

Let’s run a first GLM (Generalised Linear Model). A GLM uses a special family “binomial” as it assumes the outcome has a binomial distribution. In general, results from a Logistic Regression are close to what we get from SDT (see above).

To run the results, we will change the reference level for both response and grammaticality. The basic assumption about GLM is that we start with our reference level being the “no” responses to the “ungrammatical” category. Any changes to this reference will be seen in the coefficients as “yes” responses to the “grammatical” category.

4.2.1 Model estimation and results

The results below show the logodds for our model.

grammatical <- grammatical %>% 
  mutate(response = factor(response, levels = c("no", "yes")),
         grammaticality = factor(grammaticality, levels = c("ungrammatical", "grammatical")))

grammatical %>% 
  group_by(grammaticality, response) %>% 
  table()
               response
grammaticality   no yes
  ungrammatical  50  10
  grammatical     5 100
mdl.glm <- grammatical %>% 
  glm(response ~ grammaticality, data = ., family = binomial)
summary(mdl.glm)

Call:
glm(formula = response ~ grammaticality, family = binomial, data = .)

Deviance Residuals: 
    Min       1Q   Median  
-2.4676  -0.6039   0.3124  
     3Q      Max  
 0.3124   1.8930  

Coefficients:
                          Estimate
(Intercept)                -1.6094
grammaticalitygrammatical   4.6052
                          Std. Error
(Intercept)                   0.3464
grammaticalitygrammatical     0.5744
                          z value
(Intercept)                -4.646
grammaticalitygrammatical   8.017
                          Pr(>|z|)
(Intercept)               3.38e-06
grammaticalitygrammatical 1.08e-15
                             
(Intercept)               ***
grammaticalitygrammatical ***
---
Signif. codes:  
  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’
  0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 210.050  on 164  degrees of freedom
Residual deviance:  94.271  on 163  degrees of freedom
AIC: 98.271

Number of Fisher Scoring iterations: 5
tidy(mdl.glm) %>% 
  select(term, estimate) %>% 
  mutate(estimate = round(estimate, 3))
# to only get the coefficients
mycoef2 <- tidy(mdl.glm) %>% pull(estimate)

The results show that for one unit increase in the response (i.e., from no to yes), the logodds of being “grammatical” is increased by -0.1228119 (the intercept shows that when the response is “no”, the logodds are 0.5011178). The actual logodds for the response “yes” to grammatical is 0.3783059

4.2.2 Logodds to Odd ratios

Logodds can be modified to talk about the odds of an event. For our model above, the odds of “grammatical” receiving a “no” response is a mere 0.2; the odds of “grammatical” to receive a “yes” is a 20; i.e., 20 times more likely

exp(mycoef2[1])
[1] 0.2
exp(mycoef2[1] + mycoef2[2])
[1] 20

4.2.3 LogOdds to proportions

If you want to talk about the percentage “accuracy” of our model, then we can transform our loggodds into proportions. This shows that the proportion of “grammatical” receiving a “yes” response increases by 99% (or 95% based on our “true” coefficients)

plogis(mycoef2[1])
[1] 0.1666667
plogis(mycoef2[1] + mycoef2[2])
[1] 0.952381

4.2.4 Plotting

grammatical <- grammatical %>% 
  mutate(prob = predict(mdl.glm, type = "response"))
grammatical %>% 
  ggplot(aes(x = as.numeric(grammaticality), y = prob)) +
  geom_point() +
  geom_smooth(method = "glm", 
    method.args = list(family = "binomial"), 
    se = T) + theme_bw(base_size = 20)+
    labs(y = "Probability", x = "")+
    coord_cartesian(ylim = c(0,1))+
    scale_x_discrete(limits = c("Ungrammatical", "Grammatical"))
`geom_smooth()` using formula 'y ~ x'

4.3 GLM - Numeric predictors

In this example, we will run a GLM model using a similar technique to that used in Al-Tamimi (2017) and Baumann & Winter (2018). We use the package LanguageR and the dataset English.

In the model above, we used the equation as lm(RTlexdec ~ AgeSubject). We were interested in examining the impact of age of subject on reaction time in a lexical decision task. In this section, we are interested in understanding how reaction time allows to differentiate the participants based on their age. We use AgeSubject as our outcome and RTlexdec as our predictor using the equation glm(AgeSubject ~ RTlexdec). We usually can use RTlexdec as is, but due to a possible quasi separation and the fact that we may want to compare coefficients using multiple acoustic metrics, we will z-score our predictor. We run below two models, with and without z-scoring

For the glm model, we need to specify family = "binomial".

4.3.1 Without z-scoring of predictor

4.3.1.1 Model estimation

mdl.glm2 <- english2 %>% 
  glm(AgeSubject ~ RTlexdec, data = ., family = "binomial")

tidy(mdl.glm2) %>% 
  select(term, estimate) %>% 
  mutate(estimate = round(estimate, 3))
# to only get the coefficients
mycoef2 <- tidy(mdl.glm2) %>% pull(estimate)

4.3.1.2 LogOdds to proportions

If you want to talk about the percentage “accuracy” of our model, then we can transform our loggodds into proportions.

plogis(mycoef2[1])
[1] 1.368844e-56
plogis(mycoef2[1] + mycoef2[2])
[1] 4.678715e-48

4.3.1.3 Plotting

english2 <- english2 %>% 
  mutate(prob = predict(mdl.glm2, type = "response"))
english2 %>% 
  ggplot(aes(x = as.numeric(AgeSubject), y = prob)) +
  geom_point() +
  geom_smooth(method = "glm", 
    method.args = list(family = "binomial"), 
    se = T) + theme_bw(base_size = 20)+
    labs(y = "Probability", x = "")+
    coord_cartesian(ylim = c(0,1))+
    scale_x_discrete(limits = c("Young", "Old"))
`geom_smooth()` using formula 'y ~ x'

The plot above show how the two groups differ using a glm. The results point to an overall increase in the proportion of reaction time when moving from the “Young” to the “Old” group. Let’s use z-scoring next

4.3.2 With z-scoring of predictor

4.3.2.1 Model estimation

english2 <- english2 %>% 
  mutate(`RTlexdec_z` = scale(RTlexdec, center = TRUE, scale = TRUE))

english2['RTlexdec_z'] <- as.data.frame(scale(english2$RTlexdec))



mdl.glm3 <- english2 %>% 
  glm(AgeSubject ~ RTlexdec_z, data = ., family = "binomial")

tidy(mdl.glm3) %>% 
  select(term, estimate) %>% 
  mutate(estimate = round(estimate, 3))
# to only get the coefficients
mycoef2 <- tidy(mdl.glm3) %>% pull(estimate)

4.3.2.2 LogOdds to proportions

If you want to talk about the percentage “accuracy” of our model, then we can transform our loggodds into proportions.

plogis(mycoef2[1])
[1] 0.5192147
plogis(mycoef2[1] + mycoef2[2])
[1] 0.959313

4.3.2.3 Plotting

4.3.2.3.1 Normal
english2 <- english2 %>% 
  mutate(prob = predict(mdl.glm3, type = "response"))
english2 %>% 
  ggplot(aes(x = as.numeric(AgeSubject), y = prob)) +
  geom_point() +
  geom_smooth(method = "glm", 
    method.args = list(family = "binomial"), 
    se = T) + theme_bw(base_size = 20)+
    labs(y = "Probability", x = "")+
    coord_cartesian(ylim = c(0,1))+
    scale_x_discrete(limits = c("Young", "Old"))
`geom_smooth()` using formula 'y ~ x'

We obtain the exact same plots, but the model estimations are different. Let’s use another type of predictions

4.3.2.3.2 z-scores
z_vals <- seq(-3, 3, 0.01)

dfPredNew <- data.frame(RTlexdec_z = z_vals)

## store the predicted probabilities for each value of RTlexdec_z
pp <- cbind(dfPredNew, prob = predict(mdl.glm3, newdata = dfPredNew, type = "response"))

pp %>% 
  ggplot(aes(x = RTlexdec_z, y = prob)) +
  geom_point() +
  theme_bw(base_size = 20)+
    labs(y = "Probability", x = "")+
    coord_cartesian(ylim = c(0,1))+
  scale_x_continuous(breaks = c(-3, -2, -1, 0, 1, 2, 3))

We obtain the exact same plots, but the model estimations are different.

4.4 Accuracy and Signal Detection Theory

4.4.1 Rationale

We are generally interested in performance, i.e., whether the we have “accurately” categorised the outcome or not and at the same time want to evaluate our biases in responses. When deciding on categories, we are usually biased in our selection.

Let’s ask the question: How many of you have a Mac laptop and how many a Windows laptop? For those with a Mac, what was the main reason for choosing it? Are you biased in anyway by your decision?

To correct for these biases, we use some variants from Signal Detection Theory to obtain the true estimates without being influenced by the biases.

4.4.2 Running stats

Let’s do some stats on this

Yes No Total
Grammatical (Yes Actual) TP = 100 FN = 5 (Yes Actual) 105
Ungrammatical (No Actual) FP = 10 TN = 50 (No Actual) 60
Total (Yes Response) 110 (No Response) 55 165
grammatical <- grammatical %>% 
  mutate(response = factor(response, levels = c("yes", "no")),
         grammaticality = factor(grammaticality, levels = c("grammatical", "ungrammatical")))

## TP = True Positive (Hit); FP = False Positive; FN = False Negative; TN = True Negative


TP <- nrow(grammatical %>% 
             filter(grammaticality == "grammatical" &
                      response == "yes"))
FN <- nrow(grammatical %>% 
             filter(grammaticality == "grammatical" &
                      response == "no"))
FP <- nrow(grammatical %>% 
             filter(grammaticality == "ungrammatical" &
                      response == "yes"))
TN <- nrow(grammatical %>% 
             filter(grammaticality == "ungrammatical" &
                      response == "no"))
TP
[1] 100
FN
[1] 5
FP
[1] 10
TN
[1] 50
Total <- nrow(grammatical)
Total
[1] 165
(TP+TN)/Total # accuracy
[1] 0.9090909
(FP+FN)/Total # error, also 1-accuracy
[1] 0.09090909
# When stimulus = yes, how many times response = yes?
TP/(TP+FN) # also True Positive Rate or Specificity
[1] 0.952381
# When stimulus = no, how many times response = yes?
FP/(FP+TN) # False Positive Rate, 
[1] 0.1666667
# When stimulus = no, how many times response = no?
TN/(FP+TN) # True Negative Rate or Sensitivity 
[1] 0.8333333
# When subject responds "yes" how many times is (s)he correct?
TP/(TP+FP) # precision
[1] 0.9090909
# getting dprime (or the sensitivity index); beta (bias criterion, 0-1, lower=increase in "yes"); Aprime (estimate of discriminability, 0-1, 1=good discrimination; 0 at chance); bppd (b prime prime d, -1 to 1; 0 = no bias, negative = tendency to respond "yes", positive = tendency to respond "no"); c (index of bias, equals to SD)
#(see also https://www.r-bloggers.com/compute-signal-detection-theory-indices-with-r/amp/) 
psycho::dprime(TP, FP, FN, TN, 
               n_targets = TP+FN, 
               n_distractors = FP+TN,
               adjust=F)
$dprime
[1] 2.635813

$beta
[1] 0.3970026

$aprime
[1] 0.9419643

$bppd
[1] -0.5076923

$c
[1] -0.3504848

The most important from above, is d-prime. This is modelling the difference between the rate of “True Positive” responses and “False Positive” responses in standard unit (or z-scores). The formula can be written as:

d' (d prime) = Z(True Positive Rate) - Z(False Positive Rate)

4.4.3 GLM as a classification tool

The code below demonstrates the links between our GLM model and what we had obtained above from SDT. The predictions’ table shows that our GLM was successful at obtaining prediction that are identical to our initial data setup. Look at the table here and the table above. Once we have created our table of outcome, we can compute percent correct, the specificity, the sensitivity, the Kappa score, etc.. this yields the actual value with the SD that is related to variations in responses.

## predict(mdl.glm)>0.5 is identical to 
## predict(glm(response~grammaticality,data=grammatical,family = binomial),type="response")
grammatical <- grammatical %>% 
  mutate(response = factor(response, levels = c("yes", "no")),
         grammaticality = factor(grammaticality, levels = c("grammatical", "ungrammatical")))



mdl.glm.C <- grammatical %>% 
  glm(response ~ grammaticality, data = .,family = binomial)

tbl.glm <- table(grammatical$response, predict(mdl.glm.C, type = "response")>0.5)
colnames(tbl.glm) <- c("grammatical", "ungrammatical")
tbl.glm
     
      grammatical ungrammatical
  yes         100            10
  no            5            50
PresenceAbsence::pcc(tbl.glm)
PresenceAbsence::specificity(tbl.glm)
PresenceAbsence::sensitivity(tbl.glm)
###etc..

If you look at the results from SDT above, these results are the same as the following

Accuracy: (TP+TN)/Total (0.9090909)

True Positive Rate (or Specificity) TP/(TP+FN) (0.952381)

True Negative Rate (or Sensitivity) TN/(FP+TN) (0.8333333)

4.4.4 GLM and d prime

The values obtained here match those obtained from SDT. For d prime, the difference stems from the use of the logit variant of the Binomial family. By using a probit variant, one obtains the same values (see here for more details). A probit variant models the z-score differences in the outcome and is evaluated in change in 1-standard unit. This is modelling the change from “ungrammatical” “no” responses into “grammatical” “yes” responses in z-scores. The same conceptual underpinnings of d-prime from Signal Detection Theory.

## d prime
psycho::dprime(TP, FP, FN, TN, 
               n_targets = TP+FN, 
               n_distractors = FP+TN,
               adjust=F)$dprime
[1] 2.635813
## GLM with probit
coef(glm(response ~ grammaticality, data = grammatical, family = binomial(probit)))[2]
grammaticalityungrammatical 
                   2.635813 

4.5 GLM: Other distributions

If your data does not fit a binomial distribution, and is a multinomial (i.e., three or more response categories) or poisson (count data), then you need to use the glm function with a specific family function.

6 Linear Mixed-effects Models. Why random effects matter

Let’s generate a new dataframe that we will use later on for our mixed models

## Courtesy of Bodo Winter
set.seed(666)
#we create 6 subjects
subjects <- paste0('S', 1:6)
#here we add repetitions within speakers
subjects <- rep(subjects, each = 20)
items <- paste0('Item', 1:20)
#below repeats
items <- rep(items, 6)
#below is to generate random numbers that are log values
logFreq <- round(rexp(20)*5, 2)
#below we are repeating the logFreq 6 times to fit with the number of speakers and items
logFreq <- rep(logFreq, 6)
xdf <- data.frame(subjects, items, logFreq)
#below removes the individual variables we had created because they are already in the dataframe
rm(subjects, items, logFreq)

xdf$Intercept <- 300
submeans <- rep(rnorm(6, sd = 40), 20)
#sort make the means for each subject is the same...
submeans <- sort(submeans)
xdf$submeans <- submeans
#we create the same thing for items... we allow the items mean to vary between words...
itsmeans <- rep(rnorm(20, sd = 20), 6)
xdf$itsmeans <- itsmeans
xdf$error <- rnorm(120, sd = 20)
#here we create an effect column,  
#here for each logFreq, we have a decrease of -5 of that particular logFreq 
xdf$effect <- -5 * xdf$logFreq

xdf$dur <- xdf$Intercept + xdf$submeans + xdf$itsmeans + xdf$error + xdf$effect
#below is to subset the data and get only a few columns.. the -c(4:8) removes the columns 4 to 8..
xreal <- xdf[,-c(4:8)]
head(xreal)
rm(xdf, submeans, itsmeans)

6.1 Plots

Let’s start by doing a correlation test and plotting the data. Our results show that there is a negative correlation between duration and LogFrequency, and the plot shows this decrease.

corrMixed <- as.matrix(xreal[-c(1:2)]) %>% 
  rcorr(type="pearson")
print(corrMixed)
        logFreq   dur
logFreq    1.00 -0.54
dur       -0.54  1.00

n= 120 


P
        logFreq dur
logFreq          0 
dur      0         
corrplot(corrMixed$r, method = "circle", type = "upper", tl.srt = 45,
         addCoef.col = "black", diag = FALSE,
         p.mat = corrMixed$p, sig.level = 0.05)




ggplot.xreal <- xreal %>% 
  ggplot(aes(x = logFreq, y = dur)) +
  geom_point()+ theme_bw(base_size = 20) +
  labs(y = "Duration", x = "Frequency (Log)") +
  geom_smooth(method = lm, se=F)
ggplot.xreal
`geom_smooth()` using formula 'y ~ x'

6.2 Linear model

Let’s run a simple linear model on the data. As we can see below, there are some issues with the “simple” linear model: we had set our SD for subjects to be 40, but this was picked up as 120 (see histogram of residuals). The QQ Plot is not “normal”.

mdl.lm.xreal <- xreal %>% 
  lm(dur ~ logFreq, data = .)
summary(mdl.lm.xreal)

Call:
lm(formula = dur ~ logFreq, data = .)

Residuals:
    Min      1Q  Median      3Q 
-94.322 -35.465  -4.364  33.020 
    Max 
123.955 

Coefficients:
            Estimate Std. Error
(Intercept) 337.9730     6.2494
logFreq      -5.4601     0.7846
            t value Pr(>|t|)    
(Intercept)  54.081  < 2e-16 ***
logFreq      -6.959 2.06e-10 ***
---
Signif. codes:  
  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’
  0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 48.29 on 118 degrees of freedom
Multiple R-squared:  0.291, Adjusted R-squared:  0.285 
F-statistic: 48.43 on 1 and 118 DF,  p-value: 2.057e-10
hist(residuals(mdl.lm.xreal))

qqnorm(residuals(mdl.lm.xreal)); qqline(residuals(mdl.lm.xreal))

plot(fitted(mdl.lm.xreal), residuals(mdl.lm.xreal), cex = 4)

6.3 Linear Mixed Model

Our Linear Mixed effects Model will take into account the random effects we added and also our model specifications. We use a Maximum Likelihood estimate (REML = FALSE) as this is what we will use for model comparison. The Linear Mixed Model is reflecting our model specifications The SD of our subjects is picked up correctly. The model results are “almost” the same as our linear model above. The coefficient for the “Intercept” is at 337.973 and the coefficient for LogFrequency is at -5.460. This indicates that for each unit of increase in the LogFrequency, there is a decrease by 5.460 (ms).

mdl.lmer.xreal <- xreal %>% 
  lmer(dur ~ logFreq  +(1|subjects) + (1|items), data = ., REML = FALSE)
summary(mdl.lmer.xreal)
Linear mixed model fit by
  maximum likelihood . t-tests
  use Satterthwaite's method [
lmerModLmerTest]
Formula: 
dur ~ logFreq + (1 | subjects) + (1 | items)
   Data: .

     AIC      BIC   logLik 
  1105.8   1119.8   -547.9 
deviance df.resid 
  1095.8      115 

Scaled residuals: 
     Min       1Q   Median 
-2.06735 -0.60675  0.07184 
      3Q      Max 
 0.61122  2.39854 

Random effects:
 Groups   Name        Variance
 items    (Intercept)  589.8  
 subjects (Intercept) 1471.7  
 Residual              284.0  
 Std.Dev.
 24.29   
 38.36   
 16.85   
Number of obs: 120, groups:  
items, 20; subjects, 6

Fixed effects:
            Estimate Std. Error
(Intercept)  337.973     17.587
logFreq       -5.460      1.004
                 df t value
(Intercept)   9.126  19.218
logFreq      19.215  -5.436
            Pr(>|t|)    
(Intercept) 1.08e-08 ***
logFreq     2.92e-05 ***
---
Signif. codes:  
  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’
  0.05 ‘.’ 0.1 ‘ ’ 1

Correlation of Fixed Effects:
        (Intr)
logFreq -0.322
hist(residuals(mdl.lmer.xreal))

qqnorm(residuals(mdl.lmer.xreal)); qqline(residuals(mdl.lmer.xreal))

plot(fitted(mdl.lmer.xreal), residuals(mdl.lmer.xreal), cex = 4)

6.4 Our second Mixed model

This second model add a by-subject random slope. Random slopes allow for the variation that exists in the random effects to be taken into account. An intercept only model provides an averaged values to our participants.

mdl.lmer.xreal.2 <- xreal %>% 
  lmer(dur ~ logFreq + (logFreq|subjects) + (1|items), data = ., REML = FALSE)
boundary (singular) fit: see ?isSingular
summary(mdl.lmer.xreal.2)
Linear mixed model fit by
  maximum likelihood . t-tests
  use Satterthwaite's method [
lmerModLmerTest]
Formula: 
dur ~ logFreq + (logFreq | subjects) + (1 | items)
   Data: .

     AIC      BIC   logLik 
  1109.5   1129.0   -547.7 
deviance df.resid 
  1095.5      113 

Scaled residuals: 
    Min      1Q  Median      3Q 
-2.1087 -0.6067  0.0623  0.5828 
    Max 
 2.4564 

Random effects:
 Groups   Name        Variance 
 items    (Intercept) 5.897e+02
 subjects (Intercept) 1.400e+03
          logFreq     2.902e-02
 Residual             2.829e+02
 Std.Dev. Corr
 24.2838      
 37.4229      
  0.1704  1.00
 16.8196      
Number of obs: 120, groups:  
items, 20; subjects, 6

Fixed effects:
            Estimate Std. Error
(Intercept)  337.973     17.245
logFreq       -5.460      1.007
                 df t value
(Intercept)   9.093  19.598
logFreq      19.361  -5.424
            Pr(>|t|)    
(Intercept) 9.50e-09 ***
logFreq     2.92e-05 ***
---
Signif. codes:  
  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’
  0.05 ‘.’ 0.1 ‘ ’ 1

Correlation of Fixed Effects:
        (Intr)
logFreq -0.267
optimizer (nloptwrap) convergence code: 0 (OK)
boundary (singular) fit: see ?isSingular
hist(residuals(mdl.lmer.xreal.2))

qqnorm(residuals(mdl.lmer.xreal.2)); qqline(residuals(mdl.lmer.xreal.2))

plot(fitted(mdl.lmer.xreal.2), residuals(mdl.lmer.xreal.2), cex = 4)

6.5 Model comparison

But where are our p values? The lme4 developers decided not to include p values due to various issues with estimating df. What we can do instead is to compare models. We need to create a null model to allow for significance testing. As expected our predictor is significantly contributing to the difference.

mdl.lmer.xreal.Null <- xreal %>% 
  lmer(dur ~ 1 + (logFreq|subjects) + (1|items), data = ., REML = FALSE)
boundary (singular) fit: see ?isSingular
anova(mdl.lmer.xreal.Null, mdl.lmer.xreal.2)
Data: .
Models:
mdl.lmer.xreal.Null: dur ~ 1 + (logFreq | subjects) + (1 | items)
mdl.lmer.xreal.2: dur ~ logFreq + (logFreq | subjects) + (1 | items)
                    npar    AIC
mdl.lmer.xreal.Null    6 1125.4
mdl.lmer.xreal.2       7 1109.5
                       BIC
mdl.lmer.xreal.Null 1142.1
mdl.lmer.xreal.2    1129.0
                     logLik
mdl.lmer.xreal.Null -556.68
mdl.lmer.xreal.2    -547.73
                    deviance
mdl.lmer.xreal.Null   1113.4
mdl.lmer.xreal.2      1095.5
                     Chisq Df
mdl.lmer.xreal.Null          
mdl.lmer.xreal.2    17.892  1
                    Pr(>Chisq)
mdl.lmer.xreal.Null           
mdl.lmer.xreal.2     2.339e-05
                       
mdl.lmer.xreal.Null    
mdl.lmer.xreal.2    ***
---
Signif. codes:  
  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’
  0.05 ‘.’ 0.1 ‘ ’ 1

Also, do we really need random slopes? From the result below, we don’t seem to need random slopes at all, given that adding random slopes does not improve the model fit. I always recommend testing this. Most of the time I keep random slopes.

anova(mdl.lmer.xreal, mdl.lmer.xreal.2)
Data: .
Models:
mdl.lmer.xreal: dur ~ logFreq + (1 | subjects) + (1 | items)
mdl.lmer.xreal.2: dur ~ logFreq + (logFreq | subjects) + (1 | items)
                 npar    AIC
mdl.lmer.xreal      5 1105.8
mdl.lmer.xreal.2    7 1109.5
                    BIC  logLik
mdl.lmer.xreal   1119.8 -547.92
mdl.lmer.xreal.2 1129.0 -547.73
                 deviance  Chisq
mdl.lmer.xreal     1095.8       
mdl.lmer.xreal.2   1095.5 0.3788
                 Df Pr(>Chisq)
mdl.lmer.xreal                
mdl.lmer.xreal.2  2     0.8274

But if you are really (really!!!) obsessed by p values, then you can also use lmerTest. BUT use after comparing models to evaluate contribution of predictors

mdl.lmer.xreal.lmerTest <- xreal %>% 
  lmer(dur ~ logFreq + (logFreq|subjects) + (1|items), data = ., REML = TRUE)
boundary (singular) fit: see ?isSingular
summary(mdl.lmer.xreal.lmerTest)
Linear mixed model fit by
  REML. t-tests use
  Satterthwaite's method [
lmerModLmerTest]
Formula: 
dur ~ logFreq + (logFreq | subjects) + (1 | items)
   Data: .

REML criterion at convergence: 
1086.1

Scaled residuals: 
     Min       1Q   Median 
-2.09691 -0.60118  0.06418 
      3Q      Max 
 0.58483  2.46245 

Random effects:
 Groups   Name        Variance 
 items    (Intercept)  629.5679
 subjects (Intercept) 1651.2357
          logFreq        0.0342
 Residual              282.8593
 Std.Dev. Corr
 25.0912      
 40.6354      
  0.1849  1.00
 16.8184      
Number of obs: 120, groups:  
items, 20; subjects, 6

Fixed effects:
            Estimate Std. Error
(Intercept)  337.973     18.526
logFreq       -5.460      1.038
                 df t value
(Intercept)   7.396   18.24
logFreq      18.136   -5.26
            Pr(>|t|)    
(Intercept) 2.03e-07 ***
logFreq     5.18e-05 ***
---
Signif. codes:  
  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’
  0.05 ‘.’ 0.1 ‘ ’ 1

Correlation of Fixed Effects:
        (Intr)
logFreq -0.250
optimizer (nloptwrap) convergence code: 0 (OK)
boundary (singular) fit: see ?isSingular
detach("package:lmerTest", unload = TRUE)

6.6 Our final Mixed model

Our final model uses REML (or Restricted Maximum Likelihood Estimate of Variance Component) to estimate the model.

mdl.lmer.xreal.Full <- xreal %>% 
  lmer(dur ~ logFreq + (logFreq|subjects) + (1|items), data = ., REML = TRUE)
boundary (singular) fit: see ?isSingular
summary(mdl.lmer.xreal.Full)
Linear mixed model fit by REML [
lmerMod]
Formula: 
dur ~ logFreq + (logFreq | subjects) + (1 | items)
   Data: .

REML criterion at convergence: 
1086.1

Scaled residuals: 
     Min       1Q   Median 
-2.09691 -0.60118  0.06418 
      3Q      Max 
 0.58483  2.46245 

Random effects:
 Groups   Name        Variance 
 items    (Intercept)  629.5679
 subjects (Intercept) 1651.2357
          logFreq        0.0342
 Residual              282.8593
 Std.Dev. Corr
 25.0912      
 40.6354      
  0.1849  1.00
 16.8184      
Number of obs: 120, groups:  
items, 20; subjects, 6

Fixed effects:
            Estimate Std. Error
(Intercept)  337.973     18.526
logFreq       -5.460      1.038
            t value
(Intercept)   18.24
logFreq       -5.26

Correlation of Fixed Effects:
        (Intr)
logFreq -0.250
optimizer (nloptwrap) convergence code: 0 (OK)
boundary (singular) fit: see ?isSingular
anova(mdl.lmer.xreal.Full)
Analysis of Variance Table
        npar Sum Sq Mean Sq
logFreq    1 7826.9  7826.9
        F value
logFreq  27.671
hist(residuals(mdl.lmer.xreal.Full))

qqnorm(residuals(mdl.lmer.xreal.Full)); qqline(residuals(mdl.lmer.xreal.Full))

plot(fitted(mdl.lmer.xreal.Full), residuals(mdl.lmer.xreal.Full), cex = 4)

6.7 Dissecting the model

coef(mdl.lmer.xreal.Full)
$items
       (Intercept)   logFreq
Item1     352.3567 -5.460115
Item10    331.7618 -5.460115
Item11    324.7269 -5.460115
Item12    350.2318 -5.460115
Item13    353.1174 -5.460115
Item14    311.8355 -5.460115
Item15    354.0591 -5.460115
Item16    353.9389 -5.460115
Item17    288.7843 -5.460115
Item18    362.4702 -5.460115
Item19    338.1424 -5.460115
Item2     325.1855 -5.460115
Item20    359.7414 -5.460115
Item3     370.1804 -5.460115
Item4     302.4265 -5.460115
Item5     350.0499 -5.460115
Item6     338.9482 -5.460115
Item7     362.8402 -5.460115
Item8     295.5943 -5.460115
Item9     333.0693 -5.460115

$subjects
   (Intercept)   logFreq
S1    314.4694 -5.567073
S2    303.9037 -5.615155
S3    314.2920 -5.567881
S4    318.4282 -5.549058
S5    373.3006 -5.299350
S6    403.4443 -5.162175

attr(,"class")
[1] "coef.mer"
fixef(mdl.lmer.xreal.Full)
(Intercept)     logFreq 
 337.973044   -5.460115 
fixef(mdl.lmer.xreal.Full)[1]
(Intercept) 
    337.973 
fixef(mdl.lmer.xreal.Full)[2]
  logFreq 
-5.460115 
coef(mdl.lmer.xreal.Full)$`subjects`[1]
coef(mdl.lmer.xreal.Full)$`subjects`[2]

coef(mdl.lmer.xreal.Full)$`items`[1]
coef(mdl.lmer.xreal.Full)$`items`[2]
NA

6.8 Using predictions from our model

In general, I use the prediction from my final model in any plots. To generate this, we can use the following

xreal <- xreal %>% 
  mutate(Pred_Dur = predict(mdl.lmer.xreal.Full))

xreal %>% 
  ggplot(aes(x = logFreq, y = Pred_Dur)) +
  geom_point() + theme_bw(base_size = 20) +
  labs(y = "Duration", x = "Frequency (Log)", title = "Predicted") +
  geom_smooth(method = lm, se = F) + coord_cartesian(ylim = c(200,450))
`geom_smooth()` using formula 'y ~ x'

## original plot
xreal %>% 
  ggplot(aes(x = logFreq , y = dur)) +
  geom_point() + theme_bw(base_size = 20)+
  labs(y = "Duration", x = "Frequency (Log)", title = "Original")+
  geom_smooth(method = lm, se = F) + coord_cartesian(ylim = c(200,450))
`geom_smooth()` using formula 'y ~ x'

6.9 GLMM and CLMM

The code above was using a Linear Mixed Effects Modelling. The outcome was a numeric object. In some cases (as we have seen above), we may have:

  1. Binary outcome (binomial)
  2. Count data (poisson),
  3. Multi-category outcome (multinomial)
  4. Rating data (cumulative function)

The code below gives you an idea of how to specify these models


## Binomial family
## lme4::glmer(outcome~predictor(s)+(1|subject)+(1|items)..., data=data, family=binomial)

## Poisson family
## lme4::glmer(outcome~predictor(s)+(1|subject)+(1|items)..., data=data, family=poisson)

## Multinomial family
## a bit complicated as there is a need to use Bayesian approaches, see e.g., 
## glmmADMB
## mixcat
## MCMCglmm
## see https://gist.github.com/casallas/8263818

## Rating data, use following
## ordinal::clmm(outcome~predictor(s)+(1|subject)+(1|items)..., data=data)


## Remember to test for random effects and whether slopes are needed.

7 Principal Component Analyses (PCA)

7.1 Read dataset

dfPharV2 <- read_csv("dfPharV2.csv")
Rows: 402 Columns: 24
-- Column specification ---------
Delimiter: ","
chr  (1): context
dbl (23): CPP, Energy, H1A1c,...

i Use `spec()` to retrieve the full column specification for this data.
i Specify the column types or set `show_col_types = FALSE` to quiet this message.
dfPharV2
dfPharV2 <- dfPharV2 %>% 
  mutate(context = factor(context, levels = c("Non-Guttural", "Guttural")))

7.2 Model specification

We use the package FactoMineR to run our PCA. We use all acoustic measures as predictors and our qualitative variable as the context.

pcaDat1 <- PCA(dfPharV2,
               quali.sup = 1, graph = TRUE,
               scale.unit = TRUE, ncp = 5) 
Warning: ggrepel: 5 unlabeled data points (too many overlaps). Consider increasing max.overlaps

7.3 Results

7.3.1 Summary of results

Based on the summary of results, we observe that the first 6 dimensions account 64% of the variance in the data; each contribute individually to more than 5% of the variance.

summary(pcaDat1)

Call:
PCA(X = dfPharV2, scale.unit = TRUE, ncp = 5, quali.sup = 1,  
     graph = TRUE) 


Eigenvalues
                       Dim.1
Variance               8.660
% of var.             37.652
Cumulative % of var.  37.652
                       Dim.2
Variance               5.868
% of var.             25.512
Cumulative % of var.  63.164
                       Dim.3
Variance               2.625
% of var.             11.412
Cumulative % of var.  74.577
                       Dim.4
Variance               2.017
% of var.              8.769
Cumulative % of var.  83.346
                       Dim.5
Variance               1.143
% of var.              4.972
Cumulative % of var.  88.317
                       Dim.6
Variance               0.749
% of var.              3.259
Cumulative % of var.  91.576
                       Dim.7
Variance               0.549
% of var.              2.388
Cumulative % of var.  93.964
                       Dim.8
Variance               0.419
% of var.              1.820
Cumulative % of var.  95.784
                       Dim.9
Variance               0.284
% of var.              1.235
Cumulative % of var.  97.019
                      Dim.10
Variance               0.220
% of var.              0.957
Cumulative % of var.  97.976
                      Dim.11
Variance               0.141
% of var.              0.612
Cumulative % of var.  98.588
                      Dim.12
Variance               0.110
% of var.              0.479
Cumulative % of var.  99.067
                      Dim.13
Variance               0.087
% of var.              0.377
Cumulative % of var.  99.444
                      Dim.14
Variance               0.057
% of var.              0.246
Cumulative % of var.  99.690
                      Dim.15
Variance               0.032
% of var.              0.138
Cumulative % of var.  99.828
                      Dim.16
Variance               0.015
% of var.              0.066
Cumulative % of var.  99.894
                      Dim.17
Variance               0.012
% of var.              0.054
Cumulative % of var.  99.948
                      Dim.18
Variance               0.008
% of var.              0.036
Cumulative % of var.  99.984
                      Dim.19
Variance               0.002
% of var.              0.010
Cumulative % of var.  99.994
                      Dim.20
Variance               0.001
% of var.              0.006
Cumulative % of var. 100.000
                      Dim.21
Variance               0.000
% of var.              0.000
Cumulative % of var. 100.000
                      Dim.22
Variance               0.000
% of var.              0.000
Cumulative % of var. 100.000
                      Dim.23
Variance               0.000
% of var.              0.000
Cumulative % of var. 100.000

Individuals (the 10 first)
                 Dist    Dim.1
1            |  4.230 | -3.646
2            |  5.604 | -4.717
3            |  6.376 | -5.102
4            |  5.258 | -4.034
5            |  3.640 | -3.132
6            |  5.729 | -4.601
7            |  6.871 | -4.756
8            |  5.728 | -3.279
9            |  6.642 | -5.052
10           |  6.294 | -3.773
                ctr   cos2  
1             0.382  0.743 |
2             0.639  0.709 |
3             0.748  0.640 |
4             0.467  0.589 |
5             0.282  0.740 |
6             0.608  0.645 |
7             0.650  0.479 |
8             0.309  0.328 |
9             0.733  0.578 |
10            0.409  0.359 |
              Dim.2    ctr
1            -0.047  0.000
2             0.420  0.007
3             0.843  0.030
4             1.004  0.043
5             0.687  0.020
6             0.274  0.003
7            -2.241  0.213
8            -1.258  0.067
9            -1.599  0.108
10           -3.037  0.391
               cos2    Dim.3
1             0.000 |  1.467
2             0.006 |  1.767
3             0.017 |  1.056
4             0.036 |  1.171
5             0.036 |  1.400
6             0.002 |  0.740
7             0.106 |  2.535
8             0.048 |  2.704
9             0.058 |  1.378
10            0.233 |  1.921
                ctr   cos2  
1             0.204  0.120 |
2             0.296  0.099 |
3             0.106  0.027 |
4             0.130  0.050 |
5             0.186  0.148 |
6             0.052  0.017 |
7             0.609  0.136 |
8             0.693  0.223 |
9             0.180  0.043 |
10            0.350  0.093 |

Variables (the 10 first)
                Dim.1    ctr
CPP          |  0.524  3.166
Energy       | -0.571  3.768
H1A1c        |  0.671  5.198
H1A2c        | -0.114  0.150
H1A3c        | -0.237  0.647
H1H2c        |  0.866  8.653
H2H4c        | -0.528  3.221
H2KH5Kc      | -0.116  0.157
H42Kc        | -0.380  1.668
HNR05        |  0.911  9.579
               cos2    Dim.2
CPP           0.274 |  0.236
Energy        0.326 | -0.127
H1A1c         0.450 |  0.045
H1A2c         0.013 | -0.874
H1A3c         0.056 |  0.591
H1H2c         0.749 | -0.159
H2H4c         0.279 |  0.116
H2KH5Kc       0.014 | -0.709
H42Kc         0.144 |  0.809
HNR05         0.830 |  0.044
                ctr   cos2  
CPP           0.945  0.055 |
Energy        0.274  0.016 |
H1A1c         0.034  0.002 |
H1A2c        13.012  0.764 |
H1A3c         5.946  0.349 |
H1H2c         0.432  0.025 |
H2H4c         0.229  0.013 |
H2KH5Kc       8.569  0.503 |
H42Kc        11.158  0.655 |
HNR05         0.033  0.002 |
              Dim.3    ctr
CPP           0.312  3.699
Energy        0.559 11.905
H1A1c         0.610 14.177
H1A2c         0.149  0.844
H1A3c         0.000  0.000
H1H2c         0.080  0.246
H2H4c         0.127  0.612
H2KH5Kc      -0.589 13.223
H42Kc         0.209  1.660
HNR05         0.297  3.359
               cos2  
CPP           0.097 |
Energy        0.312 |
H1A1c         0.372 |
H1A2c         0.022 |
H1A3c         0.000 |
H1H2c         0.006 |
H2H4c         0.016 |
H2KH5Kc       0.347 |
H42Kc         0.044 |
HNR05         0.088 |

Supplementary categories
                 Dist    Dim.1
Non-Guttural |  0.713 |  0.116
Guttural     |  0.880 | -0.143
               cos2 v.test  
Non-Guttural  0.026  0.875 |
Guttural      0.026 -0.875 |
              Dim.2   cos2
Non-Guttural  0.561  0.618
Guttural     -0.691  0.618
             v.test    Dim.3
Non-Guttural  5.147 | -0.207
Guttural     -5.147 |  0.256
               cos2 v.test  
Non-Guttural  0.085 -2.846 |
Guttural      0.085  2.846 |

7.3.2 Contribution of predictors and groups

Below, we look at the contributions of the main 5 dimensions.

dimdesc(pcaDat1, axes = 1:5, proba = 0.05)
$Dim.1
$quanti
        correlation
F0Bark    0.9471236
HNR25     0.9346079
soe       0.9284685
HNR35     0.9248084
HNR05     0.9108131
HNR15     0.8954107
H1H2c     0.8656253
H1A1c     0.6709032
CPP       0.5236515
H1A2c    -0.1137865
H2KH5Kc  -0.1164455
H1A3c    -0.2366755
H42Kc    -0.3800759
Z1mnZ0   -0.4268127
A1mnA2   -0.4963663
H2H4c    -0.5281247
Energy   -0.5712695
A1mnA3   -0.6262636
SHR      -0.6393517
              p.value
F0Bark  1.429483e-199
HNR25   1.131178e-181
soe     3.755961e-174
HNR35   5.569479e-170
HNR05   8.807770e-156
HNR15   1.220387e-142
H1H2c   3.099776e-122
H1A1c    6.724616e-54
CPP      1.098679e-29
H1A2c    2.250563e-02
H2KH5Kc  1.952233e-02
H1A3c    1.595378e-06
H42Kc    2.903273e-15
Z1mnZ0   3.150556e-19
A1mnA2   2.148508e-26
H2H4c    2.967330e-30
Energy   3.377633e-36
A1mnA3   3.576749e-45
SHR      1.394910e-47

attr(,"class")
[1] "condes" "list"  

$Dim.2
$quanti
        correlation
A2mnA3    0.9488972
Z2mnZ1    0.8611635
H42Kc     0.8091541
H1A3c     0.5906500
A1mnA3    0.4113725
CPP       0.2355070
HNR15     0.2195716
Z4mnZ3    0.1585383
H2H4c     0.1158268
Energy   -0.1267753
Z1mnZ0   -0.1396388
SHR      -0.1585845
H1H2c    -0.1591314
HNR35    -0.1742803
H2KH5Kc  -0.7090710
A1mnA2   -0.7561523
H1A2c    -0.8738035
Z3mnZ2   -0.9727026
              p.value
A2mnA3  1.862448e-202
Z2mnZ1  1.326843e-119
H42Kc    2.107251e-94
H1A3c    3.640784e-39
A1mnA3   7.545605e-18
CPP      1.801379e-06
HNR15    8.870306e-06
Z4mnZ3   1.427779e-03
H2H4c    2.018384e-02
Energy   1.095302e-02
Z1mnZ0   5.035009e-03
SHR      1.423139e-03
H1H2c    1.369225e-03
HNR35    4.475517e-04
H2KH5Kc  1.143697e-62
A1mnA2   1.141274e-75
H1A2c   2.589277e-127
Z3mnZ2  7.067910e-256

$quali
                R2      p.value
context 0.06606739 1.736324e-07

$category
                       Estimate
context=Non-Guttural  0.6260556
context=Guttural     -0.6260556
                          p.value
context=Non-Guttural 1.736324e-07
context=Guttural     1.736324e-07

attr(,"class")
[1] "condes" "list"  

$Dim.3
$quanti
        correlation
Z1mnZ0    0.8535706
H1A1c     0.6100189
Energy    0.5590167
CPP       0.3115902
HNR05     0.2969497
H42Kc     0.2087229
F0Bark    0.1702103
Z4mnZ3    0.1495364
H1A2c     0.1488647
H2H4c     0.1267249
A2mnA3   -0.1051933
HNR35    -0.2151263
HNR15    -0.2372089
A1mnA2   -0.2407206
HNR25    -0.2492222
Z2mnZ1   -0.3697651
A1mnA3   -0.4089538
H2KH5Kc  -0.5891288
              p.value
Z1mnZ0  2.491707e-115
H1A1c    2.445703e-42
Energy   2.020458e-34
CPP      1.684891e-10
HNR05    1.256987e-09
H42Kc    2.460412e-05
F0Bark   6.100220e-04
Z4mnZ3   2.649038e-03
H1A2c    2.770444e-03
H2H4c    1.098501e-02
A2mnA3   3.499609e-02
HNR35    1.355901e-05
HNR15    1.509029e-06
A1mnA2   1.042740e-06
HNR25    4.161621e-07
Z2mnZ1   1.802391e-14
A1mnA3   1.222863e-17
H2KH5Kc  6.329691e-39

$quali
                R2     p.value
context 0.02019995 0.004300124

$category
                       Estimate
context=Guttural      0.2315316
context=Non-Guttural -0.2315316
                         p.value
context=Guttural     0.004300124
context=Non-Guttural 0.004300124

attr(,"class")
[1] "condes" "list"  

$Dim.4
$quanti
       correlation       p.value
Z4mnZ3   0.8760843 8.580651e-129
H1A3c    0.7256396  6.212261e-67
A1mnA3   0.4631838  9.034769e-23
H1A2c    0.2978370  1.116376e-09
SHR      0.2447883  6.747376e-07
A2mnA3   0.1991310  5.808302e-05
A1mnA2   0.1779271  3.371233e-04
H1H2c    0.1679119  7.244124e-04
HNR35    0.1435091  3.934983e-03
HNR05    0.1197102  1.633473e-02
H1A1c    0.1181063  1.783878e-02
Z1mnZ0   0.1123235  2.430980e-02
Z3mnZ2   0.1060653  3.350479e-02
F0Bark   0.1056122  3.427281e-02
CPP     -0.1843072  2.026259e-04
Energy  -0.2139211  1.518911e-05
Z2mnZ1  -0.2771545  1.597946e-08

$quali
              R2      p.value
context 0.056498 1.434538e-06

$category
                       Estimate
context=Non-Guttural  0.3394259
context=Guttural     -0.3394259
                          p.value
context=Non-Guttural 1.434538e-06
context=Guttural     1.434538e-06

attr(,"class")
[1] "condes" "list"  

$Dim.5
$quanti
       correlation      p.value
H2H4c   0.67917208 1.098695e-55
CPP     0.58268792 6.373444e-38
HNR15   0.24837172 4.569075e-07
HNR25   0.18018109 2.821810e-04
HNR05   0.16392456 9.710453e-04
HNR35   0.15076346 2.439664e-03
SHR     0.09962685 4.590553e-02
F0Bark -0.13227856 7.916814e-03
H42Kc  -0.15107672 2.388688e-03
H1H2c  -0.33610840 4.506374e-12

attr(,"class")
[1] "condes" "list"  

$call
$call$num.var
[1] 1

$call$proba
[1] 0.05

$call$weights
  [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 [15] 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 [29] 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 [43] 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 [57] 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 [71] 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 [85] 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 [99] 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[113] 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[127] 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[141] 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[155] 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[169] 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[183] 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[197] 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[211] 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[225] 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[239] 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[253] 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[267] 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[281] 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[295] 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[309] 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[323] 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[337] 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[351] 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[365] 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[379] 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[393] 1 1 1 1 1 1 1 1 1 1

$call$X
NANA

7.3.3 Contribution of variables

We look next at the contribution of the top 10 predictors on each of the 6 dimensions

7.3.3.1 Dimension 1

fviz_contrib(pcaDat1, choice = "var", axes = 1, top = 10)

7.3.3.2 Dimension 2

fviz_contrib(pcaDat1, choice = "var", axes = 2, top = 10)

7.3.3.3 Dimension 3

fviz_contrib(pcaDat1, choice = "var", axes = 3, top = 10)

7.3.3.4 Dimension 4

fviz_contrib(pcaDat1, choice = "var", axes = 4, top = 10)

7.3.3.5 Dimension 5

fviz_contrib(pcaDat1, choice = "var", axes = 5, top = 10)

7.4 Plots

7.4.1 PCA Individuals

fviz_pca_ind(pcaDat1, col.ind = "cos2", 
             gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
             repel = TRUE # Avoid text overlapping (slow if many points)
             )
Warning: ggrepel: 397 unlabeled data points (too many overlaps). Consider increasing max.overlaps

7.4.2 PCA Biplot 1:2

fviz_pca_biplot(pcaDat1, repel = TRUE, habillage = dfPharV2$context, addEllipses = TRUE, title = "MSA - Biplot")
Warning: ggrepel: 401 unlabeled data points (too many overlaps). Consider increasing max.overlaps
Warning: ggrepel: 6 unlabeled data points (too many overlaps). Consider increasing max.overlaps

7.4.3 PCA Biplot 3:4

fviz_pca_biplot(pcaDat1, axes = c(3, 4), repel = TRUE, habillage = dfPharV2$context, addEllipses = TRUE, title = "MCA - Biplot")
Warning: ggrepel: 397 unlabeled data points (too many overlaps). Consider increasing max.overlaps
Warning: ggrepel: 17 unlabeled data points (too many overlaps). Consider increasing max.overlaps

7.5 Clustering

fviz_pca_ind(pcaDat1,
             label = "none", # hide individual labels
             habillage = dfPharV2$context, # color by groups
             addEllipses = TRUE # Concentration ellipses
             )

7.6 3-D By Groups

coord <- pcaDat1$quali.sup$coord[1:2,0]
coord
            
Non-Guttural
Guttural    
#
with(pcaDat1, {
  s3d <- scatterplot3d(pcaDat1$quali.sup$coord[,1], pcaDat1$quali.sup$coord[,2], pcaDat1$quali.sup$coord[,3],        # x y and z axis
                       color=c("blue", "red"), pch=19,        # filled blue and red circles
                       type="h",                    # vertical lines to the x-y plane
                       main="PCA 3-D Scatterplot",
                       xlab="Dim1(37.7%)",
                       ylab="",
                       zlab="Dim3(11.4%)",
                       #xlim = c(-1.5, 1.5), ylim = c(-1.5, 1.5), zlim = c(-0.8, 0.8)
)
  s3d.coords <- s3d$xyz.convert(pcaDat1$quali.sup$coord[,1], pcaDat1$quali.sup$coord[,2], pcaDat1$quali.sup$coord[,3]) # convert 3D coords to 2D projection
  text(s3d.coords$x, s3d.coords$y,             # x and y coordinates
       labels=row.names(coord), col = c("blue", "red"),              # text to plot
       cex=1, pos=4)           # shrink text 50% and place to right of points)
})
dims <- par("usr")
x <- dims[1]+ 0.8*diff(dims[1:2])
y <- dims[3]+ 0.08*diff(dims[3:4])
text(x, y, "Dim2(25.5%)", srt = 25,col="black")

8 Decision Trees

Decision trees are a statistical tool that uses the combination of predictors to identify patterns in the data and provides classification accuracy for the model.

The decision tree used is based on conditional inference trees that looks at each predictor and splits the data into multiple nodes (branches) through recursive partitioning in a tree-structured regression model. Each node is also split into leaves (difference between levels of outcome).

Decision trees via ctree does the following:

  1. Test global null hypothesis of independence between predictors and outcome.
  2. Select the predictor with the strongest association with the outcome measured based on a multiplicity adjusted p-values with Bonferroni correction
  3. Implement a binary split in the selected input variable.
  4. Recursively repeat steps 1), 2). and 3).

Let’s see this in an example using the same dataset. To understand what the decision tree is doing, we will dissect it, by creating one tree with one predictor and move to the next.

8.1 GLM as a classification tool

8.1.1 Model specification

We run a GLM with context as our outcome, and Z2-Z1 as our predictor. We want to evaluate whether the two classes can be separated when using the acoustic metric Z2-Z1. Context has two levels, and this will be considered as a binomial distribution.

mdl.glm.Z2mnZ1 <- dfPharV2 %>% 
  glm(context ~ Z2mnZ1, data = ., family = binomial)
summary(mdl.glm.Z2mnZ1)

Call:
glm(formula = context ~ Z2mnZ1, family = binomial, data = .)

Deviance Residuals: 
    Min       1Q   Median  
-1.2879  -1.1358  -0.8703  
     3Q      Max  
 1.1538   1.4998  

Coefficients:
            Estimate Std. Error
(Intercept)  0.50112    0.23036
Z2mnZ1      -0.12281    0.03621
            z value Pr(>|z|)    
(Intercept)   2.175 0.029605 *  
Z2mnZ1       -3.391 0.000696 ***
---
Signif. codes:  
  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’
  0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 552.89  on 401  degrees of freedom
Residual deviance: 541.01  on 400  degrees of freedom
AIC: 545.01

Number of Fisher Scoring iterations: 4
tidy(mdl.glm.Z2mnZ1) %>% 
  select(term, estimate) %>% 
  mutate(estimate = round(estimate, 3))
# to only get the coefficients
mycoef2 <- tidy(mdl.glm.Z2mnZ1) %>% pull(estimate)

8.1.2 Plogis

The result above shows that when moving from the non-guttural (intercept), a unit increase (i.e., guttural) yields a statistically significant decrease in the logodds associated with Z2-Z1. We can evaluate this further from a classification point of view, using plogis.

# non-guttural
plogis(mycoef2[1])
[1] 0.622722
#guttural
plogis(mycoef2[1] + mycoef2[2])
[1] 0.5934644

This shows that Z2-Z1 is able to explain the difference in the guttural class with an accuracy of 59%. Let’s continue with this model further.

8.1.3 Model predictions

As above, we obtain predictions from the model. Because we are using a numeric predictor, we need to assign a threshold for the predict function. The threshold can be thought of as telling the predict function to assign any predictions lower than 50% to one group, and any higher to another.

pred.glm.Z2mnZ1 <- predict(mdl.glm.Z2mnZ1, type = "response")>0.5

tbl.glm.Z2mnZ1 <- table(pred.glm.Z2mnZ1, dfPharV2$context)
rownames(tbl.glm.Z2mnZ1) <- c("Non-Guttural", "Guttural")
tbl.glm.Z2mnZ1
               
pred.glm.Z2mnZ1 Non-Guttural
   Non-Guttural          167
   Guttural               55
               
pred.glm.Z2mnZ1 Guttural
   Non-Guttural       75
   Guttural          105
# from PresenceAbsence
PresenceAbsence::pcc(tbl.glm.Z2mnZ1)
PresenceAbsence::specificity(tbl.glm.Z2mnZ1)
PresenceAbsence::sensitivity(tbl.glm.Z2mnZ1)

roc.glm.Z2mnZ1 <- pROC::roc(dfPharV2$context, as.numeric(pred.glm.Z2mnZ1))
Setting levels: control = Non-Guttural, case = Guttural
Setting direction: controls < cases
roc.glm.Z2mnZ1

Call:
roc.default(response = dfPharV2$context, predictor = as.numeric(pred.glm.Z2mnZ1))

Data: as.numeric(pred.glm.Z2mnZ1) in 222 controls (dfPharV2$context Non-Guttural) < 180 cases (dfPharV2$context Guttural).
Area under the curve: 0.6678
pROC::plot.roc(roc.glm.Z2mnZ1, legacy.axes = TRUE)

The model above was able to explain the difference between the two classes with an accuracy of 67.7%. It has a slightly low specificity (0.58) to detect gutturals, but a flighty high sensitivity (0.75) to reject the non-gutturals. Looking at the confusion matrix, we observe that both groups were relatively accurately identified, but we have relatively large errors (or confusions). The AUC is at 0.67, which is not too high.

Let’s continue with GLM to evaluate it further. We start by running a correlation test to evaluate issues with GLM.

8.2 Individual trees

8.2.1 Tree 1

## from the package party
set.seed(123456)
tree1 <- dfPharV2 %>% 
  ctree(
    context ~ Z2mnZ1, 
    data = .)
print(tree1)

     Conditional inference tree with 4 terminal nodes

Response:  context 
Input:  Z2mnZ1 
Number of observations:  402 

1) Z2mnZ1 <= 9.551456; criterion = 0.999, statistic = 11.678
  2) Z2mnZ1 <= 6.779068; criterion = 1, statistic = 12.368
    3) Z2mnZ1 <= 4.004879; criterion = 1, statistic = 56.773
      4)*  weights = 157 
    3) Z2mnZ1 > 4.004879
      5)*  weights = 106 
  2) Z2mnZ1 > 6.779068
    6)*  weights = 64 
1) Z2mnZ1 > 9.551456
  7)*  weights = 75 
plot(tree1, main = "Conditional Inference Tree ")

How to interpret this figure? Let’s look at mean values and a plot for this variable. This is the difference between F2 and F1 using the bark scale. Because gutturals are produced within the pharynx (regardless of where), the predictions is that a high F1 and a low F2 will be the acoustic correlates related to this constriction location. The closeness between these formants yields a lower Z2-Z1. Hence, the prediction is as follow: the smaller the difference, the more pharyngeal-like constriction these consonants have (all else being equal!). Let’s compute the mean/median and plot the difference between the two contexts.

dfPharV2 %>% 
  group_by(context) %>% 
  summarise(mean = mean(Z2mnZ1),
            median = median(Z2mnZ1), 
            count = n())

dfPharV2 %>% 
  ggplot(aes(x = context, y = Z2mnZ1)) + 
  geom_boxplot()  

The table above reports the mean and median of Z2-Z1 for both levels of context and the plots show the difference between the two. We have a total of 180 cases in the guttural, and 222 in the non-guttural. When considering the conditional inference tree output, various splits were obtained. The first is any value higher than 9.55 being assigned to the non-guttural class (around 98% of 75 cases) Then, with anything lower than 9.55, a second split was obtained. A threshold of 6.78: higher assigned to guttural (around 98% of 64 cases), lower, were split again with a threshold of 4 Bark. A third split was obtained: values lower of equal to 4 Bark are assigned to the guttural (around 70% of 157 cases) and values higher than 4 Barks assigned to the non-guttural (around 90% of 106 cases).

Dissecting the tree like this allows interpretation of the output. In this example, this is quite a complex case and ctree allowed to fine tune the different patterns seen with Now let’s look at the full dataset to make sense of the combination of predictors to the difference.

8.3 Model 1

8.3.1 Model specification

set.seed(123456)
fit <- dfPharV2  %>% 
  ctree(
    context ~ ., 
    data = .)
print(fit)

     Conditional inference tree with 8 terminal nodes

Response:  context 
Inputs:  CPP, Energy, H1A1c, H1A2c, H1A3c, H1H2c, H2H4c, H2KH5Kc, H42Kc, HNR05, HNR15, HNR25, HNR35, SHR, soe, Z1mnZ0, Z2mnZ1, Z3mnZ2, Z4mnZ3, F0Bark, A1mnA2, A1mnA3, A2mnA3 
Number of observations:  402 

1) A2mnA3 <= -13.78; criterion = 1, statistic = 42.329
  2) Z4mnZ3 <= 1.592125; criterion = 1, statistic = 40.991
    3) H2H4c <= -8.396333; criterion = 0.993, statistic = 13.141
      4)*  weights = 8 
    3) H2H4c > -8.396333
      5)*  weights = 100 
  2) Z4mnZ3 > 1.592125
    6) Energy <= 2.8295; criterion = 0.999, statistic = 16.923
      7)*  weights = 25 
    6) Energy > 2.8295
      8)*  weights = 10 
1) A2mnA3 > -13.78
  9) H1H2c <= 10.27167; criterion = 0.953, statistic = 9.458
    10) SHR <= 0.1566667; criterion = 1, statistic = 18.337
      11)*  weights = 99 
    10) SHR > 0.1566667
      12) H1H2c <= 0.7411667; criterion = 0.972, statistic = 10.449
        13)*  weights = 103 
      12) H1H2c > 0.7411667
        14)*  weights = 30 
  9) H1H2c > 10.27167
    15)*  weights = 27 
plot(fit, main = "Conditional Inference Tree")

How to interpret this complex decision tree?

Let’s obtain the median value for each predictor grouped by context. Discuss some of the patterns.

dfPharV2 %>% 
  group_by(context) %>% 
  summarize_all(list(mean = mean))

We started with context as our outcome, and all 23 acoustic measures as predictors. A total of 8 terminal nodes were identified with multiple binary splits in their leaves, allowing separation of the two categories. Looking specifically at the output, we observe a few things.

The first node was based on A2*-A3*, detecting a difference between non-gutturals and gutturals. For the first binary split, a threshold of -13.78 Bark was used (mean non guttural = -7.86; mean guttural = -14.58), then for values lower of equal to this threshold, a second split was performed using Z4-Z3 (mean non guttural = 1.67; mean guttural = 1.43) with any value smaller and equal to 1.59, then another binary split using H2*-H4*, etc…

Once done, the ctree provides multiple binary splits into guttural or non-guttural.

Any possible issues/interesting patterns you can identify? Look at the interactions between predictors.

8.3.2 Predictions from the full model

Let’s obtain some predictions from the model and evaluate how successful it is in dealing with the data.

set.seed(123456)
pred.ctree <- predict(fit)
tbl.ctree <- table(pred.ctree, dfPharV2$context)
tbl.ctree
              
pred.ctree     Non-Guttural
  Non-Guttural          194
  Guttural               28
              
pred.ctree     Guttural
  Non-Guttural       41
  Guttural          139
PresenceAbsence::pcc(tbl.ctree)
PresenceAbsence::specificity(tbl.ctree)
PresenceAbsence::sensitivity(tbl.ctree)

roc.ctree <- pROC::roc(dfPharV2$context, as.numeric(pred.ctree))
Setting levels: control = Non-Guttural, case = Guttural
Setting direction: controls < cases
roc.ctree

Call:
roc.default(response = dfPharV2$context, predictor = as.numeric(pred.ctree))

Data: as.numeric(pred.ctree) in 222 controls (dfPharV2$context Non-Guttural) < 180 cases (dfPharV2$context Guttural).
Area under the curve: 0.823
pROC::plot.roc(roc.ctree, legacy.axes = TRUE)

This full model has a classification accuracy of 82.8%.This is not bad!! It has a relatively moderate specificity at 0.77 (at detecting the gutturals) but a high sensitivity at 0.87 (at detecting the non-gutturals). The ROC curve shows the relationship between the two with an AUC of 0.823

8.4 Random selection

One important issue is that the trees we grew above are biased. They are based on the full dataset, which means they are very likely to overfit the data. We did not add any random selection and we only grew one tree each time. If you think about it, is it possible that we obtained such results simply by chance?

What if we add some randomness in the process of creating a conditional inference tree?

We change a small option in ctree to allow for random selection of variables, to mimic what Random Forests will do. We use controls to specify mtry = 5, which is the rounded square root of number of predictors.

8.4.1 Model 2

set.seed(123456)
fit1 <- dfPharV2  %>% 
  ctree(
    context ~ ., 
    data = .,
    controls = ctree_control(mtry = 5))
plot(fit1, main = "Conditional Inference Tree")

pred.ctree1 <- predict(fit1)
tbl.ctree1 <- table(pred.ctree1, dfPharV2$context)
tbl.ctree1
              
pred.ctree1    Non-Guttural
  Non-Guttural          214
  Guttural                8
              
pred.ctree1    Guttural
  Non-Guttural       82
  Guttural           98
PresenceAbsence::pcc(tbl.ctree1)
PresenceAbsence::specificity(tbl.ctree1)
PresenceAbsence::sensitivity(tbl.ctree1)

roc.ctree1 <- pROC::roc(dfPharV2$context, as.numeric(pred.ctree1))
Setting levels: control = Non-Guttural, case = Guttural
Setting direction: controls < cases
roc.ctree1

Call:
roc.default(response = dfPharV2$context, predictor = as.numeric(pred.ctree1))

Data: as.numeric(pred.ctree1) in 222 controls (dfPharV2$context Non-Guttural) < 180 cases (dfPharV2$context Guttural).
Area under the curve: 0.7542
pROC::plot.roc(roc.ctree1, legacy.axes = TRUE)

Can you compare results between you and discuss what is going on?

When adding one random selection process to our ctree, we allow it to obtain more robust predictions. We could even go further and grow multiple small trees with a portion of datapoints (e.g., 100 rows, 200 rows). When doing these multiple random selections, you are growing multiple trees that are decorrelated from each other. These become independent trees and one can combine the results of these trees to come with clear predictions.

This is how Random Forests work. You would start from a dataset, then grow multiple trees, vary number of observations used (nrow), and number of predictors used (mtry), adjust branches, and depth of nodes and at the end, combine the results in a forest. You can also run permutation tests to evaluate contributions of each predictor to the outcome. This is the beauty of Random Forests. They do all of these steps automatically at once for you!

9 Random Forests

As their name indicate, a Random Forest is a forest of trees implemented through bagging ensemble algorithms. Each tree has multiple branches (nodes), and will provide predictions based on recursive partitioning of the data. Then using the predictions from the multiple grown trees, Random Forests will create averaged predictions and come up with prediction accuracy, etc.

There are multiple packages that one can use to grow Random Forests:

  1. randomForest: The original implementation of Random Forests.
  2. party and partykit: using conditional inference trees as base learners
  3. ranger: a reimplementation of Random Forests; faster and more flexible than original implementation

The first implementation of Random Forests is widely used in research. One of the issues in this first implementation is that it favoured specific types of predictors (e.g., categorical predictors, predictors with multiple cut-offs, etc). Random Forests grown via Conditional Inference Trees as implemented in party guard against this bias, but they are computationally demanding. Random Forests grown via permutation tests as implemented in ranger speed up the computations and can mimic the unbiased selection process.

9.1 Declare parallel computing

We start by declaring parallel computing on your devices. This is essential to run these complex computations. The code below is designed to only use 1 core from your machine (and it is not too complex), but if you try to increase the complexity of your computations, you will need parallel computing.


set.seed(123456)

#Declare parallel computing 
ncores <- availableCores()
cat(paste0("Number of cores available for model calculations set to ", ncores, "."))
Number of cores available for model calculations set to 8.
registerDoFuture()
makeClusterPSOCK(ncores)
Socket cluster with 8 nodes where 8 nodes are on host ‘localhost’ (R version 4.1.2 (2021-11-01), platform x86_64-w64-mingw32)
plan(multisession)
ncores
system 
     8 
# below we register our random number generator. This will mostly be used within the tidymodels below. This allows replication of the results
# below to suppress any warnings from doFuture
options(doFuture.rng.onMisuse = "ignore")

9.2 Party

Random Forests grown via conditional inference trees, are different from the original implementation. They offer an unbiased selection process that guards against overfitting of the data. There are various points we need to consider in growing the forest, including number of trees and predictors to use each time. Let us run our first Random Forest via conditional inference trees. To make sure the code runs as fast as it can, we use a very low number of trees: only 100 It is well known that the more trees you grow, the more confidence you have in the results, as model estimation will be more stable. In this example, I would easily go with 500 trees..

9.2.1 Model specification

To grow the forest, we use the function cforest. We use all of the dataset for the moment. We need to specify a few options within controls:

  1. ntree = 100 = number of trees to grow. Default = 500.
  2. mtry = round(sqrt(23)): number of predictors to use each time. Default is 5, but specifying it is advised to account for the structure of the data

By default, cforest_unbiased has two additional important options that are used for an unbiased selection process. WARNING: you should not change these unless you know what you are doing. Also, by default, the data are split into a training and a testing set. The training is equal to 2/3s of the data; the testing is 1/3.

  1. replace = FALSE = Use subsampling with or without replacement. Default is FALSE, i.e., use subsets of the data without replacing these.
  2. fraction = 0.632 = Use 63.2% of the data in each split.
set.seed(123456)
mdl.cforest <- dfPharV2 %>% 
  cforest(context ~ ., data = ., 
          controls = cforest_unbiased(ntree = 100, 
                                      mtry = round(sqrt(23))))

9.2.2 Predictions

To obtain predictions from the model, we use the predict function and add OOB = TRUE. This uses the out-of-bag sample (i.e., 1/3 of the data).

set.seed(123456)
pred.cforest <- predict(mdl.cforest, OOB = TRUE)
tbl.cforest <- table(pred.cforest, dfPharV2$context)
tbl.cforest
              
pred.cforest   Non-Guttural
  Non-Guttural          203
  Guttural               19
              
pred.cforest   Guttural
  Non-Guttural       40
  Guttural          140
PresenceAbsence::pcc(tbl.cforest)
PresenceAbsence::specificity(tbl.cforest)
PresenceAbsence::sensitivity(tbl.cforest)

roc.cforest <- pROC::roc(dfPharV2$context, as.numeric(pred.cforest))
Setting levels: control = Non-Guttural, case = Guttural
Setting direction: controls < cases
roc.cforest

Call:
roc.default(response = dfPharV2$context, predictor = as.numeric(pred.cforest))

Data: as.numeric(pred.cforest) in 222 controls (dfPharV2$context Non-Guttural) < 180 cases (dfPharV2$context Guttural).
Area under the curve: 0.8461
pROC::plot.roc(roc.cforest, legacy.axes = TRUE)

Compared with the 82.8% classification accuracy we obtained using ctree using our full dataset above (model 1), here we obtain 85.5% with an 2.7% increase. Compared with the 67.4% from model 2 from ctree with random selection of predictors, we have an 18.1% increase in classification accuracy!

We could test whether there is statistically significant difference between our ctree and cforest models. Using the ROC curves, the roc.test conducts a non-parametric Z test of significance on the correlated ROC curves. The results show a statistically significant improvement using the cforest model. This is normal because we are growing 100 different trees, with random selection of both predictors and samples and provide an averaged prediction.

pROC::roc.test(roc.ctree, roc.cforest)

    DeLong's test for two
    correlated ROC curves

data:  roc.ctree and roc.cforest
Z = -1.0148, p-value =
0.3102
alternative hypothesis: true difference in AUC is not equal to 0
95 percent confidence interval:
 -0.06756458  0.02146848
sample estimates:
AUC of roc1 AUC of roc2 
  0.8230480   0.8460961 
pROC::roc.test(roc.ctree1, roc.cforest)

    DeLong's test for two
    correlated ROC curves

data:  roc.ctree1 and roc.cforest
Z = -3.7128, p-value =
0.000205
alternative hypothesis: true difference in AUC is not equal to 0
95 percent confidence interval:
 -0.14040087 -0.04338292
sample estimates:
AUC of roc1 AUC of roc2 
  0.7542042   0.8460961 

9.2.3 Variable Importance Scores

One important feature in ctree was to show which predictor was used first is splitting the data, which was then followed by the other predictors. We use a similar functionality with cforest to obtain variable importance scores to pinpoint strong and weak predictors.

There are two ways to obtain this:

  1. Simple permutation tests (conditional = FALSE)
  2. Conditional permutation tests (conditional = TRUE)

The former is generally comparable across packages and provides a normal permutation test; the latter runs a permutation test on a grid defined by the correlation matrix and corrects for possible collinearity. This is similar to a regression analysis, but looks at both main effects and interactions.

You could use the normal varimp as implemented in party. This uses mean decrease in accuracy scores. We will use variable importance scores via an AUC based permutation tests as this uses both accuracy and errors in the model, using varImpAUC from the varImp package.

DANGER ZONE: using conditional permutation test requires a lot of RAMs, unless you have access to a cluster, and/or a lot of RAMs, do not attempt running it. We will run the non-conditional version here for demonstration.

9.2.3.1 Non-conditional permutation tests

set.seed(123456)
VarImp.cforest <- varImp::varImpAUC(mdl.cforest, conditional = FALSE)
Warning: closing unused connection 11 (<-DESKTOP-A9ARQR4:11846)
Warning: closing unused connection 10 (<-DESKTOP-A9ARQR4:11846)
Warning: closing unused connection 9 (<-DESKTOP-A9ARQR4:11846)
Warning: closing unused connection 8 (<-DESKTOP-A9ARQR4:11846)
Warning: closing unused connection 7 (<-DESKTOP-A9ARQR4:11846)
Warning: closing unused connection 6 (<-DESKTOP-A9ARQR4:11846)
Warning: closing unused connection 5 (<-DESKTOP-A9ARQR4:11846)
Warning: closing unused connection 4 (<-DESKTOP-A9ARQR4:11846)
lattice::barchart(sort(VarImp.cforest))

The Variable Importance Scores via non-conditional permutation tests showed that A2*-A3* (i.e., energy in mid-high frequencies around F2 and F3) is the most important variable at explaining the difference between gutturals and non-gutturals, followed by Z4-Z3 (pharyngeal constriction), H1*-A3* (energy in mid-high frequency component), Z2-Z1 (degree of compactness), Z3-Z2 (spectral divergence), H1*-A2 (energy in mid frequency component) and Z1-Z0 (degree of openness). All other predictors contribute to the contrast but to varying degrees (from H1*-H2* to H1*-A1*). The last 5 predictors are the least important and and the CPP has a 0 mean decrease in accuracy and can even be ignored.

9.2.3.2 Conditional permutation tests

set.seed(123456)
VarImp.cforest <- varImp::varImpAUC(mdl.cforest, conditional = TRUE)
lattice::barchart(sort(VarImp.cforest))

9.2.4 Conclusion

The party package is powerful at growing Random Forests via conditional Inference trees, but is computationally prohibitive when increasing number of trees and using conditional permutation tests of variable importance scores. We look next at the package ranger due to its speed in computation and flexibility.

9.3 Ranger

The ranger package proposes a reimplementation of the original Random Forests algorithms, written in C++ and allows for parallel computing. It offers more flexibility in terms of model specification.

9.3.1 Model specification

In the model below specification below, there are already a few options we are familiar with, with additional ones described below:

  1. num.tree = Number of trees to grow. We use the default value
  2. mtry = Number of predictors to use. Default = floor(sqrt(Variables)). For compatibility with party, we use round(sqrt(23))
  3. replace = FALSE = Use subsampling with or without replacement. Default replace = TRUE, i.e., is with replacement.
  4. sample.fraction = 0.632 = Use 63.2% of the data in each split. Default is full dataset, i.e., sample.fraction = 1
  5. importance = "permutation" = Compute variable importance scores via permutation tests
  6. scale.permutation.importance = FALSE = whether to scale variable importance scores to be out of 100%. Default is TRUE. This is likely to introduce biases in variable importance estimation.
  7. splitrule = "extratrees" = rule used for splitting trees.
  8. num.threads = allow for parallel computing. Here we only specify 1 thread, but can use all thread on your computer (or cluster).

We use options 2-7 to make sure we have an unbiased selection process with ranger. You can try on your own running the model below by using the defaults to see how the rate of classification increases more, but with the caveat that it has a biased selection process.

set.seed(123456)
mdl.ranger <- dfPharV2 %>% 
  ranger(context ~ ., data = ., num.trees = 500, mtry = round(sqrt(23)),
         replace = FALSE, sample.fraction = 0.632, 
         importance = "permutation", scale.permutation.importance = FALSE,
         splitrule = "extratrees", num.threads = ncores)
mdl.ranger
Ranger result

Call:
 ranger(context ~ ., data = ., num.trees = 500, mtry = round(sqrt(23)),      replace = FALSE, sample.fraction = 0.632, importance = "permutation",      scale.permutation.importance = FALSE, splitrule = "extratrees",      num.threads = ncores) 

Type:                             Classification 
Number of trees:                  500 
Sample size:                      402 
Number of independent variables:  23 
Mtry:                             5 
Target node size:                 1 
Variable importance mode:         permutation 
Splitrule:                        extratrees 
Number of random splits:          1 
OOB prediction error:             7.21 % 

Results of our Random Forest shows an OOB (Out-Of-Bag) error rate of 8.2%, i.e., an accuracy of 91.8%.

9.3.2 Going further

Unfortunately, when growing a tree with ranger, we cannot use predictions from the OOB sample as there are no comparable options to do so on the predictions. We need to hard-code this. We split the data into a training and a testing sets. The training will be on 2/3s of the data; the testing is on the remaining 1/3.

9.3.2.1 Create a training and a testing set

set.seed(123456)
train.idx <- sample(nrow(dfPharV2), 2/3 * nrow(dfPharV2))
gutt.train <- dfPharV2[train.idx, ]
gutt.test <- dfPharV2[-train.idx, ]

9.3.2.2 Model specification

We use the same model specification as above, except from using the training set and saving the forest (with write.forest = TRUE).

set.seed(123456)
mdl.ranger2 <- gutt.train %>% 
  ranger(context ~ ., data = ., num.trees = 500, mtry = round(sqrt(23)),
         replace = FALSE, sample.fraction = 0.632, 
         importance = "permutation", scale.permutation.importance = FALSE,
         splitrule = "extratrees", num.threads = ncores, write.forest = TRUE)
mdl.ranger2
Ranger result

Call:
 ranger(context ~ ., data = ., num.trees = 500, mtry = round(sqrt(23)),      replace = FALSE, sample.fraction = 0.632, importance = "permutation",      scale.permutation.importance = FALSE, splitrule = "extratrees",      num.threads = ncores, write.forest = TRUE) 

Type:                             Classification 
Number of trees:                  500 
Sample size:                      268 
Number of independent variables:  23 
Mtry:                             5 
Target node size:                 1 
Variable importance mode:         permutation 
Splitrule:                        extratrees 
Number of random splits:          1 
OOB prediction error:             10.45 % 

With the training set, we have an OOB error rate of 9.3%; i.e., an accuracy rate of 90.7%.

9.3.2.3 Predictions

For the predictions, we use the testing set as a validation set. This is to be considered as a true reflection of the model. This is unseen data not used in the training set.

set.seed(123456)
pred.ranger2 <- predict(mdl.ranger2, data = gutt.test)
tbl.ranger2 <- table(pred.ranger2$predictions, gutt.test$context)
tbl.ranger2
              
               Non-Guttural
  Non-Guttural           68
  Guttural                6
              
               Guttural
  Non-Guttural        5
  Guttural           55
PresenceAbsence::pcc(tbl.ranger2)
PresenceAbsence::specificity(tbl.ranger2)
PresenceAbsence::sensitivity(tbl.ranger2)

roc.ranger <- pROC::roc(gutt.test$context, as.numeric(pred.ranger2$predictions))
Setting levels: control = Non-Guttural, case = Guttural
Setting direction: controls < cases
roc.ranger

Call:
roc.default(response = gutt.test$context, predictor = as.numeric(pred.ranger2$predictions))

Data: as.numeric(pred.ranger2$predictions) in 74 controls (gutt.test$context Non-Guttural) < 60 cases (gutt.test$context Guttural).
Area under the curve: 0.9178
pROC::plot.roc(roc.ranger, legacy.axes = TRUE)

The classification rate based on the testing set is 86.6%. This is comparable to the one we obtained with cforest. The changes in the settings allow for similarities in the predictions obtained from both party and ranger.

9.3.2.4 Variable Importance Scores

9.3.2.4.1 Default

For the variable importance scores, we obtain them from either the training set or the full model above.

set.seed(123456)
lattice::barchart(sort(mdl.ranger2$variable.importance), main = "Variable Importance scores - training set")

lattice::barchart(sort(mdl.ranger$variable.importance), main = "Variable Importance scores - full set")

There are similarities between cforest and ranger, with minor differences. Z2-Z1 is the best predictor at explaining the differences between gutturals and non-gutturals with ranger followed by Z3-Z2 and then A2*-A3*, (reverse with cforest!). The order of the additional predictors is sightly different between the two models. This is expected as the cforest model only used 100 trees, whereas the ranger model used 500 trees.

A clear difference between the packages party and ranger is that the former allows for conditional permutation tests for variable importance scores; this is absent from ranger. However, there is a debate in the literature on whether correlated data are harmful within Random Forests. It is clear that how Random Forests work, i.e., the randomness in the selection process in number of data points, predictors, splitting rules, etc. allow the trees to be decorrelated from each other. Hence, the conditional permutation tests may not be required. But what they offer is to condition variable importance scores on each other (based on correlation tests) to mimic what a multiple regression analysis does (but without suffering from suppression!). Strong predictors will show major contribution, while weak ones will be squashed giving them extremely low (or even negative) scores. Within ranger, it is possible to evaluate this by estimating p values associated with each variable importance.We use the altman method. See documentation for more details.

DANGER ZONE: This requires heavy computations. Use with all cores on your machine or in the cluster. Recommendations are to use a minimum of 100 permutations or more, i.e., num.permutations = 100. Here, we only use 20 to show the output.

9.3.2.4.2 With p values
set.seed(123456)
VarImp.pval <- importance_pvalues(mdl.ranger2, method = "altmann",
                                  num.permutations = 20, 
                                  formula = context ~ ., data = gutt.train,
                                  num.threads = ncores)
VarImp.pval
         importance     pvalue
CPP     0.004484848 0.09523810
Energy  0.015979798 0.04761905
H1A1c   0.008363636 0.04761905
H1A2c   0.025292929 0.04761905
H1A3c   0.028080808 0.04761905
H1H2c   0.013313131 0.04761905
H2H4c   0.010747475 0.04761905
H2KH5Kc 0.011595960 0.04761905
H42Kc   0.015939394 0.04761905
HNR05   0.006303030 0.04761905
HNR15   0.012121212 0.04761905
HNR25   0.009353535 0.04761905
HNR35   0.010242424 0.04761905
SHR     0.013737374 0.09523810
soe     0.010767677 0.04761905
Z1mnZ0  0.030060606 0.04761905
Z2mnZ1  0.073070707 0.04761905
Z3mnZ2  0.040181818 0.04761905
Z4mnZ3  0.047171717 0.04761905
F0Bark  0.010646465 0.04761905
A1mnA2  0.014646465 0.04761905
A1mnA3  0.021313131 0.04761905
A2mnA3  0.037555556 0.04761905

Of course, the output above shows variable p values. The lowest is at 0.048 for all predictors; one at 0.14 for CPP. Recall that CPP received the lowest variable importance score within ranger and cforest. If you increase permutations to 100 or 200, you will get more confidence in your results and can report the p values

In the next part, we look at the tidymodels and introduce their philosophy.

9.4 Random forests with Tidymodels

The tidymodels are a bundle of packages used to streamline and simplify the use of machine learning. The tidymodels are not restricted to Random Forests, and you can even use them to run simple linear models, logistic regressions, PCA, Random Forests, Deep Learning, etc.

The tidymodels’ philosophy is to separate data processing on the training and testing sets, and use of a workflow. Below, is an full example of how one can run Random Forests with via ranger using the tidymodels.

9.4.1 Training and testing sets

We start by creating a training and a testing set using the function initial_split. Using strata = context allows the model to split the data taking into account its structure and splits the data according to proportions of each group.

set.seed(123456)
train_test_split <-
  initial_split(
    data = dfPharV2,
    strata = "context",
    prop = 0.667) 
train_test_split
<Analysis/Assess/Total>
<268/134/402>
train_tbl <- train_test_split %>% training() 
test_tbl  <- train_test_split %>% testing()

9.4.2 Set for cross-validation

We can (if we want to), create a 10-folds cross-validation on the training set. This allows to fine tune the training by obtaining the forest with the highest accuracy. This is a clear difference with ranger. While it is not impossible to hard code that, tidymodels simplify it for us!!

set.seed(123456)
train_cv <- vfold_cv(train_tbl, v = 10, strata = "context")

9.4.3 Model Specification

Within the model specification, we need to specify multiple options:

  1. A recipe: This is the recipe and is related to any data processing one wants to apply on the data.
  2. An engine: We need to specify the engine to use. Here we want to run a Random Forest.
  3. A tuning: Here we can tune our engine
  4. A workflow: here we specify the various steps of the workflow

9.4.3.1 Recipe

When defining the recipe, you need to think of the type of “transformations” you will apply to your data.

  1. Z-scoring is the first thing that comes to mind. When you z-score the data, you are allowing all strong and weak predictors to be considered equally by the model. This is important as some of our predictors have very large differences related to the levels of context and have different measurement scales. We could have applied it above, but we need to make sure to apply it separately on both training and testing sets (otherwise, our model suffers from data leakage)
  2. If you have any missing data, you can use central imputations to fill in missing data (random forests do not like missing data, though they can work with them).
  3. You can apply PCA on all your predictors to remove collinearity before running random forests. This is a great option to consider, but adds more complexity to your model. 4.Finally, if you have categorical predictors, you can transform them into dummy variables using step_dummy(): 1s and 2s for binary; or use one-hot-encoding step_dummy(predictor, one_hot = TRUE)

See documentations of tidymodels for what you can apply!!

set.seed(123456)
recipe <-  
  train_tbl %>%
  recipe(context ~ .) %>%
  step_center(all_predictors(), -all_outcomes()) %>%
  step_scale(all_predictors(), -all_outcomes()) %>% 
  prep()

trainData_baked <- bake(recipe, new_data = train_tbl) # convert to the train data to the newly imputed data
trainData_baked
NA

Once we have prepared the recipe, we can bake it to see the changes applied to it.

9.4.3.2 Predictors remaining

box_fun_plot = function(data, x, y) {
  ggplot(data = data, aes(x = .data[[x]],
                          y = .data[[y]],
                          fill = .data[[x]])) +
    geom_boxplot() +
    labs(title = y,
         x = x,
         y = y) +
    theme(
      legend.position = "none"
    ) +
    theme_bw()
}

# Create vector of predictors
expl <- names(trainData_baked)[-(dim(trainData_baked)[2])]#step_corr

# Loop vector with map
expl_plots_box <- map(expl, ~box_fun_plot(data = trainData_baked, x = "context", y = .x) )
plot_grid(plotlist = expl_plots_box)

9.4.3.3 Setting the engine

We set the engine here as a rand_forest. We specify a classification mode. Then, we set an engine with engine specific parameters.

set.seed(123456)
engine_tidym <- rand_forest(
    mode = "classification",
    engine = "ranger",
    mtry %>% tune(),
    trees %>% tune(),
    min_n = 1
  ) %>% 
  set_engine("ranger", importance = "permutation", sample.fraction = 0.632,
             replace = FALSE, write.forest = T, splitrule = "extratrees",
             scale.permutation.importance = FALSE) # we add engine specific settings

9.4.3.4 Settings for tuning

If we want to tune the model, then uncomment the lines below. It is important to use an mtry that hovers around the round(sqrt(Variables)). If you use all available variables, then your forest is biased as it is able to see all predictors. For number of trees, low numbers are not great, as you can easily underfit the data and not produce meaningful results. Large numbers are fine and Random Forests do not overfit (in theory).

The full dataset has around 2000 observations, and 23 predictors (well even more, but let’s ignore it for the moment). I tuned mtry to be between 4 and 6, and trees to be between 1000 and 5000 in a 30 step increment. In total, with a 10-folds cross validation, I grew 30 random forests on each fold for a total of 300 Random Forests on the training set!!! This of course will take a loooooong time to compute on your computer if using one thread. So use parallel computing or a cluster. When running in the cluster with 20 cores, each with 11GB RAMs, and it took around 260.442 seconds to run with 220GB RAMS! Of course, with smaller RAMs and number of cores, the code will still run but will take longer.

set.seed(123456)
gridy_tidym <- grid_random(
  mtry() %>% range_set(c(4, 6)),
  trees() %>% range_set(c(1000, 2000)),
  size = 30
  )

9.4.3.5 Workflow

Now we define the workflow adding the recipe and the model.

set.seed(123456)
wkfl_tidym <- workflow() %>% 
  add_recipe(recipe) %>% 
  add_model(engine_tidym)

9.4.3.6 Tuning and running model

Here we run the model starting with the workflow, the cross-validation sample, the tuning parameters and asking for specific metrics.

The model below will do the following: 1. Use a 10-folds cross validation on the training test 2. Tune the hyper-parameters to reach the model with the best predictions 3. Within each fold, we grow 30 random forests; we have a total of 300 Random Forests, and we use an ROC-AUC based search for the best performing model

Of course, you could use a larger size to grow more trees, with this will take longer to run!

The model will run for about 2-3 minutes with an 8 cores machine and 32GB of RAMs. For demonstration purposes, the tuning of number of trees is restricted to between 1000 and 2000 trees. This can of course be increased to 5000 trees (or more) depending on the size of the dataset

set.seed(123456)
system.time(grid_tidym <- 
  tune_grid(wkfl_tidym, 
            resamples = train_cv,
            grid = gridy_tidym,
            metrics = metric_set(accuracy, roc_auc, sens, spec,f_meas, precision, recall),
            control = control_grid(save_pred = TRUE, parallel_over = NULL))
)
   user  system elapsed 
   2.29    0.36  217.36 
print(grid_tidym)
# Tuning results
# 10-fold cross-validation using stratification 

9.4.3.7 Finalise model

We obtain the best performing model from cross-validation, then finalise the workflow by predicting the results on the testing set and obtain the results of the best performing model

set.seed(123456)
collect_metrics(grid_tidym)
grid_tidym_best <- select_best(grid_tidym, metric = "roc_auc")
grid_tidym_best
wkfl_tidym_best <- finalize_workflow(wkfl_tidym, grid_tidym_best)
wkfl_tidym_final <- last_fit(wkfl_tidym_best, split = train_test_split)

9.4.4 Results

For the results, we can obtain various metrics on the training and testing sets.

9.4.4.1 Cross-validation on training set

9.4.4.1.1 Accuracy
percent(show_best(grid_tidym, metric = "accuracy", n = 1)$mean)
[1] "92%"
9.4.4.1.2 ROC-AUC
# Cross-validated training performance
show_best(grid_tidym, metric = "roc_auc", n = 1)$mean
[1] 0.97
9.4.4.1.3 Sensitivity
show_best(grid_tidym, metric = "sens", n = 1)$mean
[1] 0.9519048
9.4.4.1.4 Specificity
show_best(grid_tidym, metric = "spec", n = 1)$mean
[1] 0.8833333
9.4.4.1.5 F-measure
# Cross-validated training performance
show_best(grid_tidym, metric = "f_meas", n = 1)$mean
[1] 0.930738
9.4.4.1.6 Precision
# Cross-validated training performance
show_best(grid_tidym, metric = "precision", n = 1)$mean
[1] 0.9120031
9.4.4.1.7 Recall
# Cross-validated training performance
show_best(grid_tidym, metric = "recall", n = 1)$mean
[1] 0.9519048

9.4.4.2 Predictions testing set

9.4.4.2.1 Overall
wkfl_tidym_final$.metrics
[[1]]
NA
9.4.4.2.2 Accuracy
#accuracy
percent(wkfl_tidym_final$.metrics[[1]]$.estimate[[1]])
[1] "90%"
9.4.4.2.3 ROC-AUC
#roc-auc
wkfl_tidym_final$.metrics[[1]]$.estimate[[2]]
[1] 0.9490991

9.4.4.3 Confusion Matrix training set

wkfl_tidym_final$.predictions[[1]] %>%
  conf_mat(context, .pred_class) %>%
  pluck(1) %>%
  as_tibble() %>%
  group_by(Truth) %>% # group by Truth to compute percentages
  mutate(prop =percent(prop.table(n))) %>% # calculate percentages row-wise
  ggplot(aes(Prediction, Truth, alpha = prop)) +
  geom_tile(show.legend = FALSE) +
  geom_text(aes(label = prop), colour = "white", alpha = 1, size = 8)

9.4.4.4 Variable Importance

9.4.4.4.1 Best 10
vip(pull_workflow_fit(wkfl_tidym_final$.workflow[[1]]))
Warning: `pull_workflow_fit()` was deprecated in workflows 0.2.3.
Please use `extract_fit_parsnip()` instead.

9.4.4.4.2 All predictors
vip(pull_workflow_fit(wkfl_tidym_final$.workflow[[1]]), num_features = 23)
Warning: `pull_workflow_fit()` was deprecated in workflows 0.2.3.
Please use `extract_fit_parsnip()` instead.

9.4.4.5 Gains curves

This is an interesting features that show how much is gained when looking at various portions of the data. We see a gradual increase in the values. When 50% of the data were tested, around 83% of the results within the non-guttural class were already identified. The more testing was performed, the more confidence in the results there are and then when 84.96% of the data were tested, 100% of the cases were found.

wkfl_tidym_final$.predictions[[1]] %>%
  gain_curve(context, `.pred_Non-Guttural`) %>%
  autoplot() 

9.4.4.6 ROC Curves

wkfl_tidym_final$.predictions[[1]] %>%
  roc_curve(context, `.pred_Non-Guttural`) %>%
  autoplot() 

10 session info

sessionInfo()
R version 4.1.2 (2021-11-01)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19044)

Matrix products: default

locale:
[1] LC_COLLATE=English_United Kingdom.1252 
[2] LC_CTYPE=English_United Kingdom.1252   
[3] LC_MONETARY=English_United Kingdom.1252
[4] LC_NUMERIC=C                           
[5] LC_TIME=English_United Kingdom.1252    

attached base packages:
[1] stats4    grid      stats    
[4] graphics  grDevices utils    
[7] datasets  methods   base     

other attached packages:
 [1] ordinal_2019.12-10   
 [2] psycho_0.6.1         
 [3] cowplot_1.1.1        
 [4] scatterplot3d_0.3-41 
 [5] RColorBrewer_1.1-2   
 [6] factoextra_1.0.7     
 [7] FactoMineR_2.4       
 [8] languageR_1.5.0      
 [9] PresenceAbsence_1.1.9
[10] ggsignif_0.6.3       
[11] emmeans_1.7.0        
[12] vip_0.3.2            
[13] varImp_0.4           
[14] measures_0.3         
[15] pROC_1.18.0          
[16] yardstick_0.0.8      
[17] workflowsets_0.1.0   
[18] workflows_0.2.4      
[19] tune_0.1.6           
[20] rsample_0.1.0        
[21] recipes_0.1.17       
[22] parsnip_0.1.7        
[23] modeldata_0.1.1      
[24] infer_1.0.0          
[25] dials_0.0.10         
[26] scales_1.1.1         
[27] tidymodels_0.1.4     
[28] doFuture_0.12.0      
[29] future_1.23.0        
[30] foreach_1.5.1        
[31] ranger_0.13.1        
[32] party_1.3-9          
[33] strucchange_1.5-2    
[34] sandwich_3.0-1       
[35] zoo_1.8-9            
[36] modeltools_0.2-23    
[37] mvtnorm_1.1-3        
[38] lme4_1.1-27.1        
[39] Matrix_1.3-4         
[40] corrplot_0.90        
[41] Hmisc_4.6-0          
[42] Formula_1.2-4        
[43] survival_3.2-13      
[44] lattice_0.20-45      
[45] knitr_1.36           
[46] broom_0.7.10         
[47] forcats_0.5.1        
[48] stringr_1.4.0        
[49] dplyr_1.0.7          
[50] purrr_0.3.4          
[51] readr_2.0.2          
[52] tidyr_1.1.4          
[53] tibble_3.1.5         
[54] ggplot2_3.3.5        
[55] tidyverse_1.3.1      

loaded via a namespace (and not attached):
  [1] utf8_1.2.2         
  [2] tidyselect_1.1.1   
  [3] htmlwidgets_1.5.4  
  [4] munsell_0.5.0      
  [5] codetools_0.2-18   
  [6] DT_0.19            
  [7] withr_2.4.2        
  [8] colorspace_2.0-2   
  [9] highr_0.9          
 [10] rstudioapi_0.13    
 [11] leaps_3.1          
 [12] listenv_0.8.0      
 [13] labeling_0.4.2     
 [14] bit64_4.0.5        
 [15] DiceDesign_1.9     
 [16] farver_2.1.0       
 [17] coda_0.19-4        
 [18] parallelly_1.28.1  
 [19] vctrs_0.3.8        
 [20] generics_0.1.1     
 [21] TH.data_1.1-0      
 [22] ipred_0.9-12       
 [23] xfun_0.27          
 [24] R6_2.5.1           
 [25] lhs_1.1.3          
 [26] assertthat_0.2.1   
 [27] vroom_1.5.5        
 [28] multcomp_1.4-17    
 [29] nnet_7.3-16        
 [30] gtable_0.3.0       
 [31] globals_0.14.0     
 [32] timeDate_3043.102  
 [33] rlang_0.4.12       
 [34] splines_4.1.2      
 [35] rstatix_0.7.0      
 [36] checkmate_2.0.0    
 [37] abind_1.4-5        
 [38] yaml_2.2.1         
 [39] modelr_0.1.8       
 [40] backports_1.3.0    
 [41] tools_4.1.2        
 [42] lava_1.6.10        
 [43] ellipsis_0.3.2     
 [44] Rcpp_1.0.7         
 [45] plyr_1.8.6         
 [46] base64enc_0.1-3    
 [47] ggpubr_0.4.0       
 [48] rpart_4.1-15       
 [49] haven_2.4.3        
 [50] ggrepel_0.9.1      
 [51] cluster_2.1.2      
 [52] fs_1.5.0           
 [53] furrr_0.2.3        
 [54] magrittr_2.0.1     
 [55] data.table_1.14.2  
 [56] openxlsx_4.2.4     
 [57] reprex_2.0.1       
 [58] GPfit_1.0-8        
 [59] matrixStats_0.61.0 
 [60] hms_1.1.1          
 [61] evaluate_0.14      
 [62] xtable_1.8-4       
 [63] rio_0.5.27         
 [64] jpeg_0.1-9         
 [65] readxl_1.3.1       
 [66] gridExtra_2.3      
 [67] compiler_4.1.2     
 [68] crayon_1.4.2       
 [69] minqa_1.2.4        
 [70] htmltools_0.5.2    
 [71] mgcv_1.8-38        
 [72] tzdb_0.2.0         
 [73] libcoin_1.0-9      
 [74] lubridate_1.8.0    
 [75] DBI_1.1.1          
 [76] dbplyr_2.1.1       
 [77] MASS_7.3-54        
 [78] boot_1.3-28        
 [79] car_3.0-11         
 [80] cli_3.1.0          
 [81] parallel_4.1.2     
 [82] gower_0.2.2        
 [83] pkgconfig_2.0.3    
 [84] flashClust_1.01-2  
 [85] numDeriv_2016.8-1.1
 [86] coin_1.4-2         
 [87] foreign_0.8-81     
 [88] xml2_1.3.2         
 [89] hardhat_0.1.6      
 [90] estimability_1.3   
 [91] prodlim_2019.11.13 
 [92] rvest_1.0.2        
 [93] digest_0.6.28      
 [94] rmarkdown_2.11     
 [95] cellranger_1.1.0   
 [96] htmlTable_2.3.0    
 [97] curl_4.3.2         
 [98] nloptr_1.2.2.2     
 [99] lifecycle_1.0.1    
[100] nlme_3.1-153       
[101] jsonlite_1.7.2     
[102] carData_3.0-4      
[103] fansi_0.5.0        
[104] pillar_1.6.4       
[105] fastmap_1.1.0      
[106] httr_1.4.2         
[107] glue_1.4.2         
[108] zip_2.2.0          
[109] png_0.1-7          
[110] iterators_1.0.13   
[111] bit_4.0.4          
[112] class_7.3-19       
[113] stringi_1.7.5      
[114] latticeExtra_0.6-29
[115] ucminf_1.1-4       
[116] future.apply_1.8.1 
---
title: "Session 5: Advanced statistical analyses"
author:
  name: Jalal Al-Tamimi
  affiliation: Université Paris Cité
date: "`r format(Sys.time(), '%d %B %Y')`"
output:
  html_notebook:
    highlight: pygments
    number_sections: yes
    toc: yes
    toc_depth: 6
    toc_float:
      collapsed: yes
      fig_crop: no
---

# Loading packages 

```{r warning=FALSE, message=FALSE, error=FALSE}
## Use the code below to check if you have all required packages installed. If some are not installed already, the code below will install these. If you have all packages installed, then you could load them with the second code.
requiredPackages = c('tidyverse', 'broom', 'knitr', 'Hmisc', 'corrplot', 'lme4', 'lmerTest', 'party', 'ranger','doFuture',  'tidymodels', 'pROC', 'varImp', 'lattice', 'vip', 'emmeans', 'ggsignif', 'PresenceAbsence', 'languageR', 'FactoMineR', 'factoextra', 'RColorBrewer', 'scatterplot3d', 'cowplot', 'psycho', 'ordinal')
for(p in requiredPackages){
  if(!require(p,character.only = TRUE)) install.packages(p)
  library(p,character.only = TRUE)
}

```


# Correlation tests {.tabset .tabset-fade .tabset-pills}

## Basic correlations

Let us start with a basic correlation test. We want to evaluate if two numeric variables are correlated with each other.

We use the function `cor` to obtain the pearson correlation and `cor.test` to run a basic correlation test on our data with significance testing




```{r}
cor(english$RTlexdec, english$RTnaming, method = "pearson")
cor.test(english$RTlexdec, english$RTnaming)
```

What these results are telling us? There is a positive correlation between `RTlexdec` and `RTnaming`. The correlation coefficient (R²) is 0.76 (limits between -1 and 1). This correlation is statistically significant with a t value of 78.699, degrees of freedom of 4566 and a p-value < 2.2e-16. 

What are the degrees of freedom? These relate to number of total observations - number of comparisons. Here we have 4568 observations in the dataset, and two comparisons, hence 4568 - 2 = 4566.

For the p value, there is a threshold we usually use. This threshold is p = 0.05. This threshold means we have a minimum to consider any difference as significant or not. 0.05 means that we have a probability to find a significant difference that is at 5% or lower. IN our case, the p value is lower that 2.2e-16. How to interpret this number? this tells us to add 15 0s  before the 2!! i.e., 0.0000000000000002. This probability is very (very!!) low. So we conclude that there is a statistically significant correlation between the two variables.


The formula to calculate the t value is below. 

![](t-score.jpg)


x̄ = sample mean
μ0 = population mean
s = sample standard deviation
n = sample size

The p value is influenced by various factors, number of observations, strength of the difference, mean values, etc.. You should always be careful with interpreting p values taking everything else into account.


## Using the package `corrplot`

Above, we did a correlation test on two predictors. 
What if we want to obtain a nice plot of all numeric predictors and add significance levels? 

### Correlation plots

```{r fig.height=6}
corr <- 
  english %>% 
  select(where(is.numeric)) %>% 
  cor() %>% 
  print()
print(corr)
corrplot(corr, method = 'ellipse', type = 'upper')

```



### More advanced

Let's first compute the correlations between all numeric variables and plot these with the p values

```{r fig.height=15}
## correlation using "corrplot"
## based on the function `rcorr' from the `Hmisc` package
## Need to change dataframe into a matrix
corr <- 
  english %>% 
  select(where(is.numeric)) %>% 
  as.matrix(english) %>% 
  rcorr(type = "pearson")
print(corr)
# use corrplot to obtain a nice correlation plot!
corrplot(corr$r, p.mat = corr$P,
         addCoef.col = "black", diag = FALSE, type = "upper", tl.srt = 55)
```


```{r}
english %>% 
  group_by(AgeSubject) %>% 
  summarise(mean = mean(RTlexdec),
            sd = sd(RTlexdec))
```

# Linear Models {.tabset .tabset-fade .tabset-pills}

Up to now, we have looked at descriptive statistics, and evaluated summaries, correlations in the data (with p values).

We are now interested in looking at group differences. 


## Introduction

The basic assumption of a Linear model is to create a regression analysis on the data. We have an outcome (or dependent variable) and a predictor (or an independent variable). The formula of a linear model is as follows `outcome ~ predictor` that can be read as "outcome as a function of the predictor". We can add "1" to specify an intercept, but this is by default added to the model

### Model estimation

```{r}
english2 <- english %>% 
  mutate(AgeSubject = factor(AgeSubject, levels = c("young", "old")))
mdl.lm <- english2 %>% 
  lm(RTlexdec ~ AgeSubject, data = .)
#lm(RTlexdec ~ AgeSubject, data = english)
mdl.lm #also print(mdl.lm)
summary(mdl.lm)
```

### Tidying the output

```{r}
# from library(broom)
tidy(mdl.lm) %>% 
  select(term, estimate) %>% 
  mutate(estimate = round(estimate, 3))
mycoefE <- tidy(mdl.lm) %>% pull(estimate)

```

Obtaining mean values from our model

```{r}
#old
mycoefE[1]
#young
mycoefE[1] + mycoefE[2]
```

### Nice table of our model summary

We can also obtain a nice table of our model summary. We can use the package `knitr` or `xtable`

#### Directly from model summary

```{r}
kable(summary(mdl.lm)$coef, digits = 3)

```

#### From the `tidy` output

```{r}
mdl.lmT <- tidy(mdl.lm)
kable(mdl.lmT, digits = 3)
```


### Dissecting the model

Let us dissect the model. If you use "str", you will be able to see what is available under our linear model. To access some info from the model

#### "str" and "coef"

```{r}
str(mdl.lm)
```



```{r}
coef(mdl.lm)
## same as 
## mdl.lm$coefficients
```

#### "coef" and "coefficients"

What if I want to obtain the "Intercept"? Or the coefficient for distance? What if I want the full row for distance?

```{r}
coef(mdl.lm)[1] # same as mdl.lm$coefficients[1]
coef(mdl.lm)[2] # same as mdl.lm$coefficients[2]
```


```{r}
summary(mdl.lm)$coefficients[2, ] # full row
summary(mdl.lm)$coefficients[2, 4] #for p value
```


#### Residuals

What about residuals (difference between the observed value and the estimated value of the quantity) and fitted values? This allows us to evaluate how normal our residuals are and how different they are from a normal distribution.

```{r warning=FALSE, message=FALSE, error=FALSE}
hist(residuals(mdl.lm))
qqnorm(residuals(mdl.lm)); qqline(residuals(mdl.lm))
plot(fitted(mdl.lm), residuals(mdl.lm), cex = 4)
```

#### Goodness of fit?

```{r warning=FALSE, message=FALSE, error=FALSE}
AIC(mdl.lm)	# Akaike's Information Criterion, lower values are better
BIC(mdl.lm)	# Bayesian AIC
logLik(mdl.lm)	# log likelihood
```


Or use the following from `broom`

```{r}
glance(mdl.lm)
```

#### Significance testing

Are the above informative? of course not directly. If we want to test for overall significance of model. We run a null model (aka intercept only) and compare models.

```{r warning=FALSE, message=FALSE, error=FALSE}
mdl.lm.Null <- english %>% 
  lm(RTlexdec ~ 1, data = .)
mdl.comp <- anova(mdl.lm.Null, mdl.lm)
mdl.comp
```

The results show that adding the variable "AgeSubject" improves the model fit. We can write this as follows: Model comparison showed that the addition of AgeSubject improved the model fit when compared with an intercept only model ($F$(`r mdl.comp[2,3]`) = `r round(mdl.comp[2,5], 2)`, *p* < `r mdl.comp[2,6]`)  (F(1) = 4552 , p < 2.2e-16)

## Plotting fitted values

### Trend line

Let's plot our fitted values but only for the trend line

```{r warning=FALSE, message=FALSE, error=FALSE}
english %>% 
  ggplot(aes(x = AgeSubject, y = RTlexdec))+
  geom_boxplot()+
  theme_bw() + theme(text = element_text(size = 15))+
  geom_smooth(aes(x = as.numeric(AgeSubject), y = predict(mdl.lm)), method = "lm", color = "blue") + 
  labs(x = "Age", y = "RTLexDec", title = "Boxplot and predicted trend line", subtitle = "with ggplot2") 
```

This allows us to plot the fitted values from our model with the predicted linear trend. This is exactly the same as our original data.

### Predicted means and the trend line

We can also plot the predicted means and linear trend

```{r warning=FALSE, message=FALSE, error=FALSE}
english %>% 
  ggplot(aes(x = AgeSubject, y = predict(mdl.lm)))+
  geom_boxplot(color = "blue") +
  theme_bw() + theme(text = element_text(size = 15)) +
  geom_smooth(aes(x = as.numeric(AgeSubject), y = predict(mdl.lm)), method = "lm", color = "blue") + 
    labs(x = "Age", y = "RTLexDec", title = "Predicted means and trend line", subtitle = "with ggplot2") 

```


### Raw data, predicted means and the trend line

We can also plot the actual data, the predicted means and linear trend

```{r warning=FALSE, message=FALSE, error=FALSE}
english %>% 
  ggplot(aes(x = AgeSubject, y = RTlexdec))+
  geom_boxplot() +
  geom_boxplot(aes(x = AgeSubject, y = predict(mdl.lm)), color = "blue") +
  theme_bw() + theme(text = element_text(size = 15)) +
  geom_smooth(aes(x = as.numeric(AgeSubject), y = predict(mdl.lm)), method = "lm", color = "blue") +
    labs(x = "Species", y = "Length", title = "Boxplot raw data, predicted means (in blue) and trend line", subtitle = "with ggplot2")
```

### Add significance levels and trend line on a plot?

We can use the p values generated from either our linear model to add significance levels on a plot. We use the code from above and add the significance level. We also add a trend line


```{r warning=FALSE, message=FALSE, error=FALSE}
english %>% 
  ggplot(aes(x = AgeSubject, y = RTlexdec))+
  geom_boxplot() +
  geom_boxplot(aes(x = AgeSubject, y = predict(mdl.lm)), color = "blue") +
  theme_bw() + theme(text = element_text(size = 15)) +
  geom_smooth(aes(x = as.numeric(AgeSubject), y = predict(mdl.lm)), method = "lm", color = "blue") +
    labs(x = "Species", y = "Length", title = "Boxplot raw data, predicted means (in blue) and trend line", subtitle = "with significance testing") +
    geom_signif(comparison = list(c("old", "young")), 
              map_signif_level = TRUE, test = function(a, b) {
                list(p.value = summary(mdl.lm)$coefficients[2, 4])})


```





## What about pairwise comparison?

When having three of more levels in our predictor, we can use pairwise comparisons, with corrections to evaluate differences between each level.

```{r}
summary(mdl.lm)
```


```{r}
mdl.lm %>% emmeans(pairwise ~ AgeSubject, adjust = "fdr") -> mdl.emmeans
mdl.emmeans
```

How to interpret the output? Discuss with your neighbour and share with the group.

Hint... Look at the emmeans values for each level of our factor "Species" and the contrasts. 


## Multiple predictors?

Linear models require a numeric outcome, but the predictor can be either numeric or a factor. We can have more than one predictor. The only issue is that this complicates the interpretation of results

```{r warning=FALSE, message=FALSE, error=FALSE}
english %>% 
  lm(RTlexdec ~ AgeSubject * WordCategory, data = .) %>% 
  summary()
```


And with an Anova


```{r warning=FALSE, message=FALSE, error=FALSE}
english %>% 
  lm(RTlexdec ~ AgeSubject * WordCategory, data = .) %>% 
  anova()
```


The results above tell us that all predictors used are significantly different.




# Generalised Linear Models {.tabset .tabset-fade .tabset-pills}

Here we will look at an example when the outcome is binary. This simulated data is structured as follows. We asked one participant to listen to 165 sentences, and to judge whether these are "grammatical" or "ungrammatical". There were 105 sentences that were "grammatical" and 60 "ungrammatical". This fictitious example can apply in any other situation. Let's think Geography: 165 lands: 105 "flat" and 60 "non-flat", etc. This applies to any case where you need to "categorise" the outcome into two groups. 

## Load and summaries

Let's load in the data and do some basic summaries

```{r warning=FALSE, message=FALSE, error=FALSE}
grammatical <- read_csv("grammatical.csv")
grammatical
str(grammatical)
head(grammatical)
```

## GLM - Categorical predictors

Let's run a first GLM (Generalised Linear Model). A GLM uses a special family "binomial" as it assumes the outcome has a binomial distribution. In general, results from a Logistic Regression are close to what we get from SDT (see above).

To run the results, we will change the reference level for both response and grammaticality. The basic assumption about GLM is that we start with our reference level being the "no" responses to the "ungrammatical" category. Any changes to this reference will be seen in the coefficients as "yes" responses to the "grammatical" category.

### Model estimation and results

The results below show the logodds for our model. 

```{r warning=FALSE, message=FALSE, error=FALSE}
grammatical <- grammatical %>% 
  mutate(response = factor(response, levels = c("no", "yes")),
         grammaticality = factor(grammaticality, levels = c("ungrammatical", "grammatical")))

grammatical %>% 
  group_by(grammaticality, response) %>% 
  table()

mdl.glm <- grammatical %>% 
  glm(response ~ grammaticality, data = ., family = binomial)
summary(mdl.glm)

tidy(mdl.glm) %>% 
  select(term, estimate) %>% 
  mutate(estimate = round(estimate, 3))
# to only get the coefficients
mycoef2 <- tidy(mdl.glm) %>% pull(estimate)
```


The results show that for one unit increase in the response (i.e., from no to yes), the logodds of being "grammatical" is increased by `r mycoef2[2]` (the intercept shows that when the response is "no", the logodds are `r mycoef2[1]`). The actual logodds for the response "yes" to grammatical is `r mycoef2[1]+mycoef2[2]` 

### Logodds to Odd ratios

Logodds can be modified to talk about the odds of an event. For our model above, the odds of "grammatical" receiving a "no" response is a mere 0.2; the odds of "grammatical" to receive a "yes" is a 20; i.e., 20 times more likely 


```{r warning=FALSE, message=FALSE, error=FALSE}
exp(mycoef2[1])
exp(mycoef2[1] + mycoef2[2])

```

### LogOdds to proportions

If you want to talk about the percentage "accuracy" of our model, then we can transform our loggodds into proportions. This shows that the proportion of "grammatical" receiving a "yes" response increases by 99% (or 95% based on our "true" coefficients)

```{r warning=FALSE, message=FALSE, error=FALSE}
plogis(mycoef2[1])
plogis(mycoef2[1] + mycoef2[2])
```

### Plotting

```{r warning=FALSE, message=FALSE, error=FALSE}
grammatical <- grammatical %>% 
  mutate(prob = predict(mdl.glm, type = "response"))
grammatical %>% 
  ggplot(aes(x = as.numeric(grammaticality), y = prob)) +
  geom_point() +
  geom_smooth(method = "glm", 
    method.args = list(family = "binomial"), 
    se = T) + theme_bw(base_size = 20)+
    labs(y = "Probability", x = "")+
    coord_cartesian(ylim = c(0,1))+
    scale_x_discrete(limits = c("Ungrammatical", "Grammatical"))
```


## GLM - Numeric predictors {.tabset .tabset-fade .tabset-pills}

In this example, we will run a GLM model using a similar technique to that used in `Al-Tamimi (2017)` and `Baumann & Winter (2018)`. We use the package `LanguageR` and the dataset `English`.


In the model above, we used the equation as lm(RTlexdec ~ AgeSubject). We were interested in examining the impact of age of subject on reaction time in a lexical decision task. In this section, we are interested in understanding how reaction time allows to differentiate the participants based on their age. We use `AgeSubject` as our outcome and `RTlexdec` as our predictor using the equation glm(AgeSubject ~ RTlexdec). We usually can use `RTlexdec` as is, but due to a possible quasi separation and the fact that we may want to compare coefficients using multiple acoustic metrics, we will z-score our predictor. We run below two models, with and without z-scoring

For the glm model, we need to specify `family = "binomial"`.

### Without z-scoring of predictor

#### Model estimation


```{r warning=FALSE, message=FALSE, error=FALSE}
mdl.glm2 <- english2 %>% 
  glm(AgeSubject ~ RTlexdec, data = ., family = "binomial")

tidy(mdl.glm2) %>% 
  select(term, estimate) %>% 
  mutate(estimate = round(estimate, 3))
# to only get the coefficients
mycoef2 <- tidy(mdl.glm2) %>% pull(estimate)
```

#### LogOdds to proportions

If you want to talk about the percentage "accuracy" of our model, then we can transform our loggodds into proportions.

```{r warning=FALSE, message=FALSE, error=FALSE}
plogis(mycoef2[1])
plogis(mycoef2[1] + mycoef2[2])
```

#### Plotting

```{r warning=FALSE, message=FALSE, error=FALSE}
english2 <- english2 %>% 
  mutate(prob = predict(mdl.glm2, type = "response"))
english2 %>% 
  ggplot(aes(x = as.numeric(AgeSubject), y = prob)) +
  geom_point() +
  geom_smooth(method = "glm", 
    method.args = list(family = "binomial"), 
    se = T) + theme_bw(base_size = 20)+
    labs(y = "Probability", x = "")+
    coord_cartesian(ylim = c(0,1))+
    scale_x_discrete(limits = c("Young", "Old"))
```
 
 
The plot above show how the two groups differ using a glm. The results point to an overall increase in the proportion of reaction time when moving from the "Young" to the "Old" group.
Let's use z-scoring next


### With z-scoring of predictor

#### Model estimation


```{r warning=FALSE, message=FALSE, error=FALSE}
english2 <- english2 %>% 
  mutate(`RTlexdec_z` = scale(RTlexdec, center = TRUE, scale = TRUE))

english2['RTlexdec_z'] <- as.data.frame(scale(english2$RTlexdec))



mdl.glm3 <- english2 %>% 
  glm(AgeSubject ~ RTlexdec_z, data = ., family = "binomial")

tidy(mdl.glm3) %>% 
  select(term, estimate) %>% 
  mutate(estimate = round(estimate, 3))
# to only get the coefficients
mycoef2 <- tidy(mdl.glm3) %>% pull(estimate)
```

#### LogOdds to proportions

If you want to talk about the percentage "accuracy" of our model, then we can transform our loggodds into proportions. 

```{r warning=FALSE, message=FALSE, error=FALSE}
plogis(mycoef2[1])
plogis(mycoef2[1] + mycoef2[2])
```

#### Plotting

##### Normal

```{r warning=FALSE, message=FALSE, error=FALSE}
english2 <- english2 %>% 
  mutate(prob = predict(mdl.glm3, type = "response"))
english2 %>% 
  ggplot(aes(x = as.numeric(AgeSubject), y = prob)) +
  geom_point() +
  geom_smooth(method = "glm", 
    method.args = list(family = "binomial"), 
    se = T) + theme_bw(base_size = 20)+
    labs(y = "Probability", x = "")+
    coord_cartesian(ylim = c(0,1))+
    scale_x_discrete(limits = c("Young", "Old"))
```
 
 
We obtain the exact same plots, but the model estimations are different. Let's use another type of predictions


##### z-scores

```{r warning=FALSE, message=FALSE, error=FALSE}
z_vals <- seq(-3, 3, 0.01)

dfPredNew <- data.frame(RTlexdec_z = z_vals)

## store the predicted probabilities for each value of RTlexdec_z
pp <- cbind(dfPredNew, prob = predict(mdl.glm3, newdata = dfPredNew, type = "response"))

pp %>% 
  ggplot(aes(x = RTlexdec_z, y = prob)) +
  geom_point() +
  theme_bw(base_size = 20)+
    labs(y = "Probability", x = "")+
    coord_cartesian(ylim = c(0,1))+
  scale_x_continuous(breaks = c(-3, -2, -1, 0, 1, 2, 3))
```
 
 
We obtain the exact same plots, but the model estimations are different. 






## Accuracy and Signal Detection Theory

### Rationale

We are generally interested in performance, i.e., whether the we have "accurately" categorised the outcome or not and at the same time want to evaluate our biases in responses. When deciding on categories, we are usually biased in our selection. 

Let's ask the question: How many of you have a Mac laptop and how many a Windows laptop? For those with a Mac, what was the main reason for choosing it? Are you biased in anyway by your decision? 

To correct for these biases, we use some variants from Signal Detection Theory to obtain the true estimates without being influenced by the biases. 

### Running stats

Let's do some stats on this 

|  | Yes | No | Total |
|----------------------------|--------------------|------------------|------------------|
| Grammatical (Yes Actual) | TP = 100 | FN = 5 | (Yes Actual) 105 |
| Ungrammatical (No Actual)  | FP = 10 | TN = 50 | (No Actual) 60 |
| Total | (Yes Response) 110 | (No Response) 55 | 165 |

```{r warning=FALSE, message=FALSE, error=FALSE}
grammatical <- grammatical %>% 
  mutate(response = factor(response, levels = c("yes", "no")),
         grammaticality = factor(grammaticality, levels = c("grammatical", "ungrammatical")))

## TP = True Positive (Hit); FP = False Positive; FN = False Negative; TN = True Negative


TP <- nrow(grammatical %>% 
             filter(grammaticality == "grammatical" &
                      response == "yes"))
FN <- nrow(grammatical %>% 
             filter(grammaticality == "grammatical" &
                      response == "no"))
FP <- nrow(grammatical %>% 
             filter(grammaticality == "ungrammatical" &
                      response == "yes"))
TN <- nrow(grammatical %>% 
             filter(grammaticality == "ungrammatical" &
                      response == "no"))
TP
FN
FP
TN

Total <- nrow(grammatical)
Total
(TP+TN)/Total # accuracy
(FP+FN)/Total # error, also 1-accuracy

# When stimulus = yes, how many times response = yes?
TP/(TP+FN) # also True Positive Rate or Specificity

# When stimulus = no, how many times response = yes?
FP/(FP+TN) # False Positive Rate, 

# When stimulus = no, how many times response = no?
TN/(FP+TN) # True Negative Rate or Sensitivity 

# When subject responds "yes" how many times is (s)he correct?
TP/(TP+FP) # precision

# getting dprime (or the sensitivity index); beta (bias criterion, 0-1, lower=increase in "yes"); Aprime (estimate of discriminability, 0-1, 1=good discrimination; 0 at chance); bppd (b prime prime d, -1 to 1; 0 = no bias, negative = tendency to respond "yes", positive = tendency to respond "no"); c (index of bias, equals to SD)
#(see also https://www.r-bloggers.com/compute-signal-detection-theory-indices-with-r/amp/) 
psycho::dprime(TP, FP, FN, TN, 
               n_targets = TP+FN, 
               n_distractors = FP+TN,
               adjust=F)

```

The most important from above, is d-prime. This is modelling the difference between the rate of "True Positive" responses and "False Positive" responses in standard unit (or z-scores). The formula can be written as:

`d' (d prime) = Z(True Positive Rate) - Z(False Positive Rate)`

### GLM as a classification tool

The code below demonstrates the links between our GLM model and what we had obtained above from SDT. The predictions' table shows that our GLM was successful at obtaining prediction that are identical to our initial data setup. Look at the table here and the table above. Once we have created our table of outcome, we can compute percent correct, the specificity, the sensitivity, the Kappa score, etc.. this yields the actual value with the SD that is related to variations in responses. 

```{r}
## predict(mdl.glm)>0.5 is identical to 
## predict(glm(response~grammaticality,data=grammatical,family = binomial),type="response")
grammatical <- grammatical %>% 
  mutate(response = factor(response, levels = c("yes", "no")),
         grammaticality = factor(grammaticality, levels = c("grammatical", "ungrammatical")))



mdl.glm.C <- grammatical %>% 
  glm(response ~ grammaticality, data = .,family = binomial)

tbl.glm <- table(grammatical$response, predict(mdl.glm.C, type = "response")>0.5)
colnames(tbl.glm) <- c("grammatical", "ungrammatical")
tbl.glm
PresenceAbsence::pcc(tbl.glm)
PresenceAbsence::specificity(tbl.glm)
PresenceAbsence::sensitivity(tbl.glm)
###etc..
```

If you look at the results from SDT above, these results are the same as
the following

Accuracy: (TP+TN)/Total (`r (TP+TN)/Total`) 

True Positive Rate (or Specificity) TP/(TP+FN) (`r TP/(TP+FN)`)

True Negative Rate (or Sensitivity) TN/(FP+TN) (`r TN/(FP+TN)`) 

### GLM and d prime

The values obtained here match those obtained from SDT. For d prime, the difference stems from the use of the logit variant of the Binomial family. By using a probit variant, one obtains the same values ([see here](https://stats.idre.ucla.edu/r/dae/probit-regression/) for more details). A probit variant models the z-score differences in the outcome and is evaluated in change in 1-standard unit. This is modelling the change from "ungrammatical" "no" responses into "grammatical" "yes" responses in z-scores. The same conceptual underpinnings of d-prime from Signal Detection Theory.

```{r}
## d prime
psycho::dprime(TP, FP, FN, TN, 
               n_targets = TP+FN, 
               n_distractors = FP+TN,
               adjust=F)$dprime

## GLM with probit
coef(glm(response ~ grammaticality, data = grammatical, family = binomial(probit)))[2]

```





## GLM: Other distributions

If your data does not fit a binomial distribution, and is a multinomial (i.e., three or more response categories) or poisson (count data), then you need to use the glm function with a specific family function. 

```{r warning=FALSE, message=FALSE, error=FALSE, echo=FALSE}
## For a multinomial (3 or more response categories), see below and use the following specification
## https://stats.idre.ucla.edu/r/dae/multinomial-logistic-regression/
## mdl.multi <- nnet::multinom(outcome~predictor, data=data)

## For a poisson (count data), see below and use the following specification
## https://stats.idre.ucla.edu/r/dae/poisson-regression/

## mdl.poisson <- glm(outcome~predictor, data = data, family = "poisson")


```


# Cumulative Logit Link Models

These models work perfectly with rating data. Ratings are inherently ordered, 1, 2, ... n, and expect to observe an increase (or decrease) in overall ratings from 1 to n. To demonstrate this, we will use an example using the package "ordinal". Data were from a rating experiment where six participants rated the percept of nasality in the production of particular consonants in Arabic. The data came from nine producing subjects. The ratings were from 1 to 5. This example can apply to any study, e.g., rating grammaticality of sentences, rating how positive the sentiments are in a article, interview responses, etc.

## Importing and pre-processing

We start by importing the data and process it. We change the reference level in the predictor

```{r warning=FALSE, message=FALSE, error=FALSE}
rating <- read_csv("rating.csv")
rating
rating <- rating %>% 
  mutate(Response = factor(Response),
         Context = factor(Context)) %>% 
  mutate(Context = relevel(Context, "isolation"))
rating
```

## Our first model

We run our first clm model as a simple, i.e., with no random effects

```{r warning=FALSE, message=FALSE, error=FALSE}
mdl.clm <- rating %>% 
  clm(Response ~ Context, data = .)
summary(mdl.clm)
```


## Testing significance 

We can evaluate whether "Context" improves the model fit, by comparing a null model with our model. Of course "Context" is improving the model fit.

```{r warning=FALSE, message=FALSE, error=FALSE}
mdl.clm.Null <- rating %>% 
  clm(Response ~ 1, data = .)
anova(mdl.clm, mdl.clm.Null)

```

## Interpreting a cumulative model

As a way to interpret the model, we can look at the coefficients and make sense of the results. A CLM model is a Logistic model with a cumulative effect. The "Coefficients" are the estimates for each level of the fixed effect; the "Threshold coefficients" are those of the response. For the former, a negative coefficient indicates a negative association with the response; and a positive is positively associated with the response. The p values are indicating the significance of each level. For the "Threshold coefficients", we can see the cumulative effects of ratings 1|2, 2|3, 3|4 and 4|5 which indicate an overall increase in the ratings from 1 to 5. 

## Plotting 

### No confidence intervals

We use a modified version of a plotting function that allows us to visualise the effects. For this, we use the base R plotting functions. The version below is without confidence intervals.


```{r warning=FALSE, message=FALSE, error=FALSE}
par(oma=c(1, 0, 0, 3),mgp=c(2, 1, 0))
xlimNas = c(min(mdl.clm$beta), max(mdl.clm$beta))
ylimNas = c(0,1)
plot(0,0,xlim=xlimNas, ylim=ylimNas, type="n", ylab=expression(Probability), xlab="", xaxt = "n",main="Predicted curves - Nasalisation",cex=2,cex.lab=1.5,cex.main=1.5,cex.axis=1.5)
axis(side = 1, at = c(0,mdl.clm$beta),labels = levels(rating$Context), las=2,cex=2,cex.lab=1.5,cex.axis=1.5)
xsNas = seq(xlimNas[1], xlimNas[2], length.out=100)
lines(xsNas, plogis(mdl.clm$Theta[1] - xsNas), col='black')
lines(xsNas, plogis(mdl.clm$Theta[2] - xsNas)-plogis(mdl.clm$Theta[1] - xsNas), col='red')
lines(xsNas, plogis(mdl.clm$Theta[3] - xsNas)-plogis(mdl.clm$Theta[2] - xsNas), col='green')
lines(xsNas, plogis(mdl.clm$Theta[4] - xsNas)-plogis(mdl.clm$Theta[3] - xsNas), col='orange')
lines(xsNas, 1-(plogis(mdl.clm$Theta[4] - xsNas)), col='blue')
abline(v=c(0,mdl.clm$beta),lty=3)
abline(h=0, lty="dashed")
abline(h=0.2, lty="dashed")
abline(h=0.4, lty="dashed")
abline(h=0.6, lty="dashed")
abline(h=0.8, lty="dashed")
abline(h=1, lty="dashed")

legend(par('usr')[2], par('usr')[4], bty='n', xpd=NA,lty=1, col=c("black", "red", "green", "orange", "blue"), 
       legend=c("Oral", "2", "3", "4", "Nasal"),cex=0.75)

```


### With confidence intervals

Here is an attempt to add the 97.5% confidence intervals to these plots. This is an experimantal attempt and any feedback is welcome!


```{r warning=FALSE, message=FALSE, error=FALSE}
par(oma=c(1, 0, 0, 3),mgp=c(2, 1, 0))
xlimNas = c(min(mdl.clm$beta), max(mdl.clm$beta))
ylimNas = c(0,1)
plot(0,0,xlim=xlimNas, ylim=ylimNas, type="n", ylab=expression(Probability), xlab="", xaxt = "n",main="Predicted curves - Nasalisation",cex=2,cex.lab=1.5,cex.main=1.5,cex.axis=1.5)
axis(side = 1, at = c(0,mdl.clm$beta),labels = levels(rating$Context), las=2,cex=2,cex.lab=1.5,cex.axis=1.5)
xsNas = seq(xlimNas[1], xlimNas[2], length.out=100)


#+CI 
lines(xsNas, plogis(mdl.clm$Theta[1]+(summary(mdl.clm)$coefficient[,2][[1]]/1.96) - xsNas), col='black')
lines(xsNas, plogis(mdl.clm$Theta[2]+(summary(mdl.clm)$coefficient[,2][[2]]/1.96) - xsNas)-plogis(mdl.clm$Theta[1]+(summary(mdl.clm)$coefficient[,2][[1]]/1.96) - xsNas), col='red')
lines(xsNas, plogis(mdl.clm$Theta[3]+(summary(mdl.clm)$coefficient[,2][[3]]/1.96) - xsNas)-plogis(mdl.clm$Theta[2]+(summary(mdl.clm)$coefficient[,2][[2]]/1.96) - xsNas), col='green')
lines(xsNas, plogis(mdl.clm$Theta[4]+(summary(mdl.clm)$coefficient[,2][[4]]/1.96) - xsNas)-plogis(mdl.clm$Theta[3]+(summary(mdl.clm)$coefficient[,2][[3]]/1.96) - xsNas), col='orange')
lines(xsNas, 1-(plogis(mdl.clm$Theta[4]+(summary(mdl.clm)$coefficient[,2][[4]]/1.96) - xsNas)), col='blue')

#-CI 
lines(xsNas, plogis(mdl.clm$Theta[1]-(summary(mdl.clm)$coefficient[,2][[1]]/1.96) - xsNas), col='black')
lines(xsNas, plogis(mdl.clm$Theta[2]-(summary(mdl.clm)$coefficient[,2][[2]]/1.96) - xsNas)-plogis(mdl.clm$Theta[1]-(summary(mdl.clm)$coefficient[,2][[1]]/1.96) - xsNas), col='red')
lines(xsNas, plogis(mdl.clm$Theta[3]-(summary(mdl.clm)$coefficient[,2][[3]]/1.96) - xsNas)-plogis(mdl.clm$Theta[2]-(summary(mdl.clm)$coefficient[,2][[2]]/1.96) - xsNas), col='green')
lines(xsNas, plogis(mdl.clm$Theta[4]-(summary(mdl.clm)$coefficient[,2][[4]]/1.96) - xsNas)-plogis(mdl.clm$Theta[3]-(summary(mdl.clm)$coefficient[,2][[3]]/1.96) - xsNas), col='orange')
lines(xsNas, 1-(plogis(mdl.clm$Theta[4]-(summary(mdl.clm)$coefficient[,2][[4]]/1.96) - xsNas)), col='blue')

# fill area around CI using c(x, rev(x)), c(y2, rev(y1))
polygon(c(xsNas, rev(xsNas)),
        c(plogis(mdl.clm$Theta[1]+(summary(mdl.clm)$coefficient[,2][[1]]/1.96) - xsNas), rev(plogis(mdl.clm$Theta[1]-(summary(mdl.clm)$coefficient[,2][[1]]/1.96) - xsNas))), col = "gray90")

polygon(c(xsNas, rev(xsNas)),
        c(plogis(mdl.clm$Theta[2]+(summary(mdl.clm)$coefficient[,2][[2]]/1.96) - xsNas)-plogis(mdl.clm$Theta[1]+(summary(mdl.clm)$coefficient[,2][[1]]/1.96) - xsNas), rev(plogis(mdl.clm$Theta[2]-(summary(mdl.clm)$coefficient[,2][[2]]/1.96) - xsNas)-plogis(mdl.clm$Theta[1]-(summary(mdl.clm)$coefficient[,2][[1]]/1.96) - xsNas))), col = "gray90")


polygon(c(xsNas, rev(xsNas)),
        c(plogis(mdl.clm$Theta[3]+(summary(mdl.clm)$coefficient[,2][[3]]/1.96) - xsNas)-plogis(mdl.clm$Theta[2]+(summary(mdl.clm)$coefficient[,2][[2]]/1.96) - xsNas), rev(plogis(mdl.clm$Theta[3]-(summary(mdl.clm)$coefficient[,2][[3]]/1.96) - xsNas)-plogis(mdl.clm$Theta[2]-(summary(mdl.clm)$coefficient[,2][[2]]/1.96) - xsNas))), col = "gray90")

polygon(c(xsNas, rev(xsNas)),
        c(plogis(mdl.clm$Theta[4]+(summary(mdl.clm)$coefficient[,2][[4]]/1.96) - xsNas)-plogis(mdl.clm$Theta[3]+(summary(mdl.clm)$coefficient[,2][[3]]/1.96) - xsNas), rev(plogis(mdl.clm$Theta[4]-(summary(mdl.clm)$coefficient[,2][[4]]/1.96) - xsNas)-plogis(mdl.clm$Theta[3]-(summary(mdl.clm)$coefficient[,2][[3]]/1.96) - xsNas))), col = "gray90")

        
polygon(c(xsNas, rev(xsNas)),
        c(1-(plogis(mdl.clm$Theta[4]-(summary(mdl.clm)$coefficient[,2][[4]]/1.96) - xsNas)), rev(1-(plogis(mdl.clm$Theta[4]+(summary(mdl.clm)$coefficient[,2][[4]]/1.96) - xsNas)))), col = "gray90")       

lines(xsNas, plogis(mdl.clm$Theta[1] - xsNas), col='black')
lines(xsNas, plogis(mdl.clm$Theta[2] - xsNas)-plogis(mdl.clm$Theta[1] - xsNas), col='red')
lines(xsNas, plogis(mdl.clm$Theta[3] - xsNas)-plogis(mdl.clm$Theta[2] - xsNas), col='green')
lines(xsNas, plogis(mdl.clm$Theta[4] - xsNas)-plogis(mdl.clm$Theta[3] - xsNas), col='orange')
lines(xsNas, 1-(plogis(mdl.clm$Theta[4] - xsNas)), col='blue')
abline(v=c(0,mdl.clm$beta),lty=3)

abline(h=0, lty="dashed")
abline(h=0.2, lty="dashed")
abline(h=0.4, lty="dashed")
abline(h=0.6, lty="dashed")
abline(h=0.8, lty="dashed")
abline(h=1, lty="dashed")


legend(par('usr')[2], par('usr')[4], bty='n', xpd=NA,lty=1, col=c("black", "red", "green", "orange", "blue"), 
       legend=c("Oral", "2", "3", "4", "Nasal"),cex=0.75)

```


# Linear Mixed-effects Models. Why random effects matter {.tabset .tabset-fade .tabset-pills}

Let's generate a new dataframe that we will use later on for our mixed models

```{r warning=FALSE, message=FALSE, error=FALSE}
## Courtesy of Bodo Winter
set.seed(666)
#we create 6 subjects
subjects <- paste0('S', 1:6)
#here we add repetitions within speakers
subjects <- rep(subjects, each = 20)
items <- paste0('Item', 1:20)
#below repeats
items <- rep(items, 6)
#below is to generate random numbers that are log values
logFreq <- round(rexp(20)*5, 2)
#below we are repeating the logFreq 6 times to fit with the number of speakers and items
logFreq <- rep(logFreq, 6)
xdf <- data.frame(subjects, items, logFreq)
#below removes the individual variables we had created because they are already in the dataframe
rm(subjects, items, logFreq)

xdf$Intercept <- 300
submeans <- rep(rnorm(6, sd = 40), 20)
#sort make the means for each subject is the same...
submeans <- sort(submeans)
xdf$submeans <- submeans
#we create the same thing for items... we allow the items mean to vary between words...
itsmeans <- rep(rnorm(20, sd = 20), 6)
xdf$itsmeans <- itsmeans
xdf$error <- rnorm(120, sd = 20)
#here we create an effect column,  
#here for each logFreq, we have a decrease of -5 of that particular logFreq 
xdf$effect <- -5 * xdf$logFreq

xdf$dur <- xdf$Intercept + xdf$submeans + xdf$itsmeans + xdf$error + xdf$effect
#below is to subset the data and get only a few columns.. the -c(4:8) removes the columns 4 to 8..
xreal <- xdf[,-c(4:8)]
head(xreal)
rm(xdf, submeans, itsmeans)
```

## Plots
Let's start by doing a correlation test and plotting the data. Our results show that there is a negative correlation between duration and LogFrequency, and the plot shows this decrease. 

```{r warning=FALSE, message=FALSE, error=FALSE}
corrMixed <- as.matrix(xreal[-c(1:2)]) %>% 
  rcorr(type="pearson")
print(corrMixed)
corrplot(corrMixed$r, method = "circle", type = "upper", tl.srt = 45,
         addCoef.col = "black", diag = FALSE,
         p.mat = corrMixed$p, sig.level = 0.05)



ggplot.xreal <- xreal %>% 
  ggplot(aes(x = logFreq, y = dur)) +
  geom_point()+ theme_bw(base_size = 20) +
  labs(y = "Duration", x = "Frequency (Log)") +
  geom_smooth(method = lm, se=F)
ggplot.xreal
```


## Linear model

Let's run a simple linear model on the data. As we can see below, there are some issues with the "simple" linear model: we had set our SD for subjects to be 40, but this was picked up as 120 (see histogram of residuals). The QQ Plot is not "normal". 

```{r warning=FALSE, message=FALSE, error=FALSE}
mdl.lm.xreal <- xreal %>% 
  lm(dur ~ logFreq, data = .)
summary(mdl.lm.xreal)
hist(residuals(mdl.lm.xreal))
qqnorm(residuals(mdl.lm.xreal)); qqline(residuals(mdl.lm.xreal))
plot(fitted(mdl.lm.xreal), residuals(mdl.lm.xreal), cex = 4)
```

## Linear Mixed Model

Our Linear Mixed effects Model will take into account the random effects we added and also our model specifications. We use a Maximum Likelihood estimate (REML = FALSE) as this is what we will use for model comparison. The Linear Mixed Model is reflecting our model specifications The SD of our subjects is picked up correctly. The model results are "almost" the same as our linear model above. The coefficient for the "Intercept" is at 337.973 and the coefficient for LogFrequency is at -5.460. This indicates that for each unit of increase in the LogFrequency, there is a decrease by 5.460 (ms).

```{r warning=FALSE, message=FALSE, error=FALSE}
mdl.lmer.xreal <- xreal %>% 
  lmer(dur ~ logFreq  +(1|subjects) + (1|items), data = ., REML = FALSE)
summary(mdl.lmer.xreal)
hist(residuals(mdl.lmer.xreal))
qqnorm(residuals(mdl.lmer.xreal)); qqline(residuals(mdl.lmer.xreal))
plot(fitted(mdl.lmer.xreal), residuals(mdl.lmer.xreal), cex = 4)
```

## Our second Mixed model

This second model add a by-subject random slope. Random slopes allow for the variation that exists in the random effects to be taken into account. An intercept only model provides an averaged values to our participants.

```{r warning=FALSE, message=FALSE, error=FALSE}
mdl.lmer.xreal.2 <- xreal %>% 
  lmer(dur ~ logFreq + (logFreq|subjects) + (1|items), data = ., REML = FALSE)
summary(mdl.lmer.xreal.2)
hist(residuals(mdl.lmer.xreal.2))
qqnorm(residuals(mdl.lmer.xreal.2)); qqline(residuals(mdl.lmer.xreal.2))
plot(fitted(mdl.lmer.xreal.2), residuals(mdl.lmer.xreal.2), cex = 4)
```

## Model comparison

But where are our p values? The lme4 developers decided not to include p values due to various issues with estimating df. What we can do instead is to compare models. We need to create a null model to allow for significance testing. As expected our predictor is significantly contributing to the difference. 

```{r warning=FALSE, message=FALSE, error=FALSE}
mdl.lmer.xreal.Null <- xreal %>% 
  lmer(dur ~ 1 + (logFreq|subjects) + (1|items), data = ., REML = FALSE)
anova(mdl.lmer.xreal.Null, mdl.lmer.xreal.2)
```

Also, do we really need random slopes? From the result below, we don't seem to need random slopes at all, given that adding random slopes does not improve the model fit. I always recommend testing this. Most of the time I keep random slopes.

```{r warning=FALSE, message=FALSE, error=FALSE}
anova(mdl.lmer.xreal, mdl.lmer.xreal.2)
```

But if you are really (really!!!) obsessed by p values, then you can also use lmerTest. BUT use after comparing models to evaluate contribution of predictors

```{r warning=FALSE, message=FALSE, error=FALSE}
mdl.lmer.xreal.lmerTest <- xreal %>% 
  lmer(dur ~ logFreq + (logFreq|subjects) + (1|items), data = ., REML = TRUE)
summary(mdl.lmer.xreal.lmerTest)
detach("package:lmerTest", unload = TRUE)
```


## Our final Mixed model

Our final model uses REML (or Restricted Maximum Likelihood Estimate of Variance Component) to estimate the model. 

```{r warning=FALSE, message=FALSE, error=FALSE}
mdl.lmer.xreal.Full <- xreal %>% 
  lmer(dur ~ logFreq + (logFreq|subjects) + (1|items), data = ., REML = TRUE)
summary(mdl.lmer.xreal.Full)
anova(mdl.lmer.xreal.Full)
hist(residuals(mdl.lmer.xreal.Full))
qqnorm(residuals(mdl.lmer.xreal.Full)); qqline(residuals(mdl.lmer.xreal.Full))
plot(fitted(mdl.lmer.xreal.Full), residuals(mdl.lmer.xreal.Full), cex = 4)
```


## Dissecting the model

```{r warning=FALSE, message=FALSE, error=FALSE}
coef(mdl.lmer.xreal.Full)
fixef(mdl.lmer.xreal.Full)
fixef(mdl.lmer.xreal.Full)[1]
fixef(mdl.lmer.xreal.Full)[2]

coef(mdl.lmer.xreal.Full)$`subjects`[1]
coef(mdl.lmer.xreal.Full)$`subjects`[2]

coef(mdl.lmer.xreal.Full)$`items`[1]
coef(mdl.lmer.xreal.Full)$`items`[2]

```

## Using predictions from our model
In general, I use the prediction from my final model in any plots. To generate this, we can use the following

```{r warning=FALSE, message=FALSE, error=FALSE}
xreal <- xreal %>% 
  mutate(Pred_Dur = predict(mdl.lmer.xreal.Full))

xreal %>% 
  ggplot(aes(x = logFreq, y = Pred_Dur)) +
  geom_point() + theme_bw(base_size = 20) +
  labs(y = "Duration", x = "Frequency (Log)", title = "Predicted") +
  geom_smooth(method = lm, se = F) + coord_cartesian(ylim = c(200,450))

## original plot
xreal %>% 
  ggplot(aes(x = logFreq , y = dur)) +
  geom_point() + theme_bw(base_size = 20)+
  labs(y = "Duration", x = "Frequency (Log)", title = "Original")+
  geom_smooth(method = lm, se = F) + coord_cartesian(ylim = c(200,450))

```

## GLMM and CLMM

The code above was using a Linear Mixed Effects Modelling. The outcome was a numeric object. In some cases (as we have seen above), we may have: 

1. Binary outcome (binomial)
2. Count data (poisson), 
3. Multi-category outcome (multinomial)
4. Rating data (cumulative function)

The code below gives you an idea of how to specify these models

```{r warning=FALSE, message=FALSE, error=FALSE}

## Binomial family
## lme4::glmer(outcome~predictor(s)+(1|subject)+(1|items)..., data=data, family=binomial)

## Poisson family
## lme4::glmer(outcome~predictor(s)+(1|subject)+(1|items)..., data=data, family=poisson)

## Multinomial family
## a bit complicated as there is a need to use Bayesian approaches, see e.g., 
## glmmADMB
## mixcat
## MCMCglmm
## see https://gist.github.com/casallas/8263818

## Rating data, use following
## ordinal::clmm(outcome~predictor(s)+(1|subject)+(1|items)..., data=data)


## Remember to test for random effects and whether slopes are needed.

```


# Principal Component Analyses (PCA)


## Read dataset

```{r}
dfPharV2 <- read_csv("dfPharV2.csv")
dfPharV2
dfPharV2 <- dfPharV2 %>% 
  mutate(context = factor(context, levels = c("Non-Guttural", "Guttural")))
```


## Model specification

We use the package `FactoMineR` to run our PCA. We use all acoustic measures as predictors and our qualitative variable as the `context`.

```{r}
pcaDat1 <- PCA(dfPharV2,
               quali.sup = 1, graph = TRUE,
               scale.unit = TRUE, ncp = 5) 
```



## Results

### Summary of results

Based on the summary of results, we observe that the first 6 dimensions account 64% of the variance in the data; each contribute individually to more than 5% of the variance.


```{r}
summary(pcaDat1)
```

### Contribution of predictors and groups

Below, we look at the contributions of the main 5 dimensions.

```{r}
dimdesc(pcaDat1, axes = 1:5, proba = 0.05)
```


### Contribution of variables

We look next at the contribution of the top 10 predictors on each of the 6 dimensions

#### Dimension 1

```{r}
fviz_contrib(pcaDat1, choice = "var", axes = 1, top = 10)
```

#### Dimension 2

```{r}
fviz_contrib(pcaDat1, choice = "var", axes = 2, top = 10)
```

#### Dimension 3

```{r}
fviz_contrib(pcaDat1, choice = "var", axes = 3, top = 10)
```


#### Dimension 4

```{r}
fviz_contrib(pcaDat1, choice = "var", axes = 4, top = 10)
```


#### Dimension 5

```{r}
fviz_contrib(pcaDat1, choice = "var", axes = 5, top = 10)
```



## Plots

### PCA Individuals

```{r}
fviz_pca_ind(pcaDat1, col.ind = "cos2", 
             gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
             repel = TRUE # Avoid text overlapping (slow if many points)
             )
```

### PCA Biplot 1:2

```{r}
fviz_pca_biplot(pcaDat1, repel = TRUE, habillage = dfPharV2$context, addEllipses = TRUE, title = "MSA - Biplot")
```

### PCA Biplot 3:4

```{r}
fviz_pca_biplot(pcaDat1, axes = c(3, 4), repel = TRUE, habillage = dfPharV2$context, addEllipses = TRUE, title = "MCA - Biplot")
```


## Clustering

```{r}
fviz_pca_ind(pcaDat1,
             label = "none", # hide individual labels
             habillage = dfPharV2$context, # color by groups
             addEllipses = TRUE # Concentration ellipses
             )
```


## 3-D By Groups

```{r}
coord <- pcaDat1$quali.sup$coord[1:2,0]
coord
#
with(pcaDat1, {
  s3d <- scatterplot3d(pcaDat1$quali.sup$coord[,1], pcaDat1$quali.sup$coord[,2], pcaDat1$quali.sup$coord[,3],        # x y and z axis
                       color=c("blue", "red"), pch=19,        # filled blue and red circles
                       type="h",                    # vertical lines to the x-y plane
                       main="PCA 3-D Scatterplot",
                       xlab="Dim1(37.7%)",
                       ylab="",
                       zlab="Dim3(11.4%)",
                       #xlim = c(-1.5, 1.5), ylim = c(-1.5, 1.5), zlim = c(-0.8, 0.8)
)
  s3d.coords <- s3d$xyz.convert(pcaDat1$quali.sup$coord[,1], pcaDat1$quali.sup$coord[,2], pcaDat1$quali.sup$coord[,3]) # convert 3D coords to 2D projection
  text(s3d.coords$x, s3d.coords$y,             # x and y coordinates
       labels=row.names(coord), col = c("blue", "red"),              # text to plot
       cex=1, pos=4)           # shrink text 50% and place to right of points)
})
dims <- par("usr")
x <- dims[1]+ 0.8*diff(dims[1:2])
y <- dims[3]+ 0.08*diff(dims[3:4])
text(x, y, "Dim2(25.5%)", srt = 25,col="black")
```




# Decision Trees {.tabset .tabset-fade .tabset-pills}

Decision trees are a statistical tool that uses the combination of predictors to identify patterns in the data and provides classification accuracy for the model. 

The decision tree used is based on `conditional inference trees` that looks at each predictor and splits the data into multiple nodes (branches) through recursive partitioning in a `tree-structured regression model`. Each node is also split into leaves (difference between levels of outcome).

Decision trees via `ctree` does the following: 

1. Test global null hypothesis of independence between predictors and outcome. 
2. Select the predictor with the strongest association with the outcome measured based on a multiplicity adjusted p-values with Bonferroni correction
3. Implement a binary split in the selected input variable. 
4. Recursively repeat steps 1), 2). and 3).

Let's see this in an example using the same dataset. To understand what the decision tree is doing, we will dissect it, by creating one tree with one predictor and move to the next.


## GLM as a classification tool

### Model specification

We run a GLM with `context` as our outcome, and `Z2-Z1` as our predictor. We want to evaluate whether the two classes can be separated when using the acoustic metric `Z2-Z1`. Context has two levels, and this will be considered as a binomial distribution. 


```{r}
mdl.glm.Z2mnZ1 <- dfPharV2 %>% 
  glm(context ~ Z2mnZ1, data = ., family = binomial)
summary(mdl.glm.Z2mnZ1)
tidy(mdl.glm.Z2mnZ1) %>% 
  select(term, estimate) %>% 
  mutate(estimate = round(estimate, 3))
# to only get the coefficients
mycoef2 <- tidy(mdl.glm.Z2mnZ1) %>% pull(estimate)
```

### Plogis

The result above shows that when moving from the `non-guttural` (intercept), a unit increase (i.e., `guttural`) yields a statistically significant decrease in the logodds associated with `Z2-Z1`. We can evaluate this further from a classification point of view, using `plogis`.


```{r}
# non-guttural
plogis(mycoef2[1])
#guttural
plogis(mycoef2[1] + mycoef2[2])
```


This shows that `Z2-Z1` is able to explain the difference in the `guttural` class with an accuracy of 59%. Let's continue with this model further.

### Model predictions

As above, we obtain predictions from the model. Because we are using a numeric predictor, we need to assign a threshold for the predict function. The threshold can be thought of as telling the predict function to assign any predictions lower than 50% to one group, and any higher to another. 

```{r}
pred.glm.Z2mnZ1 <- predict(mdl.glm.Z2mnZ1, type = "response")>0.5

tbl.glm.Z2mnZ1 <- table(pred.glm.Z2mnZ1, dfPharV2$context)
rownames(tbl.glm.Z2mnZ1) <- c("Non-Guttural", "Guttural")
tbl.glm.Z2mnZ1
# from PresenceAbsence
PresenceAbsence::pcc(tbl.glm.Z2mnZ1)
PresenceAbsence::specificity(tbl.glm.Z2mnZ1)
PresenceAbsence::sensitivity(tbl.glm.Z2mnZ1)

roc.glm.Z2mnZ1 <- pROC::roc(dfPharV2$context, as.numeric(pred.glm.Z2mnZ1))
roc.glm.Z2mnZ1
pROC::plot.roc(roc.glm.Z2mnZ1, legacy.axes = TRUE)
```


The model above was able to explain the difference between the two classes with an accuracy of 67.7%. It has a slightly low specificity (0.58) to detect `gutturals`, but a flighty high sensitivity (0.75) to reject the `non-gutturals`. Looking at the confusion matrix, we observe that both groups were relatively accurately identified, but we have relatively large errors (or confusions). The AUC is at 0.67, which is not too high. 

Let's continue with GLM to evaluate it further. We start by running a correlation test to evaluate issues with GLM.


## Individual trees

### Tree 1

```{r}
## from the package party
set.seed(123456)
tree1 <- dfPharV2 %>% 
  ctree(
    context ~ Z2mnZ1, 
    data = .)
print(tree1)
plot(tree1, main = "Conditional Inference Tree ")
```


How to interpret this figure? Let's look at mean values and a plot for this variable. This is the difference between `F2` and `F1` using the bark scale. Because gutturals are produced within the pharynx (regardless of where), the predictions is that a high `F1` and a low `F2` will be the acoustic correlates related to this constriction location. The closeness between these formants yields a lower `Z2-Z1`. Hence, the prediction is as follow: the smaller the difference, the more pharyngeal-like constriction these consonants have (all else being equal!). Let's compute the mean/median and plot the difference between the two contexts.

```{r}
dfPharV2 %>% 
  group_by(context) %>% 
  summarise(mean = mean(Z2mnZ1),
            median = median(Z2mnZ1), 
            count = n())

dfPharV2 %>% 
  ggplot(aes(x = context, y = Z2mnZ1)) + 
  geom_boxplot()  
```


The table above reports the mean and median of `Z2-Z1` for both levels of context and the plots show the difference between the two. We have a total of 180 cases in the `guttural`, and 222 in the `non-guttural`. 
When considering the conditional inference tree output, various splits were obtained. 
The first is any value higher than 9.55 being assigned to the `non-guttural` class (around 98% of 75 cases)
Then, with anything lower than 9.55, a second split was obtained. A threshold of 6.78: higher assigned to `guttural` (around 98% of 64 cases), lower, were split again with a threshold of 4 Bark. A third split was obtained: values lower of equal to 4 Bark are assigned to the `guttural` (around 70% of 157 cases) and values higher than 4 Barks assigned to the `non-guttural` (around 90% of 106 cases).

Dissecting the tree like this allows interpretation of the output. In this example, this is quite a complex case and `ctree` allowed to fine tune the different patterns seen with 
Now let's look at the full dataset to make sense of the combination of predictors to the difference. 

## Model 1

### Model specification

```{r}
set.seed(123456)
fit <- dfPharV2  %>% 
  ctree(
    context ~ ., 
    data = .)
print(fit)
plot(fit, main = "Conditional Inference Tree")
```


How to interpret this complex decision tree? 

Let's obtain the median value for each predictor grouped by context. Discuss some of the patterns. 

```{r}
dfPharV2 %>% 
  group_by(context) %>% 
  summarize_all(list(mean = mean))
```



We started with `context` as our outcome, and all 23 acoustic measures as predictors. A total of 8 terminal nodes were identified with multiple binary splits in their leaves, allowing separation of the two categories. Looking specifically at the output, we observe a few things.

The first node was based on `A2*-A3*`, detecting a difference between non-gutturals and gutturals. For the first binary split, a threshold of -13.78 Bark was used (mean non guttural = -7.86; mean guttural = -14.58), then for values lower of equal to this threshold, a second split was performed using `Z4-Z3` (mean non guttural = 1.67; mean guttural = 1.43) with any value smaller and equal to 1.59, then another binary split using `H2*-H4*`, etc...

Once done, the `ctree` provides multiple binary splits into guttural or non-guttural. 

Any possible issues/interesting patterns you can identify? Look at the interactions between predictors. 


### Predictions from the full model

Let's obtain some predictions from the model and evaluate how successful it is in dealing with the data. 

```{r}
set.seed(123456)
pred.ctree <- predict(fit)
tbl.ctree <- table(pred.ctree, dfPharV2$context)
tbl.ctree
PresenceAbsence::pcc(tbl.ctree)
PresenceAbsence::specificity(tbl.ctree)
PresenceAbsence::sensitivity(tbl.ctree)

roc.ctree <- pROC::roc(dfPharV2$context, as.numeric(pred.ctree))
roc.ctree
pROC::plot.roc(roc.ctree, legacy.axes = TRUE)
```

This full model has a classification accuracy of 82.8%.This is not bad!! It has a relatively moderate specificity at 0.77 (at detecting the gutturals) but a high sensitivity at 0.87 (at detecting the non-gutturals). The ROC curve shows the relationship between the two with an AUC of 0.823


## Random selection

One important issue is that the trees we grew above are biased. They are based on the full dataset, which means they are very likely to overfit the data. We did not add any random selection and we only grew one tree each time. If you think about it, is it possible that we obtained such results simply by chance? 

What if we add some randomness in the process of creating a conditional inference tree?


We change a small option in `ctree` to allow for random selection of variables, to mimic what Random Forests will do. We use `controls` to specify `mtry = 5`, which is the rounded square root of number of predictors. 


### Model 2

```{r}
set.seed(123456)
fit1 <- dfPharV2  %>% 
  ctree(
    context ~ ., 
    data = .,
    controls = ctree_control(mtry = 5))
plot(fit1, main = "Conditional Inference Tree")
pred.ctree1 <- predict(fit1)
tbl.ctree1 <- table(pred.ctree1, dfPharV2$context)
tbl.ctree1
PresenceAbsence::pcc(tbl.ctree1)
PresenceAbsence::specificity(tbl.ctree1)
PresenceAbsence::sensitivity(tbl.ctree1)

roc.ctree1 <- pROC::roc(dfPharV2$context, as.numeric(pred.ctree1))
roc.ctree1
pROC::plot.roc(roc.ctree1, legacy.axes = TRUE)
```


Can you compare results between you and discuss what is going on? 

When adding one random selection process to our `ctree`, we allow it to obtain more robust predictions. We could even go further and grow multiple small trees with a portion of datapoints (e.g., 100 rows, 200 rows). When doing these multiple random selections, you are growing multiple trees that are decorrelated from each other. These become independent trees and one can combine the results of these trees to come with clear predictions. 

This is how Random Forests work. You would start from a dataset, then grow multiple trees, vary number of observations used (nrow), and number of predictors used (mtry), adjust branches, and depth of nodes and at the end, combine the results in a forest. You can also run permutation tests to evaluate contributions of each predictor to the outcome. This is the beauty of Random Forests. They do all of these steps automatically at once for you! 


# Random Forests {.tabset .tabset-fade .tabset-pills}

As their name indicate, a Random Forest is a forest of trees implemented through bagging ensemble algorithms. Each tree has multiple branches (nodes), and will provide predictions based on recursive partitioning of the data. Then using the predictions from the multiple grown trees, Random Forests will create `averaged` predictions and come up with prediction accuracy, etc. 

There are multiple packages that one can use to grow Random Forests:

1. `randomForest`: The original implementation of Random Forests.
2. `party` and `partykit`: using conditional inference trees as base learners
3. `ranger`: a reimplementation of Random Forests; faster and more flexible than original implementation

The first implementation of Random Forests is widely used in research. One of the issues in this first implementation is that it favoured specific types of predictors (e.g., categorical predictors, predictors with multiple cut-offs, etc). Random Forests grown via Conditional Inference Trees as implemented in `party` guard against this bias, but they are computationally demanding. Random Forests grown via permutation tests as implemented in `ranger` speed up the computations and can mimic the unbiased selection process. 

## Declare parallel computing

We start by declaring parallel computing on your devices. This is essential to run these complex computations. The code below is designed to only use 1 core from your machine (and it is not too complex), but if you try to increase the complexity of your computations, you will need parallel computing. 

```{r}

set.seed(123456)

#Declare parallel computing 
ncores <- availableCores()
cat(paste0("Number of cores available for model calculations set to ", ncores, "."))
registerDoFuture()
makeClusterPSOCK(ncores)
plan(multisession)
ncores

# below we register our random number generator. This will mostly be used within the tidymodels below. This allows replication of the results
# below to suppress any warnings from doFuture
options(doFuture.rng.onMisuse = "ignore")
```



## Party

Random Forests grown via conditional inference trees, are different from the original implementation. They offer an unbiased selection process that guards against overfitting of the data. There are various points we need to consider in growing the forest, including number of trees and predictors to use each time. Let us run our first Random Forest via conditional inference trees. To make sure the code runs as fast as it can, we use a very low number of trees: only 100 It is well known that the more trees you grow, the more confidence you have in the results, as model estimation will be more stable. In this example, I would easily go with 500 trees..

### Model specification

To grow the forest, we use the function `cforest`. We use all of the dataset for the moment. We need to specify a few options within controls: 

1. `ntree = 100` = number of trees to grow. Default = 500.
2. `mtry = round(sqrt(23))`: number of predictors to use each time. Default is 5, but specifying it is advised to account for the structure of the data

By default, `cforest_unbiased` has two additional important options that are used for an unbiased selection process. **WARNING**: you should not change these unless you know what you are doing. Also, by default, the data are split into a training and a testing set. The training is equal to 2/3s of the data; the testing is 1/3.

1. `replace = FALSE` = Use subsampling with or without replacement. Default is `FALSE`, i.e., use subsets of the data without replacing these.  
2. `fraction = 0.632` = Use 63.2% of the data in each split. 


```{r}
set.seed(123456)
mdl.cforest <- dfPharV2 %>% 
  cforest(context ~ ., data = ., 
          controls = cforest_unbiased(ntree = 100, 
                                      mtry = round(sqrt(23))))
```


### Predictions

To obtain predictions from the model, we use the `predict` function and add `OOB = TRUE`. This uses the out-of-bag sample (i.e., 1/3 of the data). 

```{r}
set.seed(123456)
pred.cforest <- predict(mdl.cforest, OOB = TRUE)
tbl.cforest <- table(pred.cforest, dfPharV2$context)
tbl.cforest
PresenceAbsence::pcc(tbl.cforest)
PresenceAbsence::specificity(tbl.cforest)
PresenceAbsence::sensitivity(tbl.cforest)

roc.cforest <- pROC::roc(dfPharV2$context, as.numeric(pred.cforest))
roc.cforest
pROC::plot.roc(roc.cforest, legacy.axes = TRUE)
```

Compared with the 82.8% classification accuracy we obtained using `ctree` using our full dataset above (model 1), here we obtain 85.5% with an 2.7% increase. Compared with the 67.4% from model 2 from `ctree` with random selection of predictors, we have an 18.1% increase in classification accuracy!

We could test whether there is statistically significant difference between our `ctree` and `cforest` models. Using the ROC curves, the `roc.test` conducts a non-parametric Z test of significance on the correlated ROC curves. The results show a statistically significant improvement using the `cforest` model. This is normal because we are growing 100 different trees, with random selection of both predictors and samples and provide an `averaged` prediction. 

```{r}
pROC::roc.test(roc.ctree, roc.cforest)
pROC::roc.test(roc.ctree1, roc.cforest)
```


### Variable Importance Scores

One important feature in `ctree` was to show which predictor was used first is splitting the data, which was then followed by the other predictors. We use a similar functionality with `cforest` to obtain variable importance scores to pinpoint `strong` and `weak` predictors. 

There are two ways to obtain this:

1. Simple permutation tests (conditional = FALSE)
2. Conditional permutation tests (conditional = TRUE)

The former is generally comparable across packages and provides a normal permutation test; the latter runs a permutation test on a grid defined by the correlation matrix and corrects for possible collinearity. This is similar to a regression analysis, but looks at both main effects and interactions. 

You could use the normal `varimp` as implemented in `party`. This uses mean decrease in accuracy scores. We will use variable importance scores via an AUC based permutation tests as this uses both accuracy and errors in the model, using `varImpAUC` from the `varImp` package.

**DANGER ZONE**: using conditional permutation test requires a lot of RAMs, unless you have access to a cluster, and/or a lot of RAMs, do not attempt running it. We will run the non-conditional version here for demonstration.

#### Non-conditional permutation tests

```{r}
set.seed(123456)
VarImp.cforest <- varImp::varImpAUC(mdl.cforest, conditional = FALSE)
lattice::barchart(sort(VarImp.cforest))
```

The Variable Importance Scores via non-conditional permutation tests showed that `A2*-A3*` (i.e., energy in mid-high frequencies around F2 and F3) is the most important variable at explaining the difference between gutturals and non-gutturals, followed by `Z4-Z3` (pharyngeal constriction), `H1*-A3*` (energy in mid-high frequency component), `Z2-Z1` (degree of compactness), `Z3-Z2` (spectral divergence), `H1*-A2` (energy in mid frequency component) and `Z1-Z0` (degree of openness). All other predictors contribute to the contrast but to varying degrees (from `H1*-H2*` to `H1*-A1*`). The last 5 predictors are the least important and and the CPP has a 0 mean decrease in accuracy and can even be ignored. 


#### Conditional permutation tests

```{r}
set.seed(123456)
VarImp.cforest <- varImp::varImpAUC(mdl.cforest, conditional = TRUE)
lattice::barchart(sort(VarImp.cforest))
```




### Conclusion

The `party` package is powerful at growing Random Forests via conditional Inference trees, but is computationally prohibitive when increasing number of trees and using conditional permutation tests of variable importance scores. We look next at the package `ranger` due to its speed in computation and flexibility. 

## Ranger

The `ranger` package proposes a reimplementation of the original Random Forests algorithms, written in C++ and allows for parallel computing. It offers more flexibility in terms of model specification. 

### Model specification

In the model below specification below, there are already a few options we are familiar with, with additional ones described below:

1. `num.tree` = Number of trees to grow. We use the default value
2. `mtry` = Number of predictors to use. Default = `floor(sqrt(Variables))`. For compatibility with `party`, we use `round(sqrt(23))`
3. `replace = FALSE` = Use subsampling with or without replacement. Default `replace = TRUE`, i.e., is **with** replacement. 
4. `sample.fraction = 0.632` = Use 63.2% of the data in each split. Default is full dataset, i.e., `sample.fraction = 1`
5. `importance = "permutation"` = Compute variable importance scores via permutation tests
6. `scale.permutation.importance = FALSE` = whether to scale variable importance scores to be out of 100%. Default is TRUE. This is likely to introduce biases in variable importance estimation. 
7. `splitrule = "extratrees"` = rule used for splitting trees. 
8. `num.threads` = allow for parallel computing. Here we only specify 1 thread, but can use all thread on your computer (or cluster).


We use options 2-7 to make sure we have an unbiased selection process with `ranger`. You can try on your own running the model below by using the defaults to see how the rate of classification increases more, but with the caveat that it has a biased selection process. 

```{r}
set.seed(123456)
mdl.ranger <- dfPharV2 %>% 
  ranger(context ~ ., data = ., num.trees = 500, mtry = round(sqrt(23)),
         replace = FALSE, sample.fraction = 0.632, 
         importance = "permutation", scale.permutation.importance = FALSE,
         splitrule = "extratrees", num.threads = ncores)
mdl.ranger
```


Results of our Random Forest shows an OOB (Out-Of-Bag) error rate of 8.2%, i.e., an accuracy of 91.8%.

### Going further

Unfortunately, when growing a tree with `ranger`, we cannot use predictions from the OOB sample as there are no comparable options to do so on the predictions. We need to hard-code this. We split the data into a training and a testing sets. The training will be on 2/3s of the data; the testing is on the remaining 1/3. 

#### Create a training and a testing set

```{r}
set.seed(123456)
train.idx <- sample(nrow(dfPharV2), 2/3 * nrow(dfPharV2))
gutt.train <- dfPharV2[train.idx, ]
gutt.test <- dfPharV2[-train.idx, ]
```


#### Model specification

We use the same model specification as above, except from using the training set and saving the forest (with `write.forest = TRUE`).

```{r}
set.seed(123456)
mdl.ranger2 <- gutt.train %>% 
  ranger(context ~ ., data = ., num.trees = 500, mtry = round(sqrt(23)),
         replace = FALSE, sample.fraction = 0.632, 
         importance = "permutation", scale.permutation.importance = FALSE,
         splitrule = "extratrees", num.threads = ncores, write.forest = TRUE)
mdl.ranger2
```

With the training set, we have an OOB error rate of 9.3%; i.e., an accuracy rate of 90.7%.

#### Predictions
 
For the predictions, we use the testing set as a validation set. This is to be considered as a true reflection of the model. This is unseen data not used in the training set.


```{r}
set.seed(123456)
pred.ranger2 <- predict(mdl.ranger2, data = gutt.test)
tbl.ranger2 <- table(pred.ranger2$predictions, gutt.test$context)
tbl.ranger2
PresenceAbsence::pcc(tbl.ranger2)
PresenceAbsence::specificity(tbl.ranger2)
PresenceAbsence::sensitivity(tbl.ranger2)

roc.ranger <- pROC::roc(gutt.test$context, as.numeric(pred.ranger2$predictions))
roc.ranger
pROC::plot.roc(roc.ranger, legacy.axes = TRUE)
```


The classification rate based on the testing set is 86.6%. This is comparable to the one we obtained with `cforest`. The changes in the settings allow for similarities in the predictions obtained from both `party` and `ranger`. 

#### Variable Importance Scores

##### Default

For the variable importance scores, we obtain them from either the training set or the full model above.


```{r}
set.seed(123456)
lattice::barchart(sort(mdl.ranger2$variable.importance), main = "Variable Importance scores - training set")
lattice::barchart(sort(mdl.ranger$variable.importance), main = "Variable Importance scores - full set")
```


There are similarities between `cforest` and `ranger`, with minor differences. `Z2-Z1` is the best predictor at explaining the differences between gutturals and non-gutturals with `ranger` followed by `Z3-Z2` and then `A2*-A3*`, (reverse with `cforest`!). The order of the additional predictors is sightly different between the two models. This is expected as the `cforest` model only used 100 trees, whereas the `ranger` model used 500 trees.


A clear difference between the packages `party` and `ranger` is that the former allows for conditional permutation tests for variable importance scores; this is absent from `ranger`. However, there is a debate in the literature on whether correlated data are harmful within Random Forests. It is clear that how Random Forests work, i.e., the randomness in the selection process in number of data points, predictors, splitting rules, etc. allow the trees to be decorrelated from each other. Hence, the conditional permutation tests may not be required. But what they offer is to condition variable importance scores on each other (based on correlation tests) to mimic what a multiple regression analysis does (but without suffering from suppression!). Strong predictors will show major contribution, while weak ones will be squashed giving them extremely low (or even negative) scores. Within `ranger`, it is possible to evaluate this by estimating p values associated with each variable importance.We use the `altman` method. See documentation for more details. 

**DANGER ZONE**: This requires heavy computations. Use with all cores on your machine or in the cluster. Recommendations are to use a minimum of 100 permutations or more, i.e., `num.permutations = 100`. Here, we only use 20 to show the output.

##### With p values

```{r}
set.seed(123456)
VarImp.pval <- importance_pvalues(mdl.ranger2, method = "altmann",
                                  num.permutations = 20, 
                                  formula = context ~ ., data = gutt.train,
                                  num.threads = ncores)
VarImp.pval
```


Of course, the output above shows variable p values. The lowest is at 0.048 for all predictors; one at 0.14 for CPP. Recall that CPP received the lowest variable importance score within `ranger` and `cforest`. If you increase permutations to 100 or 200, you will get more confidence in your results and can report the p values


In the next part, we look at the `tidymodels` and introduce their philosophy. 

## Random forests with Tidymodels

The `tidymodels` are a bundle of packages used to streamline and simplify the use of machine learning. The `tidymodels` are not restricted to Random Forests, and you can even use them to run simple linear models, logistic regressions, PCA, Random Forests, Deep Learning, etc.

The `tidymodels`' philosophy is to separate data processing on the training and testing sets, and use of a workflow. Below, is an full example of how one can run Random Forests with via `ranger` using the `tidymodels`.

### Training and testing sets

We start by creating a training and a testing set using the function `initial_split`. Using `strata = context` allows the model to split the data taking into account its structure and splits the data according to proportions of each group. 

```{r}
set.seed(123456)
train_test_split <-
  initial_split(
    data = dfPharV2,
    strata = "context",
    prop = 0.667) 
train_test_split
train_tbl <- train_test_split %>% training() 
test_tbl  <- train_test_split %>% testing()
```


### Set for cross-validation

We can (if we want to), create a 10-folds cross-validation on the training set. This allows to fine tune the training by obtaining the forest with the highest accuracy. This is a clear difference with `ranger`. While it is not impossible to hard code that, `tidymodels` simplify it for us!!

```{r}
set.seed(123456)
train_cv <- vfold_cv(train_tbl, v = 10, strata = "context")
```

### Model Specification

Within the model specification, we need to specify multiple options:

1. A `recipe`: This is the recipe and is related to any data processing one wants to apply on the data.
2. An `engine`: We need to specify the `engine` to use. Here we want to run a Random Forest.
3. A `tuning`: Here we can tune our engine
4. A `workflow`: here we specify the various steps of the workflow


#### Recipe

When defining the recipe, you need to think of the type of "transformations" you will apply to your data. 

1. Z-scoring is the first thing that comes to mind. When you z-score the data, you are allowing all strong and weak predictors to be considered equally by the model. This is important as some of our predictors have very large differences related to the levels of context and have different measurement scales. We could have applied it above, but we need to make sure to apply it separately on both training and testing sets (otherwise, our model suffers from data leakage)
2. If you have any missing data, you can use central imputations to fill in missing data (random forests do not like missing data, though they can work with them). 
3. You can apply PCA on all your predictors to remove collinearity before running random forests. This is a great option to consider, but adds more complexity to your model. 
4.Finally, if you have categorical predictors, you can transform them into dummy variables using `step_dummy()`: 1s and 2s for binary; or use one-hot-encoding `step_dummy(predictor, one_hot = TRUE)`

See documentations of `tidymodels` for what you can apply!!


```{r}
set.seed(123456)
recipe <-  
  train_tbl %>%
  recipe(context ~ .) %>%
  step_center(all_predictors(), -all_outcomes()) %>%
  step_scale(all_predictors(), -all_outcomes()) %>% 
  prep()

trainData_baked <- bake(recipe, new_data = train_tbl) # convert to the train data to the newly imputed data
trainData_baked

```

Once we have prepared the `recipe`, we can `bake it` to see the changes applied to it.

#### Predictors remaining


```{r, fig.width=20, fig.height=15}
box_fun_plot = function(data, x, y) {
  ggplot(data = data, aes(x = .data[[x]],
                          y = .data[[y]],
                          fill = .data[[x]])) +
    geom_boxplot() +
    labs(title = y,
         x = x,
         y = y) +
    theme(
      legend.position = "none"
    ) +
    theme_bw()
}

# Create vector of predictors
expl <- names(trainData_baked)[-(dim(trainData_baked)[2])]#step_corr

# Loop vector with map
expl_plots_box <- map(expl, ~box_fun_plot(data = trainData_baked, x = "context", y = .x) )
plot_grid(plotlist = expl_plots_box)
```




#### Setting the engine

We set the engine here as a `rand_forest`. We specify a classification mode. Then, we set an engine with engine specific parameters. 

```{r}
set.seed(123456)
engine_tidym <- rand_forest(
    mode = "classification",
    engine = "ranger",
    mtry %>% tune(),
    trees %>% tune(),
    min_n = 1
  ) %>% 
  set_engine("ranger", importance = "permutation", sample.fraction = 0.632,
             replace = FALSE, write.forest = T, splitrule = "extratrees",
             scale.permutation.importance = FALSE) # we add engine specific settings
```

#### Settings for tuning

If we want to tune the model, then uncomment the lines below. It is important to use an mtry that hovers around the round(sqrt(Variables)). If you use all available variables, then your forest is biased as it is able to see all predictors. For number of trees, low numbers are not great, as you can easily underfit the data and not produce meaningful results. Large numbers are fine and Random Forests do not overfit (in theory). 

The full dataset has around 2000 observations, and 23 predictors (well even more, but let's ignore it for the moment). I tuned `mtry` to be between 4 and 6, and `trees` to be between 1000 and 5000 in a 30 step increment. In total, with a 10-folds cross validation, I grew 30 random forests on each fold for a total of 300 Random Forests on the training set!!! This of course will take a loooooong time to compute on your computer if using one thread. So use parallel computing or a cluster. When running in the cluster with 20 cores, each with 11GB RAMs, and it took around 260.442 seconds to run with 220GB RAMS! Of course, with smaller RAMs and number of cores, the code will still run but will take longer. 


```{r}
set.seed(123456)
gridy_tidym <- grid_random(
  mtry() %>% range_set(c(4, 6)),
  trees() %>% range_set(c(1000, 2000)),
  size = 30
  )
```

#### Workflow

Now we define the workflow adding the `recipe` and the `model`.

```{r}
set.seed(123456)
wkfl_tidym <- workflow() %>% 
  add_recipe(recipe) %>% 
  add_model(engine_tidym)
```

#### Tuning and running model

Here we run the model starting with the workflow, the cross-validation sample, the tuning parameters and asking for specific metrics. 

The model below will do the following:
1. Use a 10-folds cross validation on the training test
2. Tune the hyper-parameters to reach the model with the best predictions
3. Within each fold, we grow 30 random forests; we have a total of 300 Random Forests, and we use an ROC-AUC based search for the best performing model

Of course, you could use a larger size to grow more trees, with this will take longer to run!

The model will run for about 2-3 minutes with an 8 cores machine and 32GB of RAMs. For demonstration purposes, the tuning of number of trees is restricted to between 1000 and 2000 trees. This can of course be increased to 5000 trees (or more) depending on the size of the dataset
 

```{r}
set.seed(123456)
system.time(grid_tidym <- 
  tune_grid(wkfl_tidym, 
            resamples = train_cv,
            grid = gridy_tidym,
            metrics = metric_set(accuracy, roc_auc, sens, spec,f_meas, precision, recall),
            control = control_grid(save_pred = TRUE, parallel_over = NULL))
)
print(grid_tidym)
```



#### Finalise model

We obtain the best performing model from cross-validation, then finalise the workflow by predicting the results on the testing set and obtain the results of the best performing model

```{r}
set.seed(123456)
collect_metrics(grid_tidym)
grid_tidym_best <- select_best(grid_tidym, metric = "roc_auc")
grid_tidym_best
wkfl_tidym_best <- finalize_workflow(wkfl_tidym, grid_tidym_best)
wkfl_tidym_final <- last_fit(wkfl_tidym_best, split = train_test_split)
```

### Results

For the results, we can obtain various metrics on the training and testing sets. 

#### Cross-validation on training set

##### Accuracy

```{r}
percent(show_best(grid_tidym, metric = "accuracy", n = 1)$mean)
```

##### ROC-AUC

```{r}
# Cross-validated training performance
show_best(grid_tidym, metric = "roc_auc", n = 1)$mean
```


##### Sensitivity

```{r}
show_best(grid_tidym, metric = "sens", n = 1)$mean
```

##### Specificity

```{r}
show_best(grid_tidym, metric = "spec", n = 1)$mean
```



##### F-measure

```{r}
# Cross-validated training performance
show_best(grid_tidym, metric = "f_meas", n = 1)$mean
```


##### Precision

```{r}
# Cross-validated training performance
show_best(grid_tidym, metric = "precision", n = 1)$mean
```


##### Recall

```{r}
# Cross-validated training performance
show_best(grid_tidym, metric = "recall", n = 1)$mean
```


#### Predictions testing set

##### Overall

```{r}
wkfl_tidym_final$.metrics
```

##### Accuracy

```{r}
#accuracy
percent(wkfl_tidym_final$.metrics[[1]]$.estimate[[1]])
```

##### ROC-AUC


```{r}
#roc-auc
wkfl_tidym_final$.metrics[[1]]$.estimate[[2]]
```

#### Confusion Matrix training set

```{r warning=FALSE, message=FALSE, error=FALSE, fig.height = 6, fig.width = 8}
wkfl_tidym_final$.predictions[[1]] %>%
  conf_mat(context, .pred_class) %>%
  pluck(1) %>%
  as_tibble() %>%
  group_by(Truth) %>% # group by Truth to compute percentages
  mutate(prop =percent(prop.table(n))) %>% # calculate percentages row-wise
  ggplot(aes(Prediction, Truth, alpha = prop)) +
  geom_tile(show.legend = FALSE) +
  geom_text(aes(label = prop), colour = "white", alpha = 1, size = 8)
```


#### Variable Importance

##### Best 10

```{r}
vip(pull_workflow_fit(wkfl_tidym_final$.workflow[[1]]))
```


##### All predictors


```{r}
vip(pull_workflow_fit(wkfl_tidym_final$.workflow[[1]]), num_features = 23)
```


#### Gains curves

This is an interesting features that show how much is gained when looking at various portions of the data. We see a gradual increase in the values. When 50% of the data were tested, around 83% of the results within the non-guttural class were already identified. The more testing was performed, the more confidence in the results there are and then when 84.96% of the data were tested, 100% of the cases were found. 

```{r}
wkfl_tidym_final$.predictions[[1]] %>%
  gain_curve(context, `.pred_Non-Guttural`) %>%
  autoplot() 
```       

#### ROC Curves

```{r}
wkfl_tidym_final$.predictions[[1]] %>%
  roc_curve(context, `.pred_Non-Guttural`) %>%
  autoplot() 
``` 





# session info

```{r warning=FALSE, message=FALSE, error=FALSE}
sessionInfo()
```

