1 Introduction

This analysis accompanies the article “Al-Tamimi, J. and Khattab, G., (2018). Acoustic correlates of the voicing contrast in Lebanese Arabic singleton and geminate plosives. Invited manuscript for the special issue of Journal of Phonetics,”Marking 50 Years of Research on Voice Onset Time and the Voicing Contrast in the World’s Languages" (eds., T. Cho, G. Docherty & D. Whalen). DOI: https://doi.org/10.1016/j.wocn.2018.09.010

This notebook presents the analyses and results of the Random Forests classification. The aim was to evaluate the robustness of the results obtained for each of the 19 acoustic correlates used, (6 durational, three voicing, and 10 non-temporal including f0, F1 and harmonic differences). These acoustic correlates were used to evaluate the relationship between them and the four-way “contrast” observed between voicing and gemination, i.e., Voiced Singleton, Voiceless Singleton, Voiced Geminate and Voiceless Geminate. The previous notebook looked specifically at how each individual acoustic correlate was associated with the predictors. This was done on an individual basis see notebook. This section extends the analysis by looking at the combination of these acoustic correlates and their predictive power in discriminating between the four categories above.

2 Loading required packages

We start by loading the required packages.

requiredPackages = c('dplyr','DMwR','party','pROC','psycho','ggplot2','reshape','foreach','doSNOW','parallel')
for(p in requiredPackages){
  if(!require(p,character.only = TRUE)) install.packages(p)
  library(p,character.only = TRUE)
}

3 Preprocessing the data

3.1 Reading in the data

We start reading in the data and checking the structure. The data contains some missing data. For the Random Forests to work appropriately, we replace the missing values by centrally imputed values.

ResultsFullOriginal <- read.csv("ResultsFullOriginalData.csv")
str(ResultsFullOriginal)
'data.frame':   1793 obs. of  36 variables:
 $ x                       : int  1 2 3 4 5 6 7 8 9 10 ...
 $ file_name               : Factor w/ 68 levels "01_F_W_main_1.wav",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ speaker                 : Factor w/ 20 levels "sp1","sp10","sp11",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ sex                     : Factor w/ 2 levels "F","M": 1 1 1 1 1 1 1 1 1 1 ...
 $ Number                  : int  8 38 68 98 371 395 419 449 743 767 ...
 $ wordTarget              : Factor w/ 102 levels "2aabbeh","2aabeD",..: 83 9 28 52 47 38 34 90 6 27 ...
 $ word                    : Factor w/ 331 levels "2AAbaD","2aabbeh",..: 269 31 89 160 151 126 112 281 23 88 ...
 $ singGem                 : Factor w/ 2 levels "geminate","singleton": 2 2 2 2 2 2 2 2 2 2 ...
 $ syllType                : Factor w/ 2 levels "iambic","trochaic": 2 2 2 2 2 2 2 2 2 2 ...
 $ vowelLength             : Factor w/ 2 levels "longV1","shortV1": 2 2 2 2 2 2 2 2 2 2 ...
 $ syllStruct              : Factor w/ 5 levels "V1C2V2","V1CC2V2",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ c.v                     : Factor w/ 2 levels "C2","CC2": 1 1 1 1 1 1 1 1 1 1 ...
 $ phoneme                 : Factor w/ 13 levels "b","bb","d","D",..: 1 3 12 8 1 10 12 8 1 10 ...
 $ modeArticulation4       : Factor w/ 1 level "stop": 1 1 1 1 1 1 1 1 1 1 ...
 $ placeArticulation       : Factor w/ 4 levels "alveolar","bilabial",..: 2 1 3 4 2 1 3 4 2 1 ...
 $ voiced.unvoiced         : Factor w/ 2 levels "voiced","voiceless": 1 1 2 2 1 2 2 2 1 2 ...
 $ placeVoicing            : Factor w/ 6 levels "alveolar voiced",..: 3 1 5 6 3 2 5 6 3 2 ...
 $ durationC2              : num  60.2 87 106.9 109.7 77 ...
 $ durationVoicedPercC2    : num  100 57.5 43.9 32.7 67.5 ...
 $ durationVOTAll          : num  19.2 11.3 21.4 22 16.4 ...
 $ VOTDec                  : num  -60.2 -87 21 22 -77 ...
 $ durationBurst           : num  6 8 7 11 9 11 7 12 9 7 ...
 $ durationVOT             : num  13 3 14 11 7 18 14 15 14 14 ...
 $ durationVoicedPercVOTAll: num  100 0 0 0 41.2 ...
 $ intensityOnsetVOTAll    : num  52.6 56.9 52.6 55.3 53.4 ...
 $ intensityOffsetVOTAll   : num  64.5 63.4 59.7 57.7 64.2 ...
 $ durationV2              : num  89 155.2 86.3 113.3 76.4 ...
 $ f0OnsetV2               : num  247 241 245 239 244 ...
 $ intensityOnsetV2        : num  67.3 66.6 61.1 60.9 64.5 ...
 $ f1OnsetV2               : num  699 542 659 444 679 ...
 $ h1mnh2OnsetNormV2       : num  -2.08 5.31 11.84 17.52 13.02 ...
 $ durationV1              : num  90.5 48.4 52 63.2 61.5 ...
 $ f0OffsetV1              : num  239 239 241 238 205 ...
 $ intensityOffsetV1       : num  63.8 70.4 68.7 66.4 65.9 ...
 $ f1OffsetV1              : num  573 644 942 661 637 ...
 $ h1mnh2OffsetNorm        : num  9.123 NA 0.981 3.576 0.861 ...
summary(ResultsFullOriginal)
       x                    file_name       speaker     sex    
 Min.   :   1   01_F_W_main_1.wav:  55   sp1    :  97   F:940  
 1st Qu.: 449   06_F_W_main_1.wav:  55   sp10   :  97   M:853  
 Median : 897   02_F_W_main_1.wav:  53   sp6    :  97          
 Mean   : 897   10_F_W_main_1.wav:  52   sp17   :  96          
 3rd Qu.:1345   08_F_W_main_1.wav:  51   sp9    :  96          
 Max.   :1793   01_M_W_main_1.wav:  50   sp20   :  95          
                (Other)          :1477   (Other):1215          
     Number        wordTarget        word           singGem    
 Min.   :    8   2adar  :  20   3atab  :  20   geminate :1007  
 1st Qu.: 9584   3aadal :  20   fatal  :  20   singleton: 786  
 Median :19555   3atab  :  20   saba2  :  20                   
 Mean   :19497   3attaal:  20   2atal  :  19                   
 3rd Qu.:29525   fatal  :  20   3addad :  19                   
 Max.   :39279   Haba2  :  20   badal  :  19                   
                 (Other):1673   (Other):1676                   
     syllType     vowelLength      syllStruct   c.v      
 iambic  : 356   longV1 : 594   V1C2V2  :433   C2 : 786  
 trochaic:1437   shortV1:1199   V1CC2V2 :410   CC2:1007  
                                V1CC2VV2:356             
                                VV1C2V2 :353             
                                VV1CC2V2:241             
                                                         
                                                         
    phoneme    modeArticulation4      placeArticulation
 dd     :285   stop:1793         alveolar      :625    
 d      :242                     bilabial      :426    
 bb     :235                     pharyngealised:449    
 b      :191                     velar         :293    
 kk     :160                                           
 k      :133                                           
 (Other):547                                           
  voiced.unvoiced                   placeVoicing   durationC2    
 voiced   :1059   alveolar voiced         :420   Min.   : 27.09  
 voiceless: 734   alveolar voiceless      :205   1st Qu.: 78.43  
                  bilabial                :426   Median :139.81  
                  pharyngealised voiced   :213   Mean   :131.23  
                  pharyngealised voiceless:236   3rd Qu.:175.43  
                  velar                   :293   Max.   :315.77  
                                                                 
 durationVoicedPercC2 durationVOTAll       VOTDec       
 Min.   :  0.00       Min.   : 0.712   Min.   :-299.69  
 1st Qu.: 23.08       1st Qu.:10.478   1st Qu.:-142.10  
 Median : 69.12       Median :15.796   Median : -62.73  
 Mean   : 61.60       Mean   :18.600   Mean   : -61.12  
 3rd Qu.:100.00       3rd Qu.:23.386   3rd Qu.:  19.00  
 Max.   :100.00       Max.   :89.065   Max.   :  80.00  
                                                        
 durationBurst     durationVOT    durationVoicedPercVOTAll
 Min.   : 0.500   Min.   : 0.10   Min.   :  0.00          
 1st Qu.: 4.000   1st Qu.: 5.00   1st Qu.:  0.00          
 Median : 6.000   Median : 9.00   Median :  0.00          
 Mean   : 7.236   Mean   :11.94   Mean   : 41.02          
 3rd Qu.: 9.000   3rd Qu.:15.00   3rd Qu.:100.00          
 Max.   :38.000   Max.   :97.00   Max.   :100.00          
 NA's   :62       NA's   :62                              
 intensityOnsetVOTAll intensityOffsetVOTAll   durationV2    
 Min.   :41.11        Min.   :42.47         Min.   : 37.52  
 1st Qu.:54.91        1st Qu.:62.46         1st Qu.:113.30  
 Median :58.95        Median :65.72         Median :143.69  
 Mean   :59.42        Mean   :65.37         Mean   :156.24  
 3rd Qu.:64.08        3rd Qu.:68.80         3rd Qu.:193.76  
 Max.   :77.19        Max.   :82.02         Max.   :381.26  
                                                            
   f0OnsetV2      intensityOnsetV2   f1OnsetV2     
 Min.   : 80.93   Min.   :45.35    Min.   : 209.9  
 1st Qu.:116.99   1st Qu.:64.50    1st Qu.: 431.2  
 Median :162.49   Median :68.27    Median : 485.1  
 Mean   :162.64   Mean   :67.80    Mean   : 499.6  
 3rd Qu.:199.27   3rd Qu.:71.83    3rd Qu.: 555.4  
 Max.   :308.96   Max.   :84.14    Max.   :1082.2  
                                                   
 h1mnh2OnsetNormV2   durationV1       f0OffsetV1    
 Min.   :-23.341   Min.   :  4.31   Min.   : 82.23  
 1st Qu.: -7.210   1st Qu.: 57.45   1st Qu.:119.48  
 Median : -4.097   Median : 78.51   Median :172.18  
 Mean   : -2.665   Mean   : 94.63   Mean   :171.86  
 3rd Qu.:  0.814   3rd Qu.:131.74   3rd Qu.:215.80  
 Max.   : 22.130   Max.   :264.37   Max.   :367.78  
 NA's   :10                                         
 intensityOffsetV1   f1OffsetV1     h1mnh2OffsetNorm 
 Min.   :48.25     Min.   : 246.6   Min.   :-13.019  
 1st Qu.:65.71     1st Qu.: 470.5   1st Qu.: -6.394  
 Median :70.19     Median : 534.2   Median : -3.136  
 Mean   :69.53     Mean   : 553.6   Mean   : -2.269  
 3rd Qu.:73.82     3rd Qu.: 622.4   3rd Qu.:  1.192  
 Max.   :84.98     Max.   :1081.3   Max.   : 19.413  
                                    NA's   :226      

3.2 Central imputations

As can be seen from the summary above, the data contains some missing data. For the Random Forests to work appropriately, we replace the missing values by centrally imputed ones

# Central imputation using the package "DMwR".
ResultsFullOriginalCentral <- ResultsFullOriginal   # create copy
ResultsFullOriginalCentral[, 18:36] <- centralImputation(ResultsFullOriginalCentral[, 18:36])
str(ResultsFullOriginalCentral)
'data.frame':   1793 obs. of  36 variables:
 $ x                       : int  1 2 3 4 5 6 7 8 9 10 ...
 $ file_name               : Factor w/ 68 levels "01_F_W_main_1.wav",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ speaker                 : Factor w/ 20 levels "sp1","sp10","sp11",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ sex                     : Factor w/ 2 levels "F","M": 1 1 1 1 1 1 1 1 1 1 ...
 $ Number                  : int  8 38 68 98 371 395 419 449 743 767 ...
 $ wordTarget              : Factor w/ 102 levels "2aabbeh","2aabeD",..: 83 9 28 52 47 38 34 90 6 27 ...
 $ word                    : Factor w/ 331 levels "2AAbaD","2aabbeh",..: 269 31 89 160 151 126 112 281 23 88 ...
 $ singGem                 : Factor w/ 2 levels "geminate","singleton": 2 2 2 2 2 2 2 2 2 2 ...
 $ syllType                : Factor w/ 2 levels "iambic","trochaic": 2 2 2 2 2 2 2 2 2 2 ...
 $ vowelLength             : Factor w/ 2 levels "longV1","shortV1": 2 2 2 2 2 2 2 2 2 2 ...
 $ syllStruct              : Factor w/ 5 levels "V1C2V2","V1CC2V2",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ c.v                     : Factor w/ 2 levels "C2","CC2": 1 1 1 1 1 1 1 1 1 1 ...
 $ phoneme                 : Factor w/ 13 levels "b","bb","d","D",..: 1 3 12 8 1 10 12 8 1 10 ...
 $ modeArticulation4       : Factor w/ 1 level "stop": 1 1 1 1 1 1 1 1 1 1 ...
 $ placeArticulation       : Factor w/ 4 levels "alveolar","bilabial",..: 2 1 3 4 2 1 3 4 2 1 ...
 $ voiced.unvoiced         : Factor w/ 2 levels "voiced","voiceless": 1 1 2 2 1 2 2 2 1 2 ...
 $ placeVoicing            : Factor w/ 6 levels "alveolar voiced",..: 3 1 5 6 3 2 5 6 3 2 ...
 $ durationC2              : num  60.2 87 106.9 109.7 77 ...
 $ durationVoicedPercC2    : num  100 57.5 43.9 32.7 67.5 ...
 $ durationVOTAll          : num  19.2 11.3 21.4 22 16.4 ...
 $ VOTDec                  : num  -60.2 -87 21 22 -77 ...
 $ durationBurst           : num  6 8 7 11 9 11 7 12 9 7 ...
 $ durationVOT             : num  13 3 14 11 7 18 14 15 14 14 ...
 $ durationVoicedPercVOTAll: num  100 0 0 0 41.2 ...
 $ intensityOnsetVOTAll    : num  52.6 56.9 52.6 55.3 53.4 ...
 $ intensityOffsetVOTAll   : num  64.5 63.4 59.7 57.7 64.2 ...
 $ durationV2              : num  89 155.2 86.3 113.3 76.4 ...
 $ f0OnsetV2               : num  247 241 245 239 244 ...
 $ intensityOnsetV2        : num  67.3 66.6 61.1 60.9 64.5 ...
 $ f1OnsetV2               : num  699 542 659 444 679 ...
 $ h1mnh2OnsetNormV2       : num  -2.08 5.31 11.84 17.52 13.02 ...
 $ durationV1              : num  90.5 48.4 52 63.2 61.5 ...
 $ f0OffsetV1              : num  239 239 241 238 205 ...
 $ intensityOffsetV1       : num  63.8 70.4 68.7 66.4 65.9 ...
 $ f1OffsetV1              : num  573 644 942 661 637 ...
 $ h1mnh2OffsetNorm        : num  9.123 -3.136 0.981 3.576 0.861 ...
summary(ResultsFullOriginalCentral)
       x                    file_name       speaker     sex    
 Min.   :   1   01_F_W_main_1.wav:  55   sp1    :  97   F:940  
 1st Qu.: 449   06_F_W_main_1.wav:  55   sp10   :  97   M:853  
 Median : 897   02_F_W_main_1.wav:  53   sp6    :  97          
 Mean   : 897   10_F_W_main_1.wav:  52   sp17   :  96          
 3rd Qu.:1345   08_F_W_main_1.wav:  51   sp9    :  96          
 Max.   :1793   01_M_W_main_1.wav:  50   sp20   :  95          
                (Other)          :1477   (Other):1215          
     Number        wordTarget        word           singGem    
 Min.   :    8   2adar  :  20   3atab  :  20   geminate :1007  
 1st Qu.: 9584   3aadal :  20   fatal  :  20   singleton: 786  
 Median :19555   3atab  :  20   saba2  :  20                   
 Mean   :19497   3attaal:  20   2atal  :  19                   
 3rd Qu.:29525   fatal  :  20   3addad :  19                   
 Max.   :39279   Haba2  :  20   badal  :  19                   
                 (Other):1673   (Other):1676                   
     syllType     vowelLength      syllStruct   c.v      
 iambic  : 356   longV1 : 594   V1C2V2  :433   C2 : 786  
 trochaic:1437   shortV1:1199   V1CC2V2 :410   CC2:1007  
                                V1CC2VV2:356             
                                VV1C2V2 :353             
                                VV1CC2V2:241             
                                                         
                                                         
    phoneme    modeArticulation4      placeArticulation
 dd     :285   stop:1793         alveolar      :625    
 d      :242                     bilabial      :426    
 bb     :235                     pharyngealised:449    
 b      :191                     velar         :293    
 kk     :160                                           
 k      :133                                           
 (Other):547                                           
  voiced.unvoiced                   placeVoicing   durationC2    
 voiced   :1059   alveolar voiced         :420   Min.   : 27.09  
 voiceless: 734   alveolar voiceless      :205   1st Qu.: 78.43  
                  bilabial                :426   Median :139.81  
                  pharyngealised voiced   :213   Mean   :131.23  
                  pharyngealised voiceless:236   3rd Qu.:175.43  
                  velar                   :293   Max.   :315.77  
                                                                 
 durationVoicedPercC2 durationVOTAll       VOTDec       
 Min.   :  0.00       Min.   : 0.712   Min.   :-299.69  
 1st Qu.: 23.08       1st Qu.:10.478   1st Qu.:-142.10  
 Median : 69.12       Median :15.796   Median : -62.73  
 Mean   : 61.60       Mean   :18.600   Mean   : -61.12  
 3rd Qu.:100.00       3rd Qu.:23.386   3rd Qu.:  19.00  
 Max.   :100.00       Max.   :89.065   Max.   :  80.00  
                                                        
 durationBurst     durationVOT    durationVoicedPercVOTAll
 Min.   : 0.500   Min.   : 0.10   Min.   :  0.00          
 1st Qu.: 4.000   1st Qu.: 5.00   1st Qu.:  0.00          
 Median : 6.000   Median : 9.00   Median :  0.00          
 Mean   : 7.193   Mean   :11.84   Mean   : 41.02          
 3rd Qu.: 9.000   3rd Qu.:15.00   3rd Qu.:100.00          
 Max.   :38.000   Max.   :97.00   Max.   :100.00          
                                                          
 intensityOnsetVOTAll intensityOffsetVOTAll   durationV2    
 Min.   :41.11        Min.   :42.47         Min.   : 37.52  
 1st Qu.:54.91        1st Qu.:62.46         1st Qu.:113.30  
 Median :58.95        Median :65.72         Median :143.69  
 Mean   :59.42        Mean   :65.37         Mean   :156.24  
 3rd Qu.:64.08        3rd Qu.:68.80         3rd Qu.:193.76  
 Max.   :77.19        Max.   :82.02         Max.   :381.26  
                                                            
   f0OnsetV2      intensityOnsetV2   f1OnsetV2     
 Min.   : 80.93   Min.   :45.35    Min.   : 209.9  
 1st Qu.:116.99   1st Qu.:64.50    1st Qu.: 431.2  
 Median :162.49   Median :68.27    Median : 485.1  
 Mean   :162.64   Mean   :67.80    Mean   : 499.6  
 3rd Qu.:199.27   3rd Qu.:71.83    3rd Qu.: 555.4  
 Max.   :308.96   Max.   :84.14    Max.   :1082.2  
                                                   
 h1mnh2OnsetNormV2   durationV1       f0OffsetV1    
 Min.   :-23.341   Min.   :  4.31   Min.   : 82.23  
 1st Qu.: -7.177   1st Qu.: 57.45   1st Qu.:119.48  
 Median : -4.097   Median : 78.51   Median :172.18  
 Mean   : -2.673   Mean   : 94.63   Mean   :171.86  
 3rd Qu.:  0.769   3rd Qu.:131.74   3rd Qu.:215.80  
 Max.   : 22.130   Max.   :264.37   Max.   :367.78  
                                                    
 intensityOffsetV1   f1OffsetV1     h1mnh2OffsetNorm 
 Min.   :48.25     Min.   : 246.6   Min.   :-13.019  
 1st Qu.:65.71     1st Qu.: 470.5   1st Qu.: -5.900  
 Median :70.19     Median : 534.2   Median : -3.136  
 Mean   :69.53     Mean   : 553.6   Mean   : -2.378  
 3rd Qu.:73.82     3rd Qu.: 622.4   3rd Qu.:  0.399  
 Max.   :84.98     Max.   :1081.3   Max.   : 19.413  
                                                     
write.csv(ResultsFullOriginalCentral,"ResultsFullOriginalCentral.csv")

3.3 Z-Scoring the data

Given the differences in scales of each of the acoustic correlates, e.g., durations in milliseconds, intensity in dB, the magnitudes of the values are different. As a way to “normalise” for the data and to put all predictors on the same level, we z-score the data.

# from the package "dplyr"
all_z <- select(ResultsFullOriginalCentral, durationC2:h1mnh2OffsetNorm)
all_z <- cbind(all_z, select(ResultsFullOriginalCentral))
all_z <- apply(all_z, 2, scale)
colnames(all_z) <- paste(colnames(all_z), 'z', sep = '_')
ResultsFullOriginalCentral <- cbind(ResultsFullOriginalCentral, all_z)
summary(ResultsFullOriginalCentral)
       x                    file_name       speaker     sex    
 Min.   :   1   01_F_W_main_1.wav:  55   sp1    :  97   F:940  
 1st Qu.: 449   06_F_W_main_1.wav:  55   sp10   :  97   M:853  
 Median : 897   02_F_W_main_1.wav:  53   sp6    :  97          
 Mean   : 897   10_F_W_main_1.wav:  52   sp17   :  96          
 3rd Qu.:1345   08_F_W_main_1.wav:  51   sp9    :  96          
 Max.   :1793   01_M_W_main_1.wav:  50   sp20   :  95          
                (Other)          :1477   (Other):1215          
     Number        wordTarget        word           singGem    
 Min.   :    8   2adar  :  20   3atab  :  20   geminate :1007  
 1st Qu.: 9584   3aadal :  20   fatal  :  20   singleton: 786  
 Median :19555   3atab  :  20   saba2  :  20                   
 Mean   :19497   3attaal:  20   2atal  :  19                   
 3rd Qu.:29525   fatal  :  20   3addad :  19                   
 Max.   :39279   Haba2  :  20   badal  :  19                   
                 (Other):1673   (Other):1676                   
     syllType     vowelLength      syllStruct   c.v      
 iambic  : 356   longV1 : 594   V1C2V2  :433   C2 : 786  
 trochaic:1437   shortV1:1199   V1CC2V2 :410   CC2:1007  
                                V1CC2VV2:356             
                                VV1C2V2 :353             
                                VV1CC2V2:241             
                                                         
                                                         
    phoneme    modeArticulation4      placeArticulation
 dd     :285   stop:1793         alveolar      :625    
 d      :242                     bilabial      :426    
 bb     :235                     pharyngealised:449    
 b      :191                     velar         :293    
 kk     :160                                           
 k      :133                                           
 (Other):547                                           
  voiced.unvoiced                   placeVoicing   durationC2    
 voiced   :1059   alveolar voiced         :420   Min.   : 27.09  
 voiceless: 734   alveolar voiceless      :205   1st Qu.: 78.43  
                  bilabial                :426   Median :139.81  
                  pharyngealised voiced   :213   Mean   :131.23  
                  pharyngealised voiceless:236   3rd Qu.:175.43  
                  velar                   :293   Max.   :315.77  
                                                                 
 durationVoicedPercC2 durationVOTAll       VOTDec       
 Min.   :  0.00       Min.   : 0.712   Min.   :-299.69  
 1st Qu.: 23.08       1st Qu.:10.478   1st Qu.:-142.10  
 Median : 69.12       Median :15.796   Median : -62.73  
 Mean   : 61.60       Mean   :18.600   Mean   : -61.12  
 3rd Qu.:100.00       3rd Qu.:23.386   3rd Qu.:  19.00  
 Max.   :100.00       Max.   :89.065   Max.   :  80.00  
                                                        
 durationBurst     durationVOT    durationVoicedPercVOTAll
 Min.   : 0.500   Min.   : 0.10   Min.   :  0.00          
 1st Qu.: 4.000   1st Qu.: 5.00   1st Qu.:  0.00          
 Median : 6.000   Median : 9.00   Median :  0.00          
 Mean   : 7.193   Mean   :11.84   Mean   : 41.02          
 3rd Qu.: 9.000   3rd Qu.:15.00   3rd Qu.:100.00          
 Max.   :38.000   Max.   :97.00   Max.   :100.00          
                                                          
 intensityOnsetVOTAll intensityOffsetVOTAll   durationV2    
 Min.   :41.11        Min.   :42.47         Min.   : 37.52  
 1st Qu.:54.91        1st Qu.:62.46         1st Qu.:113.30  
 Median :58.95        Median :65.72         Median :143.69  
 Mean   :59.42        Mean   :65.37         Mean   :156.24  
 3rd Qu.:64.08        3rd Qu.:68.80         3rd Qu.:193.76  
 Max.   :77.19        Max.   :82.02         Max.   :381.26  
                                                            
   f0OnsetV2      intensityOnsetV2   f1OnsetV2     
 Min.   : 80.93   Min.   :45.35    Min.   : 209.9  
 1st Qu.:116.99   1st Qu.:64.50    1st Qu.: 431.2  
 Median :162.49   Median :68.27    Median : 485.1  
 Mean   :162.64   Mean   :67.80    Mean   : 499.6  
 3rd Qu.:199.27   3rd Qu.:71.83    3rd Qu.: 555.4  
 Max.   :308.96   Max.   :84.14    Max.   :1082.2  
                                                   
 h1mnh2OnsetNormV2   durationV1       f0OffsetV1    
 Min.   :-23.341   Min.   :  4.31   Min.   : 82.23  
 1st Qu.: -7.177   1st Qu.: 57.45   1st Qu.:119.48  
 Median : -4.097   Median : 78.51   Median :172.18  
 Mean   : -2.673   Mean   : 94.63   Mean   :171.86  
 3rd Qu.:  0.769   3rd Qu.:131.74   3rd Qu.:215.80  
 Max.   : 22.130   Max.   :264.37   Max.   :367.78  
                                                    
 intensityOffsetV1   f1OffsetV1     h1mnh2OffsetNorm 
 Min.   :48.25     Min.   : 246.6   Min.   :-13.019  
 1st Qu.:65.71     1st Qu.: 470.5   1st Qu.: -5.900  
 Median :70.19     Median : 534.2   Median : -3.136  
 Mean   :69.53     Mean   : 553.6   Mean   : -2.378  
 3rd Qu.:73.82     3rd Qu.: 622.4   3rd Qu.:  0.399  
 Max.   :84.98     Max.   :1081.3   Max.   : 19.413  
                                                     
  durationC2_z     durationVoicedPercC2_z durationVOTAll_z 
 Min.   :-1.8939   Min.   :-1.6510        Min.   :-1.5024  
 1st Qu.:-0.9603   1st Qu.:-1.0325        1st Qu.:-0.6822  
 Median : 0.1561   Median : 0.2014        Median :-0.2355  
 Mean   : 0.0000   Mean   : 0.0000        Mean   : 0.0000  
 3rd Qu.: 0.8039   3rd Qu.: 1.0291        3rd Qu.: 0.4020  
 Max.   : 3.3564   Max.   : 1.0291        Max.   : 5.9185  
                                                           
    VOTDec_z        durationBurst_z   durationVOT_z    
 Min.   :-2.88576   Min.   :-1.5989   Min.   :-1.1337  
 1st Qu.:-0.97961   1st Qu.:-0.7628   1st Qu.:-0.6605  
 Median :-0.01956   Median :-0.2850   Median :-0.2741  
 Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000  
 3rd Qu.: 0.96913   3rd Qu.: 0.4316   3rd Qu.: 0.3053  
 Max.   : 1.70700   Max.   : 7.3590   Max.   : 8.2247  
                                                       
 durationVoicedPercVOTAll_z intensityOnsetVOTAll_z
 Min.   :-0.867             Min.   :-2.94874      
 1st Qu.:-0.867             1st Qu.:-0.72620      
 Median :-0.867             Median :-0.07552      
 Mean   : 0.000             Mean   : 0.00000      
 3rd Qu.: 1.247             3rd Qu.: 0.74937      
 Max.   : 1.247             Max.   : 2.86081      
                                                  
 intensityOffsetVOTAll_z  durationV2_z      f0OnsetV2_z       
 Min.   :-4.21154        Min.   :-2.0785   Min.   :-1.693791  
 1st Qu.:-0.53597        1st Qu.:-0.7517   1st Qu.:-0.946329  
 Median : 0.06384        Median :-0.2198   Median :-0.003211  
 Mean   : 0.00000        Mean   : 0.0000   Mean   : 0.000000  
 3rd Qu.: 0.63035        3rd Qu.: 0.6571   3rd Qu.: 0.759217  
 Max.   : 3.06307        Max.   : 3.9397   Max.   : 3.032946  
                                                              
 intensityOnsetV2_z  f1OnsetV2_z      h1mnh2OnsetNormV2_z
 Min.   :-3.83386   Min.   :-2.9352   Min.   :-3.2122    
 1st Qu.:-0.56327   1st Qu.:-0.6926   1st Qu.:-0.7001    
 Median : 0.08125   Median :-0.1464   Median :-0.2214    
 Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000    
 3rd Qu.: 0.68836   3rd Qu.: 0.5662   3rd Qu.: 0.5349    
 Max.   : 2.79068   Max.   : 5.9039   Max.   : 3.8548    
                                                         
  durationV1_z      f0OffsetV1_z       intensityOffsetV1_z
 Min.   :-1.8270   Min.   :-1.642154   Min.   :-3.5036    
 1st Qu.:-0.7520   1st Qu.:-0.959636   1st Qu.:-0.6293    
 Median :-0.3262   Median : 0.005841   Median : 0.1083    
 Mean   : 0.0000   Mean   : 0.000000   Mean   : 0.0000    
 3rd Qu.: 0.7504   3rd Qu.: 0.805067   3rd Qu.: 0.7061    
 Max.   : 3.4333   Max.   : 3.589498   Max.   : 2.5419    
                                                          
  f1OffsetV1_z     h1mnh2OffsetNorm_z
 Min.   :-2.5378   Min.   :-2.0611   
 1st Qu.:-0.6871   1st Qu.:-0.6822   
 Median :-0.1604   Median :-0.1468   
 Mean   : 0.0000   Mean   : 0.0000   
 3rd Qu.: 0.5685   3rd Qu.: 0.5379   
 Max.   : 4.3619   Max.   : 4.2208   
                                     
str(ResultsFullOriginalCentral)
'data.frame':   1793 obs. of  55 variables:
 $ x                         : int  1 2 3 4 5 6 7 8 9 10 ...
 $ file_name                 : Factor w/ 68 levels "01_F_W_main_1.wav",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ speaker                   : Factor w/ 20 levels "sp1","sp10","sp11",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ sex                       : Factor w/ 2 levels "F","M": 1 1 1 1 1 1 1 1 1 1 ...
 $ Number                    : int  8 38 68 98 371 395 419 449 743 767 ...
 $ wordTarget                : Factor w/ 102 levels "2aabbeh","2aabeD",..: 83 9 28 52 47 38 34 90 6 27 ...
 $ word                      : Factor w/ 331 levels "2AAbaD","2aabbeh",..: 269 31 89 160 151 126 112 281 23 88 ...
 $ singGem                   : Factor w/ 2 levels "geminate","singleton": 2 2 2 2 2 2 2 2 2 2 ...
 $ syllType                  : Factor w/ 2 levels "iambic","trochaic": 2 2 2 2 2 2 2 2 2 2 ...
 $ vowelLength               : Factor w/ 2 levels "longV1","shortV1": 2 2 2 2 2 2 2 2 2 2 ...
 $ syllStruct                : Factor w/ 5 levels "V1C2V2","V1CC2V2",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ c.v                       : Factor w/ 2 levels "C2","CC2": 1 1 1 1 1 1 1 1 1 1 ...
 $ phoneme                   : Factor w/ 13 levels "b","bb","d","D",..: 1 3 12 8 1 10 12 8 1 10 ...
 $ modeArticulation4         : Factor w/ 1 level "stop": 1 1 1 1 1 1 1 1 1 1 ...
 $ placeArticulation         : Factor w/ 4 levels "alveolar","bilabial",..: 2 1 3 4 2 1 3 4 2 1 ...
 $ voiced.unvoiced           : Factor w/ 2 levels "voiced","voiceless": 1 1 2 2 1 2 2 2 1 2 ...
 $ placeVoicing              : Factor w/ 6 levels "alveolar voiced",..: 3 1 5 6 3 2 5 6 3 2 ...
 $ durationC2                : num  60.2 87 106.9 109.7 77 ...
 $ durationVoicedPercC2      : num  100 57.5 43.9 32.7 67.5 ...
 $ durationVOTAll            : num  19.2 11.3 21.4 22 16.4 ...
 $ VOTDec                    : num  -60.2 -87 21 22 -77 ...
 $ durationBurst             : num  6 8 7 11 9 11 7 12 9 7 ...
 $ durationVOT               : num  13 3 14 11 7 18 14 15 14 14 ...
 $ durationVoicedPercVOTAll  : num  100 0 0 0 41.2 ...
 $ intensityOnsetVOTAll      : num  52.6 56.9 52.6 55.3 53.4 ...
 $ intensityOffsetVOTAll     : num  64.5 63.4 59.7 57.7 64.2 ...
 $ durationV2                : num  89 155.2 86.3 113.3 76.4 ...
 $ f0OnsetV2                 : num  247 241 245 239 244 ...
 $ intensityOnsetV2          : num  67.3 66.6 61.1 60.9 64.5 ...
 $ f1OnsetV2                 : num  699 542 659 444 679 ...
 $ h1mnh2OnsetNormV2         : num  -2.08 5.31 11.84 17.52 13.02 ...
 $ durationV1                : num  90.5 48.4 52 63.2 61.5 ...
 $ f0OffsetV1                : num  239 239 241 238 205 ...
 $ intensityOffsetV1         : num  63.8 70.4 68.7 66.4 65.9 ...
 $ f1OffsetV1                : num  573 644 942 661 637 ...
 $ h1mnh2OffsetNorm          : num  9.123 -3.136 0.981 3.576 0.861 ...
 $ durationC2_z              : num  -1.292 -0.804 -0.442 -0.391 -0.986 ...
 $ durationVoicedPercC2_z    : num  1.029 -0.111 -0.474 -0.774 0.159 ...
 $ durationVOTAll_z          : num  0.0523 -0.6124 0.239 0.2864 -0.1851 ...
 $ VOTDec_z                  : num  0.0112 -0.3133 0.9933 1.0054 -0.1921 ...
 $ durationBurst_z           : num  -0.285 0.1927 -0.0462 0.9093 0.4316 ...
 $ durationVOT_z             : num  0.112 -0.854 0.209 -0.081 -0.467 ...
 $ durationVoicedPercVOTAll_z: num  1.24656 -0.86697 -0.86697 -0.86697 0.00331 ...
 $ intensityOnsetVOTAll_z    : num  -1.107 -0.402 -1.095 -0.669 -0.974 ...
 $ intensityOffsetVOTAll_z   : num  -0.165 -0.363 -1.046 -1.402 -0.219 ...
 $ durationV2_z              : num  -1.1771 -0.0176 -1.2242 -0.7517 -1.3971 ...
 $ f0OnsetV2_z               : num  1.75 1.63 1.71 1.59 1.69 ...
 $ intensityOnsetV2_z        : num  -0.0892 -0.2073 -1.1401 -1.1719 -0.5585 ...
 $ f1OnsetV2_z               : num  2.024 0.429 1.617 -0.559 1.819 ...
 $ h1mnh2OnsetNormV2_z       : num  0.0915 1.2402 2.2558 3.1377 2.4385 ...
 $ durationV1_z              : num  -0.0843 -0.9348 -0.8631 -0.6355 -0.6697 ...
 $ f0OffsetV1_z              : num  1.223 1.233 1.258 1.22 0.614 ...
 $ intensityOffsetV1_z       : num  -0.947 0.139 -0.145 -0.516 -0.606 ...
 $ f1OffsetV1_z              : num  0.16 0.75 3.21 0.886 0.693 ...
 $ h1mnh2OffsetNorm_z        : num  2.228 -0.147 0.651 1.153 0.627 ...

3.4 Creating a new outcome

In the following analysis, we are interested in the four-way contrast observed between voicing and gemination, i.e., Voiceless Singleton, Voiced Singleton, Voiceless Geminate and Voiced Geminate.

ResultsFullOriginalCentral["SingGemVoicing"] <- NA
ResultsFullOriginalCentral["SingGemVoicing"] <- paste(ResultsFullOriginalCentral$singGem,
                                                            ResultsFullOriginalCentral$voiced.unvoiced)
ResultsFullOriginalCentral$SingGemVoicing <- as.factor(ResultsFullOriginalCentral$SingGemVoicing)
str(ResultsFullOriginalCentral)
'data.frame':   1793 obs. of  56 variables:
 $ x                         : int  1 2 3 4 5 6 7 8 9 10 ...
 $ file_name                 : Factor w/ 68 levels "01_F_W_main_1.wav",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ speaker                   : Factor w/ 20 levels "sp1","sp10","sp11",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ sex                       : Factor w/ 2 levels "F","M": 1 1 1 1 1 1 1 1 1 1 ...
 $ Number                    : int  8 38 68 98 371 395 419 449 743 767 ...
 $ wordTarget                : Factor w/ 102 levels "2aabbeh","2aabeD",..: 83 9 28 52 47 38 34 90 6 27 ...
 $ word                      : Factor w/ 331 levels "2AAbaD","2aabbeh",..: 269 31 89 160 151 126 112 281 23 88 ...
 $ singGem                   : Factor w/ 2 levels "geminate","singleton": 2 2 2 2 2 2 2 2 2 2 ...
 $ syllType                  : Factor w/ 2 levels "iambic","trochaic": 2 2 2 2 2 2 2 2 2 2 ...
 $ vowelLength               : Factor w/ 2 levels "longV1","shortV1": 2 2 2 2 2 2 2 2 2 2 ...
 $ syllStruct                : Factor w/ 5 levels "V1C2V2","V1CC2V2",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ c.v                       : Factor w/ 2 levels "C2","CC2": 1 1 1 1 1 1 1 1 1 1 ...
 $ phoneme                   : Factor w/ 13 levels "b","bb","d","D",..: 1 3 12 8 1 10 12 8 1 10 ...
 $ modeArticulation4         : Factor w/ 1 level "stop": 1 1 1 1 1 1 1 1 1 1 ...
 $ placeArticulation         : Factor w/ 4 levels "alveolar","bilabial",..: 2 1 3 4 2 1 3 4 2 1 ...
 $ voiced.unvoiced           : Factor w/ 2 levels "voiced","voiceless": 1 1 2 2 1 2 2 2 1 2 ...
 $ placeVoicing              : Factor w/ 6 levels "alveolar voiced",..: 3 1 5 6 3 2 5 6 3 2 ...
 $ durationC2                : num  60.2 87 106.9 109.7 77 ...
 $ durationVoicedPercC2      : num  100 57.5 43.9 32.7 67.5 ...
 $ durationVOTAll            : num  19.2 11.3 21.4 22 16.4 ...
 $ VOTDec                    : num  -60.2 -87 21 22 -77 ...
 $ durationBurst             : num  6 8 7 11 9 11 7 12 9 7 ...
 $ durationVOT               : num  13 3 14 11 7 18 14 15 14 14 ...
 $ durationVoicedPercVOTAll  : num  100 0 0 0 41.2 ...
 $ intensityOnsetVOTAll      : num  52.6 56.9 52.6 55.3 53.4 ...
 $ intensityOffsetVOTAll     : num  64.5 63.4 59.7 57.7 64.2 ...
 $ durationV2                : num  89 155.2 86.3 113.3 76.4 ...
 $ f0OnsetV2                 : num  247 241 245 239 244 ...
 $ intensityOnsetV2          : num  67.3 66.6 61.1 60.9 64.5 ...
 $ f1OnsetV2                 : num  699 542 659 444 679 ...
 $ h1mnh2OnsetNormV2         : num  -2.08 5.31 11.84 17.52 13.02 ...
 $ durationV1                : num  90.5 48.4 52 63.2 61.5 ...
 $ f0OffsetV1                : num  239 239 241 238 205 ...
 $ intensityOffsetV1         : num  63.8 70.4 68.7 66.4 65.9 ...
 $ f1OffsetV1                : num  573 644 942 661 637 ...
 $ h1mnh2OffsetNorm          : num  9.123 -3.136 0.981 3.576 0.861 ...
 $ durationC2_z              : num  -1.292 -0.804 -0.442 -0.391 -0.986 ...
 $ durationVoicedPercC2_z    : num  1.029 -0.111 -0.474 -0.774 0.159 ...
 $ durationVOTAll_z          : num  0.0523 -0.6124 0.239 0.2864 -0.1851 ...
 $ VOTDec_z                  : num  0.0112 -0.3133 0.9933 1.0054 -0.1921 ...
 $ durationBurst_z           : num  -0.285 0.1927 -0.0462 0.9093 0.4316 ...
 $ durationVOT_z             : num  0.112 -0.854 0.209 -0.081 -0.467 ...
 $ durationVoicedPercVOTAll_z: num  1.24656 -0.86697 -0.86697 -0.86697 0.00331 ...
 $ intensityOnsetVOTAll_z    : num  -1.107 -0.402 -1.095 -0.669 -0.974 ...
 $ intensityOffsetVOTAll_z   : num  -0.165 -0.363 -1.046 -1.402 -0.219 ...
 $ durationV2_z              : num  -1.1771 -0.0176 -1.2242 -0.7517 -1.3971 ...
 $ f0OnsetV2_z               : num  1.75 1.63 1.71 1.59 1.69 ...
 $ intensityOnsetV2_z        : num  -0.0892 -0.2073 -1.1401 -1.1719 -0.5585 ...
 $ f1OnsetV2_z               : num  2.024 0.429 1.617 -0.559 1.819 ...
 $ h1mnh2OnsetNormV2_z       : num  0.0915 1.2402 2.2558 3.1377 2.4385 ...
 $ durationV1_z              : num  -0.0843 -0.9348 -0.8631 -0.6355 -0.6697 ...
 $ f0OffsetV1_z              : num  1.223 1.233 1.258 1.22 0.614 ...
 $ intensityOffsetV1_z       : num  -0.947 0.139 -0.145 -0.516 -0.606 ...
 $ f1OffsetV1_z              : num  0.16 0.75 3.21 0.886 0.693 ...
 $ h1mnh2OffsetNorm_z        : num  2.228 -0.147 0.651 1.153 0.627 ...
 $ SingGemVoicing            : Factor w/ 4 levels "geminate voiced",..: 3 3 4 4 3 4 4 4 3 4 ...

After all this pre-processing, we are now ready to start our Random Forest analysis.

4 Tuning our Random Forests

4.1 Creating training and testing sets

We create a training and a testing sets (66% and 33% of the data respectively). The aim is to train three random forests and using the testing set, we are evaluating the prediction power of each.

# creating training and testing sets
# we set the seeds for reproducibility
set.seed(1)
train.idx <- sample(nrow(ResultsFullOriginalCentral), 2/3 * nrow(ResultsFullOriginalCentral))
gemination.train <- ResultsFullOriginalCentral[train.idx, ]
gemination.test <- ResultsFullOriginalCentral[-train.idx, ]

4.2 Estimating number of trees required

We use a procedure we developed to obtain the optimal number of trees required to grow a forest with the highest predictive accuracy. This implements the procedure developed by “Oshiro, T. M., Perez, P. S., & Baranauskas, J. A. (2012). How many trees in a random forest? In International Workshop on Machine Learning and Data Mining in Pattern Recognition (pp. 154–168)” (see my github repo for the script).

The code below runs the analysis using parallel computing. We define the number of clusters needed and use these. We are training three Random Forests (see paper for more details):

  1. Model A: A forest with the full 19 predictors (AUCTestWithDur)

  2. Model B: A forest with 18 predictors (Model A minus the closure duration (AUCTestNoDur))

  3. Model C: A forest with 17 predictors (Model B minus the VOT (AUCTestNoDurNoVOT))

## below to prepare using parallel computing
NumberOfCluster <- detectCores()
cl <- makeCluster(NumberOfCluster)
registerDoSNOW(cl)
## below computes the ROC (Receiver Operating Curve) curves for this dataframe 
## for the 15 random forests with a 100 trees iteration
## and saves these in a list that will be used later to compare ROC curves
treeNumb <- 15
system.time(AUCTestWithDur <-
              foreach (i=1:treeNumb,.packages=c("party","pROC")) %dopar% {
                set.seed(1)
                cforest.mdl=cforest(SingGemVoicing~durationC2_z+durationVoicedPercC2_z+durationVOTAll_z+
                                      VOTDec_z+durationBurst_z+durationVOT_z+durationVoicedPercVOTAll_z+
                                      intensityOnsetVOTAll_z+intensityOffsetVOTAll_z+durationV2_z+
                                      f0OnsetV2_z+intensityOnsetV2_z+f1OnsetV2_z+
                                      h1mnh2OnsetNormV2_z+durationV1_z+f0OffsetV1_z+intensityOffsetV1_z+
                                      f1OffsetV1_z+h1mnh2OffsetNorm_z, data = gemination.train,
                                    controls = cforest_unbiased(mtry = round(sqrt(19)), ntree = i*100))
                cforest.cond = predict(cforest.mdl,OOB=TRUE)
                roc <- roc(gemination.train$SingGemVoicing,as.numeric(cforest.cond))
              })
   user  system elapsed 
   0.03    0.03   54.53 
system.time(AUCTestNoDur <-
              foreach (i=1:treeNumb,.packages=c("party","pROC")) %dopar% {
                set.seed(1)
                cforest.mdl=cforest(SingGemVoicing~durationVoicedPercC2_z+durationVOTAll_z+
                                      VOTDec_z+durationBurst_z+durationVOT_z+durationVoicedPercVOTAll_z+
                                      intensityOnsetVOTAll_z+intensityOffsetVOTAll_z+durationV2_z+
                                      f0OnsetV2_z+intensityOnsetV2_z+f1OnsetV2_z+
                                      h1mnh2OnsetNormV2_z+durationV1_z+f0OffsetV1_z+intensityOffsetV1_z+
                                      f1OffsetV1_z+h1mnh2OffsetNorm_z, data = gemination.train,
                                    controls = cforest_unbiased(mtry = round(sqrt(18)), ntree = i*100))
                cforest.cond = predict(cforest.mdl,OOB=TRUE)
                roc <- roc(gemination.train$SingGemVoicing,as.numeric(cforest.cond))
              })
   user  system elapsed 
   0.03    0.01   52.97 
system.time(AUCTestNoDurNoVOTDec <-
              foreach (i=1:treeNumb,.packages=c("party","pROC")) %dopar% {
                set.seed(1)
                cforest.mdl=cforest(SingGemVoicing~durationVoicedPercC2_z+durationVOTAll_z+
                                      durationBurst_z+durationVOT_z+durationVoicedPercVOTAll_z+
                                      intensityOnsetVOTAll_z+intensityOffsetVOTAll_z+durationV2_z+
                                      f0OnsetV2_z+intensityOnsetV2_z+f1OnsetV2_z+
                                      h1mnh2OnsetNormV2_z+durationV1_z+f0OffsetV1_z+intensityOffsetV1_z+
                                      f1OffsetV1_z+h1mnh2OffsetNorm_z, data = gemination.train,
                                    controls = cforest_unbiased(mtry = round(sqrt(17)), ntree = i*100))
                cforest.cond = predict(cforest.mdl,OOB=TRUE)
                roc <- roc(gemination.train$SingGemVoicing,as.numeric(cforest.cond))
              })
   user  system elapsed 
   0.08    0.02   62.23 
# stop cluster. If not done R will still use all cores as defined above and the system will become too slow!!
stopCluster(cl)
#

4.2.1 Significance testing

4.2.1.1 Model A, 19 predictors

# check significant ROC comparison
# with duration
roc.test(AUCTestWithDur[[1]], AUCTestWithDur[[2]])

    DeLong's test for two correlated ROC curves

data:  AUCTestWithDur[[1]] and AUCTestWithDur[[2]]
Z = -0.044298, p-value = 0.9647
alternative hypothesis: true difference in AUC is not equal to 0
sample estimates:
AUC of roc1 AUC of roc2 
  0.9059677   0.9061450 
roc.test(AUCTestWithDur[[2]], AUCTestWithDur[[3]])

    DeLong's test for two correlated ROC curves

data:  AUCTestWithDur[[2]] and AUCTestWithDur[[3]]
Z = -0.75022, p-value = 0.4531
alternative hypothesis: true difference in AUC is not equal to 0
sample estimates:
AUC of roc1 AUC of roc2 
   0.906145    0.908296 
roc.test(AUCTestWithDur[[3]], AUCTestWithDur[[4]])

    DeLong's test for two correlated ROC curves

data:  AUCTestWithDur[[3]] and AUCTestWithDur[[4]]
Z = 0, p-value = 1
alternative hypothesis: true difference in AUC is not equal to 0
sample estimates:
AUC of roc1 AUC of roc2 
   0.908296    0.908296 
roc.test(AUCTestWithDur[[4]], AUCTestWithDur[[5]])

    DeLong's test for two correlated ROC curves

data:  AUCTestWithDur[[4]] and AUCTestWithDur[[5]]
Z = -0.2101, p-value = 0.8336
alternative hypothesis: true difference in AUC is not equal to 0
sample estimates:
AUC of roc1 AUC of roc2 
  0.9082960   0.9087579 
roc.test(AUCTestWithDur[[5]], AUCTestWithDur[[6]])

    DeLong's test for two correlated ROC curves

data:  AUCTestWithDur[[5]] and AUCTestWithDur[[6]]
Z = 2.2154, p-value = 0.02673
alternative hypothesis: true difference in AUC is not equal to 0
sample estimates:
AUC of roc1 AUC of roc2 
  0.9087579   0.9013018 
roc.test(AUCTestWithDur[[6]], AUCTestWithDur[[7]])

    DeLong's test for two correlated ROC curves

data:  AUCTestWithDur[[6]] and AUCTestWithDur[[7]]
Z = -1, p-value = 0.3173
alternative hypothesis: true difference in AUC is not equal to 0
sample estimates:
AUC of roc1 AUC of roc2 
  0.9013018   0.9030702 
roc.test(AUCTestWithDur[[7]], AUCTestWithDur[[8]])

    DeLong's test for two correlated ROC curves

data:  AUCTestWithDur[[7]] and AUCTestWithDur[[8]]
Z = 0, p-value = 1
alternative hypothesis: true difference in AUC is not equal to 0
sample estimates:
AUC of roc1 AUC of roc2 
  0.9030702   0.9030702 
roc.test(AUCTestWithDur[[8]], AUCTestWithDur[[9]])

    DeLong's test for two correlated ROC curves

data:  AUCTestWithDur[[8]] and AUCTestWithDur[[9]]
Z = 0, p-value = 1
alternative hypothesis: true difference in AUC is not equal to 0
sample estimates:
AUC of roc1 AUC of roc2 
  0.9030702   0.9030702 
roc.test(AUCTestWithDur[[9]], AUCTestWithDur[[10]])

    DeLong's test for two correlated ROC curves

data:  AUCTestWithDur[[9]] and AUCTestWithDur[[10]]
Z = 0, p-value = 1
alternative hypothesis: true difference in AUC is not equal to 0
sample estimates:
AUC of roc1 AUC of roc2 
  0.9030702   0.9030702 
roc.test(AUCTestWithDur[[10]], AUCTestWithDur[[11]])

    DeLong's test for two correlated ROC curves

data:  AUCTestWithDur[[10]] and AUCTestWithDur[[11]]
Z = 0, p-value = 1
alternative hypothesis: true difference in AUC is not equal to 0
sample estimates:
AUC of roc1 AUC of roc2 
  0.9030702   0.9030702 
roc.test(AUCTestWithDur[[11]], AUCTestWithDur[[12]])

    DeLong's test for two correlated ROC curves

data:  AUCTestWithDur[[11]] and AUCTestWithDur[[12]]
Z = 0, p-value = 1
alternative hypothesis: true difference in AUC is not equal to 0
sample estimates:
AUC of roc1 AUC of roc2 
  0.9030702   0.9030702 
roc.test(AUCTestWithDur[[12]], AUCTestWithDur[[13]])

    DeLong's test for two correlated ROC curves

data:  AUCTestWithDur[[12]] and AUCTestWithDur[[13]]
Z = 0, p-value = 1
alternative hypothesis: true difference in AUC is not equal to 0
sample estimates:
AUC of roc1 AUC of roc2 
  0.9030702   0.9030702 
roc.test(AUCTestWithDur[[13]], AUCTestWithDur[[14]])

    DeLong's test for two correlated ROC curves

data:  AUCTestWithDur[[13]] and AUCTestWithDur[[14]]
Z = 0, p-value = 1
alternative hypothesis: true difference in AUC is not equal to 0
sample estimates:
AUC of roc1 AUC of roc2 
  0.9030702   0.9030702 
roc.test(AUCTestWithDur[[14]], AUCTestWithDur[[15]])

    DeLong's test for two correlated ROC curves

data:  AUCTestWithDur[[14]] and AUCTestWithDur[[15]]
Z = 0, p-value = 1
alternative hypothesis: true difference in AUC is not equal to 0
sample estimates:
AUC of roc1 AUC of roc2 
  0.9030702   0.9030702 
# 500 trees is the best model

4.2.1.2 Model B, 18 predictors (full minus Closure duration)

roc.test(AUCTestNoDur[[1]], AUCTestNoDur[[2]])

    DeLong's test for two correlated ROC curves

data:  AUCTestNoDur[[1]] and AUCTestNoDur[[2]]
Z = -1.2523, p-value = 0.2105
alternative hypothesis: true difference in AUC is not equal to 0
sample estimates:
AUC of roc1 AUC of roc2 
  0.8925345   0.8984509 
roc.test(AUCTestNoDur[[2]], AUCTestNoDur[[3]])

    DeLong's test for two correlated ROC curves

data:  AUCTestNoDur[[2]] and AUCTestNoDur[[3]]
Z = 0.49666, p-value = 0.6194
alternative hypothesis: true difference in AUC is not equal to 0
sample estimates:
AUC of roc1 AUC of roc2 
  0.8984509   0.8974477 
roc.test(AUCTestNoDur[[3]], AUCTestNoDur[[4]])

    DeLong's test for two correlated ROC curves

data:  AUCTestNoDur[[3]] and AUCTestNoDur[[4]]
Z = 1.2903, p-value = 0.1969
alternative hypothesis: true difference in AUC is not equal to 0
sample estimates:
AUC of roc1 AUC of roc2 
  0.8974477   0.8941443 
roc.test(AUCTestNoDur[[4]], AUCTestNoDur[[5]])

    DeLong's test for two correlated ROC curves

data:  AUCTestNoDur[[4]] and AUCTestNoDur[[5]]
Z = -1.9652, p-value = 0.04939
alternative hypothesis: true difference in AUC is not equal to 0
sample estimates:
AUC of roc1 AUC of roc2 
  0.8941443   0.8992721 
roc.test(AUCTestNoDur[[5]], AUCTestNoDur[[6]])

    DeLong's test for two correlated ROC curves

data:  AUCTestNoDur[[5]] and AUCTestNoDur[[6]]
Z = 1.3361, p-value = 0.1815
alternative hypothesis: true difference in AUC is not equal to 0
sample estimates:
AUC of roc1 AUC of roc2 
  0.8992721   0.8977324 
roc.test(AUCTestNoDur[[6]], AUCTestNoDur[[7]])

    DeLong's test for two correlated ROC curves

data:  AUCTestNoDur[[6]] and AUCTestNoDur[[7]]
Z = 0.29669, p-value = 0.7667
alternative hypothesis: true difference in AUC is not equal to 0
sample estimates:
AUC of roc1 AUC of roc2 
  0.8977324   0.8971258 
roc.test(AUCTestNoDur[[7]], AUCTestNoDur[[8]])

    DeLong's test for two correlated ROC curves

data:  AUCTestNoDur[[7]] and AUCTestNoDur[[8]]
Z = -0.40221, p-value = 0.6875
alternative hypothesis: true difference in AUC is not equal to 0
sample estimates:
AUC of roc1 AUC of roc2 
  0.8971258   0.8979517 
roc.test(AUCTestNoDur[[8]], AUCTestNoDur[[9]])

    DeLong's test for two correlated ROC curves

data:  AUCTestNoDur[[8]] and AUCTestNoDur[[9]]
Z = 0.98902, p-value = 0.3227
alternative hypothesis: true difference in AUC is not equal to 0
sample estimates:
AUC of roc1 AUC of roc2 
  0.8979517   0.8977324 
roc.test(AUCTestNoDur[[9]], AUCTestNoDur[[10]])

    DeLong's test for two correlated ROC curves

data:  AUCTestNoDur[[9]] and AUCTestNoDur[[10]]
Z = 0, p-value = 1
alternative hypothesis: true difference in AUC is not equal to 0
sample estimates:
AUC of roc1 AUC of roc2 
  0.8977324   0.8977324 
roc.test(AUCTestNoDur[[10]], AUCTestNoDur[[11]])

    DeLong's test for two correlated ROC curves

data:  AUCTestNoDur[[10]] and AUCTestNoDur[[11]]
Z = 0, p-value = 1
alternative hypothesis: true difference in AUC is not equal to 0
sample estimates:
AUC of roc1 AUC of roc2 
  0.8977324   0.8977324 
roc.test(AUCTestNoDur[[11]], AUCTestNoDur[[12]])

    DeLong's test for two correlated ROC curves

data:  AUCTestNoDur[[11]] and AUCTestNoDur[[12]]
Z = 0, p-value = 1
alternative hypothesis: true difference in AUC is not equal to 0
sample estimates:
AUC of roc1 AUC of roc2 
  0.8977324   0.8977324 
roc.test(AUCTestNoDur[[12]], AUCTestNoDur[[13]])

    DeLong's test for two correlated ROC curves

data:  AUCTestNoDur[[12]] and AUCTestNoDur[[13]]
Z = -0.99967, p-value = 0.3175
alternative hypothesis: true difference in AUC is not equal to 0
sample estimates:
AUC of roc1 AUC of roc2 
  0.8977324   0.8988429 
roc.test(AUCTestNoDur[[13]], AUCTestNoDur[[14]])

    DeLong's test for two correlated ROC curves

data:  AUCTestNoDur[[13]] and AUCTestNoDur[[14]]
Z = 0, p-value = 1
alternative hypothesis: true difference in AUC is not equal to 0
sample estimates:
AUC of roc1 AUC of roc2 
  0.8988429   0.8988429 
roc.test(AUCTestNoDur[[14]], AUCTestNoDur[[15]])

    DeLong's test for two correlated ROC curves

data:  AUCTestNoDur[[14]] and AUCTestNoDur[[15]]
Z = 0, p-value = 1
alternative hypothesis: true difference in AUC is not equal to 0
sample estimates:
AUC of roc1 AUC of roc2 
  0.8988429   0.8988429 
## 500 trees is the best model.

4.2.1.3 Model C, 17 predictors (Full minus Closure duration and VOT)

roc.test(AUCTestNoDurNoVOTDec[[1]], AUCTestNoDurNoVOTDec[[2]])

    DeLong's test for two correlated ROC curves

data:  AUCTestNoDurNoVOTDec[[1]] and AUCTestNoDurNoVOTDec[[2]]
Z = -0.86872, p-value = 0.385
alternative hypothesis: true difference in AUC is not equal to 0
sample estimates:
AUC of roc1 AUC of roc2 
  0.7544700   0.7623414 
roc.test(AUCTestNoDurNoVOTDec[[2]], AUCTestNoDurNoVOTDec[[3]])

    DeLong's test for two correlated ROC curves

data:  AUCTestNoDurNoVOTDec[[2]] and AUCTestNoDurNoVOTDec[[3]]
Z = 0.74265, p-value = 0.4577
alternative hypothesis: true difference in AUC is not equal to 0
sample estimates:
AUC of roc1 AUC of roc2 
  0.7623414   0.7581980 
roc.test(AUCTestNoDurNoVOTDec[[3]], AUCTestNoDurNoVOTDec[[4]])

    DeLong's test for two correlated ROC curves

data:  AUCTestNoDurNoVOTDec[[3]] and AUCTestNoDurNoVOTDec[[4]]
Z = -0.54725, p-value = 0.5842
alternative hypothesis: true difference in AUC is not equal to 0
sample estimates:
AUC of roc1 AUC of roc2 
  0.7581980   0.7612542 
roc.test(AUCTestNoDurNoVOTDec[[4]], AUCTestNoDurNoVOTDec[[5]])

    DeLong's test for two correlated ROC curves

data:  AUCTestNoDurNoVOTDec[[4]] and AUCTestNoDurNoVOTDec[[5]]
Z = 0.26264, p-value = 0.7928
alternative hypothesis: true difference in AUC is not equal to 0
sample estimates:
AUC of roc1 AUC of roc2 
  0.7612542   0.7599571 
roc.test(AUCTestNoDurNoVOTDec[[5]], AUCTestNoDurNoVOTDec[[6]])

    DeLong's test for two correlated ROC curves

data:  AUCTestNoDurNoVOTDec[[5]] and AUCTestNoDurNoVOTDec[[6]]
Z = -0.72598, p-value = 0.4679
alternative hypothesis: true difference in AUC is not equal to 0
sample estimates:
AUC of roc1 AUC of roc2 
  0.7599571   0.7637318 
roc.test(AUCTestNoDurNoVOTDec[[6]], AUCTestNoDurNoVOTDec[[7]])

    DeLong's test for two correlated ROC curves

data:  AUCTestNoDurNoVOTDec[[6]] and AUCTestNoDurNoVOTDec[[7]]
Z = 0.66791, p-value = 0.5042
alternative hypothesis: true difference in AUC is not equal to 0
sample estimates:
AUC of roc1 AUC of roc2 
  0.7637318   0.7618841 
roc.test(AUCTestNoDurNoVOTDec[[7]], AUCTestNoDurNoVOTDec[[8]])

    DeLong's test for two correlated ROC curves

data:  AUCTestNoDurNoVOTDec[[7]] and AUCTestNoDurNoVOTDec[[8]]
Z = 0.10062, p-value = 0.9199
alternative hypothesis: true difference in AUC is not equal to 0
sample estimates:
AUC of roc1 AUC of roc2 
  0.7618841   0.7615995 
roc.test(AUCTestNoDurNoVOTDec[[8]], AUCTestNoDurNoVOTDec[[9]])

    DeLong's test for two correlated ROC curves

data:  AUCTestNoDurNoVOTDec[[8]] and AUCTestNoDurNoVOTDec[[9]]
Z = 0.10783, p-value = 0.9141
alternative hypothesis: true difference in AUC is not equal to 0
sample estimates:
AUC of roc1 AUC of roc2 
  0.7615995   0.7613195 
roc.test(AUCTestNoDurNoVOTDec[[9]], AUCTestNoDurNoVOTDec[[10]])

    DeLong's test for two correlated ROC curves

data:  AUCTestNoDurNoVOTDec[[9]] and AUCTestNoDurNoVOTDec[[10]]
Z = 0, p-value = 1
alternative hypothesis: true difference in AUC is not equal to 0
sample estimates:
AUC of roc1 AUC of roc2 
  0.7613195   0.7613195 
roc.test(AUCTestNoDurNoVOTDec[[10]], AUCTestNoDurNoVOTDec[[11]])

    DeLong's test for two correlated ROC curves

data:  AUCTestNoDurNoVOTDec[[10]] and AUCTestNoDurNoVOTDec[[11]]
Z = 1.406, p-value = 0.1597
alternative hypothesis: true difference in AUC is not equal to 0
sample estimates:
AUC of roc1 AUC of roc2 
  0.7613195   0.7596865 
roc.test(AUCTestNoDurNoVOTDec[[11]], AUCTestNoDurNoVOTDec[[12]])

    DeLong's test for two correlated ROC curves

data:  AUCTestNoDurNoVOTDec[[11]] and AUCTestNoDurNoVOTDec[[12]]
Z = -0.57594, p-value = 0.5647
alternative hypothesis: true difference in AUC is not equal to 0
sample estimates:
AUC of roc1 AUC of roc2 
  0.7596865   0.7605030 
roc.test(AUCTestNoDurNoVOTDec[[12]], AUCTestNoDurNoVOTDec[[13]])

    DeLong's test for two correlated ROC curves

data:  AUCTestNoDurNoVOTDec[[12]] and AUCTestNoDurNoVOTDec[[13]]
Z = 0, p-value = 1
alternative hypothesis: true difference in AUC is not equal to 0
sample estimates:
AUC of roc1 AUC of roc2 
   0.760503    0.760503 
roc.test(AUCTestNoDurNoVOTDec[[13]], AUCTestNoDurNoVOTDec[[14]])

    DeLong's test for two correlated ROC curves

data:  AUCTestNoDurNoVOTDec[[13]] and AUCTestNoDurNoVOTDec[[14]]
Z = 0, p-value = 1
alternative hypothesis: true difference in AUC is not equal to 0
sample estimates:
AUC of roc1 AUC of roc2 
   0.760503    0.760503 
roc.test(AUCTestNoDurNoVOTDec[[14]], AUCTestNoDurNoVOTDec[[15]])

    DeLong's test for two correlated ROC curves

data:  AUCTestNoDurNoVOTDec[[14]] and AUCTestNoDurNoVOTDec[[15]]
Z = 0, p-value = 1
alternative hypothesis: true difference in AUC is not equal to 0
sample estimates:
AUC of roc1 AUC of roc2 
   0.760503    0.760503 
# 600 trees is the best model. 

4.2.2 Models and number of trees

After checking the models above, we have the following number trees that are needed to grow a random forest with the highest predictive accuracy.

  1. Model A: (19 predictors) 500 trees

  2. Model B: (18 predictors) 500 trees

  3. Model C: (17 predictors) 600 trees

4.3 Checking the correlations in the data-frame

Random forests are known to be biased towards correlated data and data with multiple categories and/or cut-points. We start by checking the correlation levels in our data.

# using "psycho"
# we use parts of the data-frame that contains the z-scored variables of interest
cor <- gemination.train[c(37:55)] %>% 
  correlation()
summary(cor)
plot(cor)

As can be seen from the correlation plot, an important number of predictors are correlated with each other. Some with minimal correlations, while other with relatively strong correlations. We will use this information in tuning the computations of variable importance.

4.4 Changing order of outcome

levels(gemination.train$SingGemVoicing)
[1] "geminate voiced"     "geminate voiceless" 
[3] "singleton voiced"    "singleton voiceless"
gemination.train$SingGemVoicing <- factor(gemination.train$SingGemVoicing,
                                          levels = c("singleton voiceless","singleton voiced","geminate voiceless","geminate voiced"))
levels(gemination.train$SingGemVoicing)
[1] "singleton voiceless" "singleton voiced"   
[3] "geminate voiceless"  "geminate voiced"    
levels(gemination.test$SingGemVoicing)
[1] "geminate voiced"     "geminate voiceless" 
[3] "singleton voiced"    "singleton voiceless"
gemination.test$SingGemVoicing <- factor(gemination.test$SingGemVoicing,
                                          levels = c("singleton voiceless","singleton voiced","geminate voiceless","geminate voiced"))
levels(gemination.test$SingGemVoicing)
[1] "singleton voiceless" "singleton voiced"   
[3] "geminate voiceless"  "geminate voiced"    

5 Running the three Random Forests via Conditional Inference Trees

5.1 Model A

5.1.1 Training and predicting

set.seed(1)
cforestGemVoicingTrain <- cforest(SingGemVoicing~durationC2_z+durationVoicedPercC2_z+durationVOTAll_z+
                                         VOTDec_z+durationBurst_z+durationVOT_z+durationVoicedPercVOTAll_z+
                                         intensityOnsetVOTAll_z+intensityOffsetVOTAll_z+durationV2_z+
                                         f0OnsetV2_z+intensityOnsetV2_z+f1OnsetV2_z+
                                         h1mnh2OnsetNormV2_z+durationV1_z+f0OffsetV1_z+intensityOffsetV1_z+
                                         f1OffsetV1_z+h1mnh2OffsetNorm_z, 
                                       data = gemination.train, 
                                       control=cforest_unbiased(mtry=round(sqrt(19)),ntree=500))
# we use predict to obtain predictions from the training set on the testing set
cforestGemVoicingTest <- predict(cforestGemVoicingTrain, newdata = gemination.test, OOB=TRUE, type = "response")
# Below is the predictions table showing the confusions
# Table predictions (rows) by original data (columns) 
round(prop.table(table(cforestGemVoicingTest,gemination.test$SingGemVoicing),2)*100,1)
                     
cforestGemVoicingTest singleton voiceless singleton voiced
  singleton voiceless                90.5              8.1
  singleton voiced                    7.6             91.9
  geminate voiceless                  0.0              0.0
  geminate voiced                     1.9              0.0
                     
cforestGemVoicingTest geminate voiceless geminate voiced
  singleton voiceless                1.5             0.0
  singleton voiced                   0.0             0.0
  geminate voiceless                92.5             6.1
  geminate voiced                    6.0            93.9
# for percent correct
sum(gemination.test$SingGemVoicing==cforestGemVoicingTest)/nrow(gemination.test)
[1] 0.9247492
# for correlations between our predictions and original data-frame
cor(as.numeric(cforestGemVoicingTest), as.numeric(gemination.test$SingGemVoicing))
[1] 0.9552215

5.1.2 Variable Importance

We are interested in the importance of our predictors and how they are informative (or not) with respect to our model.

We run two versions. One with no conditional permutations, i.e., with no appropriate corrections for the correlations in our data and one with appropriate corrections

5.1.2.1 No conditional permutations

5.1.2.1.1 Model specification
set.seed(1)
system.time(varImpSingGemVoicingTrainNT <- varimp(cforest(SingGemVoicing~durationC2_z+durationVoicedPercC2_z+durationVOTAll_z+
                                                           VOTDec_z+durationBurst_z+durationVOT_z+durationVoicedPercVOTAll_z+
                                                       intensityOnsetVOTAll_z+intensityOffsetVOTAll_z+durationV2_z+
                                                       f0OnsetV2_z+intensityOnsetV2_z+f1OnsetV2_z+
                                                       h1mnh2OnsetNormV2_z+durationV1_z+f0OffsetV1_z+intensityOffsetV1_z+
                                                       f1OffsetV1_z+h1mnh2OffsetNorm_z,  
                                                     data = gemination.train, 
                                                     control=cforest_unbiased(mtry=round(sqrt(19)),ntree=500)), conditional = F))
   user  system elapsed 
  18.19    0.00   18.24 
sort(varImpSingGemVoicingTrainNT)
        h1mnh2OffsetNorm_z        h1mnh2OnsetNormV2_z 
             -8.200456e-05              -4.555809e-06 
   intensityOffsetVOTAll_z               f1OffsetV1_z 
              6.879271e-04               7.152620e-04 
        intensityOnsetV2_z                f1OnsetV2_z 
              7.699317e-04               1.011390e-03 
               f0OnsetV2_z        intensityOffsetV1_z 
              1.421412e-03               1.662870e-03 
              f0OffsetV1_z     intensityOnsetVOTAll_z 
              1.922551e-03               5.594533e-03 
           durationBurst_z               durationV2_z 
              7.216401e-03               7.845103e-03 
             durationVOT_z               durationV1_z 
              1.505239e-02               1.594077e-02 
          durationVOTAll_z durationVoicedPercVOTAll_z 
              2.502961e-02               4.869248e-02 
    durationVoicedPercC2_z                   VOTDec_z 
              1.141959e-01               1.426834e-01 
              durationC2_z 
              2.539180e-01 
5.1.2.1.2 Plotting

We start by changing names of predictors

# We change names
names(varImpSingGemVoicingTrainNT)[names(varImpSingGemVoicingTrainNT) == 'durationC2_z'] <- 'Closure Dur'
names(varImpSingGemVoicingTrainNT)[names(varImpSingGemVoicingTrainNT) == 'durationV2_z'] <- 'V2 Dur'
names(varImpSingGemVoicingTrainNT)[names(varImpSingGemVoicingTrainNT) == 'durationV1_z'] <- 'V1 Dur'
names(varImpSingGemVoicingTrainNT)[names(varImpSingGemVoicingTrainNT) == 'VOTDec_z'] <- 'VOT'
names(varImpSingGemVoicingTrainNT)[names(varImpSingGemVoicingTrainNT) == 'durationVOTAll_z'] <- 'Rel Dur'
names(varImpSingGemVoicingTrainNT)[names(varImpSingGemVoicingTrainNT) == 'durationBurst_z'] <- 'Burst Dur'
names(varImpSingGemVoicingTrainNT)[names(varImpSingGemVoicingTrainNT) == 'durationVOT_z'] <- 'Asp Dur'
names(varImpSingGemVoicingTrainNT)[names(varImpSingGemVoicingTrainNT) == 'durationVoicedPercC2_z'] <- 'CD Voicing'
names(varImpSingGemVoicingTrainNT)[names(varImpSingGemVoicingTrainNT) == 'durationVoicedPercVOTAll_z'] <- 'Rel Voicing'
names(varImpSingGemVoicingTrainNT)[names(varImpSingGemVoicingTrainNT) == 'intensityOffsetV1_z'] <- 'V1 Off Int'
names(varImpSingGemVoicingTrainNT)[names(varImpSingGemVoicingTrainNT) == 'intensityOnsetV2_z'] <- 'V2 On Int'
names(varImpSingGemVoicingTrainNT)[names(varImpSingGemVoicingTrainNT) == 'intensityOffsetVOTAll_z'] <- 'Rel Of Int'
names(varImpSingGemVoicingTrainNT)[names(varImpSingGemVoicingTrainNT) == 'intensityOnsetVOTAll_z'] <- 'Rel On Int'
names(varImpSingGemVoicingTrainNT)[names(varImpSingGemVoicingTrainNT) == 'f0OnsetV2_z'] <- 'V2 On f0'
names(varImpSingGemVoicingTrainNT)[names(varImpSingGemVoicingTrainNT) == 'f0OffsetV1_z'] <- 'V1 Of f0'
names(varImpSingGemVoicingTrainNT)[names(varImpSingGemVoicingTrainNT) == 'f1OnsetV2_z'] <- 'V2 On F1'
names(varImpSingGemVoicingTrainNT)[names(varImpSingGemVoicingTrainNT) == 'f1OffsetV1_z'] <- 'V1 Of F1'
names(varImpSingGemVoicingTrainNT)[names(varImpSingGemVoicingTrainNT) == 'h1mnh2OffsetNorm_z'] <- 'V1 Of H1*-H2*'
names(varImpSingGemVoicingTrainNT)[names(varImpSingGemVoicingTrainNT) == 'h1mnh2OnsetNormV2_z'] <- 'V2 On H1*-H2*'

After changing names, we append to a data-frame

varImpSingGemVoicingTrainNT_DF <- varImpSingGemVoicingTrainNT
varImpSingGemVoicingTrainNT_DF <- as.data.frame(varImpSingGemVoicingTrainNT_DF)
varImpSingGemVoicingTrainNT_DF$var <- row.names(varImpSingGemVoicingTrainNT_DF)

Then we use ggplot2 to create our Variable Importance figure

ggVarImpSGDurNT_A <- ggplot(varImpSingGemVoicingTrainNT_DF, aes(reorder(var,varImpSingGemVoicingTrainNT_DF),varImpSingGemVoicingTrainNT_DF)) + geom_bar(stat="identity",position="dodge",fill = "gray70") + 
  coord_flip(ylim = c(0,0.25)) + labs(y = "", x= "", subtitle = "A: Non Conditional 19 predictors") +
  scale_y_continuous()+theme_set(theme_bw()) + theme(text=element_text(size=20)) + theme(legend.position="none", axis.title.x=element_blank(),axis.title.y=element_blank())
ggVarImpSGDurNT_A

This plot can already be used to evaluate the importance of predictors. However, it is biased because of the high level of correlations.

5.1.2.2 Conditional permutations

The conditional permutations are used to correct for the correlations in our predictors (conditional = T). This is heavy on resources, and we have already tuned our model to make sure it is not resource intensive. With the update to R 3.5.0, the computations are less resource intensive. One way to reduce resources with previous versions of R is to tune the threshold. Instead of 0.2, one could increase it to 0.5 or 0.8. The threshold is used to increase/decrease number of predictors in the conditional permutations grid and is related to the correlations in the data. With 0.2, any correlation equal or above 0.2 are added to the grid; the same applies to the other thresholds. Here we kept the threshold to 0.2 and we were able to run the computations on a local machine with 4-cores (8 logical) and 32GB RAMs.

5.1.2.2.1 Model specification
set.seed(1)
system.time(varImpSingGemVoicingTrainT <- varimp(cforest(SingGemVoicing~durationC2_z+durationVoicedPercC2_z+durationVOTAll_z+
                                                          VOTDec_z+durationBurst_z+durationVOT_z+durationVoicedPercVOTAll_z+
                                                      intensityOnsetVOTAll_z+intensityOffsetVOTAll_z+durationV2_z+
                                                      f0OnsetV2_z+intensityOnsetV2_z+f1OnsetV2_z+
                                                      h1mnh2OnsetNormV2_z+durationV1_z+f0OffsetV1_z+intensityOffsetV1_z+
                                                      f1OffsetV1_z+h1mnh2OffsetNorm_z, 
                                                    data = gemination.train, 
                                                    control=cforest_unbiased(mtry=round(sqrt(19)),ntree=500)), 
                                            conditional = T, threshold = 0.2))
   user  system elapsed 
 725.25    0.21  726.94 
sort(varImpSingGemVoicingTrainT)
              f0OffsetV1_z               durationV2_z 
             -4.555809e-06              -4.555809e-06 
       h1mnh2OnsetNormV2_z        intensityOffsetV1_z 
              0.000000e+00               0.000000e+00 
        intensityOnsetV2_z         h1mnh2OffsetNorm_z 
              9.111617e-06               9.111617e-06 
              f1OffsetV1_z    intensityOffsetVOTAll_z 
              1.366743e-05               1.366743e-05 
               f0OnsetV2_z                f1OnsetV2_z 
              1.366743e-05               2.277904e-05 
durationVoicedPercVOTAll_z            durationBurst_z 
              5.466970e-05               6.833713e-05 
    intensityOnsetVOTAll_z              durationVOT_z 
              7.289294e-05               8.200456e-05 
                  VOTDec_z           durationVOTAll_z 
              1.047836e-04               1.184510e-04 
              durationV1_z     durationVoicedPercC2_z 
              2.505695e-04               7.744875e-04 
              durationC2_z 
              1.337585e-02 
5.1.2.2.2 Plotting

We start by changing names of predictors

# We change names
names(varImpSingGemVoicingTrainT)[names(varImpSingGemVoicingTrainT) == 'durationC2_z'] <- 'Closure Dur'
names(varImpSingGemVoicingTrainT)[names(varImpSingGemVoicingTrainT) == 'durationV2_z'] <- 'V2 Dur'
names(varImpSingGemVoicingTrainT)[names(varImpSingGemVoicingTrainT) == 'durationV1_z'] <- 'V1 Dur'
names(varImpSingGemVoicingTrainT)[names(varImpSingGemVoicingTrainT) == 'VOTDec_z'] <- 'VOT'
names(varImpSingGemVoicingTrainT)[names(varImpSingGemVoicingTrainT) == 'durationVOTAll_z'] <- 'Rel Dur'
names(varImpSingGemVoicingTrainT)[names(varImpSingGemVoicingTrainT) == 'durationBurst_z'] <- 'Burst Dur'
names(varImpSingGemVoicingTrainT)[names(varImpSingGemVoicingTrainT) == 'durationVOT_z'] <- 'Asp Dur'
names(varImpSingGemVoicingTrainT)[names(varImpSingGemVoicingTrainT) == 'durationVoicedPercC2_z'] <- 'CD Voicing'
names(varImpSingGemVoicingTrainT)[names(varImpSingGemVoicingTrainT) == 'durationVoicedPercVOTAll_z'] <- 'Rel Voicing'
names(varImpSingGemVoicingTrainT)[names(varImpSingGemVoicingTrainT) == 'intensityOffsetV1_z'] <- 'V1 Off Int'
names(varImpSingGemVoicingTrainT)[names(varImpSingGemVoicingTrainT) == 'intensityOnsetV2_z'] <- 'V2 On Int'
names(varImpSingGemVoicingTrainT)[names(varImpSingGemVoicingTrainT) == 'intensityOffsetVOTAll_z'] <- 'Rel Of Int'
names(varImpSingGemVoicingTrainT)[names(varImpSingGemVoicingTrainT) == 'intensityOnsetVOTAll_z'] <- 'Rel On Int'
names(varImpSingGemVoicingTrainT)[names(varImpSingGemVoicingTrainT) == 'f0OnsetV2_z'] <- 'V2 On f0'
names(varImpSingGemVoicingTrainT)[names(varImpSingGemVoicingTrainT) == 'f0OffsetV1_z'] <- 'V1 Of f0'
names(varImpSingGemVoicingTrainT)[names(varImpSingGemVoicingTrainT) == 'f1OnsetV2_z'] <- 'V2 On F1'
names(varImpSingGemVoicingTrainT)[names(varImpSingGemVoicingTrainT) == 'f1OffsetV1_z'] <- 'V1 Of F1'
names(varImpSingGemVoicingTrainT)[names(varImpSingGemVoicingTrainT) == 'h1mnh2OffsetNorm_z'] <- 'V1 Of H1*-H2*'
names(varImpSingGemVoicingTrainT)[names(varImpSingGemVoicingTrainT) == 'h1mnh2OnsetNormV2_z'] <- 'V2 On H1*-H2*'

After changing names, we append to a data-frame

varImpSingGemVoicingTrainT_DF <- varImpSingGemVoicingTrainT
varImpSingGemVoicingTrainT_DF <- as.data.frame(varImpSingGemVoicingTrainT_DF)
varImpSingGemVoicingTrainT_DF$var <- row.names(varImpSingGemVoicingTrainT_DF)

Then we use ggplot2 to create our Variable Importance figure

ggVarImpSGDurT_A <- ggplot(varImpSingGemVoicingTrainT_DF, aes(reorder(var,varImpSingGemVoicingTrainT_DF),varImpSingGemVoicingTrainT_DF)) + geom_bar(stat="identity",position="dodge",fill = "gray70") + 
  coord_flip(ylim = c(0,0.015)) +  labs(y = "", x= "", subtitle = "A: Conditional 19 predictors")+
  scale_y_continuous()+theme_set(theme_bw()) + theme(text=element_text(size=20)) + theme(legend.position="none", axis.title.x=element_blank(),axis.title.y=element_blank())
ggVarImpSGDurT_A

As can be seen from the results, the closure duration is the most important predictor, followed by voicing in the closure, duration of V1 and voicing in the release. Most of the remaining predictors are contributing to the contrast to a lesser extent with the last six not contributing to it.

5.2 Model B

5.2.1 Training and predicting

set.seed(1)
cforestGemVoicingTrainNoDur <- cforest(SingGemVoicing~durationVoicedPercC2_z+durationVOTAll_z+
                                         VOTDec_z+durationBurst_z+durationVOT_z+durationVoicedPercVOTAll_z+
                                  intensityOnsetVOTAll_z+intensityOffsetVOTAll_z+durationV2_z+
                                  f0OnsetV2_z+intensityOnsetV2_z+f1OnsetV2_z+
                                  h1mnh2OnsetNormV2_z+durationV1_z+f0OffsetV1_z+intensityOffsetV1_z+
                                  f1OffsetV1_z+h1mnh2OffsetNorm_z, 
                           data = gemination.train, 
                           control=cforest_unbiased(mtry=round(sqrt(18)),ntree=500))
cforestGemVoicingTestNoDur <- predict(cforestGemVoicingTrainNoDur, newdata = gemination.test, OOB=TRUE, type = "response")
## Table predictions (rows) by original data (columns) 
round(prop.table(table(cforestGemVoicingTestNoDur,gemination.test$SingGemVoicing),2)*100,1)
                          
cforestGemVoicingTestNoDur singleton voiceless singleton voiced
       singleton voiceless                57.1              6.8
       singleton voiced                    8.6             91.9
       geminate voiceless                 32.4              0.7
       geminate voiced                     1.9              0.7
                          
cforestGemVoicingTestNoDur geminate voiceless geminate voiced
       singleton voiceless               15.0             3.3
       singleton voiced                   0.0             1.4
       geminate voiceless                78.9             5.2
       geminate voiced                    6.0            90.1
# percent correct
sum(gemination.test$SingGemVoicing==cforestGemVoicingTestNoDur)/nrow(gemination.test)
[1] 0.8227425
#correlations
cor(as.numeric(cforestGemVoicingTestNoDur), as.numeric(gemination.test$SingGemVoicing))
[1] 0.7574884

The percent classification accuracy of Model B is around 10% lower than that of Model A. This is a clear indication that the closure duration has an important role in discriminating the four-way contrast.

5.2.2 Variable Importance

As before we ran two versions of variable importance: non conditional and conditional

5.2.2.1 No conditional permutations

5.2.2.1.1 Model specification
set.seed(1)
system.time(varImpSingGemVoicingTrainNoDurNT <- varimp(cforest(SingGemVoicing~durationVoicedPercC2_z+durationVOTAll_z+
                                                                VOTDec_z+durationBurst_z+durationVOT_z+durationVoicedPercVOTAll_z+
                                                            intensityOnsetVOTAll_z+intensityOffsetVOTAll_z+durationV2_z+
                                                            f0OnsetV2_z+intensityOnsetV2_z+f1OnsetV2_z+
                                                            h1mnh2OnsetNormV2_z+durationV1_z+f0OffsetV1_z+intensityOffsetV1_z+
                                                            f1OffsetV1_z+h1mnh2OffsetNorm_z, 
                                                     data = gemination.train, 
                                                     control=cforest_unbiased(mtry=round(sqrt(18)),ntree=500)), conditional = F))
   user  system elapsed 
  20.59    0.02   20.64 
sort(varImpSingGemVoicingTrainNoDurNT)
        h1mnh2OffsetNorm_z        h1mnh2OnsetNormV2_z 
              6.833713e-05               4.145786e-04 
               f0OnsetV2_z               f1OffsetV1_z 
              9.430524e-04               1.097950e-03 
   intensityOffsetVOTAll_z               f0OffsetV1_z 
              1.102506e-03               1.389522e-03 
        intensityOnsetV2_z                f1OnsetV2_z 
              2.059226e-03               2.118451e-03 
       intensityOffsetV1_z            durationBurst_z 
              3.334852e-03               6.232346e-03 
    intensityOnsetVOTAll_z              durationVOT_z 
              7.981777e-03               1.848747e-02 
              durationV2_z               durationV1_z 
              2.284282e-02               2.370843e-02 
          durationVOTAll_z durationVoicedPercVOTAll_z 
              2.516629e-02               5.084282e-02 
    durationVoicedPercC2_z                   VOTDec_z 
              1.260091e-01               2.236492e-01 
5.2.2.1.2 Plotting

We start by changing names of predictors

# We change names
names(varImpSingGemVoicingTrainNoDurNT)[names(varImpSingGemVoicingTrainNoDurNT) == 'durationV2_z'] <- 'V2 Dur'
names(varImpSingGemVoicingTrainNoDurNT)[names(varImpSingGemVoicingTrainNoDurNT) == 'durationV1_z'] <- 'V1 Dur'
names(varImpSingGemVoicingTrainNoDurNT)[names(varImpSingGemVoicingTrainNoDurNT) == 'VOTDec_z'] <- 'VOT'
names(varImpSingGemVoicingTrainNoDurNT)[names(varImpSingGemVoicingTrainNoDurNT) == 'durationVOTAll_z'] <- 'Rel Dur'
names(varImpSingGemVoicingTrainNoDurNT)[names(varImpSingGemVoicingTrainNoDurNT) == 'durationBurst_z'] <- 'Burst Dur'
names(varImpSingGemVoicingTrainNoDurNT)[names(varImpSingGemVoicingTrainNoDurNT) == 'durationVOT_z'] <- 'Asp Dur'
names(varImpSingGemVoicingTrainNoDurNT)[names(varImpSingGemVoicingTrainNoDurNT) == 'durationVoicedPercC2_z'] <- 'CD Voicing'
names(varImpSingGemVoicingTrainNoDurNT)[names(varImpSingGemVoicingTrainNoDurNT) == 'durationVoicedPercVOTAll_z'] <- 'Rel Voicing'
names(varImpSingGemVoicingTrainNoDurNT)[names(varImpSingGemVoicingTrainNoDurNT) == 'intensityOffsetV1_z'] <- 'V1 Off Int'
names(varImpSingGemVoicingTrainNoDurNT)[names(varImpSingGemVoicingTrainNoDurNT) == 'intensityOnsetV2_z'] <- 'V2 On Int'
names(varImpSingGemVoicingTrainNoDurNT)[names(varImpSingGemVoicingTrainNoDurNT) == 'intensityOffsetVOTAll_z'] <- 'Rel Of Int'
names(varImpSingGemVoicingTrainNoDurNT)[names(varImpSingGemVoicingTrainNoDurNT) == 'intensityOnsetVOTAll_z'] <- 'Rel On Int'
names(varImpSingGemVoicingTrainNoDurNT)[names(varImpSingGemVoicingTrainNoDurNT) == 'f0OnsetV2_z'] <- 'V2 On f0'
names(varImpSingGemVoicingTrainNoDurNT)[names(varImpSingGemVoicingTrainNoDurNT) == 'f0OffsetV1_z'] <- 'V1 Of f0'
names(varImpSingGemVoicingTrainNoDurNT)[names(varImpSingGemVoicingTrainNoDurNT) == 'f1OnsetV2_z'] <- 'V2 On F1'
names(varImpSingGemVoicingTrainNoDurNT)[names(varImpSingGemVoicingTrainNoDurNT) == 'f1OffsetV1_z'] <- 'V1 Of F1'
names(varImpSingGemVoicingTrainNoDurNT)[names(varImpSingGemVoicingTrainNoDurNT) == 'h1mnh2OffsetNorm_z'] <- 'V1 Of H1*-H2*'
names(varImpSingGemVoicingTrainNoDurNT)[names(varImpSingGemVoicingTrainNoDurNT) == 'h1mnh2OnsetNormV2_z'] <- 'V2 On H1*-H2*'

After changing names, we append to a data-frame

varImpSingGemVoicingTrainNoDurNT_DF <- varImpSingGemVoicingTrainNoDurNT
varImpSingGemVoicingTrainNoDurNT_DF <- as.data.frame(varImpSingGemVoicingTrainNoDurNT_DF)
varImpSingGemVoicingTrainNoDurNT_DF$var <- row.names(varImpSingGemVoicingTrainNoDurNT_DF)

Then we use ggplot2 to create our Variable Importance figure

ggVarImpSGDurNT_B <- ggplot(varImpSingGemVoicingTrainNoDurNT_DF, aes(reorder(var,varImpSingGemVoicingTrainNoDurNT_DF),varImpSingGemVoicingTrainNoDurNT_DF)) + geom_bar(stat="identity",position="dodge",fill = "gray70") + 
                  coord_flip(ylim = c(0,0.25)) + labs(subtitle = "B: Non Conditional 18 predictors") + 
                  scale_y_continuous()+theme_set(theme_bw()) + theme(text=element_text(size=20)) + theme(legend.position="none", axis.title.x=element_blank(),axis.title.y=element_blank())
ggVarImpSGDurNT_B

5.2.2.2 Conditional permutations

As above, we use the conditional permutations and adjust the threshold.

5.2.2.2.1 Model specification
set.seed(1)
system.time(varImpSingGemVoicingTrainNoDurT <- varimp(cforest(SingGemVoicing~durationVoicedPercC2_z+durationVOTAll_z+
                                                               VOTDec_z+durationBurst_z+durationVOT_z+durationVoicedPercVOTAll_z+
                                                           intensityOnsetVOTAll_z+intensityOffsetVOTAll_z+durationV2_z+
                                                           f0OnsetV2_z+intensityOnsetV2_z+f1OnsetV2_z+
                                                           h1mnh2OnsetNormV2_z+durationV1_z+f0OffsetV1_z+intensityOffsetV1_z+
                                                           f1OffsetV1_z+h1mnh2OffsetNorm_z,  
                                                    data = gemination.train, 
                                                    control=cforest_unbiased(mtry=round(sqrt(18)),ntree=500)), 
                                                  conditional = T, threshold = 0.2))
   user  system elapsed 
 873.86    0.09  874.93 
sort(varImpSingGemVoicingTrainNoDurT)
   intensityOffsetVOTAll_z                f1OnsetV2_z 
             -1.366743e-05              -9.111617e-06 
              f1OffsetV1_z                f0OnsetV2_z 
             -4.555809e-06               9.111617e-06 
        intensityOnsetV2_z               f0OffsetV1_z 
              9.111617e-06               1.366743e-05 
             durationVOT_z         h1mnh2OffsetNorm_z 
              1.822323e-05               2.277904e-05 
              durationV2_z        h1mnh2OnsetNormV2_z 
              2.733485e-05               3.189066e-05 
durationVoicedPercVOTAll_z           durationVOTAll_z 
              3.644647e-05               4.555809e-05 
           durationBurst_z        intensityOffsetV1_z 
              5.922551e-05               7.289294e-05 
    intensityOnsetVOTAll_z     durationVoicedPercC2_z 
              1.093394e-04               1.776765e-04 
              durationV1_z                   VOTDec_z 
              2.551253e-04               7.667426e-03 
5.2.2.2.2 Plotting

We start by changing names of predictors

# We change names
names(varImpSingGemVoicingTrainNoDurT)[names(varImpSingGemVoicingTrainNoDurT) == 'durationV2_z'] <- 'V2 Dur'
names(varImpSingGemVoicingTrainNoDurT)[names(varImpSingGemVoicingTrainNoDurT) == 'durationV1_z'] <- 'V1 Dur'
names(varImpSingGemVoicingTrainNoDurT)[names(varImpSingGemVoicingTrainNoDurT) == 'VOTDec_z'] <- 'VOT'
names(varImpSingGemVoicingTrainNoDurT)[names(varImpSingGemVoicingTrainNoDurT) == 'durationVOTAll_z'] <- 'Rel Dur'
names(varImpSingGemVoicingTrainNoDurT)[names(varImpSingGemVoicingTrainNoDurT) == 'durationBurst_z'] <- 'Burst Dur'
names(varImpSingGemVoicingTrainNoDurT)[names(varImpSingGemVoicingTrainNoDurT) == 'durationVOT_z'] <- 'Asp Dur'
names(varImpSingGemVoicingTrainNoDurT)[names(varImpSingGemVoicingTrainNoDurT) == 'durationVoicedPercC2_z'] <- 'CD Voicing'
names(varImpSingGemVoicingTrainNoDurT)[names(varImpSingGemVoicingTrainNoDurT) == 'durationVoicedPercVOTAll_z'] <- 'Rel Voicing'
names(varImpSingGemVoicingTrainNoDurT)[names(varImpSingGemVoicingTrainNoDurT) == 'intensityOffsetV1_z'] <- 'V1 Off Int'
names(varImpSingGemVoicingTrainNoDurT)[names(varImpSingGemVoicingTrainNoDurT) == 'intensityOnsetV2_z'] <- 'V2 On Int'
names(varImpSingGemVoicingTrainNoDurT)[names(varImpSingGemVoicingTrainNoDurT) == 'intensityOffsetVOTAll_z'] <- 'Rel Of Int'
names(varImpSingGemVoicingTrainNoDurT)[names(varImpSingGemVoicingTrainNoDurT) == 'intensityOnsetVOTAll_z'] <- 'Rel On Int'
names(varImpSingGemVoicingTrainNoDurT)[names(varImpSingGemVoicingTrainNoDurT) == 'f0OnsetV2_z'] <- 'V2 On f0'
names(varImpSingGemVoicingTrainNoDurT)[names(varImpSingGemVoicingTrainNoDurT) == 'f0OffsetV1_z'] <- 'V1 Of f0'
names(varImpSingGemVoicingTrainNoDurT)[names(varImpSingGemVoicingTrainNoDurT) == 'f1OnsetV2_z'] <- 'V2 On F1'
names(varImpSingGemVoicingTrainNoDurT)[names(varImpSingGemVoicingTrainNoDurT) == 'f1OffsetV1_z'] <- 'V1 Of F1'
names(varImpSingGemVoicingTrainNoDurT)[names(varImpSingGemVoicingTrainNoDurT) == 'h1mnh2OffsetNorm_z'] <- 'V1 Of H1*-H2*'
names(varImpSingGemVoicingTrainNoDurT)[names(varImpSingGemVoicingTrainNoDurT) == 'h1mnh2OnsetNormV2_z'] <- 'V2 On H1*-H2*'

After changing names, we append to a data-frame

varImpSingGemVoicingTrainNoDurT_DF <- varImpSingGemVoicingTrainNoDurT
varImpSingGemVoicingTrainNoDurT_DF <- as.data.frame(varImpSingGemVoicingTrainNoDurT_DF)
varImpSingGemVoicingTrainNoDurT_DF$var <- row.names(varImpSingGemVoicingTrainNoDurT_DF)

Then we use ggplot2 to create our Variable Importance figure

ggVarImpSGDurT_B <- ggplot(varImpSingGemVoicingTrainNoDurT_DF, aes(reorder(var,varImpSingGemVoicingTrainNoDurT_DF),varImpSingGemVoicingTrainNoDurT_DF)) + geom_bar(stat="identity",position="dodge",fill = "gray70") + 
  coord_flip(ylim = c(0,0.01)) + labs(y = "", x= "", subtitle = "B: Conditional 18 predictors") +
  scale_y_continuous()+theme_set(theme_bw()) + theme(text=element_text(size=20)) + theme(legend.position="none", axis.title.x=element_blank(),axis.title.y=element_blank())
ggVarImpSGDurT_B

As can be seen from the results, the VOT is the most important predictor, followed by the duration of V1, the voicing in the closure, the intensity at the onset of the release. Most of the remaining predictors are contributing to the contrast to a lesser degree with the last three not contributing to it.

5.3 Model C

5.3.1 Training and predicting

set.seed(1)
cforestGemVoicingTrainNoDurNoVOT <- cforest(SingGemVoicing~durationVoicedPercC2_z+durationVOTAll_z+
                                         durationBurst_z+durationVOT_z+durationVoicedPercVOTAll_z+
                                         intensityOnsetVOTAll_z+intensityOffsetVOTAll_z+durationV2_z+
                                         f0OnsetV2_z+intensityOnsetV2_z+f1OnsetV2_z+
                                         h1mnh2OnsetNormV2_z+durationV1_z+f0OffsetV1_z+intensityOffsetV1_z+
                                         f1OffsetV1_z+h1mnh2OffsetNorm_z, 
                                       data = gemination.train, 
                                       control=cforest_unbiased(mtry=round(sqrt(17)),ntree=600))
cforestGemVoicingTestNoDurNoVOT <- predict(cforestGemVoicingTrainNoDurNoVOT, newdata = gemination.test, OOB=TRUE, type = "response")
## Table predictions (rows) by original data (columns) 
round(prop.table(table(cforestGemVoicingTestNoDurNoVOT,gemination.test$SingGemVoicing),2)*100,1)
                               
cforestGemVoicingTestNoDurNoVOT singleton voiceless
            singleton voiceless                51.4
            singleton voiced                    6.7
            geminate voiceless                 32.4
            geminate voiced                     9.5
                               
cforestGemVoicingTestNoDurNoVOT singleton voiced geminate voiceless
            singleton voiceless              4.7               16.5
            singleton voiced                67.6                0.0
            geminate voiceless               1.4               78.2
            geminate voiced                 26.4                5.3
                               
cforestGemVoicingTestNoDurNoVOT geminate voiced
            singleton voiceless             4.7
            singleton voiced               23.1
            geminate voiceless              4.2
            geminate voiced                67.9
# percent correct
sum(gemination.test$SingGemVoicing==cforestGemVoicingTestNoDurNoVOT)/nrow(gemination.test)
[1] 0.6722408
# correlation
cor(as.numeric(cforestGemVoicingTestNoDurNoVOT), as.numeric(gemination.test$SingGemVoicing))
[1] 0.4525515

As can be seen from the above, the percent correct is much lower that the two other models and the levels of correlations as well. This indicates that these secondary correlates are not providing a clear discrimination of the data.

5.3.2 Variable Importance

As before we ran two versions of variable importance: non conditional and conditional

5.3.2.1 No conditional permutations

5.3.2.1.1 Model specification
set.seed(1)
system.time(varImpSingGemVoicingTrainNoDurNoVOTNT <- varimp(cforest(SingGemVoicing~durationVoicedPercC2_z+durationVOTAll_z+
                                                                durationBurst_z+durationVOT_z+durationVoicedPercVOTAll_z+
                                                                intensityOnsetVOTAll_z+intensityOffsetVOTAll_z+durationV2_z+
                                                                f0OnsetV2_z+intensityOnsetV2_z+f1OnsetV2_z+
                                                                h1mnh2OnsetNormV2_z+durationV1_z+f0OffsetV1_z+intensityOffsetV1_z+
                                                                f1OffsetV1_z+h1mnh2OffsetNorm_z, 
                                                              data = gemination.train, 
                                                              control=cforest_unbiased(mtry=round(sqrt(17)),ntree=600)), conditional = F))
   user  system elapsed 
  25.41    0.00   25.45 
sort(varImpSingGemVoicingTrainNoDurNoVOTNT)
       h1mnh2OnsetNormV2_z         h1mnh2OffsetNorm_z 
              0.0001746393               0.0003302961 
               f0OnsetV2_z               f1OffsetV1_z 
              0.0015261959               0.0021753986 
   intensityOffsetVOTAll_z               f0OffsetV1_z 
              0.0027372817               0.0033902809 
        intensityOnsetV2_z        intensityOffsetV1_z 
              0.0036142749               0.0049544419 
               f1OnsetV2_z            durationBurst_z 
              0.0062604404               0.0078170084 
    intensityOnsetVOTAll_z              durationVOT_z 
              0.0167615793               0.0186560364 
          durationVOTAll_z               durationV1_z 
              0.0326689446               0.0430523918 
              durationV2_z durationVoicedPercVOTAll_z 
              0.0486446469               0.0643811693 
    durationVoicedPercC2_z 
              0.1278739560 
5.3.2.1.2 Plotting

We start by changing names of predictors

# We change names
names(varImpSingGemVoicingTrainNoDurNoVOTNT)[names(varImpSingGemVoicingTrainNoDurNoVOTNT) == 'durationV2_z'] <- 'V2 Dur'
names(varImpSingGemVoicingTrainNoDurNoVOTNT)[names(varImpSingGemVoicingTrainNoDurNoVOTNT) == 'durationV1_z'] <- 'V1 Dur'
names(varImpSingGemVoicingTrainNoDurNoVOTNT)[names(varImpSingGemVoicingTrainNoDurNoVOTNT) == 'durationVOTAll_z'] <- 'Rel Dur'
names(varImpSingGemVoicingTrainNoDurNoVOTNT)[names(varImpSingGemVoicingTrainNoDurNoVOTNT) == 'durationBurst_z'] <- 'Burst Dur'
names(varImpSingGemVoicingTrainNoDurNoVOTNT)[names(varImpSingGemVoicingTrainNoDurNoVOTNT) == 'durationVOT_z'] <- 'Asp Dur'
names(varImpSingGemVoicingTrainNoDurNoVOTNT)[names(varImpSingGemVoicingTrainNoDurNoVOTNT) == 'durationVoicedPercC2_z'] <- 'CD Voicing'
names(varImpSingGemVoicingTrainNoDurNoVOTNT)[names(varImpSingGemVoicingTrainNoDurNoVOTNT) == 'durationVoicedPercVOTAll_z'] <- 'Rel Voicing'
names(varImpSingGemVoicingTrainNoDurNoVOTNT)[names(varImpSingGemVoicingTrainNoDurNoVOTNT) == 'intensityOffsetV1_z'] <- 'V1 Off Int'
names(varImpSingGemVoicingTrainNoDurNoVOTNT)[names(varImpSingGemVoicingTrainNoDurNoVOTNT) == 'intensityOnsetV2_z'] <- 'V2 On Int'
names(varImpSingGemVoicingTrainNoDurNoVOTNT)[names(varImpSingGemVoicingTrainNoDurNoVOTNT) == 'intensityOffsetVOTAll_z'] <- 'Rel Of Int'
names(varImpSingGemVoicingTrainNoDurNoVOTNT)[names(varImpSingGemVoicingTrainNoDurNoVOTNT) == 'intensityOnsetVOTAll_z'] <- 'Rel On Int'
names(varImpSingGemVoicingTrainNoDurNoVOTNT)[names(varImpSingGemVoicingTrainNoDurNoVOTNT) == 'f0OnsetV2_z'] <- 'V2 On f0'
names(varImpSingGemVoicingTrainNoDurNoVOTNT)[names(varImpSingGemVoicingTrainNoDurNoVOTNT) == 'f0OffsetV1_z'] <- 'V1 Of f0'
names(varImpSingGemVoicingTrainNoDurNoVOTNT)[names(varImpSingGemVoicingTrainNoDurNoVOTNT) == 'f1OnsetV2_z'] <- 'V2 On F1'
names(varImpSingGemVoicingTrainNoDurNoVOTNT)[names(varImpSingGemVoicingTrainNoDurNoVOTNT) == 'f1OffsetV1_z'] <- 'V1 Of F1'
names(varImpSingGemVoicingTrainNoDurNoVOTNT)[names(varImpSingGemVoicingTrainNoDurNoVOTNT) == 'h1mnh2OffsetNorm_z'] <- 'V1 Of H1*-H2*'
names(varImpSingGemVoicingTrainNoDurNoVOTNT)[names(varImpSingGemVoicingTrainNoDurNoVOTNT) == 'h1mnh2OnsetNormV2_z'] <- 'V2 On H1*-H2*'

After changing names, we append to a data-frame

varImpSingGemVoicingTrainNoDurNoVOTNT_DF <- varImpSingGemVoicingTrainNoDurNoVOTNT
varImpSingGemVoicingTrainNoDurNoVOTNT_DF <- as.data.frame(varImpSingGemVoicingTrainNoDurNoVOTNT_DF)
varImpSingGemVoicingTrainNoDurNoVOTNT_DF$var <- row.names(varImpSingGemVoicingTrainNoDurNoVOTNT_DF)

Then we use ggplot2 to create our Variable Importance figure

ggVarImpSGDurNT_C <- ggplot(varImpSingGemVoicingTrainNoDurNoVOTNT_DF, aes(reorder(var,varImpSingGemVoicingTrainNoDurNoVOTNT_DF),varImpSingGemVoicingTrainNoDurNoVOTNT_DF)) + geom_bar(stat="identity",position="dodge",fill = "gray70") + 
  coord_flip(ylim = c(0,0.25)) + labs(subtitle = "C: Non Conditional 17 predictors") + 
  scale_y_continuous()+theme_set(theme_bw()) + theme(text=element_text(size=20)) + theme(legend.position="none", axis.title.x=element_blank(),axis.title.y=element_blank())
ggVarImpSGDurNT_C

5.3.2.2 Conditional permutations

As above, we use the conditional permutations and adjust the threshold.

5.3.2.2.1 Model specification
set.seed(1)
system.time(varImpSingGemVoicingTrainNoDurNoVOTT <- varimp(cforest(SingGemVoicing~durationVoicedPercC2_z+durationVOTAll_z+
                                                               durationBurst_z+durationVOT_z+durationVoicedPercVOTAll_z+
                                                               intensityOnsetVOTAll_z+intensityOffsetVOTAll_z+durationV2_z+
                                                               f0OnsetV2_z+intensityOnsetV2_z+f1OnsetV2_z+
                                                               h1mnh2OnsetNormV2_z+durationV1_z+f0OffsetV1_z+intensityOffsetV1_z+
                                                               f1OffsetV1_z+h1mnh2OffsetNorm_z,  
                                                             data = gemination.train, 
                                                             control=cforest_unbiased(mtry=round(sqrt(17)),ntree=600)), 
                                                     conditional = T, threshold = 0.2))
   user  system elapsed 
1202.19    0.19 1203.72 
sort(varImpSingGemVoicingTrainNoDurNoVOTT)
   intensityOffsetVOTAll_z            durationBurst_z 
             -1.518603e-05              -7.593014e-06 
        h1mnh2OffsetNorm_z               f0OffsetV1_z 
             -7.593014e-06              -3.796507e-06 
               f0OnsetV2_z              durationVOT_z 
              3.796507e-06               7.593014e-06 
              f1OffsetV1_z         intensityOnsetV2_z 
              7.593014e-06               1.518603e-05 
       h1mnh2OnsetNormV2_z        intensityOffsetV1_z 
              1.518603e-05               1.518603e-05 
durationVoicedPercVOTAll_z                f1OnsetV2_z 
              2.277904e-05               2.277904e-05 
          durationVOTAll_z     durationVoicedPercC2_z 
              4.176158e-05               6.074412e-05 
    intensityOnsetVOTAll_z               durationV2_z 
              1.366743e-04               2.467730e-04 
              durationV1_z 
              3.834472e-04 
5.3.2.2.2 Plotting

We start by changing names of predictors

# We change names
names(varImpSingGemVoicingTrainNoDurNoVOTT)[names(varImpSingGemVoicingTrainNoDurNoVOTT) == 'durationV2_z'] <- 'V2 Dur'
names(varImpSingGemVoicingTrainNoDurNoVOTT)[names(varImpSingGemVoicingTrainNoDurNoVOTT) == 'durationV1_z'] <- 'V1 Dur'
names(varImpSingGemVoicingTrainNoDurNoVOTT)[names(varImpSingGemVoicingTrainNoDurNoVOTT) == 'durationVOTAll_z'] <- 'Rel Dur'
names(varImpSingGemVoicingTrainNoDurNoVOTT)[names(varImpSingGemVoicingTrainNoDurNoVOTT) == 'durationBurst_z'] <- 'Burst Dur'
names(varImpSingGemVoicingTrainNoDurNoVOTT)[names(varImpSingGemVoicingTrainNoDurNoVOTT) == 'durationVOT_z'] <- 'Asp Dur'
names(varImpSingGemVoicingTrainNoDurNoVOTT)[names(varImpSingGemVoicingTrainNoDurNoVOTT) == 'durationVoicedPercC2_z'] <- 'CD Voicing'
names(varImpSingGemVoicingTrainNoDurNoVOTT)[names(varImpSingGemVoicingTrainNoDurNoVOTT) == 'durationVoicedPercVOTAll_z'] <- 'Rel Voicing'
names(varImpSingGemVoicingTrainNoDurNoVOTT)[names(varImpSingGemVoicingTrainNoDurNoVOTT) == 'intensityOffsetV1_z'] <- 'V1 Off Int'
names(varImpSingGemVoicingTrainNoDurNoVOTT)[names(varImpSingGemVoicingTrainNoDurNoVOTT) == 'intensityOnsetV2_z'] <- 'V2 On Int'
names(varImpSingGemVoicingTrainNoDurNoVOTT)[names(varImpSingGemVoicingTrainNoDurNoVOTT) == 'intensityOffsetVOTAll_z'] <- 'Rel Of Int'
names(varImpSingGemVoicingTrainNoDurNoVOTT)[names(varImpSingGemVoicingTrainNoDurNoVOTT) == 'intensityOnsetVOTAll_z'] <- 'Rel On Int'
names(varImpSingGemVoicingTrainNoDurNoVOTT)[names(varImpSingGemVoicingTrainNoDurNoVOTT) == 'f0OnsetV2_z'] <- 'V2 On f0'
names(varImpSingGemVoicingTrainNoDurNoVOTT)[names(varImpSingGemVoicingTrainNoDurNoVOTT) == 'f0OffsetV1_z'] <- 'V1 Of f0'
names(varImpSingGemVoicingTrainNoDurNoVOTT)[names(varImpSingGemVoicingTrainNoDurNoVOTT) == 'f1OnsetV2_z'] <- 'V2 On F1'
names(varImpSingGemVoicingTrainNoDurNoVOTT)[names(varImpSingGemVoicingTrainNoDurNoVOTT) == 'f1OffsetV1_z'] <- 'V1 Of F1'
names(varImpSingGemVoicingTrainNoDurNoVOTT)[names(varImpSingGemVoicingTrainNoDurNoVOTT) == 'h1mnh2OffsetNorm_z'] <- 'V1 Of H1*-H2*'
names(varImpSingGemVoicingTrainNoDurNoVOTT)[names(varImpSingGemVoicingTrainNoDurNoVOTT) == 'h1mnh2OnsetNormV2_z'] <- 'V2 On H1*-H2*'

After changing names, we append to a data-frame

varImpSingGemVoicingTrainNoDurNoVOTT_DF <- varImpSingGemVoicingTrainNoDurNoVOTT
varImpSingGemVoicingTrainNoDurNoVOTT_DF <- as.data.frame(varImpSingGemVoicingTrainNoDurNoVOTT_DF)
varImpSingGemVoicingTrainNoDurNoVOTT_DF$var <- row.names(varImpSingGemVoicingTrainNoDurNoVOTT_DF)

Then we use ggplot2 to create our Variable Importance figure

ggVarImpSGDurT_C <- ggplot(varImpSingGemVoicingTrainNoDurNoVOTT_DF, aes(reorder(var,varImpSingGemVoicingTrainNoDurNoVOTT_DF),varImpSingGemVoicingTrainNoDurNoVOTT_DF)) + geom_bar(stat="identity",position="dodge",fill = "gray70") + 
  coord_flip(ylim = c(0,0.001)) + labs(y = "", x= "", subtitle = "C: Conditional 17 predictors") +
  scale_y_continuous()+theme_set(theme_bw()) + theme(text=element_text(size=20)) + theme(legend.position="none", axis.title.x=element_blank(),axis.title.y=element_blank())
ggVarImpSGDurT_C

As can be seen from the results, the the durations of surrounding vowels followed by intensity at the onset of the release, voicing in the closure and duration of the release are the most important predictors. Their scores are much lower than in the other two models indicating these are secondary. Most of the remaining predictors are contributing to the contrast to a lesser degree with the last four not contributing to it.

6 Conclusion

The results of the Random Forests are a clear indication that the closure duration is the main correlate to the four-way voicing by gemination contrast. Model A had an accuracy of 92.5% which is an extremely high classification rate. The second model that excludes the closure duration is relatively high in classification accuracy yielding an 82.3%. A difference of 10.2% is observed between the two models. This is a clear indication of the role of closure duration in both voicing and gemination. Model C aimed at evaluating the role of all secondary predictors without using the closure duration or the VOT. A relatively low classification rate at 67.2% is indicative of a relatively successful model. It is statistically significant (rate at 25% and above), but the predictive accuracy is not high enough. This is clearly indicating that these correlates act as secondary one and have an enhancement role to the primary correlate; closure duration.

---
title: "Voicing and Gemination - Random Forests"
author: 
  name: "Jalal Al-Tamimi"
  affiliation: "Newcastle University"
date: "17 July 2018"
output: 
  html_notebook:
    number_sections: true
    toc: true
    toc_depth: 6
    toc_float:
      collapsed: true
---

# Introduction
This analysis accompanies the article "Al-Tamimi, J. and Khattab, G., (2018). *Acoustic correlates of the voicing contrast in Lebanese Arabic singleton and geminate plosives*. Invited manuscript for the special issue of Journal of Phonetics, "Marking 50 Years of Research on Voice Onset Time and the Voicing Contrast in the World’s Languages" (eds., T. Cho, G. Docherty & D. Whalen). DOI: https://doi.org/10.1016/j.wocn.2018.09.010

This notebook presents the analyses and results of the Random Forests classification. The aim was to evaluate the robustness of the results obtained for each of the 19 acoustic correlates used, (6 durational, three voicing, and 10 non-temporal including f0, F1 and harmonic differences).
These acoustic correlates were used to evaluate the relationship between them and the four-way "contrast" observed between voicing and gemination, i.e., Voiced Singleton, Voiceless Singleton, Voiced Geminate and Voiceless Geminate. The previous notebook looked specifically at how each *individual*  acoustic correlate was associated with the predictors. This was done on an individual basis [see notebook](https://jalalal-tamimi.github.io/R-Voicing-Gemination-VOT/Voicing%20and%20Gemination%20-%20Mixed%20Effects%20Modelling.nb.html). This section extends the analysis by looking at the *combination* of these acoustic correlates and their predictive power in discriminating between the four categories above. 

# Loading required packages

We start by loading the required packages.

```{r warning=FALSE, message=FALSE, error=FALSE}
requiredPackages = c('dplyr','DMwR','party','pROC','psycho','ggplot2','reshape','foreach','doSNOW','parallel')
for(p in requiredPackages){
  if(!require(p,character.only = TRUE)) install.packages(p)
  library(p,character.only = TRUE)
}
```


# Preprocessing the data

## Reading in the data

We start reading in the data and checking the structure. The data contains some missing data. For the Random Forests to work appropriately, we replace the missing values by centrally imputed values. 

```{r}
ResultsFullOriginal <- read.csv("ResultsFullOriginalData.csv")

str(ResultsFullOriginal)
summary(ResultsFullOriginal)

```

## Central imputations

As can be seen from the summary above, the data contains some missing data. For the Random Forests to work appropriately, we replace the missing values by centrally imputed ones 

```{r}
# Central imputation using the package "DMwR".

ResultsFullOriginalCentral <- ResultsFullOriginal	# create copy
ResultsFullOriginalCentral[, 18:36] <- centralImputation(ResultsFullOriginalCentral[, 18:36])
str(ResultsFullOriginalCentral)
summary(ResultsFullOriginalCentral)
write.csv(ResultsFullOriginalCentral,"ResultsFullOriginalCentral.csv")
```

## Z-Scoring the data

Given the differences in scales of each of the acoustic correlates, e.g., durations in milliseconds, intensity in dB, the magnitudes of the values are different. As a way to "normalise" for the data and to put all predictors on the same level, we z-score the data.

```{r}
# from the package "dplyr"
all_z <- select(ResultsFullOriginalCentral, durationC2:h1mnh2OffsetNorm)
all_z <- cbind(all_z, select(ResultsFullOriginalCentral))
all_z <- apply(all_z, 2, scale)
colnames(all_z) <- paste(colnames(all_z), 'z', sep = '_')
ResultsFullOriginalCentral <- cbind(ResultsFullOriginalCentral, all_z)

summary(ResultsFullOriginalCentral)
str(ResultsFullOriginalCentral)
```

## Creating a new outcome

In the following analysis, we are interested in the four-way contrast observed between voicing and gemination, i.e., Voiceless Singleton, Voiced Singleton, Voiceless Geminate and Voiced Geminate. 

```{r}
ResultsFullOriginalCentral["SingGemVoicing"] <- NA
ResultsFullOriginalCentral["SingGemVoicing"] <- paste(ResultsFullOriginalCentral$singGem,
                                                            ResultsFullOriginalCentral$voiced.unvoiced)
ResultsFullOriginalCentral$SingGemVoicing <- as.factor(ResultsFullOriginalCentral$SingGemVoicing)

str(ResultsFullOriginalCentral)

```

After all this pre-processing, we are now ready to start our Random Forest analysis.

# Tuning our Random Forests

## Creating training and testing sets

We create a training and a testing sets (66% and 33% of the data respectively). The aim is to train three random forests and using the testing set, we are evaluating the prediction power of each.  

```{r}
# creating training and testing sets
# we set the seeds for reproducibility
set.seed(1)
train.idx <- sample(nrow(ResultsFullOriginalCentral), 2/3 * nrow(ResultsFullOriginalCentral))
gemination.train <- ResultsFullOriginalCentral[train.idx, ]
gemination.test <- ResultsFullOriginalCentral[-train.idx, ]
```

## Estimating number of trees required

We use a procedure we developed to obtain the optimal number of trees required to grow a forest with the highest predictive accuracy. This implements the procedure developed by "Oshiro, T. M., Perez, P. S., & Baranauskas, J. A. (2012). How many trees in a random forest? 
In International Workshop on Machine Learning and Data Mining in Pattern Recognition (pp. 154–168)" (see my github repo for the script).

The code below runs the analysis using parallel computing. We define the number of clusters needed and use these. We are training three Random Forests (see paper for more details):

1. Model A: A forest with the full 19 predictors (AUCTestWithDur) 

2. Model B: A forest with 18 predictors (Model A minus the closure duration (AUCTestNoDur))

3. Model C: A forest with 17 predictors (Model B minus the VOT (AUCTestNoDurNoVOT))

```{r}
## below to prepare using parallel computing
NumberOfCluster <- detectCores()
cl <- makeCluster(NumberOfCluster)
registerDoSNOW(cl)

## below computes the ROC (Receiver Operating Curve) curves for this dataframe 
## for the 15 random forests with a 100 trees iteration
## and saves these in a list that will be used later to compare ROC curves
treeNumb <- 15

system.time(AUCTestWithDur <-
              foreach (i=1:treeNumb,.packages=c("party","pROC")) %dopar% {
                set.seed(1)
                cforest.mdl=cforest(SingGemVoicing~durationC2_z+durationVoicedPercC2_z+durationVOTAll_z+
                                      VOTDec_z+durationBurst_z+durationVOT_z+durationVoicedPercVOTAll_z+
                                      intensityOnsetVOTAll_z+intensityOffsetVOTAll_z+durationV2_z+
                                      f0OnsetV2_z+intensityOnsetV2_z+f1OnsetV2_z+
                                      h1mnh2OnsetNormV2_z+durationV1_z+f0OffsetV1_z+intensityOffsetV1_z+
                                      f1OffsetV1_z+h1mnh2OffsetNorm_z, data = gemination.train,
                                    controls = cforest_unbiased(mtry = round(sqrt(19)), ntree = i*100))
                cforest.cond = predict(cforest.mdl,OOB=TRUE)
                roc <- roc(gemination.train$SingGemVoicing,as.numeric(cforest.cond))
              })


system.time(AUCTestNoDur <-
              foreach (i=1:treeNumb,.packages=c("party","pROC")) %dopar% {
                set.seed(1)
                cforest.mdl=cforest(SingGemVoicing~durationVoicedPercC2_z+durationVOTAll_z+
                                      VOTDec_z+durationBurst_z+durationVOT_z+durationVoicedPercVOTAll_z+
                                      intensityOnsetVOTAll_z+intensityOffsetVOTAll_z+durationV2_z+
                                      f0OnsetV2_z+intensityOnsetV2_z+f1OnsetV2_z+
                                      h1mnh2OnsetNormV2_z+durationV1_z+f0OffsetV1_z+intensityOffsetV1_z+
                                      f1OffsetV1_z+h1mnh2OffsetNorm_z, data = gemination.train,
                                    controls = cforest_unbiased(mtry = round(sqrt(18)), ntree = i*100))
                cforest.cond = predict(cforest.mdl,OOB=TRUE)
                roc <- roc(gemination.train$SingGemVoicing,as.numeric(cforest.cond))
              })

system.time(AUCTestNoDurNoVOTDec <-
              foreach (i=1:treeNumb,.packages=c("party","pROC")) %dopar% {
                set.seed(1)
                cforest.mdl=cforest(SingGemVoicing~durationVoicedPercC2_z+durationVOTAll_z+
                                      durationBurst_z+durationVOT_z+durationVoicedPercVOTAll_z+
                                      intensityOnsetVOTAll_z+intensityOffsetVOTAll_z+durationV2_z+
                                      f0OnsetV2_z+intensityOnsetV2_z+f1OnsetV2_z+
                                      h1mnh2OnsetNormV2_z+durationV1_z+f0OffsetV1_z+intensityOffsetV1_z+
                                      f1OffsetV1_z+h1mnh2OffsetNorm_z, data = gemination.train,
                                    controls = cforest_unbiased(mtry = round(sqrt(17)), ntree = i*100))
                cforest.cond = predict(cforest.mdl,OOB=TRUE)
                roc <- roc(gemination.train$SingGemVoicing,as.numeric(cforest.cond))
              })


# stop cluster. If not done R will still use all cores as defined above and the system will become too slow!!
stopCluster(cl)
#

```

### Significance testing

#### Model A, 19 predictors

```{r}
# check significant ROC comparison
# with duration

roc.test(AUCTestWithDur[[1]], AUCTestWithDur[[2]])
roc.test(AUCTestWithDur[[2]], AUCTestWithDur[[3]])
roc.test(AUCTestWithDur[[3]], AUCTestWithDur[[4]])
roc.test(AUCTestWithDur[[4]], AUCTestWithDur[[5]])
roc.test(AUCTestWithDur[[5]], AUCTestWithDur[[6]])
roc.test(AUCTestWithDur[[6]], AUCTestWithDur[[7]])
roc.test(AUCTestWithDur[[7]], AUCTestWithDur[[8]])
roc.test(AUCTestWithDur[[8]], AUCTestWithDur[[9]])
roc.test(AUCTestWithDur[[9]], AUCTestWithDur[[10]])
roc.test(AUCTestWithDur[[10]], AUCTestWithDur[[11]])
roc.test(AUCTestWithDur[[11]], AUCTestWithDur[[12]])
roc.test(AUCTestWithDur[[12]], AUCTestWithDur[[13]])
roc.test(AUCTestWithDur[[13]], AUCTestWithDur[[14]])
roc.test(AUCTestWithDur[[14]], AUCTestWithDur[[15]])

# 500 trees is the best model

```

#### Model B, 18 predictors (full minus Closure duration)

```{r}

roc.test(AUCTestNoDur[[1]], AUCTestNoDur[[2]])
roc.test(AUCTestNoDur[[2]], AUCTestNoDur[[3]])
roc.test(AUCTestNoDur[[3]], AUCTestNoDur[[4]])
roc.test(AUCTestNoDur[[4]], AUCTestNoDur[[5]])
roc.test(AUCTestNoDur[[5]], AUCTestNoDur[[6]])
roc.test(AUCTestNoDur[[6]], AUCTestNoDur[[7]])
roc.test(AUCTestNoDur[[7]], AUCTestNoDur[[8]])
roc.test(AUCTestNoDur[[8]], AUCTestNoDur[[9]])
roc.test(AUCTestNoDur[[9]], AUCTestNoDur[[10]])
roc.test(AUCTestNoDur[[10]], AUCTestNoDur[[11]])
roc.test(AUCTestNoDur[[11]], AUCTestNoDur[[12]])
roc.test(AUCTestNoDur[[12]], AUCTestNoDur[[13]])
roc.test(AUCTestNoDur[[13]], AUCTestNoDur[[14]])
roc.test(AUCTestNoDur[[14]], AUCTestNoDur[[15]])


## 500 trees is the best model.

```

#### Model C, 17 predictors (Full minus Closure duration and VOT)

```{r}
roc.test(AUCTestNoDurNoVOTDec[[1]], AUCTestNoDurNoVOTDec[[2]])
roc.test(AUCTestNoDurNoVOTDec[[2]], AUCTestNoDurNoVOTDec[[3]])
roc.test(AUCTestNoDurNoVOTDec[[3]], AUCTestNoDurNoVOTDec[[4]])
roc.test(AUCTestNoDurNoVOTDec[[4]], AUCTestNoDurNoVOTDec[[5]])
roc.test(AUCTestNoDurNoVOTDec[[5]], AUCTestNoDurNoVOTDec[[6]])
roc.test(AUCTestNoDurNoVOTDec[[6]], AUCTestNoDurNoVOTDec[[7]])
roc.test(AUCTestNoDurNoVOTDec[[7]], AUCTestNoDurNoVOTDec[[8]])
roc.test(AUCTestNoDurNoVOTDec[[8]], AUCTestNoDurNoVOTDec[[9]])
roc.test(AUCTestNoDurNoVOTDec[[9]], AUCTestNoDurNoVOTDec[[10]])
roc.test(AUCTestNoDurNoVOTDec[[10]], AUCTestNoDurNoVOTDec[[11]])
roc.test(AUCTestNoDurNoVOTDec[[11]], AUCTestNoDurNoVOTDec[[12]])
roc.test(AUCTestNoDurNoVOTDec[[12]], AUCTestNoDurNoVOTDec[[13]])
roc.test(AUCTestNoDurNoVOTDec[[13]], AUCTestNoDurNoVOTDec[[14]])
roc.test(AUCTestNoDurNoVOTDec[[14]], AUCTestNoDurNoVOTDec[[15]])

# 600 trees is the best model. 
```

### Models and number of trees

After checking the models above, we have the following number trees that are needed to grow a random forest with the highest predictive accuracy.

1. Model A: (19 predictors) 500 trees

2. Model B: (18 predictors) 500 trees

3. Model C: (17 predictors) 600 trees

## Checking the correlations in the data-frame

Random forests are known to be biased towards correlated data and data with multiple categories and/or cut-points. We start by checking the correlation levels in our data.

```{r fig.width=9, fig.height=9}
# using "psycho"
# we use parts of the data-frame that contains the z-scored variables of interest
cor <- gemination.train[c(37:55)] %>% 
  correlation()
summary(cor)
plot(cor)
```

As can be seen from the correlation plot, an important number of predictors are correlated with each other. Some with minimal correlations, while other with relatively strong correlations. We will use this information in tuning the computations of variable importance.

## Changing order of outcome

```{r}
levels(gemination.train$SingGemVoicing)
gemination.train$SingGemVoicing <- factor(gemination.train$SingGemVoicing,
                                          levels = c("singleton voiceless","singleton voiced","geminate voiceless","geminate voiced"))
levels(gemination.train$SingGemVoicing)
levels(gemination.test$SingGemVoicing)
gemination.test$SingGemVoicing <- factor(gemination.test$SingGemVoicing,
                                          levels = c("singleton voiceless","singleton voiced","geminate voiceless","geminate voiced"))
levels(gemination.test$SingGemVoicing)

```

# Running the three Random Forests via Conditional Inference Trees

## Model A

### Training and predicting

```{r}
set.seed(1)
cforestGemVoicingTrain <- cforest(SingGemVoicing~durationC2_z+durationVoicedPercC2_z+durationVOTAll_z+
                                         VOTDec_z+durationBurst_z+durationVOT_z+durationVoicedPercVOTAll_z+
                                         intensityOnsetVOTAll_z+intensityOffsetVOTAll_z+durationV2_z+
                                         f0OnsetV2_z+intensityOnsetV2_z+f1OnsetV2_z+
                                         h1mnh2OnsetNormV2_z+durationV1_z+f0OffsetV1_z+intensityOffsetV1_z+
                                         f1OffsetV1_z+h1mnh2OffsetNorm_z, 
                                       data = gemination.train, 
                                       control=cforest_unbiased(mtry=round(sqrt(19)),ntree=500))

# we use predict to obtain predictions from the training set on the testing set
cforestGemVoicingTest <- predict(cforestGemVoicingTrain, newdata = gemination.test, OOB=TRUE, type = "response")

# Below is the predictions table showing the confusions

# Table predictions (rows) by original data (columns) 
round(prop.table(table(cforestGemVoicingTest,gemination.test$SingGemVoicing),2)*100,1)

# for percent correct
sum(gemination.test$SingGemVoicing==cforestGemVoicingTest)/nrow(gemination.test)

# for correlations between our predictions and original data-frame
cor(as.numeric(cforestGemVoicingTest), as.numeric(gemination.test$SingGemVoicing))

```

### Variable Importance

We are interested in the importance of our predictors and how they are informative (or not) with respect to our model.

We run two versions. One with no conditional permutations, i.e., with no appropriate corrections for the correlations in our data and one with appropriate corrections

#### No conditional permutations

##### Model specification

```{r}
set.seed(1)
system.time(varImpSingGemVoicingTrainNT <- varimp(cforest(SingGemVoicing~durationC2_z+durationVoicedPercC2_z+durationVOTAll_z+
                                                           VOTDec_z+durationBurst_z+durationVOT_z+durationVoicedPercVOTAll_z+
                                                       intensityOnsetVOTAll_z+intensityOffsetVOTAll_z+durationV2_z+
                                                       f0OnsetV2_z+intensityOnsetV2_z+f1OnsetV2_z+
                                                       h1mnh2OnsetNormV2_z+durationV1_z+f0OffsetV1_z+intensityOffsetV1_z+
                                                       f1OffsetV1_z+h1mnh2OffsetNorm_z,  
                                                     data = gemination.train, 
                                                     control=cforest_unbiased(mtry=round(sqrt(19)),ntree=500)), conditional = F))

sort(varImpSingGemVoicingTrainNT)

```

##### Plotting

We start by changing names of predictors 

```{r}
# We change names
names(varImpSingGemVoicingTrainNT)[names(varImpSingGemVoicingTrainNT) == 'durationC2_z'] <- 'Closure Dur'
names(varImpSingGemVoicingTrainNT)[names(varImpSingGemVoicingTrainNT) == 'durationV2_z'] <- 'V2 Dur'
names(varImpSingGemVoicingTrainNT)[names(varImpSingGemVoicingTrainNT) == 'durationV1_z'] <- 'V1 Dur'
names(varImpSingGemVoicingTrainNT)[names(varImpSingGemVoicingTrainNT) == 'VOTDec_z'] <- 'VOT'
names(varImpSingGemVoicingTrainNT)[names(varImpSingGemVoicingTrainNT) == 'durationVOTAll_z'] <- 'Rel Dur'
names(varImpSingGemVoicingTrainNT)[names(varImpSingGemVoicingTrainNT) == 'durationBurst_z'] <- 'Burst Dur'
names(varImpSingGemVoicingTrainNT)[names(varImpSingGemVoicingTrainNT) == 'durationVOT_z'] <- 'Asp Dur'
names(varImpSingGemVoicingTrainNT)[names(varImpSingGemVoicingTrainNT) == 'durationVoicedPercC2_z'] <- 'CD Voicing'
names(varImpSingGemVoicingTrainNT)[names(varImpSingGemVoicingTrainNT) == 'durationVoicedPercVOTAll_z'] <- 'Rel Voicing'
names(varImpSingGemVoicingTrainNT)[names(varImpSingGemVoicingTrainNT) == 'intensityOffsetV1_z'] <- 'V1 Off Int'
names(varImpSingGemVoicingTrainNT)[names(varImpSingGemVoicingTrainNT) == 'intensityOnsetV2_z'] <- 'V2 On Int'
names(varImpSingGemVoicingTrainNT)[names(varImpSingGemVoicingTrainNT) == 'intensityOffsetVOTAll_z'] <- 'Rel Of Int'
names(varImpSingGemVoicingTrainNT)[names(varImpSingGemVoicingTrainNT) == 'intensityOnsetVOTAll_z'] <- 'Rel On Int'
names(varImpSingGemVoicingTrainNT)[names(varImpSingGemVoicingTrainNT) == 'f0OnsetV2_z'] <- 'V2 On f0'
names(varImpSingGemVoicingTrainNT)[names(varImpSingGemVoicingTrainNT) == 'f0OffsetV1_z'] <- 'V1 Of f0'
names(varImpSingGemVoicingTrainNT)[names(varImpSingGemVoicingTrainNT) == 'f1OnsetV2_z'] <- 'V2 On F1'
names(varImpSingGemVoicingTrainNT)[names(varImpSingGemVoicingTrainNT) == 'f1OffsetV1_z'] <- 'V1 Of F1'
names(varImpSingGemVoicingTrainNT)[names(varImpSingGemVoicingTrainNT) == 'h1mnh2OffsetNorm_z'] <- 'V1 Of H1*-H2*'
names(varImpSingGemVoicingTrainNT)[names(varImpSingGemVoicingTrainNT) == 'h1mnh2OnsetNormV2_z'] <- 'V2 On H1*-H2*'

```

After changing names, we append to a data-frame

```{r}

varImpSingGemVoicingTrainNT_DF <- varImpSingGemVoicingTrainNT
varImpSingGemVoicingTrainNT_DF <- as.data.frame(varImpSingGemVoicingTrainNT_DF)
varImpSingGemVoicingTrainNT_DF$var <- row.names(varImpSingGemVoicingTrainNT_DF)
```

Then we use ggplot2 to create our Variable Importance figure

```{r fig.width=6, fig.height=6}
ggVarImpSGDurNT_A <- ggplot(varImpSingGemVoicingTrainNT_DF, aes(reorder(var,varImpSingGemVoicingTrainNT_DF),varImpSingGemVoicingTrainNT_DF)) + geom_bar(stat="identity",position="dodge",fill = "gray70") + 
  coord_flip(ylim = c(0,0.25)) + labs(y = "", x= "", subtitle = "A: Non Conditional 19 predictors") +
  scale_y_continuous()+theme_set(theme_bw()) + theme(text=element_text(size=20)) + theme(legend.position="none", axis.title.x=element_blank(),axis.title.y=element_blank())

ggVarImpSGDurNT_A
```

This plot can already be used to evaluate the importance of predictors. However, it is biased because of the high level of correlations. 

#### Conditional permutations

The conditional permutations are used to correct for the correlations in our predictors (conditional = T). This is heavy on resources, and we have already tuned our model to make sure it is not resource intensive. With the update to R 3.5.0, the computations are less resource intensive. One way to reduce resources with previous versions of R is to tune the threshold. Instead of 0.2, one could increase it to 0.5 or 0.8. The threshold is used to increase/decrease number of predictors in the conditional permutations grid and is related to the correlations in the data. With 0.2, any correlation equal or above 0.2 are added to the grid; the same applies to the other thresholds. Here we kept the threshold to 0.2 and we were able to run the computations on a local machine with 4-cores (8 logical) and 32GB RAMs.

##### Model specification

```{r}
set.seed(1)
system.time(varImpSingGemVoicingTrainT <- varimp(cforest(SingGemVoicing~durationC2_z+durationVoicedPercC2_z+durationVOTAll_z+
                                                          VOTDec_z+durationBurst_z+durationVOT_z+durationVoicedPercVOTAll_z+
                                                      intensityOnsetVOTAll_z+intensityOffsetVOTAll_z+durationV2_z+
                                                      f0OnsetV2_z+intensityOnsetV2_z+f1OnsetV2_z+
                                                      h1mnh2OnsetNormV2_z+durationV1_z+f0OffsetV1_z+intensityOffsetV1_z+
                                                      f1OffsetV1_z+h1mnh2OffsetNorm_z, 
                                                    data = gemination.train, 
                                                    control=cforest_unbiased(mtry=round(sqrt(19)),ntree=500)), 
                                            conditional = T, threshold = 0.2))
sort(varImpSingGemVoicingTrainT)

```

##### Plotting

We start by changing names of predictors 

```{r}
# We change names
names(varImpSingGemVoicingTrainT)[names(varImpSingGemVoicingTrainT) == 'durationC2_z'] <- 'Closure Dur'
names(varImpSingGemVoicingTrainT)[names(varImpSingGemVoicingTrainT) == 'durationV2_z'] <- 'V2 Dur'
names(varImpSingGemVoicingTrainT)[names(varImpSingGemVoicingTrainT) == 'durationV1_z'] <- 'V1 Dur'
names(varImpSingGemVoicingTrainT)[names(varImpSingGemVoicingTrainT) == 'VOTDec_z'] <- 'VOT'
names(varImpSingGemVoicingTrainT)[names(varImpSingGemVoicingTrainT) == 'durationVOTAll_z'] <- 'Rel Dur'
names(varImpSingGemVoicingTrainT)[names(varImpSingGemVoicingTrainT) == 'durationBurst_z'] <- 'Burst Dur'
names(varImpSingGemVoicingTrainT)[names(varImpSingGemVoicingTrainT) == 'durationVOT_z'] <- 'Asp Dur'
names(varImpSingGemVoicingTrainT)[names(varImpSingGemVoicingTrainT) == 'durationVoicedPercC2_z'] <- 'CD Voicing'
names(varImpSingGemVoicingTrainT)[names(varImpSingGemVoicingTrainT) == 'durationVoicedPercVOTAll_z'] <- 'Rel Voicing'
names(varImpSingGemVoicingTrainT)[names(varImpSingGemVoicingTrainT) == 'intensityOffsetV1_z'] <- 'V1 Off Int'
names(varImpSingGemVoicingTrainT)[names(varImpSingGemVoicingTrainT) == 'intensityOnsetV2_z'] <- 'V2 On Int'
names(varImpSingGemVoicingTrainT)[names(varImpSingGemVoicingTrainT) == 'intensityOffsetVOTAll_z'] <- 'Rel Of Int'
names(varImpSingGemVoicingTrainT)[names(varImpSingGemVoicingTrainT) == 'intensityOnsetVOTAll_z'] <- 'Rel On Int'
names(varImpSingGemVoicingTrainT)[names(varImpSingGemVoicingTrainT) == 'f0OnsetV2_z'] <- 'V2 On f0'
names(varImpSingGemVoicingTrainT)[names(varImpSingGemVoicingTrainT) == 'f0OffsetV1_z'] <- 'V1 Of f0'
names(varImpSingGemVoicingTrainT)[names(varImpSingGemVoicingTrainT) == 'f1OnsetV2_z'] <- 'V2 On F1'
names(varImpSingGemVoicingTrainT)[names(varImpSingGemVoicingTrainT) == 'f1OffsetV1_z'] <- 'V1 Of F1'
names(varImpSingGemVoicingTrainT)[names(varImpSingGemVoicingTrainT) == 'h1mnh2OffsetNorm_z'] <- 'V1 Of H1*-H2*'
names(varImpSingGemVoicingTrainT)[names(varImpSingGemVoicingTrainT) == 'h1mnh2OnsetNormV2_z'] <- 'V2 On H1*-H2*'

```

After changing names, we append to a data-frame

```{r}

varImpSingGemVoicingTrainT_DF <- varImpSingGemVoicingTrainT
varImpSingGemVoicingTrainT_DF <- as.data.frame(varImpSingGemVoicingTrainT_DF)
varImpSingGemVoicingTrainT_DF$var <- row.names(varImpSingGemVoicingTrainT_DF)

```

Then we use ggplot2 to create our Variable Importance figure

```{r fig.width=6, fig.height=6}
ggVarImpSGDurT_A <- ggplot(varImpSingGemVoicingTrainT_DF, aes(reorder(var,varImpSingGemVoicingTrainT_DF),varImpSingGemVoicingTrainT_DF)) + geom_bar(stat="identity",position="dodge",fill = "gray70") + 
  coord_flip(ylim = c(0,0.015)) +  labs(y = "", x= "", subtitle = "A: Conditional 19 predictors")+
  scale_y_continuous()+theme_set(theme_bw()) + theme(text=element_text(size=20)) + theme(legend.position="none", axis.title.x=element_blank(),axis.title.y=element_blank())
ggVarImpSGDurT_A
```


As can be seen from the results, the closure duration is the most important predictor, followed by voicing in the closure, duration of V1 and voicing in the release. Most of the remaining predictors are contributing to the contrast to a lesser extent with the last six not contributing to it. 

## Model B

### Training and predicting

```{r}
set.seed(1)
cforestGemVoicingTrainNoDur <- cforest(SingGemVoicing~durationVoicedPercC2_z+durationVOTAll_z+
                                         VOTDec_z+durationBurst_z+durationVOT_z+durationVoicedPercVOTAll_z+
                                  intensityOnsetVOTAll_z+intensityOffsetVOTAll_z+durationV2_z+
                                  f0OnsetV2_z+intensityOnsetV2_z+f1OnsetV2_z+
                                  h1mnh2OnsetNormV2_z+durationV1_z+f0OffsetV1_z+intensityOffsetV1_z+
                                  f1OffsetV1_z+h1mnh2OffsetNorm_z, 
                           data = gemination.train, 
                           control=cforest_unbiased(mtry=round(sqrt(18)),ntree=500))


cforestGemVoicingTestNoDur <- predict(cforestGemVoicingTrainNoDur, newdata = gemination.test, OOB=TRUE, type = "response")

## Table predictions (rows) by original data (columns) 
round(prop.table(table(cforestGemVoicingTestNoDur,gemination.test$SingGemVoicing),2)*100,1)

# percent correct
sum(gemination.test$SingGemVoicing==cforestGemVoicingTestNoDur)/nrow(gemination.test)
#correlations
cor(as.numeric(cforestGemVoicingTestNoDur), as.numeric(gemination.test$SingGemVoicing))

```

The percent classification accuracy of Model B is around 10% lower than that of Model A. This is a clear indication that the closure duration has an important role in discriminating the four-way contrast. 

### Variable Importance

As before we ran two versions of variable importance: non conditional and conditional

#### No conditional permutations

##### Model specification

```{r}
set.seed(1)
system.time(varImpSingGemVoicingTrainNoDurNT <- varimp(cforest(SingGemVoicing~durationVoicedPercC2_z+durationVOTAll_z+
                                                                VOTDec_z+durationBurst_z+durationVOT_z+durationVoicedPercVOTAll_z+
                                                            intensityOnsetVOTAll_z+intensityOffsetVOTAll_z+durationV2_z+
                                                            f0OnsetV2_z+intensityOnsetV2_z+f1OnsetV2_z+
                                                            h1mnh2OnsetNormV2_z+durationV1_z+f0OffsetV1_z+intensityOffsetV1_z+
                                                            f1OffsetV1_z+h1mnh2OffsetNorm_z, 
                                                     data = gemination.train, 
                                                     control=cforest_unbiased(mtry=round(sqrt(18)),ntree=500)), conditional = F))

sort(varImpSingGemVoicingTrainNoDurNT)

```

##### Plotting

We start by changing names of predictors 

```{r}
# We change names
names(varImpSingGemVoicingTrainNoDurNT)[names(varImpSingGemVoicingTrainNoDurNT) == 'durationV2_z'] <- 'V2 Dur'
names(varImpSingGemVoicingTrainNoDurNT)[names(varImpSingGemVoicingTrainNoDurNT) == 'durationV1_z'] <- 'V1 Dur'
names(varImpSingGemVoicingTrainNoDurNT)[names(varImpSingGemVoicingTrainNoDurNT) == 'VOTDec_z'] <- 'VOT'
names(varImpSingGemVoicingTrainNoDurNT)[names(varImpSingGemVoicingTrainNoDurNT) == 'durationVOTAll_z'] <- 'Rel Dur'
names(varImpSingGemVoicingTrainNoDurNT)[names(varImpSingGemVoicingTrainNoDurNT) == 'durationBurst_z'] <- 'Burst Dur'
names(varImpSingGemVoicingTrainNoDurNT)[names(varImpSingGemVoicingTrainNoDurNT) == 'durationVOT_z'] <- 'Asp Dur'
names(varImpSingGemVoicingTrainNoDurNT)[names(varImpSingGemVoicingTrainNoDurNT) == 'durationVoicedPercC2_z'] <- 'CD Voicing'
names(varImpSingGemVoicingTrainNoDurNT)[names(varImpSingGemVoicingTrainNoDurNT) == 'durationVoicedPercVOTAll_z'] <- 'Rel Voicing'
names(varImpSingGemVoicingTrainNoDurNT)[names(varImpSingGemVoicingTrainNoDurNT) == 'intensityOffsetV1_z'] <- 'V1 Off Int'
names(varImpSingGemVoicingTrainNoDurNT)[names(varImpSingGemVoicingTrainNoDurNT) == 'intensityOnsetV2_z'] <- 'V2 On Int'
names(varImpSingGemVoicingTrainNoDurNT)[names(varImpSingGemVoicingTrainNoDurNT) == 'intensityOffsetVOTAll_z'] <- 'Rel Of Int'
names(varImpSingGemVoicingTrainNoDurNT)[names(varImpSingGemVoicingTrainNoDurNT) == 'intensityOnsetVOTAll_z'] <- 'Rel On Int'
names(varImpSingGemVoicingTrainNoDurNT)[names(varImpSingGemVoicingTrainNoDurNT) == 'f0OnsetV2_z'] <- 'V2 On f0'
names(varImpSingGemVoicingTrainNoDurNT)[names(varImpSingGemVoicingTrainNoDurNT) == 'f0OffsetV1_z'] <- 'V1 Of f0'
names(varImpSingGemVoicingTrainNoDurNT)[names(varImpSingGemVoicingTrainNoDurNT) == 'f1OnsetV2_z'] <- 'V2 On F1'
names(varImpSingGemVoicingTrainNoDurNT)[names(varImpSingGemVoicingTrainNoDurNT) == 'f1OffsetV1_z'] <- 'V1 Of F1'
names(varImpSingGemVoicingTrainNoDurNT)[names(varImpSingGemVoicingTrainNoDurNT) == 'h1mnh2OffsetNorm_z'] <- 'V1 Of H1*-H2*'
names(varImpSingGemVoicingTrainNoDurNT)[names(varImpSingGemVoicingTrainNoDurNT) == 'h1mnh2OnsetNormV2_z'] <- 'V2 On H1*-H2*'


```

After changing names, we append to a data-frame

```{r}

varImpSingGemVoicingTrainNoDurNT_DF <- varImpSingGemVoicingTrainNoDurNT
varImpSingGemVoicingTrainNoDurNT_DF <- as.data.frame(varImpSingGemVoicingTrainNoDurNT_DF)
varImpSingGemVoicingTrainNoDurNT_DF$var <- row.names(varImpSingGemVoicingTrainNoDurNT_DF)

```

Then we use ggplot2 to create our Variable Importance figure

```{r fig.width=6, fig.height=6}
ggVarImpSGDurNT_B <- ggplot(varImpSingGemVoicingTrainNoDurNT_DF, aes(reorder(var,varImpSingGemVoicingTrainNoDurNT_DF),varImpSingGemVoicingTrainNoDurNT_DF)) + geom_bar(stat="identity",position="dodge",fill = "gray70") + 
                  coord_flip(ylim = c(0,0.25)) + labs(subtitle = "B: Non Conditional 18 predictors") + 
                  scale_y_continuous()+theme_set(theme_bw()) + theme(text=element_text(size=20)) + theme(legend.position="none", axis.title.x=element_blank(),axis.title.y=element_blank())
ggVarImpSGDurNT_B
```


#### Conditional permutations

As above, we use the conditional permutations and adjust the threshold. 

##### Model specification

```{r}
set.seed(1)
system.time(varImpSingGemVoicingTrainNoDurT <- varimp(cforest(SingGemVoicing~durationVoicedPercC2_z+durationVOTAll_z+
                                                               VOTDec_z+durationBurst_z+durationVOT_z+durationVoicedPercVOTAll_z+
                                                           intensityOnsetVOTAll_z+intensityOffsetVOTAll_z+durationV2_z+
                                                           f0OnsetV2_z+intensityOnsetV2_z+f1OnsetV2_z+
                                                           h1mnh2OnsetNormV2_z+durationV1_z+f0OffsetV1_z+intensityOffsetV1_z+
                                                           f1OffsetV1_z+h1mnh2OffsetNorm_z,  
                                                    data = gemination.train, 
                                                    control=cforest_unbiased(mtry=round(sqrt(18)),ntree=500)), 
                                                  conditional = T, threshold = 0.2))
sort(varImpSingGemVoicingTrainNoDurT)


```

##### Plotting

We start by changing names of predictors 

```{r}
# We change names
names(varImpSingGemVoicingTrainNoDurT)[names(varImpSingGemVoicingTrainNoDurT) == 'durationV2_z'] <- 'V2 Dur'
names(varImpSingGemVoicingTrainNoDurT)[names(varImpSingGemVoicingTrainNoDurT) == 'durationV1_z'] <- 'V1 Dur'
names(varImpSingGemVoicingTrainNoDurT)[names(varImpSingGemVoicingTrainNoDurT) == 'VOTDec_z'] <- 'VOT'
names(varImpSingGemVoicingTrainNoDurT)[names(varImpSingGemVoicingTrainNoDurT) == 'durationVOTAll_z'] <- 'Rel Dur'
names(varImpSingGemVoicingTrainNoDurT)[names(varImpSingGemVoicingTrainNoDurT) == 'durationBurst_z'] <- 'Burst Dur'
names(varImpSingGemVoicingTrainNoDurT)[names(varImpSingGemVoicingTrainNoDurT) == 'durationVOT_z'] <- 'Asp Dur'
names(varImpSingGemVoicingTrainNoDurT)[names(varImpSingGemVoicingTrainNoDurT) == 'durationVoicedPercC2_z'] <- 'CD Voicing'
names(varImpSingGemVoicingTrainNoDurT)[names(varImpSingGemVoicingTrainNoDurT) == 'durationVoicedPercVOTAll_z'] <- 'Rel Voicing'
names(varImpSingGemVoicingTrainNoDurT)[names(varImpSingGemVoicingTrainNoDurT) == 'intensityOffsetV1_z'] <- 'V1 Off Int'
names(varImpSingGemVoicingTrainNoDurT)[names(varImpSingGemVoicingTrainNoDurT) == 'intensityOnsetV2_z'] <- 'V2 On Int'
names(varImpSingGemVoicingTrainNoDurT)[names(varImpSingGemVoicingTrainNoDurT) == 'intensityOffsetVOTAll_z'] <- 'Rel Of Int'
names(varImpSingGemVoicingTrainNoDurT)[names(varImpSingGemVoicingTrainNoDurT) == 'intensityOnsetVOTAll_z'] <- 'Rel On Int'
names(varImpSingGemVoicingTrainNoDurT)[names(varImpSingGemVoicingTrainNoDurT) == 'f0OnsetV2_z'] <- 'V2 On f0'
names(varImpSingGemVoicingTrainNoDurT)[names(varImpSingGemVoicingTrainNoDurT) == 'f0OffsetV1_z'] <- 'V1 Of f0'
names(varImpSingGemVoicingTrainNoDurT)[names(varImpSingGemVoicingTrainNoDurT) == 'f1OnsetV2_z'] <- 'V2 On F1'
names(varImpSingGemVoicingTrainNoDurT)[names(varImpSingGemVoicingTrainNoDurT) == 'f1OffsetV1_z'] <- 'V1 Of F1'
names(varImpSingGemVoicingTrainNoDurT)[names(varImpSingGemVoicingTrainNoDurT) == 'h1mnh2OffsetNorm_z'] <- 'V1 Of H1*-H2*'
names(varImpSingGemVoicingTrainNoDurT)[names(varImpSingGemVoicingTrainNoDurT) == 'h1mnh2OnsetNormV2_z'] <- 'V2 On H1*-H2*'


```

After changing names, we append to a data-frame

```{r}

varImpSingGemVoicingTrainNoDurT_DF <- varImpSingGemVoicingTrainNoDurT
varImpSingGemVoicingTrainNoDurT_DF <- as.data.frame(varImpSingGemVoicingTrainNoDurT_DF)
varImpSingGemVoicingTrainNoDurT_DF$var <- row.names(varImpSingGemVoicingTrainNoDurT_DF)


```

Then we use ggplot2 to create our Variable Importance figure

```{r fig.width=6, fig.height=6}
ggVarImpSGDurT_B <- ggplot(varImpSingGemVoicingTrainNoDurT_DF, aes(reorder(var,varImpSingGemVoicingTrainNoDurT_DF),varImpSingGemVoicingTrainNoDurT_DF)) + geom_bar(stat="identity",position="dodge",fill = "gray70") + 
  coord_flip(ylim = c(0,0.01)) + labs(y = "", x= "", subtitle = "B: Conditional 18 predictors") +
  scale_y_continuous()+theme_set(theme_bw()) + theme(text=element_text(size=20)) + theme(legend.position="none", axis.title.x=element_blank(),axis.title.y=element_blank())
ggVarImpSGDurT_B
```


As can be seen from the results, the VOT is the most important predictor, followed by the duration of V1, the voicing in the closure, the intensity at the onset of the release. Most of the remaining predictors are contributing to the contrast to a lesser degree with the last three not contributing to it.

## Model C

### Training and predicting

```{r}
set.seed(1)
cforestGemVoicingTrainNoDurNoVOT <- cforest(SingGemVoicing~durationVoicedPercC2_z+durationVOTAll_z+
                                         durationBurst_z+durationVOT_z+durationVoicedPercVOTAll_z+
                                         intensityOnsetVOTAll_z+intensityOffsetVOTAll_z+durationV2_z+
                                         f0OnsetV2_z+intensityOnsetV2_z+f1OnsetV2_z+
                                         h1mnh2OnsetNormV2_z+durationV1_z+f0OffsetV1_z+intensityOffsetV1_z+
                                         f1OffsetV1_z+h1mnh2OffsetNorm_z, 
                                       data = gemination.train, 
                                       control=cforest_unbiased(mtry=round(sqrt(17)),ntree=600))


cforestGemVoicingTestNoDurNoVOT <- predict(cforestGemVoicingTrainNoDurNoVOT, newdata = gemination.test, OOB=TRUE, type = "response")

## Table predictions (rows) by original data (columns) 
round(prop.table(table(cforestGemVoicingTestNoDurNoVOT,gemination.test$SingGemVoicing),2)*100,1)

# percent correct
sum(gemination.test$SingGemVoicing==cforestGemVoicingTestNoDurNoVOT)/nrow(gemination.test)
# correlation
cor(as.numeric(cforestGemVoicingTestNoDurNoVOT), as.numeric(gemination.test$SingGemVoicing))


```

As can be seen from the above, the percent correct is much lower that the two other models and the levels of correlations as well. This indicates that these secondary correlates are not providing a clear discrimination of the data. 

### Variable Importance

As before we ran two versions of variable importance: non conditional and conditional

#### No conditional permutations

##### Model specification

```{r}
set.seed(1)
system.time(varImpSingGemVoicingTrainNoDurNoVOTNT <- varimp(cforest(SingGemVoicing~durationVoicedPercC2_z+durationVOTAll_z+
                                                                durationBurst_z+durationVOT_z+durationVoicedPercVOTAll_z+
                                                                intensityOnsetVOTAll_z+intensityOffsetVOTAll_z+durationV2_z+
                                                                f0OnsetV2_z+intensityOnsetV2_z+f1OnsetV2_z+
                                                                h1mnh2OnsetNormV2_z+durationV1_z+f0OffsetV1_z+intensityOffsetV1_z+
                                                                f1OffsetV1_z+h1mnh2OffsetNorm_z, 
                                                              data = gemination.train, 
                                                              control=cforest_unbiased(mtry=round(sqrt(17)),ntree=600)), conditional = F))


sort(varImpSingGemVoicingTrainNoDurNoVOTNT)

```

##### Plotting

We start by changing names of predictors 

```{r}
# We change names
names(varImpSingGemVoicingTrainNoDurNoVOTNT)[names(varImpSingGemVoicingTrainNoDurNoVOTNT) == 'durationV2_z'] <- 'V2 Dur'
names(varImpSingGemVoicingTrainNoDurNoVOTNT)[names(varImpSingGemVoicingTrainNoDurNoVOTNT) == 'durationV1_z'] <- 'V1 Dur'
names(varImpSingGemVoicingTrainNoDurNoVOTNT)[names(varImpSingGemVoicingTrainNoDurNoVOTNT) == 'durationVOTAll_z'] <- 'Rel Dur'
names(varImpSingGemVoicingTrainNoDurNoVOTNT)[names(varImpSingGemVoicingTrainNoDurNoVOTNT) == 'durationBurst_z'] <- 'Burst Dur'
names(varImpSingGemVoicingTrainNoDurNoVOTNT)[names(varImpSingGemVoicingTrainNoDurNoVOTNT) == 'durationVOT_z'] <- 'Asp Dur'
names(varImpSingGemVoicingTrainNoDurNoVOTNT)[names(varImpSingGemVoicingTrainNoDurNoVOTNT) == 'durationVoicedPercC2_z'] <- 'CD Voicing'
names(varImpSingGemVoicingTrainNoDurNoVOTNT)[names(varImpSingGemVoicingTrainNoDurNoVOTNT) == 'durationVoicedPercVOTAll_z'] <- 'Rel Voicing'
names(varImpSingGemVoicingTrainNoDurNoVOTNT)[names(varImpSingGemVoicingTrainNoDurNoVOTNT) == 'intensityOffsetV1_z'] <- 'V1 Off Int'
names(varImpSingGemVoicingTrainNoDurNoVOTNT)[names(varImpSingGemVoicingTrainNoDurNoVOTNT) == 'intensityOnsetV2_z'] <- 'V2 On Int'
names(varImpSingGemVoicingTrainNoDurNoVOTNT)[names(varImpSingGemVoicingTrainNoDurNoVOTNT) == 'intensityOffsetVOTAll_z'] <- 'Rel Of Int'
names(varImpSingGemVoicingTrainNoDurNoVOTNT)[names(varImpSingGemVoicingTrainNoDurNoVOTNT) == 'intensityOnsetVOTAll_z'] <- 'Rel On Int'
names(varImpSingGemVoicingTrainNoDurNoVOTNT)[names(varImpSingGemVoicingTrainNoDurNoVOTNT) == 'f0OnsetV2_z'] <- 'V2 On f0'
names(varImpSingGemVoicingTrainNoDurNoVOTNT)[names(varImpSingGemVoicingTrainNoDurNoVOTNT) == 'f0OffsetV1_z'] <- 'V1 Of f0'
names(varImpSingGemVoicingTrainNoDurNoVOTNT)[names(varImpSingGemVoicingTrainNoDurNoVOTNT) == 'f1OnsetV2_z'] <- 'V2 On F1'
names(varImpSingGemVoicingTrainNoDurNoVOTNT)[names(varImpSingGemVoicingTrainNoDurNoVOTNT) == 'f1OffsetV1_z'] <- 'V1 Of F1'
names(varImpSingGemVoicingTrainNoDurNoVOTNT)[names(varImpSingGemVoicingTrainNoDurNoVOTNT) == 'h1mnh2OffsetNorm_z'] <- 'V1 Of H1*-H2*'
names(varImpSingGemVoicingTrainNoDurNoVOTNT)[names(varImpSingGemVoicingTrainNoDurNoVOTNT) == 'h1mnh2OnsetNormV2_z'] <- 'V2 On H1*-H2*'


```

After changing names, we append to a data-frame

```{r}

varImpSingGemVoicingTrainNoDurNoVOTNT_DF <- varImpSingGemVoicingTrainNoDurNoVOTNT
varImpSingGemVoicingTrainNoDurNoVOTNT_DF <- as.data.frame(varImpSingGemVoicingTrainNoDurNoVOTNT_DF)
varImpSingGemVoicingTrainNoDurNoVOTNT_DF$var <- row.names(varImpSingGemVoicingTrainNoDurNoVOTNT_DF)


```

Then we use ggplot2 to create our Variable Importance figure

```{r fig.width=6, fig.height=6}
ggVarImpSGDurNT_C <- ggplot(varImpSingGemVoicingTrainNoDurNoVOTNT_DF, aes(reorder(var,varImpSingGemVoicingTrainNoDurNoVOTNT_DF),varImpSingGemVoicingTrainNoDurNoVOTNT_DF)) + geom_bar(stat="identity",position="dodge",fill = "gray70") + 
  coord_flip(ylim = c(0,0.25)) + labs(subtitle = "C: Non Conditional 17 predictors") + 
  scale_y_continuous()+theme_set(theme_bw()) + theme(text=element_text(size=20)) + theme(legend.position="none", axis.title.x=element_blank(),axis.title.y=element_blank())
ggVarImpSGDurNT_C
```


#### Conditional permutations

As above, we use the conditional permutations and adjust the threshold. 

##### Model specification

```{r}
set.seed(1)
system.time(varImpSingGemVoicingTrainNoDurNoVOTT <- varimp(cforest(SingGemVoicing~durationVoicedPercC2_z+durationVOTAll_z+
                                                               durationBurst_z+durationVOT_z+durationVoicedPercVOTAll_z+
                                                               intensityOnsetVOTAll_z+intensityOffsetVOTAll_z+durationV2_z+
                                                               f0OnsetV2_z+intensityOnsetV2_z+f1OnsetV2_z+
                                                               h1mnh2OnsetNormV2_z+durationV1_z+f0OffsetV1_z+intensityOffsetV1_z+
                                                               f1OffsetV1_z+h1mnh2OffsetNorm_z,  
                                                             data = gemination.train, 
                                                             control=cforest_unbiased(mtry=round(sqrt(17)),ntree=600)), 
                                                     conditional = T, threshold = 0.2))

sort(varImpSingGemVoicingTrainNoDurNoVOTT)


```

##### Plotting

We start by changing names of predictors 

```{r}
# We change names
names(varImpSingGemVoicingTrainNoDurNoVOTT)[names(varImpSingGemVoicingTrainNoDurNoVOTT) == 'durationV2_z'] <- 'V2 Dur'
names(varImpSingGemVoicingTrainNoDurNoVOTT)[names(varImpSingGemVoicingTrainNoDurNoVOTT) == 'durationV1_z'] <- 'V1 Dur'
names(varImpSingGemVoicingTrainNoDurNoVOTT)[names(varImpSingGemVoicingTrainNoDurNoVOTT) == 'durationVOTAll_z'] <- 'Rel Dur'
names(varImpSingGemVoicingTrainNoDurNoVOTT)[names(varImpSingGemVoicingTrainNoDurNoVOTT) == 'durationBurst_z'] <- 'Burst Dur'
names(varImpSingGemVoicingTrainNoDurNoVOTT)[names(varImpSingGemVoicingTrainNoDurNoVOTT) == 'durationVOT_z'] <- 'Asp Dur'
names(varImpSingGemVoicingTrainNoDurNoVOTT)[names(varImpSingGemVoicingTrainNoDurNoVOTT) == 'durationVoicedPercC2_z'] <- 'CD Voicing'
names(varImpSingGemVoicingTrainNoDurNoVOTT)[names(varImpSingGemVoicingTrainNoDurNoVOTT) == 'durationVoicedPercVOTAll_z'] <- 'Rel Voicing'
names(varImpSingGemVoicingTrainNoDurNoVOTT)[names(varImpSingGemVoicingTrainNoDurNoVOTT) == 'intensityOffsetV1_z'] <- 'V1 Off Int'
names(varImpSingGemVoicingTrainNoDurNoVOTT)[names(varImpSingGemVoicingTrainNoDurNoVOTT) == 'intensityOnsetV2_z'] <- 'V2 On Int'
names(varImpSingGemVoicingTrainNoDurNoVOTT)[names(varImpSingGemVoicingTrainNoDurNoVOTT) == 'intensityOffsetVOTAll_z'] <- 'Rel Of Int'
names(varImpSingGemVoicingTrainNoDurNoVOTT)[names(varImpSingGemVoicingTrainNoDurNoVOTT) == 'intensityOnsetVOTAll_z'] <- 'Rel On Int'
names(varImpSingGemVoicingTrainNoDurNoVOTT)[names(varImpSingGemVoicingTrainNoDurNoVOTT) == 'f0OnsetV2_z'] <- 'V2 On f0'
names(varImpSingGemVoicingTrainNoDurNoVOTT)[names(varImpSingGemVoicingTrainNoDurNoVOTT) == 'f0OffsetV1_z'] <- 'V1 Of f0'
names(varImpSingGemVoicingTrainNoDurNoVOTT)[names(varImpSingGemVoicingTrainNoDurNoVOTT) == 'f1OnsetV2_z'] <- 'V2 On F1'
names(varImpSingGemVoicingTrainNoDurNoVOTT)[names(varImpSingGemVoicingTrainNoDurNoVOTT) == 'f1OffsetV1_z'] <- 'V1 Of F1'
names(varImpSingGemVoicingTrainNoDurNoVOTT)[names(varImpSingGemVoicingTrainNoDurNoVOTT) == 'h1mnh2OffsetNorm_z'] <- 'V1 Of H1*-H2*'
names(varImpSingGemVoicingTrainNoDurNoVOTT)[names(varImpSingGemVoicingTrainNoDurNoVOTT) == 'h1mnh2OnsetNormV2_z'] <- 'V2 On H1*-H2*'


```

After changing names, we append to a data-frame

```{r}

varImpSingGemVoicingTrainNoDurNoVOTT_DF <- varImpSingGemVoicingTrainNoDurNoVOTT
varImpSingGemVoicingTrainNoDurNoVOTT_DF <- as.data.frame(varImpSingGemVoicingTrainNoDurNoVOTT_DF)
varImpSingGemVoicingTrainNoDurNoVOTT_DF$var <- row.names(varImpSingGemVoicingTrainNoDurNoVOTT_DF)



```

Then we use ggplot2 to create our Variable Importance figure

```{r fig.width=6, fig.height=6}
ggVarImpSGDurT_C <- ggplot(varImpSingGemVoicingTrainNoDurNoVOTT_DF, aes(reorder(var,varImpSingGemVoicingTrainNoDurNoVOTT_DF),varImpSingGemVoicingTrainNoDurNoVOTT_DF)) + geom_bar(stat="identity",position="dodge",fill = "gray70") + 
  coord_flip(ylim = c(0,0.001)) + labs(y = "", x= "", subtitle = "C: Conditional 17 predictors") +
  scale_y_continuous()+theme_set(theme_bw()) + theme(text=element_text(size=20)) + theme(legend.position="none", axis.title.x=element_blank(),axis.title.y=element_blank())
ggVarImpSGDurT_C
```


As can be seen from the results, the the durations of surrounding vowels followed by intensity at the onset of the release, voicing in the closure and duration of the release are the most important predictors. Their scores are much lower than in the other two models indicating these are secondary. Most of the remaining predictors are contributing to the contrast to a lesser degree with the last four not contributing to it.

# Conclusion

The results of the Random Forests are a clear indication that the closure duration is the main correlate to the four-way voicing by gemination contrast. Model A had an accuracy of 92.5% which is an extremely high classification rate. The second model that excludes the closure duration is relatively high in classification accuracy yielding an 82.3%. A difference of 10.2% is observed between the two models. This is a clear indication of the role of closure duration in both voicing and gemination. Model C aimed at evaluating the role of all *secondary* predictors without using the closure duration or the VOT. A relatively low classification rate at 67.2% is indicative of a relatively successful model. It is statistically significant (rate at 25% and above), but the predictive accuracy is not high enough. This is clearly indicating that these correlates act as secondary one and have an enhancement role to the primary correlate; closure duration. 