This analysis accompanies the article “Al-Tamimi, J. and Khattab, G., (2018). Acoustic correlates of the voicing contrast in Lebanese Arabic singleton and geminate plosives. Invited manuscript for the special issue of Journal of Phonetics,”Marking 50 Years of Research on Voice Onset Time and the Voicing Contrast in the World’s Languages" (eds., T. Cho, G. Docherty & D. Whalen). DOI: https://doi.org/10.1016/j.wocn.2018.09.010
This notebook presents the analyses and results of the Random Forests classification. The aim was to evaluate the robustness of the results obtained for each of the 19 acoustic correlates used, (6 durational, three voicing, and 10 non-temporal including f0, F1 and harmonic differences). These acoustic correlates were used to evaluate the relationship between them and the four-way “contrast” observed between voicing and gemination, i.e., Voiced Singleton, Voiceless Singleton, Voiced Geminate and Voiceless Geminate. The previous notebook looked specifically at how each individual acoustic correlate was associated with the predictors. This was done on an individual basis see notebook. This section extends the analysis by looking at the combination of these acoustic correlates and their predictive power in discriminating between the four categories above.
We start by loading the required packages.
requiredPackages = c('dplyr','DMwR','party','pROC','psycho','ggplot2','reshape','foreach','doSNOW','parallel')
for(p in requiredPackages){
if(!require(p,character.only = TRUE)) install.packages(p)
library(p,character.only = TRUE)
}
We start reading in the data and checking the structure. The data contains some missing data. For the Random Forests to work appropriately, we replace the missing values by centrally imputed values.
ResultsFullOriginal <- read.csv("ResultsFullOriginalData.csv")
str(ResultsFullOriginal)
'data.frame': 1793 obs. of 36 variables:
$ x : int 1 2 3 4 5 6 7 8 9 10 ...
$ file_name : Factor w/ 68 levels "01_F_W_main_1.wav",..: 1 1 1 1 1 1 1 1 1 1 ...
$ speaker : Factor w/ 20 levels "sp1","sp10","sp11",..: 1 1 1 1 1 1 1 1 1 1 ...
$ sex : Factor w/ 2 levels "F","M": 1 1 1 1 1 1 1 1 1 1 ...
$ Number : int 8 38 68 98 371 395 419 449 743 767 ...
$ wordTarget : Factor w/ 102 levels "2aabbeh","2aabeD",..: 83 9 28 52 47 38 34 90 6 27 ...
$ word : Factor w/ 331 levels "2AAbaD","2aabbeh",..: 269 31 89 160 151 126 112 281 23 88 ...
$ singGem : Factor w/ 2 levels "geminate","singleton": 2 2 2 2 2 2 2 2 2 2 ...
$ syllType : Factor w/ 2 levels "iambic","trochaic": 2 2 2 2 2 2 2 2 2 2 ...
$ vowelLength : Factor w/ 2 levels "longV1","shortV1": 2 2 2 2 2 2 2 2 2 2 ...
$ syllStruct : Factor w/ 5 levels "V1C2V2","V1CC2V2",..: 1 1 1 1 1 1 1 1 1 1 ...
$ c.v : Factor w/ 2 levels "C2","CC2": 1 1 1 1 1 1 1 1 1 1 ...
$ phoneme : Factor w/ 13 levels "b","bb","d","D",..: 1 3 12 8 1 10 12 8 1 10 ...
$ modeArticulation4 : Factor w/ 1 level "stop": 1 1 1 1 1 1 1 1 1 1 ...
$ placeArticulation : Factor w/ 4 levels "alveolar","bilabial",..: 2 1 3 4 2 1 3 4 2 1 ...
$ voiced.unvoiced : Factor w/ 2 levels "voiced","voiceless": 1 1 2 2 1 2 2 2 1 2 ...
$ placeVoicing : Factor w/ 6 levels "alveolar voiced",..: 3 1 5 6 3 2 5 6 3 2 ...
$ durationC2 : num 60.2 87 106.9 109.7 77 ...
$ durationVoicedPercC2 : num 100 57.5 43.9 32.7 67.5 ...
$ durationVOTAll : num 19.2 11.3 21.4 22 16.4 ...
$ VOTDec : num -60.2 -87 21 22 -77 ...
$ durationBurst : num 6 8 7 11 9 11 7 12 9 7 ...
$ durationVOT : num 13 3 14 11 7 18 14 15 14 14 ...
$ durationVoicedPercVOTAll: num 100 0 0 0 41.2 ...
$ intensityOnsetVOTAll : num 52.6 56.9 52.6 55.3 53.4 ...
$ intensityOffsetVOTAll : num 64.5 63.4 59.7 57.7 64.2 ...
$ durationV2 : num 89 155.2 86.3 113.3 76.4 ...
$ f0OnsetV2 : num 247 241 245 239 244 ...
$ intensityOnsetV2 : num 67.3 66.6 61.1 60.9 64.5 ...
$ f1OnsetV2 : num 699 542 659 444 679 ...
$ h1mnh2OnsetNormV2 : num -2.08 5.31 11.84 17.52 13.02 ...
$ durationV1 : num 90.5 48.4 52 63.2 61.5 ...
$ f0OffsetV1 : num 239 239 241 238 205 ...
$ intensityOffsetV1 : num 63.8 70.4 68.7 66.4 65.9 ...
$ f1OffsetV1 : num 573 644 942 661 637 ...
$ h1mnh2OffsetNorm : num 9.123 NA 0.981 3.576 0.861 ...
summary(ResultsFullOriginal)
x file_name speaker sex
Min. : 1 01_F_W_main_1.wav: 55 sp1 : 97 F:940
1st Qu.: 449 06_F_W_main_1.wav: 55 sp10 : 97 M:853
Median : 897 02_F_W_main_1.wav: 53 sp6 : 97
Mean : 897 10_F_W_main_1.wav: 52 sp17 : 96
3rd Qu.:1345 08_F_W_main_1.wav: 51 sp9 : 96
Max. :1793 01_M_W_main_1.wav: 50 sp20 : 95
(Other) :1477 (Other):1215
Number wordTarget word singGem
Min. : 8 2adar : 20 3atab : 20 geminate :1007
1st Qu.: 9584 3aadal : 20 fatal : 20 singleton: 786
Median :19555 3atab : 20 saba2 : 20
Mean :19497 3attaal: 20 2atal : 19
3rd Qu.:29525 fatal : 20 3addad : 19
Max. :39279 Haba2 : 20 badal : 19
(Other):1673 (Other):1676
syllType vowelLength syllStruct c.v
iambic : 356 longV1 : 594 V1C2V2 :433 C2 : 786
trochaic:1437 shortV1:1199 V1CC2V2 :410 CC2:1007
V1CC2VV2:356
VV1C2V2 :353
VV1CC2V2:241
phoneme modeArticulation4 placeArticulation
dd :285 stop:1793 alveolar :625
d :242 bilabial :426
bb :235 pharyngealised:449
b :191 velar :293
kk :160
k :133
(Other):547
voiced.unvoiced placeVoicing durationC2
voiced :1059 alveolar voiced :420 Min. : 27.09
voiceless: 734 alveolar voiceless :205 1st Qu.: 78.43
bilabial :426 Median :139.81
pharyngealised voiced :213 Mean :131.23
pharyngealised voiceless:236 3rd Qu.:175.43
velar :293 Max. :315.77
durationVoicedPercC2 durationVOTAll VOTDec
Min. : 0.00 Min. : 0.712 Min. :-299.69
1st Qu.: 23.08 1st Qu.:10.478 1st Qu.:-142.10
Median : 69.12 Median :15.796 Median : -62.73
Mean : 61.60 Mean :18.600 Mean : -61.12
3rd Qu.:100.00 3rd Qu.:23.386 3rd Qu.: 19.00
Max. :100.00 Max. :89.065 Max. : 80.00
durationBurst durationVOT durationVoicedPercVOTAll
Min. : 0.500 Min. : 0.10 Min. : 0.00
1st Qu.: 4.000 1st Qu.: 5.00 1st Qu.: 0.00
Median : 6.000 Median : 9.00 Median : 0.00
Mean : 7.236 Mean :11.94 Mean : 41.02
3rd Qu.: 9.000 3rd Qu.:15.00 3rd Qu.:100.00
Max. :38.000 Max. :97.00 Max. :100.00
NA's :62 NA's :62
intensityOnsetVOTAll intensityOffsetVOTAll durationV2
Min. :41.11 Min. :42.47 Min. : 37.52
1st Qu.:54.91 1st Qu.:62.46 1st Qu.:113.30
Median :58.95 Median :65.72 Median :143.69
Mean :59.42 Mean :65.37 Mean :156.24
3rd Qu.:64.08 3rd Qu.:68.80 3rd Qu.:193.76
Max. :77.19 Max. :82.02 Max. :381.26
f0OnsetV2 intensityOnsetV2 f1OnsetV2
Min. : 80.93 Min. :45.35 Min. : 209.9
1st Qu.:116.99 1st Qu.:64.50 1st Qu.: 431.2
Median :162.49 Median :68.27 Median : 485.1
Mean :162.64 Mean :67.80 Mean : 499.6
3rd Qu.:199.27 3rd Qu.:71.83 3rd Qu.: 555.4
Max. :308.96 Max. :84.14 Max. :1082.2
h1mnh2OnsetNormV2 durationV1 f0OffsetV1
Min. :-23.341 Min. : 4.31 Min. : 82.23
1st Qu.: -7.210 1st Qu.: 57.45 1st Qu.:119.48
Median : -4.097 Median : 78.51 Median :172.18
Mean : -2.665 Mean : 94.63 Mean :171.86
3rd Qu.: 0.814 3rd Qu.:131.74 3rd Qu.:215.80
Max. : 22.130 Max. :264.37 Max. :367.78
NA's :10
intensityOffsetV1 f1OffsetV1 h1mnh2OffsetNorm
Min. :48.25 Min. : 246.6 Min. :-13.019
1st Qu.:65.71 1st Qu.: 470.5 1st Qu.: -6.394
Median :70.19 Median : 534.2 Median : -3.136
Mean :69.53 Mean : 553.6 Mean : -2.269
3rd Qu.:73.82 3rd Qu.: 622.4 3rd Qu.: 1.192
Max. :84.98 Max. :1081.3 Max. : 19.413
NA's :226
As can be seen from the summary above, the data contains some missing data. For the Random Forests to work appropriately, we replace the missing values by centrally imputed ones
# Central imputation using the package "DMwR".
ResultsFullOriginalCentral <- ResultsFullOriginal # create copy
ResultsFullOriginalCentral[, 18:36] <- centralImputation(ResultsFullOriginalCentral[, 18:36])
str(ResultsFullOriginalCentral)
'data.frame': 1793 obs. of 36 variables:
$ x : int 1 2 3 4 5 6 7 8 9 10 ...
$ file_name : Factor w/ 68 levels "01_F_W_main_1.wav",..: 1 1 1 1 1 1 1 1 1 1 ...
$ speaker : Factor w/ 20 levels "sp1","sp10","sp11",..: 1 1 1 1 1 1 1 1 1 1 ...
$ sex : Factor w/ 2 levels "F","M": 1 1 1 1 1 1 1 1 1 1 ...
$ Number : int 8 38 68 98 371 395 419 449 743 767 ...
$ wordTarget : Factor w/ 102 levels "2aabbeh","2aabeD",..: 83 9 28 52 47 38 34 90 6 27 ...
$ word : Factor w/ 331 levels "2AAbaD","2aabbeh",..: 269 31 89 160 151 126 112 281 23 88 ...
$ singGem : Factor w/ 2 levels "geminate","singleton": 2 2 2 2 2 2 2 2 2 2 ...
$ syllType : Factor w/ 2 levels "iambic","trochaic": 2 2 2 2 2 2 2 2 2 2 ...
$ vowelLength : Factor w/ 2 levels "longV1","shortV1": 2 2 2 2 2 2 2 2 2 2 ...
$ syllStruct : Factor w/ 5 levels "V1C2V2","V1CC2V2",..: 1 1 1 1 1 1 1 1 1 1 ...
$ c.v : Factor w/ 2 levels "C2","CC2": 1 1 1 1 1 1 1 1 1 1 ...
$ phoneme : Factor w/ 13 levels "b","bb","d","D",..: 1 3 12 8 1 10 12 8 1 10 ...
$ modeArticulation4 : Factor w/ 1 level "stop": 1 1 1 1 1 1 1 1 1 1 ...
$ placeArticulation : Factor w/ 4 levels "alveolar","bilabial",..: 2 1 3 4 2 1 3 4 2 1 ...
$ voiced.unvoiced : Factor w/ 2 levels "voiced","voiceless": 1 1 2 2 1 2 2 2 1 2 ...
$ placeVoicing : Factor w/ 6 levels "alveolar voiced",..: 3 1 5 6 3 2 5 6 3 2 ...
$ durationC2 : num 60.2 87 106.9 109.7 77 ...
$ durationVoicedPercC2 : num 100 57.5 43.9 32.7 67.5 ...
$ durationVOTAll : num 19.2 11.3 21.4 22 16.4 ...
$ VOTDec : num -60.2 -87 21 22 -77 ...
$ durationBurst : num 6 8 7 11 9 11 7 12 9 7 ...
$ durationVOT : num 13 3 14 11 7 18 14 15 14 14 ...
$ durationVoicedPercVOTAll: num 100 0 0 0 41.2 ...
$ intensityOnsetVOTAll : num 52.6 56.9 52.6 55.3 53.4 ...
$ intensityOffsetVOTAll : num 64.5 63.4 59.7 57.7 64.2 ...
$ durationV2 : num 89 155.2 86.3 113.3 76.4 ...
$ f0OnsetV2 : num 247 241 245 239 244 ...
$ intensityOnsetV2 : num 67.3 66.6 61.1 60.9 64.5 ...
$ f1OnsetV2 : num 699 542 659 444 679 ...
$ h1mnh2OnsetNormV2 : num -2.08 5.31 11.84 17.52 13.02 ...
$ durationV1 : num 90.5 48.4 52 63.2 61.5 ...
$ f0OffsetV1 : num 239 239 241 238 205 ...
$ intensityOffsetV1 : num 63.8 70.4 68.7 66.4 65.9 ...
$ f1OffsetV1 : num 573 644 942 661 637 ...
$ h1mnh2OffsetNorm : num 9.123 -3.136 0.981 3.576 0.861 ...
summary(ResultsFullOriginalCentral)
x file_name speaker sex
Min. : 1 01_F_W_main_1.wav: 55 sp1 : 97 F:940
1st Qu.: 449 06_F_W_main_1.wav: 55 sp10 : 97 M:853
Median : 897 02_F_W_main_1.wav: 53 sp6 : 97
Mean : 897 10_F_W_main_1.wav: 52 sp17 : 96
3rd Qu.:1345 08_F_W_main_1.wav: 51 sp9 : 96
Max. :1793 01_M_W_main_1.wav: 50 sp20 : 95
(Other) :1477 (Other):1215
Number wordTarget word singGem
Min. : 8 2adar : 20 3atab : 20 geminate :1007
1st Qu.: 9584 3aadal : 20 fatal : 20 singleton: 786
Median :19555 3atab : 20 saba2 : 20
Mean :19497 3attaal: 20 2atal : 19
3rd Qu.:29525 fatal : 20 3addad : 19
Max. :39279 Haba2 : 20 badal : 19
(Other):1673 (Other):1676
syllType vowelLength syllStruct c.v
iambic : 356 longV1 : 594 V1C2V2 :433 C2 : 786
trochaic:1437 shortV1:1199 V1CC2V2 :410 CC2:1007
V1CC2VV2:356
VV1C2V2 :353
VV1CC2V2:241
phoneme modeArticulation4 placeArticulation
dd :285 stop:1793 alveolar :625
d :242 bilabial :426
bb :235 pharyngealised:449
b :191 velar :293
kk :160
k :133
(Other):547
voiced.unvoiced placeVoicing durationC2
voiced :1059 alveolar voiced :420 Min. : 27.09
voiceless: 734 alveolar voiceless :205 1st Qu.: 78.43
bilabial :426 Median :139.81
pharyngealised voiced :213 Mean :131.23
pharyngealised voiceless:236 3rd Qu.:175.43
velar :293 Max. :315.77
durationVoicedPercC2 durationVOTAll VOTDec
Min. : 0.00 Min. : 0.712 Min. :-299.69
1st Qu.: 23.08 1st Qu.:10.478 1st Qu.:-142.10
Median : 69.12 Median :15.796 Median : -62.73
Mean : 61.60 Mean :18.600 Mean : -61.12
3rd Qu.:100.00 3rd Qu.:23.386 3rd Qu.: 19.00
Max. :100.00 Max. :89.065 Max. : 80.00
durationBurst durationVOT durationVoicedPercVOTAll
Min. : 0.500 Min. : 0.10 Min. : 0.00
1st Qu.: 4.000 1st Qu.: 5.00 1st Qu.: 0.00
Median : 6.000 Median : 9.00 Median : 0.00
Mean : 7.193 Mean :11.84 Mean : 41.02
3rd Qu.: 9.000 3rd Qu.:15.00 3rd Qu.:100.00
Max. :38.000 Max. :97.00 Max. :100.00
intensityOnsetVOTAll intensityOffsetVOTAll durationV2
Min. :41.11 Min. :42.47 Min. : 37.52
1st Qu.:54.91 1st Qu.:62.46 1st Qu.:113.30
Median :58.95 Median :65.72 Median :143.69
Mean :59.42 Mean :65.37 Mean :156.24
3rd Qu.:64.08 3rd Qu.:68.80 3rd Qu.:193.76
Max. :77.19 Max. :82.02 Max. :381.26
f0OnsetV2 intensityOnsetV2 f1OnsetV2
Min. : 80.93 Min. :45.35 Min. : 209.9
1st Qu.:116.99 1st Qu.:64.50 1st Qu.: 431.2
Median :162.49 Median :68.27 Median : 485.1
Mean :162.64 Mean :67.80 Mean : 499.6
3rd Qu.:199.27 3rd Qu.:71.83 3rd Qu.: 555.4
Max. :308.96 Max. :84.14 Max. :1082.2
h1mnh2OnsetNormV2 durationV1 f0OffsetV1
Min. :-23.341 Min. : 4.31 Min. : 82.23
1st Qu.: -7.177 1st Qu.: 57.45 1st Qu.:119.48
Median : -4.097 Median : 78.51 Median :172.18
Mean : -2.673 Mean : 94.63 Mean :171.86
3rd Qu.: 0.769 3rd Qu.:131.74 3rd Qu.:215.80
Max. : 22.130 Max. :264.37 Max. :367.78
intensityOffsetV1 f1OffsetV1 h1mnh2OffsetNorm
Min. :48.25 Min. : 246.6 Min. :-13.019
1st Qu.:65.71 1st Qu.: 470.5 1st Qu.: -5.900
Median :70.19 Median : 534.2 Median : -3.136
Mean :69.53 Mean : 553.6 Mean : -2.378
3rd Qu.:73.82 3rd Qu.: 622.4 3rd Qu.: 0.399
Max. :84.98 Max. :1081.3 Max. : 19.413
write.csv(ResultsFullOriginalCentral,"ResultsFullOriginalCentral.csv")
Given the differences in scales of each of the acoustic correlates, e.g., durations in milliseconds, intensity in dB, the magnitudes of the values are different. As a way to “normalise” for the data and to put all predictors on the same level, we z-score the data.
# from the package "dplyr"
all_z <- select(ResultsFullOriginalCentral, durationC2:h1mnh2OffsetNorm)
all_z <- cbind(all_z, select(ResultsFullOriginalCentral))
all_z <- apply(all_z, 2, scale)
colnames(all_z) <- paste(colnames(all_z), 'z', sep = '_')
ResultsFullOriginalCentral <- cbind(ResultsFullOriginalCentral, all_z)
summary(ResultsFullOriginalCentral)
x file_name speaker sex
Min. : 1 01_F_W_main_1.wav: 55 sp1 : 97 F:940
1st Qu.: 449 06_F_W_main_1.wav: 55 sp10 : 97 M:853
Median : 897 02_F_W_main_1.wav: 53 sp6 : 97
Mean : 897 10_F_W_main_1.wav: 52 sp17 : 96
3rd Qu.:1345 08_F_W_main_1.wav: 51 sp9 : 96
Max. :1793 01_M_W_main_1.wav: 50 sp20 : 95
(Other) :1477 (Other):1215
Number wordTarget word singGem
Min. : 8 2adar : 20 3atab : 20 geminate :1007
1st Qu.: 9584 3aadal : 20 fatal : 20 singleton: 786
Median :19555 3atab : 20 saba2 : 20
Mean :19497 3attaal: 20 2atal : 19
3rd Qu.:29525 fatal : 20 3addad : 19
Max. :39279 Haba2 : 20 badal : 19
(Other):1673 (Other):1676
syllType vowelLength syllStruct c.v
iambic : 356 longV1 : 594 V1C2V2 :433 C2 : 786
trochaic:1437 shortV1:1199 V1CC2V2 :410 CC2:1007
V1CC2VV2:356
VV1C2V2 :353
VV1CC2V2:241
phoneme modeArticulation4 placeArticulation
dd :285 stop:1793 alveolar :625
d :242 bilabial :426
bb :235 pharyngealised:449
b :191 velar :293
kk :160
k :133
(Other):547
voiced.unvoiced placeVoicing durationC2
voiced :1059 alveolar voiced :420 Min. : 27.09
voiceless: 734 alveolar voiceless :205 1st Qu.: 78.43
bilabial :426 Median :139.81
pharyngealised voiced :213 Mean :131.23
pharyngealised voiceless:236 3rd Qu.:175.43
velar :293 Max. :315.77
durationVoicedPercC2 durationVOTAll VOTDec
Min. : 0.00 Min. : 0.712 Min. :-299.69
1st Qu.: 23.08 1st Qu.:10.478 1st Qu.:-142.10
Median : 69.12 Median :15.796 Median : -62.73
Mean : 61.60 Mean :18.600 Mean : -61.12
3rd Qu.:100.00 3rd Qu.:23.386 3rd Qu.: 19.00
Max. :100.00 Max. :89.065 Max. : 80.00
durationBurst durationVOT durationVoicedPercVOTAll
Min. : 0.500 Min. : 0.10 Min. : 0.00
1st Qu.: 4.000 1st Qu.: 5.00 1st Qu.: 0.00
Median : 6.000 Median : 9.00 Median : 0.00
Mean : 7.193 Mean :11.84 Mean : 41.02
3rd Qu.: 9.000 3rd Qu.:15.00 3rd Qu.:100.00
Max. :38.000 Max. :97.00 Max. :100.00
intensityOnsetVOTAll intensityOffsetVOTAll durationV2
Min. :41.11 Min. :42.47 Min. : 37.52
1st Qu.:54.91 1st Qu.:62.46 1st Qu.:113.30
Median :58.95 Median :65.72 Median :143.69
Mean :59.42 Mean :65.37 Mean :156.24
3rd Qu.:64.08 3rd Qu.:68.80 3rd Qu.:193.76
Max. :77.19 Max. :82.02 Max. :381.26
f0OnsetV2 intensityOnsetV2 f1OnsetV2
Min. : 80.93 Min. :45.35 Min. : 209.9
1st Qu.:116.99 1st Qu.:64.50 1st Qu.: 431.2
Median :162.49 Median :68.27 Median : 485.1
Mean :162.64 Mean :67.80 Mean : 499.6
3rd Qu.:199.27 3rd Qu.:71.83 3rd Qu.: 555.4
Max. :308.96 Max. :84.14 Max. :1082.2
h1mnh2OnsetNormV2 durationV1 f0OffsetV1
Min. :-23.341 Min. : 4.31 Min. : 82.23
1st Qu.: -7.177 1st Qu.: 57.45 1st Qu.:119.48
Median : -4.097 Median : 78.51 Median :172.18
Mean : -2.673 Mean : 94.63 Mean :171.86
3rd Qu.: 0.769 3rd Qu.:131.74 3rd Qu.:215.80
Max. : 22.130 Max. :264.37 Max. :367.78
intensityOffsetV1 f1OffsetV1 h1mnh2OffsetNorm
Min. :48.25 Min. : 246.6 Min. :-13.019
1st Qu.:65.71 1st Qu.: 470.5 1st Qu.: -5.900
Median :70.19 Median : 534.2 Median : -3.136
Mean :69.53 Mean : 553.6 Mean : -2.378
3rd Qu.:73.82 3rd Qu.: 622.4 3rd Qu.: 0.399
Max. :84.98 Max. :1081.3 Max. : 19.413
durationC2_z durationVoicedPercC2_z durationVOTAll_z
Min. :-1.8939 Min. :-1.6510 Min. :-1.5024
1st Qu.:-0.9603 1st Qu.:-1.0325 1st Qu.:-0.6822
Median : 0.1561 Median : 0.2014 Median :-0.2355
Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
3rd Qu.: 0.8039 3rd Qu.: 1.0291 3rd Qu.: 0.4020
Max. : 3.3564 Max. : 1.0291 Max. : 5.9185
VOTDec_z durationBurst_z durationVOT_z
Min. :-2.88576 Min. :-1.5989 Min. :-1.1337
1st Qu.:-0.97961 1st Qu.:-0.7628 1st Qu.:-0.6605
Median :-0.01956 Median :-0.2850 Median :-0.2741
Mean : 0.00000 Mean : 0.0000 Mean : 0.0000
3rd Qu.: 0.96913 3rd Qu.: 0.4316 3rd Qu.: 0.3053
Max. : 1.70700 Max. : 7.3590 Max. : 8.2247
durationVoicedPercVOTAll_z intensityOnsetVOTAll_z
Min. :-0.867 Min. :-2.94874
1st Qu.:-0.867 1st Qu.:-0.72620
Median :-0.867 Median :-0.07552
Mean : 0.000 Mean : 0.00000
3rd Qu.: 1.247 3rd Qu.: 0.74937
Max. : 1.247 Max. : 2.86081
intensityOffsetVOTAll_z durationV2_z f0OnsetV2_z
Min. :-4.21154 Min. :-2.0785 Min. :-1.693791
1st Qu.:-0.53597 1st Qu.:-0.7517 1st Qu.:-0.946329
Median : 0.06384 Median :-0.2198 Median :-0.003211
Mean : 0.00000 Mean : 0.0000 Mean : 0.000000
3rd Qu.: 0.63035 3rd Qu.: 0.6571 3rd Qu.: 0.759217
Max. : 3.06307 Max. : 3.9397 Max. : 3.032946
intensityOnsetV2_z f1OnsetV2_z h1mnh2OnsetNormV2_z
Min. :-3.83386 Min. :-2.9352 Min. :-3.2122
1st Qu.:-0.56327 1st Qu.:-0.6926 1st Qu.:-0.7001
Median : 0.08125 Median :-0.1464 Median :-0.2214
Mean : 0.00000 Mean : 0.0000 Mean : 0.0000
3rd Qu.: 0.68836 3rd Qu.: 0.5662 3rd Qu.: 0.5349
Max. : 2.79068 Max. : 5.9039 Max. : 3.8548
durationV1_z f0OffsetV1_z intensityOffsetV1_z
Min. :-1.8270 Min. :-1.642154 Min. :-3.5036
1st Qu.:-0.7520 1st Qu.:-0.959636 1st Qu.:-0.6293
Median :-0.3262 Median : 0.005841 Median : 0.1083
Mean : 0.0000 Mean : 0.000000 Mean : 0.0000
3rd Qu.: 0.7504 3rd Qu.: 0.805067 3rd Qu.: 0.7061
Max. : 3.4333 Max. : 3.589498 Max. : 2.5419
f1OffsetV1_z h1mnh2OffsetNorm_z
Min. :-2.5378 Min. :-2.0611
1st Qu.:-0.6871 1st Qu.:-0.6822
Median :-0.1604 Median :-0.1468
Mean : 0.0000 Mean : 0.0000
3rd Qu.: 0.5685 3rd Qu.: 0.5379
Max. : 4.3619 Max. : 4.2208
str(ResultsFullOriginalCentral)
'data.frame': 1793 obs. of 55 variables:
$ x : int 1 2 3 4 5 6 7 8 9 10 ...
$ file_name : Factor w/ 68 levels "01_F_W_main_1.wav",..: 1 1 1 1 1 1 1 1 1 1 ...
$ speaker : Factor w/ 20 levels "sp1","sp10","sp11",..: 1 1 1 1 1 1 1 1 1 1 ...
$ sex : Factor w/ 2 levels "F","M": 1 1 1 1 1 1 1 1 1 1 ...
$ Number : int 8 38 68 98 371 395 419 449 743 767 ...
$ wordTarget : Factor w/ 102 levels "2aabbeh","2aabeD",..: 83 9 28 52 47 38 34 90 6 27 ...
$ word : Factor w/ 331 levels "2AAbaD","2aabbeh",..: 269 31 89 160 151 126 112 281 23 88 ...
$ singGem : Factor w/ 2 levels "geminate","singleton": 2 2 2 2 2 2 2 2 2 2 ...
$ syllType : Factor w/ 2 levels "iambic","trochaic": 2 2 2 2 2 2 2 2 2 2 ...
$ vowelLength : Factor w/ 2 levels "longV1","shortV1": 2 2 2 2 2 2 2 2 2 2 ...
$ syllStruct : Factor w/ 5 levels "V1C2V2","V1CC2V2",..: 1 1 1 1 1 1 1 1 1 1 ...
$ c.v : Factor w/ 2 levels "C2","CC2": 1 1 1 1 1 1 1 1 1 1 ...
$ phoneme : Factor w/ 13 levels "b","bb","d","D",..: 1 3 12 8 1 10 12 8 1 10 ...
$ modeArticulation4 : Factor w/ 1 level "stop": 1 1 1 1 1 1 1 1 1 1 ...
$ placeArticulation : Factor w/ 4 levels "alveolar","bilabial",..: 2 1 3 4 2 1 3 4 2 1 ...
$ voiced.unvoiced : Factor w/ 2 levels "voiced","voiceless": 1 1 2 2 1 2 2 2 1 2 ...
$ placeVoicing : Factor w/ 6 levels "alveolar voiced",..: 3 1 5 6 3 2 5 6 3 2 ...
$ durationC2 : num 60.2 87 106.9 109.7 77 ...
$ durationVoicedPercC2 : num 100 57.5 43.9 32.7 67.5 ...
$ durationVOTAll : num 19.2 11.3 21.4 22 16.4 ...
$ VOTDec : num -60.2 -87 21 22 -77 ...
$ durationBurst : num 6 8 7 11 9 11 7 12 9 7 ...
$ durationVOT : num 13 3 14 11 7 18 14 15 14 14 ...
$ durationVoicedPercVOTAll : num 100 0 0 0 41.2 ...
$ intensityOnsetVOTAll : num 52.6 56.9 52.6 55.3 53.4 ...
$ intensityOffsetVOTAll : num 64.5 63.4 59.7 57.7 64.2 ...
$ durationV2 : num 89 155.2 86.3 113.3 76.4 ...
$ f0OnsetV2 : num 247 241 245 239 244 ...
$ intensityOnsetV2 : num 67.3 66.6 61.1 60.9 64.5 ...
$ f1OnsetV2 : num 699 542 659 444 679 ...
$ h1mnh2OnsetNormV2 : num -2.08 5.31 11.84 17.52 13.02 ...
$ durationV1 : num 90.5 48.4 52 63.2 61.5 ...
$ f0OffsetV1 : num 239 239 241 238 205 ...
$ intensityOffsetV1 : num 63.8 70.4 68.7 66.4 65.9 ...
$ f1OffsetV1 : num 573 644 942 661 637 ...
$ h1mnh2OffsetNorm : num 9.123 -3.136 0.981 3.576 0.861 ...
$ durationC2_z : num -1.292 -0.804 -0.442 -0.391 -0.986 ...
$ durationVoicedPercC2_z : num 1.029 -0.111 -0.474 -0.774 0.159 ...
$ durationVOTAll_z : num 0.0523 -0.6124 0.239 0.2864 -0.1851 ...
$ VOTDec_z : num 0.0112 -0.3133 0.9933 1.0054 -0.1921 ...
$ durationBurst_z : num -0.285 0.1927 -0.0462 0.9093 0.4316 ...
$ durationVOT_z : num 0.112 -0.854 0.209 -0.081 -0.467 ...
$ durationVoicedPercVOTAll_z: num 1.24656 -0.86697 -0.86697 -0.86697 0.00331 ...
$ intensityOnsetVOTAll_z : num -1.107 -0.402 -1.095 -0.669 -0.974 ...
$ intensityOffsetVOTAll_z : num -0.165 -0.363 -1.046 -1.402 -0.219 ...
$ durationV2_z : num -1.1771 -0.0176 -1.2242 -0.7517 -1.3971 ...
$ f0OnsetV2_z : num 1.75 1.63 1.71 1.59 1.69 ...
$ intensityOnsetV2_z : num -0.0892 -0.2073 -1.1401 -1.1719 -0.5585 ...
$ f1OnsetV2_z : num 2.024 0.429 1.617 -0.559 1.819 ...
$ h1mnh2OnsetNormV2_z : num 0.0915 1.2402 2.2558 3.1377 2.4385 ...
$ durationV1_z : num -0.0843 -0.9348 -0.8631 -0.6355 -0.6697 ...
$ f0OffsetV1_z : num 1.223 1.233 1.258 1.22 0.614 ...
$ intensityOffsetV1_z : num -0.947 0.139 -0.145 -0.516 -0.606 ...
$ f1OffsetV1_z : num 0.16 0.75 3.21 0.886 0.693 ...
$ h1mnh2OffsetNorm_z : num 2.228 -0.147 0.651 1.153 0.627 ...
In the following analysis, we are interested in the four-way contrast observed between voicing and gemination, i.e., Voiceless Singleton, Voiced Singleton, Voiceless Geminate and Voiced Geminate.
ResultsFullOriginalCentral["SingGemVoicing"] <- NA
ResultsFullOriginalCentral["SingGemVoicing"] <- paste(ResultsFullOriginalCentral$singGem,
ResultsFullOriginalCentral$voiced.unvoiced)
ResultsFullOriginalCentral$SingGemVoicing <- as.factor(ResultsFullOriginalCentral$SingGemVoicing)
str(ResultsFullOriginalCentral)
'data.frame': 1793 obs. of 56 variables:
$ x : int 1 2 3 4 5 6 7 8 9 10 ...
$ file_name : Factor w/ 68 levels "01_F_W_main_1.wav",..: 1 1 1 1 1 1 1 1 1 1 ...
$ speaker : Factor w/ 20 levels "sp1","sp10","sp11",..: 1 1 1 1 1 1 1 1 1 1 ...
$ sex : Factor w/ 2 levels "F","M": 1 1 1 1 1 1 1 1 1 1 ...
$ Number : int 8 38 68 98 371 395 419 449 743 767 ...
$ wordTarget : Factor w/ 102 levels "2aabbeh","2aabeD",..: 83 9 28 52 47 38 34 90 6 27 ...
$ word : Factor w/ 331 levels "2AAbaD","2aabbeh",..: 269 31 89 160 151 126 112 281 23 88 ...
$ singGem : Factor w/ 2 levels "geminate","singleton": 2 2 2 2 2 2 2 2 2 2 ...
$ syllType : Factor w/ 2 levels "iambic","trochaic": 2 2 2 2 2 2 2 2 2 2 ...
$ vowelLength : Factor w/ 2 levels "longV1","shortV1": 2 2 2 2 2 2 2 2 2 2 ...
$ syllStruct : Factor w/ 5 levels "V1C2V2","V1CC2V2",..: 1 1 1 1 1 1 1 1 1 1 ...
$ c.v : Factor w/ 2 levels "C2","CC2": 1 1 1 1 1 1 1 1 1 1 ...
$ phoneme : Factor w/ 13 levels "b","bb","d","D",..: 1 3 12 8 1 10 12 8 1 10 ...
$ modeArticulation4 : Factor w/ 1 level "stop": 1 1 1 1 1 1 1 1 1 1 ...
$ placeArticulation : Factor w/ 4 levels "alveolar","bilabial",..: 2 1 3 4 2 1 3 4 2 1 ...
$ voiced.unvoiced : Factor w/ 2 levels "voiced","voiceless": 1 1 2 2 1 2 2 2 1 2 ...
$ placeVoicing : Factor w/ 6 levels "alveolar voiced",..: 3 1 5 6 3 2 5 6 3 2 ...
$ durationC2 : num 60.2 87 106.9 109.7 77 ...
$ durationVoicedPercC2 : num 100 57.5 43.9 32.7 67.5 ...
$ durationVOTAll : num 19.2 11.3 21.4 22 16.4 ...
$ VOTDec : num -60.2 -87 21 22 -77 ...
$ durationBurst : num 6 8 7 11 9 11 7 12 9 7 ...
$ durationVOT : num 13 3 14 11 7 18 14 15 14 14 ...
$ durationVoicedPercVOTAll : num 100 0 0 0 41.2 ...
$ intensityOnsetVOTAll : num 52.6 56.9 52.6 55.3 53.4 ...
$ intensityOffsetVOTAll : num 64.5 63.4 59.7 57.7 64.2 ...
$ durationV2 : num 89 155.2 86.3 113.3 76.4 ...
$ f0OnsetV2 : num 247 241 245 239 244 ...
$ intensityOnsetV2 : num 67.3 66.6 61.1 60.9 64.5 ...
$ f1OnsetV2 : num 699 542 659 444 679 ...
$ h1mnh2OnsetNormV2 : num -2.08 5.31 11.84 17.52 13.02 ...
$ durationV1 : num 90.5 48.4 52 63.2 61.5 ...
$ f0OffsetV1 : num 239 239 241 238 205 ...
$ intensityOffsetV1 : num 63.8 70.4 68.7 66.4 65.9 ...
$ f1OffsetV1 : num 573 644 942 661 637 ...
$ h1mnh2OffsetNorm : num 9.123 -3.136 0.981 3.576 0.861 ...
$ durationC2_z : num -1.292 -0.804 -0.442 -0.391 -0.986 ...
$ durationVoicedPercC2_z : num 1.029 -0.111 -0.474 -0.774 0.159 ...
$ durationVOTAll_z : num 0.0523 -0.6124 0.239 0.2864 -0.1851 ...
$ VOTDec_z : num 0.0112 -0.3133 0.9933 1.0054 -0.1921 ...
$ durationBurst_z : num -0.285 0.1927 -0.0462 0.9093 0.4316 ...
$ durationVOT_z : num 0.112 -0.854 0.209 -0.081 -0.467 ...
$ durationVoicedPercVOTAll_z: num 1.24656 -0.86697 -0.86697 -0.86697 0.00331 ...
$ intensityOnsetVOTAll_z : num -1.107 -0.402 -1.095 -0.669 -0.974 ...
$ intensityOffsetVOTAll_z : num -0.165 -0.363 -1.046 -1.402 -0.219 ...
$ durationV2_z : num -1.1771 -0.0176 -1.2242 -0.7517 -1.3971 ...
$ f0OnsetV2_z : num 1.75 1.63 1.71 1.59 1.69 ...
$ intensityOnsetV2_z : num -0.0892 -0.2073 -1.1401 -1.1719 -0.5585 ...
$ f1OnsetV2_z : num 2.024 0.429 1.617 -0.559 1.819 ...
$ h1mnh2OnsetNormV2_z : num 0.0915 1.2402 2.2558 3.1377 2.4385 ...
$ durationV1_z : num -0.0843 -0.9348 -0.8631 -0.6355 -0.6697 ...
$ f0OffsetV1_z : num 1.223 1.233 1.258 1.22 0.614 ...
$ intensityOffsetV1_z : num -0.947 0.139 -0.145 -0.516 -0.606 ...
$ f1OffsetV1_z : num 0.16 0.75 3.21 0.886 0.693 ...
$ h1mnh2OffsetNorm_z : num 2.228 -0.147 0.651 1.153 0.627 ...
$ SingGemVoicing : Factor w/ 4 levels "geminate voiced",..: 3 3 4 4 3 4 4 4 3 4 ...
After all this pre-processing, we are now ready to start our Random Forest analysis.
We create a training and a testing sets (66% and 33% of the data respectively). The aim is to train three random forests and using the testing set, we are evaluating the prediction power of each.
# creating training and testing sets
# we set the seeds for reproducibility
set.seed(1)
train.idx <- sample(nrow(ResultsFullOriginalCentral), 2/3 * nrow(ResultsFullOriginalCentral))
gemination.train <- ResultsFullOriginalCentral[train.idx, ]
gemination.test <- ResultsFullOriginalCentral[-train.idx, ]
We use a procedure we developed to obtain the optimal number of trees required to grow a forest with the highest predictive accuracy. This implements the procedure developed by “Oshiro, T. M., Perez, P. S., & Baranauskas, J. A. (2012). How many trees in a random forest? In International Workshop on Machine Learning and Data Mining in Pattern Recognition (pp. 154–168)” (see my github repo for the script).
The code below runs the analysis using parallel computing. We define the number of clusters needed and use these. We are training three Random Forests (see paper for more details):
Model A: A forest with the full 19 predictors (AUCTestWithDur)
Model B: A forest with 18 predictors (Model A minus the closure duration (AUCTestNoDur))
Model C: A forest with 17 predictors (Model B minus the VOT (AUCTestNoDurNoVOT))
## below to prepare using parallel computing
NumberOfCluster <- detectCores()
cl <- makeCluster(NumberOfCluster)
registerDoSNOW(cl)
## below computes the ROC (Receiver Operating Curve) curves for this dataframe
## for the 15 random forests with a 100 trees iteration
## and saves these in a list that will be used later to compare ROC curves
treeNumb <- 15
system.time(AUCTestWithDur <-
foreach (i=1:treeNumb,.packages=c("party","pROC")) %dopar% {
set.seed(1)
cforest.mdl=cforest(SingGemVoicing~durationC2_z+durationVoicedPercC2_z+durationVOTAll_z+
VOTDec_z+durationBurst_z+durationVOT_z+durationVoicedPercVOTAll_z+
intensityOnsetVOTAll_z+intensityOffsetVOTAll_z+durationV2_z+
f0OnsetV2_z+intensityOnsetV2_z+f1OnsetV2_z+
h1mnh2OnsetNormV2_z+durationV1_z+f0OffsetV1_z+intensityOffsetV1_z+
f1OffsetV1_z+h1mnh2OffsetNorm_z, data = gemination.train,
controls = cforest_unbiased(mtry = round(sqrt(19)), ntree = i*100))
cforest.cond = predict(cforest.mdl,OOB=TRUE)
roc <- roc(gemination.train$SingGemVoicing,as.numeric(cforest.cond))
})
user system elapsed
0.03 0.03 54.53
system.time(AUCTestNoDur <-
foreach (i=1:treeNumb,.packages=c("party","pROC")) %dopar% {
set.seed(1)
cforest.mdl=cforest(SingGemVoicing~durationVoicedPercC2_z+durationVOTAll_z+
VOTDec_z+durationBurst_z+durationVOT_z+durationVoicedPercVOTAll_z+
intensityOnsetVOTAll_z+intensityOffsetVOTAll_z+durationV2_z+
f0OnsetV2_z+intensityOnsetV2_z+f1OnsetV2_z+
h1mnh2OnsetNormV2_z+durationV1_z+f0OffsetV1_z+intensityOffsetV1_z+
f1OffsetV1_z+h1mnh2OffsetNorm_z, data = gemination.train,
controls = cforest_unbiased(mtry = round(sqrt(18)), ntree = i*100))
cforest.cond = predict(cforest.mdl,OOB=TRUE)
roc <- roc(gemination.train$SingGemVoicing,as.numeric(cforest.cond))
})
user system elapsed
0.03 0.01 52.97
system.time(AUCTestNoDurNoVOTDec <-
foreach (i=1:treeNumb,.packages=c("party","pROC")) %dopar% {
set.seed(1)
cforest.mdl=cforest(SingGemVoicing~durationVoicedPercC2_z+durationVOTAll_z+
durationBurst_z+durationVOT_z+durationVoicedPercVOTAll_z+
intensityOnsetVOTAll_z+intensityOffsetVOTAll_z+durationV2_z+
f0OnsetV2_z+intensityOnsetV2_z+f1OnsetV2_z+
h1mnh2OnsetNormV2_z+durationV1_z+f0OffsetV1_z+intensityOffsetV1_z+
f1OffsetV1_z+h1mnh2OffsetNorm_z, data = gemination.train,
controls = cforest_unbiased(mtry = round(sqrt(17)), ntree = i*100))
cforest.cond = predict(cforest.mdl,OOB=TRUE)
roc <- roc(gemination.train$SingGemVoicing,as.numeric(cforest.cond))
})
user system elapsed
0.08 0.02 62.23
# stop cluster. If not done R will still use all cores as defined above and the system will become too slow!!
stopCluster(cl)
#
# check significant ROC comparison
# with duration
roc.test(AUCTestWithDur[[1]], AUCTestWithDur[[2]])
DeLong's test for two correlated ROC curves
data: AUCTestWithDur[[1]] and AUCTestWithDur[[2]]
Z = -0.044298, p-value = 0.9647
alternative hypothesis: true difference in AUC is not equal to 0
sample estimates:
AUC of roc1 AUC of roc2
0.9059677 0.9061450
roc.test(AUCTestWithDur[[2]], AUCTestWithDur[[3]])
DeLong's test for two correlated ROC curves
data: AUCTestWithDur[[2]] and AUCTestWithDur[[3]]
Z = -0.75022, p-value = 0.4531
alternative hypothesis: true difference in AUC is not equal to 0
sample estimates:
AUC of roc1 AUC of roc2
0.906145 0.908296
roc.test(AUCTestWithDur[[3]], AUCTestWithDur[[4]])
DeLong's test for two correlated ROC curves
data: AUCTestWithDur[[3]] and AUCTestWithDur[[4]]
Z = 0, p-value = 1
alternative hypothesis: true difference in AUC is not equal to 0
sample estimates:
AUC of roc1 AUC of roc2
0.908296 0.908296
roc.test(AUCTestWithDur[[4]], AUCTestWithDur[[5]])
DeLong's test for two correlated ROC curves
data: AUCTestWithDur[[4]] and AUCTestWithDur[[5]]
Z = -0.2101, p-value = 0.8336
alternative hypothesis: true difference in AUC is not equal to 0
sample estimates:
AUC of roc1 AUC of roc2
0.9082960 0.9087579
roc.test(AUCTestWithDur[[5]], AUCTestWithDur[[6]])
DeLong's test for two correlated ROC curves
data: AUCTestWithDur[[5]] and AUCTestWithDur[[6]]
Z = 2.2154, p-value = 0.02673
alternative hypothesis: true difference in AUC is not equal to 0
sample estimates:
AUC of roc1 AUC of roc2
0.9087579 0.9013018
roc.test(AUCTestWithDur[[6]], AUCTestWithDur[[7]])
DeLong's test for two correlated ROC curves
data: AUCTestWithDur[[6]] and AUCTestWithDur[[7]]
Z = -1, p-value = 0.3173
alternative hypothesis: true difference in AUC is not equal to 0
sample estimates:
AUC of roc1 AUC of roc2
0.9013018 0.9030702
roc.test(AUCTestWithDur[[7]], AUCTestWithDur[[8]])
DeLong's test for two correlated ROC curves
data: AUCTestWithDur[[7]] and AUCTestWithDur[[8]]
Z = 0, p-value = 1
alternative hypothesis: true difference in AUC is not equal to 0
sample estimates:
AUC of roc1 AUC of roc2
0.9030702 0.9030702
roc.test(AUCTestWithDur[[8]], AUCTestWithDur[[9]])
DeLong's test for two correlated ROC curves
data: AUCTestWithDur[[8]] and AUCTestWithDur[[9]]
Z = 0, p-value = 1
alternative hypothesis: true difference in AUC is not equal to 0
sample estimates:
AUC of roc1 AUC of roc2
0.9030702 0.9030702
roc.test(AUCTestWithDur[[9]], AUCTestWithDur[[10]])
DeLong's test for two correlated ROC curves
data: AUCTestWithDur[[9]] and AUCTestWithDur[[10]]
Z = 0, p-value = 1
alternative hypothesis: true difference in AUC is not equal to 0
sample estimates:
AUC of roc1 AUC of roc2
0.9030702 0.9030702
roc.test(AUCTestWithDur[[10]], AUCTestWithDur[[11]])
DeLong's test for two correlated ROC curves
data: AUCTestWithDur[[10]] and AUCTestWithDur[[11]]
Z = 0, p-value = 1
alternative hypothesis: true difference in AUC is not equal to 0
sample estimates:
AUC of roc1 AUC of roc2
0.9030702 0.9030702
roc.test(AUCTestWithDur[[11]], AUCTestWithDur[[12]])
DeLong's test for two correlated ROC curves
data: AUCTestWithDur[[11]] and AUCTestWithDur[[12]]
Z = 0, p-value = 1
alternative hypothesis: true difference in AUC is not equal to 0
sample estimates:
AUC of roc1 AUC of roc2
0.9030702 0.9030702
roc.test(AUCTestWithDur[[12]], AUCTestWithDur[[13]])
DeLong's test for two correlated ROC curves
data: AUCTestWithDur[[12]] and AUCTestWithDur[[13]]
Z = 0, p-value = 1
alternative hypothesis: true difference in AUC is not equal to 0
sample estimates:
AUC of roc1 AUC of roc2
0.9030702 0.9030702
roc.test(AUCTestWithDur[[13]], AUCTestWithDur[[14]])
DeLong's test for two correlated ROC curves
data: AUCTestWithDur[[13]] and AUCTestWithDur[[14]]
Z = 0, p-value = 1
alternative hypothesis: true difference in AUC is not equal to 0
sample estimates:
AUC of roc1 AUC of roc2
0.9030702 0.9030702
roc.test(AUCTestWithDur[[14]], AUCTestWithDur[[15]])
DeLong's test for two correlated ROC curves
data: AUCTestWithDur[[14]] and AUCTestWithDur[[15]]
Z = 0, p-value = 1
alternative hypothesis: true difference in AUC is not equal to 0
sample estimates:
AUC of roc1 AUC of roc2
0.9030702 0.9030702
# 500 trees is the best model
roc.test(AUCTestNoDur[[1]], AUCTestNoDur[[2]])
DeLong's test for two correlated ROC curves
data: AUCTestNoDur[[1]] and AUCTestNoDur[[2]]
Z = -1.2523, p-value = 0.2105
alternative hypothesis: true difference in AUC is not equal to 0
sample estimates:
AUC of roc1 AUC of roc2
0.8925345 0.8984509
roc.test(AUCTestNoDur[[2]], AUCTestNoDur[[3]])
DeLong's test for two correlated ROC curves
data: AUCTestNoDur[[2]] and AUCTestNoDur[[3]]
Z = 0.49666, p-value = 0.6194
alternative hypothesis: true difference in AUC is not equal to 0
sample estimates:
AUC of roc1 AUC of roc2
0.8984509 0.8974477
roc.test(AUCTestNoDur[[3]], AUCTestNoDur[[4]])
DeLong's test for two correlated ROC curves
data: AUCTestNoDur[[3]] and AUCTestNoDur[[4]]
Z = 1.2903, p-value = 0.1969
alternative hypothesis: true difference in AUC is not equal to 0
sample estimates:
AUC of roc1 AUC of roc2
0.8974477 0.8941443
roc.test(AUCTestNoDur[[4]], AUCTestNoDur[[5]])
DeLong's test for two correlated ROC curves
data: AUCTestNoDur[[4]] and AUCTestNoDur[[5]]
Z = -1.9652, p-value = 0.04939
alternative hypothesis: true difference in AUC is not equal to 0
sample estimates:
AUC of roc1 AUC of roc2
0.8941443 0.8992721
roc.test(AUCTestNoDur[[5]], AUCTestNoDur[[6]])
DeLong's test for two correlated ROC curves
data: AUCTestNoDur[[5]] and AUCTestNoDur[[6]]
Z = 1.3361, p-value = 0.1815
alternative hypothesis: true difference in AUC is not equal to 0
sample estimates:
AUC of roc1 AUC of roc2
0.8992721 0.8977324
roc.test(AUCTestNoDur[[6]], AUCTestNoDur[[7]])
DeLong's test for two correlated ROC curves
data: AUCTestNoDur[[6]] and AUCTestNoDur[[7]]
Z = 0.29669, p-value = 0.7667
alternative hypothesis: true difference in AUC is not equal to 0
sample estimates:
AUC of roc1 AUC of roc2
0.8977324 0.8971258
roc.test(AUCTestNoDur[[7]], AUCTestNoDur[[8]])
DeLong's test for two correlated ROC curves
data: AUCTestNoDur[[7]] and AUCTestNoDur[[8]]
Z = -0.40221, p-value = 0.6875
alternative hypothesis: true difference in AUC is not equal to 0
sample estimates:
AUC of roc1 AUC of roc2
0.8971258 0.8979517
roc.test(AUCTestNoDur[[8]], AUCTestNoDur[[9]])
DeLong's test for two correlated ROC curves
data: AUCTestNoDur[[8]] and AUCTestNoDur[[9]]
Z = 0.98902, p-value = 0.3227
alternative hypothesis: true difference in AUC is not equal to 0
sample estimates:
AUC of roc1 AUC of roc2
0.8979517 0.8977324
roc.test(AUCTestNoDur[[9]], AUCTestNoDur[[10]])
DeLong's test for two correlated ROC curves
data: AUCTestNoDur[[9]] and AUCTestNoDur[[10]]
Z = 0, p-value = 1
alternative hypothesis: true difference in AUC is not equal to 0
sample estimates:
AUC of roc1 AUC of roc2
0.8977324 0.8977324
roc.test(AUCTestNoDur[[10]], AUCTestNoDur[[11]])
DeLong's test for two correlated ROC curves
data: AUCTestNoDur[[10]] and AUCTestNoDur[[11]]
Z = 0, p-value = 1
alternative hypothesis: true difference in AUC is not equal to 0
sample estimates:
AUC of roc1 AUC of roc2
0.8977324 0.8977324
roc.test(AUCTestNoDur[[11]], AUCTestNoDur[[12]])
DeLong's test for two correlated ROC curves
data: AUCTestNoDur[[11]] and AUCTestNoDur[[12]]
Z = 0, p-value = 1
alternative hypothesis: true difference in AUC is not equal to 0
sample estimates:
AUC of roc1 AUC of roc2
0.8977324 0.8977324
roc.test(AUCTestNoDur[[12]], AUCTestNoDur[[13]])
DeLong's test for two correlated ROC curves
data: AUCTestNoDur[[12]] and AUCTestNoDur[[13]]
Z = -0.99967, p-value = 0.3175
alternative hypothesis: true difference in AUC is not equal to 0
sample estimates:
AUC of roc1 AUC of roc2
0.8977324 0.8988429
roc.test(AUCTestNoDur[[13]], AUCTestNoDur[[14]])
DeLong's test for two correlated ROC curves
data: AUCTestNoDur[[13]] and AUCTestNoDur[[14]]
Z = 0, p-value = 1
alternative hypothesis: true difference in AUC is not equal to 0
sample estimates:
AUC of roc1 AUC of roc2
0.8988429 0.8988429
roc.test(AUCTestNoDur[[14]], AUCTestNoDur[[15]])
DeLong's test for two correlated ROC curves
data: AUCTestNoDur[[14]] and AUCTestNoDur[[15]]
Z = 0, p-value = 1
alternative hypothesis: true difference in AUC is not equal to 0
sample estimates:
AUC of roc1 AUC of roc2
0.8988429 0.8988429
## 500 trees is the best model.
roc.test(AUCTestNoDurNoVOTDec[[1]], AUCTestNoDurNoVOTDec[[2]])
DeLong's test for two correlated ROC curves
data: AUCTestNoDurNoVOTDec[[1]] and AUCTestNoDurNoVOTDec[[2]]
Z = -0.86872, p-value = 0.385
alternative hypothesis: true difference in AUC is not equal to 0
sample estimates:
AUC of roc1 AUC of roc2
0.7544700 0.7623414
roc.test(AUCTestNoDurNoVOTDec[[2]], AUCTestNoDurNoVOTDec[[3]])
DeLong's test for two correlated ROC curves
data: AUCTestNoDurNoVOTDec[[2]] and AUCTestNoDurNoVOTDec[[3]]
Z = 0.74265, p-value = 0.4577
alternative hypothesis: true difference in AUC is not equal to 0
sample estimates:
AUC of roc1 AUC of roc2
0.7623414 0.7581980
roc.test(AUCTestNoDurNoVOTDec[[3]], AUCTestNoDurNoVOTDec[[4]])
DeLong's test for two correlated ROC curves
data: AUCTestNoDurNoVOTDec[[3]] and AUCTestNoDurNoVOTDec[[4]]
Z = -0.54725, p-value = 0.5842
alternative hypothesis: true difference in AUC is not equal to 0
sample estimates:
AUC of roc1 AUC of roc2
0.7581980 0.7612542
roc.test(AUCTestNoDurNoVOTDec[[4]], AUCTestNoDurNoVOTDec[[5]])
DeLong's test for two correlated ROC curves
data: AUCTestNoDurNoVOTDec[[4]] and AUCTestNoDurNoVOTDec[[5]]
Z = 0.26264, p-value = 0.7928
alternative hypothesis: true difference in AUC is not equal to 0
sample estimates:
AUC of roc1 AUC of roc2
0.7612542 0.7599571
roc.test(AUCTestNoDurNoVOTDec[[5]], AUCTestNoDurNoVOTDec[[6]])
DeLong's test for two correlated ROC curves
data: AUCTestNoDurNoVOTDec[[5]] and AUCTestNoDurNoVOTDec[[6]]
Z = -0.72598, p-value = 0.4679
alternative hypothesis: true difference in AUC is not equal to 0
sample estimates:
AUC of roc1 AUC of roc2
0.7599571 0.7637318
roc.test(AUCTestNoDurNoVOTDec[[6]], AUCTestNoDurNoVOTDec[[7]])
DeLong's test for two correlated ROC curves
data: AUCTestNoDurNoVOTDec[[6]] and AUCTestNoDurNoVOTDec[[7]]
Z = 0.66791, p-value = 0.5042
alternative hypothesis: true difference in AUC is not equal to 0
sample estimates:
AUC of roc1 AUC of roc2
0.7637318 0.7618841
roc.test(AUCTestNoDurNoVOTDec[[7]], AUCTestNoDurNoVOTDec[[8]])
DeLong's test for two correlated ROC curves
data: AUCTestNoDurNoVOTDec[[7]] and AUCTestNoDurNoVOTDec[[8]]
Z = 0.10062, p-value = 0.9199
alternative hypothesis: true difference in AUC is not equal to 0
sample estimates:
AUC of roc1 AUC of roc2
0.7618841 0.7615995
roc.test(AUCTestNoDurNoVOTDec[[8]], AUCTestNoDurNoVOTDec[[9]])
DeLong's test for two correlated ROC curves
data: AUCTestNoDurNoVOTDec[[8]] and AUCTestNoDurNoVOTDec[[9]]
Z = 0.10783, p-value = 0.9141
alternative hypothesis: true difference in AUC is not equal to 0
sample estimates:
AUC of roc1 AUC of roc2
0.7615995 0.7613195
roc.test(AUCTestNoDurNoVOTDec[[9]], AUCTestNoDurNoVOTDec[[10]])
DeLong's test for two correlated ROC curves
data: AUCTestNoDurNoVOTDec[[9]] and AUCTestNoDurNoVOTDec[[10]]
Z = 0, p-value = 1
alternative hypothesis: true difference in AUC is not equal to 0
sample estimates:
AUC of roc1 AUC of roc2
0.7613195 0.7613195
roc.test(AUCTestNoDurNoVOTDec[[10]], AUCTestNoDurNoVOTDec[[11]])
DeLong's test for two correlated ROC curves
data: AUCTestNoDurNoVOTDec[[10]] and AUCTestNoDurNoVOTDec[[11]]
Z = 1.406, p-value = 0.1597
alternative hypothesis: true difference in AUC is not equal to 0
sample estimates:
AUC of roc1 AUC of roc2
0.7613195 0.7596865
roc.test(AUCTestNoDurNoVOTDec[[11]], AUCTestNoDurNoVOTDec[[12]])
DeLong's test for two correlated ROC curves
data: AUCTestNoDurNoVOTDec[[11]] and AUCTestNoDurNoVOTDec[[12]]
Z = -0.57594, p-value = 0.5647
alternative hypothesis: true difference in AUC is not equal to 0
sample estimates:
AUC of roc1 AUC of roc2
0.7596865 0.7605030
roc.test(AUCTestNoDurNoVOTDec[[12]], AUCTestNoDurNoVOTDec[[13]])
DeLong's test for two correlated ROC curves
data: AUCTestNoDurNoVOTDec[[12]] and AUCTestNoDurNoVOTDec[[13]]
Z = 0, p-value = 1
alternative hypothesis: true difference in AUC is not equal to 0
sample estimates:
AUC of roc1 AUC of roc2
0.760503 0.760503
roc.test(AUCTestNoDurNoVOTDec[[13]], AUCTestNoDurNoVOTDec[[14]])
DeLong's test for two correlated ROC curves
data: AUCTestNoDurNoVOTDec[[13]] and AUCTestNoDurNoVOTDec[[14]]
Z = 0, p-value = 1
alternative hypothesis: true difference in AUC is not equal to 0
sample estimates:
AUC of roc1 AUC of roc2
0.760503 0.760503
roc.test(AUCTestNoDurNoVOTDec[[14]], AUCTestNoDurNoVOTDec[[15]])
DeLong's test for two correlated ROC curves
data: AUCTestNoDurNoVOTDec[[14]] and AUCTestNoDurNoVOTDec[[15]]
Z = 0, p-value = 1
alternative hypothesis: true difference in AUC is not equal to 0
sample estimates:
AUC of roc1 AUC of roc2
0.760503 0.760503
# 600 trees is the best model.
After checking the models above, we have the following number trees that are needed to grow a random forest with the highest predictive accuracy.
Model A: (19 predictors) 500 trees
Model B: (18 predictors) 500 trees
Model C: (17 predictors) 600 trees
Random forests are known to be biased towards correlated data and data with multiple categories and/or cut-points. We start by checking the correlation levels in our data.
# using "psycho"
# we use parts of the data-frame that contains the z-scored variables of interest
cor <- gemination.train[c(37:55)] %>%
correlation()
summary(cor)
plot(cor)
As can be seen from the correlation plot, an important number of predictors are correlated with each other. Some with minimal correlations, while other with relatively strong correlations. We will use this information in tuning the computations of variable importance.
levels(gemination.train$SingGemVoicing)
[1] "geminate voiced" "geminate voiceless"
[3] "singleton voiced" "singleton voiceless"
gemination.train$SingGemVoicing <- factor(gemination.train$SingGemVoicing,
levels = c("singleton voiceless","singleton voiced","geminate voiceless","geminate voiced"))
levels(gemination.train$SingGemVoicing)
[1] "singleton voiceless" "singleton voiced"
[3] "geminate voiceless" "geminate voiced"
levels(gemination.test$SingGemVoicing)
[1] "geminate voiced" "geminate voiceless"
[3] "singleton voiced" "singleton voiceless"
gemination.test$SingGemVoicing <- factor(gemination.test$SingGemVoicing,
levels = c("singleton voiceless","singleton voiced","geminate voiceless","geminate voiced"))
levels(gemination.test$SingGemVoicing)
[1] "singleton voiceless" "singleton voiced"
[3] "geminate voiceless" "geminate voiced"
set.seed(1)
cforestGemVoicingTrain <- cforest(SingGemVoicing~durationC2_z+durationVoicedPercC2_z+durationVOTAll_z+
VOTDec_z+durationBurst_z+durationVOT_z+durationVoicedPercVOTAll_z+
intensityOnsetVOTAll_z+intensityOffsetVOTAll_z+durationV2_z+
f0OnsetV2_z+intensityOnsetV2_z+f1OnsetV2_z+
h1mnh2OnsetNormV2_z+durationV1_z+f0OffsetV1_z+intensityOffsetV1_z+
f1OffsetV1_z+h1mnh2OffsetNorm_z,
data = gemination.train,
control=cforest_unbiased(mtry=round(sqrt(19)),ntree=500))
# we use predict to obtain predictions from the training set on the testing set
cforestGemVoicingTest <- predict(cforestGemVoicingTrain, newdata = gemination.test, OOB=TRUE, type = "response")
# Below is the predictions table showing the confusions
# Table predictions (rows) by original data (columns)
round(prop.table(table(cforestGemVoicingTest,gemination.test$SingGemVoicing),2)*100,1)
cforestGemVoicingTest singleton voiceless singleton voiced
singleton voiceless 90.5 8.1
singleton voiced 7.6 91.9
geminate voiceless 0.0 0.0
geminate voiced 1.9 0.0
cforestGemVoicingTest geminate voiceless geminate voiced
singleton voiceless 1.5 0.0
singleton voiced 0.0 0.0
geminate voiceless 92.5 6.1
geminate voiced 6.0 93.9
# for percent correct
sum(gemination.test$SingGemVoicing==cforestGemVoicingTest)/nrow(gemination.test)
[1] 0.9247492
# for correlations between our predictions and original data-frame
cor(as.numeric(cforestGemVoicingTest), as.numeric(gemination.test$SingGemVoicing))
[1] 0.9552215
We are interested in the importance of our predictors and how they are informative (or not) with respect to our model.
We run two versions. One with no conditional permutations, i.e., with no appropriate corrections for the correlations in our data and one with appropriate corrections
set.seed(1)
system.time(varImpSingGemVoicingTrainNT <- varimp(cforest(SingGemVoicing~durationC2_z+durationVoicedPercC2_z+durationVOTAll_z+
VOTDec_z+durationBurst_z+durationVOT_z+durationVoicedPercVOTAll_z+
intensityOnsetVOTAll_z+intensityOffsetVOTAll_z+durationV2_z+
f0OnsetV2_z+intensityOnsetV2_z+f1OnsetV2_z+
h1mnh2OnsetNormV2_z+durationV1_z+f0OffsetV1_z+intensityOffsetV1_z+
f1OffsetV1_z+h1mnh2OffsetNorm_z,
data = gemination.train,
control=cforest_unbiased(mtry=round(sqrt(19)),ntree=500)), conditional = F))
user system elapsed
18.19 0.00 18.24
sort(varImpSingGemVoicingTrainNT)
h1mnh2OffsetNorm_z h1mnh2OnsetNormV2_z
-8.200456e-05 -4.555809e-06
intensityOffsetVOTAll_z f1OffsetV1_z
6.879271e-04 7.152620e-04
intensityOnsetV2_z f1OnsetV2_z
7.699317e-04 1.011390e-03
f0OnsetV2_z intensityOffsetV1_z
1.421412e-03 1.662870e-03
f0OffsetV1_z intensityOnsetVOTAll_z
1.922551e-03 5.594533e-03
durationBurst_z durationV2_z
7.216401e-03 7.845103e-03
durationVOT_z durationV1_z
1.505239e-02 1.594077e-02
durationVOTAll_z durationVoicedPercVOTAll_z
2.502961e-02 4.869248e-02
durationVoicedPercC2_z VOTDec_z
1.141959e-01 1.426834e-01
durationC2_z
2.539180e-01
We start by changing names of predictors
# We change names
names(varImpSingGemVoicingTrainNT)[names(varImpSingGemVoicingTrainNT) == 'durationC2_z'] <- 'Closure Dur'
names(varImpSingGemVoicingTrainNT)[names(varImpSingGemVoicingTrainNT) == 'durationV2_z'] <- 'V2 Dur'
names(varImpSingGemVoicingTrainNT)[names(varImpSingGemVoicingTrainNT) == 'durationV1_z'] <- 'V1 Dur'
names(varImpSingGemVoicingTrainNT)[names(varImpSingGemVoicingTrainNT) == 'VOTDec_z'] <- 'VOT'
names(varImpSingGemVoicingTrainNT)[names(varImpSingGemVoicingTrainNT) == 'durationVOTAll_z'] <- 'Rel Dur'
names(varImpSingGemVoicingTrainNT)[names(varImpSingGemVoicingTrainNT) == 'durationBurst_z'] <- 'Burst Dur'
names(varImpSingGemVoicingTrainNT)[names(varImpSingGemVoicingTrainNT) == 'durationVOT_z'] <- 'Asp Dur'
names(varImpSingGemVoicingTrainNT)[names(varImpSingGemVoicingTrainNT) == 'durationVoicedPercC2_z'] <- 'CD Voicing'
names(varImpSingGemVoicingTrainNT)[names(varImpSingGemVoicingTrainNT) == 'durationVoicedPercVOTAll_z'] <- 'Rel Voicing'
names(varImpSingGemVoicingTrainNT)[names(varImpSingGemVoicingTrainNT) == 'intensityOffsetV1_z'] <- 'V1 Off Int'
names(varImpSingGemVoicingTrainNT)[names(varImpSingGemVoicingTrainNT) == 'intensityOnsetV2_z'] <- 'V2 On Int'
names(varImpSingGemVoicingTrainNT)[names(varImpSingGemVoicingTrainNT) == 'intensityOffsetVOTAll_z'] <- 'Rel Of Int'
names(varImpSingGemVoicingTrainNT)[names(varImpSingGemVoicingTrainNT) == 'intensityOnsetVOTAll_z'] <- 'Rel On Int'
names(varImpSingGemVoicingTrainNT)[names(varImpSingGemVoicingTrainNT) == 'f0OnsetV2_z'] <- 'V2 On f0'
names(varImpSingGemVoicingTrainNT)[names(varImpSingGemVoicingTrainNT) == 'f0OffsetV1_z'] <- 'V1 Of f0'
names(varImpSingGemVoicingTrainNT)[names(varImpSingGemVoicingTrainNT) == 'f1OnsetV2_z'] <- 'V2 On F1'
names(varImpSingGemVoicingTrainNT)[names(varImpSingGemVoicingTrainNT) == 'f1OffsetV1_z'] <- 'V1 Of F1'
names(varImpSingGemVoicingTrainNT)[names(varImpSingGemVoicingTrainNT) == 'h1mnh2OffsetNorm_z'] <- 'V1 Of H1*-H2*'
names(varImpSingGemVoicingTrainNT)[names(varImpSingGemVoicingTrainNT) == 'h1mnh2OnsetNormV2_z'] <- 'V2 On H1*-H2*'
After changing names, we append to a data-frame
varImpSingGemVoicingTrainNT_DF <- varImpSingGemVoicingTrainNT
varImpSingGemVoicingTrainNT_DF <- as.data.frame(varImpSingGemVoicingTrainNT_DF)
varImpSingGemVoicingTrainNT_DF$var <- row.names(varImpSingGemVoicingTrainNT_DF)
Then we use ggplot2 to create our Variable Importance figure
ggVarImpSGDurNT_A <- ggplot(varImpSingGemVoicingTrainNT_DF, aes(reorder(var,varImpSingGemVoicingTrainNT_DF),varImpSingGemVoicingTrainNT_DF)) + geom_bar(stat="identity",position="dodge",fill = "gray70") +
coord_flip(ylim = c(0,0.25)) + labs(y = "", x= "", subtitle = "A: Non Conditional 19 predictors") +
scale_y_continuous()+theme_set(theme_bw()) + theme(text=element_text(size=20)) + theme(legend.position="none", axis.title.x=element_blank(),axis.title.y=element_blank())
ggVarImpSGDurNT_A
This plot can already be used to evaluate the importance of predictors. However, it is biased because of the high level of correlations.
The conditional permutations are used to correct for the correlations in our predictors (conditional = T). This is heavy on resources, and we have already tuned our model to make sure it is not resource intensive. With the update to R 3.5.0, the computations are less resource intensive. One way to reduce resources with previous versions of R is to tune the threshold. Instead of 0.2, one could increase it to 0.5 or 0.8. The threshold is used to increase/decrease number of predictors in the conditional permutations grid and is related to the correlations in the data. With 0.2, any correlation equal or above 0.2 are added to the grid; the same applies to the other thresholds. Here we kept the threshold to 0.2 and we were able to run the computations on a local machine with 4-cores (8 logical) and 32GB RAMs.
set.seed(1)
system.time(varImpSingGemVoicingTrainT <- varimp(cforest(SingGemVoicing~durationC2_z+durationVoicedPercC2_z+durationVOTAll_z+
VOTDec_z+durationBurst_z+durationVOT_z+durationVoicedPercVOTAll_z+
intensityOnsetVOTAll_z+intensityOffsetVOTAll_z+durationV2_z+
f0OnsetV2_z+intensityOnsetV2_z+f1OnsetV2_z+
h1mnh2OnsetNormV2_z+durationV1_z+f0OffsetV1_z+intensityOffsetV1_z+
f1OffsetV1_z+h1mnh2OffsetNorm_z,
data = gemination.train,
control=cforest_unbiased(mtry=round(sqrt(19)),ntree=500)),
conditional = T, threshold = 0.2))
user system elapsed
725.25 0.21 726.94
sort(varImpSingGemVoicingTrainT)
f0OffsetV1_z durationV2_z
-4.555809e-06 -4.555809e-06
h1mnh2OnsetNormV2_z intensityOffsetV1_z
0.000000e+00 0.000000e+00
intensityOnsetV2_z h1mnh2OffsetNorm_z
9.111617e-06 9.111617e-06
f1OffsetV1_z intensityOffsetVOTAll_z
1.366743e-05 1.366743e-05
f0OnsetV2_z f1OnsetV2_z
1.366743e-05 2.277904e-05
durationVoicedPercVOTAll_z durationBurst_z
5.466970e-05 6.833713e-05
intensityOnsetVOTAll_z durationVOT_z
7.289294e-05 8.200456e-05
VOTDec_z durationVOTAll_z
1.047836e-04 1.184510e-04
durationV1_z durationVoicedPercC2_z
2.505695e-04 7.744875e-04
durationC2_z
1.337585e-02
We start by changing names of predictors
# We change names
names(varImpSingGemVoicingTrainT)[names(varImpSingGemVoicingTrainT) == 'durationC2_z'] <- 'Closure Dur'
names(varImpSingGemVoicingTrainT)[names(varImpSingGemVoicingTrainT) == 'durationV2_z'] <- 'V2 Dur'
names(varImpSingGemVoicingTrainT)[names(varImpSingGemVoicingTrainT) == 'durationV1_z'] <- 'V1 Dur'
names(varImpSingGemVoicingTrainT)[names(varImpSingGemVoicingTrainT) == 'VOTDec_z'] <- 'VOT'
names(varImpSingGemVoicingTrainT)[names(varImpSingGemVoicingTrainT) == 'durationVOTAll_z'] <- 'Rel Dur'
names(varImpSingGemVoicingTrainT)[names(varImpSingGemVoicingTrainT) == 'durationBurst_z'] <- 'Burst Dur'
names(varImpSingGemVoicingTrainT)[names(varImpSingGemVoicingTrainT) == 'durationVOT_z'] <- 'Asp Dur'
names(varImpSingGemVoicingTrainT)[names(varImpSingGemVoicingTrainT) == 'durationVoicedPercC2_z'] <- 'CD Voicing'
names(varImpSingGemVoicingTrainT)[names(varImpSingGemVoicingTrainT) == 'durationVoicedPercVOTAll_z'] <- 'Rel Voicing'
names(varImpSingGemVoicingTrainT)[names(varImpSingGemVoicingTrainT) == 'intensityOffsetV1_z'] <- 'V1 Off Int'
names(varImpSingGemVoicingTrainT)[names(varImpSingGemVoicingTrainT) == 'intensityOnsetV2_z'] <- 'V2 On Int'
names(varImpSingGemVoicingTrainT)[names(varImpSingGemVoicingTrainT) == 'intensityOffsetVOTAll_z'] <- 'Rel Of Int'
names(varImpSingGemVoicingTrainT)[names(varImpSingGemVoicingTrainT) == 'intensityOnsetVOTAll_z'] <- 'Rel On Int'
names(varImpSingGemVoicingTrainT)[names(varImpSingGemVoicingTrainT) == 'f0OnsetV2_z'] <- 'V2 On f0'
names(varImpSingGemVoicingTrainT)[names(varImpSingGemVoicingTrainT) == 'f0OffsetV1_z'] <- 'V1 Of f0'
names(varImpSingGemVoicingTrainT)[names(varImpSingGemVoicingTrainT) == 'f1OnsetV2_z'] <- 'V2 On F1'
names(varImpSingGemVoicingTrainT)[names(varImpSingGemVoicingTrainT) == 'f1OffsetV1_z'] <- 'V1 Of F1'
names(varImpSingGemVoicingTrainT)[names(varImpSingGemVoicingTrainT) == 'h1mnh2OffsetNorm_z'] <- 'V1 Of H1*-H2*'
names(varImpSingGemVoicingTrainT)[names(varImpSingGemVoicingTrainT) == 'h1mnh2OnsetNormV2_z'] <- 'V2 On H1*-H2*'
After changing names, we append to a data-frame
varImpSingGemVoicingTrainT_DF <- varImpSingGemVoicingTrainT
varImpSingGemVoicingTrainT_DF <- as.data.frame(varImpSingGemVoicingTrainT_DF)
varImpSingGemVoicingTrainT_DF$var <- row.names(varImpSingGemVoicingTrainT_DF)
Then we use ggplot2 to create our Variable Importance figure
ggVarImpSGDurT_A <- ggplot(varImpSingGemVoicingTrainT_DF, aes(reorder(var,varImpSingGemVoicingTrainT_DF),varImpSingGemVoicingTrainT_DF)) + geom_bar(stat="identity",position="dodge",fill = "gray70") +
coord_flip(ylim = c(0,0.015)) + labs(y = "", x= "", subtitle = "A: Conditional 19 predictors")+
scale_y_continuous()+theme_set(theme_bw()) + theme(text=element_text(size=20)) + theme(legend.position="none", axis.title.x=element_blank(),axis.title.y=element_blank())
ggVarImpSGDurT_A
As can be seen from the results, the closure duration is the most important predictor, followed by voicing in the closure, duration of V1 and voicing in the release. Most of the remaining predictors are contributing to the contrast to a lesser extent with the last six not contributing to it.
set.seed(1)
cforestGemVoicingTrainNoDur <- cforest(SingGemVoicing~durationVoicedPercC2_z+durationVOTAll_z+
VOTDec_z+durationBurst_z+durationVOT_z+durationVoicedPercVOTAll_z+
intensityOnsetVOTAll_z+intensityOffsetVOTAll_z+durationV2_z+
f0OnsetV2_z+intensityOnsetV2_z+f1OnsetV2_z+
h1mnh2OnsetNormV2_z+durationV1_z+f0OffsetV1_z+intensityOffsetV1_z+
f1OffsetV1_z+h1mnh2OffsetNorm_z,
data = gemination.train,
control=cforest_unbiased(mtry=round(sqrt(18)),ntree=500))
cforestGemVoicingTestNoDur <- predict(cforestGemVoicingTrainNoDur, newdata = gemination.test, OOB=TRUE, type = "response")
## Table predictions (rows) by original data (columns)
round(prop.table(table(cforestGemVoicingTestNoDur,gemination.test$SingGemVoicing),2)*100,1)
cforestGemVoicingTestNoDur singleton voiceless singleton voiced
singleton voiceless 57.1 6.8
singleton voiced 8.6 91.9
geminate voiceless 32.4 0.7
geminate voiced 1.9 0.7
cforestGemVoicingTestNoDur geminate voiceless geminate voiced
singleton voiceless 15.0 3.3
singleton voiced 0.0 1.4
geminate voiceless 78.9 5.2
geminate voiced 6.0 90.1
# percent correct
sum(gemination.test$SingGemVoicing==cforestGemVoicingTestNoDur)/nrow(gemination.test)
[1] 0.8227425
#correlations
cor(as.numeric(cforestGemVoicingTestNoDur), as.numeric(gemination.test$SingGemVoicing))
[1] 0.7574884
The percent classification accuracy of Model B is around 10% lower than that of Model A. This is a clear indication that the closure duration has an important role in discriminating the four-way contrast.
As before we ran two versions of variable importance: non conditional and conditional
set.seed(1)
system.time(varImpSingGemVoicingTrainNoDurNT <- varimp(cforest(SingGemVoicing~durationVoicedPercC2_z+durationVOTAll_z+
VOTDec_z+durationBurst_z+durationVOT_z+durationVoicedPercVOTAll_z+
intensityOnsetVOTAll_z+intensityOffsetVOTAll_z+durationV2_z+
f0OnsetV2_z+intensityOnsetV2_z+f1OnsetV2_z+
h1mnh2OnsetNormV2_z+durationV1_z+f0OffsetV1_z+intensityOffsetV1_z+
f1OffsetV1_z+h1mnh2OffsetNorm_z,
data = gemination.train,
control=cforest_unbiased(mtry=round(sqrt(18)),ntree=500)), conditional = F))
user system elapsed
20.59 0.02 20.64
sort(varImpSingGemVoicingTrainNoDurNT)
h1mnh2OffsetNorm_z h1mnh2OnsetNormV2_z
6.833713e-05 4.145786e-04
f0OnsetV2_z f1OffsetV1_z
9.430524e-04 1.097950e-03
intensityOffsetVOTAll_z f0OffsetV1_z
1.102506e-03 1.389522e-03
intensityOnsetV2_z f1OnsetV2_z
2.059226e-03 2.118451e-03
intensityOffsetV1_z durationBurst_z
3.334852e-03 6.232346e-03
intensityOnsetVOTAll_z durationVOT_z
7.981777e-03 1.848747e-02
durationV2_z durationV1_z
2.284282e-02 2.370843e-02
durationVOTAll_z durationVoicedPercVOTAll_z
2.516629e-02 5.084282e-02
durationVoicedPercC2_z VOTDec_z
1.260091e-01 2.236492e-01
We start by changing names of predictors
# We change names
names(varImpSingGemVoicingTrainNoDurNT)[names(varImpSingGemVoicingTrainNoDurNT) == 'durationV2_z'] <- 'V2 Dur'
names(varImpSingGemVoicingTrainNoDurNT)[names(varImpSingGemVoicingTrainNoDurNT) == 'durationV1_z'] <- 'V1 Dur'
names(varImpSingGemVoicingTrainNoDurNT)[names(varImpSingGemVoicingTrainNoDurNT) == 'VOTDec_z'] <- 'VOT'
names(varImpSingGemVoicingTrainNoDurNT)[names(varImpSingGemVoicingTrainNoDurNT) == 'durationVOTAll_z'] <- 'Rel Dur'
names(varImpSingGemVoicingTrainNoDurNT)[names(varImpSingGemVoicingTrainNoDurNT) == 'durationBurst_z'] <- 'Burst Dur'
names(varImpSingGemVoicingTrainNoDurNT)[names(varImpSingGemVoicingTrainNoDurNT) == 'durationVOT_z'] <- 'Asp Dur'
names(varImpSingGemVoicingTrainNoDurNT)[names(varImpSingGemVoicingTrainNoDurNT) == 'durationVoicedPercC2_z'] <- 'CD Voicing'
names(varImpSingGemVoicingTrainNoDurNT)[names(varImpSingGemVoicingTrainNoDurNT) == 'durationVoicedPercVOTAll_z'] <- 'Rel Voicing'
names(varImpSingGemVoicingTrainNoDurNT)[names(varImpSingGemVoicingTrainNoDurNT) == 'intensityOffsetV1_z'] <- 'V1 Off Int'
names(varImpSingGemVoicingTrainNoDurNT)[names(varImpSingGemVoicingTrainNoDurNT) == 'intensityOnsetV2_z'] <- 'V2 On Int'
names(varImpSingGemVoicingTrainNoDurNT)[names(varImpSingGemVoicingTrainNoDurNT) == 'intensityOffsetVOTAll_z'] <- 'Rel Of Int'
names(varImpSingGemVoicingTrainNoDurNT)[names(varImpSingGemVoicingTrainNoDurNT) == 'intensityOnsetVOTAll_z'] <- 'Rel On Int'
names(varImpSingGemVoicingTrainNoDurNT)[names(varImpSingGemVoicingTrainNoDurNT) == 'f0OnsetV2_z'] <- 'V2 On f0'
names(varImpSingGemVoicingTrainNoDurNT)[names(varImpSingGemVoicingTrainNoDurNT) == 'f0OffsetV1_z'] <- 'V1 Of f0'
names(varImpSingGemVoicingTrainNoDurNT)[names(varImpSingGemVoicingTrainNoDurNT) == 'f1OnsetV2_z'] <- 'V2 On F1'
names(varImpSingGemVoicingTrainNoDurNT)[names(varImpSingGemVoicingTrainNoDurNT) == 'f1OffsetV1_z'] <- 'V1 Of F1'
names(varImpSingGemVoicingTrainNoDurNT)[names(varImpSingGemVoicingTrainNoDurNT) == 'h1mnh2OffsetNorm_z'] <- 'V1 Of H1*-H2*'
names(varImpSingGemVoicingTrainNoDurNT)[names(varImpSingGemVoicingTrainNoDurNT) == 'h1mnh2OnsetNormV2_z'] <- 'V2 On H1*-H2*'
After changing names, we append to a data-frame
varImpSingGemVoicingTrainNoDurNT_DF <- varImpSingGemVoicingTrainNoDurNT
varImpSingGemVoicingTrainNoDurNT_DF <- as.data.frame(varImpSingGemVoicingTrainNoDurNT_DF)
varImpSingGemVoicingTrainNoDurNT_DF$var <- row.names(varImpSingGemVoicingTrainNoDurNT_DF)
Then we use ggplot2 to create our Variable Importance figure
ggVarImpSGDurNT_B <- ggplot(varImpSingGemVoicingTrainNoDurNT_DF, aes(reorder(var,varImpSingGemVoicingTrainNoDurNT_DF),varImpSingGemVoicingTrainNoDurNT_DF)) + geom_bar(stat="identity",position="dodge",fill = "gray70") +
coord_flip(ylim = c(0,0.25)) + labs(subtitle = "B: Non Conditional 18 predictors") +
scale_y_continuous()+theme_set(theme_bw()) + theme(text=element_text(size=20)) + theme(legend.position="none", axis.title.x=element_blank(),axis.title.y=element_blank())
ggVarImpSGDurNT_B
As above, we use the conditional permutations and adjust the threshold.
set.seed(1)
system.time(varImpSingGemVoicingTrainNoDurT <- varimp(cforest(SingGemVoicing~durationVoicedPercC2_z+durationVOTAll_z+
VOTDec_z+durationBurst_z+durationVOT_z+durationVoicedPercVOTAll_z+
intensityOnsetVOTAll_z+intensityOffsetVOTAll_z+durationV2_z+
f0OnsetV2_z+intensityOnsetV2_z+f1OnsetV2_z+
h1mnh2OnsetNormV2_z+durationV1_z+f0OffsetV1_z+intensityOffsetV1_z+
f1OffsetV1_z+h1mnh2OffsetNorm_z,
data = gemination.train,
control=cforest_unbiased(mtry=round(sqrt(18)),ntree=500)),
conditional = T, threshold = 0.2))
user system elapsed
873.86 0.09 874.93
sort(varImpSingGemVoicingTrainNoDurT)
intensityOffsetVOTAll_z f1OnsetV2_z
-1.366743e-05 -9.111617e-06
f1OffsetV1_z f0OnsetV2_z
-4.555809e-06 9.111617e-06
intensityOnsetV2_z f0OffsetV1_z
9.111617e-06 1.366743e-05
durationVOT_z h1mnh2OffsetNorm_z
1.822323e-05 2.277904e-05
durationV2_z h1mnh2OnsetNormV2_z
2.733485e-05 3.189066e-05
durationVoicedPercVOTAll_z durationVOTAll_z
3.644647e-05 4.555809e-05
durationBurst_z intensityOffsetV1_z
5.922551e-05 7.289294e-05
intensityOnsetVOTAll_z durationVoicedPercC2_z
1.093394e-04 1.776765e-04
durationV1_z VOTDec_z
2.551253e-04 7.667426e-03
We start by changing names of predictors
# We change names
names(varImpSingGemVoicingTrainNoDurT)[names(varImpSingGemVoicingTrainNoDurT) == 'durationV2_z'] <- 'V2 Dur'
names(varImpSingGemVoicingTrainNoDurT)[names(varImpSingGemVoicingTrainNoDurT) == 'durationV1_z'] <- 'V1 Dur'
names(varImpSingGemVoicingTrainNoDurT)[names(varImpSingGemVoicingTrainNoDurT) == 'VOTDec_z'] <- 'VOT'
names(varImpSingGemVoicingTrainNoDurT)[names(varImpSingGemVoicingTrainNoDurT) == 'durationVOTAll_z'] <- 'Rel Dur'
names(varImpSingGemVoicingTrainNoDurT)[names(varImpSingGemVoicingTrainNoDurT) == 'durationBurst_z'] <- 'Burst Dur'
names(varImpSingGemVoicingTrainNoDurT)[names(varImpSingGemVoicingTrainNoDurT) == 'durationVOT_z'] <- 'Asp Dur'
names(varImpSingGemVoicingTrainNoDurT)[names(varImpSingGemVoicingTrainNoDurT) == 'durationVoicedPercC2_z'] <- 'CD Voicing'
names(varImpSingGemVoicingTrainNoDurT)[names(varImpSingGemVoicingTrainNoDurT) == 'durationVoicedPercVOTAll_z'] <- 'Rel Voicing'
names(varImpSingGemVoicingTrainNoDurT)[names(varImpSingGemVoicingTrainNoDurT) == 'intensityOffsetV1_z'] <- 'V1 Off Int'
names(varImpSingGemVoicingTrainNoDurT)[names(varImpSingGemVoicingTrainNoDurT) == 'intensityOnsetV2_z'] <- 'V2 On Int'
names(varImpSingGemVoicingTrainNoDurT)[names(varImpSingGemVoicingTrainNoDurT) == 'intensityOffsetVOTAll_z'] <- 'Rel Of Int'
names(varImpSingGemVoicingTrainNoDurT)[names(varImpSingGemVoicingTrainNoDurT) == 'intensityOnsetVOTAll_z'] <- 'Rel On Int'
names(varImpSingGemVoicingTrainNoDurT)[names(varImpSingGemVoicingTrainNoDurT) == 'f0OnsetV2_z'] <- 'V2 On f0'
names(varImpSingGemVoicingTrainNoDurT)[names(varImpSingGemVoicingTrainNoDurT) == 'f0OffsetV1_z'] <- 'V1 Of f0'
names(varImpSingGemVoicingTrainNoDurT)[names(varImpSingGemVoicingTrainNoDurT) == 'f1OnsetV2_z'] <- 'V2 On F1'
names(varImpSingGemVoicingTrainNoDurT)[names(varImpSingGemVoicingTrainNoDurT) == 'f1OffsetV1_z'] <- 'V1 Of F1'
names(varImpSingGemVoicingTrainNoDurT)[names(varImpSingGemVoicingTrainNoDurT) == 'h1mnh2OffsetNorm_z'] <- 'V1 Of H1*-H2*'
names(varImpSingGemVoicingTrainNoDurT)[names(varImpSingGemVoicingTrainNoDurT) == 'h1mnh2OnsetNormV2_z'] <- 'V2 On H1*-H2*'
After changing names, we append to a data-frame
varImpSingGemVoicingTrainNoDurT_DF <- varImpSingGemVoicingTrainNoDurT
varImpSingGemVoicingTrainNoDurT_DF <- as.data.frame(varImpSingGemVoicingTrainNoDurT_DF)
varImpSingGemVoicingTrainNoDurT_DF$var <- row.names(varImpSingGemVoicingTrainNoDurT_DF)
Then we use ggplot2 to create our Variable Importance figure
ggVarImpSGDurT_B <- ggplot(varImpSingGemVoicingTrainNoDurT_DF, aes(reorder(var,varImpSingGemVoicingTrainNoDurT_DF),varImpSingGemVoicingTrainNoDurT_DF)) + geom_bar(stat="identity",position="dodge",fill = "gray70") +
coord_flip(ylim = c(0,0.01)) + labs(y = "", x= "", subtitle = "B: Conditional 18 predictors") +
scale_y_continuous()+theme_set(theme_bw()) + theme(text=element_text(size=20)) + theme(legend.position="none", axis.title.x=element_blank(),axis.title.y=element_blank())
ggVarImpSGDurT_B
As can be seen from the results, the VOT is the most important predictor, followed by the duration of V1, the voicing in the closure, the intensity at the onset of the release. Most of the remaining predictors are contributing to the contrast to a lesser degree with the last three not contributing to it.
set.seed(1)
cforestGemVoicingTrainNoDurNoVOT <- cforest(SingGemVoicing~durationVoicedPercC2_z+durationVOTAll_z+
durationBurst_z+durationVOT_z+durationVoicedPercVOTAll_z+
intensityOnsetVOTAll_z+intensityOffsetVOTAll_z+durationV2_z+
f0OnsetV2_z+intensityOnsetV2_z+f1OnsetV2_z+
h1mnh2OnsetNormV2_z+durationV1_z+f0OffsetV1_z+intensityOffsetV1_z+
f1OffsetV1_z+h1mnh2OffsetNorm_z,
data = gemination.train,
control=cforest_unbiased(mtry=round(sqrt(17)),ntree=600))
cforestGemVoicingTestNoDurNoVOT <- predict(cforestGemVoicingTrainNoDurNoVOT, newdata = gemination.test, OOB=TRUE, type = "response")
## Table predictions (rows) by original data (columns)
round(prop.table(table(cforestGemVoicingTestNoDurNoVOT,gemination.test$SingGemVoicing),2)*100,1)
cforestGemVoicingTestNoDurNoVOT singleton voiceless
singleton voiceless 51.4
singleton voiced 6.7
geminate voiceless 32.4
geminate voiced 9.5
cforestGemVoicingTestNoDurNoVOT singleton voiced geminate voiceless
singleton voiceless 4.7 16.5
singleton voiced 67.6 0.0
geminate voiceless 1.4 78.2
geminate voiced 26.4 5.3
cforestGemVoicingTestNoDurNoVOT geminate voiced
singleton voiceless 4.7
singleton voiced 23.1
geminate voiceless 4.2
geminate voiced 67.9
# percent correct
sum(gemination.test$SingGemVoicing==cforestGemVoicingTestNoDurNoVOT)/nrow(gemination.test)
[1] 0.6722408
# correlation
cor(as.numeric(cforestGemVoicingTestNoDurNoVOT), as.numeric(gemination.test$SingGemVoicing))
[1] 0.4525515
As can be seen from the above, the percent correct is much lower that the two other models and the levels of correlations as well. This indicates that these secondary correlates are not providing a clear discrimination of the data.
As before we ran two versions of variable importance: non conditional and conditional
set.seed(1)
system.time(varImpSingGemVoicingTrainNoDurNoVOTNT <- varimp(cforest(SingGemVoicing~durationVoicedPercC2_z+durationVOTAll_z+
durationBurst_z+durationVOT_z+durationVoicedPercVOTAll_z+
intensityOnsetVOTAll_z+intensityOffsetVOTAll_z+durationV2_z+
f0OnsetV2_z+intensityOnsetV2_z+f1OnsetV2_z+
h1mnh2OnsetNormV2_z+durationV1_z+f0OffsetV1_z+intensityOffsetV1_z+
f1OffsetV1_z+h1mnh2OffsetNorm_z,
data = gemination.train,
control=cforest_unbiased(mtry=round(sqrt(17)),ntree=600)), conditional = F))
user system elapsed
25.41 0.00 25.45
sort(varImpSingGemVoicingTrainNoDurNoVOTNT)
h1mnh2OnsetNormV2_z h1mnh2OffsetNorm_z
0.0001746393 0.0003302961
f0OnsetV2_z f1OffsetV1_z
0.0015261959 0.0021753986
intensityOffsetVOTAll_z f0OffsetV1_z
0.0027372817 0.0033902809
intensityOnsetV2_z intensityOffsetV1_z
0.0036142749 0.0049544419
f1OnsetV2_z durationBurst_z
0.0062604404 0.0078170084
intensityOnsetVOTAll_z durationVOT_z
0.0167615793 0.0186560364
durationVOTAll_z durationV1_z
0.0326689446 0.0430523918
durationV2_z durationVoicedPercVOTAll_z
0.0486446469 0.0643811693
durationVoicedPercC2_z
0.1278739560
We start by changing names of predictors
# We change names
names(varImpSingGemVoicingTrainNoDurNoVOTNT)[names(varImpSingGemVoicingTrainNoDurNoVOTNT) == 'durationV2_z'] <- 'V2 Dur'
names(varImpSingGemVoicingTrainNoDurNoVOTNT)[names(varImpSingGemVoicingTrainNoDurNoVOTNT) == 'durationV1_z'] <- 'V1 Dur'
names(varImpSingGemVoicingTrainNoDurNoVOTNT)[names(varImpSingGemVoicingTrainNoDurNoVOTNT) == 'durationVOTAll_z'] <- 'Rel Dur'
names(varImpSingGemVoicingTrainNoDurNoVOTNT)[names(varImpSingGemVoicingTrainNoDurNoVOTNT) == 'durationBurst_z'] <- 'Burst Dur'
names(varImpSingGemVoicingTrainNoDurNoVOTNT)[names(varImpSingGemVoicingTrainNoDurNoVOTNT) == 'durationVOT_z'] <- 'Asp Dur'
names(varImpSingGemVoicingTrainNoDurNoVOTNT)[names(varImpSingGemVoicingTrainNoDurNoVOTNT) == 'durationVoicedPercC2_z'] <- 'CD Voicing'
names(varImpSingGemVoicingTrainNoDurNoVOTNT)[names(varImpSingGemVoicingTrainNoDurNoVOTNT) == 'durationVoicedPercVOTAll_z'] <- 'Rel Voicing'
names(varImpSingGemVoicingTrainNoDurNoVOTNT)[names(varImpSingGemVoicingTrainNoDurNoVOTNT) == 'intensityOffsetV1_z'] <- 'V1 Off Int'
names(varImpSingGemVoicingTrainNoDurNoVOTNT)[names(varImpSingGemVoicingTrainNoDurNoVOTNT) == 'intensityOnsetV2_z'] <- 'V2 On Int'
names(varImpSingGemVoicingTrainNoDurNoVOTNT)[names(varImpSingGemVoicingTrainNoDurNoVOTNT) == 'intensityOffsetVOTAll_z'] <- 'Rel Of Int'
names(varImpSingGemVoicingTrainNoDurNoVOTNT)[names(varImpSingGemVoicingTrainNoDurNoVOTNT) == 'intensityOnsetVOTAll_z'] <- 'Rel On Int'
names(varImpSingGemVoicingTrainNoDurNoVOTNT)[names(varImpSingGemVoicingTrainNoDurNoVOTNT) == 'f0OnsetV2_z'] <- 'V2 On f0'
names(varImpSingGemVoicingTrainNoDurNoVOTNT)[names(varImpSingGemVoicingTrainNoDurNoVOTNT) == 'f0OffsetV1_z'] <- 'V1 Of f0'
names(varImpSingGemVoicingTrainNoDurNoVOTNT)[names(varImpSingGemVoicingTrainNoDurNoVOTNT) == 'f1OnsetV2_z'] <- 'V2 On F1'
names(varImpSingGemVoicingTrainNoDurNoVOTNT)[names(varImpSingGemVoicingTrainNoDurNoVOTNT) == 'f1OffsetV1_z'] <- 'V1 Of F1'
names(varImpSingGemVoicingTrainNoDurNoVOTNT)[names(varImpSingGemVoicingTrainNoDurNoVOTNT) == 'h1mnh2OffsetNorm_z'] <- 'V1 Of H1*-H2*'
names(varImpSingGemVoicingTrainNoDurNoVOTNT)[names(varImpSingGemVoicingTrainNoDurNoVOTNT) == 'h1mnh2OnsetNormV2_z'] <- 'V2 On H1*-H2*'
After changing names, we append to a data-frame
varImpSingGemVoicingTrainNoDurNoVOTNT_DF <- varImpSingGemVoicingTrainNoDurNoVOTNT
varImpSingGemVoicingTrainNoDurNoVOTNT_DF <- as.data.frame(varImpSingGemVoicingTrainNoDurNoVOTNT_DF)
varImpSingGemVoicingTrainNoDurNoVOTNT_DF$var <- row.names(varImpSingGemVoicingTrainNoDurNoVOTNT_DF)
Then we use ggplot2 to create our Variable Importance figure
ggVarImpSGDurNT_C <- ggplot(varImpSingGemVoicingTrainNoDurNoVOTNT_DF, aes(reorder(var,varImpSingGemVoicingTrainNoDurNoVOTNT_DF),varImpSingGemVoicingTrainNoDurNoVOTNT_DF)) + geom_bar(stat="identity",position="dodge",fill = "gray70") +
coord_flip(ylim = c(0,0.25)) + labs(subtitle = "C: Non Conditional 17 predictors") +
scale_y_continuous()+theme_set(theme_bw()) + theme(text=element_text(size=20)) + theme(legend.position="none", axis.title.x=element_blank(),axis.title.y=element_blank())
ggVarImpSGDurNT_C
As above, we use the conditional permutations and adjust the threshold.
set.seed(1)
system.time(varImpSingGemVoicingTrainNoDurNoVOTT <- varimp(cforest(SingGemVoicing~durationVoicedPercC2_z+durationVOTAll_z+
durationBurst_z+durationVOT_z+durationVoicedPercVOTAll_z+
intensityOnsetVOTAll_z+intensityOffsetVOTAll_z+durationV2_z+
f0OnsetV2_z+intensityOnsetV2_z+f1OnsetV2_z+
h1mnh2OnsetNormV2_z+durationV1_z+f0OffsetV1_z+intensityOffsetV1_z+
f1OffsetV1_z+h1mnh2OffsetNorm_z,
data = gemination.train,
control=cforest_unbiased(mtry=round(sqrt(17)),ntree=600)),
conditional = T, threshold = 0.2))
user system elapsed
1202.19 0.19 1203.72
sort(varImpSingGemVoicingTrainNoDurNoVOTT)
intensityOffsetVOTAll_z durationBurst_z
-1.518603e-05 -7.593014e-06
h1mnh2OffsetNorm_z f0OffsetV1_z
-7.593014e-06 -3.796507e-06
f0OnsetV2_z durationVOT_z
3.796507e-06 7.593014e-06
f1OffsetV1_z intensityOnsetV2_z
7.593014e-06 1.518603e-05
h1mnh2OnsetNormV2_z intensityOffsetV1_z
1.518603e-05 1.518603e-05
durationVoicedPercVOTAll_z f1OnsetV2_z
2.277904e-05 2.277904e-05
durationVOTAll_z durationVoicedPercC2_z
4.176158e-05 6.074412e-05
intensityOnsetVOTAll_z durationV2_z
1.366743e-04 2.467730e-04
durationV1_z
3.834472e-04
We start by changing names of predictors
# We change names
names(varImpSingGemVoicingTrainNoDurNoVOTT)[names(varImpSingGemVoicingTrainNoDurNoVOTT) == 'durationV2_z'] <- 'V2 Dur'
names(varImpSingGemVoicingTrainNoDurNoVOTT)[names(varImpSingGemVoicingTrainNoDurNoVOTT) == 'durationV1_z'] <- 'V1 Dur'
names(varImpSingGemVoicingTrainNoDurNoVOTT)[names(varImpSingGemVoicingTrainNoDurNoVOTT) == 'durationVOTAll_z'] <- 'Rel Dur'
names(varImpSingGemVoicingTrainNoDurNoVOTT)[names(varImpSingGemVoicingTrainNoDurNoVOTT) == 'durationBurst_z'] <- 'Burst Dur'
names(varImpSingGemVoicingTrainNoDurNoVOTT)[names(varImpSingGemVoicingTrainNoDurNoVOTT) == 'durationVOT_z'] <- 'Asp Dur'
names(varImpSingGemVoicingTrainNoDurNoVOTT)[names(varImpSingGemVoicingTrainNoDurNoVOTT) == 'durationVoicedPercC2_z'] <- 'CD Voicing'
names(varImpSingGemVoicingTrainNoDurNoVOTT)[names(varImpSingGemVoicingTrainNoDurNoVOTT) == 'durationVoicedPercVOTAll_z'] <- 'Rel Voicing'
names(varImpSingGemVoicingTrainNoDurNoVOTT)[names(varImpSingGemVoicingTrainNoDurNoVOTT) == 'intensityOffsetV1_z'] <- 'V1 Off Int'
names(varImpSingGemVoicingTrainNoDurNoVOTT)[names(varImpSingGemVoicingTrainNoDurNoVOTT) == 'intensityOnsetV2_z'] <- 'V2 On Int'
names(varImpSingGemVoicingTrainNoDurNoVOTT)[names(varImpSingGemVoicingTrainNoDurNoVOTT) == 'intensityOffsetVOTAll_z'] <- 'Rel Of Int'
names(varImpSingGemVoicingTrainNoDurNoVOTT)[names(varImpSingGemVoicingTrainNoDurNoVOTT) == 'intensityOnsetVOTAll_z'] <- 'Rel On Int'
names(varImpSingGemVoicingTrainNoDurNoVOTT)[names(varImpSingGemVoicingTrainNoDurNoVOTT) == 'f0OnsetV2_z'] <- 'V2 On f0'
names(varImpSingGemVoicingTrainNoDurNoVOTT)[names(varImpSingGemVoicingTrainNoDurNoVOTT) == 'f0OffsetV1_z'] <- 'V1 Of f0'
names(varImpSingGemVoicingTrainNoDurNoVOTT)[names(varImpSingGemVoicingTrainNoDurNoVOTT) == 'f1OnsetV2_z'] <- 'V2 On F1'
names(varImpSingGemVoicingTrainNoDurNoVOTT)[names(varImpSingGemVoicingTrainNoDurNoVOTT) == 'f1OffsetV1_z'] <- 'V1 Of F1'
names(varImpSingGemVoicingTrainNoDurNoVOTT)[names(varImpSingGemVoicingTrainNoDurNoVOTT) == 'h1mnh2OffsetNorm_z'] <- 'V1 Of H1*-H2*'
names(varImpSingGemVoicingTrainNoDurNoVOTT)[names(varImpSingGemVoicingTrainNoDurNoVOTT) == 'h1mnh2OnsetNormV2_z'] <- 'V2 On H1*-H2*'
After changing names, we append to a data-frame
varImpSingGemVoicingTrainNoDurNoVOTT_DF <- varImpSingGemVoicingTrainNoDurNoVOTT
varImpSingGemVoicingTrainNoDurNoVOTT_DF <- as.data.frame(varImpSingGemVoicingTrainNoDurNoVOTT_DF)
varImpSingGemVoicingTrainNoDurNoVOTT_DF$var <- row.names(varImpSingGemVoicingTrainNoDurNoVOTT_DF)
Then we use ggplot2 to create our Variable Importance figure
ggVarImpSGDurT_C <- ggplot(varImpSingGemVoicingTrainNoDurNoVOTT_DF, aes(reorder(var,varImpSingGemVoicingTrainNoDurNoVOTT_DF),varImpSingGemVoicingTrainNoDurNoVOTT_DF)) + geom_bar(stat="identity",position="dodge",fill = "gray70") +
coord_flip(ylim = c(0,0.001)) + labs(y = "", x= "", subtitle = "C: Conditional 17 predictors") +
scale_y_continuous()+theme_set(theme_bw()) + theme(text=element_text(size=20)) + theme(legend.position="none", axis.title.x=element_blank(),axis.title.y=element_blank())
ggVarImpSGDurT_C
As can be seen from the results, the the durations of surrounding vowels followed by intensity at the onset of the release, voicing in the closure and duration of the release are the most important predictors. Their scores are much lower than in the other two models indicating these are secondary. Most of the remaining predictors are contributing to the contrast to a lesser degree with the last four not contributing to it.
The results of the Random Forests are a clear indication that the closure duration is the main correlate to the four-way voicing by gemination contrast. Model A had an accuracy of 92.5% which is an extremely high classification rate. The second model that excludes the closure duration is relatively high in classification accuracy yielding an 82.3%. A difference of 10.2% is observed between the two models. This is a clear indication of the role of closure duration in both voicing and gemination. Model C aimed at evaluating the role of all secondary predictors without using the closure duration or the VOT. A relatively low classification rate at 67.2% is indicative of a relatively successful model. It is statistically significant (rate at 25% and above), but the predictive accuracy is not high enough. This is clearly indicating that these correlates act as secondary one and have an enhancement role to the primary correlate; closure duration.