10.1 What is predictive modelling

The majority of research in phonetics and linguistics looks at traditional statistical techniques to evaluate group differences, using either frequentest or Bayesian frameworks. What is important in these frameworks is to assess whether a particular outcome can be explained by the potential differences that exist in a predictor. For example, we may want to examine if the differences observed in F2 frequency at the vowel midpoint can be explained by place of articulation, or age or gender. For this we use a linear regression (with or without mixed effects modelling) with the following formula: lm(F2 ~ place + age + gender).

This approach can potentially explain the differences associated with the predictors, i.e., we evaluate their influence on the outcome. But let’s ask the following question: can the observed differences on F2 frequency at the vowel midpoint be predictive of group differences associated with gender? Let’s put it differently: can we predict gender differences based on F2 frequencies? What would happen in the future? This is where predictive modelling comes to the rescue.

Predictive modelling is a statistical technique that uses either traditional or machine learning approaches to evaluate group differences and how they inform group separation. They can be used to predict changes either on the current data or in the future. If you build a predictive model based on pre-existing data, you can use the model’s predictions to:

Validate the results on the current set
Predict results on new ranges (e.g., if you have results on age group 30-50, in theory, you could predict the patterns for age 20 or 70)
Predict results on new data, either on the testing set or any new data you obtain in the future.

To understand this further, we will start with a simple Generalised Linear Model predicting a perception experiment on grammaticality judgement, before moving to Decision Trees and Random Forests.