Statistics for Linguists using R - Quantitative and qualitative approaches
2025-08-29
Introduction
This book covers the basics of statistics and R programming for linguists. It is based on various courses I have taught at both Newcastle University, Newcastle, UK during various workshops and mainly at the Université Paris Cité, Paris, France during my teaching at undergraduate and postgraduate levels.
It is designed to be accessible to those with little or no prior experience in statistics or programming. The book applies these concepts to real-world examples from linguistics, including phonetics, syntax, and semantics. The book also includes exercises and solutions to help readers practice and reinforce their understanding of the material. The book is divided into four main parts.
The first part covered in chapter 1 introduces the basics of the R programming language and the tidyverse universe. It covers data manipulation, data cleaning and an introduction to the basics of descriptive statistics. We end the part with data visualisation.
The second part focuses on statistical analyses using R covering inferential statistics, and hypothesis testing. This part aims to consolidate the reader’s understanding of fundamental statistical concepts and how they can be applied using R and is composed of two chapters. In chapter 2, we cover type of outcomes and the best statistical models adapted to each type of outcome, and cover type of errors in statistical designs, including the classical Type I and Type II errors, and Type M and S errors. We also cover statistical power, and effect sizes. This part then covers correlations and three main family distributions, Gaussian (via linear models), Binomial (via GLM) and Cumulative (via Ordinal or likert scale data). We introduce the Signal Detection Theory concepts in a way that allows the reader to understand the various notions linked with STD and GLM models.
This part continues with the chapter 3, which introduces Random effects structures via mixed effects modelling. We cover the notions of random effects, random intercepts and random slopes, and modelling strategies. We look at LMMs, GLMM, CLMM and GAMMs (for additive models).
The third part covers qualitative research methods using R. It is composed of three chapters and includes examples of where to find textual corpora and creating a corpus (in chapter 4), how to manipulate a corpus, including part of speech tagging (in chapter 5) and how to conduct statistical analyses of textual data including poisson regressions, network analyses and word clouds (in chapter 6).
The last part of the book is composed of two chapters and introduces the reader in chapter 7 with various techniques used for dimensionality reduction, including correlations thresholds, Principal Component Analysis (PCA), cluster analysis and Multi Dimensional Scaling (MDS).
Chapter 8 covers the basics of machine learning, exploring issues related with regression analyses, and introducing solutions in terms of decision trees and Random Forests. We look at two different implementations of Random Forests, via Conditional Inference Trees and via the basic implementation using the ranger
package, fora faster computation. We end the chapter by introducing the tidymodels
philosophy
The structure is organised with each chapter being dedicated to a specific topic and can normally be covered in 1 or 2 sessions.
It is hoped that this book allows students to specialise in the field of statistical analyses applied to linguistic data and to be able to use R for their own research. The book is designed to be a practical guide that can be used in the classroom or for self-study. It is hoped that this book will help students to develop the skills they need to conduct their own research and to understand the research of others.