This is an R Markdown Notebook. When you execute code within the notebook, the results appear beneath the code.
Try executing this chunk by clicking the Run button within the chunk or by placing your cursor inside it and pressing Ctrl+Shift+Enter.
plot(cars)
Add a new chunk by clicking the Insert Chunk button on the toolbar or by pressing Ctrl+Alt+I.
When you save the notebook, an HTML file containing the code and output will be saved alongside it (click the Preview button or press Ctrl+Shift+K to preview the HTML file).
The preview shows you a rendered HTML copy of the contents of the editor. Consequently, unlike Knit, Preview does not run any R code chunks. Instead, the output of the chunk when it was last run in the editor is displayed.
R
is the most influential statistical software that is widely used in data science. The R Project for Statistical Computing. R allows the user to take control of their analyses and being open about how the data were analysed, etc. R
encourages transparency and reproducible research.
If you are a windows user, download the latest version here R version 3.5.1. If you are a MacX user, download the latest version here R version 3.5.1. Other Linux versions available here.
Using up-to-date versions of R
is important as this allows you to use the latest developments of the software. You can have a look at what is new in this latest release here. One of the major changes in versions post 3.5.0 is the speed of computations. In many cases, these are invisible, but with heavy computations, e.g., (Generalised-)Linear Mixed Effects using lme4
, Cumulative Logit Mixed Effects using ordinal
, Conditional Variable Importance in Random Forests using party
, etc. you will see a difference in how fast the computations are (i.e., rather than 4-5 hours running of a script, these will be finished in around 1 hour or less!). R
by default uses 1 core on your machine. You can use parallel computing with specialised packages using foreach
, doSNOW
, parallel
, etc. or see below.
You can download the latest version from above and update. Or you can use the package installr
and upgrade to the latest available version. If the package is not installed, use this: install.packages("installr")
and then run with library(installr)
then type installr
in the console (what? what’s a console?).. We’ll come to this later on!
If you are after more speed, then use Microsoft R Open
distribution. This is an enhanced version of R
that allows for multicore and multithreading by default. This means you will have an increase in speech on how various analyses are computed. You can download it here. This used to be part of Revolution R Open
that was bought by Microsoft. It is an Open source distribution (and I use it all the time). This is compatible with any Operating System (Windows, Mac, Linux), but it was shown to have a major impact with windows OS…
This distribution usually takes a couple of weeks to be updated after an R
update and the packages are fixed in time, i.e., any updates to packages will not occur until the next official release. This is not an issue as package updates are not too frequent.
R Studio is one of the mostly used free and open-source integrated development environment for R
. It allows the user to have access to various information at the same time, e.g., the Source
, the Console
, the Environment
and the Files
, etc. When you open R studio
, and if you have installed R
appropriately, then R
Studio will “talk” to R
by sending it messages to execute commands.
You can set up the layout to suit your needs. I always find the following layout best for my needs:
Source
pane: the file where you write your codeConsole
where actual code is runEnvironment
pane, which shows you all variables/datasets, with the history of executed code, etc.Files/Viewer
pane, which shows you the files in the current folder, the plots, the installed packages, and help files, etc.If you click on Tools and Global options, then Pane Layout, you can change the order, and add/remove options from the two panes below
I use Sublime Text to run Python, Praat and write in \(\LaTeX\). I use R Markdown in R to publish my code and write notebooks. I am in the process of writing my first article with R Markdown for a fully reproducible research.
There are many development environments that can be used to “talk” to R: TinnR, Visual Studio, etc…
GUIs (for Graphical User Interface) for R
are available. I list below a few. However, after trying some of these, I found it much easier to get to code directly in R. I don’t remember all codes! I use these to my advantage, by saving my code into a script and using it later on in other scripts.
Some of the GUIs are meant to make R
like excel or SPSS, while others are more specialised. Here is a list of some of these GUIs…
R
directly (using library(Rcmdr)
from within R
)library(rattle)
then rattle()
to run)library(Deducer)
after installation)You can always start by using any of the above to familiarise yourself with the code, and then move to using R
fully via code. My recommendation is to start coding first thing and search for help on how to write the specific code you are after.
Well almost. There is one thing we need to consider: telling R
where is our working directory. By default R
saves this to your documents (or somewhere else). For this workshop, this is generally OK, though when working on your own data, things get more complicated.
There are two schools of thought here. 1. Create R
scripts that run the analyses and saves the output(s) directly to your working directory. Does not save the .RData
image at the end 2. Create a project: a self-contained folder, where all your scripts, figures, etc. will be automatically saved. Saves the .RData
at the end
I subscribe to the second, as some of the computations I run take ages to finish.
Click the menu Session -> Set Workign Directory -> Choose Directory
or use setwd("path/to/directory")
(choose the location where you want to save the results)
You can also use getwd()
Look at the top-right hand where you can see Projects (none)
. You can create a new project in a new path or based on a specific folder.
The best option is to use the menu above (under Tools) and click Install packages
, or type in install.packages(“package.name”). Make sure to always have install dependencies
ticked (using the first option).
Use the following to load a package: library(package.name)
. Once the package is loaded, you can use any of its functions directly into your code. Sometimes you may need to specify to use a particular function from within a particular package, in this case use: package.name::function
. We will most probably not use this today, but this is something you need to know about otherwise undesirable results may occur (or even errors!).
Under the Files pane (right bottom), click on the menu Packages and you will have access to all installed packages. Click on a package and you can see the associated help files. You can also type the following to find help: ?package.name. ??function e.g.,
?stats
??MASS
Or try clicking on the function name to find details of what to specify: e.g., scroll on lmer
(assuming lme4
is installed). Do a Ctrl/Cmd + left mouse click on a function to display options.
lme4::lmer
function (formula, data = NULL, REML = TRUE, control = lmerControl(),
start = NULL, verbose = 0L, subset, weights, na.action, offset,
contrasts = NULL, devFunOnly = FALSE, ...)
{
mc <- mcout <- match.call()
missCtrl <- missing(control)
if (!missCtrl && !inherits(control, "lmerControl")) {
if (!is.list(control))
stop("'control' is not a list; use lmerControl()")
warning("passing control as list is deprecated: please use lmerControl() instead",
immediate. = TRUE)
control <- do.call(lmerControl, control)
}
if (!is.null(list(...)[["family"]])) {
warning("calling lmer with 'family' is deprecated; please use glmer() instead")
mc[[1]] <- quote(lme4::glmer)
if (missCtrl)
mc$control <- glmerControl()
return(eval(mc, parent.frame(1L)))
}
mc$control <- control
mc[[1]] <- quote(lme4::lFormula)
lmod <- eval(mc, parent.frame(1L))
mcout$formula <- lmod$formula
lmod$formula <- NULL
devfun <- do.call(mkLmerDevfun, c(lmod, list(start = start,
verbose = verbose, control = control)))
if (devFunOnly)
return(devfun)
if (identical(control$optimizer, "none"))
stop("deprecated use of optimizer=='none'; use NULL instead")
opt <- if (length(control$optimizer) == 0) {
s <- getStart(start, environment(devfun)$lower, environment(devfun)$pp)
list(par = s, fval = devfun(s), conv = 1000, message = "no optimization")
}
else {
optimizeLmer(devfun, optimizer = control$optimizer, restart_edge = control$restart_edge,
boundary.tol = control$boundary.tol, control = control$optCtrl,
verbose = verbose, start = start, calc.derivs = control$calc.derivs,
use.last.params = control$use.last.params)
}
cc <- checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv,
lbound = environment(devfun)$lower)
mkMerMod(environment(devfun), opt, lmod$reTrms, fr = lmod$fr,
mc = mcout, lme4conv = cc)
}
<bytecode: 0x00000000382f84e0>
<environment: namespace:lme4>
Install a package and search for help.
R
can be used as a calculator. Try some of the following below:
1 + 2
[1] 3
1+2*3
[1] 7
Well wait a second! were you all expecting the result to be 7? how many expected the result to be 9? Check the following:
(1+2)*3
[1] 9
1+(2*3)
[1] 7
So parenthesis are important! Always use these to tell R (and any other software) the order of operations. This is the order: 1. Parentheses 2. Exponents 3. Multiplication and Division (from left to right) 4. Addition and Subtraction (from left to right)
There are many built-in functions in R to do some complicated mathematical calculations.
Run some of the following.
sqrt(3)
[1] 1.732051
3^2
[1] 9
log(3)
[1] 1.098612
We can also create variables (aka temporary place holders).
x <- 2
y <- 5
b <- x*y
x
[1] 2
y
[1] 5
b
[1] 10
b+log(y)*x^2
[1] 16.43775
When you create a variable and assign to it a number (or characters), you can use it later on.
We can also create sequences of numbers
seq(1,10,2)
[1] 1 3 5 7 9
z <- 1:10
And we can do the following.. Can you explain what we have done here?
z2 <- z+1
z*z2
[1] 2 6 12 20 30 42 56 72 90 110
Up to you… Write some more complex maths here just for fun!
# Add below
Objects are related to variables (we created above), but can also be dataframes, and other things we create in R. All of these are stored in memory and are shown below (under environment). You can check the type of the “object” below in the list (look at “Type”) or by using class()
.
Let’s look at the variables we created so far.. We will create another one as well…
class(b)
[1] "numeric"
class(x)
[1] "numeric"
class(y)
[1] "numeric"
class(z)
[1] "integer"
class(z2)
[1] "numeric"
a <- "test"
class(a)
[1] "character"
When we do calculations in R, we need to make sure we use numeric/integer variables only.. Try some of the below
x+y
[1] 7
two <- "2"
x + two
Error in x + two : non-numeric argument to binary operator
Can you explain the error?
We have tried to add a number to a (character) string which is clearly impossible. To do the maths, we need to change the class using any of the following commands: as.character
, as.integer
, as.numeric
, as.factor
, e.g.:
two <- as.numeric(two)
x + two
[1] 4
We can create a vector of objects to do various things on.. We use the function c()
and do various things on:
numbers <- c(1,4,5,12,55,13,45,38,77,836,543)
class(numbers)
[1] "numeric"
mean(numbers)
[1] 148.0909
sd(numbers)
[1] 276.6375
median(numbers)
[1] 38
min(numbers)
[1] 1
max(numbers)
[1] 836
range(numbers)
[1] 1 836
Sometimes we may want to refer to a specific position in the list of numbers we just created… Use the following:
numbers[2]
[1] 4
numbers[3:5]
[1] 5 12 55
numbers[-4]
[1] 1 4 5 55 13 45 38 77 836 543
numbers+numbers[6]
[1] 14 17 18 25 68 26 58 51 90 849 556
Can you explain what we have done in the last operation?
A dataframe is the most important object we will be using over and over again… It is an object that contains information in both rows and columns.
In this exercise, we will create a 4*9 dataframe. The code below creates four variables, and combines them together to make a dataframe. As you can see, variables can also be characters. To create the dataframe, we use the functions as.data.frame
and cbind
.
word <- c("a", "the", "lamp", "not", "jump", "it", "coffee", "walk", "on")
freq <- c(500, 600, 7, 200, 30, 450, 130, 33, 300) # note this is completely made up!!
functionword <- c("y", "y", "n", "y", "n", "y", "n", "n", "y")
length <- c(1, 3, 4, 3, 4, 2, 6, 4, 2)
df <- as.data.frame(cbind(word,freq,functionword,length))
Environment
If you have created various variables you do not need any more, you can use rm
to remove these
rm(word,freq,functionword,length)
BUT wait, did I remove these from my dataframe? Well no.. We have removed objects from within the R
environment and not from the actual dataframe. Let’s check this up
df
The code below allows you to save the dataframe and read it again. The extension .csv
is for “comma delimited files”. This is the best format to use as it is simply a text file with no additional formatting.
write.csv(df,"df.csv")
dfNew <- read.csv("df.csv")
df
dfNew
The newly created object contains 5 columns rather than the 4 we initially created. This is normal. By default, R
add a column that reflects the order of the list before it was saved. You can simply delete the column or keep as is (but be careful as this means you need to adjust any references to columns that we will use later on).
R
allows us to read data in any format. If you have a .txt
, .sav
, .xls
, .xlsx
, etc., then there are packages specific to do that (e.g., package xlsx
to read/save .xlsx
files, or the function haven
from the package tidyverse
to read/save .sav
files).
You can use the built-in plugin in RStudio
to import your dataset. See Import Dataset
within the Environment
.
In general, any specific formatting is kept, but sometimes variable names associated with numbers (as in .sav
files) will be lost. Hence, it is always preferable to do minimal formatting on the data.. Start with a .csv
file, import it to R
and do the magic!
The first thing we will do is to check the structure of our created dataset. We will use the originally created one (i.e., df
and not the imported one (i.e., dfNew
).
str(df)
'data.frame': 9 obs. of 4 variables:
$ word : Factor w/ 9 levels "a","coffee","it",..: 1 8 5 6 4 3 2 9 7
$ freq : Factor w/ 9 levels "130","200","30",..: 7 8 9 2 3 6 1 5 4
$ functionword: Factor w/ 2 levels "n","y": 2 2 1 2 1 2 1 1 2
$ length : Factor w/ 5 levels "1","2","3","4",..: 1 3 4 3 4 2 5 4 2
The function str
gives us the following information: 1. How many observations (i.e., rows) and variables (i.e., columns) 2. The name of each variable (look at $
and what comes after it) 3. Within each variable, we have the class with number of levels
class
of a variableAs we can see, the four created variables were added to the dataframe as factors
. We need to change the class
of the numeric variables: freq and length. Let’s do that:
df$freq <- as.numeric(df$freq)
df$length <- as.numeric(df$length)
str(df)
'data.frame': 9 obs. of 4 variables:
$ word : Factor w/ 9 levels "a","coffee","it",..: 1 8 5 6 4 3 2 9 7
$ freq : num 7 8 9 2 3 6 1 5 4
$ functionword: Factor w/ 2 levels "n","y": 2 2 1 2 1 2 1 1 2
$ length : num 1 3 4 3 4 2 5 4 2
As you can see from the above, we can refer to a particular variable in the dataframe by its name and adding $
. There are additional options to do that. Let’s see what we can do. Can you tell what each of the below does? chat to your neighbour….
df[1]
df[,1]
[1] a the lamp not jump it coffee walk on
Levels: a coffee it jump lamp not on the walk
df[1,]
df[1,1]
[1] a
Levels: a coffee it jump lamp not on the walk
Here are the answers: 1. Refers to the full column 1 2. Refers to first variable 3. Refers to first row 4. Refers to first observation in first column
Practice a bit and use other specifications to obtain specific observations, columns or rows…
# write here
We can use the function summary
to do some basic summaries
summary(df)
word freq functionword length
a :1 Min. :1 n:4 Min. :1.000
coffee :1 1st Qu.:3 y:5 1st Qu.:2.000
it :1 Median :5 Median :3.000
jump :1 Mean :5 Mean :3.111
lamp :1 3rd Qu.:7 3rd Qu.:4.000
not :1 Max. :9 Max. :5.000
(Other):3
We can create a table with the function table
table(df$functionword,df$freq)
1 2 3 4 5 6 7 8 9
n 1 0 1 0 1 0 0 0 1
y 0 1 0 1 0 1 1 1 0
We sometimes need to create and/or delete new variables.. Do you know how to do that?
Let’s look at the structure again:
str(df)
'data.frame': 9 obs. of 4 variables:
$ word : Factor w/ 9 levels "a","coffee","it",..: 1 8 5 6 4 3 2 9 7
$ freq : num 7 8 9 2 3 6 1 5 4
$ functionword: Factor w/ 2 levels "n","y": 2 2 1 2 1 2 1 1 2
$ length : num 1 3 4 3 4 2 5 4 2
We said earlier that we can refer to a specific variable by using $
+ the name of the variable. Let’s use this again and add a new name of variable not in the list of variables above
df$newVariable
NULL
What does NULL
mean? The variable does not exist! Let’s do something else
df$newVariable <- NA
Ah no error messages! Let’s check the structure
str(df)
'data.frame': 9 obs. of 5 variables:
$ word : Factor w/ 9 levels "a","coffee","it",..: 1 8 5 6 4 3 2 9 7
$ freq : num 7 8 9 2 3 6 1 5 4
$ functionword: Factor w/ 2 levels "n","y": 2 2 1 2 1 2 1 1 2
$ length : num 1 3 4 3 4 2 5 4 2
$ newVariable : logi NA NA NA NA NA NA ...
So we now have five variables and the last one is named “newVariable” and assigned “NA”. “NA” is used in R
to refer to missing data or is a place holder. We can replace these with any calculations, or anything else. Let’s do that:
df$newVariable <- log(df$freq)
str(df)
'data.frame': 9 obs. of 5 variables:
$ word : Factor w/ 9 levels "a","coffee","it",..: 1 8 5 6 4 3 2 9 7
$ freq : num 7 8 9 2 3 6 1 5 4
$ functionword: Factor w/ 2 levels "n","y": 2 2 1 2 1 2 1 1 2
$ length : num 1 3 4 3 4 2 5 4 2
$ newVariable : num 1.946 2.079 2.197 0.693 1.099 ...
We replaced “NA” with the log of the frequencies. Let’s check that this is correct only for one observation. Can you dissect the code below? what did I use to ask R
to compute the log of the frequency (freq)? Remember rows and columns
log(df[1,2])
[1] 1.94591
df[1,5]
[1] 1.94591
So they are the same values.
Now we need to change the name of the variable to reflect the computations. “newVariable” is meaningless as a name, but “logFreq” is informative.
colnames(df)[5] <- "logFreq"
str(df)
'data.frame': 9 obs. of 5 variables:
$ word : Factor w/ 9 levels "a","coffee","it",..: 1 8 5 6 4 3 2 9 7
$ freq : num 7 8 9 2 3 6 1 5 4
$ functionword: Factor w/ 2 levels "n","y": 2 2 1 2 1 2 1 1 2
$ length : num 1 3 4 3 4 2 5 4 2
$ logFreq : num 1.946 2.079 2.197 0.693 1.099 ...
As can be seen from the above, using the command colnames(df)[5] <- "logFreq"
allows us to change the column name in position 5 of the dataframe. If we were to change all of the columns names, we could use colnames(df) <- c("col1","col2",...)
“. As an exercise, let’s do that now. Change the names of all columns:
## change column names here
Let us now create a new compound variable that we later delete. This new compound variable will the multiplication of two numeric variables. The result is meaningless of course, but will be used for this exercise.
df$madeUpVariable <- df$freq*df$length
str(df)
'data.frame': 9 obs. of 6 variables:
$ word : Factor w/ 9 levels "a","coffee","it",..: 1 8 5 6 4 3 2 9 7
$ freq : num 7 8 9 2 3 6 1 5 4
$ functionword : Factor w/ 2 levels "n","y": 2 2 1 2 1 2 1 1 2
$ length : num 1 3 4 3 4 2 5 4 2
$ logFreq : num 1.946 2.079 2.197 0.693 1.099 ...
$ madeUpVariable: num 7 24 36 6 12 12 5 20 8
Let us now delete this variable given that we are not interested in. Do you know how to do that? Think about how we referred to a variable before? We use df[colNumber]
. What if we use df[-colNumebr]
, what would be the result?
df[-6]
This shows all columns minus the one we are not interested in. If we rewrite the variable df
and assign to it the newly created dataframe we just used above (with the minus sign), then the column we are not interested in will be deleted.
df <- df[-6]
str(df)
'data.frame': 9 obs. of 5 variables:
$ word : Factor w/ 9 levels "a","coffee","it",..: 1 8 5 6 4 3 2 9 7
$ freq : num 7 8 9 2 3 6 1 5 4
$ functionword: Factor w/ 2 levels "n","y": 2 2 1 2 1 2 1 1 2
$ length : num 1 3 4 3 4 2 5 4 2
$ logFreq : num 1.946 2.079 2.197 0.693 1.099 ...
Let’s say that we want to change the names of our observations. For instance, the variable “functionword” has the levels “y” and “n”. Let us change the names to become “yes” and “no”. We first need to change the factor
level variable into character and then change the observations. Then we need to transform back to a factor
df$functionword <- as.character(df$functionword)
df$functionword[df$functionword == "y"] <- "yes"
df$functionword[df$functionword == "n"] <- "no"
df$functionword <- as.factor(df$functionword)
str(df)
'data.frame': 9 obs. of 5 variables:
$ word : Factor w/ 9 levels "a","coffee","it",..: 1 8 5 6 4 3 2 9 7
$ freq : num 7 8 9 2 3 6 1 5 4
$ functionword: Factor w/ 2 levels "no","yes": 2 2 1 2 1 2 1 1 2
$ length : num 1 3 4 3 4 2 5 4 2
$ logFreq : num 1.946 2.079 2.197 0.693 1.099 ...
We can also check the levels of factor and change the reference value. This is useful when doing any type of statistics or when plotting the data. We use levels
, relevel
and ref
levels(df$functionword)
[1] "no" "yes"
df$functionword <-relevel(df$functionword, ref = "yes")
levels(df$functionword)
[1] "yes" "no"
We can also use the following code to change the order of the levels of a multilevel factor
levels(df$word)
[1] "a" "coffee" "it" "jump" "lamp" "not" "on"
[8] "the" "walk"
df$word <- factor(df$word, levels = c("a","coffee","jump","lamp","not","it","on","walk","the"))
levels(df$word)
[1] "a" "coffee" "jump" "lamp" "not" "it" "on"
[8] "walk" "the"
We may sometimes need to subset the dataframe and use parts of it. We use the function subset
or which
.
df_Yes1 <- df[which(df$functionword == 'yes'),]
#or
df_Yes2 <- subset(df, functionword=="yes")
str(df_Yes1)
'data.frame': 5 obs. of 5 variables:
$ word : Factor w/ 9 levels "a","coffee","jump",..: 1 9 5 6 7
$ freq : num 7 8 2 6 4
$ functionword: Factor w/ 2 levels "yes","no": 1 1 1 1 1
$ length : num 1 3 3 2 2
$ logFreq : num 1.946 2.079 0.693 1.792 1.386
str(df_Yes2)
'data.frame': 5 obs. of 5 variables:
$ word : Factor w/ 9 levels "a","coffee","jump",..: 1 9 5 6 7
$ freq : num 7 8 2 6 4
$ functionword: Factor w/ 2 levels "yes","no": 1 1 1 1 1
$ length : num 1 3 3 2 2
$ logFreq : num 1.946 2.079 0.693 1.792 1.386
When we subset the data, the levels of a factor are kept as they are.
levels(df_Yes1$functionword)
[1] "yes" "no"
levels(df_Yes2$functionword)
[1] "yes" "no"
But we only have one level of our factor..
df_Yes1$functionword
[1] yes yes yes yes yes
Levels: yes no
df_Yes2$functionword
[1] yes yes yes yes yes
Levels: yes no
By default, R
keeps the levels of the factor as they are unless we change it by using the following:
df_Yes1$functionword <- factor(df_Yes1$functionword)
df_Yes2$functionword <- factor(df_Yes2$functionword)
df_Yes1$functionword
[1] yes yes yes yes yes
Levels: yes
df_Yes2$functionword
[1] yes yes yes yes yes
Levels: yes
In most cases, we want to visualise our data. We will use the power of the base R
plotting.
We will use one of the built in R
. You can check all available datasets in R
using the following:
data()
# or below for all datasets available in all installed packages
data(package = .packages(all.available = TRUE))
datasets have been moved from package 'base' to package 'datasets'datasets have been moved from package 'stats' to package 'datasets'
We will use the iris
dataset from the package MASS
str(iris)
'data.frame': 150 obs. of 5 variables:
$ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
$ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
$ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
$ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
$ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
We have a dataframe with 150 observations and 5 variables; 4 numeric and 1 factor with 3 levels.
We summarise the data to see the trends:
summary(iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width
Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
Median :5.800 Median :3.000 Median :4.350 Median :1.300
Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
Species
setosa :50
versicolor:50
virginica :50
So we have an equal dataframe (50 observations under each level of the factor Species
), with no missing values (aka NA
).
This is useful to get a sense of the shape of a distribution. It’s not too informative here but will be useful for bigger datasets
hist(iris$Sepal.Length)
This allows us to visualise any relationships between two numeric variables
plot(iris$Sepal.Length, iris$Petal.Length)
We can plot a boxplot. This allows us to visualise the median, and quantiles in addition to the standard deviation and any outliers… All in the same plot!
plot(iris$Species, iris$Sepal.Length)
#or
boxplot(iris$Sepal.Length ~ iris$Species) # ~ to be read as "as a function of"
We can add labels to our plots
boxplot(iris$Sepal.Length ~ iris$Species, xlab = "Species", ylab="Length", main="A boxplot of the Iris dataset")
We can also add a trend line to our plot
boxplot(iris$Sepal.Length ~ iris$Species, xlab = "Species", ylab="Length", main="A boxplot of the Iris dataset with a linear trend")
lines(iris$Species, fitted(lm(Sepal.Length~Species, data=iris)),col="blue")
We can also use the plot above on the two numeric variables to plot a trend line
plot(iris$Sepal.Length, iris$Petal.Length, xlab = "Sepal Length", ylab="Petal Length", main="A plot of the linear association between Sepal and Petal Length")
abline(lm(Petal.Length~Sepal.Length, data=iris),col="blue")
We can of course add any other trend lines here…
ggplot2
We can use the code below to plot the data
library(ggplot2)
ggplot(data = iris, aes(x=Species, y = Sepal.Length))+
geom_boxplot() +
labs(x="Species",y="Length",title="Boxplot and trend line",subtitle="with ggplot2") +
theme_set(theme_bw()) + theme(text=element_text(size=15))+
geom_smooth(aes(x = as.numeric(Species), y = Sepal.Length))
Do a couple of plots based on the datasets available
## here
We are all after running some statistics to appreciate our results. We will use the dataset iris
to do basic statistics
We use the function cor
to obtain the pearson correlation and cor.test
to run a basic correlation test on our data with significance testing
cor(iris$Sepal.Length,iris$Sepal.Width,method = "pearson")
[1] -0.1175698
cor(iris$Sepal.Length,iris$Petal.Length,method = "pearson")
[1] 0.8717538
cor(iris$Sepal.Length,iris$Petal.Width,method = "pearson")
[1] 0.8179411
cor.test(iris$Sepal.Length,iris$Petal.Width)
Pearson's product-moment correlation
data: iris$Sepal.Length and iris$Petal.Width
t = 17.296, df = 148, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.7568971 0.8648361
sample estimates:
cor
0.8179411
We can run a basic t-test on our data, using the function t-test
t.test(iris$Sepal.Length,iris$Petal.Width)
Welch Two Sample t-test
data: iris$Sepal.Length and iris$Petal.Width
t = 50.536, df = 295.98, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
4.46315 4.82485
sample estimates:
mean of x mean of y
5.843333 1.199333
We can use the function aov
to run an Analysis of Variance
summary(aov(Sepal.Length~Species, data=iris))
Df Sum Sq Mean Sq F value Pr(>F)
Species 2 63.21 31.606 119.3 <2e-16 ***
Residuals 147 38.96 0.265
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
We can use the function lm
to run a linear model
summary(lm(Sepal.Length~Species, data=iris))
Call:
lm(formula = Sepal.Length ~ Species, data = iris)
Residuals:
Min 1Q Median 3Q Max
-1.6880 -0.3285 -0.0060 0.3120 1.3120
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.0060 0.0728 68.762 < 2e-16 ***
Speciesversicolor 0.9300 0.1030 9.033 8.77e-16 ***
Speciesvirginica 1.5820 0.1030 15.366 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.5148 on 147 degrees of freedom
Multiple R-squared: 0.6187, Adjusted R-squared: 0.6135
F-statistic: 119.3 on 2 and 147 DF, p-value: < 2.2e-16
But wait… How is the linear model comparable to the analysis of variance we ran above? This linear model derives the analysis of variance we saw above, use anova
on your linear model..
Here are the results of the initial Analysis of variance:
summary(aov(Sepal.Length~Species, data=iris))
Df Sum Sq Mean Sq F value Pr(>F)
Species 2 63.21 31.606 119.3 <2e-16 ***
Residuals 147 38.96 0.265
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
And here are the results of the linear model with the anova
function
anova(lm(Sepal.Length~Species, data=iris))
Analysis of Variance Table
Response: Sepal.Length
Df Sum Sq Mean Sq F value Pr(>F)
Species 2 63.212 31.606 119.26 < 2.2e-16 ***
Residuals 147 38.956 0.265
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
They are exactly the same… The underlying of an Analysis of variance is a linear model..
This is the end of this first session. We have looked at the various R
distributions, the GUIs
to R
, installing and using packages, then R
as a calculator, with basic and more advanced calculations. We then looked at the various object types, and created a dataframe from scratch. We did some manipulations of the dataframe, by creating a new variable, renaming a column, deleting one, and changing the levels of a variable. We then created some basic plots, and ran some basic statistics.
This whole workshop relied on the base R
. Many researchers prefer to only use base R
as this is stable and the code rarely changes (well it does change!). Others prefer using many of the R
packages to speed up analyses or create lovely plots. I usually use a combination of base R
plots, and plots created with ggplot2
or lattice
.