2.7 Matrices and dataframes
2.7.1 Matrix
2.7.2 Dataframes
A dataframe is the most important object we will be using over and over again… It is an object that contains information in both rows and columns.
2.7.2.1 Creating a dataframe from scratch
In this exercise, we will create a 4*9 dataframe. The code below creates four variables, and combines them together to make a dataframe. As you can see, variables can also be characters.
To create the dataframe, we use the functions as.data.frame
and cbind
.
word <- c("a", "the", "lamp", "not", "jump", "it", "coffee", "walk", "on")
freq <- c(500, 600, 7, 200, 30, 450, 130, 33, 300) ## note this is completely made up!!
functionword <- c("y", "y", "n", "y", "n", "y", "n", "n", "y")
length <- c(1, 3, 4, 3, 4, 2, 6, 4, 2)
df <- as.data.frame(cbind(word,freq,functionword,length))
2.7.2.2 Deleting variables from the Environment
If you have created various variables you do not need any more, you can use rm
to remove these
BUT wait, did I remove these from my dataframe? Well no.. We have removed objects from within the R
environment and not from the actual dataframe. Let’s check this up
## word freq functionword length
## 1 a 500 y 1
## 2 the 600 y 3
## 3 lamp 7 n 4
## 4 not 200 y 3
## 5 jump 30 n 4
## 6 it 450 y 2
## 7 coffee 130 n 6
## 8 walk 33 n 4
## 9 on 300 y 2
2.7.2.4 Reading and Saving in .csv
The code below allows you to save the dataframe and read it again. The extension .csv
is for “comma delimited files”. This is the best format to use as it is simply a text file with no additional formatting.
## word freq functionword length
## 1 a 500 y 1
## 2 the 600 y 3
## 3 lamp 7 n 4
## 4 not 200 y 3
## 5 jump 30 n 4
## 6 it 450 y 2
## 7 coffee 130 n 6
## 8 walk 33 n 4
## 9 on 300 y 2
## X word freq functionword length
## 1 1 a 500 y 1
## 2 2 the 600 y 3
## 3 3 lamp 7 n 4
## 4 4 not 200 y 3
## 5 5 jump 30 n 4
## 6 6 it 450 y 2
## 7 7 coffee 130 n 6
## 8 8 walk 33 n 4
## 9 9 on 300 y 2
The newly created object contains 5 columns rather than the 4 we initially created. This is normal. By default, R
add a column that reflects the order of the list before it was saved. You can simply delete the column or keep as is (but be careful as this means you need to adjust any references to columns that we will use later on).
2.7.2.5 Reading and saving other formats
R
allows us to read data in any format. If you have a .txt
, .sav
, .xls
, .xlsx
, etc., then there are packages specific to do that (e.g., package xlsx
to read/save .xlsx
files, or the function haven
from the package Tidyverse
to read/save .sav
files).
You can use the built-in plugin in RStudio
to import your dataset. See Import Dataset
within the Environment
.
In general, any specific formatting is kept, but sometimes variable names associated with numbers (as in .sav
files) will be lost. Hence, it is always preferable to do minimal formatting on the data.. Start with a .csv
file, import it to R
and do the magic!
2.7.2.6 Checking the structure
The first thing we will do is to check the structure of our created dataset. We will use the originally created one (i.e., df
and not the imported one (i.e., dfNew
).
## 'data.frame': 9 obs. of 4 variables:
## $ word : chr "a" "the" "lamp" "not" ...
## $ freq : chr "500" "600" "7" "200" ...
## $ functionword: chr "y" "y" "n" "y" ...
## $ length : chr "1" "3" "4" "3" ...
The function str
gives us the following information:
- How many observations (i.e., rows) and variables (i.e., columns)
- The name of each variable (look at
$
and what comes after it) - Within each variable, we have the class with number of levels
2.7.2.7 Changing the class
of a variable
As we can see, the four created variables were added to the dataframe as factors
. We need to change the class
of the numeric variables: freq and length. Let’s do that:
## 'data.frame': 9 obs. of 4 variables:
## $ word : chr "a" "the" "lamp" "not" ...
## $ freq : num 500 600 7 200 30 450 130 33 300
## $ functionword: chr "y" "y" "n" "y" ...
## $ length : num 1 3 4 3 4 2 6 4 2
2.7.2.8 Referring to particular variables, observations
As you can see from the above, we can refer to a particular variable in the dataframe by its name and adding $
. There are additional options to do that. Let’s see what we can do. Can you tell what each of the below does? chat to your neighbour….
## word
## 1 a
## 2 the
## 3 lamp
## 4 not
## 5 jump
## 6 it
## 7 coffee
## 8 walk
## 9 on
## [1] "a" "the" "lamp" "not" "jump" "it" "coffee" "walk"
## [9] "on"
## word freq functionword length
## 1 a 500 y 1
## [1] "a"
Here are the answers:
- Refers to the full column 1
- Refers to first variable
- Refers to first row
- Refers to first observation in first column
Practice a bit and use other specifications to obtain specific observations, columns or rows…
2.7.3 Descriptive statistics
2.7.3.1 Basic summaries, tables
We can use the function summary
to do some basic summaries
## word freq functionword length
## Length:9 Min. : 7 Length:9 Min. :1.000
## Class :character 1st Qu.: 33 Class :character 1st Qu.:2.000
## Mode :character Median :200 Mode :character Median :3.000
## Mean :250 Mean :3.222
## 3rd Qu.:450 3rd Qu.:4.000
## Max. :600 Max. :6.000
We can create a table with the function table
##
## 7 30 33 130 200 300 450 500 600
## n 1 1 1 1 0 0 0 0 0
## y 0 0 0 0 1 1 1 1 1
2.7.3.2 Basic manipulations
2.7.3.2.1 Creating variables
We sometimes need to create and/or delete new variables.. Do you know how to do that?
Let’s look at the structure again:
## 'data.frame': 9 obs. of 4 variables:
## $ word : chr "a" "the" "lamp" "not" ...
## $ freq : num 500 600 7 200 30 450 130 33 300
## $ functionword: chr "y" "y" "n" "y" ...
## $ length : num 1 3 4 3 4 2 6 4 2
We said earlier that we can refer to a specific variable by using $
+ the name of the variable. Let’s use this again and add a new name of variable not in the list of variables above
## NULL
What does NULL
mean? The variable does not exist!
Let’s do something else
Ah no error messages! Let’s check the structure
## 'data.frame': 9 obs. of 5 variables:
## $ word : chr "a" "the" "lamp" "not" ...
## $ freq : num 500 600 7 200 30 450 130 33 300
## $ functionword: chr "y" "y" "n" "y" ...
## $ length : num 1 3 4 3 4 2 6 4 2
## $ newVariable : logi NA NA NA NA NA NA ...
So we now have five variables and the last one is named “newVariable” and assigned “NA”. “NA” is used in R
to refer to missing data or is a place holder. We can replace these with any calculations, or anything else. Let’s do that:
## 'data.frame': 9 obs. of 5 variables:
## $ word : chr "a" "the" "lamp" "not" ...
## $ freq : num 500 600 7 200 30 450 130 33 300
## $ functionword: chr "y" "y" "n" "y" ...
## $ length : num 1 3 4 3 4 2 6 4 2
## $ newVariable : num 6.21 6.4 1.95 5.3 3.4 ...
We replaced “NA” with the log of the frequencies. Let’s check that this is correct only for one observation. Can you dissect the code below? what did I use to ask R
to compute the log of the frequency (freq)? Remember rows and columns
## [1] 6.214608
## [1] 6.214608
So they are the same values.
2.7.3.2.2 Changing column names
Now we need to change the name of the variable to reflect the computations. “newVariable” is meaningless as a name, but “logFreq” is informative.
## 'data.frame': 9 obs. of 5 variables:
## $ word : chr "a" "the" "lamp" "not" ...
## $ freq : num 500 600 7 200 30 450 130 33 300
## $ functionword: chr "y" "y" "n" "y" ...
## $ length : num 1 3 4 3 4 2 6 4 2
## $ logFreq : num 6.21 6.4 1.95 5.3 3.4 ...
As can be seen from the above, using the command colnames(df)[5] <- "logFreq"
allows us to change the column name in position 5 of the dataframe. If we were to change all of the columns names, we could use colnames(df) <- c("col1","col2",...)
“.
2.7.3.2.3 Deleting variables
Let us now create a new compound variable that we later delete. This new compound variable will the multiplication of two numeric variables. The result is meaningless of course, but will be used for this exercise.
## 'data.frame': 9 obs. of 6 variables:
## $ word : chr "a" "the" "lamp" "not" ...
## $ freq : num 500 600 7 200 30 450 130 33 300
## $ functionword : chr "y" "y" "n" "y" ...
## $ length : num 1 3 4 3 4 2 6 4 2
## $ logFreq : num 6.21 6.4 1.95 5.3 3.4 ...
## $ madeUpVariable: num 500 1800 28 600 120 900 780 132 600
Let us now delete this variable given that we are not interested in. Do you know how to do that? Think about how we referred to a variable before? We use df[colNumber]
. What if we use df[-colNumebr]
, what would be the result?
## word freq functionword length logFreq
## 1 a 500 y 1 6.214608
## 2 the 600 y 3 6.396930
## 3 lamp 7 n 4 1.945910
## 4 not 200 y 3 5.298317
## 5 jump 30 n 4 3.401197
## 6 it 450 y 2 6.109248
## 7 coffee 130 n 6 4.867534
## 8 walk 33 n 4 3.496508
## 9 on 300 y 2 5.703782
This shows all columns minus the one we are not interested in. If we rewrite the variable df
and assign to it the newly created dataframe we just used above (with the minus sign), then the column we are not interested in will be deleted.
## 'data.frame': 9 obs. of 5 variables:
## $ word : chr "a" "the" "lamp" "not" ...
## $ freq : num 500 600 7 200 30 450 130 33 300
## $ functionword: chr "y" "y" "n" "y" ...
## $ length : num 1 3 4 3 4 2 6 4 2
## $ logFreq : num 6.21 6.4 1.95 5.3 3.4 ...
2.7.3.2.4 Changing names of observations
Let’s say that we want to change the names of our observations. For instance, the variable “functionword” has the levels “y” and “n”. Let us change the names to become “yes” and “no”. We first need to change the factor
level variable into character and then change the observations. Then we need to transform back to a factor
df$functionword <- as.character(df$functionword)
df$functionword[df$functionword == "y"] <- "yes"
df$functionword[df$functionword == "n"] <- "no"
df$functionword <- as.factor(df$functionword)
str(df)
## 'data.frame': 9 obs. of 5 variables:
## $ word : chr "a" "the" "lamp" "not" ...
## $ freq : num 500 600 7 200 30 450 130 33 300
## $ functionword: Factor w/ 2 levels "no","yes": 2 2 1 2 1 2 1 1 2
## $ length : num 1 3 4 3 4 2 6 4 2
## $ logFreq : num 6.21 6.4 1.95 5.3 3.4 ...
2.7.3.2.5 Checking levels of factors
We can also check the levels of factor and change the reference value. This is useful when doing any type of statistics or when plotting the data. We use levels
, relevel
and ref
## [1] "no" "yes"
## [1] "yes" "no"
We can also use the following code to change the order of the levels of a multilevel factor
## NULL
df$word <- factor(df$word, levels = c("a","coffee","jump","lamp","not","it","on","walk","the"))
levels(df$word)
## [1] "a" "coffee" "jump" "lamp" "not" "it" "on" "walk"
## [9] "the"
2.7.3.2.6 Subsetting the dataframe
We may sometimes need to subset the dataframe and use parts of it. We use the function subset
or which
.
df_Yes1 <- df[which(df$functionword == 'yes'),]
##or
df_Yes2 <- subset(df, functionword=="yes")
str(df_Yes1)
## 'data.frame': 5 obs. of 5 variables:
## $ word : Factor w/ 9 levels "a","coffee","jump",..: 1 9 5 6 7
## $ freq : num 500 600 200 450 300
## $ functionword: Factor w/ 2 levels "yes","no": 1 1 1 1 1
## $ length : num 1 3 3 2 2
## $ logFreq : num 6.21 6.4 5.3 6.11 5.7
## 'data.frame': 5 obs. of 5 variables:
## $ word : Factor w/ 9 levels "a","coffee","jump",..: 1 9 5 6 7
## $ freq : num 500 600 200 450 300
## $ functionword: Factor w/ 2 levels "yes","no": 1 1 1 1 1
## $ length : num 1 3 3 2 2
## $ logFreq : num 6.21 6.4 5.3 6.11 5.7
When we subset the data, the levels of a factor are kept as they are.
## [1] "yes" "no"
## [1] "yes" "no"
But we only have one level of our factor..
## [1] yes yes yes yes yes
## Levels: yes no
## [1] yes yes yes yes yes
## Levels: yes no
By default, R
keeps the levels of the factor as they are unless we change it by using the following:
df_Yes1$functionword <- factor(df_Yes1$functionword)
df_Yes2$functionword <- factor(df_Yes2$functionword)
df_Yes1$functionword
## [1] yes yes yes yes yes
## Levels: yes
## [1] yes yes yes yes yes
## Levels: yes