2.7 Matrices and dataframes

2.7.1 Matrix

2.7.1.1 General

x <- 1:4
x <- as.matrix(x)
x
##      [,1]
## [1,]    1
## [2,]    2
## [3,]    3
## [4,]    4
dim(x)
## [1] 4 1
dim(x) <- c(2,2)
dim(x)
## [1] 2 2
x
##      [,1] [,2]
## [1,]    1    3
## [2,]    2    4

2.7.1.2 Referring to specific location

x[1,]
## [1] 1 3
x[,1]
## [1] 1 2
x[1,2] 
## [1] 3
x[,] ## = x
##      [,1] [,2]
## [1,]    1    3
## [2,]    2    4

2.7.2 Dataframes

A dataframe is the most important object we will be using over and over again… It is an object that contains information in both rows and columns.

2.7.2.1 Creating a dataframe from scratch

In this exercise, we will create a 4*9 dataframe. The code below creates four variables, and combines them together to make a dataframe. As you can see, variables can also be characters. To create the dataframe, we use the functions as.data.frame and cbind.

word <- c("a", "the", "lamp", "not", "jump", "it", "coffee", "walk", "on")
freq <- c(500, 600, 7, 200, 30, 450, 130, 33, 300)  ## note this is completely made up!!
functionword <- c("y", "y", "n", "y", "n", "y", "n", "n", "y")
length <- c(1, 3, 4, 3, 4, 2, 6, 4, 2)
df <- as.data.frame(cbind(word,freq,functionword,length))

2.7.2.2 Deleting variables from the Environment

If you have created various variables you do not need any more, you can use rm to remove these

rm(word,freq,functionword,length)

BUT wait, did I remove these from my dataframe? Well no.. We have removed objects from within the R environment and not from the actual dataframe. Let’s check this up

df
##     word freq functionword length
## 1      a  500            y      1
## 2    the  600            y      3
## 3   lamp    7            n      4
## 4    not  200            y      3
## 5   jump   30            n      4
## 6     it  450            y      2
## 7 coffee  130            n      6
## 8   walk   33            n      4
## 9     on  300            y      2

2.7.2.3 Saving and reading the dataframe

2.7.2.4 Reading and Saving in .csv

The code below allows you to save the dataframe and read it again. The extension .csv is for “comma delimited files”. This is the best format to use as it is simply a text file with no additional formatting.

write.csv(df, paste0("outputs/df.csv"))
dfNew <- read.csv(paste0("outputs/df.csv"))
df
##     word freq functionword length
## 1      a  500            y      1
## 2    the  600            y      3
## 3   lamp    7            n      4
## 4    not  200            y      3
## 5   jump   30            n      4
## 6     it  450            y      2
## 7 coffee  130            n      6
## 8   walk   33            n      4
## 9     on  300            y      2
dfNew
##   X   word freq functionword length
## 1 1      a  500            y      1
## 2 2    the  600            y      3
## 3 3   lamp    7            n      4
## 4 4    not  200            y      3
## 5 5   jump   30            n      4
## 6 6     it  450            y      2
## 7 7 coffee  130            n      6
## 8 8   walk   33            n      4
## 9 9     on  300            y      2

The newly created object contains 5 columns rather than the 4 we initially created. This is normal. By default, R add a column that reflects the order of the list before it was saved. You can simply delete the column or keep as is (but be careful as this means you need to adjust any references to columns that we will use later on).

2.7.2.5 Reading and saving other formats

R allows us to read data in any format. If you have a .txt, .sav, .xls, .xlsx, etc., then there are packages specific to do that (e.g., package xlsx to read/save .xlsx files, or the function haven from the package Tidyverse to read/save .sav files).

You can use the built-in plugin in RStudio to import your dataset. See Import Dataset within the Environment.

In general, any specific formatting is kept, but sometimes variable names associated with numbers (as in .sav files) will be lost. Hence, it is always preferable to do minimal formatting on the data.. Start with a .csv file, import it to R and do the magic!

2.7.2.6 Checking the structure

The first thing we will do is to check the structure of our created dataset. We will use the originally created one (i.e., df and not the imported one (i.e., dfNew).

str(df)
## 'data.frame':    9 obs. of  4 variables:
##  $ word        : chr  "a" "the" "lamp" "not" ...
##  $ freq        : chr  "500" "600" "7" "200" ...
##  $ functionword: chr  "y" "y" "n" "y" ...
##  $ length      : chr  "1" "3" "4" "3" ...

The function str gives us the following information:

  1. How many observations (i.e., rows) and variables (i.e., columns)
  2. The name of each variable (look at $ and what comes after it)
  3. Within each variable, we have the class with number of levels

2.7.2.7 Changing the class of a variable

As we can see, the four created variables were added to the dataframe as factors. We need to change the class of the numeric variables: freq and length. Let’s do that:

df$freq <- as.numeric(df$freq)
df$length <- as.numeric(df$length)
str(df)
## 'data.frame':    9 obs. of  4 variables:
##  $ word        : chr  "a" "the" "lamp" "not" ...
##  $ freq        : num  500 600 7 200 30 450 130 33 300
##  $ functionword: chr  "y" "y" "n" "y" ...
##  $ length      : num  1 3 4 3 4 2 6 4 2

2.7.2.8 Referring to particular variables, observations

As you can see from the above, we can refer to a particular variable in the dataframe by its name and adding $. There are additional options to do that. Let’s see what we can do. Can you tell what each of the below does? chat to your neighbour….

df[1]
##     word
## 1      a
## 2    the
## 3   lamp
## 4    not
## 5   jump
## 6     it
## 7 coffee
## 8   walk
## 9     on
df[,1]
## [1] "a"      "the"    "lamp"   "not"    "jump"   "it"     "coffee" "walk"  
## [9] "on"
df[1,]
##   word freq functionword length
## 1    a  500            y      1
df[1,1]
## [1] "a"

Here are the answers:

  1. Refers to the full column 1
  2. Refers to first variable
  3. Refers to first row
  4. Refers to first observation in first column

Practice a bit and use other specifications to obtain specific observations, columns or rows…

2.7.3 Descriptive statistics

2.7.3.1 Basic summaries, tables

We can use the function summary to do some basic summaries

summary(df)
##      word                freq     functionword           length     
##  Length:9           Min.   :  7   Length:9           Min.   :1.000  
##  Class :character   1st Qu.: 33   Class :character   1st Qu.:2.000  
##  Mode  :character   Median :200   Mode  :character   Median :3.000  
##                     Mean   :250                      Mean   :3.222  
##                     3rd Qu.:450                      3rd Qu.:4.000  
##                     Max.   :600                      Max.   :6.000

We can create a table with the function table

table(df$functionword, df$freq)
##    
##     7 30 33 130 200 300 450 500 600
##   n 1  1  1   1   0   0   0   0   0
##   y 0  0  0   0   1   1   1   1   1

2.7.3.2 Basic manipulations

2.7.3.2.1 Creating variables

We sometimes need to create and/or delete new variables.. Do you know how to do that?

Let’s look at the structure again:

str(df)
## 'data.frame':    9 obs. of  4 variables:
##  $ word        : chr  "a" "the" "lamp" "not" ...
##  $ freq        : num  500 600 7 200 30 450 130 33 300
##  $ functionword: chr  "y" "y" "n" "y" ...
##  $ length      : num  1 3 4 3 4 2 6 4 2

We said earlier that we can refer to a specific variable by using $ + the name of the variable. Let’s use this again and add a new name of variable not in the list of variables above

df$newVariable
## NULL

What does NULL mean? The variable does not exist! Let’s do something else

df$newVariable <- NA

Ah no error messages! Let’s check the structure

str(df)
## 'data.frame':    9 obs. of  5 variables:
##  $ word        : chr  "a" "the" "lamp" "not" ...
##  $ freq        : num  500 600 7 200 30 450 130 33 300
##  $ functionword: chr  "y" "y" "n" "y" ...
##  $ length      : num  1 3 4 3 4 2 6 4 2
##  $ newVariable : logi  NA NA NA NA NA NA ...

So we now have five variables and the last one is named “newVariable” and assigned “NA”. “NA” is used in R to refer to missing data or is a place holder. We can replace these with any calculations, or anything else. Let’s do that:

df$newVariable <- log(df$freq)
str(df)
## 'data.frame':    9 obs. of  5 variables:
##  $ word        : chr  "a" "the" "lamp" "not" ...
##  $ freq        : num  500 600 7 200 30 450 130 33 300
##  $ functionword: chr  "y" "y" "n" "y" ...
##  $ length      : num  1 3 4 3 4 2 6 4 2
##  $ newVariable : num  6.21 6.4 1.95 5.3 3.4 ...

We replaced “NA” with the log of the frequencies. Let’s check that this is correct only for one observation. Can you dissect the code below? what did I use to ask R to compute the log of the frequency (freq)? Remember rows and columns

log(df[1,2])
## [1] 6.214608
df[1,5]
## [1] 6.214608

So they are the same values.

2.7.3.2.2 Changing column names

Now we need to change the name of the variable to reflect the computations. “newVariable” is meaningless as a name, but “logFreq” is informative.

colnames(df)[5] <- "logFreq"
str(df)
## 'data.frame':    9 obs. of  5 variables:
##  $ word        : chr  "a" "the" "lamp" "not" ...
##  $ freq        : num  500 600 7 200 30 450 130 33 300
##  $ functionword: chr  "y" "y" "n" "y" ...
##  $ length      : num  1 3 4 3 4 2 6 4 2
##  $ logFreq     : num  6.21 6.4 1.95 5.3 3.4 ...

As can be seen from the above, using the command colnames(df)[5] <- "logFreq" allows us to change the column name in position 5 of the dataframe. If we were to change all of the columns names, we could use colnames(df) <- c("col1","col2",...)“.

2.7.3.2.3 Deleting variables

Let us now create a new compound variable that we later delete. This new compound variable will the multiplication of two numeric variables. The result is meaningless of course, but will be used for this exercise.

df$madeUpVariable <- df$freq*df$length
str(df)
## 'data.frame':    9 obs. of  6 variables:
##  $ word          : chr  "a" "the" "lamp" "not" ...
##  $ freq          : num  500 600 7 200 30 450 130 33 300
##  $ functionword  : chr  "y" "y" "n" "y" ...
##  $ length        : num  1 3 4 3 4 2 6 4 2
##  $ logFreq       : num  6.21 6.4 1.95 5.3 3.4 ...
##  $ madeUpVariable: num  500 1800 28 600 120 900 780 132 600

Let us now delete this variable given that we are not interested in. Do you know how to do that? Think about how we referred to a variable before? We use df[colNumber]. What if we use df[-colNumebr], what would be the result?

df[-6]
##     word freq functionword length  logFreq
## 1      a  500            y      1 6.214608
## 2    the  600            y      3 6.396930
## 3   lamp    7            n      4 1.945910
## 4    not  200            y      3 5.298317
## 5   jump   30            n      4 3.401197
## 6     it  450            y      2 6.109248
## 7 coffee  130            n      6 4.867534
## 8   walk   33            n      4 3.496508
## 9     on  300            y      2 5.703782

This shows all columns minus the one we are not interested in. If we rewrite the variable df and assign to it the newly created dataframe we just used above (with the minus sign), then the column we are not interested in will be deleted.

df <- df[-6]
str(df)
## 'data.frame':    9 obs. of  5 variables:
##  $ word        : chr  "a" "the" "lamp" "not" ...
##  $ freq        : num  500 600 7 200 30 450 130 33 300
##  $ functionword: chr  "y" "y" "n" "y" ...
##  $ length      : num  1 3 4 3 4 2 6 4 2
##  $ logFreq     : num  6.21 6.4 1.95 5.3 3.4 ...
2.7.3.2.4 Changing names of observations

Let’s say that we want to change the names of our observations. For instance, the variable “functionword” has the levels “y” and “n”. Let us change the names to become “yes” and “no”. We first need to change the factor level variable into character and then change the observations. Then we need to transform back to a factor

df$functionword <- as.character(df$functionword)
df$functionword[df$functionword == "y"] <- "yes"
df$functionword[df$functionword == "n"] <- "no"
df$functionword <- as.factor(df$functionword)
str(df)
## 'data.frame':    9 obs. of  5 variables:
##  $ word        : chr  "a" "the" "lamp" "not" ...
##  $ freq        : num  500 600 7 200 30 450 130 33 300
##  $ functionword: Factor w/ 2 levels "no","yes": 2 2 1 2 1 2 1 1 2
##  $ length      : num  1 3 4 3 4 2 6 4 2
##  $ logFreq     : num  6.21 6.4 1.95 5.3 3.4 ...
2.7.3.2.5 Checking levels of factors

We can also check the levels of factor and change the reference value. This is useful when doing any type of statistics or when plotting the data. We use levels, relevel and ref

levels(df$functionword)
## [1] "no"  "yes"
df$functionword <-relevel(df$functionword, ref = "yes")
levels(df$functionword)
## [1] "yes" "no"

We can also use the following code to change the order of the levels of a multilevel factor

levels(df$word)
## NULL
df$word <- factor(df$word, levels = c("a","coffee","jump","lamp","not","it","on","walk","the"))
levels(df$word)
## [1] "a"      "coffee" "jump"   "lamp"   "not"    "it"     "on"     "walk"  
## [9] "the"
2.7.3.2.6 Subsetting the dataframe

We may sometimes need to subset the dataframe and use parts of it. We use the function subset or which.

df_Yes1 <- df[which(df$functionword == 'yes'),]
##or
df_Yes2 <- subset(df, functionword=="yes")
str(df_Yes1)
## 'data.frame':    5 obs. of  5 variables:
##  $ word        : Factor w/ 9 levels "a","coffee","jump",..: 1 9 5 6 7
##  $ freq        : num  500 600 200 450 300
##  $ functionword: Factor w/ 2 levels "yes","no": 1 1 1 1 1
##  $ length      : num  1 3 3 2 2
##  $ logFreq     : num  6.21 6.4 5.3 6.11 5.7
str(df_Yes2)
## 'data.frame':    5 obs. of  5 variables:
##  $ word        : Factor w/ 9 levels "a","coffee","jump",..: 1 9 5 6 7
##  $ freq        : num  500 600 200 450 300
##  $ functionword: Factor w/ 2 levels "yes","no": 1 1 1 1 1
##  $ length      : num  1 3 3 2 2
##  $ logFreq     : num  6.21 6.4 5.3 6.11 5.7

When we subset the data, the levels of a factor are kept as they are.

levels(df_Yes1$functionword)
## [1] "yes" "no"
levels(df_Yes2$functionword)
## [1] "yes" "no"

But we only have one level of our factor..

df_Yes1$functionword
## [1] yes yes yes yes yes
## Levels: yes no
df_Yes2$functionword
## [1] yes yes yes yes yes
## Levels: yes no

By default, R keeps the levels of the factor as they are unless we change it by using the following:

df_Yes1$functionword <- factor(df_Yes1$functionword)
df_Yes2$functionword <- factor(df_Yes2$functionword)
df_Yes1$functionword
## [1] yes yes yes yes yes
## Levels: yes
df_Yes2$functionword
## [1] yes yes yes yes yes
## Levels: yes