Lesson 2: Subsetting data vs Manipulating data with dplyr

The gene expression dataset

Blackmore et al. (2017), The effect of upper-respiratory infection on transcriptomic changes in the CNS. The goal of the study was to determine the effect of an upper-respiratory infection on changes in RNA transcription occurring in the cerebellum and spinal cord post infection. Gender matched eight week old C57BL/6 mice were inoculated with saline or with Influenza A by intranasal route and transcriptomic changes in the cerebellum and spinal cord tissues were evaluated by RNA-seq at days 0 (non-infected), 4 and 8.

Getting set up

  • Remember to setwd to bioc-intro folder

  • Download CSV file

  • Creating code chunk with Command+Option+I

download.file(url = "https://github.com/carpentries-incubator/bioc-intro/raw/main/episodes/data/rnaseq.csv",
              destfile = "rnaseq.csv")
rna <- read.csv("rnaseq.csv") # comma separated values
#View(rna)
  • First column is hugo symbol IDs.
  • The 3rd column is the GEO sample specific accession number. Its a unique identifier.

Data Frames

  • We previously used the str command. This allowed us to identify the structure of an R object.
str(rna) #structure of R object 

ThiS data frame contains ~32,000 rows and 19 columns that contain both character and integer vectors.

dim(rna)
## [1] 32428    19
#names(rna)
nrow(rna)
## [1] 32428
#rownames(rna)
rna <- data.frame(rna)

Subsetting

Core idea: “From this object, give me these parts”

# object[row,columns]
rna[1,1] # the first element in the first column 

# Extract the 2 column only
rna[, 2]

# Extract the first 5 rows and output to head_rna then view it 
rna5 <- rna[1:5, ]
rna[, -1] # contains the whole data frame, except the first column

What happens if we want a specific gene of interest?

First, we can access a single column:

rna$gene # this allows us to select a column 

Now we ask a question about that column:

rna$gene == "Asl" # this does not return data, it returns TRUE/FALSE

We can use this logical vector to keep the rows we want:

rna[rna$gene == "Asl", ] # keep only the rows which are TRUE
  • rows where the condition are TRUE are kept
  • the blank after the comma means keep all columns

Exercise #1

Modify the code above so that:

  • Only the first 6 columns are saved
  • Output to a new object called rna_asl
  • View the object
rna_asl <- rna[rna$gene == "Asl", 1:6]

Sometimes we want rows that meet more than one condition. To combine row conditions we use: &

rna$gene == "Asl" & rna$tissue == "Cerebellum"

Exercise #2:

Modify the code so that

  • gene = Asl
  • sex = female
  • output to rna_asl_sex
rna_asl_sex <- rna[rna$gene == "Asl" & rna$sex == "Female", ]

What if we care about more than one gene? Using the == does not work for matching multiple values. You can match multiple values with %in%

#%in% 

#membership testing; each value is tested against the guest list

Exercise #3

Your favorite genes are Kinesin Family Members (Kif5a, Kif21b, and Kif14). You only want rows for these three kif genes AND where infection = Noninfected. Save the result as rna_kif

rna_kif <- rna[
  rna$gene %in% c("Kif5a", "Kif21b", "Kif14") &
    rna$infection == "NonInfected", ]

Factors

Factors represent categorical data. They are stored as integers. They can be both ordered and unordered.

Once created, factors contain a pre-defined set of values, known as levels, by default R sorts levels by ABC order.

sex <- factor(c("male", "female", "female", "male", "female")) 

# this combines values into a vector
#R is able to interpret this as five observations 
#sex <- factor("male", "female", "female", "male", "female") 

# R is seeing here is one observation, plus some arguments tha it does not understand 
levels(sex)
## [1] "female" "male"
# level 1 = female
# level 2 = male

Sometimes order matters and we might want to reorder to improve visualization (low, med, high). To do so we can use:

sex <- factor(sex, levels = c("male", "female"))

Remember that these factors are represented by integers. So once data is stored as a factor, you can use the plot() function:

plot(sex)

To rename, we can change its levels:

levels(sex) <- c("M", "F")
sex
## [1] M F F M F
## Levels: M F

Exporting and saving tabular data

To export a data.frame to a text-based spreadsheet, we can use the write.table set of functions (write.csv, write.delim, …).

#write.csv(rna, file = "my_rna.csv")

Manipulating data with dplyr

clear your Global Environment before starting

  • There is a R package called dplyr which provides powerful tools for data manipulation tasks. It is built to work directly with data frames, with many manipulation tasks optimised.

In order to access the dplyr-specific tools, we must load the tidyverse library. The tidyverse package is an “umbrella-package” that installs several useful packages for data analysis which work well together, such as tidyr, dplyr, ggplot2, and tibble.

library("tidyverse")

Loading data with tidyverse

Instead of read.csv(), we will read in our data using the read_csv() function from the tidyverse package readr.

rna <- read_csv("data/rnaseq.csv")

# what do we notice when we view the data

rna
  • Notice that the class of the data is now referred to as a “tibble”.

The data structure of a tibble is very similar to a data frame, but here are the differences:

  • It displays the data type of each column under its name. Note that is a data type defined to hold numeric values with decimal points.

  • It is designed to only prints the first few rows of data and only as many columns as fit on one screen.

Now we are now going to learn some of the most common dplyr functions.

The first one is select(). The first argument will always be the dataframe.

select(rna, gene, sample, tissue, expression)

To select all columns except certain ones, put a “-” in front of the variable to exclude it.

select(rna, -tissue, -organism)

To choose rows based on a specific criteria, use filter():

filter(rna, sex == "Male")

Similar to subsetting, if we wanted to filter based on more than one criteria we can make use of the & symbol.

Go ahead and filter the data for only Male and NonInfected below:

filter(rna, sex == "Male" & infection == "NonInfected")
  • Now let’s imagine we are interested in the human homologs of the mouse genes analysed in this dataset.

  • This information can be found in the last column of the rna tibble, named hsapiens_homolog_associated_gene_name.

  • To visualize it easily, create a new table containing just the 2 columns gene and hsapiens_homolog_associated_gene_name below. Output to genes_new

genes_new <- select(rna, gene, hsapiens_homolog_associated_gene_name)
#genes_new

Once we get to panel 3, you’ll notice that this data frame has some missing values - NA - in the human homolog column.

What does NA mean? It means unknown, not zero, it means the data is missing.

When we use filter() in dplyr, every row has to evaluate to TRUE or FALSE. If R sees an NA and we dont tell it what to do with it, things get weird.

Lets consider the following line:

#filter(genes_new, hsapiens_homolog_associated_gene_name)
  • Keep the rows, where the column exists. Thats what we think.

  • However, what R actually sees is, evaluate this character vector as TRUE or FALSE.

  • It contains strings and NAs.

  • Filter needs a logical condition.

The solution is to make the logic explicit:

filter(genes_new, is.na(hsapiens_homolog_associated_gene_name))

Here, we are being very clear. We are asking R a yes/no question for every row. Is the human homolog missing? If the answer is TRUE, keep the row. If its FALSE drop it.

Final Exercise:

Above, is showing how to keep rows WITHOUT a human homolog.

  • Output this to genes_no_homolog.
  • In a separate code chunk, keep rows WITH a human homolog and output to genes_with_homolog.
  • Output number of rows for each R object.
# Keep rows WITHOUT a human homolog
genes_no_homolog <- filter(genes_new, is.na(hsapiens_homolog_associated_gene_name))
# Keep rows WITH a human homolog
genes_with_homolog <- filter(genes_new, !is.na(hsapiens_homolog_associated_gene_name))
nrow(genes_no_homolog)
nrow(genes_with_homolog)