Lesson 2: Subsetting data vs Manipulating data with dplyr
The gene expression dataset
Blackmore et al. (2017), The effect of upper-respiratory infection on transcriptomic changes in the CNS. The goal of the study was to determine the effect of an upper-respiratory infection on changes in RNA transcription occurring in the cerebellum and spinal cord post infection. Gender matched eight week old C57BL/6 mice were inoculated with saline or with Influenza A by intranasal route and transcriptomic changes in the cerebellum and spinal cord tissues were evaluated by RNA-seq at days 0 (non-infected), 4 and 8.
Getting set up
Remember to setwd to bioc-intro folder
Download CSV file
Creating code chunk with Command+Option+I
download.file(url = "https://github.com/carpentries-incubator/bioc-intro/raw/main/episodes/data/rnaseq.csv",
destfile = "rnaseq.csv")- First column is hugo symbol IDs.
- The 3rd column is the GEO sample specific accession number. Its a unique identifier.
Data Frames
- We previously used the
strcommand. This allowed us to identify the structure of an R object.
ThiS data frame contains ~32,000 rows and 19 columns that contain both character and integer vectors.
## [1] 32428 19
## [1] 32428
Subsetting
Core idea: “From this object, give me these parts”
# object[row,columns]
rna[1,1] # the first element in the first column
# Extract the 2 column only
rna[, 2]
# Extract the first 5 rows and output to head_rna then view it
rna5 <- rna[1:5, ]What happens if we want a specific gene of interest?
First, we can access a single column:
Now we ask a question about that column:
We can use this logical vector to keep the rows we want:
- rows where the condition are TRUE are kept
- the blank after the comma means keep all columns
Exercise #1
Modify the code above so that:
- Only the first 6 columns are saved
- Output to a new object called
rna_asl - View the object
Sometimes we want rows that meet more than one condition. To combine
row conditions we use: &
Exercise #2:
Modify the code so that
- gene = Asl
- sex = female
- output to
rna_asl_sex
What if we care about more than one gene? Using the ==
does not work for matching multiple values. You can match multiple
values with %in%
Factors
Factors represent categorical data. They are stored as integers. They can be both ordered and unordered.
Once created, factors contain a pre-defined set of values, known as levels, by default R sorts levels by ABC order.
sex <- factor(c("male", "female", "female", "male", "female"))
# this combines values into a vector
#R is able to interpret this as five observations #sex <- factor("male", "female", "female", "male", "female")
# R is seeing here is one observation, plus some arguments tha it does not understand ## [1] "female" "male"
Sometimes order matters and we might want to reorder to improve visualization (low, med, high). To do so we can use:
Remember that these factors are represented by integers. So once data
is stored as a factor, you can use the plot() function:
To rename, we can change its levels:
## [1] M F F M F
## Levels: M F
Exporting and saving tabular data
To export a data.frame to a text-based spreadsheet, we can use the write.table set of functions (write.csv, write.delim, …).
Manipulating data with dplyr
clear your Global Environment before starting
- There is a R package called
dplyrwhich provides powerful tools for data manipulation tasks. It is built to work directly with data frames, with many manipulation tasks optimised.
In order to access the dplyr-specific tools, we must load the
tidyverse library. The tidyverse package is an
“umbrella-package” that installs several useful packages for data
analysis which work well together, such as tidyr,
dplyr, ggplot2, and tibble.
Loading data with tidyverse
Instead of read.csv(), we will read in our data using
the read_csv() function from the tidyverse package
readr.
- Notice that the class of the data is now referred to as a “tibble”.
The data structure of a tibble is very similar to a data frame, but here are the differences:
It displays the data type of each column under its name. Note that
is a data type defined to hold numeric values with decimal points. It is designed to only prints the first few rows of data and only as many columns as fit on one screen.
Now we are now going to learn some of the most common
dplyr functions.
The first one is select(). The first argument will
always be the dataframe.
To select all columns except certain ones, put a “-” in front of the variable to exclude it.
To choose rows based on a specific criteria, use
filter():
Similar to subsetting, if we wanted to filter based on more than one
criteria we can make use of the & symbol.
Go ahead and filter the data for only Male and NonInfected below:
Now let’s imagine we are interested in the human homologs of the mouse genes analysed in this dataset.
This information can be found in the last column of the rna tibble, named
hsapiens_homolog_associated_gene_name.To visualize it easily, create a new table containing just the 2 columns
geneandhsapiens_homolog_associated_gene_namebelow. Output togenes_new
Once we get to panel 3, you’ll notice that this data frame has some missing values - NA - in the human homolog column.
What does NA mean? It means unknown, not zero, it means the data is missing.
When we use filter() in dplyr, every row has to evaluate
to TRUE or FALSE. If R sees an NA and we dont tell it what to do with
it, things get weird.
Lets consider the following line:
Keep the rows, where the column exists. Thats what we think.
However, what R actually sees is, evaluate this character vector as TRUE or FALSE.
It contains strings and NAs.
Filter needs a logical condition.
The solution is to make the logic explicit:
Here, we are being very clear. We are asking R a yes/no question for every row. Is the human homolog missing? If the answer is TRUE, keep the row. If its FALSE drop it.
Final Exercise:
Above, is showing how to keep rows WITHOUT a human homolog.
- Output this to
genes_no_homolog. - In a separate code chunk, keep rows WITH a human homolog and output
to
genes_with_homolog. - Output number of rows for each R object.
# Keep rows WITHOUT a human homolog
genes_no_homolog <- filter(genes_new, is.na(hsapiens_homolog_associated_gene_name))