Introduction to R/RStudio

Learning Objectives

  • Be able to describe the difference between R and RStudio
  • Describe the purpose and use of each panel in RStudio IDE
  • Be able to use common R syntax
  • Describe and use most common data types and data structures in R
  • Demonstrate how to load a library and how to find functions specific to a package

What is R?

Go ahead and request 1hr of R/RStudio session on the VACC

“R” is used to name a programming language and the software that reads and interprets the instructions written on the scripts of this language. Is specialized in statistical computing and graphics.

The R environment combines:

  • effective handling of big data
  • collection of integrated tools
  • graphical facilities
  • simple and effective programming language

Why use R?

R is a powerful environment. It has a wide range of statistics and general data analysis and visualization capabilities.

  • Data handling, wrangling, and storage
  • Wide array of statistical methods and graphical techniques available
  • Easy to install on any platform and use (and it’s free!)
  • Open source with a large and growing community of peers
  • R produces high-quality graphics that are reproducible

Example of R used in the media

What is RStudio?

Here, we will use be using R via RStudio. First time users often confuse the two. At its simplest, R is like a car’s engine while RStudio is like a car’s dashboard as illustrated in the Figure below.

More precisely, R is a programming language that runs computations, while RStudio is a freely available open-source integrated development environment (IDE) that provides an interface by adding many convenient features and tools. So just as the way of having access to a speedometer, rearview mirrors, and a navigation system makes driving much easier, using RStudio’s interface makes using R much easier as well.

RStudio provides an environment with many features to make using R easier and is a great alternative to working on R in the terminal.

  • Graphical user interface, not just a command prompt
  • Great learning tool
  • Free for academic use
  • Platform agnostic
  • Open source

Creating a new project directory in RStudio

Let’s create a new project directory for our “Introduction to R” lesson today.

  1. Open RStudio
  2. Go to the File menu and select New Project.
  3. In the New Project window, choose New Directory. Then, choose New Project. Name your new directory Intro-to-R and then “Create the project as subdirectory of:” the root of your VACC home account (~).
  4. Click on Create Project.
  5. After your project is completed, if the project does not automatically open in RStudio, then go to the File menu, select Open Project, and choose Intro-to-R.Rproj.
  6. When RStudio opens, you will see three panels in the window.
  7. Go to the File menu and select New File, and select R Script.
  8. Go to the File menu and select Save As..., type Intro-to-R.R and select Save

The RStudio interface should now look like the screenshot below.

RStudio interface
RStudio interface

What is a project in RStudio?

It is simply a directory that contains everything related your analyses for a specific project. RStudio projects are useful when you are working on context- specific analyses and you wish to keep them separate. When creating a project in RStudio you associate it with a working directory of your choice (either an existing one, or a new one). A . RProj file is created within that directory and that keeps track of your command history and variables in the environment. The .RProj file can be used to open the project in its current state but at a later date.

When a project is (re) opened within RStudio the following actions are taken:

  • A new R session (process) is started
  • RStudio automatically tries to remember everything you had in your R session when you close it. It does this by saving all objects in your environment (variables, data frames, etc.) into a file called .RData.
  • The next time you open RStudio in that directory, it reloads those objects automatically. While this can sound convenient, it often causes problems—especially when working with large datasets. You have the option of turning off this behavior.
  • The .Rhistory file in the project’s main directory is loaded into the RStudio History pane (and used for Console Up/Down arrow command history).
  • The current working directory is set to the project directory.
  • Previously edited source documents are restored into editor tabs
  • Other RStudio settings (e.g. active tabs, splitter positions, etc.) are restored to where they were the last time the project was closed.

Information adapted from RStudio Support Site

Organizing your working directory & setting up

RStudio Interface

The RStudio interface has four main panels:

  1. Console: where you can type commands and see output. The console is all you would see if you ran R in the command line without RStudio.
  2. Script editor: where you can type out commands and save to file. You can also submit the commands to run in the console.
  3. Environment/History: environment shows all active objects and history keeps track of all commands run in console
  4. Files/Plots/Packages/Help is a handy browser for your current files, this is where your plots will appear, you can view package information, and much more.

The working directory

  • The working directory is an important concept to understand. It is the place from where R will be looking for and saving the files. When you write code for your project, it should refer to files in relation to the root of your working directory and only need files within this structure.
  • How will I get my working directory? Use getwd()/ setwd()
  • Let’s check to see where our current working directory is located by typing into the console:
getwd() # return an abolute filepath
# this is also our first example of a function 

Your working directory should be the Intro-to-R folder constructed when you created the project. The working directory is where RStudio will automatically look for any files you bring in and where it will automatically save any files you create, unless otherwise specified.

You can visualize your working directory by selecting the Files tab from the Files/Plots/Packages/Help window.

If you wanted to choose a different directory to be your working directory, you could navigate to a different folder in the Files tab, then, click on the More dropdown menu which appears as a Cog and select Set As Working Directory.

Structuring your working directory

To organize your working directory for a particular analysis, you should separate the original data (raw data) from intermediate datasets. For instance, you may want to create a data/ directory within your working directory that stores the raw data, and have a results/ directory for intermediate datasets and a figures/ directory for the plots you will generate.

Let’s create these three directories within your working directory by clicking on New Folder within the Files tab.

When finished, your working directory should look like:

Setting up

This is more of a housekeeping task. In the future, we may be writing long lines of code in our script editor and want to make sure that the lines “wrap” and you don’t have to scroll back and forth to look at your long line of code.

Click on Code -> Soft Wrap Long lines (make sure this is checked off)

Interacting with R

Now that we have our interface and directory structure set up, let’s start interacting with R! There are two main ways of interacting with R in RStudio: using the console or by using script editor (plain text files that contain your code).

Console window

The console window (in RStudio, the bottom left panel) is the place where R is waiting for you to tell it what to do, and where it will show the results of a command. You can type commands directly into the console, but they will be forgotten when you close the session.

Let’s test it out:

  • How am I running the line without physically hitting Run? Does anyone know?
3 + 5 

Script editor

Best practice is to enter the commands in the script editor, and save the script. You are encouraged to comment liberally to describe the commands you are running using #. This way, you have a complete record of what you did, you can easily show others how you did it and you can do it again later on if needed.

The Rstudio script editor allows you to ‘send’ the current line or the currently highlighted text to the R console by clicking on the Run button in the upper-right hand corner of the script editor.

Now let’s try entering commands to the script editor and using the comments character # to add descriptions and highlighting the text to run:

# simple math 
3 + 5 
12/7

Alternatively, you can run by simply pressing the Ctrl and Return/Enter keys at the same time as a shortcut.

You should see the command run in the console and output the result.

What happens if we do that same command without the comment symbol #? Re-run the command after removing the # sign in the front:

# simple math 
3 + 5 

Now R is trying to run that sentence as a command, and it doesn’t work. We get an error in the console “Error: unexpected symbol in”I am” means that the R interpreter did not know what to do with that command.”


Naming variables

Objects can be given any name such as x, current_temperature, or subject_id. You want your object names to be explicit and not too long. They cannot start with a number (2x is not valid, but x2 is). R is case sensitive (e.g., weight_kg is different from Weight_kg). There are some names that cannot be used because they are the names of fundamental functions in R (e.g., if, else, for, see the Reserved Words in R manual page, for a complete list).

It’s also best to avoid dots (.) within an object name as in my.dataset.

# naming 
# 2x
# weight_kg is not Weight_kg
# my.dataset 

The R syntax

Now that we know how to talk with R via the script editor or the console, we want to use R for something more than adding numbers. To do this, we need to know more about the R syntax.

The main “parts of speech” in R (syntax) include:

  • the comments # and how they are used to document function and its content
  • variables and functions
  • the assignment operator <-

We will go through each of these “parts of speech” in more detail, starting with the assignment operator.

To do useful and interesting things in R, we need to assign values to variables using the assignment operator, <-.

  • Typing the object name (weight_kg) will give you a value on the console
  • Now weight_kg has been memorized by R
# assign values to objects (lets make it useful)
weight_kg <- 50
weight_kg # this object is "memorized" by R; global environment 

The assignment operator (<-) assigns values on the right to variables on the left.

When assigning a value to an variable, R does not print anything to the console. You can force to print the value by using parentheses or by typing the variable name.

In RStudio, typing Alt + - (push Alt at the same time as the - key, on Mac type option and the - key) and this will write <- in a single keystroke.

Variables

A variable is a symbolic name for (or reference to) information. Variables in computer programming are analogous to “buckets”, where information can be maintained and referenced. On the outside of the bucket is a name. When referring to the bucket, we use the name of the bucket, not the data stored in the bucket.

Let’s create another variable.

weight_lb <- weight_kg * 2 

weight_kg <- 60 # (why didn’t this object change?)

Functions and their arguments

  • Don’t be a hero, read what’s in front of you – while writing the word Functions on the screen

Functions are “canned scripts” that automate more complicated sets of commands including operations assignments, etc. Many functions are predefined, or can be made available by importing R packages (more on that later). A function usually gets one or more inputs called arguments. Functions often (but not always) return a value. A typical example would be the function round().

  • Do round(3.1234)
  • I want to modify the default behavior of this function by adding an argument which will allow for additional digits, how can I get additional information to do this?
  • Contrast to getwd() – no argument required
# Functions 
round(3.1234)
?round
round(3.1234, 3)
round(x = 3.1234, digits = 3)

getwd()

Vectors and data types

  • Don’t be a hero, read what’s in front of you – while writing the words #Vectors and data types on the screen

A vector is the most common and basic data type in R, and is pretty much the workhorse of R. A vector is composed by a series of values, such as numbers or characters. We can assign a series of values to a vector using the c() function. For example we can create a vector of animal weights and assign it to a new object weight_g:

# Vectors(objects) and data types 
weight_g <- c(50, 60, 65, 82) # use of combine function to assign a series of values to a vector 
weight_g # this is a numerical vector 
  • Don’t run it and get the error “object ‘weight_g’ not found”
  • Ask students what the error means and how to fix it
  • Continue with the lesson
#lets create a character vector 
molecules <- c("dna", "rna", "protein")
molecules

The quotes around “dna”, “rna”, etc. are essential here. Without the quotes R will assume there are objects called dna, rna and protein. As these objects don’t exist in R’s memory, there will be an error message.

An important feature of a vector, is that all of the elements are the same type of data. The function class() indicates the class (the type of element) of an object:

class(weight_g) # class indicates type of element class the object is 
class(molecules)

str(weight_g) #provides overview of the structure of an object 

You can use the c() function to add other elements to your vector:

# addition of elements 
weight_g <- c(weight_g, 90) # added value to the end of the vector 
weight_g <- c(30, weight_g) # add to the beginning of the vector 
# Quick Exercise 
num_char <- c(1, 2, 3, "a")
class(num_char)

Subsetting Vectors

  • molecules <- c(molecules[1:2], "peptide", molecules[3]) R vectors don’t have an “insert” command — to add something in the middle, we rebuild the vector by combining the parts before and after the insertion point.

  • The combine function takes individual values or vectors and combines them into a single vector – its like the glue.

# Subsetting Vectors (by postion)
molecules <- c("dna", "rna", "peptides", "proteins")

Or i could use brackets

#subsetting - addition of glycerol
molecules <- c(molecules[1:3], "glycerol", molecules[4])
#brackets in R are how we say "I only want this part" 

molecules[2] #specifying position 2 of the object molecules 
molecules[3,2] # assumes molecules is a dataframe with 2 dimensions 
#(row, columns) - give me the value in row 3, column 2 
molecules[c(3,2)] # this is a vector with one dimension; now this is interpreted as a vector of positions - give me the 3rd and 2nd element of the vector 
more_molecules <- molecules[c(1, 2, 3, 2, 1, 4)]
more_molecules

This is powerful because it means:

  • You can reorder elements
  • You can duplicate elements
  • You can select any combination of positions

Finally, it is also possible to get all the elements of a vector except some specified elements using negative indices:

#remove a element
molecules[-1] ## all but the first one

Brackets in R are how we say “I only want part of this object.”

  • Positive numbers → keep those positions
  • Negative numbers → remove those positions

Conditional subsetting

So far, we’ve selected values from a vector using positions (like the 1st or 3rd element). Conditional subsetting lets us select values based on rules, not positions using a logical vector. TRUE will select the element with the same index, while FALSE will not:

# Conditional subsetting; selection of values based on RULES 

weight_g <- c(21, 34, 39, 54, 55)
weight_g[c(TRUE, FALSE, TRUE, TRUE, FALSE)] # Use of logical vectors 

weight_g[weight_g > 50] # select values above 50 

weight_g[weight_g < 30 | weight_g > 50] # one of the conditions is true

weight_g[weight_g >= 30 & weight_g == 21] # == numerical equality; both conditions are true. This can never be true 

This expression uses conditional subsetting to filter a vector based on multiple rules.

  • weight_g >= 30 checks which values are 30 or greater
  • weight_g == 30 checks which values are exactly 30
  • the | means at least one condition must be TRUE; OR
  • & means both conditions must be TRUE at the same time

Data Types

Variables can contain values of specific types within R. The six data types that R uses include:

  • "numeric" for any numerical value, including whole numbers and decimals. This is the most common data type for performing mathematical operations.
  • "character" for text values, denoted by using quotes (““) around value. For instance, while 5 is a numeric value, if you were to put quotation marks around it, it would turn into a character value, and you could no longer use it for mathematical operations. Single or double quotes both work, as long as the same type is used at the beginning and end of the character value.
  • "integer" for whole numbers (e.g., 2L, the L indicates to R that it’s an integer). It behaves similar to the numeric data type for most tasks or functions; however, it takes up less storage space than numeric data, so often tools will output integers if the data is known to be comprised of whole numbers. Just know that integers behave similarly to numeric values. If you wanted to create your own, you could do so by providing the whole number, followed by an upper-case L.
  • "logical" for TRUE and FALSE (the Boolean data type). The logical data type can be specified using four values, TRUE in all capital letters, FALSE in all capital letters, a single capital T or a single capital F.
  • "complex" to represent complex numbers with real and imaginary parts (e.g., 1+4i) and that’s all we’re going to say about them
  • "raw" that we won’t discuss further

The table below provides examples of each of the commonly used data types:

Data Type Examples
Numeric: 1, 1.5, 20, pi
Character: “anytext”, “5”, “TRUE”
Logical: TRUE, FALSE, T, F

The type of data will determine what you can do with it. For example, if you want to perform mathematical operations, then your data type cannot be character or logical. Whereas if you want to search for a word or pattern in your data, then you data should be of the character data type. The task or function being performed on the data will determine what type of data can be used.

Missing Data

  • Lets transition, R was designed to analyze ….

  • There are a few ways to deal with NAs, the ones we will discuss are na.rm, is.na(), and na.omit()

# Dealing with Missing Data - NA's 

## Option 1: na.rm - Ignore NAs during a calculation
heights <- c(2, 4, 4, NA, 6)
mean(heights)

mean(heights, na.rm = TRUE)
## Option 2: is.na() - Identify missing values

is.na(heights) #Asks a yes/no for each value
# So at this point, nothing has been removed
#R is just identifying which values are missing.

!is.na(heights) # the ! means NOT; flips logical 

heights[!is.na(heights)]

#The square brackets [] are used for subsetting
#R keeps only the elements where the condition is TRUE

is.na() finds the missing values, ! flips the logic, and the brackets keep only the values that are not missing.

## Option 2: na.omit() - permanently removes rows that contain NAs
na.omit(heights)

# Returns a new object with NAs removed
# Works on vectors, data frames, and matrices
# For data frames: removes entire rows with any NA

Class Exercise

  1. Using this vector of heights in inches, create a new vector with the NAs removed.
heights <- c(63, 69, 60, 65, NA, 68, 61, NA, 70)
heights_no_na <- na.omit(heights)
  1. Use the function median() to calculate the median of the heights vector.
median(heights_no_na)
# or 

median(heights, na.rm = TRUE)

This lesson has been developed by members of the teaching team at the Harvard Chan Bioinformatics Core (HBC). These are open access materials distributed under the terms of the Creative Commons Attribution license (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.