Introduction to R/RStudio
Learning Objectives
- Be able to describe the difference between R and RStudio
- Describe the purpose and use of each panel in RStudio IDE
- Be able to use common R syntax
- Describe and use most common data types and data structures in R
- Demonstrate how to load a library and how to find functions specific to a package
What is R?
Go ahead and request 1hr of R/RStudio session on the VACC
“R” is used to name a programming language and the software that reads and interprets the instructions written on the scripts of this language. Is specialized in statistical computing and graphics.
The R environment combines:
- effective handling of big data
- collection of integrated tools
- graphical facilities
- simple and effective programming language
Why use R?
R is a powerful environment. It has a wide range of statistics and general data analysis and visualization capabilities.
- Data handling, wrangling, and storage
- Wide array of statistical methods and graphical techniques available
- Easy to install on any platform and use (and it’s free!)
- Open source with a large and growing community of peers
- R produces high-quality graphics that are reproducible
Example of R used in the media
- “At the BBC data team, we have developed an R package and an R cookbook to make the process of creating publication-ready graphics in our in-house style…” - BBC Visual and Data Journalism cookbook for R graphics
What is RStudio?
Here, we will use be using R via RStudio. First time users often confuse the two. At its simplest, R is like a car’s engine while RStudio is like a car’s dashboard as illustrated in the Figure below.
More precisely, R is a programming language that runs computations, while RStudio is a freely available open-source integrated development environment (IDE) that provides an interface by adding many convenient features and tools. So just as the way of having access to a speedometer, rearview mirrors, and a navigation system makes driving much easier, using RStudio’s interface makes using R much easier as well.
RStudio provides an environment with many features to make using R easier and is a great alternative to working on R in the terminal.
- Graphical user interface, not just a command prompt
- Great learning tool
- Free for academic use
- Platform agnostic
- Open source
Creating a new project directory in RStudio
Let’s create a new project directory for our “Introduction to R” lesson today.
- Open RStudio
- Go to the
Filemenu and selectNew Project. - In the
New Projectwindow, chooseNew Directory. Then, chooseNew Project. Name your new directoryIntro-to-Rand then “Create the project as subdirectory of:” the root of your VACC home account (~). - Click on
Create Project. - After your project is completed, if the project does not
automatically open in RStudio, then go to the
Filemenu, selectOpen Project, and chooseIntro-to-R.Rproj. - When RStudio opens, you will see three panels in the window.
- Go to the
Filemenu and selectNew File, and selectR Script. - Go to the
Filemenu and selectSave As..., typeIntro-to-R.Rand selectSave
The RStudio interface should now look like the screenshot below.
What is a project in RStudio?
It is simply a directory that contains everything related your
analyses for a specific project. RStudio projects are useful when you
are working on context- specific analyses and you wish to keep them
separate. When creating a project in RStudio you associate it with a
working directory of your choice (either an existing one, or a new one).
A . RProj file is created within that directory and that
keeps track of your command history and variables in the environment.
The .RProj file can be used to open the project in its
current state but at a later date.
When a project is (re) opened within RStudio the following actions are taken:
- A new R session (process) is started
- RStudio automatically tries to remember everything you had in your R session when you close it. It does this by saving all objects in your environment (variables, data frames, etc.) into a file called .RData.
- The next time you open RStudio in that directory, it reloads those objects automatically. While this can sound convenient, it often causes problems—especially when working with large datasets. You have the option of turning off this behavior.
- The .Rhistory file in the project’s main directory is loaded into the RStudio History pane (and used for Console Up/Down arrow command history).
- The current working directory is set to the project directory.
- Previously edited source documents are restored into editor tabs
- Other RStudio settings (e.g. active tabs, splitter positions, etc.) are restored to where they were the last time the project was closed.
Information adapted from RStudio Support Site
Organizing your working directory & setting up
RStudio Interface
The RStudio interface has four main panels:
- Console: where you can type commands and see output. The console is all you would see if you ran R in the command line without RStudio.
- Script editor: where you can type out commands and save to file. You can also submit the commands to run in the console.
- Environment/History: environment shows all active objects and history keeps track of all commands run in console
- Files/Plots/Packages/Help is a handy browser for your current files, this is where your plots will appear, you can view package information, and much more.
The working directory
- The working directory is an important concept to understand. It is the place from where R will be looking for and saving the files. When you write code for your project, it should refer to files in relation to the root of your working directory and only need files within this structure.
- How will I get my working directory? Use
getwd()/setwd() - Let’s check to see where our current working directory is located by typing into the console:
Your working directory should be the Intro-to-R folder
constructed when you created the project. The working directory is where
RStudio will automatically look for any files you bring in and where it
will automatically save any files you create, unless otherwise
specified.
You can visualize your working directory by selecting the
Files tab from the
Files/Plots/Packages/Help window.
If you wanted to choose a different directory to be your working
directory, you could navigate to a different folder in the
Files tab, then, click on the More dropdown
menu which appears as a Cog and select
Set As Working Directory.
Structuring your working directory
To organize your working directory for a particular analysis, you
should separate the original data (raw data) from intermediate datasets.
For instance, you may want to create a data/ directory
within your working directory that stores the raw data, and have a
results/ directory for intermediate datasets and a
figures/ directory for the plots you will generate.
Let’s create these three directories within your working directory by
clicking on New Folder within the Files
tab.
When finished, your working directory should look like:
Setting up
This is more of a housekeeping task. In the future, we may be writing long lines of code in our script editor and want to make sure that the lines “wrap” and you don’t have to scroll back and forth to look at your long line of code.
Click on Code -> Soft Wrap Long lines (make sure this is checked off)
Interacting with R
Now that we have our interface and directory structure set up, let’s start interacting with R! There are two main ways of interacting with R in RStudio: using the console or by using script editor (plain text files that contain your code).
Console window
The console window (in RStudio, the bottom left panel) is the place where R is waiting for you to tell it what to do, and where it will show the results of a command. You can type commands directly into the console, but they will be forgotten when you close the session.
Let’s test it out:
- How am I running the line without physically hitting Run? Does anyone know?
Script editor
Best practice is to enter the commands in the script
editor, and save the script. You are encouraged to comment
liberally to describe the commands you are running using #.
This way, you have a complete record of what you did, you can easily
show others how you did it and you can do it again later on if
needed.
The Rstudio script editor allows you to ‘send’ the current
line or the currently highlighted text to the R console by clicking on
the Run button in the upper-right hand corner of the script
editor.
Now let’s try entering commands to the script editor
and using the comments character # to add descriptions and
highlighting the text to run:
Alternatively, you can run by simply pressing the Ctrl
and Return/Enter keys at the same time as a shortcut.
You should see the command run in the console and output the result.
What happens if we do that same command without the comment symbol
#? Re-run the command after removing the # sign in the
front:
Now R is trying to run that sentence as a command, and it doesn’t work. We get an error in the console “Error: unexpected symbol in”I am” means that the R interpreter did not know what to do with that command.”
Naming variables
Objects can be given any name such as x,
current_temperature, or subject_id. You want
your object names to be explicit and not too long. They cannot start
with a number (2x is not valid, but x2 is). R
is case sensitive (e.g., weight_kg is different from
Weight_kg). There are some names that cannot be used
because they are the names of fundamental functions in R (e.g.,
if, else, for, see the Reserved
Words in R manual page, for a complete list).
It’s also best to avoid dots (.) within an object name as in
my.dataset.
The R syntax
Now that we know how to talk with R via the script editor or the console, we want to use R for something more than adding numbers. To do this, we need to know more about the R syntax.
The main “parts of speech” in R (syntax) include:
- the comments
#and how they are used to document function and its content - variables and functions
- the assignment operator
<-
We will go through each of these “parts of speech” in more detail, starting with the assignment operator.
To do useful and interesting things in R, we need to assign
values to variables using the assignment operator,
<-.
- Typing the object name (weight_kg) will give you a value on the console
- Now weight_kg has been memorized by R
# assign values to objects (lets make it useful)
weight_kg <- 50
weight_kg # this object is "memorized" by R; global environment The assignment operator (<-) assigns values
on the right to variables on the left.
When assigning a value to an variable, R does not print anything to the console. You can force to print the value by using parentheses or by typing the variable name.
In RStudio, typing Alt + - (push Alt at
the same time as the - key, on Mac type option
and the - key) and this will write <- in a
single keystroke.
Variables
A variable is a symbolic name for (or reference to) information. Variables in computer programming are analogous to “buckets”, where information can be maintained and referenced. On the outside of the bucket is a name. When referring to the bucket, we use the name of the bucket, not the data stored in the bucket.
Let’s create another variable.
Functions and their arguments
- Don’t be a hero, read what’s in front of you – while writing the word Functions on the screen
Functions are “canned scripts” that automate more complicated sets of
commands including operations assignments, etc. Many functions are
predefined, or can be made available by importing R packages (more on
that later). A function usually gets one or more inputs called
arguments. Functions often (but not always) return a value. A typical
example would be the function round().
- Do round(3.1234)
- I want to modify the default behavior of this function by adding an argument which will allow for additional digits, how can I get additional information to do this?
- Contrast to
getwd()– no argument required
Vectors and data types
- Don’t be a hero, read what’s in front of you – while writing the words #Vectors and data types on the screen
A vector is the most common and basic data type in R, and is pretty much the workhorse of R. A vector is composed by a series of values, such as numbers or characters. We can assign a series of values to a vector using the c() function. For example we can create a vector of animal weights and assign it to a new object weight_g:
# Vectors(objects) and data types
weight_g <- c(50, 60, 65, 82) # use of combine function to assign a series of values to a vector
weight_g # this is a numerical vector - Don’t run it and get the error “object ‘weight_g’ not found”
- Ask students what the error means and how to fix it
- Continue with the lesson
The quotes around “dna”, “rna”, etc. are essential here. Without the quotes R will assume there are objects called dna, rna and protein. As these objects don’t exist in R’s memory, there will be an error message.
An important feature of a vector, is that all of the elements are the
same type of data. The function class() indicates the class
(the type of element) of an object:
class(weight_g) # class indicates type of element class the object is
class(molecules)
str(weight_g) #provides overview of the structure of an object You can use the c() function to add other elements to
your vector:
Subsetting Vectors
molecules <- c(molecules[1:2], "peptide", molecules[3])R vectors don’t have an “insert” command — to add something in the middle, we rebuild the vector by combining the parts before and after the insertion point.The combine function takes individual values or vectors and combines them into a single vector – its like the glue.
Or i could use brackets
#subsetting - addition of glycerol
molecules <- c(molecules[1:3], "glycerol", molecules[4])
#brackets in R are how we say "I only want this part"
molecules[2] #specifying position 2 of the object molecules
molecules[3,2] # assumes molecules is a dataframe with 2 dimensions
#(row, columns) - give me the value in row 3, column 2
molecules[c(3,2)] # this is a vector with one dimension; now this is interpreted as a vector of positions - give me the 3rd and 2nd element of the vector This is powerful because it means:
- You can reorder elements
- You can duplicate elements
- You can select any combination of positions
Finally, it is also possible to get all the elements of a vector except some specified elements using negative indices:
Brackets in R are how we say “I only want part of this object.”
- Positive numbers → keep those positions
- Negative numbers → remove those positions
Conditional subsetting
So far, we’ve selected values from a vector using positions (like the 1st or 3rd element). Conditional subsetting lets us select values based on rules, not positions using a logical vector. TRUE will select the element with the same index, while FALSE will not:
# Conditional subsetting; selection of values based on RULES
weight_g <- c(21, 34, 39, 54, 55)
weight_g[c(TRUE, FALSE, TRUE, TRUE, FALSE)] # Use of logical vectors
weight_g[weight_g > 50] # select values above 50
weight_g[weight_g < 30 | weight_g > 50] # one of the conditions is true
weight_g[weight_g >= 30 & weight_g == 21] # == numerical equality; both conditions are true. This can never be true This expression uses conditional subsetting to filter a vector based on multiple rules.
- weight_g >= 30 checks which values are 30 or greater
- weight_g == 30 checks which values are exactly 30
- the | means at least one condition must be TRUE; OR
- & means both conditions must be TRUE at the same time
Data Types
Variables can contain values of specific types within R. The six data types that R uses include:
"numeric"for any numerical value, including whole numbers and decimals. This is the most common data type for performing mathematical operations."character"for text values, denoted by using quotes (““) around value. For instance, while 5 is a numeric value, if you were to put quotation marks around it, it would turn into a character value, and you could no longer use it for mathematical operations. Single or double quotes both work, as long as the same type is used at the beginning and end of the character value."integer"for whole numbers (e.g.,2L, theLindicates to R that it’s an integer). It behaves similar to thenumericdata type for most tasks or functions; however, it takes up less storage space than numeric data, so often tools will output integers if the data is known to be comprised of whole numbers. Just know that integers behave similarly to numeric values. If you wanted to create your own, you could do so by providing the whole number, followed by an upper-case L."logical"forTRUEandFALSE(the Boolean data type). Thelogicaldata type can be specified using four values,TRUEin all capital letters,FALSEin all capital letters, a single capitalTor a single capitalF."complex"to represent complex numbers with real and imaginary parts (e.g.,1+4i) and that’s all we’re going to say about them"raw"that we won’t discuss further
The table below provides examples of each of the commonly used data types:
| Data Type | Examples |
|---|---|
| Numeric: | 1, 1.5, 20, pi |
| Character: | “anytext”, “5”, “TRUE” |
| Logical: | TRUE, FALSE, T, F |
The type of data will determine what you can do with it. For example, if you want to perform mathematical operations, then your data type cannot be character or logical. Whereas if you want to search for a word or pattern in your data, then you data should be of the character data type. The task or function being performed on the data will determine what type of data can be used.
Missing Data
Lets transition, R was designed to analyze ….
There are a few ways to deal with NAs, the ones we will discuss are
na.rm,is.na(), andna.omit()
# Dealing with Missing Data - NA's
## Option 1: na.rm - Ignore NAs during a calculation
heights <- c(2, 4, 4, NA, 6)
mean(heights)
mean(heights, na.rm = TRUE)## Option 2: is.na() - Identify missing values
is.na(heights) #Asks a yes/no for each value
# So at this point, nothing has been removed
#R is just identifying which values are missing.
!is.na(heights) # the ! means NOT; flips logical
heights[!is.na(heights)]
#The square brackets [] are used for subsetting
#R keeps only the elements where the condition is TRUEis.na() finds the missing values, ! flips
the logic, and the brackets keep only the values that
are not missing.
## Option 2: na.omit() - permanently removes rows that contain NAs
na.omit(heights)
# Returns a new object with NAs removed
# Works on vectors, data frames, and matrices
# For data frames: removes entire rows with any NAClass Exercise
- Using this vector of heights in inches, create a new vector with the NAs removed.
- Use the function
median()to calculate the median of the heights vector.
This lesson has been developed by members of the teaching team at the Harvard Chan Bioinformatics Core (HBC). These are open access materials distributed under the terms of the Creative Commons Attribution license (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
- The materials used in this lesson are adapted from work that is Copyright © Data Carpentry (http://datacarpentry.org/). All Data Carpentry instructional material is made available under the Creative Commons Attribution license (CC BY 4.0).