Trimming and Filtering (L10)

Learning Objectives

Learn to clean FASTQ reads using Trimmomatic if required

Introduction

During the last class, we took a high-level look at the quality of each of our samples using FastQC. We visualized per-base quality graphs showing the distribution of read quality at each base across all reads in a sample and extracted information about which samples fail which quality checks. Some of our samples failed quite a few quality metrics used by FastQC. This does not mean, though, that our samples should be thrown out! It is very common to have some quality metrics fail, and this may or may not be a problem for your downstream application.

Now, we want to make sure that as many reads as possible map or align accurately to the genome. To ensure accuracy, only a small number of mismatches between the read sequence and the genome sequence are allowed, and any read with more than a few mismatches will be marked as being unaligned.

Therefore, to make sure that all the reads in the dataset have a chance to map/align to the genome, unwanted information can be trimmed off from every read, one read at a time. The types of unwanted information can include one or more of the following:

leftover adapter sequences
known contaminants (strings of As/Ts, other sequences)
poor quality bases

Where does this "unwanted information" come from?

During the sequencing process, adapter sequences are added to the ends of the reads. These sequences are used to attach the RNA molecules to the sequencing platform but they do not map to regions of the genome. These adapter sequences can interfere with downstream analysis and lead to inaccurate results.
RNA-seq data may contain low-quality reads that are caused by sequencing errors, PCR bias, or other factors such as poor sample quality. By removing these low-quality reads researcher can ensure that the data is of high quality.

Steps of trimming data

Step 1: Quality Control

The first step in trimming RNA-seq data is to assess the quality of the raw reads. This can be done using software such as FastQC, which generates a report that includes information on read length, GC content, base quality, and sequencing adapter contamination. If the data is of poor quality, it may need to be re-sequenced or excluded from further analysis.

Step 2: Adapter Trimming The next step is to remove adapter sequences from the reads. These sequences are added during library preparation and can interfere with downstream analysis. Many tools (some listed below) can be used to trim adapters from the reads by searching for the adapter sequence and removing it.

Step 2b: Quality Trimming During adapter trimming, we could also remove low-quality reads by setting a minimum quality threshold and removing any bases that fall below this threshold. Trimming tools often have built-in quality trimming options that can be used for this purpose.

Step 2c: Short read filtering

Reads that are too short after trimming can also be removed. This step is important to ensure that only high-quality reads are used for downstream analysis. Trimming tools also have this built-in option as well.

Tools for Trimming

There are a number of tools that can be used for read trimming, some include:

They have a varying range of clipping and trimming features, but for the most part they all work similarly.

Trimming is not always required

There are some aligners that are available which will "soft-clip" low-quality bases or adapter sequences during alignment. If you are working with short reads (<50 bp), trimming can actually prevent the aligner from discarding poor-quality reads.

We will compare some aligners next week

In the below class exercise we will be using Trimmomatic.

Trimmomatic options

module load gcc/13.3.0-xp3epyt

Search for the module with:

module avail

Use the following to check that the program was loaded

module list

Trimmomatic has a variety of options to trim your reads. If we run the following command, we can see some of our options.

trimmomatic

Usage: 
       PE [-version] [-threads <threads>] [-phred33|-phred64] [-trimlog <trimLogFile>] [-summary <statsSummaryFile>] [-quiet] [-validatePairs] [-basein <inputBase> | <inputFile1> <inputFile2>] [-baseout <outputBase> | <outputFile1P> <outputFile1U> <outputFile2P> <outputFile2U>] <trimmer1>...
   or: 
       SE [-version] [-threads <threads>] [-phred33|-phred64] [-trimlog <trimLogFile>] [-summary <statsSummaryFile>] [-quiet] <inputFile> <outputFile> <trimmer1>...

This output shows us that we must first specify whether we have paired end (PE) or single end (SE) reads. Next, we specify what flag we would like to run. For example, you can specify threads to indicate the number of processors on your computer that you want Trimmomatic to use. In most cases using multiple threads (processors) can help to run the trimming faster. These flags are not necessary, but they can give you more control over the command. The flags are followed by positional arguments, meaning the order in which you specify them is important. In paired end mode, Trimmomatic expects the two input files, and then the names of the output files. These files are described below. While, in single end mode, Trimmomatic will expect 1 file as input, after which you can enter the optional settings and lastly the name of the output file.

option	meaning
inputFile1	Input reads to be trimmed. Typically the file name will contain an _1 or _R1 in the name.
inputFile2	Input reads to be trimmed. Typically the file name will contain an _2 or _R2 in the name.
outputFile1P	Output file that contains surviving pairs from the _1 file.
outputFile1U	Output file that contains orphaned reads from the _1 file.
outputFile2P	Output file that contains surviving pairs from the _2 file.
outputFile2U	Output file that contains orphaned reads from the _2 file.

In addition, trimmomatic expects to see is the trimming parameters:

step	meaning
ILLUMINACLIP	Perform adapter removal.
SLIDINGWINDOW	Perform sliding window trimming, cutting once the average quality within the window falls below a threshold.
LEADING	Cut bases off the start of a read, if below a threshold quality.
TRAILING	Cut bases off the end of a read, if below a threshold quality.
CROP	Cut the read to a specified length.
HEADCROP	Cut the specified number of bases from the start of the read.
MINLEN	Drop an entire read if it is below a specified length.
TOPHRED33	Convert quality scores to Phred-33.
TOPHRED64	Convert quality scores to Phred-64.

We will use only a few of these options and trimming steps in our analysis. It is important to understand the steps you are using to clean your data. For more information about the Trimmomatic arguments and options, see the Trimmomatic manual.

However, a complete command for Trimmomatic will look something like the command below. This command is an example and will not work, as we do not have the files it refers to:

trimmomatic PE SRR_1056_1.fastq SRR_1056_2.fastq  \
        SRR_1056_1.trimmed.fastq SRR_1056_1un.trimmed.fastq \
        SRR_1056_2.trimmed.fastq SRR_1056_2un.trimmed.fastq \
        SLIDINGWINDOW:4:20 ILLUMINACLIP:SRR_adapters.fa

In this example, we have told Trimmomatic:

code	meaning
PE	that it will be taking a paired end file as input
SRR_1056_1.fastq	the first input file name
SRR_1056_2.fastq	the second input file name
SRR_1056_1.trimmed.fastq	the output file for surviving pairs from the _1 file
SRR_1056_1un.trimmed.fastq	the output file for orphaned reads from the _1 file
SRR_1056_2.trimmed.fastq	the output file for surviving pairs from the _2 file
SRR_1056_2un.trimmed.fastq	the output file for orphaned reads from the _2 file
SLIDINGWINDOW:4:20	to use a sliding window of size 4 that will remove bases if their phred score is below 20
ILLUMINACLIP:SRR_adapters.fa	to clip the Illumina adapters from the input file using the adapter sequences listed in SRR_adapters.fa

Note

Some of the commands we ran in this lesson are long! When typing a long command into your terminal, you can use the \ character to separate code chunks onto separate lines. This can make your code more readable.

Running Trimmomatic

Class Exercise

This Exercise will take ~20 mins. Please work with your neighbor if you have questions. I will begin answering questions at the 5 minute mark.

Part A: Make a copy of the exercise materials. The folder can be found in this location:

/gpfs1/cl/mmg3320/course_materials/trimmomatic_exercise

Part B: Modify the trim.sh script using either Jupyter Notebook or Nano to run trimmomatic on the provided FASTQ sample. Ensure that you include the arguments listed below, using the exact syntax required by trimmomatic.

argument to add	meaning
PE	that it will be taking a paired end file as input
inputFile1	SRR2589044_1.fastq.gz
inputFile2	SRR2589044_2.fastq.gz
outputFile1P	Output file that contains surviving pairs from the _1 file.
outputFile1U	Output file that contains orphaned reads from the _1 file.
outputFile2P	Output file that contains surviving pairs from the _2 file.
outputFile2U	Output file that contains orphaned reads from the _2 file.
SLIDINGWINDOW:4:20	to use a sliding window of size 4 that will remove bases if their phred score is below 20
filter short reads	drop an entire read if it is below 25
adapter to remove	NexteraPE-PE.fa:2:40:15

Note: We are adding additional parameters the NexteraPE-PE.Fa adapter

NexteraPE-PE.fa : This is a FASTA file that contains adapter sequences specific to Nextera paired-end libraries.
2:40:15
- 2 : Seed mismatch threshold (maximum number of mismatches allowed in the adapter sequence match).
- 40 : Palindrome mode threshold (minimum match length for identifying adapter sequences when the paired-end reads overlap).
- 15 : Simple adapter trimming threshold (minimum length of a match required to trigger adapter removal).

Once you have a working script, run it with the following command:

sh trim.sh

You will know that your script is correct and working properly if you see the following output:

TrimmomaticPE: Started with arguments:
SRR2589044_1.fastq.gz SRR2589044_2.fastq.gz SRR2589044_1.trim.fastq.gz SRR2589044_1un.trim.fastq.gz SRR2589044_2.trim.fastq.gz SRR2589044_2un.trim.fastq.gz SLIDINGWINDOW:4:20 MINLEN:25 ILLUMINACLIP:trimmomatic_adapters/NexteraPE-PE.fa:2:40:15
Using PrefixPair: 'AGATGTGTATAAGAGACAG' and 'AGATGTGTATAAGAGACAG'
Using Long Clipping Sequence: 'GTCTCGTGGGCTCGGAGATGTGTATAAGAGACAG'
Using Long Clipping Sequence: 'TCGTCGGCAGCGTCAGATGTGTATAAGAGACAG'
Using Long Clipping Sequence: 'CTGTCTCTTATACACATCTCCGAGCCCACGAGAC'
Using Long Clipping Sequence: 'CTGTCTCTTATACACATCTGACGCTGCCGACGA'
ILLUMINACLIP: Using 1 prefix pairs, 4 forward/reverse sequences, 0 forward only sequences, 0 reverse only sequences
Quality encoding detected as phred33

Input Read Pairs: 1107090 Both Surviving: 885220 (79.96%) Forward Only Surviving: 216472 (19.55%) Reverse Only Surviving: 2850 (0.26%) Dropped: 2548 (0.23%)
TrimmomaticPE: Completed successfully

Part C: Check that the file outputs are the correct size

total 453M
-rw-r--r-- 1 pdrodrig pi-jdragon 124M Feb 11 14:30 SRR2589044_1.fastq.gz
-rw-r--r-- 1 pdrodrig pi-jdragon  94M Feb 11 18:32 SRR2589044_1.trim.fastq.gz
-rw-r--r-- 1 pdrodrig pi-jdragon  18M Feb 11 18:32 SRR2589044_1un.trim.fastq.gz
-rw-r--r-- 1 pdrodrig pi-jdragon 128M Feb 11 14:30 SRR2589044_2.fastq.gz
-rw-r--r-- 1 pdrodrig pi-jdragon  91M Feb 11 18:32 SRR2589044_2.trim.fastq.gz
-rw-r--r-- 1 pdrodrig pi-jdragon 271K Feb 11 18:32 SRR2589044_2un.trim.fastq.gz

Part D: Run FASTQC on all the .gz files and visualize the HTML files to see whether your per base sequence quality is higher after trimming. In addition, please note if the adapter content/contamination has been removed.