RSeQC (L16)
Learning Objectives
- Use of RSeQC (a package of scripts) to evaluate the quality of RNA-Seq data
Recap from last week
HISAT2 produces multiple output files:
.bam
: unsorted bam filesorted.bam
: sorted bam file.bam.bai
: index for sorted bam file.log
: log file with containing detailed information about the run. This file is most useful for troubleshooting and debugging.
Planning and Organization
For each experiment you work on and analyze data for, it is considered best practice to get organized by creating a planned storage space (directory structure).
The outputs from HISAT2 were many and looked like this:
Irrel_kd_1.subset.bam Irrel_kd_2.subset_sorted.bam.bai Mov10_oe_1.subset.bam Mov10_oe_2.subset_sorted.bam.bai
Irrel_kd_1.subset.log Irrel_kd_2.subset.txt Mov10_oe_1.subset.log Mov10_oe_2.subset.txt
Irrel_kd_1.subset_sorted.bam Irrel_kd_3.subset.bam Mov10_oe_1.subset_sorted.bam Mov10_oe_3.subset.bam
Irrel_kd_1.subset_sorted.bam.bai Irrel_kd_3.subset.log Mov10_oe_1.subset_sorted.bam.bai Mov10_oe_3.subset.log
Irrel_kd_1.subset.txt Irrel_kd_3.subset_sorted.bam Mov10_oe_1.subset.txt Mov10_oe_3.subset_sorted.bam
Irrel_kd_2.subset.bam Irrel_kd_3.subset_sorted.bam.bai Mov10_oe_2.subset.bam Mov10_oe_3.subset_sorted.bam.bai
Irrel_kd_2.subset.log Irrel_kd_3.subset.txt Mov10_oe_2.subset.log Mov10_oe_3.subset.txt
Irrel_kd_2.subset_sorted.bam Mov10_oe_2.subset_sorted.bam
These data outputs have been organized into the following folders:
RSeQC_exercise
├── bams
├── logs
├── RSeQC_bed_files
├── scripts
└── sorted_bams
logs
: used to keep track of the commands run and the specific parameters used, but also to have a record of any standard output that is generated while running the command.RSeQC_bed_files
: contain BED12 files created from GTF file
RSeQC
Today we will be using RSeQC. RSeQC is a package of scripts designed to evaluate the quality of RNA-seq data. We will run the following: bam_stat.py
, infer_experiment.py
, read_distribution.py
.
The majority of RSeQC scripts generate output files which can be plotted and summarized in the MultiQC report.
Getting Started
Last week we installed RSeQC
using conda. To load RSeQC, we will need to perform:
conda activate rseqc_env
To test that everything is in working order we can run:
infer_experiment.py --help
Expected Output:
Usage: infer_experiment.py [options]
Options:
--version show program's version number and exit
-h, --help show this help message and exit
-i INPUT_FILE, --input-file=INPUT_FILE
Input alignment file in SAM or BAM format
-r REFGENE_BED, --refgene=REFGENE_BED
Reference gene model in bed fomat.
-s SAMPLE_SIZE, --sample-size=SAMPLE_SIZE
Number of reads sampled from SAM/BAM file.
default=200000
-q MAP_QUAL, --mapq=MAP_QUAL
Minimum mapping quality (phred scaled) for an
alignment to be considered as "uniquely mapped".
default=30
(rseqc_env) [pdrodrig@vacc-login4 RSeQC_exercise]$
Download dataset for today's lesson:
/gpfs1/cl/mmg3320/course_materials/RSeQC_exercise
Strand-Specificity
Purpose: This script predicts the “strandedness” of the protocol (i.e. unstranded, sense or antisense) that was used to prepare the sample for sequencing by assessing the orientation in which aligned reads overlay gene features in the reference genome. This information is not always available in public datasets.
Why use it?:
- Crucial for differential expression analysis, as strand-specific libraries require correct orientation for accurate read assignment.
- Avoids misinterpretation of gene expression levels caused by incorrect strand assignment
Basic Usage:
infer_experiment.py -r ref.bed -i input.bam
- -i input.bam -> specifies the sorted BAM file with aligned RNA-Seq reads
- -r ref.bed -> specifies the BED file containing gene annotation (in BED12 format)
Expected Output:
.infer_experiment.txt
: File containing fraction of reads mapping to given strandedness configurations.
Class Exercise #1: infer_experiment.py
- Create a directory within RSeQC_exercise called
IKD1_RSeQC
- Move
Irrel_kd_1.subset_sorted.bam
andIrrel_kd_1.subset_sorted.bam.bai
intoIKD1_RSeQC
- Within the
RSeQC_exercise/IKD1_RSeQC
folder runinfer_experiment.py
onIrrel_kd_1.subset_sorted.bam
. - Use the hg38 bed12 file
- Be sure to redirect the final output to
Irrel_kd_1.subset_infer.txt
. You will need this file for our final step.
BAM Statistics
Purpose: Provides a summary of the alignment results from a BAM file
Why use it?:
- Quickly checks overall alignment quality
- Reports metrics such as total reads, mapped reads, properly paired reads (if appropriate), and uniquely mapped reads
- Helps identify issues like low mapping rates
Basic Usage:
bam_stat.py -i input.bam
- -i input.bam -> specifies the sorted BAM file with aligned RNA-Seq reads
Expected Output:
- Summary statistics of the BAM file, including:
- Total number of reads
- Mapped reads
- Properly paired reads
- Uniquely mapped reads
- Reads mapped to multiple loci
Interpretation:
- High mapping rates (~95%) are expected for well-prepared RNA-Seq libraries.
- A low mapping rate (<80%) could indicate issues with genome annotation or contamination.
Class Exercise #2: bam_stat.py
- Within the
RSeQC_exercise/IKD1_RSeQC
folder runbam_stat.py
onIrrel_kd_1.subset_sorted.bam
. - Be sure to redirect the final output to
Irrel_kd_1.subset_bamstat.txt
. You will need this file for our final step.
Splice Junction Detection
Purpose: Used to assess the sequencing depth and completeness of splice junction detection in RNA-Seq data. It helps to determine whether an RNA-Seq experiment has sufficient coverage to detect splice junctions reliably.
Why use it?:
- Aids in determining if additional sequencing would improve splice junction discovery
- Ensures that the experiment is capturing enough splice junctions for meaningful downstream analysis
Basic Usage:
junction_saturation.py -i input.bam -r reference.bed -o output_prefix
- -i input.bam -> specifies the sorted BAM file with aligned RNA-Seq reads
- -r reference BED file with gene annotation
- -o output_prefix -> prefix of output files
Expected Output:
- PDF file of Junction Saturation Plot; helps you assess whether sequencing depth is sufficient or if more reads are needed to discover new junctions
- The .r (R script) used to generate the PDF file
Interpretation:
- X-axis -> Read depth (number of reads supporting a junction).
- Y-axis -> Number of detected splice junctions.
- Curve shape:
- If the curve plateaus -> You have sufficient sequencing depth; additional reads won’t improve junction detection.
- If the curve keeps increasing -> More sequencing might still reveal new junctions.
Class Exercise #3: junction_saturation.py
- Within the
RSeQC_exercise/IKD1_RSeQC
folder runjunction_saturation.py
onIrrel_kd_1.subset_sorted.bam
. - Use
IKD1-output
as the output_prefix - Do not generate a .txt file for this step
Splice Junction Annotation
Purpose: Identifies and annotates splice junctions by comparing detected junctions to known junctions in a reference genome annotation (GTF/GFF)
Why use it?:
- Validates detected splice sites against known annotations
- Detects novel splice junctions, which may indicate alternative splicing or sequencing errors
- Identifies unexpected splicing patterns
Basic Usage:
junction_annotation.py -i input.bam -o output -r reference.bed
- -i input.bam -> specifies the sorted BAM file with aligned RNA-Seq reads
- -r reference.bed -> specifies the BED file containing gene annotation (in BED12 format)
- -o output_prefix -> prefix of output files
Expected Output:
- A summary of detected splice junctions, categorized into:
- Known (found in the reference)
- Novel (newly identified)
- Unannotated (potential mapping issues)
Interpretation:
- A high fraction of novel junctions may suggest alternative splicing events.
- Too many unannotated junctions may indicate alignment problems.
Class Exercise #4: junction_annotation.py
- Within the
RSeQC_exercise/IKD1_RSeQC
folder runjunction_annotation.py
onIrrel_kd_1.subset_sorted.bam
. - Use
IKD1-output
as the output_prefix - Be sure to redirect the final output to
Irrel_kd_1.subset_janno.txt
. You will need this file for our final step.
Read Distribution Across Genomic Features (Class Exercise #5)
Purpose: Analyzes how reads are distributed across different genomic features (e.g., exons, introns, UTRs, intergenic regions).
Why use it?:
- Ensures the expected proportion of reads falls within exons for RNA-Seq experiments
Basic Usage:
read_distribution.py -i input.bam -r ref.bed
- -i input.bam -> specifies the sorted BAM file with aligned RNA-Seq reads
- -r reference.bed -> specifies the BED file containing gene annotation (in BED12 format)
Expected Output:
- The fraction of reads mapping to different genomic regions, including:
- Exons
- Introns
- Intergenic regions
- 5' and 3' UTRs
Interpretation:
- For RNA-Seq, most reads (>70%) should align to exons.
- A high proportion of intronic reads (>30%) may indicate rRNA contamination or unprocessed RNA.
Class Exercise #5: read_distribution.py
- Within the
RSeQC_exercise/IKD1_RSeQC
folder runread_distribution.py
onIrrel_kd_1.subset_sorted.bam
. - Be sure to redirect the final output to
Irrel_kd_1.subset_read.txt
You will need this file for our final step.
Final Step: Run MultiQC
-
Navigate to top
RSeQC_exercise
folder. Make sure you are not in any of the subdirectories folders. -
Load
py-multiqc/1.15-fmpaaj7
-
Run multiqc. Be sure to name the output file
RSeQC_stats-multiqc
-
Take a look at the generated HTML file.