Skip to content

HTSeq & deepTools (L16)

Learning Objectives

  • Understand the purpose of read counting in RNA-Seq analysis
  • Describe how HTseq assigns reads to genomic features
  • Distinguish between union, intersection-strict, and intersection-nonempty counting modes
  • Interpret HTSeq output files

Counting reads as a measure of gene expression

Once we have our reads aligned to the genome, the next step is to count how many reads have mapped to each gene. Many tools can take BAM files as input and output the number of reads (counts) associated with each feature of interest (genes, exons, transcripts, etc.). Two commonly used counting tools are featureCounts and htseq-count.

shells
  • These tools report raw counts, meaning they only count reads that map uniquely to a single location in the genome. They are most effective for counting at the gene level. In this approach, the total read count for a gene (meta-feature) is the sum of reads assigned to each of its exons (features).

  • Some other tools account for multiple transcripts per gene, assigning fractional counts instead of whole numbers. For example, if one read aligns to two transcripts, it may be counted as 0.5 for each transcript rather than a whole number.

  • There are also tools that count multi-mapping reads, but this approach can lead to over-counting, which affects normalization and ultimately compromises the accuracy of differential gene expression analysis.

Input for counting = multiple BAM files + 1 GTF file

Simply speaking, the genomic coordinates of where the read is mapped (BAM) are cross-referenced with the genomic coordinates of whichever feature you are interested in counting expression of (GTF), it can be exons, genes or transcripts.

shells

Output of counting = A count matrix, with genes as rows and samples are columns

These are the "raw" counts and will be used in statistical programs downstream for differential gene expression. Below is a representative counts matrix.

shells

We will begin by demonstrating the usage of HTSeq. HTSeq-count is a command-line tool used in RNA-Seq data analysis to count the number of sequencing reads that align to specific genomic features, such as genes. It is part of the HTSeq Python package and is commonly used in differential gene expression analysis pipelines.

Basic Workflow

Input Files:

  • SAM/BAM file: Contains aligned RNA-Seq reads (output from tools like HISAT2 or STAR).
  • GTF/GFF file: Contains gene annotations specifying genomic coordinates.

Counting Reads:

  • HTSeq-count assigns each read to a genomic feature (usually a gene) based on predefined rules.
  • It outputs a table listing each gene and the number of reads assigned to it.

Output:

  • A tab-delimited text file with two columns:
    • Gene ID
    • Read count

Counting reads in features with htseq-count

Given a file with aligned sequencing reads and a list of genomic features, a common task is to count how many reads map to each feature.

Here, a feature refers to a specific interval (i.e., a range of positions) on a chromosome or a union of such intervals.

Since our example data comes from an RNA-Seq experiment, we aim to count how many reads fall within the exonic regions of each gene. To do this, we first need information about exon positions, which can be obtained from GTF files—a common format provided by Ensembl (available here).

Special care must be taken when handling reads that align to or overlap with multiple features. The htseq-count script offers three different modes to handle such cases.

  1. Union (Recommended for most cases):

    • A read is assigned to a feature if any part of the read overlaps with it.
    • If a read overlaps multiple features, it is not counted at all (to avoid ambiguity).
  2. Intersection-strict:

    • A read is assigned to a feature only if every position of the read overlaps with that feature.
    • If a read overlaps multiple features but not completely within one, it is not counted.
  3. Intersection-nonempty:

    • A read is assigned to a feature only if it overlaps with at least one feature at every position.
    • Unlike intersection-strict, it ignores positions where no features are present.

The figure below illustrates the effect of these three modes:

shells

Running htseq-count

Begin by loading the htseq-count module with module load

module load gcc/13.3.0-xp3epyt 
py-htseq/2.0.3-mb7ap7s

Run the following to check the module is loaded:

htseq-count --help
usage: htseq-count [-h] [--version] [-f {sam,bam,auto}] [-r {pos,name}] [--max-reads-in-buffer MAX_BUFFER_SIZE] [-s {yes,no,reverse}] [-a MINAQUAL]
                   [-t FEATURE_TYPE] [-i IDATTR] [--additional-attr ADDITIONAL_ATTRIBUTES] [--add-chromosome-info]
                   [-m {union,intersection-strict,intersection-nonempty}] [--nonunique {none,all,fraction,random}]
                   [--secondary-alignments {score,ignore}] [--supplementary-alignments {score,ignore}] [-o SAMOUTS] [-p {SAM,BAM,sam,bam}]
                   [-d OUTPUT_DELIMITER] [-c OUTPUT_FILENAME] [--counts-output-sparse] [--append-output] [-n NPROCESSES] [--feature-query FEATURE_QUERY]
                   [-q] [--with-header]
                   samfilenames [samfilenames ...] featuresfilename
This script takes one or more alignment files in SAM/BAM format and a feature file in GFF format and calculates for each feature the number of reads
mapping to it. See http://htseq.readthedocs.io/en/master/count.html for details.

Basic Command Syntax

htseq-count -f bam -s no -i gene_id sample.bam genes.gtf > gene_counts.txt
  • -f bam → Specifies input format (can be sam or bam).
  • -s no → Defines strandedness (yes, no, or reverse).
  • -i gene_id → used to identify the GTF feature attribute (gene_name or gene_id)
  • aligned_reads.bam → Input file with mapped reads.
  • genes.gtf → Reference annotation file.
  • > gene_counts.txt → Redirects output to a file.

Options

-f <format>, --format=<format>

    Format of the input data. Possible values are sam (for text SAM files) and bam (for binary BAM files). Default is sam.
-s <yes/no/reverse>, --stranded=<yes/no/reverse>

    whether the data is from a strand-specific assay (default: `yes`)

    For stranded=no, a read is considered overlapping with a feature regardless of whether it is mapped to the same or the opposite strand as the feature. For stranded=yes and single-end reads, the read has to be mapped to the same strand as the feature. For paired-end reads, the first read has to be on the same strand and the second read on the opposite strand. For stranded=reverse, these rules are reversed.
-i <id attribute>, --idattr=<id attribute>

    GFF attribute to be used as feature ID. Several GFF lines with the same feature ID will be considered as parts of the same feature. The feature ID is used to identity the counts in the output table. The default, suitable for RNA-Seq analysis using an Ensembl GTF file, is `gene_id`.

Key Considerations

Stranded vs. Unstranded Data:

  • Many RNA-Seq protocols preserve strand information. Setting -s yes or -s reverse ensures proper assignment.

Feature Type:

  • HTSeq-count assigns reads based on feature-type (default is exon in GTF). If needed, use -t to specify other features.

Overlap Mode:

  • Some reads may overlap multiple features; the -m flag determines how they are assigned (union, intersection-strict, intersection-nonempty).

Limitations & Alternatives

  • HTSeq-count is a gene-level summarization tool; it does not handle transcript-level quantification.
  • It requires sorted BAM/SAM files.
  • Alternative tools like featureCounts (..which is faster) or Salmon/RSEM (for transcript-level quantification) may be preferred in some cases.

Class Exercise Part A

Folder Content htseq_demo

├── bams
│   ├── KO_hg19_rep2_sorted.bam
│   ├── KO_hg19_rep2_sorted.bam.bai
│   ├── KO_hg19_rep3_sorted.bam
│   ├── KO_hg19_rep3_sorted.bam.bai
│   ├── WT_hg19_rep1_sorted.bam
│   ├── WT_hg19_rep1_sorted.bam.bai
│   ├── WT_hg19_rep2_sorted.bam
│   ├── WT_hg19_rep2_sorted.bam.bai
│   ├── WT_hg19_rep3_sorted.bam
│   └── WT_hg19_rep3_sorted.bam.bai
├── chr1-hg19_genes.gtf
├── logs
│   ├── KO_hg19_rep1.log
│   ├── KO_hg19_rep1.txt
│   ├── KO_hg19_rep2.log
│   ├── KO_hg19_rep2.txt
│   ├── KO_hg19_rep3.log
│   ├── KO_hg19_rep3.txt
│   ├── WT_hg19_rep1.log
│   ├── WT_hg19_rep1.txt
│   ├── WT_hg19_rep2.log
│   ├── WT_hg19_rep2.txt
│   ├── WT_hg19_rep3.log
│   └── WT_hg19_rep3.txt
└── refseq.hg19.bed12

Class Exercise: Running RSeQC

  1. Copy Folder: Make a copy of the following folder into your home directory:

    /gpfs1/cl/mmg3320/course_materials/htseq_demo
    
  2. Determine Strandedness: HtSeq-count requires setting the -s parameter based on the RNA-Seq library preparation protocol. You will need to run RSeQC to determine strandedness of the demo data. There are three options to select from:

    • -s yes, reads are mapped to the same strand as the sense strand (sense)
    • -s no, reads can map to either strand (unstranded/undetermined)
    • -s reverse, reads are mapped to the opposite strand (anti-sense)

    • The rseqc-loop.sh script is provided below. Make a copy and modify the path for variables BAM_DIR and BED_FILE.

    #!/bin/bash
    #SBATCH --partition=general
    #SBATCH --nodes=1
    #SBATCH --ntasks=8
    #SBATCH --mem=20G
    #SBATCH --time=24:00:00
    #SBATCH --job-name=rseqc-loop
    #SBATCH --output=run-%x_%j.out
    
    module load apptainer/1.3.4
    
    CONTAINER_PATH="/gpfs1/cl/mmg3320/course_materials/containers/rseqc.sif"
    MULTIQC_PATH="/gpfs1/cl/mmg3320/course_materials/containers/multiqc-1.20.sif"
    
    BAM_DIR="/path/to/bams"
    BED_FILE="/path/to/bed12"
    OUTPUT_DIR="rseqc_data"
    
    mkdir -p "$OUTPUT_DIR"
    
    for BAM_FILE in "$BAM_DIR"/*.bam; do
        NAME=$(basename "$BAM_FILE" .bam)
    
        echo "Processing: $NAME"
    
        apptainer exec --cleanenv "$CONTAINER_PATH" \
            infer_experiment.py \
            -r "$BED_FILE" \
            -i "$BAM_FILE" \
            > "$OUTPUT_DIR/${NAME}.infer_experiment.txt"
    done
    
    # Run MultiQC
    apptainer exec --cleanenv "$MULTIQC_PATH" \
        multiqc "$OUTPUT_DIR"
    
  3. Submit the finalized script and then check your outputs are the same size. This script will take ~5 minutes to run. If correct, your output will look similar to below.

    -rw-r--r-- 1 pdrodrig pi-jdragon 165 Mar  2 10:37 KO_hg19_rep2_sorted.infer_experiment.txt
    -rw-r--r-- 1 pdrodrig pi-jdragon 165 Mar  2 10:37 KO_hg19_rep3_sorted.infer_experiment.txt
    -rw-r--r-- 1 pdrodrig pi-jdragon 165 Mar  2 10:37 WT_hg19_rep1_sorted.infer_experiment.txt
    -rw-r--r-- 1 pdrodrig pi-jdragon 165 Mar  2 10:37 WT_hg19_rep2_sorted.infer_experiment.txt
    -rw-r--r-- 1 pdrodrig pi-jdragon 165 Mar  2 10:37 WT_hg19_rep3_sorted.infer_experiment.txt
    
    4. Interpret the multiQC output to determine strandedness of the BAM files.

Class Exercise Part B

Class Exercise: Running HTSeq-count

  1. Modify htseq-count.sh script found below: You will need to designate the following parameters:

    • the correct path to the GTF file
    • the -f option to either bam or sam
    • the -s option to either yes, no, or reverse
    • the -i to gene_id
  2. Submit the htseq-count.sh script after modifying it. This script will take a few minutes to run.

  3. Look inside of the htseq-count_XXXXXX.out file after the job is completed. It should appear identical as below:

    Processing: KO_hg19_rep2_sorted
    Processing: KO_hg19_rep3_sorted
    Processing: WT_hg19_rep1_sorted
    Processing: WT_hg19_rep2_sorted
    Processing: WT_hg19_rep3_sorted
    

The incomplete htseq-count.sh script is provided:

#!/bin/bash
#SBATCH --partition=general 
#SBATCH --nodes=1
#SBATCH --cpus-per-task=4
#SBATCH --mem=10G
#SBATCH --time=30:00:00
#SBATCH --job-name=htseq-count    
#SBATCH --output=%x_%j.out  # %x=job-name, %j=jobid

# Load HTSeq module
module load gcc/13.3.0-xp3epyt
module load py-htseq/2.0.3-mb7ap7s

# Specify PATH to GTF file
GTF="/path/to/gtf"

# Iterate through all BAM files in the current directory
for BAM_FILE in *.bam; do

# Extract the filename without the .bam extension
NAME=$(basename "$BAM_FILE" .bam)
echo "Processing: $NAME"

# Run HTSeq-count
htseq-count \
  -f \
  -s \
  -i \
  -m union \
  "$BAM_FILE" "$GTF" \
  > "${NAME}.gene_id.count.txt" \
  2> "${NAME}.gene_id.summary"

done
  • > → Redirects the gene counts output to results/counts/sample_counts.txt.
  • 2> → Redirects the summary of assigned/unassigned reads to results/counts/sample_counts.summary.

Class Exercise Part C

Class Exercise: Interpreting the final MultiQC

  1. While the htseq-count.sh is running, read the section below titled htseq-count output.
  2. Once the htseq-count.sh script is finished, navigate into the bams folder
  3. Now, generate one final multiQC output
  4. Submit the final multiQC to Brightspace for Homework-mini-3

htseq-count output

When running HTSeq, two output files are generated for each sample.

File 1: Summary File (*.summary):

This file contains processing statistics and reports how many reads were assigned or not assigned to features. If your job fails, please look inside this file as any error messages will be written here.

Once the job is complete, you can view the beginning of the summary file using:

head KO_hg19_rep2_sorted.gene_id.summary 

Example output:

76767 GFF lines processed.
100000 alignment records processed.
199252 alignment records processed.
What this means?
  • GFF lines processed: Number of annotation entries read from the GTF/GFF file.
  • Alignment records processed: Number of reads (or read pairs) examined from the BAM file.
    • For paired-end data, each mate is counted separately.
  • This file helps diagnose:
    • Whether the annotation file loaded correctly
    • Whether the BAM file was read successfully
    • Whether the expected number of alignments were processed

File 2: Counts File (*.count.txt):

This is the main output file. It contains the raw read counts per gene or feature.

You can view the top of the file using:

head KO_hg19_rep2_sorted.gene_id.count.txt
AADACL3 0
AADACL4 0
ABCA4   0
ABCB10  0
ABCD3   1
ABL2    0
ACADM   0
ACAP3   0
ACBD3   0
ACBD6   1

Interpretation:

  • First column: Gene ID (from the gene_id attribute in the GTF file)
  • Second column: Number of reads assigned to that gene

These are raw integers counts used as input for differential expression tools such as DESeq2.

At the bottom of the counts files are special summary rows. View them with:

tail KO_hg19_rep2_sorted.gene_id.count.txt
__no_feature    32813
__ambiguous     4372
__too_low_aQual 0
__not_aligned   15209
__alignment_not_unique  3667

What these Mean:

  • __no_feature: Reads aligned to the genome but did not overlap any annotated feature being counted (i.e. intronic, intergenic regions, annotation mismatch, wrong strandedness)
  • __ambiguous: Reads overlapped more than one gene and could not be uniquely assigned.
  • __too_low_aQual: Reads with alignment quality below the specified threshold (controlled by -a).
  • __not_aligned: Reads that did not align to the reference genome
  • __alignment_not_unique: Reads that aligned to multiple genomic locations.

Resource Recommendations for htseq-count

An array will submit independent jobs and process each BAM file independently.

#!/bin/bash
#SBATCH --partition=general
#SBATCH --nodes=1
#SBATCH --cpus-per-task=4  
#SBATCH --mem=4G  # 2-6GB, 4GB is typically enough but specify 6GB for 100M+ reads
#SBATCH --time=2:00:00 # adjust time up to 30hrs 
#SBATCH --job-name=htseq-array
#SBATCH --output=htseq-%A_%a.out  # %A = job ID, %a = array task ID

Creating bigWig files with deepTools

We will now take our BAM files and convert them into bigWig files. The bigWig format is an indexed binary format useful for dense, continuous data that can be displayed in a genome browser as a graph/track.

To create bigWig files we will use deepTools, a suite of Python tools developed for the efficient analysis of high-throughput sequencing data, such as ChIP-seq, RNA-seq or MNase-seq. deepTools has a wide variety of commands that go beyond what we will cover today.

shells

Setting up

Checking/Creating index file for the BAM file: Often, when working with BAM files you will find that many tools require an index (an associated .bai file). You can think of an index similar to that which is located at the back of a textbook - when you are interested in a particular subject, you look for the keyword in the index and identify the pages that contain the relevant information. Similarily, indexing the BAM file aims to achieve fast retrieval of alignments overlapping a specified region without going through the whole alignment file. Essentially, a bai file along with the bam ensures that downstream applications are able to use the information with the bam file much more speedily.

As a reminder, we used SAMtools, specifically the samtools index command, to index the BAM files.

bamCoverage from deepTools

This command takes a BAM file as input and evaluates which areas of the genome have reads associated with them, i.e. how much of the genome is "covered" with reads. The coverage is calculated as the number of reads per bin, where bins are short consecutive sections of the genome (bins) that can be defined by the user. The output of this command is a bigWig file.

These are some parameters of bamCoverage that are worth considering:

  • normalizeUsing: Possible choices: RPKM, CPM, BPM, RPGC. By default, no normalization is applied. More on this below.
  • binSize: size of bins in bases (default is 50)
  • --effectiveGenomeSize: the portion of the genome that is mappable. It is useful to consider this when computing your scaling factor.
  • smoothLength: defines a window, larger than the binSize, to average the number of reads over. This helps produce a more continuous plot.
  • centerReads: reads are centered with respect to the fragment length as specified by extendReads. This option is useful to get a sharper signal around enriched regions.

Selecting Normalization method: The methods for bigWig creation (bamCoverage and bamCompare) allows for normalization, which is great if we want to compare different samples to each other and they vary in terms of sequencing depth. DeepTools offers different methods of normalization as listed below, each is perfomed per bin. The default is no normalization.

  • Reads Per Kilobase per Million mapped reads (RPKM)
    • number of reads per bin / (number of mapped reads (in millions) * bin length (kb))
  • Counts per million (CPM); this is similar to CPM in RNA-seq
    • number of reads per bin / number of mapped reads (in millions)
  • Bins Per Million mapped reads (BPM); same as TPM in RNA-seq
    • number of reads per bin / sum of all reads per bin (in millions)
  • Reads per genomic content (RPGC)
    • number of reads per bin / scaling factor for 1x average coverage
    • scaling factor is determined from the sequencing depth: total number of mapped reads * fragment length) / effective genome size
    • this option requires an effectiveGenomeSize

We will be using the bare minimum of parameters as shown in the code below. We decrease the bin size to increase the resolution of the track (this also means larger file size). If you are interested, feel free to test out some of the other parameters to create different bigWig files. You can load them into a genome viewer like IGV and observe the differences.

Class Exercise: Running deeptools/bamCoverage

Let's create a bigWig file for KO_hg19_rep2_sorted.bam and WT_hg19_rep2_sorted.bam:

module load deeptools/3.5.5 

bamCoverage -b KO_hg19_rep2_sorted.bam -o KO_hg19_rep2_sorted.bw 
Note: Normally, this command can take up to 10 minutes to complete.

Visualize with IGV:

  • Start IGV, You may have this previously installed on your laptop. If not no worries, use the IGV Web App.
  • Load the Human genome (hg19) into IGV using the dropdown menu at the top left of your screen.
  • Load the .bw file using the “Load from File…“ or "Tracks" option.
  • Type MOV10 into the search bar.
shells

This lesson has been developed by members of the teaching team at the Harvard Chan Bioinformatics Core (HBC). These are open access materials distributed under the terms of the Creative Commons Attribution license (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.