Single-cell RNA sequencing (scRNA-seq) is a powerful genomic technology that enables the measurement of gene expression at the resolution of individual cells. Unlike bulk RNA-seq, which provides an average signal across thousands or millions of cells, scRNA-seq reveals cellular heterogeneity, allowing researchers to identify distinct cell types, states, and dynamic biological processes within complex tissues.
Biological systems are inherently heterogeneous. In bulk RNA-seq experiments, differences between cells are masked by averaging, which can obscure:
Therefore, bulk RNA-seq only provide the average measurement for all the cells. It does not provide insights into the stochastic nature of gene expression. scRNA-seq overcomes this limitation by profiling each cell independently, making it especially valuable in fields such as developmental biology, immunology, and cancer genomics.
Single-Cell Isolation Cells are separated using methods such as microfluidics (e.g., droplet-based systems), fluorescence-activated cell sorting (FACS), or microwell platforms.
Barcoding and Library Preparation Each cell’s RNA is tagged with a unique barcode, allowing reads to be traced back to their cell of origin. Unique Molecular Identifiers (UMIs) are often added to quantify transcripts more accurately.
Sequencing Libraries are sequenced using next-generation sequencing platforms.The main difference between bulk and single-cell RNA-seq (scRNA-seq) is that each sequencing library represents a single cell, instead of a population of cells.
Data Processing Sequencing reads are aligned to a reference genome, and gene expression is quantified to produce a gene-by-cell count matrix.
scRNA-seq data differ substantially from bulk RNA-seq and require specialized analysis approaches:
These characteristics necessitate careful preprocessing and normalization.
A standard scRNA-seq analysis workflow includes:
Quality Control (QC) Remove low-quality cells (e.g., high mitochondrial gene content) Filter out lowly expressed genes
Normalization Adjust for differences in sequencing depth across cells
Dimensionality Reduction Principal Component Analysis (PCA) Nonlinear methods such as UMAP or t-SNE
Clustering Identify groups of cells with similar expression profiles
Cell Type Annotation Assign biological meaning to clusters using marker genes
Differential Expression Analysis Identify genes that distinguish clusters or conditions
Several R packages are widely used for scRNA-seq analysis:
Seurat – comprehensive toolkit for preprocessing,
clustering, and visualization SingleCellExperiment –
standardized data container for single-cell data; provides data
structure
This class implements a data structure that stores all aspects of our single-cell data - gene-by-cell expression data, cell-wise metadata, and gene-wise annotation - and lets us manipulate them in an organized manner.
Each column = one cell Each row = one gene Values = gene expression counts
SingleCellExperiment objects.SingleCellExperiment
objects.#BiocManager::install(c("AUCell", "batchelor", "BiocStyle",
#"CuratedAtlasQueryR", "DropletUtils", "duckdb",
#"EnsDb.Mmusculus.v79", "MouseGastrulationData",
#"scDblFinder", "Seurat", "lgeistlinger/SeuratData",
#"SingleR", "TENxBrainData", "zellkonverter"),
# Ncpus = 4)
## took less than 5 minutes to install
load these packages
library(SingleCellExperiment)
library(MouseGastrulationData)
library(tidyverse)
library(dplyr)
library(SingleR)
library(bluster)
library(scater)
library(scran)
library(pheatmap)
library(GSEABase)
The MouseGastrulationData is a data package that
contains real scRNA-seq datasets we can use for analysis.
SingleCellExperiment is the container, and
MouseGastrulationData is the data we put inside of the
container.
WTChimerData datasetThis dataset comes from mouse embryo experiments where:
The data can be accessed using the WTChimeraData()
function.
sce <- WTChimeraData(samples = 5)
# You may be prompted to type "yes" in the console
#Here we are simply loading sample 5 from the dataset and storing it into `sce`
sce is now an SingleCellExperiment object, which is
structured to store + gene expression data + cell metadata
sce
## class: SingleCellExperiment
## dim: 29453 2411
## metadata(0):
## assays(1): counts
## rownames(29453): ENSMUSG00000051951 ENSMUSG00000089699 ...
## ENSMUSG00000095742 tomato-td
## rowData names(2): ENSEMBL SYMBOL
## colnames(2411): cell_9769 cell_9770 ... cell_12178 cell_12179
## colData names(11): cell barcode ... doub.density sizeFactor
## reducedDimNames(2): pca.corrected.E7.5 pca.corrected.E8.5
## mainExpName: NULL
## altExpNames(0):
names(assays(sce))
## [1] "counts"
#list the type of expression data stored
# counts = raw gene expression counts
counts(sce)[1:3, 1:3] # extracting first 3 genes x first 3 cells
## 3 x 3 sparse Matrix of class "dgCMatrix"
## cell_9769 cell_9770 cell_9771
## ENSMUSG00000051951 . . .
## ENSMUSG00000089699 . . .
## ENSMUSG00000102343 . . .
colData(sce)[1:3, 1:4] # displays metadata for cells
## DataFrame with 3 rows and 4 columns
## cell barcode sample stage
## <character> <character> <integer> <character>
## cell_9769 cell_9769 AAACCTGAGACTGTAA 5 E8.5
## cell_9770 cell_9770 AAACCTGAGATGCCTT 5 E8.5
## cell_9771 cell_9771 AAACCTGAGCAGCCTC 5 E8.5
rowData(sce)[1:3, 1:2] # displays metadata for genes
## DataFrame with 3 rows and 2 columns
## ENSEMBL SYMBOL
## <character> <character>
## ENSMUSG00000051951 ENSMUSG00000051951 Xkr4
## ENSMUSG00000089699 ENSMUSG00000089699 Gm1992
## ENSMUSG00000102343 ENSMUSG00000102343 Gm37381
sce$my_sum <- colSums(counts(sce)) # calculating total counts per cell; "for each cell how many total transcripts were detected"
colData(sce)[1:3,]
## DataFrame with 3 rows and 12 columns
## cell barcode sample stage tomato
## <character> <character> <integer> <character> <logical>
## cell_9769 cell_9769 AAACCTGAGACTGTAA 5 E8.5 TRUE
## cell_9770 cell_9770 AAACCTGAGATGCCTT 5 E8.5 TRUE
## cell_9771 cell_9771 AAACCTGAGCAGCCTC 5 E8.5 TRUE
## pool stage.mapped celltype.mapped closest.cell doub.density
## <integer> <character> <character> <character> <numeric>
## cell_9769 3 E8.25 Mesenchyme cell_24159 0.02985045
## cell_9770 3 E8.5 Endothelium cell_96660 0.00172753
## cell_9771 3 E8.5 Allantois cell_134982 0.01338013
## sizeFactor my_sum
## <numeric> <numeric>
## cell_9769 1.41243 27577
## cell_9770 1.22757 29309
## cell_9771 1.15439 28795
Summary so far:
A SingleCellExperiment object is structured in a way to
organize + assays() - gene expression values +
colData() - cell-level metadata + rowData() -
gene-level metadata
Everything that happens later, builds on this structure.
dim(counts(sce)) # [genes, cells]
## [1] 29453 2411
Above is showing the dimensions for one sample: genes × cells_from_sample5
sce_all <- WTChimeraData()
dim(counts(sce_all)) # [genes, cells]
## [1] 29453 30703
#this takes a few minutes
This object contains all cells across all samples Rows = genes Columns = cells
All samples: genes × (cells_sample1 + cells_sample2 + … + cells_sampleN)
So far, we’ve explored the expression matrix (genes × cells). Now, we shift our focus to the metadata, which describes each individual cell.
In a SingleCellExperiment object, metadata is stored in
colData()
colData(sce_all)
## DataFrame with 30703 rows and 11 columns
## cell barcode sample stage tomato
## <character> <character> <integer> <character> <logical>
## cell_1 cell_1 AAACCTGAGACCTTTG 1 E7.5 TRUE
## cell_2 cell_2 AAACCTGAGCGCCTCA 1 E7.5 TRUE
## cell_3 cell_3 AAACCTGAGGATCGCA 1 E7.5 TRUE
## cell_4 cell_4 AAACCTGCACGTAAGG 1 E7.5 TRUE
## cell_5 cell_5 AAACCTGCAGACAGGT 1 E7.5 TRUE
## ... ... ... ... ... ...
## cell_30699 cell_30699 TTTGTCACAGCTCGCA 10 E8.5 FALSE
## cell_30700 cell_30700 TTTGTCAGTCTAGTCA 10 E8.5 FALSE
## cell_30701 cell_30701 TTTGTCATCATCGGAT 10 E8.5 FALSE
## cell_30702 cell_30702 TTTGTCATCATTATCC 10 E8.5 FALSE
## cell_30703 cell_30703 TTTGTCATCCCATTTA 10 E8.5 FALSE
## pool stage.mapped celltype.mapped closest.cell
## <integer> <character> <character> <character>
## cell_1 1 E7.5 Haematoendothelial p.. cell_22032
## cell_2 1 E8.0 Mesenchyme cell_116830
## cell_3 1 E7.25 Primitive Streak cell_84552
## cell_4 1 E7.5 Caudal epiblast cell_714
## cell_5 1 E7.5 Caudal epiblast cell_4785
## ... ... ... ... ...
## cell_30699 5 E8.5 Erythroid3 cell_38810
## cell_30700 5 E8.5 Surface ectoderm cell_38588
## cell_30701 5 E8.25 Forebrain/Midbrain/H.. cell_66082
## cell_30702 5 E8.5 Erythroid3 cell_138114
## cell_30703 5 E8.0 Doublet cell_92644
## doub.density sizeFactor
## <numeric> <numeric>
## cell_1 0.0504199 1.16288
## cell_2 0.0974964 1.65059
## cell_3 0.0413143 1.64731
## cell_4 0.0433117 1.16565
## cell_5 0.0627621 1.48883
## ... ... ...
## cell_30699 0.00146287 0.389311
## cell_30700 0.00374155 0.588784
## cell_30701 0.05651258 0.624455
## cell_30702 0.00108837 0.550807
## cell_30703 0.82369305 1.184919
Each row = one cell Columns describe the cell
Some key columns include: + sample + stage = developmental stage + celltype.mapped = predicted cell identity + total_counts = total reads per cell
Let’s ask a basic question:
How many samples are in this dataset?
unique(colData(sce_all)$sample)
## [1] 1 2 3 4 5 6 7 8 9 10
This tells us there are 10 samples, but what do these actually represent?
To interpret sample, we need to connect it to a
biological variable like stage:
as.data.frame(colData(sce_all)) |>
distinct(sample, stage) |>
arrange(sample)
## sample stage
## cell_1 1 E7.5
## cell_2883 2 E7.5
## cell_5480 3 E7.5
## cell_7948 4 E7.5
## cell_9769 5 E8.5
## cell_12180 6 E8.5
## cell_13227 7 E8.5
## cell_16234 8 E8.5
## cell_19331 9 E8.5
## cell_23875 10 E8.5
Why as.data.frame? colData() returns a
special Bioconductor object (DataFrame) which is different
than base R data.frame
as.data.frame() makes the data compatible with tidyverse
tools
counts(sce_all)[, "cell_1"][1:10] #showing the first 10 genes for cell_1
## ENSMUSG00000051951 ENSMUSG00000089699 ENSMUSG00000102343 ENSMUSG00000025900
## 0 0 0 0
## ENSMUSG00000025902 ENSMUSG00000104328 ENSMUSG00000033845 ENSMUSG00000025903
## 0 0 13 3
## ENSMUSG00000104217 ENSMUSG00000033813
## 0 3
colData(sce_all)["cell_1", ] #metadata for cell_1
## DataFrame with 1 row and 11 columns
## cell barcode sample stage tomato pool
## <character> <character> <integer> <character> <logical> <integer>
## cell_1 cell_1 AAACCTGAGACCTTTG 1 E7.5 TRUE 1
## stage.mapped celltype.mapped closest.cell doub.density sizeFactor
## <character> <character> <character> <numeric> <numeric>
## cell_1 E7.5 Haematoendothelial p.. cell_22032 0.0504199 1.16288
Now that we understand how samples relate to biological stages, we can ask a more meaningful question:
How many cells are in each developmental stage?
In this exercise, create a bar plot to visualize the number of cells
at each developmental stage using the sce_all dataset.
Use ggplot() to create a bar plot where
apply the theme_classic()
Add informative labels:
# Visualize the number of cell counts per sample
library(ggplot2)
library(dplyr)
as.data.frame(colData(sce_all)) |>
ggplot(aes(x = stage, fill = stage)) +
geom_bar(color = "black") +
theme_classic() +
labs(
title = "Number of Cells per Developmental Stage",
x = "Developmental Stage",
y = "Number of Cells"
)
It’s still useful to understand how cells are distributed across samples:
#mutate(sample = factor(sample)) |>
as.data.frame(colData(sce_all)) |>
mutate(sample = factor(sample)) |>
ggplot(aes(x = sample, fill = stage)) +
geom_bar(color = "black") +
theme_classic() +
labs(
title = "Number of Cells per Sample",
x = "Sample",
y = "Number of Cells"
)
This plot helps us see:
So far, we’ve counted cells.
Next, we examine how much data each cell contains.
How many sequencing reads are captured per cell?
library(scater)
sce_all$total_counts <- colSums(counts(sce_all))
as.data.frame(colData(sce_all)) |>
ggplot(aes(x = total_counts)) +
geom_histogram(bins = 50, fill = "steelblue", color = "black") +
theme_classic() +
labs(
title = "Distribution of Total Counts per Cell",
x = "Total Counts per Cell",
y = "Number of Cells"
)
Each bar = number of cells in a given count range
This helps identify: low-quality cells (low counts)
At this point, we can see that scRNA-seq data are:
This makes it difficult to:
To solve this, we apply dimensionality reduction techniques, such as:
One useful feature of SingleCellExperiment is that it
can store these reduced representations directly inside the object.
reducedDims(sce)
## List of length 2
## names(2): pca.corrected.E7.5 pca.corrected.E8.5
# this object contains two different dimensionality reductions; one for stage E7.5 X E8.5
Now we can visualize the dataset:
plotReducedDim(sce,
"pca.corrected.E8.5",
colour_by = "stage.mapped")
#ggsave("pca_corrected_E8_5.png")
Single-cell datasets contain technical artifacts that must be addressed before downstream analysis.
Common issues include:
Therefore, QC is a critical early step in any scRNA-seq workflow.
library(MouseGastrulationData)
library(DropletUtils)
library(ggplot2)
library(EnsDb.Mmusculus.v79)
library(scuttle)
library(scater)
library(scran)
library(scDblFinder)
Load raw (unfiltered) data
sce <- WTChimeraData(samples = 5, type = "raw")
# the “raw” data that contains all the droplets for which we have sequenced reads
This dataset contains all droplets captured during sequencing, however, not all droplets contain real cells.
sce
## $`5`
## class: SingleCellExperiment
## dim: 29453 522554
## metadata(0):
## assays(1): counts
## rownames(29453): ENSMUSG00000051951 ENSMUSG00000089699 ...
## ENSMUSG00000095742 tomato-td
## rowData names(2): ENSEMBL SYMBOL
## colnames(522554): AAACCTGAGAAACCAT AAACCTGAGAAACCGC ...
## TTTGTCATCTTTACGT TTTGTCATCTTTCCTC
## colData names(0):
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):
From the experiment, we expect to have only a few thousand cells, while we can see that we have data for more than 500,000 droplets. It is likely that most of these droplets are empty and are capturing only ambient or background RNA.
[droplet]
Testing for empty droplets, before filtering, we isolate the relevant data:
sce <- sce[[1]]
#This extracts the count matrix for analysis
Next, we will tests each droplet to determine if it contains a real cell + Uses a statistical model to distinguish real cells vs ambient RNA
# emptyDrops performs Monte Carlo simulations to compute p-values,
# so we need to set the seed to obtain reproducible results.
set.seed(100)
# this may take a few minutes
e.out <- emptyDrops(counts(sce))
# This shows how many droplets pass the threshold
summary(e.out$FDR <= 0.001)
## Mode FALSE TRUE NA's
## logical 5919 3396 513239
The result confirms our expectation: only 3396 droplets contain a cell, while the large majority of droplets are empty.
We need to filter to keep real cells and remove empty droplets.
sce <- sce[,which(e.out$FDR <= 0.001)]
sce
## class: SingleCellExperiment
## dim: 29453 3396
## metadata(0):
## assays(1): counts
## rownames(29453): ENSMUSG00000051951 ENSMUSG00000089699 ...
## ENSMUSG00000095742 tomato-td
## rowData names(2): ENSEMBL SYMBOL
## colnames(3396): AAACCTGAGACTGTAA AAACCTGAGATGCCTT ... TTTGTCACATTCTCAT
## TTTGTCATCTGAGTGT
## colData names(0):
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):
Note: Whenever your code involves the generation of random numbers,
it’s a good practice to set the random seed in R with
set.seed().
Setting the seed to a specific value (in the above example to 100) will cause the pseudo-random number generator to return the same pseudo-random numbers in the same order.
This allows us to write code with reproducible results, despite technically involving the generation of (pseudo-)random numbers.
While we have removed empty droplets, this does not necessarily imply that all the cell-containing droplets should be kept for downstream analysis. Some droplets could contain low-quality samples, due to cell damage or failure in library preparation, pooly sequenced, or partially lysed.
What makes a low-quality cell?
One common signal is high mitochondrial gene expression. High proportions of mitochondrial genes are indicative of poor-quality cells because of loss of cytoplasmic RNA from perforated cells.
For single-nucleus RNA-seq experiments, high proportions are also useful as they can mark cells where the cytoplasm has not been successfully stripped.
Step 1: First, we need to identify mitochondrial genes. We use the available EnsDb mouse package available in Bioconductor, but a more updated version of Ensembl can be used through the AnnotationHub or biomaRt packages.
chr.loc <- mapIds(EnsDb.Mmusculus.v79,
keys = rownames(sce),
keytype = "GENEID",
column = "SEQNAME")
## Warning: Unable to map 2918 of 29453 requested IDs.
is.mito <- which(chr.loc == "MT")
What is this doing? - Look up each gene - Find which chromosome it comes from - Keep genes located on the mitochondrial chromosome (“MT”)
Step 2: Calculate QC metrics for each cell:
df <- perCellQCMetrics(sce, subsets = list(Mito = is.mito))
Step 3: Add QC metrics to metadata
colData(sce)
## DataFrame with 3396 rows and 0 columns
colData(sce)
## DataFrame with 3396 rows and 0 columns
#cbind = column bind
Now that we have computed the metrics, we have to decide on thresholds to define high- and low-quality samples.
Step 4: Explore QC metrics. We could check how many cells are above/below a certain fixed threshold.
# How many cells have very low counts?
# How many cells have >10% mitochondrial reads?
# Are there cells detecting very few genes?
Step 5: Flag low quality cells
reasons <- perCellQCFilters(df, sub.fields = "subsets_Mito_percent")
reasons
## DataFrame with 3396 rows and 4 columns
## low_lib_size low_n_features high_subsets_Mito_percent discard
## <outlier.filter> <outlier.filter> <outlier.filter> <logical>
## 1 FALSE FALSE FALSE FALSE
## 2 FALSE FALSE FALSE FALSE
## 3 FALSE FALSE FALSE FALSE
## 4 FALSE FALSE FALSE FALSE
## 5 TRUE TRUE FALSE TRUE
## ... ... ... ... ...
## 3392 FALSE FALSE FALSE FALSE
## 3393 TRUE TRUE FALSE TRUE
## 3394 TRUE TRUE TRUE TRUE
## 3395 TRUE TRUE TRUE TRUE
## 3396 TRUE TRUE TRUE TRUE
# identifies outliers based on QC metrics; flags poor quality cells
Step 6: Label cells to keep/remove
sce$discard <- reasons$discard
# TRUE = discard (low quality)
# FALSE = keep
Step 7: Visualize QC filtering
How to interpret:
Summary: At this stage, we are not asking biological questions yet — we are making sure our data is reliable enough to trust.
In this part of the tutorial, you will learn how to:
We will now be working with a processed single-cell RNA-seq dataset from a wild-type chimeric mouse embryo.
sce <- WTChimeraData(samples = 5, type = "processed")
sce
## class: SingleCellExperiment
## dim: 29453 2411
## metadata(0):
## assays(1): counts
## rownames(29453): ENSMUSG00000051951 ENSMUSG00000089699 ...
## ENSMUSG00000095742 tomato-td
## rowData names(2): ENSEMBL SYMBOL
## colnames(2411): cell_9769 cell_9770 ... cell_12178 cell_12179
## colData names(11): cell barcode ... doub.density sizeFactor
## reducedDimNames(2): pca.corrected.E7.5 pca.corrected.E8.5
## mainExpName: NULL
## altExpNames(0):
What’s happening here?
sce is a SingleCellExperiment object, which stores all
of our data Rows = genes, columns = cells
It may also include metadata (e.g., cell type, experimental condition)
Tip: Always print your object (sce) to understand what data is available before starting analysis.
To make the analysis faster, we randomly select 1,000 cells.
set.seed(123)
ind <- sample(ncol(sce), 1000)
sce <- sce[,ind]
What’s happening here?
ncol(sce) = total number of cellssample() randomly selects 1,000 cellssce[, ind] keeps all genes but only those selected
cellsWhy this matters: Working with fewer cells speeds things up while still preserving biological patterns.
Before clustering, we need to normalize the data and reduce its complexity.
sce <- logNormCounts(sce)
# Run PCA
sce <- runPCA(sce)
What’s happening here?
logNormCounts() adjusts for differences in sequencing
depth and makes values comparable across cells
runPCA() reduces thousands of genes into a smaller
number of dimensions (principal components)
Clustering is an unsupervised learning procedure that is used to define groups of cells with similar expression profiles. Its primary purpose is to summarize complex scRNA-seq data into a digestible format for human interpretation. In short, this helps us identify potential cell types of interest.
colLabels(sce) <- clusterCells(sce,
use.dimred = "PCA",
BLUSPARAM = NNGraphParam(cluster.fun = "louvain"))
# table of clusters for sce
table(colLabels(sce))
##
## 1 2 3 4 5 6 7 8 9 10 11
## 193 161 59 134 63 61 108 49 91 42 39
Here, we use the clusterCells() function from the
scran package to perform graph-based clustering using the
Louvain algorithm for community detection. This function returns a
vector containing cluster assignments for each cell in our
SingleCellExperiment object.
You can see we ended up with 11 clusters of varying cell sizes. The key idea here is cells in the same cluster likely represent the same cell type or state.
Now we visualize the clusters in 2D space.
sce <- runUMAP(sce, dimred = "PCA")
# plotReducedDim
plotReducedDim(sce, "UMAP", color_by = "label")
What if we wanted to get fewer, larger clusters?
sce$clust2 <- clusterCells(sce, use.dimred = "PCA",
BLUSPARAM = NNGraphParam(cluster.fun = "louvain",
k = 30))
# k = number of nearest neighbors used to build the graph where small k cells only connect locally and large k cells connect more broadly (bigger communities)
# plot for clust2
plotReducedDim(sce, "UMAP", color_by = "clust2")
** Quick Exercise** : Run clustering with different K values (k = 10 then k = 50). How does the number of clusters change - provide the numbers below.
k; # of clusters k = 10; # = k = 30; # = k = 50; # =
How do you choose k? Lots of questions are asked:
Marker genes allow us to assign biological meaning to each cluster based on their functional annotation. We have a priori knowledge of the marker genes associated with particular cell types, allowing us to treat the clustering as a proxy for cell type identity.
The most straightforward approach to marker gene detection involves testing for differential expression between clusters. If a gene is strongly DE between clusters, it is likely to have driven the separation of cells in the clustering algorithm.
#rename genes using symbols
rownames(sce) <- rowData(sce)$SYMBOL
# find genes that distinguish each cluster
markers <- scoreMarkers(sce)
# output a ranked list of genes for each cluster
markers
## List of length 11
## names(11): 1 2 3 4 5 6 7 8 9 10 11
Below shows the marker genes for cluster 1.
markers[[1]]
## DataFrame with 29453 rows and 19 columns
## self.average other.average self.detected other.detected
## <numeric> <numeric> <numeric> <numeric>
## Xkr4 0.00777523 0.00204744 0.0103627 0.0015873
## Gm1992 0.00000000 0.00000000 0.0000000 0.0000000
## Gm37381 0.00000000 0.00000000 0.0000000 0.0000000
## Rp1 0.00000000 0.00000000 0.0000000 0.0000000
## Sox17 0.01448428 0.28440343 0.0103627 0.1378294
## ... ... ... ... ...
## AC149090.1 0.39760189 0.339949277 0.32642487 0.2773499
## DHRSX 0.37536303 0.514962467 0.32642487 0.4087374
## Vmn2r122 0.00000000 0.000000000 0.00000000 0.0000000
## CAAA01147332.1 0.00852569 0.000802687 0.00518135 0.0010989
## tomato-td 0.61466767 0.641897197 0.48186528 0.4972542
## mean.logFC.cohen min.logFC.cohen median.logFC.cohen
## <numeric> <numeric> <numeric>
## Xkr4 0.119882 -0.10006 0.1443198
## Gm1992 0.000000 0.00000 0.0000000
## Gm37381 0.000000 0.00000 0.0000000
## Rp1 0.000000 0.00000 0.0000000
## Sox17 -0.346525 -2.07437 -0.0713339
## ... ... ... ...
## AC149090.1 0.1070458 -0.08935467 0.0555201
## DHRSX -0.2135530 -0.49469635 -0.1947910
## Vmn2r122 0.0000000 0.00000000 0.0000000
## CAAA01147332.1 0.0921178 0.00500178 0.1017973
## tomato-td -0.0304896 -0.27695578 -0.0829054
## max.logFC.cohen rank.logFC.cohen mean.AUC min.AUC median.AUC
## <numeric> <integer> <numeric> <numeric> <numeric>
## Xkr4 0.144320 3253 0.504379 0.4971626 0.505181
## Gm1992 0.000000 7838 0.500000 0.5000000 0.500000
## Gm37381 0.000000 7838 0.500000 0.5000000 0.500000
## Rp1 0.000000 7838 0.500000 0.5000000 0.500000
## Sox17 0.143843 5357 0.435959 0.0958549 0.493880
## ... ... ... ... ... ...
## AC149090.1 0.4394296 1678 0.524363 0.482359 0.509786
## DHRSX 0.0270598 7335 0.451362 0.383050 0.450894
## Vmn2r122 0.0000000 7838 0.500000 0.500000 0.500000
## CAAA01147332.1 0.1017973 4305 0.502044 0.497125 0.502591
## tomato-td 0.2898496 4071 0.487954 0.422322 0.476355
## max.AUC rank.AUC mean.logFC.detected min.logFC.detected
## <numeric> <integer> <numeric> <numeric>
## Xkr4 0.505181 5688 7.43734e-01 -2.75044e-01
## Gm1992 0.500000 7636 -1.60171e-17 -3.20343e-16
## Gm37381 0.500000 7636 -1.60171e-17 -3.20343e-16
## Rp1 0.500000 7636 -1.60171e-17 -3.20343e-16
## Sox17 0.505181 6757 -1.32542e+00 -4.92640e+00
## ... ... ... ... ...
## AC149090.1 0.588965 1993 2.58156e-01 -2.88274e-02
## DHRSX 0.511020 4598 -2.98961e-01 -5.66395e-01
## Vmn2r122 0.500000 7636 -1.60171e-17 -3.20343e-16
## CAAA01147332.1 0.502591 6151 3.89887e-01 -4.42710e-01
## tomato-td 0.568178 2881 -2.57339e-02 -3.30235e-01
## median.logFC.detected max.logFC.detected rank.logFC.detected
## <numeric> <numeric> <integer>
## Xkr4 0.697532 1.41597e+00 1119
## Gm1992 0.000000 3.20343e-16 6173
## Gm37381 0.000000 3.20343e-16 6173
## Rp1 0.000000 3.20343e-16 6173
## Sox17 -0.703710 9.58290e-01 2200
## ... ... ... ...
## AC149090.1 0.0583464 9.40386e-01 2627
## DHRSX -0.2934548 2.40918e-02 5723
## Vmn2r122 0.0000000 3.20343e-16 6173
## CAAA01147332.1 0.3905253 8.75149e-01 1964
## tomato-td -0.0101262 2.99327e-01 4995
# each row = one gene
# each column = a statistic comparing cluster 1 vs all other cells
# self.average = avg expression in cluster 1
# other.average = avg expression elsewhere
# looking for high self.average compare to other.average
AUC tells you how well this gene separates the cluster. Where 0.5 = no separation and closer to 1 is a strong marker.
Key takeaway: Most genes are not good markers. We need to filter them out.
We can then inspect the top marker genes for the first cluster using
the plotExpression function from the scater
package.
# select the top marker genes for cluster 1
c1_markers <- markers[[1]]
ord <- order(c1_markers$mean.AUC,
decreasing = TRUE)
top.markers <- head(rownames(c1_markers[ord,]))
# plot their expression across clusters
plotExpression(sce,
features = top.markers,
x = "label",
color_by = "label")
How to interpret:
Clearly, not every marker gene distinguishes cluster 1 from every other cluster. However, with a combination of multiple marker genes it’s possible to clearly identify gene patterns that are unique to cluster 1.
Class Exercise
Identify cell markers for cluster 2 and plot these. Word of caution: be careful when renaming R objects!
# select the top marker genes for cluster 2
# plot expression of top cluster 2 markers across clusters
Clustering tells us groups exist, but not what they are. Annotation assigns biological meaning to clusters.
Next, we can directly compare our expression profiles to published reference datasets using SingleR.
We compare our data to a reference dataset with known cell types.
ref <- EmbryoAtlasData(samples = 29)
ref
## class: SingleCellExperiment
## dim: 29452 7569
## metadata(0):
## assays(1): counts
## rownames(29452): ENSMUSG00000051951 ENSMUSG00000089699 ...
## ENSMUSG00000096730 ENSMUSG00000095742
## rowData names(2): ENSEMBL SYMBOL
## colnames(7569): cell_95727 cell_95728 ... cell_103294 cell_103295
## colData names(17): cell barcode ... colour sizeFactor
## reducedDimNames(2): pca.corrected umap
## mainExpName: NULL
## altExpNames(0):
# EmbryoAtlas is a reference database with cells that are already labeled by experts
Contains 29,452 genes and 7,569 cells. GeneIDs are ENSEMBL and SYMBOL
We are comparing our unknown cells to this labeled dataset!
# subsetting and cleaning the reference
set.seed(123)
ind <- sample(ncol(ref), 1000)
ref <- ref[,ind]
You can see we have an assortment of different cell types in the reference (with varying frequency):
tab <- sort(table(ref$celltype), decreasing = TRUE)
tab
##
## Forebrain/Midbrain/Hindbrain Erythroid3
## 131 75
## Paraxial mesoderm NMP
## 69 51
## ExE mesoderm Surface ectoderm
## 49 47
## Allantois Mesenchyme
## 46 45
## Spinal cord Pharyngeal mesoderm
## 45 41
## ExE endoderm Neural crest
## 38 35
## Gut Haematoendothelial progenitors
## 30 27
## Intermediate mesoderm Cardiomyocytes
## 27 26
## Somitic mesoderm Endothelium
## 25 23
## Erythroid2 Def. endoderm
## 11 3
## Erythroid1 Blood progenitors 1
## 2 1
## Blood progenitors 2 Caudal Mesoderm
## 1 1
## PGC
## 1
Which is the most and least abundant labeled cell type in this reference?
#the normalized log counts
ref <- logNormCounts(ref)
# remove cells of the reference dataset for which the cell type annotation is missing
nna <- !is.na(ref$celltype)
ref <- ref[,nna]
#Also remove cell types of very low abundance (here less than 10 cells) to remove noise prior to subsequent annotation tasks
abu.ct <- names(tab)[tab >= 10]
ind <- ref$celltype %in% abu.ct
ref <- ref[,ind]
Restrict to genes shared between query and reference dataset. Both datasets must use the same genes for comparison!
rownames(ref) <- rowData(ref)$SYMBOL
shared_genes <- intersect(rownames(sce), rownames(ref))
sce <- sce[shared_genes,]
ref <- ref[shared_genes,]
Convert to matrix to run SingleR
sce.mat <- as.matrix(assay(sce, "logcounts"))
ref.mat <- as.matrix(assay(ref, "logcounts"))
Running SingleR with the query and reference datasets:
res <- SingleR(test = sce.mat,
ref = ref.mat,
labels = ref$celltype)
res
## DataFrame with 1000 rows and 4 columns
## scores labels delta.next
## <matrix> <character> <numeric>
## cell_11995 0.348586:0.335451:0.314515:... Forebrain/Midbrain/H.. 0.1285110
## cell_10294 0.273570:0.260013:0.298932:... Erythroid3 0.1381951
## cell_9963 0.328538:0.291288:0.475611:... Endothelium 0.2193295
## cell_11610 0.281161:0.269245:0.299961:... Erythroid3 0.0359215
## cell_10910 0.422454:0.346897:0.355947:... ExE mesoderm 0.0984285
## ... ... ... ...
## cell_11597 0.323805:0.292967:0.300485:... NMP 0.1663369
## cell_9807 0.464466:0.374189:0.381698:... Mesenchyme 0.0833019
## cell_10095 0.341721:0.288215:0.485324:... Endothelium 0.0889931
## cell_11706 0.267487:0.240215:0.286012:... Erythroid2 0.0350557
## cell_11860 0.345786:0.343437:0.313994:... Forebrain/Midbrain/H.. 0.0117001
## pruned.labels
## <character>
## cell_11995 Forebrain/Midbrain/H..
## cell_10294 Erythroid3
## cell_9963 Endothelium
## cell_11610 Erythroid3
## cell_10910 ExE mesoderm
## ... ...
## cell_11597 NMP
## cell_9807 Mesenchyme
## cell_10095 Endothelium
## cell_11706 Erythroid2
## cell_11860 Forebrain/Midbrain/H..
delta.next is the difference between the best match
and the second best match. Here a large delta (0.2+) is equivalent to
more confidence. A small delta is ambiguous (less than 0.05)
pruned.labels cleaned versions of labels &
removes low confidence assignments
Overall, SingleR assigns a label AND tells you how
confident it is.
plotScoreHeatmap(res)
How to interpret:
We can also compare the cell type assignments with the unsupervised clustering results to determine the identity of each cluster.
tab <- table(anno = res$pruned.labels,
cluster = colLabels(sce))
pheatmap(log2(tab + 10),
color = colorRampPalette(c("white", "blue"))(101))
In this section, we move from analyzing a single sample to analyzing multiple samples together. This allows us to compare across experiments and identify shared or sample-specific patterns.
sce <- WTChimeraData(samples = 5:10, type = "processed")
# this will take some time to load
sce
## class: SingleCellExperiment
## dim: 29453 20935
## metadata(0):
## assays(1): counts
## rownames(29453): ENSMUSG00000051951 ENSMUSG00000089699 ...
## ENSMUSG00000095742 tomato-td
## rowData names(2): ENSEMBL SYMBOL
## colnames(20935): cell_9769 cell_9770 ... cell_30702 cell_30703
## colData names(11): cell barcode ... doub.density sizeFactor
## reducedDimNames(2): pca.corrected.E7.5 pca.corrected.E8.5
## mainExpName: NULL
## altExpNames(0):
We are loading samples 5 through 10 (multiple experiments)
type = “processed” means: + Quality control has already been done + Data is ready for downstream analysis
# shows metadata for each cell
colData(sce)
## DataFrame with 20935 rows and 11 columns
## cell barcode sample stage tomato
## <character> <character> <integer> <character> <logical>
## cell_9769 cell_9769 AAACCTGAGACTGTAA 5 E8.5 TRUE
## cell_9770 cell_9770 AAACCTGAGATGCCTT 5 E8.5 TRUE
## cell_9771 cell_9771 AAACCTGAGCAGCCTC 5 E8.5 TRUE
## cell_9772 cell_9772 AAACCTGCATACTCTT 5 E8.5 TRUE
## cell_9773 cell_9773 AAACGGGTCAACACCA 5 E8.5 TRUE
## ... ... ... ... ... ...
## cell_30699 cell_30699 TTTGTCACAGCTCGCA 10 E8.5 FALSE
## cell_30700 cell_30700 TTTGTCAGTCTAGTCA 10 E8.5 FALSE
## cell_30701 cell_30701 TTTGTCATCATCGGAT 10 E8.5 FALSE
## cell_30702 cell_30702 TTTGTCATCATTATCC 10 E8.5 FALSE
## cell_30703 cell_30703 TTTGTCATCCCATTTA 10 E8.5 FALSE
## pool stage.mapped celltype.mapped closest.cell
## <integer> <character> <character> <character>
## cell_9769 3 E8.25 Mesenchyme cell_24159
## cell_9770 3 E8.5 Endothelium cell_96660
## cell_9771 3 E8.5 Allantois cell_134982
## cell_9772 3 E8.5 Erythroid3 cell_133892
## cell_9773 3 E8.25 Erythroid1 cell_76296
## ... ... ... ... ...
## cell_30699 5 E8.5 Erythroid3 cell_38810
## cell_30700 5 E8.5 Surface ectoderm cell_38588
## cell_30701 5 E8.25 Forebrain/Midbrain/H.. cell_66082
## cell_30702 5 E8.5 Erythroid3 cell_138114
## cell_30703 5 E8.0 Doublet cell_92644
## doub.density sizeFactor
## <numeric> <numeric>
## cell_9769 0.02985045 1.41243
## cell_9770 0.00172753 1.22757
## cell_9771 0.01338013 1.15439
## cell_9772 0.00218402 1.28676
## cell_9773 0.00211723 1.78719
## ... ... ...
## cell_30699 0.00146287 0.389311
## cell_30700 0.00374155 0.588784
## cell_30701 0.05651258 0.624455
## cell_30702 0.00108837 0.550807
## cell_30703 0.82369305 1.184919
Remove unwanted cells
drop <- sce$celltype.mapped %in% c("stripped", "Doublet")
sce <- sce[,!drop]
# Subsample cells (per sample)
set.seed(29482)
# We randomly select 50% of cells from each sample
idx <- unlist(tapply(colnames(sce), sce$sample, function(x) {
perc <- round(0.50 * length(x))
sample(x, perc)
}))
sce <- sce[,idx]
Why this matters:
# Normalize counts
sce <- logNormCounts(sce)
# Find highly variable genes (HVGs)
dec <- modelGeneVar(sce, block = sce$sample)
chosen.hvgs <- dec$bio > 0
# PCA (dimensionality reduction)
sce <- runPCA(sce, subset_row = chosen.hvgs, ntop = 1000)
# t-SNE (visualization)
sce <- runTSNE(sce, dimred = "PCA")
sce$sample <- as.factor(sce$sample)
plotTSNE(sce, colour_by = "sample")
# Each point = one cell
It seems like samples 5 and 6 are clearly separated off the other samples in gene expression space. Why might this be the case?
Create a t-SNE plot colored by cell type and interpret whether cells group by biological identity.
color_vec <- c("#5A5156", "#E4E1E3", "#F6222E", "#FE00FA", "#16FF32", "#3283FE",
"#FEAF16", "#B00068", "#1CFFCE", "#90AD1C", "#2ED9FF", "#DEA0FD",
"#AA0DFE", "#F8A19F", "#325A9B", "#C4451C", "#1C8356", "#85660D",
"#B10DA1", "#3B00FB", "#1CBE4F", "#FA0087", "#333333", "#F7E1A0",
"#C075A6", "#782AB6", "#AAF400", "#BDCDFF", "#822E1C", "#B5EFB5",
"#7ED7D1", "#1C7F93", "#D85FF7", "#683B79", "#66B0FF", "#FBE426")
plotTSNE(sce, colour_by = "celltype.mapped") +
scale_color_manual(values = color_vec) +
theme(legend.position = "bottom")
## Scale for colour is already present.
## Adding another scale for colour, which will replace the existing scale.
plotTSNE(sce, colour_by = "sample") +
scale_color_manual(values = color_vec) +
theme(legend.position = "bottom")
## Scale for colour is already present.
## Adding another scale for colour, which will replace the existing scale.
sessionInfo()
## R version 4.5.3 (2026-03-11)
## Platform: x86_64-apple-darwin20
## Running under: macOS Tahoe 26.3.1
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.5-x86_64/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.5-x86_64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.1
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## time zone: America/New_York
## tzcode source: internal
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] scDblFinder_1.24.10 EnsDb.Mmusculus.v79_2.99.0
## [3] ensembldb_2.34.0 AnnotationFilter_1.34.0
## [5] GenomicFeatures_1.62.0 DropletUtils_1.30.0
## [7] GSEABase_1.72.0 graph_1.88.1
## [9] annotate_1.88.0 XML_3.99-0.23
## [11] AnnotationDbi_1.72.0 pheatmap_1.0.13
## [13] scran_1.38.1 scater_1.38.1
## [15] scuttle_1.20.0 bluster_1.20.0
## [17] SingleR_2.12.0 lubridate_1.9.5
## [19] forcats_1.0.1 stringr_1.6.0
## [21] dplyr_1.2.1 purrr_1.2.1
## [23] readr_2.2.0 tidyr_1.3.2
## [25] tibble_3.3.1 ggplot2_4.0.2
## [27] tidyverse_2.0.0 MouseGastrulationData_1.24.0
## [29] SpatialExperiment_1.20.0 SingleCellExperiment_1.32.0
## [31] SummarizedExperiment_1.40.0 Biobase_2.70.0
## [33] GenomicRanges_1.62.1 Seqinfo_1.0.0
## [35] IRanges_2.44.0 S4Vectors_0.48.1
## [37] BiocGenerics_0.56.0 generics_0.1.4
## [39] MatrixGenerics_1.22.0 matrixStats_1.5.0
##
## loaded via a namespace (and not attached):
## [1] BiocIO_1.20.0 bitops_1.0-9
## [3] filelock_1.0.3 R.oo_1.27.1
## [5] lifecycle_1.0.5 httr2_1.2.2
## [7] edgeR_4.8.2 MASS_7.3-65
## [9] lattice_0.22-9 magrittr_2.0.5
## [11] limma_3.66.0 sass_0.4.10
## [13] rmarkdown_2.31 jquerylib_0.1.4
## [15] yaml_2.3.12 metapod_1.18.0
## [17] otel_0.2.0 cowplot_1.2.0
## [19] DBI_1.3.0 RColorBrewer_1.1-3
## [21] abind_1.4-8 Rtsne_0.17
## [23] R.utils_2.13.0 BumpyMatrix_1.18.0
## [25] RCurl_1.98-1.18 rappdirs_0.3.4
## [27] ggrepel_0.9.8 irlba_2.3.7
## [29] RSpectra_0.16-2 dqrng_0.4.1
## [31] DelayedMatrixStats_1.32.0 codetools_0.2-20
## [33] DelayedArray_0.36.1 tidyselect_1.2.1
## [35] UCSC.utils_1.6.1 farver_2.1.2
## [37] ScaledMatrix_1.18.0 viridis_0.6.5
## [39] BiocFileCache_3.0.0 GenomicAlignments_1.46.0
## [41] jsonlite_2.0.0 BiocNeighbors_2.4.0
## [43] tools_4.5.3 Rcpp_1.1.1
## [45] glue_1.8.0 gridExtra_2.3
## [47] SparseArray_1.10.10 xfun_0.57
## [49] GenomeInfoDb_1.46.2 HDF5Array_1.38.0
## [51] withr_3.0.2 BiocManager_1.30.27
## [53] fastmap_1.2.0 rhdf5filters_1.22.0
## [55] digest_0.6.39 rsvd_1.0.5
## [57] timechange_0.4.0 R6_2.6.1
## [59] RSQLite_2.4.6 cigarillo_1.0.0
## [61] R.methodsS3_1.8.2 h5mread_1.2.1
## [63] data.table_1.18.2.1 FNN_1.1.4.1
## [65] rtracklayer_1.70.1 httr_1.4.8
## [67] S4Arrays_1.10.1 uwot_0.2.4
## [69] pkgconfig_2.0.3 gtable_0.3.6
## [71] blob_1.3.0 S7_0.2.1
## [73] XVector_0.50.0 htmltools_0.5.9
## [75] ProtGenerics_1.42.0 scales_1.4.0
## [77] png_0.1-9 knitr_1.51
## [79] rstudioapi_0.18.0 tzdb_0.5.0
## [81] rjson_0.2.23 curl_7.0.0
## [83] cachem_1.1.0 rhdf5_2.54.1
## [85] BiocVersion_3.22.0 parallel_4.5.3
## [87] vipor_0.4.7 restfulr_0.0.16
## [89] pillar_1.11.1 grid_4.5.3
## [91] vctrs_0.7.2 BiocSingular_1.26.1
## [93] dbplyr_2.5.2 beachmat_2.26.0
## [95] xtable_1.8-8 cluster_2.1.8.2
## [97] beeswarm_0.4.0 evaluate_1.0.5
## [99] magick_2.9.1 cli_3.6.6
## [101] locfit_1.5-9.12 compiler_4.5.3
## [103] Rsamtools_2.26.0 rlang_1.2.0
## [105] crayon_1.5.3 labeling_0.4.3
## [107] ggbeeswarm_0.7.3 stringi_1.8.7
## [109] viridisLite_0.4.3 BiocParallel_1.44.0
## [111] Biostrings_2.78.0 lazyeval_0.2.3
## [113] Matrix_1.7-5 ExperimentHub_3.0.0
## [115] hms_1.1.4 sparseMatrixStats_1.22.0
## [117] bit64_4.6.0-1 Rhdf5lib_1.32.0
## [119] KEGGREST_1.50.0 statmod_1.5.1
## [121] AnnotationHub_4.0.0 igraph_2.2.3
## [123] memoise_2.0.1 bslib_0.10.0
## [125] xgboost_3.2.1.1 bit_4.6.0