Background

Single-cell RNA sequencing (scRNA-seq) is a powerful genomic technology that enables the measurement of gene expression at the resolution of individual cells. Unlike bulk RNA-seq, which provides an average signal across thousands or millions of cells, scRNA-seq reveals cellular heterogeneity, allowing researchers to identify distinct cell types, states, and dynamic biological processes within complex tissues.

Why Single-Cell Resolution Matters

Biological systems are inherently heterogeneous. In bulk RNA-seq experiments, differences between cells are masked by averaging, which can obscure:

Rare cell populations
Transitional cell states (e.g., differentiation trajectories)
Cell-type-specific responses to stimuli or disease

Therefore, bulk RNA-seq only provide the average measurement for all the cells. It does not provide insights into the stochastic nature of gene expression. scRNA-seq overcomes this limitation by profiling each cell independently, making it especially valuable in fields such as developmental biology, immunology, and cancer genomics.

A typical scRNA-seq experiment consists of several key steps:

Single-Cell Isolation Cells are separated using methods such as microfluidics (e.g., droplet-based systems), fluorescence-activated cell sorting (FACS), or microwell platforms.
Barcoding and Library Preparation Each cell’s RNA is tagged with a unique barcode, allowing reads to be traced back to their cell of origin. Unique Molecular Identifiers (UMIs) are often added to quantify transcripts more accurately.
Sequencing Libraries are sequenced using next-generation sequencing platforms.The main difference between bulk and single-cell RNA-seq (scRNA-seq) is that each sequencing library represents a single cell, instead of a population of cells.
Data Processing Sequencing reads are aligned to a reference genome, and gene expression is quantified to produce a gene-by-cell count matrix.

Key Characteristics of scRNA-seq Data

scRNA-seq data differ substantially from bulk RNA-seq and require specialized analysis approaches:

High sparsity (dropouts): Many genes appear as zero due to low capture efficiency
High dimensionality: Thousands of genes measured across thousands of cells
Technical noise: Variation introduced during capture and amplification
Batch effects: Differences across experiments or runs

These characteristics necessitate careful preprocessing and normalization.

Common Analytical Steps

A standard scRNA-seq analysis workflow includes:

Quality Control (QC) Remove low-quality cells (e.g., high mitochondrial gene content) Filter out lowly expressed genes
Normalization Adjust for differences in sequencing depth across cells
Dimensionality Reduction Principal Component Analysis (PCA) Nonlinear methods such as UMAP or t-SNE
Clustering Identify groups of cells with similar expression profiles
Cell Type Annotation Assign biological meaning to clusters using marker genes
Differential Expression Analysis Identify genes that distinguish clusters or conditions

Common Tools in R

Several R packages are widely used for scRNA-seq analysis:

Seurat – comprehensive toolkit for preprocessing, clustering, and visualization SingleCellExperiment – standardized data container for single-cell data; provides data structure

This class implements a data structure that stores all aspects of our single-cell data - gene-by-cell expression data, cell-wise metadata, and gene-wise annotation - and lets us manipulate them in an organized manner.

Each column = one cell Each row = one gene Values = gene expression counts

Learning Objectives for Part I

Load data generated with common single-cell technologies as SingleCellExperiment objects.
Inspect and manipulate SingleCellExperiment objects.

Package Installation

#BiocManager::install(c("AUCell", "batchelor", "BiocStyle", 
                       #"CuratedAtlasQueryR", "DropletUtils", "duckdb",
                       #"EnsDb.Mmusculus.v79", "MouseGastrulationData", 
                       #"scDblFinder", "Seurat", "lgeistlinger/SeuratData",
                       #"SingleR", "TENxBrainData", "zellkonverter"),
                      # Ncpus = 4)

## took less than 5 minutes to install

Introduction to SingleCellExperiment

load these packages

library(SingleCellExperiment)
library(MouseGastrulationData)
library(tidyverse)
library(dplyr)
library(SingleR)
library(bluster)
library(scater)
library(scran)
library(pheatmap)
library(GSEABase)

The MouseGastrulationData is a data package that contains real scRNA-seq datasets we can use for analysis.

SingleCellExperiment is the container, and MouseGastrulationData is the data we put inside of the container.

The `WTChimerData` dataset

This dataset comes from mouse embryo experiments where:

Fluorescent stem cells were injected into embryos
Cells were later separated (host vs injected)
RNA was sequenced using 10x Genomics

The data can be accessed using the WTChimeraData() function.

sce <- WTChimeraData(samples = 5)

# You may be prompted to type "yes" in the console 
#Here we are simply loading sample 5 from the dataset and storing it into `sce`

sce is now an SingleCellExperiment object, which is structured to store + gene expression data + cell metadata

sce

## class: SingleCellExperiment 
## dim: 29453 2411 
## metadata(0):
## assays(1): counts
## rownames(29453): ENSMUSG00000051951 ENSMUSG00000089699 ...
##   ENSMUSG00000095742 tomato-td
## rowData names(2): ENSEMBL SYMBOL
## colnames(2411): cell_9769 cell_9770 ... cell_12178 cell_12179
## colData names(11): cell barcode ... doub.density sizeFactor
## reducedDimNames(2): pca.corrected.E7.5 pca.corrected.E8.5
## mainExpName: NULL
## altExpNames(0):

names(assays(sce))

## [1] "counts"

#list the type of expression data stored 
# counts = raw gene expression counts

counts(sce)[1:3, 1:3] # extracting first 3 genes x first 3 cells

## 3 x 3 sparse Matrix of class "dgCMatrix"
##                    cell_9769 cell_9770 cell_9771
## ENSMUSG00000051951         .         .         .
## ENSMUSG00000089699         .         .         .
## ENSMUSG00000102343         .         .         .

colData(sce)[1:3, 1:4] # displays metadata for cells

## DataFrame with 3 rows and 4 columns
##                  cell          barcode    sample       stage
##           <character>      <character> <integer> <character>
## cell_9769   cell_9769 AAACCTGAGACTGTAA         5        E8.5
## cell_9770   cell_9770 AAACCTGAGATGCCTT         5        E8.5
## cell_9771   cell_9771 AAACCTGAGCAGCCTC         5        E8.5

rowData(sce)[1:3, 1:2] # displays metadata for genes

## DataFrame with 3 rows and 2 columns
##                               ENSEMBL      SYMBOL
##                           <character> <character>
## ENSMUSG00000051951 ENSMUSG00000051951        Xkr4
## ENSMUSG00000089699 ENSMUSG00000089699      Gm1992
## ENSMUSG00000102343 ENSMUSG00000102343     Gm37381

sce$my_sum <- colSums(counts(sce)) # calculating total counts per cell; "for each cell how many total transcripts were detected"

colData(sce)[1:3,]

## DataFrame with 3 rows and 12 columns
##                  cell          barcode    sample       stage    tomato
##           <character>      <character> <integer> <character> <logical>
## cell_9769   cell_9769 AAACCTGAGACTGTAA         5        E8.5      TRUE
## cell_9770   cell_9770 AAACCTGAGATGCCTT         5        E8.5      TRUE
## cell_9771   cell_9771 AAACCTGAGCAGCCTC         5        E8.5      TRUE
##                pool stage.mapped celltype.mapped closest.cell doub.density
##           <integer>  <character>     <character>  <character>    <numeric>
## cell_9769         3        E8.25      Mesenchyme   cell_24159   0.02985045
## cell_9770         3         E8.5     Endothelium   cell_96660   0.00172753
## cell_9771         3         E8.5       Allantois  cell_134982   0.01338013
##           sizeFactor    my_sum
##            <numeric> <numeric>
## cell_9769    1.41243     27577
## cell_9770    1.22757     29309
## cell_9771    1.15439     28795

Summary so far:

A SingleCellExperiment object is structured in a way to organize + assays() - gene expression values + colData() - cell-level metadata + rowData() - gene-level metadata

Everything that happens later, builds on this structure.

Exploring the dataset structure

dim(counts(sce)) # [genes, cells]

## [1] 29453  2411

Above is showing the dimensions for one sample: genes × cells_from_sample5

sce_all <- WTChimeraData()

dim(counts(sce_all)) # [genes, cells]

## [1] 29453 30703

#this takes a few minutes

This object contains all cells across all samples Rows = genes Columns = cells

All samples: genes × (cells_sample1 + cells_sample2 + … + cells_sampleN)

Understanding metadata

So far, we’ve explored the expression matrix (genes × cells). Now, we shift our focus to the metadata, which describes each individual cell.

In a SingleCellExperiment object, metadata is stored in colData()

colData(sce_all)

## DataFrame with 30703 rows and 11 columns
##                   cell          barcode    sample       stage    tomato
##            <character>      <character> <integer> <character> <logical>
## cell_1          cell_1 AAACCTGAGACCTTTG         1        E7.5      TRUE
## cell_2          cell_2 AAACCTGAGCGCCTCA         1        E7.5      TRUE
## cell_3          cell_3 AAACCTGAGGATCGCA         1        E7.5      TRUE
## cell_4          cell_4 AAACCTGCACGTAAGG         1        E7.5      TRUE
## cell_5          cell_5 AAACCTGCAGACAGGT         1        E7.5      TRUE
## ...                ...              ...       ...         ...       ...
## cell_30699  cell_30699 TTTGTCACAGCTCGCA        10        E8.5     FALSE
## cell_30700  cell_30700 TTTGTCAGTCTAGTCA        10        E8.5     FALSE
## cell_30701  cell_30701 TTTGTCATCATCGGAT        10        E8.5     FALSE
## cell_30702  cell_30702 TTTGTCATCATTATCC        10        E8.5     FALSE
## cell_30703  cell_30703 TTTGTCATCCCATTTA        10        E8.5     FALSE
##                 pool stage.mapped        celltype.mapped closest.cell
##            <integer>  <character>            <character>  <character>
## cell_1             1         E7.5 Haematoendothelial p..   cell_22032
## cell_2             1         E8.0             Mesenchyme  cell_116830
## cell_3             1        E7.25       Primitive Streak   cell_84552
## cell_4             1         E7.5        Caudal epiblast     cell_714
## cell_5             1         E7.5        Caudal epiblast    cell_4785
## ...              ...          ...                    ...          ...
## cell_30699         5         E8.5             Erythroid3   cell_38810
## cell_30700         5         E8.5       Surface ectoderm   cell_38588
## cell_30701         5        E8.25 Forebrain/Midbrain/H..   cell_66082
## cell_30702         5         E8.5             Erythroid3  cell_138114
## cell_30703         5         E8.0                Doublet   cell_92644
##            doub.density sizeFactor
##               <numeric>  <numeric>
## cell_1        0.0504199    1.16288
## cell_2        0.0974964    1.65059
## cell_3        0.0413143    1.64731
## cell_4        0.0433117    1.16565
## cell_5        0.0627621    1.48883
## ...                 ...        ...
## cell_30699   0.00146287   0.389311
## cell_30700   0.00374155   0.588784
## cell_30701   0.05651258   0.624455
## cell_30702   0.00108837   0.550807
## cell_30703   0.82369305   1.184919

Each row = one cell Columns describe the cell

Some key columns include: + sample + stage = developmental stage + celltype.mapped = predicted cell identity + total_counts = total reads per cell

Let’s ask a basic question:

How many samples are in this dataset?

unique(colData(sce_all)$sample)

##  [1]  1  2  3  4  5  6  7  8  9 10

This tells us there are 10 samples, but what do these actually represent?

To interpret sample, we need to connect it to a biological variable like stage:

as.data.frame(colData(sce_all)) |> 
  distinct(sample, stage) |> 
  arrange(sample)

##            sample stage
## cell_1          1  E7.5
## cell_2883       2  E7.5
## cell_5480       3  E7.5
## cell_7948       4  E7.5
## cell_9769       5  E8.5
## cell_12180      6  E8.5
## cell_13227      7  E8.5
## cell_16234      8  E8.5
## cell_19331      9  E8.5
## cell_23875     10  E8.5

Why as.data.frame? colData() returns a special Bioconductor object (DataFrame) which is different than base R data.frame

as.data.frame() makes the data compatible with tidyverse tools

counts(sce_all)[, "cell_1"][1:10] #showing the first 10 genes for cell_1

## ENSMUSG00000051951 ENSMUSG00000089699 ENSMUSG00000102343 ENSMUSG00000025900 
##                  0                  0                  0                  0 
## ENSMUSG00000025902 ENSMUSG00000104328 ENSMUSG00000033845 ENSMUSG00000025903 
##                  0                  0                 13                  3 
## ENSMUSG00000104217 ENSMUSG00000033813 
##                  0                  3

colData(sce_all)["cell_1", ] #metadata for cell_1

## DataFrame with 1 row and 11 columns
##               cell          barcode    sample       stage    tomato      pool
##        <character>      <character> <integer> <character> <logical> <integer>
## cell_1      cell_1 AAACCTGAGACCTTTG         1        E7.5      TRUE         1
##        stage.mapped        celltype.mapped closest.cell doub.density sizeFactor
##         <character>            <character>  <character>    <numeric>  <numeric>
## cell_1         E7.5 Haematoendothelial p..   cell_22032    0.0504199    1.16288

Visualizing cells across biological conditions

Now that we understand how samples relate to biological stages, we can ask a more meaningful question:

How many cells are in each developmental stage?

Class Exercise

In this exercise, create a bar plot to visualize the number of cells at each developmental stage using the sce_all dataset.

Use ggplot() to create a bar plot where
- The x-axis represents developmental stage (stage)
- Bars are filled by stage
apply the theme_classic()
Add informative labels:
- Title: Number of Cells per Developmental Stage
- X-axis: Developmental Stage
- Y-axis: Number of Cells

# Visualize the number of cell counts per sample
library(ggplot2)
library(dplyr)

as.data.frame(colData(sce_all)) |> 
  ggplot(aes(x = stage, fill = stage)) +
  geom_bar(color = "black") +
  theme_classic() +
  labs(
    title = "Number of Cells per Developmental Stage",
    x = "Developmental Stage",
    y = "Number of Cells"
  )

It’s still useful to understand how cells are distributed across samples:

#mutate(sample = factor(sample)) |> 

as.data.frame(colData(sce_all)) |> 
  mutate(sample = factor(sample)) |> 
  ggplot(aes(x = sample, fill = stage)) +
  geom_bar(color = "black") +
  theme_classic() +
  labs(
    title = "Number of Cells per Sample",
    x = "Sample",
    y = "Number of Cells"
  )

This plot helps us see:

Which samples belong to each stage
Whether cell counts are balanced across replicates

So far, we’ve counted cells.

Next, we examine how much data each cell contains.

How many sequencing reads are captured per cell?

library(scater)

sce_all$total_counts <- colSums(counts(sce_all))

as.data.frame(colData(sce_all)) |> 
  ggplot(aes(x = total_counts)) +
  geom_histogram(bins = 50, fill = "steelblue", color = "black") +
  theme_classic() +
  labs(
    title = "Distribution of Total Counts per Cell",
    x = "Total Counts per Cell",
    y = "Number of Cells"
  )

Each bar = number of cells in a given count range

This helps identify: low-quality cells (low counts)

Why we need dimensionality reduction

At this point, we can see that scRNA-seq data are:

high-dimensional (thousands of genes)
large-scale (thousands of cells)

This makes it difficult to:

visualize the data
identify patterns
group similar cells

Solution is reducing dimensionality

To solve this, we apply dimensionality reduction techniques, such as:

PCA (Principal Component Analysis)
t-SNE
UMAP

One useful feature of SingleCellExperiment is that it can store these reduced representations directly inside the object.

reducedDims(sce)

## List of length 2
## names(2): pca.corrected.E7.5 pca.corrected.E8.5

# this object contains two different dimensionality reductions; one for stage E7.5 X E8.5

Now we can visualize the dataset:

plotReducedDim(sce, 
               "pca.corrected.E8.5",
               colour_by = "stage.mapped")

#ggsave("pca_corrected_E8_5.png")

Each point = one cell
position = similarity in gene expression
color = developmental stage

Exploratory data analysis and quality control

Single-cell datasets contain technical artifacts that must be addressed before downstream analysis.

Common issues include:

Empty droplets (no real cell, only ambient RNA)
Doublets (two cells captured together)
Low-quality cells

Therefore, QC is a critical early step in any scRNA-seq workflow.

library(MouseGastrulationData)
library(DropletUtils)
library(ggplot2)
library(EnsDb.Mmusculus.v79)
library(scuttle)
library(scater)
library(scran)
library(scDblFinder)

Load raw (unfiltered) data

sce <- WTChimeraData(samples = 5, type = "raw")
# the “raw” data that contains all the droplets for which we have sequenced reads

This dataset contains all droplets captured during sequencing, however, not all droplets contain real cells.

sce

## $`5`
## class: SingleCellExperiment 
## dim: 29453 522554 
## metadata(0):
## assays(1): counts
## rownames(29453): ENSMUSG00000051951 ENSMUSG00000089699 ...
##   ENSMUSG00000095742 tomato-td
## rowData names(2): ENSEMBL SYMBOL
## colnames(522554): AAACCTGAGAAACCAT AAACCTGAGAAACCGC ...
##   TTTGTCATCTTTACGT TTTGTCATCTTTCCTC
## colData names(0):
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):

From the experiment, we expect to have only a few thousand cells, while we can see that we have data for more than 500,000 droplets. It is likely that most of these droplets are empty and are capturing only ambient or background RNA.

[droplet]

Testing for empty droplets, before filtering, we isolate the relevant data:

sce <- sce[[1]]
#This extracts the count matrix for analysis

Next, we will tests each droplet to determine if it contains a real cell + Uses a statistical model to distinguish real cells vs ambient RNA

# emptyDrops performs Monte Carlo simulations to compute p-values,
# so we need to set the seed to obtain reproducible results.
set.seed(100)

# this may take a few minutes
e.out <- emptyDrops(counts(sce))

# This shows how many droplets pass the threshold
summary(e.out$FDR <= 0.001)

##    Mode   FALSE    TRUE    NA's 
## logical    5919    3396  513239

The result confirms our expectation: only 3396 droplets contain a cell, while the large majority of droplets are empty.

We need to filter to keep real cells and remove empty droplets.

sce <- sce[,which(e.out$FDR <= 0.001)]
sce

## class: SingleCellExperiment 
## dim: 29453 3396 
## metadata(0):
## assays(1): counts
## rownames(29453): ENSMUSG00000051951 ENSMUSG00000089699 ...
##   ENSMUSG00000095742 tomato-td
## rowData names(2): ENSEMBL SYMBOL
## colnames(3396): AAACCTGAGACTGTAA AAACCTGAGATGCCTT ... TTTGTCACATTCTCAT
##   TTTGTCATCTGAGTGT
## colData names(0):
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):

Note: Whenever your code involves the generation of random numbers, it’s a good practice to set the random seed in R with set.seed().

Setting the seed to a specific value (in the above example to 100) will cause the pseudo-random number generator to return the same pseudo-random numbers in the same order.

This allows us to write code with reproducible results, despite technically involving the generation of (pseudo-)random numbers.

Quality Control

While we have removed empty droplets, this does not necessarily imply that all the cell-containing droplets should be kept for downstream analysis. Some droplets could contain low-quality samples, due to cell damage or failure in library preparation, pooly sequenced, or partially lysed.

What makes a low-quality cell?

One common signal is high mitochondrial gene expression. High proportions of mitochondrial genes are indicative of poor-quality cells because of loss of cytoplasmic RNA from perforated cells.

For single-nucleus RNA-seq experiments, high proportions are also useful as they can mark cells where the cytoplasm has not been successfully stripped.

Step 1: First, we need to identify mitochondrial genes. We use the available EnsDb mouse package available in Bioconductor, but a more updated version of Ensembl can be used through the AnnotationHub or biomaRt packages.

chr.loc <- mapIds(EnsDb.Mmusculus.v79,
                  keys    = rownames(sce),
                  keytype = "GENEID", 
                  column  = "SEQNAME")

## Warning: Unable to map 2918 of 29453 requested IDs.

is.mito <- which(chr.loc == "MT")

What is this doing? - Look up each gene - Find which chromosome it comes from - Keep genes located on the mitochondrial chromosome (“MT”)

Step 2: Calculate QC metrics for each cell:

df <- perCellQCMetrics(sce, subsets = list(Mito = is.mito))

Step 3: Add QC metrics to metadata

colData(sce)

## DataFrame with 3396 rows and 0 columns

colData(sce)

## DataFrame with 3396 rows and 0 columns

#cbind = column bind

Now that we have computed the metrics, we have to decide on thresholds to define high- and low-quality samples.

Step 4: Explore QC metrics. We could check how many cells are above/below a certain fixed threshold.

# How many cells have very low counts?

# How many cells have >10% mitochondrial reads?

# Are there cells detecting very few genes?

Step 5: Flag low quality cells

reasons <- perCellQCFilters(df, sub.fields = "subsets_Mito_percent")

reasons

## DataFrame with 3396 rows and 4 columns
##          low_lib_size   low_n_features high_subsets_Mito_percent   discard
##      <outlier.filter> <outlier.filter>          <outlier.filter> <logical>
## 1               FALSE            FALSE                     FALSE     FALSE
## 2               FALSE            FALSE                     FALSE     FALSE
## 3               FALSE            FALSE                     FALSE     FALSE
## 4               FALSE            FALSE                     FALSE     FALSE
## 5                TRUE             TRUE                     FALSE      TRUE
## ...               ...              ...                       ...       ...
## 3392            FALSE            FALSE                     FALSE     FALSE
## 3393             TRUE             TRUE                     FALSE      TRUE
## 3394             TRUE             TRUE                      TRUE      TRUE
## 3395             TRUE             TRUE                      TRUE      TRUE
## 3396             TRUE             TRUE                      TRUE      TRUE

# identifies outliers based on QC metrics; flags poor quality cells

Step 6: Label cells to keep/remove

sce$discard <- reasons$discard

# TRUE = discard (low quality)
# FALSE = keep

Step 7: Visualize QC filtering

How to interpret:

Each point = one cell
y-axis = total counts
color = whether the cell will be removed

Summary: At this stage, we are not asking biological questions yet — we are making sure our data is reliable enough to trust.

Learning Objectives for Part II

In this part of the tutorial, you will learn how to:

Group cells into clusters based on similar gene expression patterns
Find marker genes that distinguish each cluster
Assign likely cell types using reference datasets

Data

We will now be working with a processed single-cell RNA-seq dataset from a wild-type chimeric mouse embryo.

sce <- WTChimeraData(samples = 5, type = "processed")

sce

## class: SingleCellExperiment 
## dim: 29453 2411 
## metadata(0):
## assays(1): counts
## rownames(29453): ENSMUSG00000051951 ENSMUSG00000089699 ...
##   ENSMUSG00000095742 tomato-td
## rowData names(2): ENSEMBL SYMBOL
## colnames(2411): cell_9769 cell_9770 ... cell_12178 cell_12179
## colData names(11): cell barcode ... doub.density sizeFactor
## reducedDimNames(2): pca.corrected.E7.5 pca.corrected.E8.5
## mainExpName: NULL
## altExpNames(0):

What’s happening here?

sce is a SingleCellExperiment object, which stores all of our data Rows = genes, columns = cells

It may also include metadata (e.g., cell type, experimental condition)

Tip: Always print your object (sce) to understand what data is available before starting analysis.

Subsetting cells

To make the analysis faster, we randomly select 1,000 cells.

set.seed(123)

ind <- sample(ncol(sce), 1000)

sce <- sce[,ind]

What’s happening here?

ncol(sce) = total number of cells
sample() randomly selects 1,000 cells
sce[, ind] keeps all genes but only those selected cells

Why this matters: Working with fewer cells speeds things up while still preserving biological patterns.

Preprocessing

Before clustering, we need to normalize the data and reduce its complexity.

sce <- logNormCounts(sce)

# Run PCA 
sce <- runPCA(sce)

What’s happening here?

logNormCounts() adjusts for differences in sequencing depth and makes values comparable across cells

runPCA() reduces thousands of genes into a smaller number of dimensions (principal components)

Clustering

Clustering is an unsupervised learning procedure that is used to define groups of cells with similar expression profiles. Its primary purpose is to summarize complex scRNA-seq data into a digestible format for human interpretation. In short, this helps us identify potential cell types of interest.

colLabels(sce) <- clusterCells(sce, 
                               use.dimred = "PCA",
                               BLUSPARAM = NNGraphParam(cluster.fun = "louvain"))

# table of clusters for sce
table(colLabels(sce))

## 
##   1   2   3   4   5   6   7   8   9  10  11 
## 193 161  59 134  63  61 108  49  91  42  39

Here, we use the clusterCells() function from the scran package to perform graph-based clustering using the Louvain algorithm for community detection. This function returns a vector containing cluster assignments for each cell in our SingleCellExperiment object.

You can see we ended up with 11 clusters of varying cell sizes. The key idea here is cells in the same cluster likely represent the same cell type or state.

Visualization (UMAP)

Now we visualize the clusters in 2D space.

sce <- runUMAP(sce, dimred = "PCA")

# plotReducedDim
plotReducedDim(sce, "UMAP", color_by = "label")

Changing cluster resolution

What if we wanted to get fewer, larger clusters?

sce$clust2 <- clusterCells(sce, use.dimred = "PCA",
                           BLUSPARAM = NNGraphParam(cluster.fun = "louvain",
                                                    k = 30))

# k = number of nearest neighbors used to build the graph where small k cells only connect locally and large k cells connect more broadly (bigger communities)

# plot for clust2
plotReducedDim(sce, "UMAP", color_by = "clust2")

** Quick Exercise** : Run clustering with different K values (k = 10 then k = 50). How does the number of clusters change - provide the numbers below.

k; # of clusters k = 10; # = k = 30; # = k = 50; # =

How do you choose k? Lots of questions are asked:

Do clusters look stable?
Do they match known biology?
Are clusters too fragmented or too merged?

Marker gene detection

Marker genes allow us to assign biological meaning to each cluster based on their functional annotation. We have a priori knowledge of the marker genes associated with particular cell types, allowing us to treat the clustering as a proxy for cell type identity.

The most straightforward approach to marker gene detection involves testing for differential expression between clusters. If a gene is strongly DE between clusters, it is likely to have driven the separation of cells in the clustering algorithm.

#rename genes using symbols 
rownames(sce) <- rowData(sce)$SYMBOL

# find genes that distinguish each cluster 
markers <- scoreMarkers(sce)

# output a ranked list of genes for each cluster 
markers

## List of length 11
## names(11): 1 2 3 4 5 6 7 8 9 10 11

Below shows the marker genes for cluster 1.

markers[[1]]

## DataFrame with 29453 rows and 19 columns
##                self.average other.average self.detected other.detected
##                   <numeric>     <numeric>     <numeric>      <numeric>
## Xkr4             0.00777523    0.00204744     0.0103627      0.0015873
## Gm1992           0.00000000    0.00000000     0.0000000      0.0000000
## Gm37381          0.00000000    0.00000000     0.0000000      0.0000000
## Rp1              0.00000000    0.00000000     0.0000000      0.0000000
## Sox17            0.01448428    0.28440343     0.0103627      0.1378294
## ...                     ...           ...           ...            ...
## AC149090.1       0.39760189   0.339949277    0.32642487      0.2773499
## DHRSX            0.37536303   0.514962467    0.32642487      0.4087374
## Vmn2r122         0.00000000   0.000000000    0.00000000      0.0000000
## CAAA01147332.1   0.00852569   0.000802687    0.00518135      0.0010989
## tomato-td        0.61466767   0.641897197    0.48186528      0.4972542
##                mean.logFC.cohen min.logFC.cohen median.logFC.cohen
##                       <numeric>       <numeric>          <numeric>
## Xkr4                   0.119882        -0.10006          0.1443198
## Gm1992                 0.000000         0.00000          0.0000000
## Gm37381                0.000000         0.00000          0.0000000
## Rp1                    0.000000         0.00000          0.0000000
## Sox17                 -0.346525        -2.07437         -0.0713339
## ...                         ...             ...                ...
## AC149090.1            0.1070458     -0.08935467          0.0555201
## DHRSX                -0.2135530     -0.49469635         -0.1947910
## Vmn2r122              0.0000000      0.00000000          0.0000000
## CAAA01147332.1        0.0921178      0.00500178          0.1017973
## tomato-td            -0.0304896     -0.27695578         -0.0829054
##                max.logFC.cohen rank.logFC.cohen  mean.AUC   min.AUC median.AUC
##                      <numeric>        <integer> <numeric> <numeric>  <numeric>
## Xkr4                  0.144320             3253  0.504379 0.4971626   0.505181
## Gm1992                0.000000             7838  0.500000 0.5000000   0.500000
## Gm37381               0.000000             7838  0.500000 0.5000000   0.500000
## Rp1                   0.000000             7838  0.500000 0.5000000   0.500000
## Sox17                 0.143843             5357  0.435959 0.0958549   0.493880
## ...                        ...              ...       ...       ...        ...
## AC149090.1           0.4394296             1678  0.524363  0.482359   0.509786
## DHRSX                0.0270598             7335  0.451362  0.383050   0.450894
## Vmn2r122             0.0000000             7838  0.500000  0.500000   0.500000
## CAAA01147332.1       0.1017973             4305  0.502044  0.497125   0.502591
## tomato-td            0.2898496             4071  0.487954  0.422322   0.476355
##                  max.AUC  rank.AUC mean.logFC.detected min.logFC.detected
##                <numeric> <integer>           <numeric>          <numeric>
## Xkr4            0.505181      5688         7.43734e-01       -2.75044e-01
## Gm1992          0.500000      7636        -1.60171e-17       -3.20343e-16
## Gm37381         0.500000      7636        -1.60171e-17       -3.20343e-16
## Rp1             0.500000      7636        -1.60171e-17       -3.20343e-16
## Sox17           0.505181      6757        -1.32542e+00       -4.92640e+00
## ...                  ...       ...                 ...                ...
## AC149090.1      0.588965      1993         2.58156e-01       -2.88274e-02
## DHRSX           0.511020      4598        -2.98961e-01       -5.66395e-01
## Vmn2r122        0.500000      7636        -1.60171e-17       -3.20343e-16
## CAAA01147332.1  0.502591      6151         3.89887e-01       -4.42710e-01
## tomato-td       0.568178      2881        -2.57339e-02       -3.30235e-01
##                median.logFC.detected max.logFC.detected rank.logFC.detected
##                            <numeric>          <numeric>           <integer>
## Xkr4                        0.697532        1.41597e+00                1119
## Gm1992                      0.000000        3.20343e-16                6173
## Gm37381                     0.000000        3.20343e-16                6173
## Rp1                         0.000000        3.20343e-16                6173
## Sox17                      -0.703710        9.58290e-01                2200
## ...                              ...                ...                 ...
## AC149090.1                 0.0583464        9.40386e-01                2627
## DHRSX                     -0.2934548        2.40918e-02                5723
## Vmn2r122                   0.0000000        3.20343e-16                6173
## CAAA01147332.1             0.3905253        8.75149e-01                1964
## tomato-td                 -0.0101262        2.99327e-01                4995

# each row = one gene 
# each column = a statistic comparing cluster 1 vs all other cells 

# self.average = avg expression in cluster 1
# other.average = avg expression elsewhere 

# looking for high self.average compare to other.average

AUC tells you how well this gene separates the cluster. Where 0.5 = no separation and closer to 1 is a strong marker.

Key takeaway: Most genes are not good markers. We need to filter them out.

Visualize Marker Expression

We can then inspect the top marker genes for the first cluster using the plotExpression function from the scater package.

# select the top marker genes for cluster 1 
c1_markers <- markers[[1]]

ord <- order(c1_markers$mean.AUC,
             decreasing = TRUE)

top.markers <- head(rownames(c1_markers[ord,]))

# plot their expression across clusters 
plotExpression(sce, 
               features = top.markers, 
               x        = "label", 
               color_by = "label")

How to interpret:

A good marker = high expression in one cluster, low in others
No single gene is perfect → we use combinations of markers

Clearly, not every marker gene distinguishes cluster 1 from every other cluster. However, with a combination of multiple marker genes it’s possible to clearly identify gene patterns that are unique to cluster 1.

Class Exercise

Identify cell markers for cluster 2 and plot these. Word of caution: be careful when renaming R objects!

# select the top marker genes for cluster 2 


# plot expression of top cluster 2 markers across clusters

Cell type annotation

Clustering tells us groups exist, but not what they are. Annotation assigns biological meaning to clusters.

Next, we can directly compare our expression profiles to published reference datasets using SingleR.

Reference-Based Annotation (SingleR)

We compare our data to a reference dataset with known cell types.

ref <- EmbryoAtlasData(samples = 29)
ref

## class: SingleCellExperiment 
## dim: 29452 7569 
## metadata(0):
## assays(1): counts
## rownames(29452): ENSMUSG00000051951 ENSMUSG00000089699 ...
##   ENSMUSG00000096730 ENSMUSG00000095742
## rowData names(2): ENSEMBL SYMBOL
## colnames(7569): cell_95727 cell_95728 ... cell_103294 cell_103295
## colData names(17): cell barcode ... colour sizeFactor
## reducedDimNames(2): pca.corrected umap
## mainExpName: NULL
## altExpNames(0):

# EmbryoAtlas is a reference database with cells that are already labeled by experts

Contains 29,452 genes and 7,569 cells. GeneIDs are ENSEMBL and SYMBOL

We are comparing our unknown cells to this labeled dataset!

# subsetting and cleaning the reference 
set.seed(123)

ind <- sample(ncol(ref), 1000)

ref <- ref[,ind]

You can see we have an assortment of different cell types in the reference (with varying frequency):

tab <- sort(table(ref$celltype), decreasing = TRUE)

tab

## 
##   Forebrain/Midbrain/Hindbrain                     Erythroid3 
##                            131                             75 
##              Paraxial mesoderm                            NMP 
##                             69                             51 
##                   ExE mesoderm               Surface ectoderm 
##                             49                             47 
##                      Allantois                     Mesenchyme 
##                             46                             45 
##                    Spinal cord            Pharyngeal mesoderm 
##                             45                             41 
##                   ExE endoderm                   Neural crest 
##                             38                             35 
##                            Gut Haematoendothelial progenitors 
##                             30                             27 
##          Intermediate mesoderm                 Cardiomyocytes 
##                             27                             26 
##               Somitic mesoderm                    Endothelium 
##                             25                             23 
##                     Erythroid2                  Def. endoderm 
##                             11                              3 
##                     Erythroid1            Blood progenitors 1 
##                              2                              1 
##            Blood progenitors 2                Caudal Mesoderm 
##                              1                              1 
##                            PGC 
##                              1

Which is the most and least abundant labeled cell type in this reference?

#the normalized log counts
ref <- logNormCounts(ref)

# remove cells of the reference dataset for which the cell type annotation is missing
nna <- !is.na(ref$celltype)
ref <- ref[,nna]

#Also remove cell types of very low abundance (here less than 10 cells) to remove noise prior to subsequent annotation tasks

abu.ct <- names(tab)[tab >= 10]

ind <- ref$celltype %in% abu.ct

ref <- ref[,ind]

Restrict to genes shared between query and reference dataset. Both datasets must use the same genes for comparison!

rownames(ref) <- rowData(ref)$SYMBOL

shared_genes <- intersect(rownames(sce), rownames(ref))

sce <- sce[shared_genes,]

ref <- ref[shared_genes,]

Convert to matrix to run SingleR

sce.mat <- as.matrix(assay(sce, "logcounts"))

ref.mat <- as.matrix(assay(ref, "logcounts"))

Running SingleR with the query and reference datasets:

res <- SingleR(test = sce.mat, 
               ref = ref.mat,
               labels = ref$celltype)
res

## DataFrame with 1000 rows and 4 columns
##                                    scores                 labels delta.next
##                                  <matrix>            <character>  <numeric>
## cell_11995 0.348586:0.335451:0.314515:... Forebrain/Midbrain/H..  0.1285110
## cell_10294 0.273570:0.260013:0.298932:...             Erythroid3  0.1381951
## cell_9963  0.328538:0.291288:0.475611:...            Endothelium  0.2193295
## cell_11610 0.281161:0.269245:0.299961:...             Erythroid3  0.0359215
## cell_10910 0.422454:0.346897:0.355947:...           ExE mesoderm  0.0984285
## ...                                   ...                    ...        ...
## cell_11597 0.323805:0.292967:0.300485:...                    NMP  0.1663369
## cell_9807  0.464466:0.374189:0.381698:...             Mesenchyme  0.0833019
## cell_10095 0.341721:0.288215:0.485324:...            Endothelium  0.0889931
## cell_11706 0.267487:0.240215:0.286012:...             Erythroid2  0.0350557
## cell_11860 0.345786:0.343437:0.313994:... Forebrain/Midbrain/H..  0.0117001
##                     pruned.labels
##                       <character>
## cell_11995 Forebrain/Midbrain/H..
## cell_10294             Erythroid3
## cell_9963             Endothelium
## cell_11610             Erythroid3
## cell_10910           ExE mesoderm
## ...                           ...
## cell_11597                    NMP
## cell_9807              Mesenchyme
## cell_10095            Endothelium
## cell_11706             Erythroid2
## cell_11860 Forebrain/Midbrain/H..

delta.next is the difference between the best match and the second best match. Here a large delta (0.2+) is equivalent to more confidence. A small delta is ambiguous (less than 0.05)
pruned.labels cleaned versions of labels & removes low confidence assignments

Overall, SingleR assigns a label AND tells you how confident it is.

Evaluating Annotation Confidence

plotScoreHeatmap(res)

How to interpret:

Each row = cell
Each column = cell type
Color = similarity score
- Dark/high = strong match
- Light/low = weak match

Comparing Clusters and Cell Types

We can also compare the cell type assignments with the unsupervised clustering results to determine the identity of each cluster.

tab <- table(anno = res$pruned.labels, 
             cluster = colLabels(sce))

pheatmap(log2(tab + 10), 
         color = colorRampPalette(c("white", "blue"))(101))

Multi-sample analyses

In this section, we move from analyzing a single sample to analyzing multiple samples together. This allows us to compare across experiments and identify shared or sample-specific patterns.

sce <- WTChimeraData(samples = 5:10, type = "processed")
# this will take some time to load 

sce

## class: SingleCellExperiment 
## dim: 29453 20935 
## metadata(0):
## assays(1): counts
## rownames(29453): ENSMUSG00000051951 ENSMUSG00000089699 ...
##   ENSMUSG00000095742 tomato-td
## rowData names(2): ENSEMBL SYMBOL
## colnames(20935): cell_9769 cell_9770 ... cell_30702 cell_30703
## colData names(11): cell barcode ... doub.density sizeFactor
## reducedDimNames(2): pca.corrected.E7.5 pca.corrected.E8.5
## mainExpName: NULL
## altExpNames(0):

What’s happening?

We are loading samples 5 through 10 (multiple experiments)

type = “processed” means: + Quality control has already been done + Data is ready for downstream analysis

# shows metadata for each cell
colData(sce)

## DataFrame with 20935 rows and 11 columns
##                   cell          barcode    sample       stage    tomato
##            <character>      <character> <integer> <character> <logical>
## cell_9769    cell_9769 AAACCTGAGACTGTAA         5        E8.5      TRUE
## cell_9770    cell_9770 AAACCTGAGATGCCTT         5        E8.5      TRUE
## cell_9771    cell_9771 AAACCTGAGCAGCCTC         5        E8.5      TRUE
## cell_9772    cell_9772 AAACCTGCATACTCTT         5        E8.5      TRUE
## cell_9773    cell_9773 AAACGGGTCAACACCA         5        E8.5      TRUE
## ...                ...              ...       ...         ...       ...
## cell_30699  cell_30699 TTTGTCACAGCTCGCA        10        E8.5     FALSE
## cell_30700  cell_30700 TTTGTCAGTCTAGTCA        10        E8.5     FALSE
## cell_30701  cell_30701 TTTGTCATCATCGGAT        10        E8.5     FALSE
## cell_30702  cell_30702 TTTGTCATCATTATCC        10        E8.5     FALSE
## cell_30703  cell_30703 TTTGTCATCCCATTTA        10        E8.5     FALSE
##                 pool stage.mapped        celltype.mapped closest.cell
##            <integer>  <character>            <character>  <character>
## cell_9769          3        E8.25             Mesenchyme   cell_24159
## cell_9770          3         E8.5            Endothelium   cell_96660
## cell_9771          3         E8.5              Allantois  cell_134982
## cell_9772          3         E8.5             Erythroid3  cell_133892
## cell_9773          3        E8.25             Erythroid1   cell_76296
## ...              ...          ...                    ...          ...
## cell_30699         5         E8.5             Erythroid3   cell_38810
## cell_30700         5         E8.5       Surface ectoderm   cell_38588
## cell_30701         5        E8.25 Forebrain/Midbrain/H..   cell_66082
## cell_30702         5         E8.5             Erythroid3  cell_138114
## cell_30703         5         E8.0                Doublet   cell_92644
##            doub.density sizeFactor
##               <numeric>  <numeric>
## cell_9769    0.02985045    1.41243
## cell_9770    0.00172753    1.22757
## cell_9771    0.01338013    1.15439
## cell_9772    0.00218402    1.28676
## cell_9773    0.00211723    1.78719
## ...                 ...        ...
## cell_30699   0.00146287   0.389311
## cell_30700   0.00374155   0.588784
## cell_30701   0.05651258   0.624455
## cell_30702   0.00108837   0.550807
## cell_30703   0.82369305   1.184919

Remove unwanted cells

drop <- sce$celltype.mapped %in% c("stripped", "Doublet")

sce <- sce[,!drop]

Removing Doublets: two cells captured together (technical artifact)
Stripped: low-quality or ambiguous cells

# Subsample cells (per sample)
set.seed(29482)

# We randomly select 50% of cells from each sample
idx <- unlist(tapply(colnames(sce), sce$sample, function(x) {
    perc <- round(0.50 * length(x))
    sample(x, perc)
}))

sce <- sce[,idx]

Why this matters:

Prevents one sample from dominating the analysis
Speeds up computation

# Normalize counts
sce <- logNormCounts(sce)

# Find highly variable genes (HVGs)
dec <- modelGeneVar(sce, block = sce$sample)
chosen.hvgs <- dec$bio > 0

# PCA (dimensionality reduction)
sce <- runPCA(sce, subset_row = chosen.hvgs, ntop = 1000)

# t-SNE (visualization)
sce <- runTSNE(sce, dimred = "PCA")
sce$sample <- as.factor(sce$sample)

plotTSNE(sce, colour_by = "sample")

# Each point = one cell

It seems like samples 5 and 6 are clearly separated off the other samples in gene expression space. Why might this be the case?

Visualizing Cells with Custom Colors

Create a t-SNE plot colored by cell type and interpret whether cells group by biological identity.

color_vec <- c("#5A5156", "#E4E1E3", "#F6222E", "#FE00FA", "#16FF32", "#3283FE", 
               "#FEAF16", "#B00068", "#1CFFCE", "#90AD1C", "#2ED9FF", "#DEA0FD", 
               "#AA0DFE", "#F8A19F", "#325A9B", "#C4451C", "#1C8356", "#85660D", 
               "#B10DA1", "#3B00FB", "#1CBE4F", "#FA0087", "#333333", "#F7E1A0", 
               "#C075A6", "#782AB6", "#AAF400", "#BDCDFF", "#822E1C", "#B5EFB5", 
               "#7ED7D1", "#1C7F93", "#D85FF7", "#683B79", "#66B0FF", "#FBE426")

plotTSNE(sce, colour_by = "celltype.mapped") +
    scale_color_manual(values = color_vec) +
    theme(legend.position = "bottom")

## Scale for colour is already present.
## Adding another scale for colour, which will replace the existing scale.

plotTSNE(sce, colour_by = "sample") +
    scale_color_manual(values = color_vec) +
    theme(legend.position = "bottom")

## Scale for colour is already present.
## Adding another scale for colour, which will replace the existing scale.

sessionInfo()

## R version 4.5.3 (2026-03-11)
## Platform: x86_64-apple-darwin20
## Running under: macOS Tahoe 26.3.1
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.5-x86_64/Resources/lib/libRblas.0.dylib 
## LAPACK: /Library/Frameworks/R.framework/Versions/4.5-x86_64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.1
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## time zone: America/New_York
## tzcode source: internal
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] scDblFinder_1.24.10          EnsDb.Mmusculus.v79_2.99.0  
##  [3] ensembldb_2.34.0             AnnotationFilter_1.34.0     
##  [5] GenomicFeatures_1.62.0       DropletUtils_1.30.0         
##  [7] GSEABase_1.72.0              graph_1.88.1                
##  [9] annotate_1.88.0              XML_3.99-0.23               
## [11] AnnotationDbi_1.72.0         pheatmap_1.0.13             
## [13] scran_1.38.1                 scater_1.38.1               
## [15] scuttle_1.20.0               bluster_1.20.0              
## [17] SingleR_2.12.0               lubridate_1.9.5             
## [19] forcats_1.0.1                stringr_1.6.0               
## [21] dplyr_1.2.1                  purrr_1.2.1                 
## [23] readr_2.2.0                  tidyr_1.3.2                 
## [25] tibble_3.3.1                 ggplot2_4.0.2               
## [27] tidyverse_2.0.0              MouseGastrulationData_1.24.0
## [29] SpatialExperiment_1.20.0     SingleCellExperiment_1.32.0 
## [31] SummarizedExperiment_1.40.0  Biobase_2.70.0              
## [33] GenomicRanges_1.62.1         Seqinfo_1.0.0               
## [35] IRanges_2.44.0               S4Vectors_0.48.1            
## [37] BiocGenerics_0.56.0          generics_0.1.4              
## [39] MatrixGenerics_1.22.0        matrixStats_1.5.0           
## 
## loaded via a namespace (and not attached):
##   [1] BiocIO_1.20.0             bitops_1.0-9             
##   [3] filelock_1.0.3            R.oo_1.27.1              
##   [5] lifecycle_1.0.5           httr2_1.2.2              
##   [7] edgeR_4.8.2               MASS_7.3-65              
##   [9] lattice_0.22-9            magrittr_2.0.5           
##  [11] limma_3.66.0              sass_0.4.10              
##  [13] rmarkdown_2.31            jquerylib_0.1.4          
##  [15] yaml_2.3.12               metapod_1.18.0           
##  [17] otel_0.2.0                cowplot_1.2.0            
##  [19] DBI_1.3.0                 RColorBrewer_1.1-3       
##  [21] abind_1.4-8               Rtsne_0.17               
##  [23] R.utils_2.13.0            BumpyMatrix_1.18.0       
##  [25] RCurl_1.98-1.18           rappdirs_0.3.4           
##  [27] ggrepel_0.9.8             irlba_2.3.7              
##  [29] RSpectra_0.16-2           dqrng_0.4.1              
##  [31] DelayedMatrixStats_1.32.0 codetools_0.2-20         
##  [33] DelayedArray_0.36.1       tidyselect_1.2.1         
##  [35] UCSC.utils_1.6.1          farver_2.1.2             
##  [37] ScaledMatrix_1.18.0       viridis_0.6.5            
##  [39] BiocFileCache_3.0.0       GenomicAlignments_1.46.0 
##  [41] jsonlite_2.0.0            BiocNeighbors_2.4.0      
##  [43] tools_4.5.3               Rcpp_1.1.1               
##  [45] glue_1.8.0                gridExtra_2.3            
##  [47] SparseArray_1.10.10       xfun_0.57                
##  [49] GenomeInfoDb_1.46.2       HDF5Array_1.38.0         
##  [51] withr_3.0.2               BiocManager_1.30.27      
##  [53] fastmap_1.2.0             rhdf5filters_1.22.0      
##  [55] digest_0.6.39             rsvd_1.0.5               
##  [57] timechange_0.4.0          R6_2.6.1                 
##  [59] RSQLite_2.4.6             cigarillo_1.0.0          
##  [61] R.methodsS3_1.8.2         h5mread_1.2.1            
##  [63] data.table_1.18.2.1       FNN_1.1.4.1              
##  [65] rtracklayer_1.70.1        httr_1.4.8               
##  [67] S4Arrays_1.10.1           uwot_0.2.4               
##  [69] pkgconfig_2.0.3           gtable_0.3.6             
##  [71] blob_1.3.0                S7_0.2.1                 
##  [73] XVector_0.50.0            htmltools_0.5.9          
##  [75] ProtGenerics_1.42.0       scales_1.4.0             
##  [77] png_0.1-9                 knitr_1.51               
##  [79] rstudioapi_0.18.0         tzdb_0.5.0               
##  [81] rjson_0.2.23              curl_7.0.0               
##  [83] cachem_1.1.0              rhdf5_2.54.1             
##  [85] BiocVersion_3.22.0        parallel_4.5.3           
##  [87] vipor_0.4.7               restfulr_0.0.16          
##  [89] pillar_1.11.1             grid_4.5.3               
##  [91] vctrs_0.7.2               BiocSingular_1.26.1      
##  [93] dbplyr_2.5.2              beachmat_2.26.0          
##  [95] xtable_1.8-8              cluster_2.1.8.2          
##  [97] beeswarm_0.4.0            evaluate_1.0.5           
##  [99] magick_2.9.1              cli_3.6.6                
## [101] locfit_1.5-9.12           compiler_4.5.3           
## [103] Rsamtools_2.26.0          rlang_1.2.0              
## [105] crayon_1.5.3              labeling_0.4.3           
## [107] ggbeeswarm_0.7.3          stringi_1.8.7            
## [109] viridisLite_0.4.3         BiocParallel_1.44.0      
## [111] Biostrings_2.78.0         lazyeval_0.2.3           
## [113] Matrix_1.7-5              ExperimentHub_3.0.0      
## [115] hms_1.1.4                 sparseMatrixStats_1.22.0 
## [117] bit64_4.6.0-1             Rhdf5lib_1.32.0          
## [119] KEGGREST_1.50.0           statmod_1.5.1            
## [121] AnnotationHub_4.0.0       igraph_2.2.3             
## [123] memoise_2.0.1             bslib_0.10.0             
## [125] xgboost_3.2.1.1           bit_4.6.0

bioc-scrnaseq

2026-04-02

Background

Why Single-Cell Resolution Matters

A typical scRNA-seq experiment consists of several key steps:

Key Characteristics of scRNA-seq Data

Common Analytical Steps

Common Tools in R

Learning Objectives for Part I

Package Installation

Introduction to SingleCellExperiment

The `WTChimerData` dataset

Exploring the dataset structure

Understanding metadata

Visualizing cells across biological conditions

Class Exercise

Why we need dimensionality reduction

Solution is reducing dimensionality

Exploratory data analysis and quality control

Quality Control

Learning Objectives for Part II

Data

Subsetting cells

Preprocessing

Clustering

Visualization (UMAP)

Changing cluster resolution

Marker gene detection

Visualize Marker Expression

Cell type annotation

Reference-Based Annotation (SingleR)

Evaluating Annotation Confidence

Comparing Clusters and Cell Types

Multi-sample analyses

Visualizing Cells with Custom Colors

bioc-scrnaseq

2026-04-02

Background

Why Single-Cell Resolution Matters

A typical scRNA-seq experiment consists of several key steps:

Key Characteristics of scRNA-seq Data

Common Analytical Steps

Common Tools in R

Learning Objectives for Part I

Package Installation

Introduction to SingleCellExperiment

The WTChimerData dataset

Exploring the dataset structure

Understanding metadata

Visualizing cells across biological conditions

Class Exercise

Why we need dimensionality reduction

Solution is reducing dimensionality

Exploratory data analysis and quality control

Quality Control

Learning Objectives for Part II

Data

Subsetting cells

Preprocessing

Clustering

Visualization (UMAP)

Changing cluster resolution

Marker gene detection

Visualize Marker Expression

Cell type annotation

Reference-Based Annotation (SingleR)

Evaluating Annotation Confidence

Comparing Clusters and Cell Types

Multi-sample analyses

Visualizing Cells with Custom Colors

The `WTChimerData` dataset