scRNA-seq analysis workflow
1. Raw data processing
cellranger : 10x Genomics’s Chromium system to
process single cell RNA-seq raw data. Easy to use but need large
computing resources and Linux environment.
Input : Fastq files, cellranger genome index
cellranger count/aggr
# cellranger count for transcriptome data processing
cellranger count --id=SampleA_GEX \
--transcriptome=/path/to/reference/transcriptomes/Mouse_GEX_2020/refdata-gex-mm10-2020-A \
--fastqs=/path/to/raw_data/SampleA \
--sample=SampleA \
--expect-cells=10000
# cellranger aggr to aggregate multiple cellranger count output to one file
cellranger aggr --id=Aggregate_GEX \
--csv=/path/to/aggregate_info/Aggregate_GEX.csv
Aggregate_GEX.csv example
Output
Aggregated Feature-Barcode Matrices
matrix.mtx
genes.tsv
barcodes.tsv
2. Matrices processing and filtering
Sample data
Data import from the cellranger output (above).
Create Seurat Object
Doublet removal (by scrublet)
scrublet runs in python env.
Input file is the count matrix from seurat object.
import numpy as np
import pandas as pd
import scanpy as sc
import scrublet as scr
def run_scrublet(input_file, output_file, expected_doublet_rate=0.1):
df = pd.read_csv(input_file, header=0, index_col=0)
adata = sc.AnnData(df)
# Set the expected_doublet_rate parameter
sc.external.pp.scrublet(adata, expected_doublet_rate=expected_doublet_rate)
# Save the observation (results) to CSV
adata.obs.to_csv(output_file)
input_file = 'input.count.csv'
output_file = 'output.scr.csv'
# Adjust the expected_doublet_rate parameter
expected_doublet_rate = 0.1
run_scrublet(input_file, output_file, expected_doublet_rate)
Add the scrublet output information to the seurat object
## import scrublet result
df= read.csv('path/output.scr.csv', row.names = 1)
# If the previous process changes the name of cell ID,
# Use the following code to correct it.
rownames(df) = gsub(pattern = '_', replacement = '-' ,rownames(df))
## add doublet info to the srt obj
obj.srt[['doublet_score']] = df[rownames(obj.srt@meta.data),]$doublet_score
obj.srt[['predicted_doublet']] =df[rownames(obj.srt@meta.data),]$predicted_doublet
obj.srt %>% saveRDS("path/saved.obj.rds")
# You might want to check the number of doublets in the data
obj.srt@meta.data %>% select(predicted_doublet) %>% table()
3. Data Analysis
Perform default analysis
This process includes the following.
FindVariableFeatures(“vst” method)
NormalizeData
ScaleData
RunPCA
FindNeighbors
FindClusters
RunUMAP
## perform default analysis
perform_default_analysis <- function(obj.srt, n_features = 2000, n_pcs = 20,
dims_for_neighbors = 1:20,
resolutions = c(0.1,0.5,1),
umap_dims = 1:20) {
# Step 1: Find variable features
obj.srt <- FindVariableFeatures(obj.srt,
selection.method = 'vst',
nfeatures = n_features)
# Step 2: Scale and normalize data
all_genes <- rownames(obj.srt)
obj.srt <- NormalizeData(obj.srt)
obj.srt <- ScaleData(obj.srt, features = all_genes)
# Step 3: Run PCA
obj.srt <- RunPCA(obj.srt,
features = VariableFeatures(object = obj.srt), npcs = n_pcs)
# Step 4: Find neighbors
obj.srt <- FindNeighbors(obj.srt, dims = dims_for_neighbors)
# Step 5: Find clusters
obj.srt <- FindClusters(obj.srt, resolution = resolutions)
# Step 6: Run UMAP
obj.srt <- RunUMAP(obj.srt, dims = umap_dims)
# Return the Seurat object with analysis results
return(obj.srt)
}
# apply
obj.srt <- perform_default_analysis(obj.srt)
Integration (Seurat built-in)
Reference URL: https://satijalab.org/seurat/articles/integration_introduction.html
Correction for Batch Effects:
Seurat offers robust methods to align different datasets, helping to correct for batch effects that often occur due to variations in experimental conditions, sample preparation, or sequencing technologies.
Enhanced Data Visualization:
By harmonizing datasets, Seurat enables unified visualization and
analysis, which facilitates the identification of trends and patterns
across diverse datasets.
# Define what groups to be integrated
groups_to_be_integrated = "Timepoint"
# Split object and make them a list
obj.list <- SplitObject(obj.srt, split.by = groups_to_be_integrated)
# Normalize and identify variable features for each dataset independently
# selection method and number of features can be modified
intgr.list <- lapply(X = obj.list, FUN = function(x) {
x <- NormalizeData(x)
x <- FindVariableFeatures(x, selection.method = "vst", nfeatures = 2000)
})
# select features that are repeatedly variable across datasets for integration
features <- SelectIntegrationFeatures(object.list = intgr.list)
# Perform integration
obj.anchors <- FindIntegrationAnchors(object.list = intgr.list,
anchor.features = features)
# this command creates an 'integrated' data assay
obj.intgr <- IntegrateData(anchorset = obj.anchors)
# Perform an integrated analysis
# specify that we will perform downstream analysis on the corrected data note that the
# original unmodified data still resides in the 'RNA' assay
# To use integrated data, change the default assay to "integrated"
DefaultAssay(obj.intgr) <- "integrated"
# Run the standard workflow for visualization and clustering
obj.intgr <- ScaleData(obj.intgr, verbose = FALSE)
obj.intgr <- RunPCA(obj.intgr, npcs = 30, verbose = FALSE)
obj.intgr <- RunUMAP(obj.intgr, reduction = "pca", dims = 1:30)
obj.intgr <- FindNeighbors(obj.intgr, reduction = "pca", dims = 1:30)
obj.intgr <- FindClusters(obj.intgr, resolution = c(0.1,0.2,0.5,0.8))
obj.intgr %>% saveRDS("path/saved.obj.rds")
# Visualization
DimPlot(obj.intgr, reduction = "umap", group.by = groups_to_be_integrated)
Other integration methods can be considered
Cell cycle scoring
Cell type annotation by scType
URL: https://github.com/IanevskiAleksandr/sc-type
# load gene set preparation function
source("https://raw.githubusercontent.com/IanevskiAleksandr/sc-type/master/R/gene_sets_prepare.R")
# load cell type annotation function
source("https://raw.githubusercontent.com/IanevskiAleksandr/sc-type/master/R/sctype_score_.R")
# DB file
db_ <- "https://raw.githubusercontent.com/IanevskiAleksandr/sc-type/master/ScTypeDB_full.xlsx";
tissue <- "Immune system" # e.g. Immune system,Pancreas,Liver,Kidney,Brain,Lung
# prepare gene sets
gs_list <- gene_sets_prepare(db_, tissue)
# extract scaled scRNA-seq matrix
scRNAseqData_scaled <- as.matrix(obj.srt[["RNA"]]@scale.data)
# run ScType
es.max <- sctype_score(scRNAseqData = scRNAseqData_scaled,
scaled = TRUE,
gs = gs_list$gs_positive,
gs2 = gs_list$gs_negative)
# Merge by cluster
cL_results <- do.call("rbind",
lapply(unique(obj.srt@meta.data$seurat_clusters), function(cl) {
cl_data <- obj.srt@meta.data[obj.srt@meta.data$seurat_clusters == cl, ]
es_sum <- rowSums(es.max[, rownames(cl_data)])
es_sorted <- sort(es_sum, decreasing = TRUE)
top_es <- head(data.frame(cluster = cl,
type = names(es_sorted),
scores = es_sorted,
ncells = nrow(cl_data)), 10)
return(top_es)
})
)
# Calculate top scores by cluster
sctype_scores <- cL_results %>%
group_by(cluster) %>%
top_n(n = 1, wt = scores)
# Set low-confidence clusters to "Unknown"
threshold <- sctype_scores$ncells / 4
sctype_scores$type[sctype_scores$scores < threshold] <- "Unknown"
# overlay the identified cell types on UMAP plot
obj.srt@meta.data$sctype_classification = ""
# Assign cell type classification to meta data
unique_clusters <- unique(sctype_scores$cluster)
for (j in unique_clusters) {
# Extract the type for the current cluster
cl_type <- sctype_scores[sctype_scores$cluster == j, ]
type_label <- as.character(cl_type$type[1])
# Assign type label to all cells in the cluster
obj.srt@meta.data$sctype_classification[obj.srt@meta.data$seurat_clusters == j] <- type_label
}
# Draw plot
DimPlot(obj.srt, reduction = "umap", label = TRUE, repel = TRUE,
group.by = 'sctype_classification')
AUC (Area Under the Curve)
To assess the activation of gene sets (custom) in assigned cell types (by scType), I calculate AUC scores by determining the level of activation of specific gene sets in each cell.
# import data from single-cell experiment
exprMatrix <- obj.srt@assays$RNA@data
# Convert to sparse:
exprMatrix <- as(exprMatrix, "dgCMatrix")
Build gene-expression rankings for each cell
Determine the cells with the given gene signatures or active gene sets
Threshold check
# check the length inside
n <- length(cells_assignment)
aucThr <- matrix(nrow = n, ncol = 1) # 필요한 경우 열의 수를 조정하세요.
# This is an example rownames
rownames(aucThr) <- c("B", "Fibroblast", "Myeloid", "T_NK", "Tumor")[1:n]
for (i in 1:n) {
aucThr[i, ] <- cells_assignment[[i]]$aucThr$selected
}
# Data frame
aucThr %>% data.frame() %>% DT::datatable()
Add AUC output to metadata
# Add the calculated AUC scores as metadata to the Seurat object
obj.srt <- AddMetaData(obj.srt, data.frame(t(cells_AUC@assays@data$AUC)))
# This is an additional part
Category = c("B" ,"Fibroblast","Myeloid","T_NK","Tumor")
for (cat in Category) {
rows <- cells_assignment[[cat]]$assignment
auc_column_name <- paste0("AUC_designated_", cat)
obj.srt@meta.data[, auc_column_name] <- "FALSE"
obj.srt@meta.data[rows, auc_column_name] <- "TRUE"
}
Visualize AUC score in UMAP
Note
Once obtaining AUC output, I check the information that can be transferred to seurat object metadata. There should be various visualization methods for the transferred data. I usually use UMAP visualization to check the general patterns of them.
Marker gene analysis
Find markers
Create downloadable link of markers
Visualize the marker gene expression
Dotplot
DEG analysis between two conditions
Find DEGs
# Generating DEGs bewteen condition 1 and condition 2
DEG.two_groups= function(obj.srt,cluster=0, cond1="ADU", cond2="CTL"){
obj.srt@meta.data=obj.srt@meta.data %>%
mutate(compare = ifelse(RNA_snn_res.0.1 == cluster & sample == cond1,
paste0(cluster,"_", cond1),
ifelse(RNA_snn_res.0.1 == cluster & sample == cond2,
paste0(cluster,"_", cond2), "others")))
g2=paste0(cluster,"_", cond1)
g1=paste0(cluster,"_", cnod2)
Idents(obj.srt) = 'compare'
logfc=log2(1)
mks =FindMarkers(obj.srt, ident.1 = g2, ident.2 = g1,
logfc.threshold = logfc, assay = "count")
pval=0.05
fc=1.2
mks = mks %>%
mutate(DE=ifelse(avg_log2FC >= log2(fc) & p_val_adj < pval, 'UP',
ifelse(avg_log2FC <= -log2(fc) & p_val_adj < pval, 'DN',
'no_sig')))
mks$DE = factor(mks$DE, levels = c('UP','DN','no_sig'))
mks$gene = rownames(mks)
mks =mks %>% mutate(labels= ifelse(DE == 'UP', gene, ifelse(DE=='DN',gene,'')))
mks =mks %>% arrange(desc(avg_log2FC))
return(mks)
}
Visualize the DEG output
# Volcano plot
input.data = deg.cluster0
# Draw plot
input.data %>%
ggplot(aes(avg_log2FC, -log10(p_val_adj), color=DE)) +
geom_point(size=1, alpha=0.5) +
scale_color_manual(values = c('red','blue','grey')) +
theme_classic() +
geom_vline(xintercept = c(-log2(fc),log2(fc)), color='grey') +
geom_hline(yintercept = -log10(0.05),color='grey') +
guides(colour = guide_legend(override.aes = list(size=5))) +
ggtitle(paste0("Plot Title here")) +
geom_text(aes(label = labels), size = 2.5,
show.legend = FALSE, hjust = 0, nudge_x = 0.01) +
ggeasy::easy_center_title()
Gene set enrichment analysis of DEGs
This analysis uses hallmark pathways for the analaysis.
library(clusterProfiler)
hallmark <- msigdbr::msigdbr(species = "Mus musculus", category = "H") %>%
dplyr::select(gs_name, gene_symbol)
perform_GSEA <- function(res, ref, pvalueCutoff = 1) {
ranking <- function(res) {
df <- res$avg_log2FC
names(df) <- rownames(res)
df <- sort(df, decreasing = TRUE)
return(df)
}
ranked.res <- ranking(res)
x <- clusterProfiler::GSEA(geneList = ranked.res,
TERM2GENE = ref,
pvalueCutoff = pvalueCutoff,
pAdjustMethod = "BH",
verbose = TRUE,
seed = TRUE)
result <- x@result %>% arrange(desc(NES))
result <- result[, c('NES', 'pvalue', 'p.adjust', 'core_enrichment', 'ID')]
return(result)
}
# Application
gsea.res = perform_GSEA(res = deg.cluster3, ref = hallmark)
GSEA output visualization
NES plot
# GESA Plot
gsea_nes_plot =function(gsea.res, title){
gsea.res %>% ggplot(aes(reorder(ID, NES), NES)) +
geom_col(aes(fill=p.adjust)) +
coord_flip() +
labs(x="Pathway", y="Normalized Enrichment Score",
title= "GSEA") +
theme_classic() +
scale_fill_gradient(low = 'red', high = '#E5E7E9') +
theme(axis.text.x= element_text(size=5, face = 'bold'),
axis.text.y= element_text(size=6, face = 'bold'),
axis.title =element_text(size=10)) +ggtitle(title)
}
# Application
gsea_nes_plot(gsea.res[gsea.res$p.adjust <= 0.05,], title = "Cluster3")
Imputation
MAGIC (Cell, 2018)
https://github.com/KrishnaswamyLab/MAGIC
Markov Affinity-based Graph Imputation of Cells (MAGIC) is an algorithm
for denoising high-dimensional data most commonly applied to single-cell
RNA sequencing data. MAGIC learns the manifold data, using the resultant
graph to smooth the features and restore the structure of the
data.
# Import the necessary libraries.
import pandas as pd
import magic
import csv
# Load the CSV file into a DataFrame.
data_path = '~/path/data.csv'
x = pd.read_csv(data_path)
# Remove the 'Unnamed: 0'
del x['Unnamed: 0']
# Initialize the MAGIC algorithm. the number of nearest neighbors (knn) is set to 5, and the diffusion time (t) is set to 3.
magic_operator = magic.MAGIC(knn=5, t=3)
# Apply the MAGIC algorithm to the data.
x_magic = magic_operator.fit_transform(x)
# Save the data to a CSV file.
output_path = '~/path/data.magic.csv'
x_magic.to_csv(output_path)