scRNA-seq analysis workflow

Back to Main Page

1. Raw data processing

cellranger : 10x Genomics’s Chromium system to process single cell RNA-seq raw data. Easy to use but need large computing resources and Linux environment.
Input : Fastq files, cellranger genome index

cellranger count/aggr

# cellranger count for transcriptome data processing 
cellranger count --id=SampleA_GEX \
--transcriptome=/path/to/reference/transcriptomes/Mouse_GEX_2020/refdata-gex-mm10-2020-A \
--fastqs=/path/to/raw_data/SampleA \
--sample=SampleA \
--expect-cells=10000

# cellranger aggr to aggregate multiple cellranger count output to one file 
cellranger aggr --id=Aggregate_GEX \
--csv=/path/to/aggregate_info/Aggregate_GEX.csv

Aggregate_GEX.csv example

library_id,molecule_h5
SampleA,/path/to/SampleA/outs/molecule_info.h5
SampleB,/path/to/SampleB/outs/molecule_info.h5
SampleC,/path/to/SampleC/outs/molecule_info.h5

Output

Aggregated Feature-Barcode Matrices

matrix.mtx
genes.tsv
barcodes.tsv

2. Matrices processing and filtering

Load packages

library(dplyr)
library(ggplot2)
library(reshape)
library(dplyr)

Sample data

Data import from the cellranger output (above).
Create Seurat Object

library(Seurat)
obj.raw = Read10X('raw_data/Path/filtered_feature_bc_matrix/')
obj.srt = CreateSeuratObject(counts = obj.raw, project = 'project_name')

Doublet removal (by scrublet)

scrublet runs in python env.
Input file is the count matrix from seurat object.

import numpy as np
import pandas as pd
import scanpy as sc
import scrublet as scr

def run_scrublet(input_file, output_file, expected_doublet_rate=0.1):
    df = pd.read_csv(input_file, header=0, index_col=0)
    adata = sc.AnnData(df)
    
    # Set the expected_doublet_rate parameter
    sc.external.pp.scrublet(adata, expected_doublet_rate=expected_doublet_rate)
    
    # Save the observation (results) to CSV
    adata.obs.to_csv(output_file)

input_file = 'input.count.csv'
output_file = 'output.scr.csv'

# Adjust the expected_doublet_rate parameter
expected_doublet_rate = 0.1
run_scrublet(input_file, output_file, expected_doublet_rate)


Add the scrublet output information to the seurat object

## import scrublet result
df= read.csv('path/output.scr.csv', row.names = 1)

# If the previous process changes the name of cell ID, 
# Use the following code to correct it. 
rownames(df) = gsub(pattern = '_', replacement = '-' ,rownames(df))

## add doublet info to the srt obj

obj.srt[['doublet_score']] = df[rownames(obj.srt@meta.data),]$doublet_score
obj.srt[['predicted_doublet']] =df[rownames(obj.srt@meta.data),]$predicted_doublet

obj.srt %>% saveRDS("path/saved.obj.rds")

# You might want to check the number of doublets in the data
obj.srt@meta.data %>% select(predicted_doublet) %>% table()

3. Data Analysis

Perform default analysis

This process includes the following.

FindVariableFeatures(“vst” method)
NormalizeData
ScaleData
RunPCA
FindNeighbors
FindClusters
RunUMAP

## perform default analysis
perform_default_analysis <- function(obj.srt, n_features = 2000, n_pcs = 20, 
                                     dims_for_neighbors = 1:20, 
                                     resolutions = c(0.1,0.5,1), 
                                     umap_dims = 1:20) {
  # Step 1: Find variable features
  obj.srt <- FindVariableFeatures(obj.srt, 
                                  selection.method = 'vst', 
                                  nfeatures = n_features)
  
  # Step 2: Scale and normalize data
  all_genes <- rownames(obj.srt)
  obj.srt <- NormalizeData(obj.srt)
  obj.srt <- ScaleData(obj.srt, features = all_genes)
  
  # Step 3: Run PCA
  obj.srt <- RunPCA(obj.srt, 
                    features = VariableFeatures(object = obj.srt), npcs = n_pcs)
  
  # Step 4: Find neighbors
  obj.srt <- FindNeighbors(obj.srt, dims = dims_for_neighbors)
  
  # Step 5: Find clusters
  obj.srt <- FindClusters(obj.srt, resolution = resolutions)
  
  # Step 6: Run UMAP
  obj.srt <- RunUMAP(obj.srt, dims = umap_dims)
  
  # Return the Seurat object with analysis results
  return(obj.srt)
}

# apply
obj.srt <- perform_default_analysis(obj.srt)

Integration (Seurat built-in)

Reference URL: https://satijalab.org/seurat/articles/integration_introduction.html

Correction for Batch Effects:

Seurat offers robust methods to align different datasets, helping to correct for batch effects that often occur due to variations in experimental conditions, sample preparation, or sequencing technologies.

Enhanced Data Visualization:
By harmonizing datasets, Seurat enables unified visualization and analysis, which facilitates the identification of trends and patterns across diverse datasets.

# Define what groups to be integrated 
groups_to_be_integrated = "Timepoint"

# Split object and make them a list 
obj.list <- SplitObject(obj.srt, split.by = groups_to_be_integrated)

# Normalize and identify variable features for each dataset independently
# selection method and number of features can be modified 

intgr.list <- lapply(X = obj.list, FUN = function(x) {
  x <- NormalizeData(x)
  x <- FindVariableFeatures(x, selection.method = "vst", nfeatures = 2000)
})

# select features that are repeatedly variable across datasets for integration
features <- SelectIntegrationFeatures(object.list = intgr.list)

# Perform integration
obj.anchors <- FindIntegrationAnchors(object.list = intgr.list, 
                                      anchor.features = features)

# this command creates an 'integrated' data assay
obj.intgr <- IntegrateData(anchorset = obj.anchors)

# Perform an integrated analysis

# specify that we will perform downstream analysis on the corrected data note that the
# original unmodified data still resides in the 'RNA' assay
# To use integrated data, change the default assay to "integrated"
DefaultAssay(obj.intgr) <- "integrated"

# Run the standard workflow for visualization and clustering
obj.intgr <- ScaleData(obj.intgr, verbose = FALSE)
obj.intgr <- RunPCA(obj.intgr, npcs = 30, verbose = FALSE)
obj.intgr <- RunUMAP(obj.intgr, reduction = "pca", dims = 1:30)
obj.intgr <- FindNeighbors(obj.intgr, reduction = "pca", dims = 1:30)
obj.intgr <- FindClusters(obj.intgr, resolution = c(0.1,0.2,0.5,0.8))

obj.intgr %>% saveRDS("path/saved.obj.rds")

# Visualization
DimPlot(obj.intgr, reduction = "umap", group.by = groups_to_be_integrated) 

Other integration methods can be considered

Harmony
https://github.com/immunogenomics/harmony

LIGER
https://github.com/welch-lab/liger

Cell cycle scoring

s.genes <- cc.genes$s.genes
g2m.genes <- cc.genes$g2m.genes
obj.srt <- CellCycleScoring(obj.srt, 
                            s.features = s.genes, g2m.features = g2m.genes)

Cell type annotation by scType

URL: https://github.com/IanevskiAleksandr/sc-type

# load gene set preparation function
source("https://raw.githubusercontent.com/IanevskiAleksandr/sc-type/master/R/gene_sets_prepare.R")
# load cell type annotation function
source("https://raw.githubusercontent.com/IanevskiAleksandr/sc-type/master/R/sctype_score_.R")

# DB file
db_ <- "https://raw.githubusercontent.com/IanevskiAleksandr/sc-type/master/ScTypeDB_full.xlsx";
tissue <- "Immune system" # e.g. Immune system,Pancreas,Liver,Kidney,Brain,Lung

# prepare gene sets
gs_list <- gene_sets_prepare(db_, tissue)

# extract scaled scRNA-seq matrix
scRNAseqData_scaled <- as.matrix(obj.srt[["RNA"]]@scale.data)

# run ScType
es.max <- sctype_score(scRNAseqData = scRNAseqData_scaled, 
                       scaled = TRUE, 
                       gs = gs_list$gs_positive, 
                       gs2 = gs_list$gs_negative)

# Merge by cluster
cL_results <- do.call("rbind", 
  lapply(unique(obj.srt@meta.data$seurat_clusters), function(cl) {
    cl_data <- obj.srt@meta.data[obj.srt@meta.data$seurat_clusters == cl, ]
    es_sum <- rowSums(es.max[, rownames(cl_data)])
    es_sorted <- sort(es_sum, decreasing = TRUE)
    top_es <- head(data.frame(cluster = cl, 
                              type = names(es_sorted), 
                              scores = es_sorted, 
                              ncells = nrow(cl_data)), 10)
    return(top_es)
  })
)

# Calculate top scores by cluster
sctype_scores <- cL_results %>% 
  group_by(cluster) %>% 
  top_n(n = 1, wt = scores)

# Set low-confidence clusters to "Unknown"
threshold <- sctype_scores$ncells / 4
sctype_scores$type[sctype_scores$scores < threshold] <- "Unknown"
# overlay the identified cell types on UMAP plot
obj.srt@meta.data$sctype_classification = ""

# Assign cell type classification to meta data
unique_clusters <- unique(sctype_scores$cluster)

for (j in unique_clusters) {
  # Extract the type for the current cluster
  cl_type <- sctype_scores[sctype_scores$cluster == j, ]
  type_label <- as.character(cl_type$type[1])
  
  # Assign type label to all cells in the cluster
  obj.srt@meta.data$sctype_classification[obj.srt@meta.data$seurat_clusters == j] <- type_label
}

# Draw plot 
DimPlot(obj.srt, reduction = "umap", label = TRUE, repel = TRUE, 
        group.by = 'sctype_classification')        

AUC (Area Under the Curve)

To assess the activation of gene sets (custom) in assigned cell types (by scType), I calculate AUC scores by determining the level of activation of specific gene sets in each cell.

library(Seurat)
library(AUCell)
library(GSEABase)
# import data from single-cell experiment
exprMatrix <- obj.srt@assays$RNA@data
# Convert to sparse:
exprMatrix <- as(exprMatrix, "dgCMatrix")
# Create genesets to test
# getting them from msigDB or make custom ones 
gmt 

Build gene-expression rankings for each cell

cells_rankings <- AUCell_buildRankings(exprMatrix, plotStats=TRUE)

Calculate enrichment for the gene signatures (AUC)

cells_AUC <- AUCell_calcAUC(gmt, cells_rankings)

Determine the cells with the given gene signatures or active gene sets

cells_assignment <- AUCell_exploreThresholds(cells_AUC, plotHist=TRUE, assign=TRUE) 

Threshold check

# check the length inside 
n <- length(cells_assignment)

aucThr <- matrix(nrow = n, ncol = 1) # 필요한 경우 열의 수를 조정하세요.

# This is an example rownames
rownames(aucThr) <- c("B", "Fibroblast", "Myeloid", "T_NK", "Tumor")[1:n]

for (i in 1:n) {
  aucThr[i, ] <- cells_assignment[[i]]$aucThr$selected
}

# Data frame 
aucThr %>% data.frame() %>% DT::datatable()

Add AUC output to metadata

# Add the calculated AUC scores as metadata to the Seurat object
obj.srt <- AddMetaData(obj.srt, data.frame(t(cells_AUC@assays@data$AUC)))

# This is an additional part
Category = c("B" ,"Fibroblast","Myeloid","T_NK","Tumor")

for (cat in Category) {
  rows <- cells_assignment[[cat]]$assignment
  auc_column_name <- paste0("AUC_designated_", cat)
  obj.srt@meta.data[, auc_column_name] <- "FALSE"
  obj.srt@meta.data[rows, auc_column_name] <- "TRUE"
}

Visualize AUC score in UMAP

Note

Once obtaining AUC output, I check the information that can be transferred to seurat object metadata. There should be various visualization methods for the transferred data. I usually use UMAP visualization to check the general patterns of them.

DimPlot(obj.srt, group.by = "AUC_designated_my_category")
FeaturePlot(obj.srt, features = "AUC_score_for_my_geneset")

Marker gene analysis

Find markers

# Find markers based on the selected cluster 
Idents(obj.srt) = 'RNA_snn_res.0.1'
fc=1.2
all.markers = FindAllMarkers(obj.srt, logfc.threshold = log2(fc), 
                             only.pos = T, test.use = 'wilcox',min.pct = 0.5)

Visualize the marker gene expression

Dotplot

gs = genes_selected # Selected genes from the marker list

obj.srt$rev.res.0.1 = factor(obj.srt$RNA_snn_res.0.1, 
                             levels = rev(levels(obj.srt$RNA_snn_res.0.1)))

DotPlot(obj.srt, group.by = "rev.res.0.1", 
        features = gs, cols = c("grey","red")) +RotatedAxis()

Heatmap

gs = genes_selected # Selected genes from the marker list
DoHeatmap(obj.srt, features = gs, group.by = "RNA_snn_res.0.1")

DEG analysis between two conditions

Find DEGs

# Generating DEGs bewteen condition 1 and condition 2
DEG.two_groups= function(obj.srt,cluster=0, cond1="ADU", cond2="CTL"){
  obj.srt@meta.data=obj.srt@meta.data %>% 
    mutate(compare = ifelse(RNA_snn_res.0.1 == cluster & sample == cond1, 
                            paste0(cluster,"_", cond1), 
                            ifelse(RNA_snn_res.0.1 == cluster & sample == cond2, 
                                   paste0(cluster,"_", cond2), "others")))
  g2=paste0(cluster,"_", cond1)
  g1=paste0(cluster,"_", cnod2)
  Idents(obj.srt) = 'compare'
  logfc=log2(1)
  mks =FindMarkers(obj.srt, ident.1 = g2, ident.2 = g1, 
                   logfc.threshold = logfc, assay = "count")
  pval=0.05
  fc=1.2
  mks = mks %>% 
    mutate(DE=ifelse(avg_log2FC >= log2(fc) & p_val_adj < pval, 'UP',
                                 ifelse(avg_log2FC <= -log2(fc) & p_val_adj < pval, 'DN',
                                        'no_sig')))
  mks$DE = factor(mks$DE, levels = c('UP','DN','no_sig'))
  
  mks$gene = rownames(mks)
  mks =mks %>% mutate(labels= ifelse(DE == 'UP', gene, ifelse(DE=='DN',gene,'')))
  mks =mks %>% arrange(desc(avg_log2FC))
  return(mks)
}
# Application  
deg.cluster0 = DEG.two_groups(obj.srt = obj.srt, cluster = 0)
deg.cluster1 = DEG.two_groups(obj.srt = obj.srt, cluster = 1)
deg.cluster2 = DEG.two_groups(obj.srt = obj.srt, cluster = 2)

Visualize the DEG output

# Volcano plot   
input.data = deg.cluster0
# Draw plot 
input.data %>% 
  ggplot(aes(avg_log2FC, -log10(p_val_adj), color=DE)) + 
  geom_point(size=1, alpha=0.5) + 
  scale_color_manual(values = c('red','blue','grey')) +
  theme_classic() +
  geom_vline(xintercept = c(-log2(fc),log2(fc)), color='grey') +
  geom_hline(yintercept = -log10(0.05),color='grey') +
  guides(colour = guide_legend(override.aes = list(size=5))) +
  ggtitle(paste0("Plot Title here")) +
  geom_text(aes(label = labels), size = 2.5, 
            show.legend = FALSE, hjust = 0, nudge_x = 0.01) +
  ggeasy::easy_center_title() 

Gene set enrichment analysis of DEGs

This analysis uses hallmark pathways for the analaysis.

library(clusterProfiler)
hallmark <- msigdbr::msigdbr(species = "Mus musculus", category = "H") %>% 
  dplyr::select(gs_name, gene_symbol)

perform_GSEA <- function(res, ref, pvalueCutoff = 1) {
  ranking <- function(res) {
    df <- res$avg_log2FC
    names(df) <- rownames(res)
    df <- sort(df, decreasing = TRUE)
    return(df)
  }
  
  ranked.res <- ranking(res)
  
  x <- clusterProfiler::GSEA(geneList = ranked.res,
                             TERM2GENE = ref,
                             pvalueCutoff = pvalueCutoff,
                             pAdjustMethod = "BH",
                             verbose = TRUE,
                             seed = TRUE)
  
  result <- x@result %>% arrange(desc(NES))
  result <- result[, c('NES', 'pvalue', 'p.adjust', 'core_enrichment', 'ID')]
  return(result)
}

# Application 
gsea.res = perform_GSEA(res = deg.cluster3, ref = hallmark)

GSEA output visualization

NES plot

# GESA Plot 
gsea_nes_plot =function(gsea.res, title){
  gsea.res %>% ggplot(aes(reorder(ID, NES), NES)) +
    geom_col(aes(fill=p.adjust)) +
    coord_flip() +
    labs(x="Pathway", y="Normalized Enrichment Score",
         title= "GSEA") + 
    theme_classic() +
    scale_fill_gradient(low = 'red', high = '#E5E7E9') +
    theme(axis.text.x= element_text(size=5, face = 'bold'),
          axis.text.y= element_text(size=6, face = 'bold'), 
          axis.title =element_text(size=10)) +ggtitle(title)
}

# Application 
gsea_nes_plot(gsea.res[gsea.res$p.adjust <= 0.05,], title = "Cluster3")

Enrichment plot for the individual pathway

id1 = "GOBP_T_CELL_ACTIVATION" # Selected pathway" 
enrichplot::gseaplot2(gsea.res, geneSetID = id1, title = id1)

Imputation

MAGIC (Cell, 2018)
https://github.com/KrishnaswamyLab/MAGIC
Markov Affinity-based Graph Imputation of Cells (MAGIC) is an algorithm for denoising high-dimensional data most commonly applied to single-cell RNA sequencing data. MAGIC learns the manifold data, using the resultant graph to smooth the features and restore the structure of the data.

# Import the necessary libraries.
import pandas as pd
import magic
import csv

# Load the CSV file into a DataFrame.
data_path = '~/path/data.csv'
x = pd.read_csv(data_path)

# Remove the 'Unnamed: 0'
del x['Unnamed: 0']

# Initialize the MAGIC algorithm. the number of nearest neighbors (knn) is set to 5, and the diffusion time (t) is set to 3.
magic_operator = magic.MAGIC(knn=5, t=3)

# Apply the MAGIC algorithm to the data.
x_magic = magic_operator.fit_transform(x)

# Save the data to a CSV file.
output_path = '~/path/data.magic.csv'
x_magic.to_csv(output_path)

Further analysis



More example analyses will be updated.