Genomic data processing

Applicable to ChIP-seq, ATAC-seq, Cut&Run

Peak Calling
Normalization
Peak analysis in R

Peak Calling by MACS2

Peak Calling of individual bam file

Input files:

sample.bam
control.bam

Script

nano run_peak_calling.sh

#!/bin/bash

# Check if required input BAM files are provided
if [ -z "$1" ] || [ -z "$2" ]; then
  echo "Usage: $0 <input_bam> <control_bam>"
  exit 1
fi

# BAM file (sample)
input_bam=$1

# Control input BAM file (background)
common_input=$2

# Output directory
output_dir="macs2_out"

# Ensure the output directory exists
mkdir -p "$output_dir"

# Extract sample name from the input BAM file
sample_name=$(basename "$input_bam")

# Run MACS2 peak calling
macs2 callpeak \
    -t "$input_bam" \
    -c "$common_input" \
    -f BAMPE \
    -g mm \
    -n "$sample_name" \
    -B \
    -q 0.05 \
    --outdir "$output_dir"

# Check exit status
if [ $? -ne 0 ]; then
  echo "Error: MACS2 peak calling failed."
  exit 1
fi

echo "MACS2 peak calling completed successfully. Results are in

Run (Example)

chmod +x run_peak_calling.sh

bash run_peak_calling.sh /path/to/sample.bam /path/to/control.bam
./run_peak_calling.sh DMSO_H3K27me3.merged.bam /path/to/control.bam
./run_peak_calling.sh Treat1_H3K27me3.merged.bam /path/to/control.bam
./run_peak_calling.sh Treat2_H3K27me3.merged.bam /path/to/control.bam

Peak Calling putting a folder as input

Script

nano macs2_callpeaks.sh

#!/bin/bash

# Check if the input folder is provided
if [ -z "$1" ]; then
  echo "Usage: $0 <input_folder>"
  exit 1
fi

# Input folder containing BAM files
input_folder=$1

# Control input BAM files (glob pattern inside the folder)
common_input="${input_folder}/*_H3K27.bam"

# Output directory
output_dir="macs2_out"

# Ensure the output directory exists
mkdir -p "$output_dir"

# Sample peak calling
for sample_bam in "${input_folder}"/*.merged.bam; do
    sample=$(basename "$sample_bam" .merged.bam)
    macs2 callpeak \
        -t "$sample_bam" \
        -c "$common_input" \
        -f BAMPE \
        -g mm \
        -n "$sample" \
        -B \
        -q 0.05 \
        --outdir "$output_dir"
done

echo "MACS2 peak calling completed. Results are in '$output_dir'."

Run

chmod +x macs2_callpeaks.sh
./macs2_callpeaks.sh /path/to/input_folder

Broad Peak option

Broad Peak for Extended Regions:
Focuses on large, spread-out areas suited for markers such as H3K27me3 or H3K36me3.

Script

–broad-cutoff 0.1 (FDR 10%)

#!/bin/bash

# BAM file
input_bam=$1

# Control input  
common_input="../input.bam" # My input 

# Output directory 
output_dir="macs2_out"

# sample name 
sample_name=$(basename $input_bam .merged.bam)

# MACS2 run for broad peaks
macs2 callpeak \
    -t $input_bam \
    -c $common_input \
    -f BAMPE \
    -g mm \
    -n $sample_name \
    -B \
    --broad \
    -q 0.05 \
    --broad-cutoff 0.1 \
    --outdir $output_dir

Run

./run_broadpeak_calling.sh input.bam

Peak filtering

Narrowpeak column information

Check signalValue column

Filtering and Sorting

Filtering by signalValue if necessary.
Sorting bed:
- First order by chromosome names using -k1,1 and then by start positions numerically with -k2,2n.
- This sorts entries by chromosome and then by position within each chromosome.

Normalization of peaks

Bed Coverage calculation from reference peak (unionpeak) information

The adv/disadv of processing bam files using a standardized peak set (union peaks) as a reference

Standardizes peak across different samples
Reduced Bias
Efficiency in the downstream analysis
Risk of overgeneralization (e.g. overlooking sample/condition-specific peaks like peak length)

bedtools coverage

#!/bin/bash

# Assign BAM file
bamFile=$1

# Assign reference Peak File
unionPeakFile=$2

# Run bedtools coverage
bedtools coverage -a "$unionPeakFile" -b "$bamFile" > "${bamFile%.bam}.cov.bed"

echo "Coverage calculation for $bamFile completed."


./bamcoverage_single.sh path/to/your.bam path/to/unionpeak.bed

Peak analysis in R

Read files in R

library(data.table)

# Getting files with .cov.bed 
cov_files <- list.files(pattern = "\\.cov\\.bed$")

# Import and save 
cov_data_list <- lapply(cov_files, function(file) {
  fread(file)
})

# assign names 
names(cov_data_list) <- cov_files

Coverage bed file example

##     chrom start   end num_reads coverageFraction
##    <char> <num> <num>     <num>            <num>
## 1:   chr1   100   200        25             1.00
## 2:   chr1   500   600        30             0.95
## 3:   chr2   200   300         5             0.20
## 4:   chr2   800   900        20             1.00
## 5:   chr3   150   250        15             0.85

Modify coverage bed files based on the purpose in the downstream analysis.