Data Processing in Linux

List

Conda environment (for macs2 callpeak)
MultiQC
StarFusion (with STAR)
Cut&Run fastq processing (bowtie2)

Conda environment for macs2 callpeak

In HPC, setting up a conda environment for specific software or pipelines improves reproducibility. For example, I use Python 3.6 in a conda environment for MACS2 callpeak, allowing easy updates later.

# Create the environment 
conda create --name testenv python=3.6 
conda activate testenv

# Install macs2 (if don't have it already)
conda install -c bioconda macs2

macs2 # to check if macs2 was installed correctly

# Single run
macs2 callpeak -t path/my_sample.bam -c -f BAMPE -g mm -n test -B -q 0.05 --outdir ~/Downloads

# Multiple sample run

cd /path/to/your/bam/files  

mkdir macs2

for i in *.bam;do macs2 callpeak -t ${i} -c -f BAMPE -g mm -n ${i} -B -q 0.05 --outdir macs2;done

# Deactivate the environment 
conda deactivate

Parameters
The -t option specifies the target file,
The -c option should specify the control file, which is missing here and should be provided,
The -f option sets the file format,
The -g option specifies the genome,
The -n option sets the name for the output files,
The -B option generates additional BED files for broad peaks,
The -q option sets the p-value cutoff for peak detection,
The –outdir option specifies the output directory.

MultiQC

MultiQC is a tool used in bioinformatics to aggregate and visualize the results from various analysis tools. It generates a summary report that collates data from tools like FastQC, Samtools, and Picard, allowing users to easily review and compare the quality and outcomes of their experimental data. Written in Python, MultiQC is executed via the command line, offering ease of use and customization to meet diverse analytical needs.

# Perform multiqc in conda environment 
# python 3.9 
conda create -n multiqc python=3.9
conda activate multiqc
conda install fastqc multiqc

# Create fastqc result directory folder 
mkdir fastqc_out

# Perform fastqc first
# This takes long time to run 
fastqc *.fastq.gz -o fastqc_out

# Perform multiqc
multiqc fastqc_out

Output : multiqc_report.html
multiqc_data folder includes the multiple files to create report.html file.

StarFusion

Background Information
One of my collaborators induced T cell lymphoma in mice using the ITK-SYK fused gene model. Because ITK-SYK originates from humans, it wasn’t detected with the usual mouse genome reference in STAR. To confirm ITK-SYK gene expression in the dataset, we applied the StarFusion method using the human genome as the reference.

Two FASTQ files (R1.fastq and R2.fastq)
StarFusion requires the use of STAR to align initial RNA-seq data.

CTAT genome index :
(Cancer Transcriptome Analysis Toolkit)
https://data.broadinstitute.org/Trinity/CTAT_RESOURCE_LIB/__genome_libs_StarFv1.10/

RNA-seq data alignment

STAR --genomeDir /path/to/star_genome_index \
     --readFilesIn R1.fastq R2.fastq \
     --outFileNamePrefix star_output \
     --readFilesCommand zcat \
     --outSAMtype BAM SortedByCoordinate

Run StarFusion

StarFusion --genome_lib_dir /path/to/ctat_genome_lib_build_dir \
           --bam star_outputAligned.sortedByCoord.out.bam \
           --output_dir star_fusion_out # Output directory

Outputs

star-fusion.fusion_predictions.tsv
star-fusion.fusion_predictions.abridged.tsv
star-fusion.fusion_evidence_reads_1.fq.gz
star-fusion.fusion_evidence_reads_2.fq.gz

Fastq to Bam (Cut&Run)

Featuring bowtie2

#!/bin/bash

###### these codes will generate one final bam file from two fastqs (paired) ###
###### assign one input name for the all downstream process
###### in this case, trimming process remaining 36bp was used so 'crop' indicate the remaining sequence number as an indicator
### two variables 
### input, crop
### set path

##### general workflow #####
### Trimming of fastqs remaining {crop}bp 
### There are other trimming methods
### Alignment of samples : bowtie2 for cut and run
### output log from bowtie2 would be used to calculate mappable reads from fastq files.
### calculate only aligned reads in output log
### sam to bam conversion : samtools
### filter out MAPQ <30 : samtools
### sorting bam : samtools
### indexing bam : samtools
### remove intermediate files (sam)
### remove duplicates by picard markduplicates
### sort rmd bams : samtools
### extra steps to count the reads through the flow

### input folders : naive_Tle3 naive_IgG effector_Tle3 Tem_Tle3 Tcm_Tle3 


input_folder=Tem_Tle3
input=Tem_Tle3_2
crop=36
path=~/Desktop/Sung_work/raw_tmp/TLE3_cut_run/${input_folder}
cd ~/Desktop/Sung_work/raw_tmp/TLE3_cut_run


cd ~/Desktop/Sung_work/raw_tmp/TLE3_cut_run

### Trimming of fastqs remaining {crop}bp 
### Trimmomatic program location : ~/Desktop/software/Trimmomatic-0.39/trimmomatic-0.39.jar
### output : trimmed fastqs (R1 and R2)
## run
java -jar ~/Desktop/software/Trimmomatic-0.39/trimmomatic-0.39.jar PE -threads 10 \
$path/${input}.R1.fastq.gz $path/${input}.R2.fastq.gz \
$path/${input}.${crop}.R1.paired.fastq.gz $path/${input}.${crop}.R1.unpaired.fastq.gz \
$path/${input}.${crop}.R2.paired.fastq.gz $path/${input}.${crop}.R2.unpaired.fastq.gz \
CROP:${crop}

### Alignment of samples : bowtie2 for ATAC-seq
### bowtie2
### bowtie2 index location : ~/Desktop/ref/mm10/mm10.index
### indicate location for the output log from bowtie2 : ~/Desktop/Sung_work/log/
### indicate the location for the output sam : ~/Desktop/Sung_work/sam/
### output: bam

mkdir $path/${input}_sam

bowtie2 --mm \
-p 10 \
--no-unal \
--non-deterministic \
--no-discordant \
-x ~/Desktop/ref/mm10/mm10.index \
-1 $path/${input}.${crop}.R1.paired.fastq.gz \
-2 $path/${input}.${crop}.R2.paired.fastq.gz \
-S $path/${input}_sam/${input}.${crop}.sam 2> $path/${input}_sam/${input}.${crop}.bowtie2.log


#############
### sam to bam conversion
### filter out MAPQ <30 
### sorting bam
### indexing bam
### remove intermediate files (sam)
### path again for the location of the sam

samtools view -@ 10 -S -b -@ 10 $path/${input}_sam/${input}.${crop}.sam > $path/${input}_sam/${input}.${crop}.bam
samtools view -@ 10 -q 30 -b $path/${input}_sam/${input}.${crop}.bam > $path/${input}_sam/${input}.${crop}.q30.bam

samtools sort -@ 10 $path/${input}_sam/${input}.${crop}.q30.bam -o $path/${input}_sam/${input}.${crop}.q30.sorted.bam
samtools index -@ 10 $path/${input}_sam/${input}.${crop}.q30.sorted.bam 

rm $path/${input}_sam/${input}.${crop}.sam

### remove duplicates 
### picard tool 
### output: bam with removal of duplicates

### picard markduplicates
java -jar ~/Desktop/software/picard.jar MarkDuplicates \
I= $path/${input}_sam/${input}.${crop}.q30.sorted.bam  \
O= $path/${input}_sam/${input}.${crop}.q30.sorted.rmd.bam \
M= $path/${input}_sam/${input}.${crop}.q30.sorted.rmd.metrics.txt \
ASSUME_SORTED=TRUE  REMOVE_DUPLICATES=true CREATE_INDEX=true QUIET=true 

### sort rmd bams
### output : sorted and duplicates-removed bam
samtools sort -@ 10 $path/${input}_sam/${input}.${crop}.q30.sorted.rmd.bam -o $path/${input}_sam/${input}.${crop}.q30.sorted.sorted.rmd.bam

################################################################
####### extra steps to count the reads through the flow ########

echo 'bam reads'
samtools view -@ 10 -c $path/${input}_sam/${input}.${crop}.bam
echo 'q30 sorted rmd bam sorted reads'
samtools view -@ 10 -c $path/${input}_sam/${input}.${crop}.q30.sorted.sorted.rmd.bam