Data Processing in Linux
List
- Conda environment (for macs2 callpeak)
- MultiQC
- StarFusion (with STAR)
- Cut&Run fastq processing (bowtie2)
Conda environment for macs2 callpeak
In HPC, setting up a conda environment for specific software or pipelines improves reproducibility. For example, I use Python 3.6 in a conda environment for MACS2 callpeak, allowing easy updates later.
# Create the environment
conda create --name testenv python=3.6
conda activate testenv
# Install macs2 (if don't have it already)
conda install -c bioconda macs2
macs2 # to check if macs2 was installed correctly
# Single run
macs2 callpeak -t path/my_sample.bam -c -f BAMPE -g mm -n test -B -q 0.05 --outdir ~/Downloads
# Multiple sample run
cd /path/to/your/bam/files
mkdir macs2
for i in *.bam;do macs2 callpeak -t ${i} -c -f BAMPE -g mm -n ${i} -B -q 0.05 --outdir macs2;done
# Deactivate the environment
conda deactivate
Parameters
The -t option specifies the target file,
The -c option should specify the control file, which is missing here and
should be provided,
The -f option sets the file format,
The -g option specifies the genome,
The -n option sets the name for the output files,
The -B option generates additional BED files for broad peaks,
The -q option sets the p-value cutoff for peak detection,
The –outdir option specifies the output directory.
MultiQC
MultiQC is a tool used in bioinformatics to aggregate and visualize the results from various analysis tools. It generates a summary report that collates data from tools like FastQC, Samtools, and Picard, allowing users to easily review and compare the quality and outcomes of their experimental data. Written in Python, MultiQC is executed via the command line, offering ease of use and customization to meet diverse analytical needs.
# Perform multiqc in conda environment
# python 3.9
conda create -n multiqc python=3.9
conda activate multiqc
conda install fastqc multiqc
# Create fastqc result directory folder
mkdir fastqc_out
# Perform fastqc first
# This takes long time to run
fastqc *.fastq.gz -o fastqc_out
# Perform multiqc
multiqc fastqc_out
Output : multiqc_report.html
multiqc_data folder includes the multiple files to create report.html
file.
StarFusion
Background Information
One of my collaborators induced T cell lymphoma in mice using the
ITK-SYK fused gene model. Because ITK-SYK originates from humans, it
wasn’t detected with the usual mouse genome reference in STAR. To
confirm ITK-SYK gene expression in the dataset, we applied the
StarFusion method using the human genome as the reference.
Two FASTQ files (R1.fastq and R2.fastq)
StarFusion requires the use of STAR to align initial RNA-seq data.
CTAT genome index :
(Cancer Transcriptome Analysis Toolkit)
https://data.broadinstitute.org/Trinity/CTAT_RESOURCE_LIB/__genome_libs_StarFv1.10/
RNA-seq data alignment
Run StarFusion
Outputs
star-fusion.fusion_predictions.tsv
star-fusion.fusion_predictions.abridged.tsv
star-fusion.fusion_evidence_reads_1.fq.gz
star-fusion.fusion_evidence_reads_2.fq.gz
Fastq to Bam (Cut&Run)
Featuring bowtie2
#!/bin/bash
###### these codes will generate one final bam file from two fastqs (paired) ###
###### assign one input name for the all downstream process
###### in this case, trimming process remaining 36bp was used so 'crop' indicate the remaining sequence number as an indicator
### two variables
### input, crop
### set path
##### general workflow #####
### Trimming of fastqs remaining {crop}bp
### There are other trimming methods
### Alignment of samples : bowtie2 for cut and run
### output log from bowtie2 would be used to calculate mappable reads from fastq files.
### calculate only aligned reads in output log
### sam to bam conversion : samtools
### filter out MAPQ <30 : samtools
### sorting bam : samtools
### indexing bam : samtools
### remove intermediate files (sam)
### remove duplicates by picard markduplicates
### sort rmd bams : samtools
### extra steps to count the reads through the flow
### input folders : naive_Tle3 naive_IgG effector_Tle3 Tem_Tle3 Tcm_Tle3
input_folder=Tem_Tle3
input=Tem_Tle3_2
crop=36
path=~/Desktop/Sung_work/raw_tmp/TLE3_cut_run/${input_folder}
cd ~/Desktop/Sung_work/raw_tmp/TLE3_cut_run
cd ~/Desktop/Sung_work/raw_tmp/TLE3_cut_run
### Trimming of fastqs remaining {crop}bp
### Trimmomatic program location : ~/Desktop/software/Trimmomatic-0.39/trimmomatic-0.39.jar
### output : trimmed fastqs (R1 and R2)
## run
java -jar ~/Desktop/software/Trimmomatic-0.39/trimmomatic-0.39.jar PE -threads 10 \
$path/${input}.R1.fastq.gz $path/${input}.R2.fastq.gz \
$path/${input}.${crop}.R1.paired.fastq.gz $path/${input}.${crop}.R1.unpaired.fastq.gz \
$path/${input}.${crop}.R2.paired.fastq.gz $path/${input}.${crop}.R2.unpaired.fastq.gz \
CROP:${crop}
### Alignment of samples : bowtie2 for ATAC-seq
### bowtie2
### bowtie2 index location : ~/Desktop/ref/mm10/mm10.index
### indicate location for the output log from bowtie2 : ~/Desktop/Sung_work/log/
### indicate the location for the output sam : ~/Desktop/Sung_work/sam/
### output: bam
mkdir $path/${input}_sam
bowtie2 --mm \
-p 10 \
--no-unal \
--non-deterministic \
--no-discordant \
-x ~/Desktop/ref/mm10/mm10.index \
-1 $path/${input}.${crop}.R1.paired.fastq.gz \
-2 $path/${input}.${crop}.R2.paired.fastq.gz \
-S $path/${input}_sam/${input}.${crop}.sam 2> $path/${input}_sam/${input}.${crop}.bowtie2.log
#############
### sam to bam conversion
### filter out MAPQ <30
### sorting bam
### indexing bam
### remove intermediate files (sam)
### path again for the location of the sam
samtools view -@ 10 -S -b -@ 10 $path/${input}_sam/${input}.${crop}.sam > $path/${input}_sam/${input}.${crop}.bam
samtools view -@ 10 -q 30 -b $path/${input}_sam/${input}.${crop}.bam > $path/${input}_sam/${input}.${crop}.q30.bam
samtools sort -@ 10 $path/${input}_sam/${input}.${crop}.q30.bam -o $path/${input}_sam/${input}.${crop}.q30.sorted.bam
samtools index -@ 10 $path/${input}_sam/${input}.${crop}.q30.sorted.bam
rm $path/${input}_sam/${input}.${crop}.sam
### remove duplicates
### picard tool
### output: bam with removal of duplicates
### picard markduplicates
java -jar ~/Desktop/software/picard.jar MarkDuplicates \
I= $path/${input}_sam/${input}.${crop}.q30.sorted.bam \
O= $path/${input}_sam/${input}.${crop}.q30.sorted.rmd.bam \
M= $path/${input}_sam/${input}.${crop}.q30.sorted.rmd.metrics.txt \
ASSUME_SORTED=TRUE REMOVE_DUPLICATES=true CREATE_INDEX=true QUIET=true
### sort rmd bams
### output : sorted and duplicates-removed bam
samtools sort -@ 10 $path/${input}_sam/${input}.${crop}.q30.sorted.rmd.bam -o $path/${input}_sam/${input}.${crop}.q30.sorted.sorted.rmd.bam
################################################################
####### extra steps to count the reads through the flow ########
echo 'bam reads'
samtools view -@ 10 -c $path/${input}_sam/${input}.${crop}.bam
echo 'q30 sorted rmd bam sorted reads'
samtools view -@ 10 -c $path/${input}_sam/${input}.${crop}.q30.sorted.sorted.rmd.bam