QC
.fastq
.fastqc
Trimming adapter
.fastq
.fastq
(trimmed)QC after trimming
Align
.fastq
(trimmed)bowtie2-build reference.fa reference_index
) reference
genome (fasta) and pre-build index by the aligner (explore bowtie
website or nf-core/references)..sam
convert .sam
to .bam
file
.sam
.bam
sort the reads in the bam file by coordinates
.bam
sorted.bam
index the bam file
sorted.bam
(it is .bam
)bam.bai
–> it is
.bam
) –> this can use for visualization in IGenome
browsers like IGV or UCSC
Genome Browsergenerate bigwig files
bam.bai
)--normalizeUsing RPKM
RPKM: reads per kilobase per
million reads. The formula is: RPKM (per bin) = number of reads per bin
/(number of mapped reads (in millions) * bin length (kp))--extendReads 200
We only sequenced 50bp of the DNA
after antibody pull-down, but the real DNA is about 200 bp (the size of
the DNA after sonication/fragmentation)peak calling
–extend-size
: Extends reads to the specified fragment
size (e.g., 200 bp). This mimics the actual fragment length in the
absence of paired-end information.-no-model
(about model): Model building is used because
the fragment size is not directly available from single-end reads. MACS
shifts the reads by half the estimated fragment size to align them to
the putative binding sites. BUT for paired-end data, the fragment size
is directly available from the paired-end read alignment. So, use
-no-model
if you already know the approximate fragment size
from your library preparation.-g hs
: the mappable genome size or effective genome
size which is defined as the genome size which can be sequenced. The
default hs ~2.9e9 is recommended for human genome.OPTION 1: With Model Building (Default Behavior): Let MACS build the model unless there are specific reasons to disable it.
macs2 callpeak -t treatment.bam -c control.bam -f BED -g hs -n sample –outdir results
OPTION 2: Without Model Building (–no-model): Use when: You know the fragment size (e.g., from library preparation or empirical testing). The library is not suitable for accurate model building (e.g., very low read depth or noisy data). Specify –extend-size to set the fragment size.
macs2 callpeak -t treatment.bam -c control.bam -f BED -g hs -n sample --outdir results --no-model --extend-size 200
--no-model
and --extend-size
are not
needed.
For paired-end data, model building is typically unnecessary since the
fragment size is directly calculated.
macs3
.narrowPeak
or .broadPeak
: These files
contain the genomic coordinates of the called peaks, along with other
relevant information such as peak summit, p-value, and fold enrichment.
This file is like a map showing the general territory where your protein
of interest is hanging out. It outlines the entire region of enrichment,
including the summit and the surrounding areas where the protein might
be bound with slightly lower affinity..xls
or .csv
: These files provide a more
human-readable summary of the peak information, including peak location,
length, and statistical significance..bedGraph
: This file contains a continuous signal track
of the ChIP-seq enrichment across the genome._summits.bed
: This file contains the genomic
coordinates of the peak summits, which represent the most likely binding
sites of the protein of interest. This file is like a pinpoint on the
map, marking the exact spot where your protein is most likely to be
camped out. It represents the point of highest enrichment within the
peak, indicating the most probable binding site.So, the .narrowPeak
or .broadPeak
file
gives you the broader picture of the protein-DNA interaction, while the
_summit.bed
file zooms in on the most critical binding
location.
fastq
files from GEOThe input file for upstream analysis of transcriptomics is the fastq
files. we will use some fastq file from GSE66083
including
- YAP_rep1 = SRR1810900
- TAZ_rep1 = SRR1810907
- TEAD4_rep1 = SRR1810918
- IgG = SRR1810912
By using nf-core fetchngs
pipeline, the default
(FTP
) will be used.
To download those mentioned fastq files, we have to create a file
named ids.csv
where contains all SRR
numbers
that we want to download.
# Go to directory where `ids.csv` file was saved
cd data/fetchngs
# Oberving the `ids.csv` file
cat ids.csv
This is ids.csv
file
SRR1810907
SRR1810900
SRR1810918
SRR1810912
Next, I define the needed information in nf-core/fetchngs
launch pipeline. After launch the workflow, I got
nf-params.json
file. I download the file to run it
locally.
Here is my nf-params.json
. Please note that we will run
all process on High Performance Computing (HPC), thus the file paths are
full path of HPC.
# Go to directory where the `nf-params.json file was saved
cd data/HPC/data/fetchngs
# Observing the `nf-params.json` file
cat nf-params.json
This is nf-params.json
file
{
"input": "/mnt/sisplockers/krittiyabhornk/nf/ChIP_Chatomic/data/ids.csv",
"outdir": "/mnt/sisplockers/krittiyabhornk/nf/ChIP_Chatomic/data",
"email": "kkrittiyabhorn@gmail.com"
}
4 fastq files. To start nf-core/fetchngs
process, I run
the following command.
nextflow run nf-core/fetchngs -r 1.12.0 -name Zanconato -params-file nf-params.json -profile docker
The process will start and show the picture below. Moreover, it will
real-time update the status of each process.
After pipeline finished successfully, we will find these 4 fastq
files.
As I mentioned above that aligning on reference genome. We will use
reference genome that provided by nf-core which we will use human
GRCh38
.
The upstream analysis was performed on HPC using a nf-core pipeline
called chipseq
.
From nf-core/chipseq
pipeline, we will mainly use the
blue pipeline where BWA
will be used for aligning. I will
use Bowties2
for aligning, thus my pipeline will be the
yellow line
To analyze a set of .fastq
files, we have to create a
file named samplesheet.csv
where includes all information
about our .fastq
file.
In my case, my data is single-end sequencing read. Thus, I have only one fastq file per each replicate of each sample.
# Go to directory that save `samplesheet.csv` file
cd data/chipseq
# Observing the `samplesheet.csv` file
cat samplesheet.csv
sample,fastq_1,fastq_2,replicate,antibody,control,control_replicate
YAP,/mnt/sisplockers/krittiyabhornk/nf/ChIP_Chatomic/data/fastq/SRX883578_SRR1810907.fastq.gz,,1,YAP,IgG,1
TAZ,/mnt/sisplockers/krittiyabhornk/nf/ChIP_Chatomic/data/fastq/SRX883576_SRR1810900.fastq.gz,,1,TAZ,IgG,1
TEAD4,/mnt/sisplockers/krittiyabhornk/nf/ChIP_Chatomic/data/fastq/SRX883582_SRR1810918.fastq.gz,,1,TEAD4,IgG,1
IgG,/mnt/sisplockers/krittiyabhornk/nf/ChIP_Chatomic/data/fastq/SRX883580_SRR1810912.fastq.gz,,1,,,
Next, I define the needed information in nf-core/chipseq
launch pipeline. After launch the workflow, I got
nf-params.json
file. I download the file to run the
analysis it locally.
Please note that
- we will run all process on High Performance Computing (HPC), thus the
file paths are full path of HPC.
- Since we use nf-core provided reference genome, the black-listed
regions will be automatically excluded by the pipeline (They have
provided black-listed regions.)
Here is my nf-params.json
.
# Go to directory where the `nf-params.json file was saved
cd data/chipseq
# Observing the `nf-params.json` file
cat nf-params.json
{
"input": "\/mnt\/sisplockers\/krittiyabhornk\/nf\/ChIP_Chatomic\/samplesheet.csv",
"outdir": "\/mnt\/sisplockers\/krittiyabhornk\/nf\/ChIP_Chatomic\/output_new",
"genome": "GRCh38",
"aligner": "bowtie2",
"narrow_peak": true,
"macs_gsize": 2701495711
}
NOTE: macs_gsize
for macs3
software is
defined according to read length = 50 bp (This is known from library
preparation.). However, normally we can used default values for each
version human genome as showed by deeptools. You can used
macs3 callpeak -g hs
if you are not using nf-core (hs:
2,913,022,398 for GRCh38).
Now, All information is ready to perfrom nf-core/chipseq
to analyze from 4 fastq files to gene count. To start
nf-core/chipseq
process, I run the following command.
nextflow run nf-core/chipseq -r 2.1.0 -name Chatomic4 -profile docker -resume -params-file nf-params.json
The process will start and show the picture below. Moreover, it will real-time update the status of each process.
After the analysis complete, the command will show as picture
below.
Please note that we can close our terminal by using
screen
command (see
my trick).
We can observe QC from both pre and post-alignment and many more
visualization created by the tools provided by
nf-core/chipseq
pipeline from MultiQC.html
file.
We can visualize peak (narrowPeak.bed
and
_summit.bed
) and the normalized fragment the mapped to the
reference genome (.bigWig
) using Integrative Genomics Viewer(IGV) which
currently have web app. You can see your data without IGV software
installation.