Overview of ChIP-seq analysis (both up & down stream)

Overview of all ChIP-seq analysis. Credit pic: Harvard Chan Bioinformatics Core
Overview of all ChIP-seq analysis. Credit pic: Harvard Chan Bioinformatics Core

Overview of upstream analysis

Overview of ChIP-seq upstream analysis. Credit pic: Harvard Chan Bioinformatics Core
Overview of ChIP-seq upstream analysis. Credit pic: Harvard Chan Bioinformatics Core

Each step of upstream analysis

  1. QC

    • input: .fastq
    • process: FastQC
    • output: .fastqc
  2. Trimming adapter

    • input: .fastq
    • process: cutadater
    • output: .fastq (trimmed)
  3. QC after trimming

  4. Align

    • input: .fastq (trimmed)
    • other inputs: Bowtie reference genome index (built with bowtie2-build reference.fa reference_index) reference genome (fasta) and pre-build index by the aligner (explore bowtie website or nf-core/references).
    • process: bowtie2 (other option: bwa)
    • output: .sam
  5. convert .sam to .bam file

    • input: .sam
    • process: samtools view
    • output: .bam
  6. sort the reads in the bam file by coordinates

    • input: .bam
    • process: samtools sort
    • output: sorted.bam
  7. index the bam file

    • input: sorted.bam (it is .bam)
    • process: samtools index
    • output: sorted.index.bam (bam.bai–> it is .bam) –> this can use for visualization in IGenome browsers like IGV or UCSC Genome Browser
  8. generate bigwig files

    • input: sorted.index.bam (bam.bai)
    • process: bamCoverage (deeptools’s sub-command)
    • note:
      • --normalizeUsing RPKM RPKM: reads per kilobase per million reads. The formula is: RPKM (per bin) = number of reads per bin /(number of mapped reads (in millions) * bin length (kp))
      • --extendReads 200 We only sequenced 50bp of the DNA after antibody pull-down, but the real DNA is about 200 bp (the size of the DNA after sonication/fragmentation)
    • output: .bw –> this can use for visualization in IGV
  9. peak calling

    • process: MACS3
    • note:
      • –extend-size: Extends reads to the specified fragment size (e.g., 200 bp). This mimics the actual fragment length in the absence of paired-end information.
      • -no-model (about model): Model building is used because the fragment size is not directly available from single-end reads. MACS shifts the reads by half the estimated fragment size to align them to the putative binding sites. BUT for paired-end data, the fragment size is directly available from the paired-end read alignment. So, use -no-model if you already know the approximate fragment size from your library preparation.
      • -g hs: the mappable genome size or effective genome size which is defined as the genome size which can be sequenced. The default hs ~2.9e9 is recommended for human genome.

Best Practices for Peak Calling

Single-End Data

OPTION 1: With Model Building (Default Behavior): Let MACS build the model unless there are specific reasons to disable it.

macs2 callpeak -t treatment.bam -c control.bam -f BED -g hs -n sample –outdir results

OPTION 2: Without Model Building (–no-model): Use when: You know the fragment size (e.g., from library preparation or empirical testing). The library is not suitable for accurate model building (e.g., very low read depth or noisy data). Specify –extend-size to set the fragment size.

macs2 callpeak -t treatment.bam -c control.bam -f BED -g hs -n sample --outdir results --no-model --extend-size 200

Paired-End Data

--no-model and --extend-size are not needed.
For paired-end data, model building is typically unnecessary since the fragment size is directly calculated.

Output files of macs3

  1. .narrowPeak or .broadPeak: These files contain the genomic coordinates of the called peaks, along with other relevant information such as peak summit, p-value, and fold enrichment. This file is like a map showing the general territory where your protein of interest is hanging out. It outlines the entire region of enrichment, including the summit and the surrounding areas where the protein might be bound with slightly lower affinity.
  • Narrow peaks = TF
  • Broad peaks = histone modifiers
  1. .xls or .csv: These files provide a more human-readable summary of the peak information, including peak location, length, and statistical significance.
  2. .bedGraph: This file contains a continuous signal track of the ChIP-seq enrichment across the genome.
  3. _summits.bed: This file contains the genomic coordinates of the peak summits, which represent the most likely binding sites of the protein of interest. This file is like a pinpoint on the map, marking the exact spot where your protein is most likely to be camped out. It represents the point of highest enrichment within the peak, indicating the most probable binding site.

So, the .narrowPeak or .broadPeak file gives you the broader picture of the protein-DNA interaction, while the _summit.bed file zooms in on the most critical binding location.

Perform upstream analysis

1. Downloading fastq files from GEO

The input file for upstream analysis of transcriptomics is the fastq files. we will use some fastq file from GSE66083 including
- YAP_rep1 = SRR1810900
- TAZ_rep1 = SRR1810907
- TEAD4_rep1 = SRR1810918
- IgG = SRR1810912

By using nf-core fetchngs pipeline, the default (FTP) will be used.

Overview of fetchngs from nf-core (nf-core/fetchngs) Credit pic
Overview of fetchngs from nf-core (nf-core/fetchngs) Credit pic

To download those mentioned fastq files, we have to create a file named ids.csv where contains all SRR numbers that we want to download.

# Go to directory where `ids.csv` file was saved
cd data/fetchngs

# Oberving the `ids.csv` file
cat ids.csv

This is ids.csv file

    SRR1810907
    SRR1810900
    SRR1810918
    SRR1810912

Next, I define the needed information in nf-core/fetchngs launch pipeline. After launch the workflow, I got nf-params.json file. I download the file to run it locally.

Here is my nf-params.json. Please note that we will run all process on High Performance Computing (HPC), thus the file paths are full path of HPC.

# Go to directory where the `nf-params.json file was saved
cd data/HPC/data/fetchngs

# Observing the `nf-params.json` file
cat nf-params.json

This is nf-params.json file

    {
        "input": "/mnt/sisplockers/krittiyabhornk/nf/ChIP_Chatomic/data/ids.csv",
        "outdir": "/mnt/sisplockers/krittiyabhornk/nf/ChIP_Chatomic/data",
        "email": "kkrittiyabhorn@gmail.com"
    }

4 fastq files. To start nf-core/fetchngs process, I run the following command.

nextflow run nf-core/fetchngs -r 1.12.0 -name Zanconato -params-file nf-params.json -profile docker

The process will start and show the picture below. Moreover, it will real-time update the status of each process.

After pipeline finished successfully, we will find these 4 fastq files.

2. Reference genome and genome annotation

As I mentioned above that aligning on reference genome. We will use reference genome that provided by nf-core which we will use human GRCh38.

3. Performing upstream analysis

The upstream analysis was performed on HPC using a nf-core pipeline called chipseq.

From nf-core/chipseq pipeline, we will mainly use the blue pipeline where BWA will be used for aligning. I will use Bowties2 for aligning, thus my pipeline will be the yellow line

To analyze a set of .fastq files, we have to create a file named samplesheet.csv where includes all information about our .fastq file.

In my case, my data is single-end sequencing read. Thus, I have only one fastq file per each replicate of each sample.

# Go to directory that save `samplesheet.csv` file
cd data/chipseq

# Observing the `samplesheet.csv` file
cat samplesheet.csv
    sample,fastq_1,fastq_2,replicate,antibody,control,control_replicate
    YAP,/mnt/sisplockers/krittiyabhornk/nf/ChIP_Chatomic/data/fastq/SRX883578_SRR1810907.fastq.gz,,1,YAP,IgG,1
    TAZ,/mnt/sisplockers/krittiyabhornk/nf/ChIP_Chatomic/data/fastq/SRX883576_SRR1810900.fastq.gz,,1,TAZ,IgG,1
    TEAD4,/mnt/sisplockers/krittiyabhornk/nf/ChIP_Chatomic/data/fastq/SRX883582_SRR1810918.fastq.gz,,1,TEAD4,IgG,1
    IgG,/mnt/sisplockers/krittiyabhornk/nf/ChIP_Chatomic/data/fastq/SRX883580_SRR1810912.fastq.gz,,1,,,

Next, I define the needed information in nf-core/chipseq launch pipeline. After launch the workflow, I got nf-params.json file. I download the file to run the analysis it locally.

Please note that
- we will run all process on High Performance Computing (HPC), thus the file paths are full path of HPC.
- Since we use nf-core provided reference genome, the black-listed regions will be automatically excluded by the pipeline (They have provided black-listed regions.)

Here is my nf-params.json.

# Go to directory where the `nf-params.json file was saved
cd data/chipseq

# Observing the `nf-params.json` file
cat nf-params.json
   {
        "input": "\/mnt\/sisplockers\/krittiyabhornk\/nf\/ChIP_Chatomic\/samplesheet.csv",
        "outdir": "\/mnt\/sisplockers\/krittiyabhornk\/nf\/ChIP_Chatomic\/output_new",
        "genome": "GRCh38",
        "aligner": "bowtie2",
        "narrow_peak": true,
        "macs_gsize": 2701495711
   }

NOTE: macs_gsize for macs3 software is defined according to read length = 50 bp (This is known from library preparation.). However, normally we can used default values for each version human genome as showed by deeptools. You can used macs3 callpeak -g hs if you are not using nf-core (hs: 2,913,022,398 for GRCh38).

Now, All information is ready to perfrom nf-core/chipseq to analyze from 4 fastq files to gene count. To start nf-core/chipseq process, I run the following command.

nextflow run nf-core/chipseq -r 2.1.0 -name Chatomic4 -profile docker -resume -params-file nf-params.json

The process will start and show the picture below. Moreover, it will real-time update the status of each process.

After the analysis complete, the command will show as picture below.

Please note that we can close our terminal by using screen command (see my trick).

We can observe QC from both pre and post-alignment and many more visualization created by the tools provided by nf-core/chipseq pipeline from MultiQC.html file.

MultiQc file
MultiQc file

Visualize the peak

We can visualize peak (narrowPeak.bed and _summit.bed) and the normalized fragment the mapped to the reference genome (.bigWig) using Integrative Genomics Viewer(IGV) which currently have web app. You can see your data without IGV software installation.

Visualize the peak using IGV-web app
Visualize the peak using IGV-web app