Overview of upstream transcriptomics analysis

The upstream analysis will include the analysis from downloading fastq files to gene count. Each step of the analysis will be explained in detail in each section. The nf-core pipelines were used for all processes in this upstream analysis according to the reproducibility concern. Please see the detail of nf-core setting in my note for computer workbench.

Overview of all transcriptomics process. Credit pic
Overview of all transcriptomics process. Credit pic

Transcriptome analysis starts with fastq files as input files and pass through 5 different processes including

1. Pre-alignment quality control (QC): To test the quanlity of fastq file, the tools such as FastQC

Input = .fastq files
Output = the same .fastq files without any editing

2. Trimming: To cut the adapter out from each read.

Input = .fastq files
Output = .fastq files which adapter was cut.

What is the adapter? and Why we need to cut them? Credit pic
What is the adapter? and Why we need to cut them? Credit pic


3. Aligning: To map the sequencing read.

It is different versions of alignment depending the organism you are working with. Mainly, it is divided into 2 categories which are organism with and without reference genome.

2 Main types of the alignments. Credit pic
2 Main types of the alignments. Credit pic


3.1 An organism with a reference genome: It is possible to infer which transcripts are expressed by mapping the reads to the reference genome (genome mapping) or transcriptome (transcriptome mapping).

To understand the different usage between Genome and Transcriptome mapping, you need to firstly back to central dogma as picture showed below. RNA-seq reads can be aligned directly to reference transcriptome. However, transcriptome mapping is not suitable for discovery of novel isoforms and splice junctions. Therefore, it depend on research question.

The difference of genome and transcriptome Credit pic: P. Kueanjinda
The difference of genome and transcriptome Credit pic: P. Kueanjinda


3.2 An organism without a reference genome: The reads need to be assembled first into longer contigs (de novo assembly).

In my case, I want to quantify gene expression of Human without concerning about novel isoforms and slice junction. Therefore, I will use at least transcriptome mapping.
I choose STAR as my tool for alignment process. Mapping will be done on genome/transciptome and we also want to know whether which location(gene) it map to. Therefore, the additional files needs for the alignment process are reference genome and gene annotation, accordingly.

How STAR, an alignment tool works Credit pic
How STAR, an alignment tool works Credit pic


If we manually perform this alignment process using STAR, we need 2 main steps including
- Generating a genome index: using --runMode genomeGenerate - Reference genome alignment: The genome index from previous step will be used as one of the inputs for this step.

Therefore, input and output for STAR tool are - Input = sample files (.fastq files), reference genome (fasta file), gene annotation (GTF)
- Output = SAM or BAM files which adapter was cut.

4. Counting (Quantification)

The mapped reads will next be quantified the gene expression. I will use RSEM for this process. If we manually perform this quantification process using RSEM, we need 2 main steps including
- Reference preparation: using rsem-prepare-reference - Expression calculation: The RSEM reference from previous step will be used as one of the inputs for this step using rsem-calculate-expression

Therefore, input and output for RSEM tool are
- Input = mapped read from sample (BAM files), reference genome (fasta file), gene annotation (GTF)
- Output = Gene-level results (rsem.genes.results), Isoform-level results (rsem.isoforms.results)

5. Post-alignment quality control

Here RSeQC will be used and we will observe the QC result from all post-aligned files using MultiQC which will provide as a .html file.

Perform upstream analysis

1. Downloading fastq files from GEO

The input file for upstream analysis of transcriptomics is the fastq files. For this example, I will use the data from GSE197576. The data from the GSE includes all data under the study, which we don’t want all of those. Here, I want to compare normoxic and hypoxic condition. Therefore, I includ only control cell sgCTRL from hypoxic (GSM5920765, GSM5920766) and normoxic (GSM5920759, GSM5920760) condition.

By using nf-core fetchngs pipeline, the default (FTP) will be used.

Overview of RNA-seq analysis from nf-core (nf-core/rnaseq) Credit pic
Overview of RNA-seq analysis from nf-core (nf-core/rnaseq) Credit pic

To download those mentioned fastq files, we have to create a file named ids.csv where contains all GSM numbers that we want to download.

# Go to directory where `ids.csv` file was saved
cd data/fetchngs

# Oberving the `ids.csv` file
cat ids.csv

This is ids.csv file

    GSM5920759
    GSM5920760
    GSM5920765
    GSM5920766

Next, I define the needed information in nf-core/fetchngs launch pipeline. After launch the workflow, I got nf-params.json file. I download the file to run it locally.
Here is my nf-params.json. Please note that we will run all process on High Performance Computing (HPC), thus the file paths are full path of HPC.

# Go to directory where the `nf-params.json file was saved
cd data/HPC/data/fetchngs

# Observing the `nf-params.json` file
cat nf-params.json

This is nf-params.json file

{
    "input": "/mnt/sisplockers/krittiyabhornk/nf/hypox/data/ids.csv",
    "outdir": "/mnt/sisplockers/krittiyabhornk/nf/hypox/data",
    "email": "kkrittiyabhorn@gmail.com"
}

4 fastq files. To start nf-core/fetchngs process, I run the following command.

nextflow run nf-core/fetchngs -r 1.12.0 -name Hypox -params-file nf-params.json -profile docker

The process will start and show the picture below. Moreover, it will real-time update the status of each process.


After pipeline finished successfully, we will find these 4 fastq files as shown below.

2. Downloading reference genome and genome annotation

As I mentioned above that aligning on reference genome/transcriptome need reference genome and gene annotation.
Here, we will download the latest version of those two using the command below.

# Find the latest version 
latest_release=$(curl -s 'http://rest.ensembl.org/info/software?content-type=application/json' | grep -o '"release":[0-9]*' | cut -d: -f2)

# Download the Human (Homo_sapiens.GRCh38) reference genome 
wget -L ftp://ftp.ensembl.org/pub/release-${latest_release}/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa.gz

# Download the Human (Homo_sapiens.GRCh38)gene annotation
wget -L ftp://ftp.ensembl.org/pub/release-${latest_release}/gtf/homo_sapiens/Homo_sapiens.GRCh38.${latest_release}.gtf.gz

3. Performing upstream analysis

The upstream analysis was performed on HPC using a nf-core pipeline called rnaseq.

From nf-core/rnaseq pipeline, we will mainly use the dark blue pipeline where Trim Galora, STAR, and RSEM will be used for trimming, aligning, and quantification, respectively. I used those tools as mentioned above according to the reccomendation in STAT115 course which thought by Prof. Xiaole Shirley Liu, Harvard university

To analyze a set of .fastq files, we have to create a file named samplesheet.csv where includes all information about our .fastq file.
In my case, my data is single-end sequencing read. Thus, I have only one fastq file per each replicate of each sample.

# Go to directory that save `samplesheet.csv` file
cd data/rnaseq

# Observing the `samplesheet.csv` file
cat samplesheet.csv

Next, I define the needed information in nf-core/rnaseq launch pipeline. After launch the workflow, I got nf-params.json file. I download the file to run the analysis it locally.
Here is my nf-params.json. Please note that we will run all process on High Performance Computing (HPC), thus the file paths are full path of HPC.

# Go to directory where the `nf-params.json file was saved
cd data/rnaseq

# Observing the `nf-params.json` file
cat nf-params.json

Now, All information is ready to perfrom nf-core/rnaseq to analyze from 4 fastq files to gene count. To start nf-core/rnaseq process, I run the following command.

nextflow run nf-core/rnaseq -r 3.14.0 -name HYPOX -resume -params-file nf-params.json -profile docker

The process will start and show the picture below. Moreover, it will real-time update the status of each process.

We can observe QC from both pre and post-alignment and many more visualization created by the tools provided by nf-core/rnaseq pipeline from MultiQC.html file.

MultiQc file
MultiQc file

Trick for nf-core

As nf-core is a great tool that runs the analysis processes through Nextflow, we can extract the details of each process, such as the command used, input, and output. Follow the steps below:

  1. Find and open the Nextflow report (execution_report.html) in the outdir/pipeline_info directory. You will see a step-by-step overview of the pipeline running information.

  2. Find the step you are interested in to see the running details. In this case, I am interested in the reference preparation of RSEM. I will search for the process, tag, and hash in the .html file. I found the process, and the hash is 44/26c6d520d5a549af6f46a1bb0246a1bb0246fb.

  3. Go to the outdir/work directory and find the directory 44 and the sub-directory 26c6d520d5a549af6f46a1bb0246a1bb0246fb.

  4. See all files in this process by using the following command:

ls -al #see all files including hidden files
  1. The command line used in this process is in .command.sh. You can use cat .command.sh to see it as an example on the right side of the picture below.