FastQC
The upstream analysis will include the analysis from downloading fastq files to gene count. Each step of the analysis will be explained in detail in each section. The nf-core pipelines were used for all processes in this upstream analysis according to the reproducibility concern. Please see the detail of nf-core setting in my note for computer workbench.
Transcriptome analysis starts with fastq files as input files and
pass through 5 different processes including
FastQC
Input = .fastq
files
Output = the same .fastq
files without any editing
Input = .fastq
files
Output = .fastq
files which adapter was cut.
It is different versions of alignment depending the organism you are
working with. Mainly, it is divided into 2 categories which are organism
with and without reference genome.
To understand the different usage between Genome and Transcriptome
mapping, you need to firstly back to central dogma as picture showed
below. RNA-seq reads can be aligned directly to reference transcriptome.
However, transcriptome mapping is not suitable for discovery of novel
isoforms and splice junctions. Therefore, it depend on research
question.
In my case, I want to quantify gene expression of Human without
concerning about novel isoforms and slice junction. Therefore, I will
use at least transcriptome mapping.
I choose STAR
as my tool for alignment process. Mapping
will be done on genome/transciptome and we also want to know whether
which location(gene) it map to. Therefore, the additional files needs
for the alignment process are reference genome and gene annotation,
accordingly.
If we manually perform this alignment process using
STAR
, we need 2 main steps including
- Generating a genome index: using --runMode genomeGenerate
- Reference genome alignment: The genome index from previous step will
be used as one of the inputs for this step.
Therefore, input and output for STAR tool are - Input = sample files
(.fastq
files), reference genome (fasta
file),
gene annotation (GTF
)
- Output = SAM
or BAM
files which adapter was
cut.
The mapped reads will next be quantified the gene expression. I will
use RSEM
for this process. If we manually perform this
quantification process using RSEM
, we need 2 main steps
including
- Reference preparation: using rsem-prepare-reference
-
Expression calculation: The RSEM reference from previous step will be
used as one of the inputs for this step using
rsem-calculate-expression
Therefore, input and output for RSEM tool are
- Input = mapped read from sample (BAM
files), reference
genome (fasta
file), gene annotation
(GTF
)
- Output = Gene-level results (rsem.genes.results
),
Isoform-level results (rsem.isoforms.results
)
Here RSeQC
will be used and we will observe the QC
result from all post-aligned files using MultiQC
which will
provide as a .html
file.
fastq
files from GEOThe input file for upstream analysis of transcriptomics is the fastq
files. For this example, I will use the data from GSE197576.
The data from the GSE includes all data under the study, which we don’t
want all of those. Here, I want to compare normoxic and hypoxic
condition. Therefore, I includ only control cell sgCTRL
from hypoxic (GSM5920765,
GSM5920766)
and normoxic (GSM5920759,
GSM5920760)
condition.
By using nf-core fetchngs
pipeline, the default
(FTP
) will be used.
To download those mentioned fastq files, we have to create a file
named ids.csv
where contains all GSM
numbers
that we want to download.
# Go to directory where `ids.csv` file was saved
cd data/fetchngs
# Oberving the `ids.csv` file
cat ids.csv
This is ids.csv
file
GSM5920759
GSM5920760
GSM5920765
GSM5920766
Next, I define the needed information in nf-core/fetchngs
launch pipeline. After launch the workflow, I got
nf-params.json
file. I download the file to run it
locally.
Here is my nf-params.json
. Please note that we will run all
process on High Performance Computing (HPC), thus the file paths are
full path of HPC.
# Go to directory where the `nf-params.json file was saved
cd data/HPC/data/fetchngs
# Observing the `nf-params.json` file
cat nf-params.json
This is nf-params.json
file
{
"input": "/mnt/sisplockers/krittiyabhornk/nf/hypox/data/ids.csv",
"outdir": "/mnt/sisplockers/krittiyabhornk/nf/hypox/data",
"email": "kkrittiyabhorn@gmail.com"
}
4 fastq files. To start nf-core/fetchngs
process, I run
the following command.
nextflow run nf-core/fetchngs -r 1.12.0 -name Hypox -params-file nf-params.json -profile docker
The process will start and show the picture below. Moreover, it will real-time update the status of each process.
After pipeline finished successfully, we will find these 4 fastq files as shown below.
As I mentioned above that aligning on reference genome/transcriptome
need reference genome and gene annotation.
Here, we will download the latest version of those two using the command
below.
# Find the latest version
latest_release=$(curl -s 'http://rest.ensembl.org/info/software?content-type=application/json' | grep -o '"release":[0-9]*' | cut -d: -f2)
# Download the Human (Homo_sapiens.GRCh38) reference genome
wget -L ftp://ftp.ensembl.org/pub/release-${latest_release}/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa.gz
# Download the Human (Homo_sapiens.GRCh38)gene annotation
wget -L ftp://ftp.ensembl.org/pub/release-${latest_release}/gtf/homo_sapiens/Homo_sapiens.GRCh38.${latest_release}.gtf.gz
The upstream analysis was performed on HPC using a nf-core pipeline
called rnaseq
.
From nf-core/rnaseq
pipeline, we will mainly use the
dark blue pipeline where Trim Galora
, STAR
,
and RSEM
will be used for trimming, aligning, and
quantification, respectively. I used those tools as mentioned above
according to the reccomendation in STAT115 course which thought by
Prof. Xiaole Shirley Liu, Harvard university
To analyze a set of .fastq
files, we have to create a
file named samplesheet.csv
where includes all information
about our .fastq
file.
In my case, my data is single-end sequencing read. Thus, I have only one
fastq file per each replicate of each sample.
# Go to directory that save `samplesheet.csv` file
cd data/rnaseq
# Observing the `samplesheet.csv` file
cat samplesheet.csv
Next, I define the needed information in nf-core/rnaseq
launch pipeline. After launch the workflow, I got
nf-params.json
file. I download the file to run the
analysis it locally.
Here is my nf-params.json
. Please note that we will run all
process on High Performance Computing (HPC), thus the file paths are
full path of HPC.
# Go to directory where the `nf-params.json file was saved
cd data/rnaseq
# Observing the `nf-params.json` file
cat nf-params.json
Now, All information is ready to perfrom nf-core/rnaseq
to analyze from 4 fastq files to gene count. To start
nf-core/rnaseq
process, I run the following command.
nextflow run nf-core/rnaseq -r 3.14.0 -name HYPOX -resume -params-file nf-params.json -profile docker
The process will start and show the picture below. Moreover, it will real-time update the status of each process.
We can observe QC from both pre and post-alignment and many more
visualization created by the tools provided by
nf-core/rnaseq
pipeline from MultiQC.html
file.
nf-core
As nf-core is a great tool that runs the analysis processes through Nextflow, we can extract the details of each process, such as the command used, input, and output. Follow the steps below:
Find and open the Nextflow report
(execution_report.html
) in the
outdir/pipeline_info
directory. You will see a step-by-step
overview of the pipeline running information.
Find the step you are interested in to see the running details.
In this case, I am interested in the reference preparation of RSEM. I
will search for the process, tag, and hash in the .html
file. I found the process, and the hash is
44/26c6d520d5a549af6f46a1bb0246a1bb0246fb
.
Go to the outdir/work
directory and find the
directory 44
and the sub-directory
26c6d520d5a549af6f46a1bb0246a1bb0246fb
.
See all files in this process by using the following command:
ls -al #see all files including hidden files
.command.sh
. You can use cat .command.sh
to
see it as an example on the right side of the picture below.