Transcriptomics: From nf-core Count Matrix to Differential Expression Analysis (DESeq2) and Functional Analysis

Overview of this analysis
1. Project organization
2. Get the count matrix and manage the data
3. Differential gene expression analysis by DESeq2
4. Observe the DESeq2 results by visualization
- 4.1 Volcano plot
5. Over-Representation Analysis (ORA)

Overview of this analysis

For the downstream analysis of transcriptomics, we will run all code in RStudio. The gene counts will serve as the input file, and DESeq2 will be used for differential expression analysis. After identifying differentially expressed genes, we will utilize various tools and visualizations, as shown in the picture below.

Overview of Downstream transcriptomics analysis. Credit pic

1. Project organization

Let’s start with project organization. I organized the folders in this project to ensure reproducibility.

I created a set of routine folders, which include data, results, and scripts.

Furthermore, I run all R scripts using the R server in Docker and have also set up the R project and Git version control. For more details, please see My Computer Workbench.

2. Get the count matrix and manage the data

2.1 Obtain the data from the upstream analysis

Here, we obtain the input file from the upstream process by navigating to the outdir/star_rsem directory, where you will find a file named rsem.merged.gene_counts.tsv.

Next, we will load the readr and dplyr libraries to read the .tsv file and manage the data within it.

library(dplyr)
library(readr)
library(here)

# Define location of the file
raw_counts_file <- here("data/HPC/result/STAR_RSEM", "rsem.merged.gene_counts.tsv")

# Import the file
raw_counts <- read_tsv(raw_counts_file)

# Observe head of the data
head(raw_counts)

## # A tibble: 6 × 6
##   gene_id         `transcript_id(s)`         sgCTRL_Hypox_Rep1 sgCTRL_Hypox_Rep2
##   <chr>           <chr>                                  <dbl>             <dbl>
## 1 ENSG00000000003 ENST00000373020,ENST00000…              757               734 
## 2 ENSG00000000005 ENST00000373031,ENST00000…                0                 0 
## 3 ENSG00000000419 ENST00000371582,ENST00000…             2333.             2390.
## 4 ENSG00000000457 ENST00000367770,ENST00000…              324.              377.
## 5 ENSG00000000460 ENST00000286031,ENST00000…              417.              451.
## 6 ENSG00000000938 ENST00000374003,ENST00000…                4                 3 
## # ℹ 2 more variables: sgCTRL_Normox_Rep1 <dbl>, sgCTRL_Normox_Rep2 <dbl>

2.2 Manage the data

To simplify the data format, we will exclude the transcript_id(s) column and rename the gene_id column to ENSEMBL.

# Exclude column 2 which is `transcript_id(s)
raw_counts <- raw_counts[, -2]

# Rename the `gene_id` to `SYMBOL`
colnames(raw_counts) <- c("ENSEMBL", 
                          "sgCTRL_Hypox_Rep1", "sgCTRL_Hypox_Rep2",
                          "sgCTRL_Normox_Rep1", "sgCTRL_Normox_Rep2")

2.3 Change the data type to matrix

Currently, the gene counts in raw_counts are in character format. We will convert raw_counts to a matrix, as the gene counts should be treated as numeric values. Additionally, we need to convert all count values in raw_counts_mat to integers, in accordance with DESeq2’s input requirements.

# Change only count values to matrix
raw_counts_mat <- raw_counts[, -1] %>% 
  as.matrix() 

# Convert all values in `raw_counts_mat` to integer according to DESeq2 input request.
raw_counts_mat <- apply(raw_counts_mat, 2, as.integer)

# Add rownames
rownames(raw_counts_mat) <- raw_counts$ENSEMBL

head(raw_counts_mat)

##                 sgCTRL_Hypox_Rep1 sgCTRL_Hypox_Rep2 sgCTRL_Normox_Rep1
## ENSG00000000003               757               734                577
## ENSG00000000005                 0                 0                  0
## ENSG00000000419              2333              2390               2173
## ENSG00000000457               323               377                203
## ENSG00000000460               417               450                507
## ENSG00000000938                 4                 3                  3
##                 sgCTRL_Normox_Rep2
## ENSG00000000003                677
## ENSG00000000005                  0
## ENSG00000000419               2671
## ENSG00000000457                298
## ENSG00000000460                690
## ENSG00000000938                 14

2.4 Subset the data: only the protein-coding genes

In most cases of RNA-seq analysis, we subset only protein-coding genes for several reasons, including primary biological function, higher expression levels compared to non-coding genes, accurate gene annotation, reduction of noise, simplified data interpretation, and targeted research goals (specific pathways or cellular processes).

library(org.Hs.eg.db)

# list all keytypes from `org.Hs.eg.db` to see what is gene type was named. 
columns(org.Hs.eg.db)

##  [1] "ACCNUM"       "ALIAS"        "ENSEMBL"      "ENSEMBLPROT"  "ENSEMBLTRANS"
##  [6] "ENTREZID"     "ENZYME"       "EVIDENCE"     "EVIDENCEALL"  "GENENAME"    
## [11] "GENETYPE"     "GO"           "GOALL"        "IPI"          "MAP"         
## [16] "OMIM"         "ONTOLOGY"     "ONTOLOGYALL"  "PATH"         "PFAM"        
## [21] "PMID"         "PROSITE"      "REFSEQ"       "SYMBOL"       "UCSCKG"      
## [26] "UNIPROT"

We will retrieve the GENETYPE of raw_counts (the same data as raw_counts_mat) using SYMBOL. Here, we will create a map that matches SYMBOL with our data (raw_counts and raw_counts_mat).

map <- AnnotationDbi::select(org.Hs.eg.db, 
                            keys = raw_counts$ENSEMBL , # column that contain key (we want to convert this col.)
                            keytype = "ENSEMBL", # type of key in the column,
                            columns = "GENETYPE") # What we want to retrieve

head(map)

##           ENSEMBL       GENETYPE
## 1 ENSG00000000003 protein-coding
## 2 ENSG00000000005 protein-coding
## 3 ENSG00000000419 protein-coding
## 4 ENSG00000000457 protein-coding
## 5 ENSG00000000460 protein-coding
## 6 ENSG00000000938 protein-coding

Observe the types of gene.

unique(map$GENETYPE)

##  [1] "protein-coding" "pseudo"         "ncRNA"          NA              
##  [5] "unknown"        "snoRNA"         "snRNA"          "rRNA"          
##  [9] "other"          "scRNA"

We will select only protein-coding genes by defining them in map_CodingGene. Then, we will use map_CodingGene to subset our data.

map_CodingGene <- map %>% 
  dplyr::filter(GENETYPE == "protein-coding") 

head(map_CodingGene)

##           ENSEMBL       GENETYPE
## 1 ENSG00000000003 protein-coding
## 2 ENSG00000000005 protein-coding
## 3 ENSG00000000419 protein-coding
## 4 ENSG00000000457 protein-coding
## 5 ENSG00000000460 protein-coding
## 6 ENSG00000000938 protein-coding

Show number of protein-codinggenes.

nrow(map_CodingGene)

## [1] 19519

From raw_count_mat, we will subset only the common genes, which are the intersecting genes between raw_count_mat and map_CodingGene (protein-coding genes).

# Define the `common_genes` = genes that are in both `raw_count_mat` and `map_CodingGene`
common_genes <- intersect(rownames(raw_counts_mat), map_CodingGene$ENSEMBL)

# subset the common genes and re-order them by `coding_genes`
raw_counts_mat <- raw_counts_mat[common_genes, ]

head(raw_counts_mat)

##                 sgCTRL_Hypox_Rep1 sgCTRL_Hypox_Rep2 sgCTRL_Normox_Rep1
## ENSG00000000003               757               734                577
## ENSG00000000005                 0                 0                  0
## ENSG00000000419              2333              2390               2173
## ENSG00000000457               323               377                203
## ENSG00000000460               417               450                507
## ENSG00000000938                 4                 3                  3
##                 sgCTRL_Normox_Rep2
## ENSG00000000003                677
## ENSG00000000005                  0
## ENSG00000000419               2671
## ENSG00000000457                298
## ENSG00000000460                690
## ENSG00000000938                 14

3. Differential gene expression analysis by `DESeq2`

3.1 Prepare the files for `DESeq2` analysis: Prepare `ColData`

The DESeq2 analysis needs 2 input files which are countData (which we have as raw_counts_mat) and colData (which we will prepare).

# Define condition of each column
coldata <- data.frame(condition = c("hypoxia", "hypoxia", "normoxia", "normoxia"))
rownames(coldata) <- colnames(raw_counts_mat)

# Recheck whether the name is match with the defined condition
coldata

##                    condition
## sgCTRL_Hypox_Rep1    hypoxia
## sgCTRL_Hypox_Rep2    hypoxia
## sgCTRL_Normox_Rep1  normoxia
## sgCTRL_Normox_Rep2  normoxia

3.2 Differential gene expression analysis by `DESeq2`

This starts by creating a DESeq2 object (dds) using raw_counts_mat and coldata as inputs. This object will use the condition as the main factor.

library(DESeq2)

dds <- DESeqDataSetFromMatrix(countData = raw_counts_mat,
                              colData = coldata,
                              design = ~ condition)

dds <- DESeq(dds)
dds

## class: DESeqDataSet 
## dim: 19519 4 
## metadata(1): version
## assays(4): counts mu H cooks
## rownames(19519): ENSG00000000003 ENSG00000000005 ... ENSG00000293542
##   ENSG00000293543
## rowData names(22): baseMean baseVar ... deviance maxCooks
## colnames(4): sgCTRL_Hypox_Rep1 sgCTRL_Hypox_Rep2 sgCTRL_Normox_Rep1
##   sgCTRL_Normox_Rep2
## colData names(2): condition sizeFactor

To obtain differential gene expression, the two conditions (normoxia and hypoxia) will be compared.

res <- results(dds, contrast = c("condition", "hypoxia", "normoxia"))

res

## log2 fold change (MLE): condition hypoxia vs normoxia 
## Wald test p-value: condition hypoxia vs normoxia 
## DataFrame with 19519 rows and 6 columns
##                   baseMean log2FoldChange     lfcSE       stat      pvalue
##                  <numeric>      <numeric> <numeric>  <numeric>   <numeric>
## ENSG00000000003    681.296     0.00331546 0.1141074  0.0290556 9.76820e-01
## ENSG00000000005      0.000             NA        NA         NA          NA
## ENSG00000000419   2391.024    -0.28016274 0.0660624 -4.2408819 2.22643e-05
## ENSG00000000457    294.175     0.25005363 0.1748603  1.4300194 1.52711e-01
## ENSG00000000460    520.260    -0.70213180 0.1297147 -5.4128924 6.20147e-08
## ...                    ...            ...       ...        ...         ...
## ENSG00000292373   5.053874     -0.8190937   1.24552  -0.657632    0.510775
## ENSG00000293137 225.412652     -0.0273231   0.19486  -0.140219    0.888487
## ENSG00000293435   0.614115     -2.8461645   3.99107  -0.713133    0.475763
## ENSG00000293542   0.000000             NA        NA         NA          NA
## ENSG00000293543   4.600170     -0.3293342   1.31914  -0.249659    0.802851
##                        padj
##                   <numeric>
## ENSG00000000003 9.83702e-01
## ENSG00000000005          NA
## ENSG00000000419 8.16885e-05
## ENSG00000000457 2.46176e-01
## ENSG00000000460 3.18345e-07
## ...                     ...
## ENSG00000292373    0.630954
## ENSG00000293137    0.926124
## ENSG00000293435          NA
## ENSG00000293542          NA
## ENSG00000293543    0.864271

3.3 Observe the analysed data: P-value as a histogram

Firstly, we will examine the distribution of the p-value using a histogram.

library(ggplot2)

res %>% as.data.frame() %>%
  arrange(padj) %>%
  ggplot(aes(x=pvalue)) +
  geom_histogram(color = "white", bins = 50)

According to Tommy recommendation, genes with small counts can distort the p-values, so we will compare the counts between those with low and high p-values using the box plot below.

res %>% as.data.frame() %>%
        tibble::rownames_to_column(var = "gene_id") %>%
        filter(!is.na(pvalue)) %>%
        mutate(pvalue_bin = if_else(pvalue > 0.75, "high", "low")) %>%
        ggplot(aes(x= pvalue_bin, y = log2(baseMean))) +
        geom_boxplot()

As the box plot shows, genes with high p-values have lower gene expression. Therefore, we will remove the genes with low counts.

# Counts across 4 samples lower than 10 will be removed (This number can be changes.)
dds <- dds[rowSums(counts(dds)) > 10,]

# Recalculate differential gene expression
dds <- DESeq(dds)

res <- results(dds, contrast = c("condition", "hypoxia", "normoxia"))

# Re-observe the distribution of p-value
res %>% as.data.frame() %>%
  arrange(padj) %>%
  ggplot(aes(x=pvalue)) +
  geom_histogram(color = "white", bins = 50)

Now, the distribution of p-value looks better.

4. Observe the DESeq2 results by visualization

4.1 Volcano plot

The volcano plot will label the most differentially expressed genes as either upregulated or downregulated.

4.1.1 Convert `ENSEMBL` to readable format (`SYMBOL`)

library(org.Hs.eg.db)

res_forMap <-   res %>% 
  as.data.frame() %>%
  tibble::rownames_to_column(var = "ENSEMBL")

map <- AnnotationDbi::select(org.Hs.eg.db, 
                            keys = res_forMap$ENSEMBL , # column that contain key (we want to convert this col.)
                            keytype = "ENSEMBL", # type of key in the column,
                            columns= "SYMBOL")# What we want to retrieve

head(map)

##           ENSEMBL SYMBOL
## 1 ENSG00000000003 TSPAN6
## 2 ENSG00000000419   DPM1
## 3 ENSG00000000457  SCYL3
## 4 ENSG00000000460  FIRRM
## 5 ENSG00000000938    FGR
## 6 ENSG00000001036  FUCA2

Adding SYMBOL to res_map.

res_map <- left_join(res_forMap, map)

head(res_map)

##           ENSEMBL    baseMean log2FoldChange      lfcSE       stat       pvalue
## 1 ENSG00000000003  681.296045    0.003393615 0.11160564  0.0304072 9.757423e-01
## 2 ENSG00000000419 2391.023676   -0.280150021 0.06539050 -4.2842618 1.833469e-05
## 3 ENSG00000000457  294.175101    0.249896622 0.17052780  1.4654304 1.428035e-01
## 4 ENSG00000000460  520.260176   -0.702192509 0.12671529 -5.5414978 2.998954e-08
## 5 ENSG00000000938    5.898808   -1.445522599 1.19412125 -1.2105325 2.260746e-01
## 6 ENSG00000001036 1970.721295   -0.250948895 0.07029303 -3.5700393 3.569277e-04
##           padj SYMBOL
## 1 9.829441e-01 TSPAN6
## 2 6.357477e-05   DPM1
## 3 2.211256e-01  SCYL3
## 4 1.482511e-07  FIRRM
## 5 3.241821e-01    FGR
## 6 1.008703e-03  FUCA2

4.1.2 Define the labeled genes (high change value & significant)

We will first define genes that we will label as genes_to_label, which includes only genes with a log2 fold change greater than 2.5, an adjusted p-value lower than 0.001, and names that do not start with LOC(indicating genes for which a published symbol is not available).

genes_to_label <- res_map %>%
  filter(!stringr::str_detect(SYMBOL, "LOC"),
         abs(log2FoldChange) >= 2.5,
         padj <= 0.001)

head(genes_to_label)

##           ENSEMBL   baseMean log2FoldChange      lfcSE      stat        pvalue
## 1 ENSG00000006606   47.08133       3.537408 0.53497078  6.612338  3.782959e-11
## 2 ENSG00000006659   20.89578       5.232378 1.15501146  4.530152  5.894124e-06
## 3 ENSG00000013588 2360.00398       2.931581 0.07820453 37.486076 1.553028e-307
## 4 ENSG00000019186  274.23162       3.098137 0.20695901 14.969810  1.156517e-50
## 5 ENSG00000023445  127.07710       3.772602 0.34105723 11.061494  1.928580e-28
## 6 ENSG00000024422  719.66484       6.362696 0.27362463 23.253375 1.314978e-119
##            padj  SYMBOL
## 1  2.427837e-10   CCL26
## 2  2.192853e-05 LGALS14
## 3 1.171392e-304  GPRC5A
## 4  4.918116e-49 CYP24A1
## 5  3.646237e-27   BIRC3
## 6 1.983680e-117    EHD2

4.1.3 Define the coloring genes (Significant genes)

Moreover, we want to label the significantly different genes by color. Therefore, the significant (sig) and not significant (not_sig) genes will be defined by the code below for labeling purposes.

res_label <- res_map %>%
  mutate(sig = case_when(
    !stringr::str_detect(SYMBOL, "LOC") &
      abs(log2FoldChange) >= 2.5 &
      padj <= 0.001 ~ "sig",
    TRUE ~ "not_sig"
  ))

head(res_label)

##           ENSEMBL    baseMean log2FoldChange      lfcSE       stat       pvalue
## 1 ENSG00000000003  681.296045    0.003393615 0.11160564  0.0304072 9.757423e-01
## 2 ENSG00000000419 2391.023676   -0.280150021 0.06539050 -4.2842618 1.833469e-05
## 3 ENSG00000000457  294.175101    0.249896622 0.17052780  1.4654304 1.428035e-01
## 4 ENSG00000000460  520.260176   -0.702192509 0.12671529 -5.5414978 2.998954e-08
## 5 ENSG00000000938    5.898808   -1.445522599 1.19412125 -1.2105325 2.260746e-01
## 6 ENSG00000001036 1970.721295   -0.250948895 0.07029303 -3.5700393 3.569277e-04
##           padj SYMBOL     sig
## 1 9.829441e-01 TSPAN6 not_sig
## 2 6.357477e-05   DPM1 not_sig
## 3 2.211256e-01  SCYL3 not_sig
## 4 1.482511e-07  FIRRM not_sig
## 5 3.241821e-01    FGR not_sig
## 6 1.008703e-03  FUCA2 not_sig

4.1.4 Plot a volcano

Now, we will create the volcano plot using the ggplot2 library and use ggrepel to prevent label overlap. Additionally, we will add horizontal and vertical lines to mark the highly significant levels based on p-values and fold change, respectively.

library(ggplot2)
library(ggrepel)

ggplot(res_label, aes(x = log2FoldChange, y = -log10(pvalue))) +
  geom_point(aes(color = sig)) +
  scale_color_manual(values = c("blue", "red")) +
  ggrepel::geom_label_repel(data = genes_to_label, aes(label = SYMBOL))+
  geom_hline(yintercept = 100, linetype = 2, color = "red") +
  geom_vline(xintercept = c(-2.5, 2.5), linetype = 2, color = "red")+
  theme_bw(base_size = 14)+
  labs(title = "Hypoxia vs Normoxia")

### 4.2 Principal Component Analysis (PCA) In genomics, PCA is often used to visualize and explore variations in gene expression data across different samples or conditions. Here, we expect to see the same samples, but different replicates clustering together.

4.2.1 Normalize the data

First, we need to normalize the data. This normalized data (normalized_counts) will be used as input for both PCA and the heatmap (in the next section).

# Calculate the normalized data for PCA plot
vsd <- vst(dds, blind=FALSE)

normalized_counts <- assay(vsd) %>% as.matrix()

# Observe the normalized data
head(normalized_counts)

##                 sgCTRL_Hypox_Rep1 sgCTRL_Hypox_Rep2 sgCTRL_Normox_Rep1
## ENSG00000000003          12.23842          12.22585           12.24529
## ENSG00000000419          12.73578          12.74558           12.86559
## ENSG00000000457          12.00239          12.03673           11.96531
## ENSG00000000460          12.06331          12.08121           12.20255
## ENSG00000000938          11.60129          11.59438           11.60139
## ENSG00000001036          12.63943          12.64492           12.72490
##                 sgCTRL_Normox_Rep2
## ENSG00000000003           12.21803
## ENSG00000000419           12.84417
## ENSG00000000457           11.99568
## ENSG00000000460           12.22429
## ENSG00000000938           11.64762
## ENSG00000001036           12.74701

4.2.2 Calculate the priciple component

# Calculate the principal components
pca_prcomp <- prcomp(t(normalized_counts), 
                     center = TRUE, 
                     scale. = FALSE)

# This is the PC values  
pca_prcomp$x

##                          PC1        PC2        PC3          PC4
## sgCTRL_Hypox_Rep1  -9.526003  1.6700232 -0.7426898 6.517009e-14
## sgCTRL_Hypox_Rep2  -9.594833 -1.6612802  0.7438744 6.385440e-14
## sgCTRL_Normox_Rep1  9.532438 -0.9430173 -1.3195604 6.548234e-14
## sgCTRL_Normox_Rep2  9.588398  0.9342743  1.3183758 6.598845e-14

We will next use the data from pca_prcomp$x and prepare it for PCA plotting.

# Create a DataFrame for visualization
PC1_and_PC2 <- data.frame(PC1 = pca_prcomp$x[, 1], 
                          PC2 = pca_prcomp$x[, 2], 
                          type = rownames(pca_prcomp$x))

PC1_and_PC2

##                          PC1        PC2               type
## sgCTRL_Hypox_Rep1  -9.526003  1.6700232  sgCTRL_Hypox_Rep1
## sgCTRL_Hypox_Rep2  -9.594833 -1.6612802  sgCTRL_Hypox_Rep2
## sgCTRL_Normox_Rep1  9.532438 -0.9430173 sgCTRL_Normox_Rep1
## sgCTRL_Normox_Rep2  9.588398  0.9342743 sgCTRL_Normox_Rep2

4.2.3 Generate the PCA plot

ggplot(PC1_and_PC2, aes(x = PC1, y = PC2, col = type)) + 
  geom_point() + 
  geom_text(aes(label = type), hjust = 0, vjust = 0) +
  coord_fixed()+
  theme_classic()

### 4.3 Heatmap Heatmaps are useful in genomics and other scientific fields where large datasets need to be analyzed. For heatmap plotting, only significantly different genes will be displayed, including both upregulated and downregulated genes.

4.3.1 Define significant genes

The significant genes are defined by an adjusted p-value of less than 0.01 and a log2 fold change greater than 2.

significant_genes <- res %>%
  as.data.frame() %>%
  filter(padj <= 0.01, 
         abs(log2FoldChange) >= 2) %>%
  rownames()

head(significant_genes, 10)

##  [1] "ENSG00000003096" "ENSG00000006606" "ENSG00000006659" "ENSG00000008517"
##  [5] "ENSG00000010379" "ENSG00000012822" "ENSG00000013588" "ENSG00000019186"
##  [9] "ENSG00000019549" "ENSG00000023445"

As you saw, significant_genes is simply a list of genes. We will now use this list to filter our normalized data (normalized_counts).

significant_mat <- normalized_counts[significant_genes, ]

4.3.2 Create coloumn annotation

We have four samples, with two replicates each for normoxic and hypoxic conditions. We will label the hypoxic and normoxic samples by coloring the heads of the heatmap columns.

library(ComplexHeatmap)

col_anno <- HeatmapAnnotation(
  df = coldata,
  col = list(condition = c("hypoxia" = "red", 
                           "normoxia" = "blue"))
)

4.3.3 Scale the heatmap input data

Before we plot the heatmap, we need to scale it by following steps.

We transform rows (genes) into columns, so that each column represents a gene.
The scaling will be performed independently for each column (this is the default behavior in R) by subtracting the mean of each row from its values and then dividing by the standard deviation of that row. After scaling, each row will have a mean of 0 and a standard deviation of 1.
We then transpose it back, so that the genes return to being in rows.

In conclusion, scaling is performed independently for each gene.

Please note that the scaled values will typically range from negative to positive, depending on how far the original values deviate from their mean. Therefore, the range is not fixed.

# Scaling the data
scaled_sig_mat <- t(scale(t(significant_mat)))

# Observe min and max values of the scaled data
range_scale <- c(min(scaled_sig_mat), max(scaled_sig_mat))
range_scale

## [1] -1.240994  1.500000

4.3.4 Heatmap plotting

Heatmap(
  scaled_sig_mat,
  top_annotation = col_anno,
  show_row_names = FALSE,
  name = "scaled normalized\nexpression"
)

5. Over-Representation Analysis (ORA)

Over-Representation Analysis (ORA) is a method that compares a list of genes of interest (e.g., DEGs) to a background set (all genes in the experiment) to identify pathways that are over-represented among the genes of interest.

In this case, we will use significant_genes, which have already been defined in the previous section as the genes of interest.

5.1 Observe input datas

5.1.1 Genes of interest: `significant_genes`

head(significant_genes)

## [1] "ENSG00000003096" "ENSG00000006606" "ENSG00000006659" "ENSG00000008517"
## [5] "ENSG00000010379" "ENSG00000012822"

5.1.2 Background genes: the genes that are detected in the RNA-seq experiment.

background_genes <- res %>% 
  as.data.frame() %>% 
  filter(baseMean != 0) %>%
  tibble::rownames_to_column(var = "ENSEMBL") %>%
  pull(ENSEMBL)

5.2 Conversion of ENSEMBL to ENTREZID

According to further analysis, gene identification is mostly required in ENTREZID format (except enrichGO, where ENSEMBL is accepted). Therefore, we need a conversion.

5.2.1 Genes of interest

library(org.Hs.eg.db)
library(clusterProfiler)

significant_genes_map <- clusterProfiler::bitr(geneID = significant_genes,
                      fromType = "ENSEMBL", 
                      toType = "ENTREZID",
                      OrgDb = "org.Hs.eg.db")

head(significant_genes_map)

##           ENSEMBL ENTREZID
## 1 ENSG00000003096    90293
## 2 ENSG00000006606    10344
## 3 ENSG00000006659    56891
## 4 ENSG00000008517     9235
## 5 ENSG00000010379     6540
## 6 ENSG00000012822    57658

5.2.2 Background genes

background_genes_map <- bitr(geneID  = background_genes, 
                            fromType = "ENSEMBL", 
                            toType   = "ENTREZID",
                            OrgDb    = "org.Hs.eg.db")

head(background_genes_map)

##           ENSEMBL ENTREZID
## 1 ENSG00000000003     7105
## 2 ENSG00000000419     8813
## 3 ENSG00000000457    57147
## 4 ENSG00000000460    55732
## 5 ENSG00000000938     2268
## 6 ENSG00000001036     2519

5.3 ORA analysis and visualization

We will use two different gene sets, including Gene Ontology (GO) and the Molecular Signatures Database (MSigDB), for the ORA analysis.

5.3.1 Gene Ontology (GO) enrichment analysis

For GO gene set, there are 3 main classification which are
1. MF: Molecular Function
2. CC: Cellular Component
3. BP: Biological Process

Here, we will use BP for gene ontology and Benjamini & Hochberg correction to adjust p-values for multiple testing.

ego <- enrichGO(gene          = significant_genes_map$ENTREZID,
                universe      = background_genes_map$ENTREZID , 
                OrgDb         = org.Hs.eg.db,   # Specifies the organism: “org.Hs.eg.db” for Homo sapiens
                ont           = "BP", # Specified the ontology: BP for Biological Process
                pAdjustMethod = "BH", # The method used to adjjust p-values: BH for Benjamini & Hochberg correction
                pvalueCutoff  = 0.01,
                qvalueCutoff  = 0.05,
                readable      = TRUE)

head(ego)

##                    ID                           Description GeneRatio   BgRatio
## GO:0001525 GO:0001525                          angiogenesis    45/334 421/12958
## GO:0048514 GO:0048514            blood vessel morphogenesis    49/334 496/12958
## GO:0006935 GO:0006935                            chemotaxis    31/334 288/12958
## GO:0042330 GO:0042330                                 taxis    31/334 288/12958
## GO:0045765 GO:0045765            regulation of angiogenesis    26/334 215/12958
## GO:1901342 GO:1901342 regulation of vasculature development    26/334 219/12958
##            RichFactor FoldEnrichment    zScore       pvalue     p.adjust
## GO:0001525 0.10688836       4.146884 10.677068 3.058372e-16 5.508524e-13
## GO:0048514 0.09879032       3.832710 10.463476 3.154036e-16 5.508524e-13
## GO:0006935 0.10763889       4.176002  8.865754 1.438644e-11 1.256296e-08
## GO:0042330 0.10763889       4.176002  8.865754 1.438644e-11 1.256296e-08
## GO:0045765 0.12093023       4.691659  8.878342 5.138188e-11 3.589538e-08
## GO:1901342 0.11872146       4.605966  8.753929 7.787389e-11 4.533558e-08
##                  qvalue
## GO:0001525 4.438891e-13
## GO:0048514 4.438891e-13
## GO:0006935 1.012351e-08
## GO:0042330 1.012351e-08
## GO:0045765 2.892529e-08
## GO:1901342 3.653242e-08
##                                                                                                                                                                                                                                                                                               geneID
## GO:0001525                   SH2D2A/EDN1/MMP2/TGFB2/ITGB8/CAV1/SERPINE1/CCL2/SASH1/VEGFA/SEMA5A/FGF1/PDGFRB/F3/TNFAIP3/CCN2/NPPB/MMP19/C3/EDN2/CHI3L1/BTG1/LOXL2/ANXA1/KLF4/KLK3/CCN1/ACKR3/ADAM12/ADM/ITGA5/ANGPTL4/COL4A3/CXCL8/S1PR1/APLN/COL8A2/NDNF/CD34/ADGRB1/SEMA4A/COL27A1/AQP1/NOX5/PIK3R6
## GO:0048514 SH2D2A/PRDM1/EDN1/MMP2/TGFB2/ITGB8/CAV1/SERPINE1/CCL2/SASH1/VEGFA/SEMA5A/LOX/FGF1/PDGFRB/F3/TNFAIP3/CCN2/NPPB/MMP19/C3/EDN2/CHI3L1/BTG1/LOXL2/ANXA1/KLF4/KLK3/CCN1/ACKR3/ADAM12/ADM/XDH/ITGA5/ANGPTL4/COL4A3/CXCL8/S1PR1/APLN/COL8A2/NDNF/CD34/ADGRB1/NOG/SEMA4A/COL27A1/AQP1/NOX5/PIK3R6
## GO:0006935                                                                                                    CCL26/EDN1/MMP2/TGFB2/HSD3B7/SERPINE1/CCL2/TRPV4/NEDD9/VEGFA/SEMA5A/LOX/FGF1/PDGFRB/PADI2/F3/EDN2/LSP1/ANXA1/SEMA7A/CCN1/ACKR3/CCL28/NRG1/SLAMF8/ITGB2/CXCL8/S1PR1/SEMA4A/LPAR1/PLXNA4
## GO:0042330                                                                                                    CCL26/EDN1/MMP2/TGFB2/HSD3B7/SERPINE1/CCL2/TRPV4/NEDD9/VEGFA/SEMA5A/LOX/FGF1/PDGFRB/PADI2/F3/EDN2/LSP1/ANXA1/SEMA7A/CCN1/ACKR3/CCL28/NRG1/SLAMF8/ITGB2/CXCL8/S1PR1/SEMA4A/LPAR1/PLXNA4
## GO:0045765                                                                                                                                TGFB2/ITGB8/SERPINE1/SASH1/VEGFA/SEMA5A/FGF1/F3/TNFAIP3/NPPB/C3/CHI3L1/BTG1/KLF4/KLK3/ADAM12/ADM/ITGA5/ANGPTL4/COL4A3/CXCL8/CD34/ADGRB1/SEMA4A/AQP1/PIK3R6
## GO:1901342                                                                                                                                TGFB2/ITGB8/SERPINE1/SASH1/VEGFA/SEMA5A/FGF1/F3/TNFAIP3/NPPB/C3/CHI3L1/BTG1/KLF4/KLK3/ADAM12/ADM/ITGA5/ANGPTL4/COL4A3/CXCL8/CD34/ADGRB1/SEMA4A/AQP1/PIK3R6
##            Count
## GO:0001525    45
## GO:0048514    49
## GO:0006935    31
## GO:0042330    31
## GO:0045765    26
## GO:1901342    26

The GO enrichment analysis has now been completed and stored in ego. Let’s observe the results in the dot plot below.

dotplot(ego, showCategory=20)

#### 5.3.2 MSigDB database for enrichment analysis. We next explore enrichment analysis using the MSigDB database. MSigDB stores data from various species and gene categories. Let’s first retrieve gene sets from Homo sapiens and observe them.

# Load the msigdbr package
library(msigdbr)

# Retrieve gene sets for Homo sapiens
m_df <- msigdbr(species = "Homo sapiens")

# Them_df object now contains information about the Hallmark gene sets.
head(m_df)

## # A tibble: 6 × 15
##   gs_cat gs_subcat      gs_name        gene_symbol entrez_gene ensembl_gene   
##   <chr>  <chr>          <chr>          <chr>             <int> <chr>          
## 1 C3     MIR:MIR_Legacy AAACCAC_MIR140 ABCC4             10257 ENSG00000125257
## 2 C3     MIR:MIR_Legacy AAACCAC_MIR140 ABRAXAS2          23172 ENSG00000165660
## 3 C3     MIR:MIR_Legacy AAACCAC_MIR140 ACTN4                81 ENSG00000130402
## 4 C3     MIR:MIR_Legacy AAACCAC_MIR140 ACTN4                81 ENSG00000282844
## 5 C3     MIR:MIR_Legacy AAACCAC_MIR140 ACVR1                90 ENSG00000115170
## 6 C3     MIR:MIR_Legacy AAACCAC_MIR140 ADAM9              8754 ENSG00000168615
## # ℹ 9 more variables: human_gene_symbol <chr>, human_entrez_gene <int>,
## #   human_ensembl_gene <chr>, gs_id <chr>, gs_pmid <chr>, gs_geoid <chr>,
## #   gs_exact_source <chr>, gs_url <chr>, gs_description <chr>

Now we have the gene sets of Homo sapiens, however, MsigDB contains 8 key categories:
1. H (Hallmark gene sets) : These are a set of 50 gene sets representing well-defined biological states or processes.
2. C1 (Positional gene sets) : Gene sets based on chromosomal location.
3. C2 (Curated gene sets) : Manually curated gene sets from various sources, including pathway databases.
4. C3 (Motif gene sets) : Gene sets related to transcription factor binding motifs.
5. C4 (Computational gene sets) : Gene sets generated through computational methods.
6. C5 (GO gene sets) : Gene sets based on Gene Ontology terms.
7. C6 (Oncogenic signatures) : Gene sets associated with cancer-related processes.
8. C7 (Immunologic signatures) : Gene sets related to immune system processes.

Here, we will set the species parameter to “Homo sapiens” and the category parameter to “H” to retrieve Hallmark gene sets.

m_t2g <- msigdbr(species = "Homo sapiens", category = "H") %>% 
  dplyr::select(gs_name, entrez_gene)

We will investigate overview of the gene set from the Hallmark gene sets of Homo sapiens.

# Count the number of genes in each gene set
table(m_t2g$gs_name)

## 
##                      HALLMARK_ADIPOGENESIS 
##                                        210 
##               HALLMARK_ALLOGRAFT_REJECTION 
##                                        335 
##                 HALLMARK_ANDROGEN_RESPONSE 
##                                        102 
##                      HALLMARK_ANGIOGENESIS 
##                                         36 
##                   HALLMARK_APICAL_JUNCTION 
##                                        231 
##                    HALLMARK_APICAL_SURFACE 
##                                         46 
##                         HALLMARK_APOPTOSIS 
##                                        183 
##              HALLMARK_BILE_ACID_METABOLISM 
##                                        114 
##           HALLMARK_CHOLESTEROL_HOMEOSTASIS 
##                                         77 
##                       HALLMARK_COAGULATION 
##                                        162 
##                        HALLMARK_COMPLEMENT 
##                                        237 
##                        HALLMARK_DNA_REPAIR 
##                                        170 
##                       HALLMARK_E2F_TARGETS 
##                                        218 
## HALLMARK_EPITHELIAL_MESENCHYMAL_TRANSITION 
##                                        204 
##           HALLMARK_ESTROGEN_RESPONSE_EARLY 
##                                        216 
##            HALLMARK_ESTROGEN_RESPONSE_LATE 
##                                        218 
##             HALLMARK_FATTY_ACID_METABOLISM 
##                                        165 
##                    HALLMARK_G2M_CHECKPOINT 
##                                        204 
##                        HALLMARK_GLYCOLYSIS 
##                                        215 
##                HALLMARK_HEDGEHOG_SIGNALING 
##                                         36 
##                   HALLMARK_HEME_METABOLISM 
##                                        214 
##                           HALLMARK_HYPOXIA 
##                                        215 
##               HALLMARK_IL2_STAT5_SIGNALING 
##                                        216 
##           HALLMARK_IL6_JAK_STAT3_SIGNALING 
##                                        103 
##             HALLMARK_INFLAMMATORY_RESPONSE 
##                                        222 
##         HALLMARK_INTERFERON_ALPHA_RESPONSE 
##                                        140 
##         HALLMARK_INTERFERON_GAMMA_RESPONSE 
##                                        286 
##                 HALLMARK_KRAS_SIGNALING_DN 
##                                        220 
##                 HALLMARK_KRAS_SIGNALING_UP 
##                                        220 
##                   HALLMARK_MITOTIC_SPINDLE 
##                                        215 
##                  HALLMARK_MTORC1_SIGNALING 
##                                        211 
##                    HALLMARK_MYC_TARGETS_V1 
##                                        236 
##                    HALLMARK_MYC_TARGETS_V2 
##                                         60 
##                        HALLMARK_MYOGENESIS 
##                                        212 
##                   HALLMARK_NOTCH_SIGNALING 
##                                         34 
##         HALLMARK_OXIDATIVE_PHOSPHORYLATION 
##                                        220 
##                       HALLMARK_P53_PATHWAY 
##                                        215 
##               HALLMARK_PANCREAS_BETA_CELLS 
##                                         44 
##                        HALLMARK_PEROXISOME 
##                                        110 
##           HALLMARK_PI3K_AKT_MTOR_SIGNALING 
##                                        118 
##                 HALLMARK_PROTEIN_SECRETION 
##                                         98 
##   HALLMARK_REACTIVE_OXYGEN_SPECIES_PATHWAY 
##                                         58 
##                   HALLMARK_SPERMATOGENESIS 
##                                        144 
##                HALLMARK_TGF_BETA_SIGNALING 
##                                         59 
##           HALLMARK_TNFA_SIGNALING_VIA_NFKB 
##                                        228 
##         HALLMARK_UNFOLDED_PROTEIN_RESPONSE 
##                                        115 
##                    HALLMARK_UV_RESPONSE_DN 
##                                        152 
##                    HALLMARK_UV_RESPONSE_UP 
##                                        191 
##        HALLMARK_WNT_BETA_CATENIN_SIGNALING 
##                                         50 
##             HALLMARK_XENOBIOTIC_METABOLISM 
##                                        224

Next, let’s perform an enrichment analysis (ORA) using these sets.

em <- enricher(significant_genes_map$ENTREZID,
               TERM2GENE = m_t2g,
               universe  = background_genes_map$ENTREZID,
)

The enrichment analysis has been completed and stored in em. Let’s take a look at the analyzed data.

head(em)

##                                                                                    ID
## HALLMARK_HYPOXIA                                                     HALLMARK_HYPOXIA
## HALLMARK_EPITHELIAL_MESENCHYMAL_TRANSITION HALLMARK_EPITHELIAL_MESENCHYMAL_TRANSITION
## HALLMARK_TNFA_SIGNALING_VIA_NFKB                     HALLMARK_TNFA_SIGNALING_VIA_NFKB
## HALLMARK_KRAS_SIGNALING_DN                                 HALLMARK_KRAS_SIGNALING_DN
## HALLMARK_KRAS_SIGNALING_UP                                 HALLMARK_KRAS_SIGNALING_UP
## HALLMARK_INFLAMMATORY_RESPONSE                         HALLMARK_INFLAMMATORY_RESPONSE
##                                                                           Description
## HALLMARK_HYPOXIA                                                     HALLMARK_HYPOXIA
## HALLMARK_EPITHELIAL_MESENCHYMAL_TRANSITION HALLMARK_EPITHELIAL_MESENCHYMAL_TRANSITION
## HALLMARK_TNFA_SIGNALING_VIA_NFKB                     HALLMARK_TNFA_SIGNALING_VIA_NFKB
## HALLMARK_KRAS_SIGNALING_DN                                 HALLMARK_KRAS_SIGNALING_DN
## HALLMARK_KRAS_SIGNALING_UP                                 HALLMARK_KRAS_SIGNALING_UP
## HALLMARK_INFLAMMATORY_RESPONSE                         HALLMARK_INFLAMMATORY_RESPONSE
##                                            GeneRatio  BgRatio RichFactor
## HALLMARK_HYPOXIA                              29/137 181/3729  0.1602210
## HALLMARK_EPITHELIAL_MESENCHYMAL_TRANSITION    25/137 157/3729  0.1592357
## HALLMARK_TNFA_SIGNALING_VIA_NFKB              19/137 180/3729  0.1055556
## HALLMARK_KRAS_SIGNALING_DN                    14/137 123/3729  0.1138211
## HALLMARK_KRAS_SIGNALING_UP                    16/137 156/3729  0.1025641
## HALLMARK_INFLAMMATORY_RESPONSE                15/137 145/3729  0.1034483
##                                            FoldEnrichment   zScore       pvalue
## HALLMARK_HYPOXIA                                 4.361052 9.052171 4.196736e-12
## HALLMARK_EPITHELIAL_MESENCHYMAL_TRANSITION       4.334232 8.335277 1.838472e-10
## HALLMARK_TNFA_SIGNALING_VIA_NFKB                 2.873114 5.030112 2.253037e-05
## HALLMARK_KRAS_SIGNALING_DN                       3.098095 4.620563 1.305093e-04
## HALLMARK_KRAS_SIGNALING_UP                       2.791690 4.464150 1.491772e-04
## HALLMARK_INFLAMMATORY_RESPONSE                   2.815756 4.354995 2.210038e-04
##                                                p.adjust       qvalue
## HALLMARK_HYPOXIA                           1.804597e-10 1.369461e-10
## HALLMARK_EPITHELIAL_MESENCHYMAL_TRANSITION 3.952714e-09 2.999612e-09
## HALLMARK_TNFA_SIGNALING_VIA_NFKB           3.229353e-04 2.450671e-04
## HALLMARK_KRAS_SIGNALING_DN                 1.282924e-03 9.735774e-04
## HALLMARK_KRAS_SIGNALING_UP                 1.282924e-03 9.735774e-04
## HALLMARK_INFLAMMATORY_RESPONSE             1.583861e-03 1.201951e-03
##                                                                                                                                                                                       geneID
## HALLMARK_HYPOXIA                           10397/665/857/5054/7422/4015/8614/54206/2152/7128/1490/3623/1907/1289/694/8553/901/3491/8497/57007/3486/3484/133/6781/51129/54541/5209/3669/55076
## HALLMARK_EPITHELIAL_MESENCHYMAL_TRANSITION                    9235/6591/50509/4313/7078/5054/1277/7422/4015/5159/7128/1490/3624/1289/4148/4017/1009/3491/3486/8038/6876/3678/3576/1296/50863
## HALLMARK_TNFA_SIGNALING_VIA_NFKB                                                                  330/1906/5054/6347/7422/604/2152/7128/6446/3624/694/8553/9314/3491/57007/1437/19/5209/3914
## HALLMARK_KRAS_SIGNALING_DN                                                                                       1906/7042/778/80303/6446/1907/56154/7068/7032/3699/3851/3866/3860/105375355
## HALLMARK_KRAS_SIGNALING_UP                                                                                  330/639/28951/5243/3587/7139/7128/3624/2162/9314/3486/3561/80201/3689/1437/51129
## HALLMARK_INFLAMMATORY_RESPONSE                                                                                     1906/10148/3696/5054/6347/3587/2152/3624/23533/133/3678/19/3576/1902/4049
##                                            Count
## HALLMARK_HYPOXIA                              29
## HALLMARK_EPITHELIAL_MESENCHYMAL_TRANSITION    25
## HALLMARK_TNFA_SIGNALING_VIA_NFKB              19
## HALLMARK_KRAS_SIGNALING_DN                    14
## HALLMARK_KRAS_SIGNALING_UP                    16
## HALLMARK_INFLAMMATORY_RESPONSE                15

We will visualize the data using a dotplot.

dotplot(em, showCategory = 20)

## 6. Gene Set Enrichment Analysis (GSEA) Gene Set Enrichment Analysis (GSEA) is an alternative approach that considers the entire list of genes ranked by their association with a phenotype or condition. Rather than merely identifying over-represented pathways, GSEA evaluates whether a predefined gene set (e.g., a pathway) demonstrates a significant association with the phenotype throughout the entire ranked gene list.

6.1 Accessing Gene Sets:

We need to define the gene sets for the GSEA analysis. In this case, we will use MSigDB, which we retrieved from the Hallmark gene sets of Homo sapiens in the previous step.

table(m_t2g$gs_name)

## 
##                      HALLMARK_ADIPOGENESIS 
##                                        210 
##               HALLMARK_ALLOGRAFT_REJECTION 
##                                        335 
##                 HALLMARK_ANDROGEN_RESPONSE 
##                                        102 
##                      HALLMARK_ANGIOGENESIS 
##                                         36 
##                   HALLMARK_APICAL_JUNCTION 
##                                        231 
##                    HALLMARK_APICAL_SURFACE 
##                                         46 
##                         HALLMARK_APOPTOSIS 
##                                        183 
##              HALLMARK_BILE_ACID_METABOLISM 
##                                        114 
##           HALLMARK_CHOLESTEROL_HOMEOSTASIS 
##                                         77 
##                       HALLMARK_COAGULATION 
##                                        162 
##                        HALLMARK_COMPLEMENT 
##                                        237 
##                        HALLMARK_DNA_REPAIR 
##                                        170 
##                       HALLMARK_E2F_TARGETS 
##                                        218 
## HALLMARK_EPITHELIAL_MESENCHYMAL_TRANSITION 
##                                        204 
##           HALLMARK_ESTROGEN_RESPONSE_EARLY 
##                                        216 
##            HALLMARK_ESTROGEN_RESPONSE_LATE 
##                                        218 
##             HALLMARK_FATTY_ACID_METABOLISM 
##                                        165 
##                    HALLMARK_G2M_CHECKPOINT 
##                                        204 
##                        HALLMARK_GLYCOLYSIS 
##                                        215 
##                HALLMARK_HEDGEHOG_SIGNALING 
##                                         36 
##                   HALLMARK_HEME_METABOLISM 
##                                        214 
##                           HALLMARK_HYPOXIA 
##                                        215 
##               HALLMARK_IL2_STAT5_SIGNALING 
##                                        216 
##           HALLMARK_IL6_JAK_STAT3_SIGNALING 
##                                        103 
##             HALLMARK_INFLAMMATORY_RESPONSE 
##                                        222 
##         HALLMARK_INTERFERON_ALPHA_RESPONSE 
##                                        140 
##         HALLMARK_INTERFERON_GAMMA_RESPONSE 
##                                        286 
##                 HALLMARK_KRAS_SIGNALING_DN 
##                                        220 
##                 HALLMARK_KRAS_SIGNALING_UP 
##                                        220 
##                   HALLMARK_MITOTIC_SPINDLE 
##                                        215 
##                  HALLMARK_MTORC1_SIGNALING 
##                                        211 
##                    HALLMARK_MYC_TARGETS_V1 
##                                        236 
##                    HALLMARK_MYC_TARGETS_V2 
##                                         60 
##                        HALLMARK_MYOGENESIS 
##                                        212 
##                   HALLMARK_NOTCH_SIGNALING 
##                                         34 
##         HALLMARK_OXIDATIVE_PHOSPHORYLATION 
##                                        220 
##                       HALLMARK_P53_PATHWAY 
##                                        215 
##               HALLMARK_PANCREAS_BETA_CELLS 
##                                         44 
##                        HALLMARK_PEROXISOME 
##                                        110 
##           HALLMARK_PI3K_AKT_MTOR_SIGNALING 
##                                        118 
##                 HALLMARK_PROTEIN_SECRETION 
##                                         98 
##   HALLMARK_REACTIVE_OXYGEN_SPECIES_PATHWAY 
##                                         58 
##                   HALLMARK_SPERMATOGENESIS 
##                                        144 
##                HALLMARK_TGF_BETA_SIGNALING 
##                                         59 
##           HALLMARK_TNFA_SIGNALING_VIA_NFKB 
##                                        228 
##         HALLMARK_UNFOLDED_PROTEIN_RESPONSE 
##                                        115 
##                    HALLMARK_UV_RESPONSE_DN 
##                                        152 
##                    HALLMARK_UV_RESPONSE_UP 
##                                        191 
##        HALLMARK_WNT_BETA_CATENIN_SIGNALING 
##                                         50 
##             HALLMARK_XENOBIOTIC_METABOLISM 
##                                        224

6.2 Input data for GSEA

Input for GSEA is different from ORA. Instead of selecting only genes of interest, GSEA utilizes all differential expression data, including fold change and p-value from all genes.

We will use the results from the previous step (res) as input. Therefore, we need to filter the genes and change the data type to a data frame using the following code, then use res_df for further processing.

res_df <- res %>% 
  as.data.frame() %>% 
  filter(baseMean != 0) %>%
  tibble::rownames_to_column(var = "ENSEMBL")

GSEA requires the data to be pre-ranked based on a metric that combines fold change and significance, such as the signed fold change multiplied by the negative logarithm of the p-value.

Additionally, we include the ENTREZID in the data frame.

res_df <- res_df %>% 
  mutate(signed_rank_stats = sign(log2FoldChange) * -log10(pvalue)) %>%
  left_join(background_genes_map, by = c("ENSEMBL" = "ENSEMBL")) %>%
  arrange(desc(signed_rank_stats))

One issue that may arise is the presence of infinite values in the data, which can cause errors during GSEA.
To address this, we replace infinite values with large numbers so that they are ranked high:

res_df <- res_df %>%
  mutate(negative_log10pvalue = -log10(pvalue)) %>%
  mutate(negative_log10pvalue = ifelse(is.infinite(negative_log10pvalue), 1000, negative_log10pvalue)) %>%
  mutate(signed_rank_stats = sign(log2FoldChange) * negative_log10pvalue)

This is the final step of input preparation for GSEA. We will prepare gene_list that contains signed_rank_stats and gene identification (ENTREZID)

# Assign `signed_rank_stats` as `gene_list` 
gene_list <- res_df$signed_rank_stats

# Name the `gene_list` by ENTREZID
names(gene_list) <- res_df$ENTREZID

6.3 GSEA Analysis

We perform fast GSEA using our pre-ranked gene list (gene_list) and a reference gene set, represented by m_t2g. The analyzed data is stored in em2.

em2 <- GSEA(gene_list, TERM2GENE = m_t2g)

head(em2)

##                                                                                    ID
## HALLMARK_E2F_TARGETS                                             HALLMARK_E2F_TARGETS
## HALLMARK_G2M_CHECKPOINT                                       HALLMARK_G2M_CHECKPOINT
## HALLMARK_HYPOXIA                                                     HALLMARK_HYPOXIA
## HALLMARK_MYC_TARGETS_V2                                       HALLMARK_MYC_TARGETS_V2
## HALLMARK_TNFA_SIGNALING_VIA_NFKB                     HALLMARK_TNFA_SIGNALING_VIA_NFKB
## HALLMARK_EPITHELIAL_MESENCHYMAL_TRANSITION HALLMARK_EPITHELIAL_MESENCHYMAL_TRANSITION
##                                                                           Description
## HALLMARK_E2F_TARGETS                                             HALLMARK_E2F_TARGETS
## HALLMARK_G2M_CHECKPOINT                                       HALLMARK_G2M_CHECKPOINT
## HALLMARK_HYPOXIA                                                     HALLMARK_HYPOXIA
## HALLMARK_MYC_TARGETS_V2                                       HALLMARK_MYC_TARGETS_V2
## HALLMARK_TNFA_SIGNALING_VIA_NFKB                     HALLMARK_TNFA_SIGNALING_VIA_NFKB
## HALLMARK_EPITHELIAL_MESENCHYMAL_TRANSITION HALLMARK_EPITHELIAL_MESENCHYMAL_TRANSITION
##                                            setSize enrichmentScore       NES
## HALLMARK_E2F_TARGETS                           200      -0.7906095 -2.211227
## HALLMARK_G2M_CHECKPOINT                        199      -0.7678108 -2.148832
## HALLMARK_HYPOXIA                               181       0.9360678  2.000318
## HALLMARK_MYC_TARGETS_V2                         58      -0.8337025 -1.964085
## HALLMARK_TNFA_SIGNALING_VIA_NFKB               180       0.8704395  1.860062
## HALLMARK_EPITHELIAL_MESENCHYMAL_TRANSITION     157       0.8591668  1.826105
##                                                  pvalue     p.adjust
## HALLMARK_E2F_TARGETS                       1.000000e-10 1.666667e-09
## HALLMARK_G2M_CHECKPOINT                    1.000000e-10 1.666667e-09
## HALLMARK_HYPOXIA                           1.000000e-10 1.666667e-09
## HALLMARK_MYC_TARGETS_V2                    6.255554e-07 7.819443e-06
## HALLMARK_TNFA_SIGNALING_VIA_NFKB           9.855909e-07 9.855909e-06
## HALLMARK_EPITHELIAL_MESENCHYMAL_TRANSITION 5.688140e-06 4.740116e-05
##                                                  qvalue rank
## HALLMARK_E2F_TARGETS                       8.771930e-10 1775
## HALLMARK_G2M_CHECKPOINT                    8.771930e-10 1717
## HALLMARK_HYPOXIA                           8.771930e-10  615
## HALLMARK_MYC_TARGETS_V2                    4.115496e-06 1247
## HALLMARK_TNFA_SIGNALING_VIA_NFKB           5.187321e-06 1129
## HALLMARK_EPITHELIAL_MESENCHYMAL_TRANSITION 2.494798e-05  790
##                                                              leading_edge
## HALLMARK_E2F_TARGETS                       tags=58%, list=12%, signal=52%
## HALLMARK_G2M_CHECKPOINT                    tags=50%, list=12%, signal=45%
## HALLMARK_HYPOXIA                            tags=45%, list=4%, signal=44%
## HALLMARK_MYC_TARGETS_V2                     tags=78%, list=9%, signal=71%
## HALLMARK_TNFA_SIGNALING_VIA_NFKB            tags=38%, list=8%, signal=35%
## HALLMARK_EPITHELIAL_MESENCHYMAL_TRANSITION  tags=34%, list=5%, signal=32%
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   core_enrichment
## HALLMARK_E2F_TARGETS                       3978/253714/1633/10248/6839/4605/9972/10549/57405/9133/3148/10733/10535/55635/9125/3184/2146/3930/9833/10714/23594/10212/79019/29089/9319/10615/84296/84844/5982/4999/24137/8726/23649/3161/672/4288/27338/5889/3015/10797/9787/3835/5558/1163/11340/57122/983/9126/1965/29028/7514/8914/9212/9700/51155/3838/5424/3609/4085/701/4678/4001/55646/1111/5631/7153/7112/9918/6790/29127/54962/6628/332/1854/204/10051/3070/5591/11051/8243/79075/203068/5983/6427/2547/4436/991/1503/23165/5347/993/7374/9238/10492/6749/4171/1164/10527/4174/79077/1019/4830/5902/1786/6426/3159/4172/4175/10528/4176/5425/4173/5901/5111/1434/5036/9221
## HALLMARK_G2M_CHECKPOINT                                                                                                                       6839/10403/4605/8899/9133/3980/10733/3184/2146/6713/1869/3930/1810/23594/10212/3832/1874/10592/8454/24137/23649/5933/8438/6599/9585/3161/4288/27338/7027/3015/10036/3835/22809/890/5558/1163/699/7290/983/5001/262/55872/7514/990/22974/9212/9184/9700/8318/51155/3838/3609/10926/7150/4085/51776/4678/4001/1111/10419/7153/7112/6790/29127/7272/3151/10432/332/9156/10051/7371/10146/8243/4953/7090/6427/3837/51659/991/3192/5347/993/6421/10772/10492/4171/1164/4174/6632/1019/6426/3159/4172/4175/8140/3312/85236/1736/4691/9221
## HALLMARK_HYPOXIA                                                                                                                                                                                                                            6515/2355/5230/10397/665/8614/54206/8553/3939/3486/54541/133/6513/857/3491/7045/5209/8497/7052/8974/4627/205/6385/7422/1316/5054/3725/55818/3099/7167/694/901/5155/2026/302/6533/5163/2113/2632/55076/1907/1490/5033/55139/23645/284119/6095/2023/112464/10370/3098/8660/2152/1649/5211/10957/3309/5214/8870/7852/30001/4783/5236/23764/2597/226/1026/26118/5329/4601/26136/467/51129/5066/3669/54800/23327/4015/7436/7128/2131/26355
## HALLMARK_MYC_TARGETS_V2                                                                                                                                                                                                                                                                                                                                                                                                    23223/79050/705/1844/64216/29078/80324/23481/6573/8886/56915/92856/26354/9136/51388/27340/2193/51491/23082/7965/27346/83743/5347/7374/10244/9238/79711/51154/4174/79077/1019/23160/10196/6723/10514/6652/5245/10528/4173/6949/3336/5036/4869/9221/3329
## HALLMARK_TNFA_SIGNALING_VIA_NFKB                                                                                                                                                                                                                                                                                          6515/2355/8553/3914/3491/5209/6385/7422/1316/5054/3725/694/10318/22822/9314/23645/4616/8660/2152/10957/56937/10769/4082/1827/8870/6446/4783/23764/7050/329/3383/9120/23135/1026/19/2114/5329/960/10725/467/330/8061/8744/4084/7128/10938/604/4791/25976/2354/7538/1847/6303/23258/4088/3572/3976/5606/10135/687/3280/9516/1437/3624/4792/1846/1051/1906
## HALLMARK_EPITHELIAL_MESENCHYMAL_TRANSITION                                                                                                                                                                                                                                                                                                                                                            4017/3486/3678/7078/3491/7045/23705/7052/6385/7422/5054/3725/2026/1307/1490/1293/3693/4616/7057/56937/3909/4035/3956/4487/50863/4323/3918/2335/1294/5329/3688/960/3685/5270/1277/4907/8572/5768/4015/7474/2014/22943/7128/4313/822/7168/800/30008/7431/10398/6876/2316/6303

6.4 Visualization

6.4.1 Dot plot for overview of the GSEA.

dotplot(em2, showCategory=20)

6.4.2 GSEA plot for each defined gene set.

Visualization for the GSEA of HALLMARK_P53_PATHWAY

p1 <- gseaplot(em2, geneSetID = "HALLMARK_P53_PATHWAY", 
              by = "runningScore", title = "HALLMARK_P53_PATHWAY")

p1

Visualization for the GSEA of HALLMARK_HYPOXIA

p2 <- gseaplot(em2, geneSetID = "HALLMARK_HYPOXIA", 
               by = "runningScore", title = "HALLMARK_HYPOXIA")

p2