Transcriptomics: From GEO Count Matrix to Differential Expression Analysis (DESeq2) and Functional Analysis

Overview of this analysis
1. Project organization
2. Get the count matrix and manage the data
3. Differential gene expression analysis by DESeq2
4. Observe the DESeq2 results by visualization
- 4.1 Volcano plot
5. Over-Representation Analysis (ORA)

Overview of this analysis

For the downstream analysis of transcriptomics, we will run all code in RStudio. The gene counts will serve as the input file, and DESeq2 will be used for differential expression analysis. After identifying differentially expressed genes, we will utilize various tools and visualizations.

1. Project organization

Let’s start with project organization. I organized the folders in this project to ensure reproducibility.

I created a set of routine folders, which include data, results, and scripts.

Furthermore, I run all R scripts using the R server in Docker and have also set up the R project and Git version control. For more details, please see My Computer Workbench.

2. Get the count matrix and manage the data

2.1 Download the data from GEO

We will use the count matrix from a published data GSE197576.

library(dplyr)
library(readr)
library(here)

# Define location of the file
raw_counts_file <- here("data/GEO", "GSE197576_raw_gene_counts_matrix.tsv.gz")

# Import the file
raw_counts <- read_tsv(raw_counts_file)

# Observe head of the data
head(raw_counts)

## # A tibble: 6 × 13
##   gene        `01_SW_sgCTRL_Norm` `02_SW_sgCTRL_Norm` `03_SW_sgITPR3_1_Norm`
##   <chr>                     <dbl>               <dbl>                  <dbl>
## 1 DDX11L1                       0                   0                      0
## 2 WASH7P                       18                  11                     28
## 3 MIR6859-1                     5                   1                      6
## 4 MIR1302-2HG                   0                   0                      0
## 5 MIR1302-2                     0                   0                      0
## 6 FAM138A                       0                   0                      0
## # ℹ 9 more variables: `04_SW_sgITPR3_1_Norm` <dbl>,
## #   `07_SW_sgRELB_3_Norm` <dbl>, `08_SW_sgRELB_3_Norm` <dbl>,
## #   `11_SW_sgCTRL_Hyp` <dbl>, `12_SW_sgCTRL_Hyp` <dbl>,
## #   `13_SW_sgITPR3_1_Hyp` <dbl>, `14_SW_sgITPR3_1_Hyp` <dbl>,
## #   `17_SW_sgRELB_3_Hyp` <dbl>, `18_SW_sgRELB_3_Hyp` <dbl>

2.2 Subset the data: Include only control of normoxic and hypoxic data

Next, let’s narrow down our data to the specific samples needed for comparison.

We will compare hypoxic and normoxic conditions, so we will include only sgCTRL samples from both conditions. To accomplish this, we will use a defined logical vector to subset the columns.

# Define the data we want by this logical vector
columns_to_select <- colnames(raw_counts) %>%
  stringr::str_detect("sgCTRL|gene")

columns_to_select

##  [1]  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE FALSE
## [13] FALSE

# Use this logical vector to subset the data frame
raw_counts <- raw_counts[, columns_to_select]

head(raw_counts)

## # A tibble: 6 × 5
##   gene        `01_SW_sgCTRL_Norm` `02_SW_sgCTRL_Norm` `11_SW_sgCTRL_Hyp`
##   <chr>                     <dbl>               <dbl>              <dbl>
## 1 DDX11L1                       0                   0                  0
## 2 WASH7P                       18                  11                 23
## 3 MIR6859-1                     5                   1                  8
## 4 MIR1302-2HG                   0                   0                  1
## 5 MIR1302-2                     0                   0                  0
## 6 FAM138A                       0                   0                  0
## # ℹ 1 more variable: `12_SW_sgCTRL_Hyp` <dbl>

2.3 Change the data type to matrix

Currently, the gene counts in raw_counts are in character format. We will convert the raw_counts to a matrix, as the gene counts should be treated as numeric values.

# Change only count values to matrix
raw_counts_mat <- raw_counts[, -1] %>% 
  as.matrix() 

# Add rownames
rownames(raw_counts_mat) <- raw_counts$gene 

head(raw_counts_mat)

##             01_SW_sgCTRL_Norm 02_SW_sgCTRL_Norm 11_SW_sgCTRL_Hyp
## DDX11L1                     0                 0                0
## WASH7P                     18                11               23
## MIR6859-1                   5                 1                8
## MIR1302-2HG                 0                 0                1
## MIR1302-2                   0                 0                0
## FAM138A                     0                 0                0
##             12_SW_sgCTRL_Hyp
## DDX11L1                    0
## WASH7P                    45
## MIR6859-1                 12
## MIR1302-2HG                0
## MIR1302-2                  0
## FAM138A                    0

2.4 Subset the data: only the protein-coding genes

In most cases of RNA-seq analysis, we subset only protein-coding genes for several reasons, including primary biological function, higher expression levels compared to non-coding genes, accurate gene annotation, reduction of noise, simplified data interpretation, and targeted research goals (specific pathways or cellular processes).

library(org.Hs.eg.db)

# list all keytypes from `org.Hs.eg.db` to see what is gene type was named. 
columns(org.Hs.eg.db)

##  [1] "ACCNUM"       "ALIAS"        "ENSEMBL"      "ENSEMBLPROT"  "ENSEMBLTRANS"
##  [6] "ENTREZID"     "ENZYME"       "EVIDENCE"     "EVIDENCEALL"  "GENENAME"    
## [11] "GENETYPE"     "GO"           "GOALL"        "IPI"          "MAP"         
## [16] "OMIM"         "ONTOLOGY"     "ONTOLOGYALL"  "PATH"         "PFAM"        
## [21] "PMID"         "PROSITE"      "REFSEQ"       "SYMBOL"       "UCSCKG"      
## [26] "UNIPROT"

We will retrieve the GENETYPE of raw_counts (the same data as raw_counts_mat) using SYMBOL. Here, we will create a map that matches SYMBOL with our data (raw_counts and raw_counts_mat).

map <- AnnotationDbi::select(org.Hs.eg.db, 
                            keys = raw_counts$gene , # column that contain key (we want to convert this col.)
                            keytype = "SYMBOL", # type of key in the column,
                            columns = "GENETYPE") # What we want to retrieve

head(map)

##        SYMBOL GENETYPE
## 1     DDX11L1   pseudo
## 2      WASH7P   pseudo
## 3   MIR6859-1    ncRNA
## 4 MIR1302-2HG    ncRNA
## 5   MIR1302-2    ncRNA
## 6     FAM138A    ncRNA

Observe the types of gene.

unique(map$GENETYPE)

##  [1] "pseudo"         "ncRNA"          "protein-coding" NA              
##  [5] "snoRNA"         "snRNA"          "tRNA"           "other"         
##  [9] "rRNA"           "unknown"        "scRNA"

We will select only protein-coding genes by defining them in map_CodingGene. Then, we will use map_CodingGene to subset our data.

map_CodingGene <- map %>% 
  dplyr::filter(GENETYPE == "protein-coding") 

head(map_CodingGene)

##         SYMBOL       GENETYPE
## 1        OR4F5 protein-coding
## 2 LOC112268260 protein-coding
## 3       OR4F29 protein-coding
## 4 LOC105378947 protein-coding
## 5       OR4F16 protein-coding
## 6       SAMD11 protein-coding

Show number of protein-codinggenes.

nrow(map_CodingGene)

## [1] 19377

From raw_count_mat, we will subset only the common genes, which are the intersecting genes between raw_count_mat and map_CodingGene (protein-coding genes).

# Define the `common_genes` = genes that are in both `raw_count_mat` and `map_CodingGene`
common_genes <- intersect(rownames(raw_counts_mat), map_CodingGene$SYMBOL)

# subset the common genes and re-order them by `coding_genes`
raw_counts_mat <- raw_counts_mat[common_genes, ]

head(raw_counts_mat)

##              01_SW_sgCTRL_Norm 02_SW_sgCTRL_Norm 11_SW_sgCTRL_Hyp
## OR4F5                        0                 0                0
## LOC112268260                 0                 0                0
## OR4F29                       0                 0                0
## LOC105378947                 0                 0                0
## OR4F16                       1                 0                0
## SAMD11                       2                 1                2
##              12_SW_sgCTRL_Hyp
## OR4F5                       0
## LOC112268260                0
## OR4F29                      0
## LOC105378947                0
## OR4F16                      3
## SAMD11                      5

3. Differential gene expression analysis by `DESeq2`

3.1 Prepare the files for `DESeq2` analysis: Prepare `ColData`

The DESeq2 analysis needs 2 input files which are countData (which we have as raw_counts_mat) and colData (which we will prepare).

# Define condition of each column
coldata <- data.frame(condition = c("normoxia", "normoxia", "hypoxia", "hypoxia"))
rownames(coldata) <- colnames(raw_counts_mat)

# Recheck whether the name is match with the defined condition
coldata

##                   condition
## 01_SW_sgCTRL_Norm  normoxia
## 02_SW_sgCTRL_Norm  normoxia
## 11_SW_sgCTRL_Hyp    hypoxia
## 12_SW_sgCTRL_Hyp    hypoxia

3.2 Differential gene expression analysis by `DESeq2`

This starts by creating a DESeq2 object (dds) using raw_counts_mat and coldata as inputs. This object will use the condition as the main factor.

library(DESeq2)

dds <- DESeqDataSetFromMatrix(countData = raw_counts_mat,
                              colData = coldata,
                              design = ~ condition)

dds <- DESeq(dds)
dds

## class: DESeqDataSet 
## dim: 19377 4 
## metadata(1): version
## assays(4): counts mu H cooks
## rownames(19377): OR4F5 LOC112268260 ... ND6 CYTB
## rowData names(22): baseMean baseVar ... deviance maxCooks
## colnames(4): 01_SW_sgCTRL_Norm 02_SW_sgCTRL_Norm 11_SW_sgCTRL_Hyp
##   12_SW_sgCTRL_Hyp
## colData names(2): condition sizeFactor

To obtain differential gene expression, the two conditions (normoxia and hypoxia) will be compared.

res <- results(dds, contrast = c("condition", "hypoxia", "normoxia"))

res

## log2 fold change (MLE): condition hypoxia vs normoxia 
## Wald test p-value: condition hypoxia vs normoxia 
## DataFrame with 19377 rows and 6 columns
##               baseMean log2FoldChange     lfcSE      stat      pvalue
##              <numeric>      <numeric> <numeric> <numeric>   <numeric>
## OR4F5         0.000000             NA        NA        NA          NA
## LOC112268260  0.000000             NA        NA        NA          NA
## OR4F29        0.000000             NA        NA        NA          NA
## LOC105378947  0.000000             NA        NA        NA          NA
## OR4F16        0.931111        1.37997   2.94378  0.468772    0.639232
## ...                ...            ...       ...       ...         ...
## ND4L           20344.0     -0.2356148 0.1794782  -1.31278 1.89258e-01
## ND4           119899.1      0.0678208 0.0433314   1.56517 1.17544e-01
## ND5            44322.2      0.5803516 0.0943872   6.14863 7.81559e-10
## ND6            11418.1      0.9891536 0.1098203   9.00702 2.11733e-19
## CYTB           55810.9      0.0738320 0.0512175   1.44154 1.49433e-01
##                     padj
##                <numeric>
## OR4F5                 NA
## LOC112268260          NA
## OR4F29                NA
## LOC105378947          NA
## OR4F16                NA
## ...                  ...
## ND4L         2.87405e-01
## ND4          1.92725e-01
## ND5          4.49348e-09
## ND6          2.44271e-18
## CYTB         2.36020e-01

3.3 Observe the analysed data: P-value as a histogram

Firstly, we will examine the distribution of the p-value using a histogram.

library(ggplot2)

res %>% as.data.frame() %>%
  arrange(padj) %>%
  ggplot(aes(x=pvalue)) +
  geom_histogram(color = "white", bins = 50)

According to Tommy recommendation, genes with small counts can distort the p-values, so we will compare the counts between those with low and high p-values using the box plot below.

res %>% as.data.frame() %>%
        tibble::rownames_to_column(var = "gene_id") %>%
        filter(!is.na(pvalue)) %>%
        mutate(pvalue_bin = if_else(pvalue > 0.75, "high", "low")) %>%
        ggplot(aes(x= pvalue_bin, y = log2(baseMean))) +
        geom_boxplot()

As the box plot shows, genes with high p-values have lower gene expression. Therefore, we will remove the genes with low counts.

# Counts across 4 samples lower than 10 will be removed (This number can be changes.)
dds <- dds[rowSums(counts(dds)) > 10,]

# Recalculate differential gene expression
dds <- DESeq(dds)

res <- results(dds, contrast = c("condition", "hypoxia", "normoxia"))

# Re-observe the distribution of p-value
res %>% as.data.frame() %>%
  arrange(padj) %>%
  ggplot(aes(x=pvalue)) +
  geom_histogram(color = "white", bins = 50)

Now, the distribution of p-value looks better.

4. Observe the DESeq2 results by visualization

4.1 Volcano plot

The volcano plot will label the most differentially expressed genes as either upregulated or downregulated.

4.1.1 Define the labeled genes (high change value & significant)

We will first define genes that we will label as genes_to_label, which includes only genes with a log2 fold change greater than 2.5, an adjusted p-value lower than 0.001, and names that do not start with LOC(indicating genes for which a published symbol is not available).

genes_to_label <- res %>%
  as.data.frame() %>%
  tibble::rownames_to_column(var = "gene") %>%
  filter(!stringr::str_detect(gene, "LOC"),
         abs(log2FoldChange) >= 2.5,
         padj <= 0.001)

head(genes_to_label)

##     gene   baseMean log2FoldChange      lfcSE      stat       pvalue
## 1 ERRFI1 4332.40999       2.786041 0.06805398 40.938694 0.000000e+00
## 2   NPPB   25.37084       4.487259 0.85034921  5.276961 1.313439e-07
## 3   CTRC   13.75187       4.639940 1.19560510  3.880830 1.041005e-04
## 4  PADI2   47.15242       2.865481 0.45345365  6.319236 2.628596e-10
## 5   GJB4   88.93894       2.810783 0.32840632  8.558858 1.139914e-17
## 6   EDN2  368.07113       3.596625 0.18740536 19.191685 4.343258e-82
##           padj
## 1 0.000000e+00
## 2 5.814759e-07
## 3 3.126940e-04
## 4 1.503784e-09
## 5 1.142190e-16
## 6 3.771674e-80

4.1.2 Define the coloring genes (Significant genes)

Moreover, we want to label the significantly different genes by color. Therefore, the significant (sig) and not significant (not_sig) genes will be defined by the code below for labeling purposes.

res_label <- res %>%
  as.data.frame() %>%
  tibble::rownames_to_column(var = "gene") %>%
  mutate(sig = case_when(
    !stringr::str_detect(gene, "LOC") &
      abs(log2FoldChange) >= 2.5 &
      padj <= 0.001 ~ "sig",
    TRUE ~ "not_sig"
  ))

head(res_label)

##      gene   baseMean log2FoldChange      lfcSE        stat       pvalue
## 1   NOC2L 4231.31471    -0.87086996 0.05503194 -15.8248098 2.098321e-56
## 2  KLHL17  588.20882    -0.11328977 0.11189537  -1.0124617 3.113173e-01
## 3 PLEKHN1  229.41180     0.04712997 0.17347083   0.2716882 7.858618e-01
## 4   PERM1   10.50349     2.47576605 0.90550077   2.7341402 6.254341e-03
## 5    HES4  209.38352     0.31409405 0.18035415   1.7415405 8.158888e-02
## 6   ISG15  581.41884    -0.69162460 0.12429286  -5.5644758 2.629417e-08
##           padj     sig
## 1 9.939147e-55 not_sig
## 2 4.176487e-01 not_sig
## 3 8.447251e-01 not_sig
## 4 1.368560e-02 not_sig
## 5 1.362866e-01 not_sig
## 6 1.254250e-07 not_sig

4.1.3 Plot a volcano

Now, we will create the volcano plot using the ggplot2 library and use ggrepel to prevent label overlap. Additionally, we will add horizontal and vertical lines to mark the highly significant levels based on p-values and fold change, respectively.

library(ggplot2)
library(ggrepel)

ggplot(res_label, aes(x = log2FoldChange, y = -log10(pvalue))) +
  geom_point(aes(color = sig)) +
  scale_color_manual(values = c("blue", "red")) +
  ggrepel::geom_label_repel(data = genes_to_label, aes(label = gene))+
  geom_hline(yintercept = 100, linetype = 2, color = "red") +
  geom_vline(xintercept = c(-2.5, 2.5), linetype = 2, color = "red")+
  theme_bw(base_size = 14)+
  labs(title = "Hypoxia vs Normoxia")

### 4.2 Principal Component Analysis (PCA) In genomics, PCA is often used to visualize and explore variations in gene expression data across different samples or conditions. Here, we expect to see the same samples, but different replicates clustering together.

4.2.1 Normalize the data

First, we need to normalize the data. This normalized data (normalized_counts) will be used as input for both PCA and the heatmap (in the next section).

# Calculate the normalized data for PCA plot
vsd <- vst(dds, blind=FALSE)

normalized_counts <- assay(vsd) %>% as.matrix()

# Observe the normalized data
head(normalized_counts)

##         01_SW_sgCTRL_Norm 02_SW_sgCTRL_Norm 11_SW_sgCTRL_Hyp 12_SW_sgCTRL_Hyp
## NOC2L            12.98927          12.95152         12.41364         12.40060
## KLHL17           11.39941          11.37610         11.34033         11.36613
## PLEKHN1          11.01355          11.04002         11.04961         11.02156
## PERM1            10.52707          10.51164         10.61583         10.61238
## HES4             10.97748          10.97282         11.02024         11.04890
## ISG15            11.46388          11.46339         11.20769         11.29751

4.2.2 Calculate the priciple component

# Calculate the principal components
pca_prcomp <- prcomp(t(normalized_counts), 
                     center = TRUE, 
                     scale. = FALSE)

# This is the PC values  
pca_prcomp$x

##                         PC1       PC2        PC3          PC4
## 01_SW_sgCTRL_Norm -12.22530  1.088312 -1.6224222 6.218420e-14
## 02_SW_sgCTRL_Norm -12.16220 -1.098779  1.6235839 6.251640e-14
## 11_SW_sgCTRL_Hyp   12.23783 -2.107295 -0.8407274 6.549535e-14
## 12_SW_sgCTRL_Hyp   12.14966  2.117762  0.8395656 6.350693e-14

We will next use the data from pca_prcomp$x and prepare it for PCA plotting.

# Create a DataFrame for visualization
PC1_and_PC2 <- data.frame(PC1 = pca_prcomp$x[, 1], 
                          PC2 = pca_prcomp$x[, 2], 
                          type = rownames(pca_prcomp$x))

PC1_and_PC2

##                         PC1       PC2              type
## 01_SW_sgCTRL_Norm -12.22530  1.088312 01_SW_sgCTRL_Norm
## 02_SW_sgCTRL_Norm -12.16220 -1.098779 02_SW_sgCTRL_Norm
## 11_SW_sgCTRL_Hyp   12.23783 -2.107295  11_SW_sgCTRL_Hyp
## 12_SW_sgCTRL_Hyp   12.14966  2.117762  12_SW_sgCTRL_Hyp

4.2.3 Generate the PCA plot

ggplot(PC1_and_PC2, aes(x = PC1, y = PC2, col = type)) + 
  geom_point() + 
  geom_text(aes(label = type), hjust = 0, vjust = 0) +
  coord_fixed()+
  theme_classic()

### 4.3 Heatmap Heatmaps are useful in genomics and other scientific fields where large datasets need to be analyzed. For heatmap plotting, only significantly different genes will be displayed, including both upregulated and downregulated genes.

4.3.1 Define significant genes

The significant genes are defined by an adjusted p-value of less than 0.01 and a log2 fold change greater than 2.

significant_genes <- res %>%
  as.data.frame() %>%
  filter(padj <= 0.01, 
         abs(log2FoldChange) >= 2) %>%
  rownames()

head(significant_genes, 10)

##  [1] "HES2"   "ERRFI1" "NPPB"   "CTRC"   "PADI2"  "CDA"    "CSMD2"  "GJB4"  
##  [9] "COL8A2" "POU3F1"

As you saw, significant_genes is simply a list of genes. We will now use this list to filter our normalized data (normalized_counts).

significant_mat <- normalized_counts[significant_genes, ]

4.3.2 Create coloumn annotation

We have four samples, with two replicates each for normoxic and hypoxic conditions. We will label the hypoxic and normoxic samples by coloring the heads of the heatmap columns.

library(ComplexHeatmap)

col_anno <- HeatmapAnnotation(
  df = coldata,
  col = list(condition = c("hypoxia" = "red", 
                           "normoxia" = "blue"))
)

4.3.3 Scale the heatmap input data

Before we plot the heatmap, we need to scale it by following steps.

We transform rows (genes) into columns, so that each column represents a gene.
The scaling will be performed independently for each column (this is the default behavior in R) by subtracting the mean of each row from its values and then dividing by the standard deviation of that row. After scaling, each row will have a mean of 0 and a standard deviation of 1.
We then transpose it back, so that the genes return to being in rows.

In conclusion, scaling is performed independently for each gene.

Please note that the scaled values will typically range from negative to positive, depending on how far the original values deviate from their mean. Therefore, the range is not fixed.

# Scaling the data
scaled_sig_mat <- t(scale(t(significant_mat)))

# Observe min and max values of the scaled data
range_scale <- c(min(scaled_sig_mat), max(scaled_sig_mat))
range_scale

## [1] -1.247098  1.363132

4.3.4 Heatmap plotting

Heatmap(
  scaled_sig_mat,
  top_annotation = col_anno,
  show_row_names = FALSE,
  name = "scaled normalized\nexpression"
)

5. Over-Representation Analysis (ORA)

Over-Representation Analysis (ORA) is a method that compares a list of genes of interest (e.g., DEGs) to a background set (all genes in the experiment) to identify pathways that are over-represented among the genes of interest.

In this case, we will use significant_genes, which have already been defined in the previous section as the genes of interest.

5.1 Observe input datas

5.1.1 Genes of interest: `significant_genes`

head(significant_genes)

## [1] "HES2"   "ERRFI1" "NPPB"   "CTRC"   "PADI2"  "CDA"

5.1.2 Background genes: the genes that are detected in the RNA-seq experiment.

background_genes <- res %>% 
  as.data.frame() %>% 
  filter(baseMean != 0) %>%
  tibble::rownames_to_column(var = "gene") %>%
  pull(gene)

5.2 Conversion of Gene Symbols to EntrezID

5.2.1 Genes of interest

library(org.Hs.eg.db)
library(clusterProfiler)

significant_genes_map <- clusterProfiler::bitr(geneID = significant_genes,
                      fromType = "SYMBOL", 
                      toType = "ENTREZID",
                      OrgDb = "org.Hs.eg.db")

head(significant_genes_map)

##   SYMBOL ENTREZID
## 1   HES2    54626
## 2 ERRFI1    54206
## 3   NPPB     4879
## 4   CTRC    11330
## 5  PADI2    11240
## 6    CDA      978

5.2.2 Background genes

background_genes_map <- bitr(geneID  = background_genes, 
                            fromType = "SYMBOL", 
                            toType   = "ENTREZID",
                            OrgDb    = "org.Hs.eg.db")

head(background_genes_map)

##    SYMBOL ENTREZID
## 1   NOC2L    26155
## 2  KLHL17   339451
## 3 PLEKHN1    84069
## 4   PERM1    84808
## 5    HES4    57801
## 6   ISG15     9636

5.3 ORA analysis and visualization

We will use two different gene sets, including Gene Ontology (GO) and the Molecular Signatures Database (MSigDB), for the ORA analysis.

5.3.1 Gene Ontology (GO) enrichment analysis

For GO gene set, there are 3 main classification which are
1. MF: Molecular Function
2. CC: Cellular Component
3. BP: Biological Process

Here, we will use BP for gene ontology and Benjamini & Hochberg correction to adjust p-values for multiple testing.

ego <- enrichGO(gene          = significant_genes_map$ENTREZID,
                universe      = background_genes_map$ENTREZID, 
                OrgDb         = org.Hs.eg.db,   # Specifies the organism: “org.Hs.eg.db” for Homo sapiens
                ont           = "BP", # Specified the ontology: BP for Biological Process
                pAdjustMethod = "BH", # The method used to adjjust p-values: BH for Benjamini & Hochberg correction
                pvalueCutoff  = 0.01,
                qvalueCutoff  = 0.05,
                readable      = TRUE)

head(ego)

##                    ID                                  Description GeneRatio
## GO:0001525 GO:0001525                                 angiogenesis    46/336
## GO:0048514 GO:0048514                   blood vessel morphogenesis    50/336
## GO:0045765 GO:0045765                   regulation of angiogenesis    26/336
## GO:1901342 GO:1901342        regulation of vasculature development    26/336
## GO:0007186 GO:0007186 G protein-coupled receptor signaling pathway    39/336
## GO:0045766 GO:0045766          positive regulation of angiogenesis    19/336
##              BgRatio RichFactor FoldEnrichment    zScore       pvalue
## GO:0001525 414/12357 0.11111111       4.086310 10.678770 2.269804e-16
## GO:0048514 488/12357 0.10245902       3.768113 10.430977 2.841906e-16
## GO:0045765 209/12357 0.12440191       4.575103  8.714602 8.592322e-11
## GO:1901342 213/12357 0.12206573       4.489185  8.587588 1.315569e-10
## GO:0007186 454/12357 0.08590308       3.159239  7.836800 1.640067e-10
## GO:0045766 122/12357 0.15573770       5.727532  8.773052 6.961533e-10
##                p.adjust       qvalue
## GO:0001525 4.926444e-13 4.019053e-13
## GO:0048514 4.926444e-13 4.019053e-13
## GO:0045765 9.929860e-08 8.100901e-08
## GO:1901342 1.137223e-07 9.277602e-08
## GO:0007186 1.137223e-07 9.277602e-08
## GO:0045766 4.022606e-07 3.281691e-07
##                                                                                                                                                                                                                                                                                                    geneID
## GO:0001525                   NPPB/COL8A2/EDN2/CCN1/F3/S1PR1/SEMA4A/SH2D2A/CHI3L1/CD34/TGFB2/COL4A3/ACKR3/CXCL8/EREG/NDNF/SEMA5A/FGF1/PDGFRB/EDN1/VEGFA/CCN2/TNFAIP3/SASH1/ITGB8/AQP1/SERPINE1/CAV1/LOXL2/ADGRB1/ANXA1/KLF4/COL27A1/ADAM12/ADM/ITGA5/MMP19/BTG1/NOX5/MMP2/PIK3R6/CCL2/C3/ANGPTL4/KLK3/APLN
## GO:0048514 NPPB/COL8A2/EDN2/CCN1/F3/S1PR1/SEMA4A/SH2D2A/CHI3L1/CD34/TGFB2/XDH/COL4A3/ACKR3/CXCL8/EREG/NDNF/SEMA5A/LOX/FGF1/PDGFRB/EDN1/VEGFA/PRDM1/CCN2/TNFAIP3/SASH1/ITGB8/AQP1/SERPINE1/CAV1/LOXL2/ADGRB1/ANXA1/KLF4/COL27A1/ADAM12/ADM/ITGA5/MMP19/BTG1/NOX5/MMP2/PIK3R6/CCL2/NOG/C3/ANGPTL4/KLK3/APLN
## GO:0045765                                                                                                                                     NPPB/F3/SEMA4A/CHI3L1/CD34/TGFB2/COL4A3/CXCL8/SEMA5A/FGF1/VEGFA/TNFAIP3/SASH1/ITGB8/AQP1/SERPINE1/ADGRB1/KLF4/ADAM12/ADM/ITGA5/BTG1/PIK3R6/C3/ANGPTL4/KLK3
## GO:1901342                                                                                                                                     NPPB/F3/SEMA4A/CHI3L1/CD34/TGFB2/COL4A3/CXCL8/SEMA5A/FGF1/VEGFA/TNFAIP3/SASH1/ITGB8/AQP1/SERPINE1/ADGRB1/KLF4/ADAM12/ADM/ITGA5/BTG1/PIK3R6/C3/ANGPTL4/KLK3
## GO:0007186                                                            NPPB/PADI2/S1PR1/KISS1/GPR17/GPR39/ACKR3/CACNA1D/MGLL/CXCL8/EREG/ARRDC3/PDGFRB/EDN1/CCL26/CAV1/ADGRB1/ANXA1/ABCA1/LPAR1/AKR1C3/ADM/CHRM4/DRD2/GPRC5A/GPRC5D/GNG2/PLCB2/PIK3R6/PIK3R5/CCL2/GAST/GNGT2/KISS1R/S1PR4/C3/TFF2/GRPR/APLN
## GO:0045766                                                                                                                                                                                  F3/CHI3L1/CD34/CXCL8/SEMA5A/FGF1/VEGFA/SASH1/ITGB8/AQP1/SERPINE1/KLF4/ADAM12/ADM/ITGA5/BTG1/PIK3R6/C3/ANGPTL4
##            Count
## GO:0001525    46
## GO:0048514    50
## GO:0045765    26
## GO:1901342    26
## GO:0007186    39
## GO:0045766    19

The GO enrichment analysis has now been completed and stored in ego. Let’s observe the results in the dot plot below.

dotplot(ego, showCategory=20)

#### 5.3.2 MSigDB database fot the enrichment analysis. We next explore enrichment analysis using the MSigDB database. MSigDB stores data from various species and gene categories. Let’s first retrieve gene sets from Homo sapiens and observe them.

# Load the msigdbr package
library(msigdbr)

# Retrieve gene sets for Homo sapiens
m_df <- msigdbr(species = "Homo sapiens")

# Them_df object now contains information about the Hallmark gene sets.
head(m_df)

## # A tibble: 6 × 15
##   gs_cat gs_subcat      gs_name        gene_symbol entrez_gene ensembl_gene   
##   <chr>  <chr>          <chr>          <chr>             <int> <chr>          
## 1 C3     MIR:MIR_Legacy AAACCAC_MIR140 ABCC4             10257 ENSG00000125257
## 2 C3     MIR:MIR_Legacy AAACCAC_MIR140 ABRAXAS2          23172 ENSG00000165660
## 3 C3     MIR:MIR_Legacy AAACCAC_MIR140 ACTN4                81 ENSG00000130402
## 4 C3     MIR:MIR_Legacy AAACCAC_MIR140 ACTN4                81 ENSG00000282844
## 5 C3     MIR:MIR_Legacy AAACCAC_MIR140 ACVR1                90 ENSG00000115170
## 6 C3     MIR:MIR_Legacy AAACCAC_MIR140 ADAM9              8754 ENSG00000168615
## # ℹ 9 more variables: human_gene_symbol <chr>, human_entrez_gene <int>,
## #   human_ensembl_gene <chr>, gs_id <chr>, gs_pmid <chr>, gs_geoid <chr>,
## #   gs_exact_source <chr>, gs_url <chr>, gs_description <chr>

Now we have the gene sets of Homo sapiens, however, MsigDB contains 8 key categories:
1. H (Hallmark gene sets) : These are a set of 50 gene sets representing well-defined biological states or processes.
2. C1 (Positional gene sets) : Gene sets based on chromosomal location.
3. C2 (Curated gene sets) : Manually curated gene sets from various sources, including pathway databases.
4. C3 (Motif gene sets) : Gene sets related to transcription factor binding motifs.
5. C4 (Computational gene sets) : Gene sets generated through computational methods.
6. C5 (GO gene sets) : Gene sets based on Gene Ontology terms.
7. C6 (Oncogenic signatures) : Gene sets associated with cancer-related processes.
8. C7 (Immunologic signatures) : Gene sets related to immune system processes.

Here, we will set the species parameter to “Homo sapiens” and the category parameter to “H” to retrieve Hallmark gene sets.

m_t2g <- msigdbr(species = "Homo sapiens", category = "H") %>% 
  dplyr::select(gs_name, entrez_gene)

We will investigate overview of the gene set from the Hallmark gene sets of Homo sapiens.

# Count the number of genes in each gene set
table(m_t2g$gs_name)

## 
##                      HALLMARK_ADIPOGENESIS 
##                                        210 
##               HALLMARK_ALLOGRAFT_REJECTION 
##                                        335 
##                 HALLMARK_ANDROGEN_RESPONSE 
##                                        102 
##                      HALLMARK_ANGIOGENESIS 
##                                         36 
##                   HALLMARK_APICAL_JUNCTION 
##                                        231 
##                    HALLMARK_APICAL_SURFACE 
##                                         46 
##                         HALLMARK_APOPTOSIS 
##                                        183 
##              HALLMARK_BILE_ACID_METABOLISM 
##                                        114 
##           HALLMARK_CHOLESTEROL_HOMEOSTASIS 
##                                         77 
##                       HALLMARK_COAGULATION 
##                                        162 
##                        HALLMARK_COMPLEMENT 
##                                        237 
##                        HALLMARK_DNA_REPAIR 
##                                        170 
##                       HALLMARK_E2F_TARGETS 
##                                        218 
## HALLMARK_EPITHELIAL_MESENCHYMAL_TRANSITION 
##                                        204 
##           HALLMARK_ESTROGEN_RESPONSE_EARLY 
##                                        216 
##            HALLMARK_ESTROGEN_RESPONSE_LATE 
##                                        218 
##             HALLMARK_FATTY_ACID_METABOLISM 
##                                        165 
##                    HALLMARK_G2M_CHECKPOINT 
##                                        204 
##                        HALLMARK_GLYCOLYSIS 
##                                        215 
##                HALLMARK_HEDGEHOG_SIGNALING 
##                                         36 
##                   HALLMARK_HEME_METABOLISM 
##                                        214 
##                           HALLMARK_HYPOXIA 
##                                        215 
##               HALLMARK_IL2_STAT5_SIGNALING 
##                                        216 
##           HALLMARK_IL6_JAK_STAT3_SIGNALING 
##                                        103 
##             HALLMARK_INFLAMMATORY_RESPONSE 
##                                        222 
##         HALLMARK_INTERFERON_ALPHA_RESPONSE 
##                                        140 
##         HALLMARK_INTERFERON_GAMMA_RESPONSE 
##                                        286 
##                 HALLMARK_KRAS_SIGNALING_DN 
##                                        220 
##                 HALLMARK_KRAS_SIGNALING_UP 
##                                        220 
##                   HALLMARK_MITOTIC_SPINDLE 
##                                        215 
##                  HALLMARK_MTORC1_SIGNALING 
##                                        211 
##                    HALLMARK_MYC_TARGETS_V1 
##                                        236 
##                    HALLMARK_MYC_TARGETS_V2 
##                                         60 
##                        HALLMARK_MYOGENESIS 
##                                        212 
##                   HALLMARK_NOTCH_SIGNALING 
##                                         34 
##         HALLMARK_OXIDATIVE_PHOSPHORYLATION 
##                                        220 
##                       HALLMARK_P53_PATHWAY 
##                                        215 
##               HALLMARK_PANCREAS_BETA_CELLS 
##                                         44 
##                        HALLMARK_PEROXISOME 
##                                        110 
##           HALLMARK_PI3K_AKT_MTOR_SIGNALING 
##                                        118 
##                 HALLMARK_PROTEIN_SECRETION 
##                                         98 
##   HALLMARK_REACTIVE_OXYGEN_SPECIES_PATHWAY 
##                                         58 
##                   HALLMARK_SPERMATOGENESIS 
##                                        144 
##                HALLMARK_TGF_BETA_SIGNALING 
##                                         59 
##           HALLMARK_TNFA_SIGNALING_VIA_NFKB 
##                                        228 
##         HALLMARK_UNFOLDED_PROTEIN_RESPONSE 
##                                        115 
##                    HALLMARK_UV_RESPONSE_DN 
##                                        152 
##                    HALLMARK_UV_RESPONSE_UP 
##                                        191 
##        HALLMARK_WNT_BETA_CATENIN_SIGNALING 
##                                         50 
##             HALLMARK_XENOBIOTIC_METABOLISM 
##                                        224

Next, let’s perform an enrichment analysis (ORA) using these sets.

em <- enricher(significant_genes_map$ENTREZID,
               TERM2GENE = m_t2g,
               universe  = background_genes_map$ENTREZID
)

The enrichment analysis has been completed and stored in em. Let’s take a look at the analyzed data.

head(em)

##                                                                                    ID
## HALLMARK_HYPOXIA                                                     HALLMARK_HYPOXIA
## HALLMARK_EPITHELIAL_MESENCHYMAL_TRANSITION HALLMARK_EPITHELIAL_MESENCHYMAL_TRANSITION
## HALLMARK_KRAS_SIGNALING_UP                                 HALLMARK_KRAS_SIGNALING_UP
## HALLMARK_TNFA_SIGNALING_VIA_NFKB                     HALLMARK_TNFA_SIGNALING_VIA_NFKB
## HALLMARK_KRAS_SIGNALING_DN                                 HALLMARK_KRAS_SIGNALING_DN
## HALLMARK_INFLAMMATORY_RESPONSE                         HALLMARK_INFLAMMATORY_RESPONSE
##                                                                           Description
## HALLMARK_HYPOXIA                                                     HALLMARK_HYPOXIA
## HALLMARK_EPITHELIAL_MESENCHYMAL_TRANSITION HALLMARK_EPITHELIAL_MESENCHYMAL_TRANSITION
## HALLMARK_KRAS_SIGNALING_UP                                 HALLMARK_KRAS_SIGNALING_UP
## HALLMARK_TNFA_SIGNALING_VIA_NFKB                     HALLMARK_TNFA_SIGNALING_VIA_NFKB
## HALLMARK_KRAS_SIGNALING_DN                                 HALLMARK_KRAS_SIGNALING_DN
## HALLMARK_INFLAMMATORY_RESPONSE                         HALLMARK_INFLAMMATORY_RESPONSE
##                                            GeneRatio  BgRatio RichFactor
## HALLMARK_HYPOXIA                              29/139 175/3586  0.1657143
## HALLMARK_EPITHELIAL_MESENCHYMAL_TRANSITION    24/139 158/3586  0.1518987
## HALLMARK_KRAS_SIGNALING_UP                    18/139 151/3586  0.1192053
## HALLMARK_TNFA_SIGNALING_VIA_NFKB              19/139 174/3586  0.1091954
## HALLMARK_KRAS_SIGNALING_DN                    15/139 116/3586  0.1293103
## HALLMARK_INFLAMMATORY_RESPONSE                15/139 142/3586  0.1056338
##                                            FoldEnrichment   zScore       pvalue
## HALLMARK_HYPOXIA                                 4.275190 8.919613 6.665908e-12
## HALLMARK_EPITHELIAL_MESENCHYMAL_TRANSITION       3.918769 7.534235 3.602349e-09
## HALLMARK_KRAS_SIGNALING_UP                       3.075325 5.231697 1.449774e-05
## HALLMARK_TNFA_SIGNALING_VIA_NFKB                 2.817084 4.933743 2.937100e-05
## HALLMARK_KRAS_SIGNALING_DN                       3.336021 5.135377 3.010333e-05
## HALLMARK_INFLAMMATORY_RESPONSE                   2.725200 4.211952 3.135613e-04
##                                                p.adjust       qvalue
## HALLMARK_HYPOXIA                           2.799681e-10 2.105023e-10
## HALLMARK_EPITHELIAL_MESENCHYMAL_TRANSITION 7.564932e-08 5.687919e-08
## HALLMARK_KRAS_SIGNALING_UP                 2.029683e-04 1.526077e-04
## HALLMARK_TNFA_SIGNALING_VIA_NFKB           2.528680e-04 1.901263e-04
## HALLMARK_KRAS_SIGNALING_DN                 2.528680e-04 1.901263e-04
## HALLMARK_INFLAMMATORY_RESPONSE             2.032666e-03 1.528320e-03
##                                                                                                                                                                                       geneID
## HALLMARK_HYPOXIA                           54206/1907/3491/2152/8497/3623/57007/8553/55076/901/4015/8614/7422/1490/7128/3484/3486/5054/857/6781/665/10397/1289/5209/54541/133/694/3669/51129
## HALLMARK_EPITHELIAL_MESENCHYMAL_TRANSITION                         1296/3491/4148/3576/4015/5159/7422/1490/7128/3624/3486/5054/4017/6591/1289/8038/6876/50863/3678/4313/1009/1277/50509/7078
## HALLMARK_KRAS_SIGNALING_UP                                                                        7139/28951/2069/1437/2162/639/7128/3624/3486/5243/9314/80201/330/3587/5175/51129/3689/3561
## HALLMARK_TNFA_SIGNALING_VIA_NFKB                                                                  3491/2152/3914/57007/8553/604/1437/1906/7422/6446/7128/3624/5054/19/9314/5209/330/694/6347
## HALLMARK_KRAS_SIGNALING_DN                                                                                  1907/7042/80303/7068/3699/1906/6446/105375355/56154/3851/3860/3866/3780/7032/778
## HALLMARK_INFLAMMATORY_RESPONSE                                                                                     2152/3576/2069/1906/3696/3624/5054/19/1902/133/3587/3678/23533/6347/10148
##                                            Count
## HALLMARK_HYPOXIA                              29
## HALLMARK_EPITHELIAL_MESENCHYMAL_TRANSITION    24
## HALLMARK_KRAS_SIGNALING_UP                    18
## HALLMARK_TNFA_SIGNALING_VIA_NFKB              19
## HALLMARK_KRAS_SIGNALING_DN                    15
## HALLMARK_INFLAMMATORY_RESPONSE                15

We will visualize the data using a dotplot.

dotplot(em, showCategory = 20)

## 6. Gene Set Enrichment Analysis (GSEA) Gene Set Enrichment Analysis (GSEA) is an alternative approach that considers the entire list of genes ranked by their association with a phenotype or condition. Rather than merely identifying over-represented pathways, GSEA evaluates whether a predefined gene set (e.g., a pathway) demonstrates a significant association with the phenotype throughout the entire ranked gene list.

6.1 Accessing Gene Sets:

We need to define the gene sets for the GSEA analysis. In this case, we will use MSigDB, which we retrieved from the Hallmark gene sets of Homo sapiens in the previous step.

table(m_t2g$gs_name)

## 
##                      HALLMARK_ADIPOGENESIS 
##                                        210 
##               HALLMARK_ALLOGRAFT_REJECTION 
##                                        335 
##                 HALLMARK_ANDROGEN_RESPONSE 
##                                        102 
##                      HALLMARK_ANGIOGENESIS 
##                                         36 
##                   HALLMARK_APICAL_JUNCTION 
##                                        231 
##                    HALLMARK_APICAL_SURFACE 
##                                         46 
##                         HALLMARK_APOPTOSIS 
##                                        183 
##              HALLMARK_BILE_ACID_METABOLISM 
##                                        114 
##           HALLMARK_CHOLESTEROL_HOMEOSTASIS 
##                                         77 
##                       HALLMARK_COAGULATION 
##                                        162 
##                        HALLMARK_COMPLEMENT 
##                                        237 
##                        HALLMARK_DNA_REPAIR 
##                                        170 
##                       HALLMARK_E2F_TARGETS 
##                                        218 
## HALLMARK_EPITHELIAL_MESENCHYMAL_TRANSITION 
##                                        204 
##           HALLMARK_ESTROGEN_RESPONSE_EARLY 
##                                        216 
##            HALLMARK_ESTROGEN_RESPONSE_LATE 
##                                        218 
##             HALLMARK_FATTY_ACID_METABOLISM 
##                                        165 
##                    HALLMARK_G2M_CHECKPOINT 
##                                        204 
##                        HALLMARK_GLYCOLYSIS 
##                                        215 
##                HALLMARK_HEDGEHOG_SIGNALING 
##                                         36 
##                   HALLMARK_HEME_METABOLISM 
##                                        214 
##                           HALLMARK_HYPOXIA 
##                                        215 
##               HALLMARK_IL2_STAT5_SIGNALING 
##                                        216 
##           HALLMARK_IL6_JAK_STAT3_SIGNALING 
##                                        103 
##             HALLMARK_INFLAMMATORY_RESPONSE 
##                                        222 
##         HALLMARK_INTERFERON_ALPHA_RESPONSE 
##                                        140 
##         HALLMARK_INTERFERON_GAMMA_RESPONSE 
##                                        286 
##                 HALLMARK_KRAS_SIGNALING_DN 
##                                        220 
##                 HALLMARK_KRAS_SIGNALING_UP 
##                                        220 
##                   HALLMARK_MITOTIC_SPINDLE 
##                                        215 
##                  HALLMARK_MTORC1_SIGNALING 
##                                        211 
##                    HALLMARK_MYC_TARGETS_V1 
##                                        236 
##                    HALLMARK_MYC_TARGETS_V2 
##                                         60 
##                        HALLMARK_MYOGENESIS 
##                                        212 
##                   HALLMARK_NOTCH_SIGNALING 
##                                         34 
##         HALLMARK_OXIDATIVE_PHOSPHORYLATION 
##                                        220 
##                       HALLMARK_P53_PATHWAY 
##                                        215 
##               HALLMARK_PANCREAS_BETA_CELLS 
##                                         44 
##                        HALLMARK_PEROXISOME 
##                                        110 
##           HALLMARK_PI3K_AKT_MTOR_SIGNALING 
##                                        118 
##                 HALLMARK_PROTEIN_SECRETION 
##                                         98 
##   HALLMARK_REACTIVE_OXYGEN_SPECIES_PATHWAY 
##                                         58 
##                   HALLMARK_SPERMATOGENESIS 
##                                        144 
##                HALLMARK_TGF_BETA_SIGNALING 
##                                         59 
##           HALLMARK_TNFA_SIGNALING_VIA_NFKB 
##                                        228 
##         HALLMARK_UNFOLDED_PROTEIN_RESPONSE 
##                                        115 
##                    HALLMARK_UV_RESPONSE_DN 
##                                        152 
##                    HALLMARK_UV_RESPONSE_UP 
##                                        191 
##        HALLMARK_WNT_BETA_CATENIN_SIGNALING 
##                                         50 
##             HALLMARK_XENOBIOTIC_METABOLISM 
##                                        224

6.2 Input data for GSEA

Input for GSEA is different from ORA. Instead of selecting only genes of interest, GSEA utilizes all differential expression data, including fold change and p-value from all genes.

We will use the results from the previous step (res) as input. Therefore, we need to filter the genes and change the data type to a data frame using the following code, then use res_df for further processing.

res_df <- res %>% 
  as.data.frame() %>% 
  filter(baseMean != 0) %>%
  tibble::rownames_to_column(var = "gene")

GSEA requires the data to be pre-ranked based on a metric that combines fold change and significance, such as the signed fold change multiplied by the negative logarithm of the p-value.

Additionally, we include the ENTREZID in the data frame.

res_df <- res_df %>% 
  mutate(signed_rank_stats = sign(log2FoldChange) * -log10(pvalue)) %>%
  left_join(background_genes_map, by = c("gene" = "SYMBOL")) %>%
  arrange(desc(signed_rank_stats))

One issue that may arise is the presence of infinite values in the data, which can cause errors during GSEA.
To address this, we replace infinite values with large numbers so that they are ranked high:

res_df <- res_df %>%
  mutate(negative_log10pvalue = -log10(pvalue)) %>%
  mutate(negative_log10pvalue = ifelse(is.infinite(negative_log10pvalue), 1000, negative_log10pvalue)) %>%
  mutate(signed_rank_stats = sign(log2FoldChange) * negative_log10pvalue)

This is the final step of input preparation for GSEA. We will prepare gene_list that contains signed_rank_stats and gene identification (ENTREZID)

# Assign `signed_rank_stats` as `gene_list` 
gene_list <- res_df$signed_rank_stats

# Name the `gene_list` by ENTREZID
names(gene_list) <- res_df$ENTREZID

6.3 GSEA Analysis

We perform fast GSEA using our pre-ranked gene list (gene_list) and a reference gene set, represented by m_t2g. The analyzed data is stored in em2.

em2 <- GSEA(gene_list, TERM2GENE = m_t2g)

head(em2)

##                                                                    ID
## HALLMARK_E2F_TARGETS                             HALLMARK_E2F_TARGETS
## HALLMARK_G2M_CHECKPOINT                       HALLMARK_G2M_CHECKPOINT
## HALLMARK_OXIDATIVE_PHOSPHORYLATION HALLMARK_OXIDATIVE_PHOSPHORYLATION
## HALLMARK_HYPOXIA                                     HALLMARK_HYPOXIA
## HALLMARK_MYC_TARGETS_V1                       HALLMARK_MYC_TARGETS_V1
## HALLMARK_TNFA_SIGNALING_VIA_NFKB     HALLMARK_TNFA_SIGNALING_VIA_NFKB
##                                                           Description setSize
## HALLMARK_E2F_TARGETS                             HALLMARK_E2F_TARGETS     197
## HALLMARK_G2M_CHECKPOINT                       HALLMARK_G2M_CHECKPOINT     195
## HALLMARK_OXIDATIVE_PHOSPHORYLATION HALLMARK_OXIDATIVE_PHOSPHORYLATION     190
## HALLMARK_HYPOXIA                                     HALLMARK_HYPOXIA     175
## HALLMARK_MYC_TARGETS_V1                       HALLMARK_MYC_TARGETS_V1     191
## HALLMARK_TNFA_SIGNALING_VIA_NFKB     HALLMARK_TNFA_SIGNALING_VIA_NFKB     174
##                                    enrichmentScore       NES       pvalue
## HALLMARK_E2F_TARGETS                    -0.7867385 -2.208564 1.000000e-10
## HALLMARK_G2M_CHECKPOINT                 -0.7627360 -2.146758 1.000000e-10
## HALLMARK_OXIDATIVE_PHOSPHORYLATION      -0.7363473 -2.060504 1.000000e-10
## HALLMARK_HYPOXIA                         0.9373476  2.038974 1.000000e-10
## HALLMARK_MYC_TARGETS_V1                 -0.6939122 -1.945010 9.891268e-09
## HALLMARK_TNFA_SIGNALING_VIA_NFKB         0.8766807  1.909870 1.880431e-07
##                                        p.adjust       qvalue rank
## HALLMARK_E2F_TARGETS               1.250000e-09 6.842105e-10 1758
## HALLMARK_G2M_CHECKPOINT            1.250000e-09 6.842105e-10 1571
## HALLMARK_OXIDATIVE_PHOSPHORYLATION 1.250000e-09 6.842105e-10 1861
## HALLMARK_HYPOXIA                   1.250000e-09 6.842105e-10  626
## HALLMARK_MYC_TARGETS_V1            9.891268e-08 5.414168e-08 1093
## HALLMARK_TNFA_SIGNALING_VIA_NFKB   1.567026e-06 8.577404e-07 1035
##                                                      leading_edge
## HALLMARK_E2F_TARGETS               tags=59%, list=13%, signal=52%
## HALLMARK_G2M_CHECKPOINT            tags=50%, list=12%, signal=45%
## HALLMARK_OXIDATIVE_PHOSPHORYLATION tags=49%, list=14%, signal=43%
## HALLMARK_HYPOXIA                    tags=48%, list=5%, signal=46%
## HALLMARK_MYC_TARGETS_V1             tags=65%, list=8%, signal=61%
## HALLMARK_TNFA_SIGNALING_VIA_NFKB    tags=39%, list=8%, signal=37%
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      core_enrichment
## HALLMARK_E2F_TARGETS                                           7398/27338/3978/9972/10635/55635/3148/25842/79019/4605/1633/9133/10248/10549/9125/9833/6839/3184/10535/10733/3930/10212/253714/51155/10615/10714/5558/2146/4288/29089/9319/5982/84844/23594/8726/24137/3161/4999/3015/10797/9787/84296/672/9126/23649/5889/983/3835/7514/1965/57122/3838/29028/9212/8914/3609/9700/11340/5424/4085/7153/204/701/4001/7112/6790/9918/4830/55646/6628/6427/4678/10051/29127/5631/3070/1111/332/1854/5591/11051/8243/2547/79075/54962/5983/991/4436/23165/10492/4171/6749/1503/5347/7374/9238/993/10527/1019/1164/4174/3159/5902/6426/1786/79077/5111/4175/4172/4176/10528/5901/4173/5425/1434/5036/9221
## HALLMARK_G2M_CHECKPOINT                                                                                                                                                     4605/9133/10403/3980/10592/6839/3184/6713/10733/1810/3832/3930/1874/3619/10212/51155/5558/1869/8454/2146/4288/8438/6599/23594/24137/3161/3015/22809/5933/9585/7027/3151/890/699/23649/10036/262/983/22974/3835/7514/3838/990/5001/7150/7290/55872/9212/10926/9184/3609/9700/10419/51776/4085/7153/8318/4001/7112/6790/10432/6427/4678/7272/10051/29127/10146/1111/332/7371/8243/4953/9156/7090/6421/3837/3192/991/10492/4171/51659/10772/5347/993/1019/1164/6632/4174/3159/6426/8140/4175/4172/3312/85236/4691/1736/9221
## HALLMARK_OXIDATIVE_PHOSPHORYLATION                                                                                                                                                                           4726/10975/4702/4720/7384/4694/1743/4717/23530/509/4835/291/6832/65003/9551/10939/1340/4723/4729/1351/513/4552/374291/5018/2746/3420/4712/7417/50/5250/2108/10440/9927/7386/1355/1374/3032/8604/64981/10989/34/29796/3028/4706/1892/5447/8802/56993/10884/7388/29088/51318/4697/26517/9377/3313/92609/1738/4704/2395/533/1349/64960/1737/54704/6390/23203/4191/517/515/5435/2271/7416/518/6392/56945/3419/54205/3417/9131/38/516/11331/498/10935/10128/1537/1431/5160/2806/506/292/3945
## HALLMARK_HYPOXIA                                                                                                                                                                                                                                   54206/8553/8614/3486/5054/665/10397/54541/133/6515/5230/2355/857/3491/6513/5209/8497/7045/8974/6385/1316/7052/3725/205/7422/55818/4627/694/3099/901/5155/5163/7167/2026/2632/2113/55076/1907/1490/302/5033/10370/55139/6095/23645/8660/1649/2152/284119/10957/3098/2023/5214/23764/4783/5211/5236/30001/3309/7852/1026/4601/112464/26118/467/5329/6533/2597/51129/26136/3669/54800/5066/23327/4015/7128/2131/7436/2997/26355/6781/3623/23036/3516
## HALLMARK_MYC_TARGETS_V1            6637/5250/4999/3015/8662/10921/8664/65005/7027/3066/3608/6193/3178/890/6128/5093/4706/2058/7514/1965/3838/2739/7298/55651/9045/8886/1982/9184/10054/5984/328/9377/6428/4085/5478/6633/8318/26354/7965/1964/1478/6627/51690/6194/4830/23435/9136/6427/3376/9188/10146/1854/10856/6634/10971/2547/4953/7416/2079/56910/10291/51491/3837/1207/5496/3192/23196/8669/991/6732/1665/10155/23450/10492/10728/4171/1503/6187/3336/3181/11331/1019/3615/6418/10935/8761/2107/6632/7555/3735/1537/4686/4174/5902/26156/6426/6059/3183/7284/6723/5111/4175/10576/4176/790/10528/6741/5901/2806/5245/4173/10574/6950/26135/7203/708/5425/10575/3326/22948/5036/4869/9221/3329
## HALLMARK_TNFA_SIGNALING_VIA_NFKB                                                                                                                                                                                                                                                                                                              3914/8553/5054/6515/2355/3491/5209/6385/1316/3725/7422/694/9314/10318/22822/4616/23645/8660/2152/10957/10769/4082/56937/1827/6446/23764/4783/329/7050/3383/23135/9120/10725/19/1026/467/5329/2114/8744/330/960/8061/4084/7128/604/2354/4791/7538/10938/6303/1847/25976/3976/23258/4088/3572/5606/1906/3280/4792/9516/687/3624/1437/4973/1846/1839/3371

6.4 Visualization

6.4.1 Dot plot for overview of the GSEA.

dotplot(em2, showCategory=20)

6.4.2 GSEA plot for each defined gene set.

Visualization for the GSEA of HALLMARK_P53_PATHWAY

p1 <- gseaplot(em2, geneSetID = "HALLMARK_P53_PATHWAY", 
              by = "runningScore", title = "HALLMARK_P53_PATHWAY")

p1

Visualization for the GSEA of HALLMARK_HYPOXIA

p2 <- gseaplot(em2, geneSetID = "HALLMARK_HYPOXIA", 
               by = "runningScore", title = "HALLMARK_HYPOXIA")

p2