DESeq2
) and Functional
AnalysisDESeq2
For the downstream analysis of transcriptomics, we will run all code in RStudio. The gene counts will serve as the input file, and DESeq2 will be used for differential expression analysis. After identifying differentially expressed genes, we will utilize various tools and visualizations.
Let’s start with project organization. I organized the folders in
this project to ensure reproducibility.
I created a set of routine folders, which include data, results, and
scripts.
Furthermore, I run all R scripts using the R server in Docker and
have also set up the R project and Git version control. For more
details, please see My Computer
Workbench.
We will use the count matrix from a published data GSE197576.
library(dplyr)
library(readr)
library(here)
# Define location of the file
raw_counts_file <- here("data/GEO", "GSE197576_raw_gene_counts_matrix.tsv.gz")
# Import the file
raw_counts <- read_tsv(raw_counts_file)
# Observe head of the data
head(raw_counts)
## # A tibble: 6 × 13
## gene `01_SW_sgCTRL_Norm` `02_SW_sgCTRL_Norm` `03_SW_sgITPR3_1_Norm`
## <chr> <dbl> <dbl> <dbl>
## 1 DDX11L1 0 0 0
## 2 WASH7P 18 11 28
## 3 MIR6859-1 5 1 6
## 4 MIR1302-2HG 0 0 0
## 5 MIR1302-2 0 0 0
## 6 FAM138A 0 0 0
## # ℹ 9 more variables: `04_SW_sgITPR3_1_Norm` <dbl>,
## # `07_SW_sgRELB_3_Norm` <dbl>, `08_SW_sgRELB_3_Norm` <dbl>,
## # `11_SW_sgCTRL_Hyp` <dbl>, `12_SW_sgCTRL_Hyp` <dbl>,
## # `13_SW_sgITPR3_1_Hyp` <dbl>, `14_SW_sgITPR3_1_Hyp` <dbl>,
## # `17_SW_sgRELB_3_Hyp` <dbl>, `18_SW_sgRELB_3_Hyp` <dbl>
Next, let’s narrow down our data to the specific samples needed for
comparison.
We will compare hypoxic and normoxic conditions, so we will include
only sgCTRL
samples from both conditions. To accomplish
this, we will use a defined logical vector to subset the columns.
# Define the data we want by this logical vector
columns_to_select <- colnames(raw_counts) %>%
stringr::str_detect("sgCTRL|gene")
columns_to_select
## [1] TRUE TRUE TRUE FALSE FALSE FALSE FALSE TRUE TRUE FALSE FALSE FALSE
## [13] FALSE
# Use this logical vector to subset the data frame
raw_counts <- raw_counts[, columns_to_select]
head(raw_counts)
## # A tibble: 6 × 5
## gene `01_SW_sgCTRL_Norm` `02_SW_sgCTRL_Norm` `11_SW_sgCTRL_Hyp`
## <chr> <dbl> <dbl> <dbl>
## 1 DDX11L1 0 0 0
## 2 WASH7P 18 11 23
## 3 MIR6859-1 5 1 8
## 4 MIR1302-2HG 0 0 1
## 5 MIR1302-2 0 0 0
## 6 FAM138A 0 0 0
## # ℹ 1 more variable: `12_SW_sgCTRL_Hyp` <dbl>
Currently, the gene counts in raw_counts
are in
character format. We will convert the raw_counts
to a
matrix, as the gene counts should be treated as numeric values.
# Change only count values to matrix
raw_counts_mat <- raw_counts[, -1] %>%
as.matrix()
# Add rownames
rownames(raw_counts_mat) <- raw_counts$gene
head(raw_counts_mat)
## 01_SW_sgCTRL_Norm 02_SW_sgCTRL_Norm 11_SW_sgCTRL_Hyp
## DDX11L1 0 0 0
## WASH7P 18 11 23
## MIR6859-1 5 1 8
## MIR1302-2HG 0 0 1
## MIR1302-2 0 0 0
## FAM138A 0 0 0
## 12_SW_sgCTRL_Hyp
## DDX11L1 0
## WASH7P 45
## MIR6859-1 12
## MIR1302-2HG 0
## MIR1302-2 0
## FAM138A 0
In most cases of RNA-seq analysis, we subset only protein-coding
genes for several reasons, including primary biological function, higher
expression levels compared to non-coding genes, accurate gene
annotation, reduction of noise, simplified data interpretation, and
targeted research goals (specific pathways or cellular processes).
library(org.Hs.eg.db)
# list all keytypes from `org.Hs.eg.db` to see what is gene type was named.
columns(org.Hs.eg.db)
## [1] "ACCNUM" "ALIAS" "ENSEMBL" "ENSEMBLPROT" "ENSEMBLTRANS"
## [6] "ENTREZID" "ENZYME" "EVIDENCE" "EVIDENCEALL" "GENENAME"
## [11] "GENETYPE" "GO" "GOALL" "IPI" "MAP"
## [16] "OMIM" "ONTOLOGY" "ONTOLOGYALL" "PATH" "PFAM"
## [21] "PMID" "PROSITE" "REFSEQ" "SYMBOL" "UCSCKG"
## [26] "UNIPROT"
We will retrieve the GENETYPE
of raw_counts
(the same data as raw_counts_mat
) using
SYMBOL
. Here, we will create a map
that
matches SYMBOL
with our data (raw_counts
and
raw_counts_mat
).
map <- AnnotationDbi::select(org.Hs.eg.db,
keys = raw_counts$gene , # column that contain key (we want to convert this col.)
keytype = "SYMBOL", # type of key in the column,
columns = "GENETYPE") # What we want to retrieve
head(map)
## SYMBOL GENETYPE
## 1 DDX11L1 pseudo
## 2 WASH7P pseudo
## 3 MIR6859-1 ncRNA
## 4 MIR1302-2HG ncRNA
## 5 MIR1302-2 ncRNA
## 6 FAM138A ncRNA
Observe the types of gene.
unique(map$GENETYPE)
## [1] "pseudo" "ncRNA" "protein-coding" NA
## [5] "snoRNA" "snRNA" "tRNA" "other"
## [9] "rRNA" "unknown" "scRNA"
We will select only protein-coding
genes by defining
them in map_CodingGene
. Then, we will use
map_CodingGene
to subset our data.
map_CodingGene <- map %>%
dplyr::filter(GENETYPE == "protein-coding")
head(map_CodingGene)
## SYMBOL GENETYPE
## 1 OR4F5 protein-coding
## 2 LOC112268260 protein-coding
## 3 OR4F29 protein-coding
## 4 LOC105378947 protein-coding
## 5 OR4F16 protein-coding
## 6 SAMD11 protein-coding
Show number of protein-coding
genes.
nrow(map_CodingGene)
## [1] 19377
From raw_count_mat
, we will subset only the common
genes, which are the intersecting genes between
raw_count_mat
and map_CodingGene
(protein-coding genes).
# Define the `common_genes` = genes that are in both `raw_count_mat` and `map_CodingGene`
common_genes <- intersect(rownames(raw_counts_mat), map_CodingGene$SYMBOL)
# subset the common genes and re-order them by `coding_genes`
raw_counts_mat <- raw_counts_mat[common_genes, ]
head(raw_counts_mat)
## 01_SW_sgCTRL_Norm 02_SW_sgCTRL_Norm 11_SW_sgCTRL_Hyp
## OR4F5 0 0 0
## LOC112268260 0 0 0
## OR4F29 0 0 0
## LOC105378947 0 0 0
## OR4F16 1 0 0
## SAMD11 2 1 2
## 12_SW_sgCTRL_Hyp
## OR4F5 0
## LOC112268260 0
## OR4F29 0
## LOC105378947 0
## OR4F16 3
## SAMD11 5
DESeq2
DESeq2
analysis: Prepare
ColData
The DESeq2
analysis needs 2 input files which are
countData
(which we have as raw_counts_mat
)
and colData (which we will prepare).
# Define condition of each column
coldata <- data.frame(condition = c("normoxia", "normoxia", "hypoxia", "hypoxia"))
rownames(coldata) <- colnames(raw_counts_mat)
# Recheck whether the name is match with the defined condition
coldata
## condition
## 01_SW_sgCTRL_Norm normoxia
## 02_SW_sgCTRL_Norm normoxia
## 11_SW_sgCTRL_Hyp hypoxia
## 12_SW_sgCTRL_Hyp hypoxia
DESeq2
This starts by creating a DESeq2 object (dds
) using
raw_counts_mat
and coldata
as inputs. This
object will use the condition as the main factor.
library(DESeq2)
dds <- DESeqDataSetFromMatrix(countData = raw_counts_mat,
colData = coldata,
design = ~ condition)
dds <- DESeq(dds)
dds
## class: DESeqDataSet
## dim: 19377 4
## metadata(1): version
## assays(4): counts mu H cooks
## rownames(19377): OR4F5 LOC112268260 ... ND6 CYTB
## rowData names(22): baseMean baseVar ... deviance maxCooks
## colnames(4): 01_SW_sgCTRL_Norm 02_SW_sgCTRL_Norm 11_SW_sgCTRL_Hyp
## 12_SW_sgCTRL_Hyp
## colData names(2): condition sizeFactor
To obtain differential gene expression, the two conditions (normoxia and hypoxia) will be compared.
res <- results(dds, contrast = c("condition", "hypoxia", "normoxia"))
res
## log2 fold change (MLE): condition hypoxia vs normoxia
## Wald test p-value: condition hypoxia vs normoxia
## DataFrame with 19377 rows and 6 columns
## baseMean log2FoldChange lfcSE stat pvalue
## <numeric> <numeric> <numeric> <numeric> <numeric>
## OR4F5 0.000000 NA NA NA NA
## LOC112268260 0.000000 NA NA NA NA
## OR4F29 0.000000 NA NA NA NA
## LOC105378947 0.000000 NA NA NA NA
## OR4F16 0.931111 1.37997 2.94378 0.468772 0.639232
## ... ... ... ... ... ...
## ND4L 20344.0 -0.2356148 0.1794782 -1.31278 1.89258e-01
## ND4 119899.1 0.0678208 0.0433314 1.56517 1.17544e-01
## ND5 44322.2 0.5803516 0.0943872 6.14863 7.81559e-10
## ND6 11418.1 0.9891536 0.1098203 9.00702 2.11733e-19
## CYTB 55810.9 0.0738320 0.0512175 1.44154 1.49433e-01
## padj
## <numeric>
## OR4F5 NA
## LOC112268260 NA
## OR4F29 NA
## LOC105378947 NA
## OR4F16 NA
## ... ...
## ND4L 2.87405e-01
## ND4 1.92725e-01
## ND5 4.49348e-09
## ND6 2.44271e-18
## CYTB 2.36020e-01
Firstly, we will examine the distribution of the p-value using a histogram.
library(ggplot2)
res %>% as.data.frame() %>%
arrange(padj) %>%
ggplot(aes(x=pvalue)) +
geom_histogram(color = "white", bins = 50)
According to Tommy
recommendation, genes with small counts can distort the p-values, so
we will compare the counts between those with low and high p-values
using the box plot below.
res %>% as.data.frame() %>%
tibble::rownames_to_column(var = "gene_id") %>%
filter(!is.na(pvalue)) %>%
mutate(pvalue_bin = if_else(pvalue > 0.75, "high", "low")) %>%
ggplot(aes(x= pvalue_bin, y = log2(baseMean))) +
geom_boxplot()
As the box plot shows, genes with high p-values have lower gene
expression. Therefore, we will remove the genes with low counts.
# Counts across 4 samples lower than 10 will be removed (This number can be changes.)
dds <- dds[rowSums(counts(dds)) > 10,]
# Recalculate differential gene expression
dds <- DESeq(dds)
res <- results(dds, contrast = c("condition", "hypoxia", "normoxia"))
# Re-observe the distribution of p-value
res %>% as.data.frame() %>%
arrange(padj) %>%
ggplot(aes(x=pvalue)) +
geom_histogram(color = "white", bins = 50)
Now, the distribution of p-value looks better.
The volcano plot will label the most differentially expressed genes
as either upregulated or downregulated.
We will first define genes that we will label as
genes_to_label
, which includes only genes with a log2 fold
change greater than 2.5, an adjusted p-value lower than 0.001, and names
that do not start with LOC
(indicating genes for which a
published symbol is not available).
genes_to_label <- res %>%
as.data.frame() %>%
tibble::rownames_to_column(var = "gene") %>%
filter(!stringr::str_detect(gene, "LOC"),
abs(log2FoldChange) >= 2.5,
padj <= 0.001)
head(genes_to_label)
## gene baseMean log2FoldChange lfcSE stat pvalue
## 1 ERRFI1 4332.40999 2.786041 0.06805398 40.938694 0.000000e+00
## 2 NPPB 25.37084 4.487259 0.85034921 5.276961 1.313439e-07
## 3 CTRC 13.75187 4.639940 1.19560510 3.880830 1.041005e-04
## 4 PADI2 47.15242 2.865481 0.45345365 6.319236 2.628596e-10
## 5 GJB4 88.93894 2.810783 0.32840632 8.558858 1.139914e-17
## 6 EDN2 368.07113 3.596625 0.18740536 19.191685 4.343258e-82
## padj
## 1 0.000000e+00
## 2 5.814759e-07
## 3 3.126940e-04
## 4 1.503784e-09
## 5 1.142190e-16
## 6 3.771674e-80
Moreover, we want to label the significantly different genes by
color. Therefore, the significant (sig
) and not significant
(not_sig
) genes will be defined by the code below for
labeling purposes.
res_label <- res %>%
as.data.frame() %>%
tibble::rownames_to_column(var = "gene") %>%
mutate(sig = case_when(
!stringr::str_detect(gene, "LOC") &
abs(log2FoldChange) >= 2.5 &
padj <= 0.001 ~ "sig",
TRUE ~ "not_sig"
))
head(res_label)
## gene baseMean log2FoldChange lfcSE stat pvalue
## 1 NOC2L 4231.31471 -0.87086996 0.05503194 -15.8248098 2.098321e-56
## 2 KLHL17 588.20882 -0.11328977 0.11189537 -1.0124617 3.113173e-01
## 3 PLEKHN1 229.41180 0.04712997 0.17347083 0.2716882 7.858618e-01
## 4 PERM1 10.50349 2.47576605 0.90550077 2.7341402 6.254341e-03
## 5 HES4 209.38352 0.31409405 0.18035415 1.7415405 8.158888e-02
## 6 ISG15 581.41884 -0.69162460 0.12429286 -5.5644758 2.629417e-08
## padj sig
## 1 9.939147e-55 not_sig
## 2 4.176487e-01 not_sig
## 3 8.447251e-01 not_sig
## 4 1.368560e-02 not_sig
## 5 1.362866e-01 not_sig
## 6 1.254250e-07 not_sig
Now, we will create the volcano plot using the ggplot2
library and use ggrepel
to prevent label overlap.
Additionally, we will add horizontal and vertical lines to mark the
highly significant levels based on p-values and fold change,
respectively.
library(ggplot2)
library(ggrepel)
ggplot(res_label, aes(x = log2FoldChange, y = -log10(pvalue))) +
geom_point(aes(color = sig)) +
scale_color_manual(values = c("blue", "red")) +
ggrepel::geom_label_repel(data = genes_to_label, aes(label = gene))+
geom_hline(yintercept = 100, linetype = 2, color = "red") +
geom_vline(xintercept = c(-2.5, 2.5), linetype = 2, color = "red")+
theme_bw(base_size = 14)+
labs(title = "Hypoxia vs Normoxia")
### 4.2 Principal Component Analysis (PCA) In genomics, PCA is often
used to visualize and explore variations in gene expression data across
different samples or conditions. Here, we expect to see the same
samples, but different replicates clustering together.
First, we need to normalize the data. This normalized data
(normalized_counts
) will be used as input for both PCA and
the heatmap (in the next section).
# Calculate the normalized data for PCA plot
vsd <- vst(dds, blind=FALSE)
normalized_counts <- assay(vsd) %>% as.matrix()
# Observe the normalized data
head(normalized_counts)
## 01_SW_sgCTRL_Norm 02_SW_sgCTRL_Norm 11_SW_sgCTRL_Hyp 12_SW_sgCTRL_Hyp
## NOC2L 12.98927 12.95152 12.41364 12.40060
## KLHL17 11.39941 11.37610 11.34033 11.36613
## PLEKHN1 11.01355 11.04002 11.04961 11.02156
## PERM1 10.52707 10.51164 10.61583 10.61238
## HES4 10.97748 10.97282 11.02024 11.04890
## ISG15 11.46388 11.46339 11.20769 11.29751
# Calculate the principal components
pca_prcomp <- prcomp(t(normalized_counts),
center = TRUE,
scale. = FALSE)
# This is the PC values
pca_prcomp$x
## PC1 PC2 PC3 PC4
## 01_SW_sgCTRL_Norm -12.22530 1.088312 -1.6224222 6.218420e-14
## 02_SW_sgCTRL_Norm -12.16220 -1.098779 1.6235839 6.251640e-14
## 11_SW_sgCTRL_Hyp 12.23783 -2.107295 -0.8407274 6.549535e-14
## 12_SW_sgCTRL_Hyp 12.14966 2.117762 0.8395656 6.350693e-14
We will next use the data from pca_prcomp$x
and prepare
it for PCA plotting.
# Create a DataFrame for visualization
PC1_and_PC2 <- data.frame(PC1 = pca_prcomp$x[, 1],
PC2 = pca_prcomp$x[, 2],
type = rownames(pca_prcomp$x))
PC1_and_PC2
## PC1 PC2 type
## 01_SW_sgCTRL_Norm -12.22530 1.088312 01_SW_sgCTRL_Norm
## 02_SW_sgCTRL_Norm -12.16220 -1.098779 02_SW_sgCTRL_Norm
## 11_SW_sgCTRL_Hyp 12.23783 -2.107295 11_SW_sgCTRL_Hyp
## 12_SW_sgCTRL_Hyp 12.14966 2.117762 12_SW_sgCTRL_Hyp
ggplot(PC1_and_PC2, aes(x = PC1, y = PC2, col = type)) +
geom_point() +
geom_text(aes(label = type), hjust = 0, vjust = 0) +
coord_fixed()+
theme_classic()
### 4.3 Heatmap Heatmaps are useful in genomics and other scientific
fields where large datasets need to be analyzed. For heatmap plotting,
only significantly different genes will be displayed, including both
upregulated and downregulated genes.
The significant genes are defined by an adjusted p-value of less than 0.01 and a log2 fold change greater than 2.
significant_genes <- res %>%
as.data.frame() %>%
filter(padj <= 0.01,
abs(log2FoldChange) >= 2) %>%
rownames()
head(significant_genes, 10)
## [1] "HES2" "ERRFI1" "NPPB" "CTRC" "PADI2" "CDA" "CSMD2" "GJB4"
## [9] "COL8A2" "POU3F1"
As you saw, significant_genes
is simply a list of genes.
We will now use this list to filter our normalized data
(normalized_counts
).
significant_mat <- normalized_counts[significant_genes, ]
We have four samples, with two replicates each for
normoxic
and hypoxic
conditions. We will label
the hypoxic
and normoxic
samples by coloring
the heads of the heatmap columns.
library(ComplexHeatmap)
col_anno <- HeatmapAnnotation(
df = coldata,
col = list(condition = c("hypoxia" = "red",
"normoxia" = "blue"))
)
Before we plot the heatmap, we need to scale it by following steps.
We transform rows (genes) into columns, so that each column
represents a gene.
The scaling will be performed independently for each column (this
is the default behavior in R) by subtracting the mean of each row from
its values and then dividing by the standard deviation of that row.
After scaling, each row will have a mean of 0 and a standard deviation
of 1.
We then transpose it back, so that the genes return to being in
rows.
In conclusion, scaling is performed independently for each
gene.
Please note that the scaled values will typically range from negative
to positive, depending on how far the original values deviate from their
mean. Therefore, the range is not fixed.
# Scaling the data
scaled_sig_mat <- t(scale(t(significant_mat)))
# Observe min and max values of the scaled data
range_scale <- c(min(scaled_sig_mat), max(scaled_sig_mat))
range_scale
## [1] -1.247098 1.363132
Heatmap(
scaled_sig_mat,
top_annotation = col_anno,
show_row_names = FALSE,
name = "scaled normalized\nexpression"
)
Over-Representation Analysis (ORA) is a method that compares a list
of genes of interest (e.g., DEGs) to a background set (all genes in the
experiment) to identify pathways that are over-represented among the
genes of interest.
In this case, we will use significant_genes, which have already been defined in the previous section as the genes of interest.
significant_genes
head(significant_genes)
## [1] "HES2" "ERRFI1" "NPPB" "CTRC" "PADI2" "CDA"
background_genes <- res %>%
as.data.frame() %>%
filter(baseMean != 0) %>%
tibble::rownames_to_column(var = "gene") %>%
pull(gene)
library(org.Hs.eg.db)
library(clusterProfiler)
significant_genes_map <- clusterProfiler::bitr(geneID = significant_genes,
fromType = "SYMBOL",
toType = "ENTREZID",
OrgDb = "org.Hs.eg.db")
head(significant_genes_map)
## SYMBOL ENTREZID
## 1 HES2 54626
## 2 ERRFI1 54206
## 3 NPPB 4879
## 4 CTRC 11330
## 5 PADI2 11240
## 6 CDA 978
background_genes_map <- bitr(geneID = background_genes,
fromType = "SYMBOL",
toType = "ENTREZID",
OrgDb = "org.Hs.eg.db")
head(background_genes_map)
## SYMBOL ENTREZID
## 1 NOC2L 26155
## 2 KLHL17 339451
## 3 PLEKHN1 84069
## 4 PERM1 84808
## 5 HES4 57801
## 6 ISG15 9636
We will use two different gene sets, including Gene Ontology (GO) and the Molecular Signatures Database (MSigDB), for the ORA analysis.
For GO gene set, there are 3 main classification which are
1. MF: Molecular Function
2. CC: Cellular Component
3. BP: Biological Process
Here, we will use BP for gene ontology and Benjamini & Hochberg correction to adjust p-values for multiple testing.
ego <- enrichGO(gene = significant_genes_map$ENTREZID,
universe = background_genes_map$ENTREZID,
OrgDb = org.Hs.eg.db, # Specifies the organism: “org.Hs.eg.db” for Homo sapiens
ont = "BP", # Specified the ontology: BP for Biological Process
pAdjustMethod = "BH", # The method used to adjjust p-values: BH for Benjamini & Hochberg correction
pvalueCutoff = 0.01,
qvalueCutoff = 0.05,
readable = TRUE)
head(ego)
## ID Description GeneRatio
## GO:0001525 GO:0001525 angiogenesis 46/336
## GO:0048514 GO:0048514 blood vessel morphogenesis 50/336
## GO:0045765 GO:0045765 regulation of angiogenesis 26/336
## GO:1901342 GO:1901342 regulation of vasculature development 26/336
## GO:0007186 GO:0007186 G protein-coupled receptor signaling pathway 39/336
## GO:0045766 GO:0045766 positive regulation of angiogenesis 19/336
## BgRatio RichFactor FoldEnrichment zScore pvalue
## GO:0001525 414/12357 0.11111111 4.086310 10.678770 2.269804e-16
## GO:0048514 488/12357 0.10245902 3.768113 10.430977 2.841906e-16
## GO:0045765 209/12357 0.12440191 4.575103 8.714602 8.592322e-11
## GO:1901342 213/12357 0.12206573 4.489185 8.587588 1.315569e-10
## GO:0007186 454/12357 0.08590308 3.159239 7.836800 1.640067e-10
## GO:0045766 122/12357 0.15573770 5.727532 8.773052 6.961533e-10
## p.adjust qvalue
## GO:0001525 4.926444e-13 4.019053e-13
## GO:0048514 4.926444e-13 4.019053e-13
## GO:0045765 9.929860e-08 8.100901e-08
## GO:1901342 1.137223e-07 9.277602e-08
## GO:0007186 1.137223e-07 9.277602e-08
## GO:0045766 4.022606e-07 3.281691e-07
## geneID
## GO:0001525 NPPB/COL8A2/EDN2/CCN1/F3/S1PR1/SEMA4A/SH2D2A/CHI3L1/CD34/TGFB2/COL4A3/ACKR3/CXCL8/EREG/NDNF/SEMA5A/FGF1/PDGFRB/EDN1/VEGFA/CCN2/TNFAIP3/SASH1/ITGB8/AQP1/SERPINE1/CAV1/LOXL2/ADGRB1/ANXA1/KLF4/COL27A1/ADAM12/ADM/ITGA5/MMP19/BTG1/NOX5/MMP2/PIK3R6/CCL2/C3/ANGPTL4/KLK3/APLN
## GO:0048514 NPPB/COL8A2/EDN2/CCN1/F3/S1PR1/SEMA4A/SH2D2A/CHI3L1/CD34/TGFB2/XDH/COL4A3/ACKR3/CXCL8/EREG/NDNF/SEMA5A/LOX/FGF1/PDGFRB/EDN1/VEGFA/PRDM1/CCN2/TNFAIP3/SASH1/ITGB8/AQP1/SERPINE1/CAV1/LOXL2/ADGRB1/ANXA1/KLF4/COL27A1/ADAM12/ADM/ITGA5/MMP19/BTG1/NOX5/MMP2/PIK3R6/CCL2/NOG/C3/ANGPTL4/KLK3/APLN
## GO:0045765 NPPB/F3/SEMA4A/CHI3L1/CD34/TGFB2/COL4A3/CXCL8/SEMA5A/FGF1/VEGFA/TNFAIP3/SASH1/ITGB8/AQP1/SERPINE1/ADGRB1/KLF4/ADAM12/ADM/ITGA5/BTG1/PIK3R6/C3/ANGPTL4/KLK3
## GO:1901342 NPPB/F3/SEMA4A/CHI3L1/CD34/TGFB2/COL4A3/CXCL8/SEMA5A/FGF1/VEGFA/TNFAIP3/SASH1/ITGB8/AQP1/SERPINE1/ADGRB1/KLF4/ADAM12/ADM/ITGA5/BTG1/PIK3R6/C3/ANGPTL4/KLK3
## GO:0007186 NPPB/PADI2/S1PR1/KISS1/GPR17/GPR39/ACKR3/CACNA1D/MGLL/CXCL8/EREG/ARRDC3/PDGFRB/EDN1/CCL26/CAV1/ADGRB1/ANXA1/ABCA1/LPAR1/AKR1C3/ADM/CHRM4/DRD2/GPRC5A/GPRC5D/GNG2/PLCB2/PIK3R6/PIK3R5/CCL2/GAST/GNGT2/KISS1R/S1PR4/C3/TFF2/GRPR/APLN
## GO:0045766 F3/CHI3L1/CD34/CXCL8/SEMA5A/FGF1/VEGFA/SASH1/ITGB8/AQP1/SERPINE1/KLF4/ADAM12/ADM/ITGA5/BTG1/PIK3R6/C3/ANGPTL4
## Count
## GO:0001525 46
## GO:0048514 50
## GO:0045765 26
## GO:1901342 26
## GO:0007186 39
## GO:0045766 19
The GO enrichment analysis has now been completed and stored in
ego
. Let’s observe the results in the dot plot below.
dotplot(ego, showCategory=20)
#### 5.3.2 MSigDB database fot the enrichment analysis. We next explore
enrichment analysis using the MSigDB database. MSigDB stores data from
various species and gene categories. Let’s first retrieve gene sets from
Homo sapiens and observe them.
# Load the msigdbr package
library(msigdbr)
# Retrieve gene sets for Homo sapiens
m_df <- msigdbr(species = "Homo sapiens")
# Them_df object now contains information about the Hallmark gene sets.
head(m_df)
## # A tibble: 6 × 15
## gs_cat gs_subcat gs_name gene_symbol entrez_gene ensembl_gene
## <chr> <chr> <chr> <chr> <int> <chr>
## 1 C3 MIR:MIR_Legacy AAACCAC_MIR140 ABCC4 10257 ENSG00000125257
## 2 C3 MIR:MIR_Legacy AAACCAC_MIR140 ABRAXAS2 23172 ENSG00000165660
## 3 C3 MIR:MIR_Legacy AAACCAC_MIR140 ACTN4 81 ENSG00000130402
## 4 C3 MIR:MIR_Legacy AAACCAC_MIR140 ACTN4 81 ENSG00000282844
## 5 C3 MIR:MIR_Legacy AAACCAC_MIR140 ACVR1 90 ENSG00000115170
## 6 C3 MIR:MIR_Legacy AAACCAC_MIR140 ADAM9 8754 ENSG00000168615
## # ℹ 9 more variables: human_gene_symbol <chr>, human_entrez_gene <int>,
## # human_ensembl_gene <chr>, gs_id <chr>, gs_pmid <chr>, gs_geoid <chr>,
## # gs_exact_source <chr>, gs_url <chr>, gs_description <chr>
Now we have the gene sets of Homo sapiens, however, MsigDB contains 8
key categories:
1. H (Hallmark gene sets) : These are a set of 50 gene sets representing
well-defined biological states or processes.
2. C1 (Positional gene sets) : Gene sets based on chromosomal
location.
3. C2 (Curated gene sets) : Manually curated gene sets from various
sources, including pathway databases.
4. C3 (Motif gene sets) : Gene sets related to transcription factor
binding motifs.
5. C4 (Computational gene sets) : Gene sets generated through
computational methods.
6. C5 (GO gene sets) : Gene sets based on Gene Ontology terms.
7. C6 (Oncogenic signatures) : Gene sets associated with cancer-related
processes.
8. C7 (Immunologic signatures) : Gene sets related to immune system
processes.
Here, we will set the species parameter to “Homo sapiens” and the category parameter to “H” to retrieve Hallmark gene sets.
m_t2g <- msigdbr(species = "Homo sapiens", category = "H") %>%
dplyr::select(gs_name, entrez_gene)
We will investigate overview of the gene set from the Hallmark gene sets of Homo sapiens.
# Count the number of genes in each gene set
table(m_t2g$gs_name)
##
## HALLMARK_ADIPOGENESIS
## 210
## HALLMARK_ALLOGRAFT_REJECTION
## 335
## HALLMARK_ANDROGEN_RESPONSE
## 102
## HALLMARK_ANGIOGENESIS
## 36
## HALLMARK_APICAL_JUNCTION
## 231
## HALLMARK_APICAL_SURFACE
## 46
## HALLMARK_APOPTOSIS
## 183
## HALLMARK_BILE_ACID_METABOLISM
## 114
## HALLMARK_CHOLESTEROL_HOMEOSTASIS
## 77
## HALLMARK_COAGULATION
## 162
## HALLMARK_COMPLEMENT
## 237
## HALLMARK_DNA_REPAIR
## 170
## HALLMARK_E2F_TARGETS
## 218
## HALLMARK_EPITHELIAL_MESENCHYMAL_TRANSITION
## 204
## HALLMARK_ESTROGEN_RESPONSE_EARLY
## 216
## HALLMARK_ESTROGEN_RESPONSE_LATE
## 218
## HALLMARK_FATTY_ACID_METABOLISM
## 165
## HALLMARK_G2M_CHECKPOINT
## 204
## HALLMARK_GLYCOLYSIS
## 215
## HALLMARK_HEDGEHOG_SIGNALING
## 36
## HALLMARK_HEME_METABOLISM
## 214
## HALLMARK_HYPOXIA
## 215
## HALLMARK_IL2_STAT5_SIGNALING
## 216
## HALLMARK_IL6_JAK_STAT3_SIGNALING
## 103
## HALLMARK_INFLAMMATORY_RESPONSE
## 222
## HALLMARK_INTERFERON_ALPHA_RESPONSE
## 140
## HALLMARK_INTERFERON_GAMMA_RESPONSE
## 286
## HALLMARK_KRAS_SIGNALING_DN
## 220
## HALLMARK_KRAS_SIGNALING_UP
## 220
## HALLMARK_MITOTIC_SPINDLE
## 215
## HALLMARK_MTORC1_SIGNALING
## 211
## HALLMARK_MYC_TARGETS_V1
## 236
## HALLMARK_MYC_TARGETS_V2
## 60
## HALLMARK_MYOGENESIS
## 212
## HALLMARK_NOTCH_SIGNALING
## 34
## HALLMARK_OXIDATIVE_PHOSPHORYLATION
## 220
## HALLMARK_P53_PATHWAY
## 215
## HALLMARK_PANCREAS_BETA_CELLS
## 44
## HALLMARK_PEROXISOME
## 110
## HALLMARK_PI3K_AKT_MTOR_SIGNALING
## 118
## HALLMARK_PROTEIN_SECRETION
## 98
## HALLMARK_REACTIVE_OXYGEN_SPECIES_PATHWAY
## 58
## HALLMARK_SPERMATOGENESIS
## 144
## HALLMARK_TGF_BETA_SIGNALING
## 59
## HALLMARK_TNFA_SIGNALING_VIA_NFKB
## 228
## HALLMARK_UNFOLDED_PROTEIN_RESPONSE
## 115
## HALLMARK_UV_RESPONSE_DN
## 152
## HALLMARK_UV_RESPONSE_UP
## 191
## HALLMARK_WNT_BETA_CATENIN_SIGNALING
## 50
## HALLMARK_XENOBIOTIC_METABOLISM
## 224
Next, let’s perform an enrichment analysis (ORA) using these sets.
em <- enricher(significant_genes_map$ENTREZID,
TERM2GENE = m_t2g,
universe = background_genes_map$ENTREZID
)
The enrichment analysis has been completed and stored in em. Let’s take a look at the analyzed data.
head(em)
## ID
## HALLMARK_HYPOXIA HALLMARK_HYPOXIA
## HALLMARK_EPITHELIAL_MESENCHYMAL_TRANSITION HALLMARK_EPITHELIAL_MESENCHYMAL_TRANSITION
## HALLMARK_KRAS_SIGNALING_UP HALLMARK_KRAS_SIGNALING_UP
## HALLMARK_TNFA_SIGNALING_VIA_NFKB HALLMARK_TNFA_SIGNALING_VIA_NFKB
## HALLMARK_KRAS_SIGNALING_DN HALLMARK_KRAS_SIGNALING_DN
## HALLMARK_INFLAMMATORY_RESPONSE HALLMARK_INFLAMMATORY_RESPONSE
## Description
## HALLMARK_HYPOXIA HALLMARK_HYPOXIA
## HALLMARK_EPITHELIAL_MESENCHYMAL_TRANSITION HALLMARK_EPITHELIAL_MESENCHYMAL_TRANSITION
## HALLMARK_KRAS_SIGNALING_UP HALLMARK_KRAS_SIGNALING_UP
## HALLMARK_TNFA_SIGNALING_VIA_NFKB HALLMARK_TNFA_SIGNALING_VIA_NFKB
## HALLMARK_KRAS_SIGNALING_DN HALLMARK_KRAS_SIGNALING_DN
## HALLMARK_INFLAMMATORY_RESPONSE HALLMARK_INFLAMMATORY_RESPONSE
## GeneRatio BgRatio RichFactor
## HALLMARK_HYPOXIA 29/139 175/3586 0.1657143
## HALLMARK_EPITHELIAL_MESENCHYMAL_TRANSITION 24/139 158/3586 0.1518987
## HALLMARK_KRAS_SIGNALING_UP 18/139 151/3586 0.1192053
## HALLMARK_TNFA_SIGNALING_VIA_NFKB 19/139 174/3586 0.1091954
## HALLMARK_KRAS_SIGNALING_DN 15/139 116/3586 0.1293103
## HALLMARK_INFLAMMATORY_RESPONSE 15/139 142/3586 0.1056338
## FoldEnrichment zScore pvalue
## HALLMARK_HYPOXIA 4.275190 8.919613 6.665908e-12
## HALLMARK_EPITHELIAL_MESENCHYMAL_TRANSITION 3.918769 7.534235 3.602349e-09
## HALLMARK_KRAS_SIGNALING_UP 3.075325 5.231697 1.449774e-05
## HALLMARK_TNFA_SIGNALING_VIA_NFKB 2.817084 4.933743 2.937100e-05
## HALLMARK_KRAS_SIGNALING_DN 3.336021 5.135377 3.010333e-05
## HALLMARK_INFLAMMATORY_RESPONSE 2.725200 4.211952 3.135613e-04
## p.adjust qvalue
## HALLMARK_HYPOXIA 2.799681e-10 2.105023e-10
## HALLMARK_EPITHELIAL_MESENCHYMAL_TRANSITION 7.564932e-08 5.687919e-08
## HALLMARK_KRAS_SIGNALING_UP 2.029683e-04 1.526077e-04
## HALLMARK_TNFA_SIGNALING_VIA_NFKB 2.528680e-04 1.901263e-04
## HALLMARK_KRAS_SIGNALING_DN 2.528680e-04 1.901263e-04
## HALLMARK_INFLAMMATORY_RESPONSE 2.032666e-03 1.528320e-03
## geneID
## HALLMARK_HYPOXIA 54206/1907/3491/2152/8497/3623/57007/8553/55076/901/4015/8614/7422/1490/7128/3484/3486/5054/857/6781/665/10397/1289/5209/54541/133/694/3669/51129
## HALLMARK_EPITHELIAL_MESENCHYMAL_TRANSITION 1296/3491/4148/3576/4015/5159/7422/1490/7128/3624/3486/5054/4017/6591/1289/8038/6876/50863/3678/4313/1009/1277/50509/7078
## HALLMARK_KRAS_SIGNALING_UP 7139/28951/2069/1437/2162/639/7128/3624/3486/5243/9314/80201/330/3587/5175/51129/3689/3561
## HALLMARK_TNFA_SIGNALING_VIA_NFKB 3491/2152/3914/57007/8553/604/1437/1906/7422/6446/7128/3624/5054/19/9314/5209/330/694/6347
## HALLMARK_KRAS_SIGNALING_DN 1907/7042/80303/7068/3699/1906/6446/105375355/56154/3851/3860/3866/3780/7032/778
## HALLMARK_INFLAMMATORY_RESPONSE 2152/3576/2069/1906/3696/3624/5054/19/1902/133/3587/3678/23533/6347/10148
## Count
## HALLMARK_HYPOXIA 29
## HALLMARK_EPITHELIAL_MESENCHYMAL_TRANSITION 24
## HALLMARK_KRAS_SIGNALING_UP 18
## HALLMARK_TNFA_SIGNALING_VIA_NFKB 19
## HALLMARK_KRAS_SIGNALING_DN 15
## HALLMARK_INFLAMMATORY_RESPONSE 15
We will visualize the data using a dotplot.
dotplot(em, showCategory = 20)
## 6. Gene Set Enrichment Analysis (GSEA) Gene Set Enrichment Analysis
(GSEA) is an alternative approach that considers the entire list of
genes ranked by their association with a phenotype or condition. Rather
than merely identifying over-represented pathways, GSEA evaluates
whether a predefined gene set (e.g., a pathway) demonstrates a
significant association with the phenotype throughout the entire ranked
gene list.
We need to define the gene sets for the GSEA analysis. In this case,
we will use MSigDB
, which we retrieved from the Hallmark
gene sets of Homo sapiens in the previous step.
table(m_t2g$gs_name)
##
## HALLMARK_ADIPOGENESIS
## 210
## HALLMARK_ALLOGRAFT_REJECTION
## 335
## HALLMARK_ANDROGEN_RESPONSE
## 102
## HALLMARK_ANGIOGENESIS
## 36
## HALLMARK_APICAL_JUNCTION
## 231
## HALLMARK_APICAL_SURFACE
## 46
## HALLMARK_APOPTOSIS
## 183
## HALLMARK_BILE_ACID_METABOLISM
## 114
## HALLMARK_CHOLESTEROL_HOMEOSTASIS
## 77
## HALLMARK_COAGULATION
## 162
## HALLMARK_COMPLEMENT
## 237
## HALLMARK_DNA_REPAIR
## 170
## HALLMARK_E2F_TARGETS
## 218
## HALLMARK_EPITHELIAL_MESENCHYMAL_TRANSITION
## 204
## HALLMARK_ESTROGEN_RESPONSE_EARLY
## 216
## HALLMARK_ESTROGEN_RESPONSE_LATE
## 218
## HALLMARK_FATTY_ACID_METABOLISM
## 165
## HALLMARK_G2M_CHECKPOINT
## 204
## HALLMARK_GLYCOLYSIS
## 215
## HALLMARK_HEDGEHOG_SIGNALING
## 36
## HALLMARK_HEME_METABOLISM
## 214
## HALLMARK_HYPOXIA
## 215
## HALLMARK_IL2_STAT5_SIGNALING
## 216
## HALLMARK_IL6_JAK_STAT3_SIGNALING
## 103
## HALLMARK_INFLAMMATORY_RESPONSE
## 222
## HALLMARK_INTERFERON_ALPHA_RESPONSE
## 140
## HALLMARK_INTERFERON_GAMMA_RESPONSE
## 286
## HALLMARK_KRAS_SIGNALING_DN
## 220
## HALLMARK_KRAS_SIGNALING_UP
## 220
## HALLMARK_MITOTIC_SPINDLE
## 215
## HALLMARK_MTORC1_SIGNALING
## 211
## HALLMARK_MYC_TARGETS_V1
## 236
## HALLMARK_MYC_TARGETS_V2
## 60
## HALLMARK_MYOGENESIS
## 212
## HALLMARK_NOTCH_SIGNALING
## 34
## HALLMARK_OXIDATIVE_PHOSPHORYLATION
## 220
## HALLMARK_P53_PATHWAY
## 215
## HALLMARK_PANCREAS_BETA_CELLS
## 44
## HALLMARK_PEROXISOME
## 110
## HALLMARK_PI3K_AKT_MTOR_SIGNALING
## 118
## HALLMARK_PROTEIN_SECRETION
## 98
## HALLMARK_REACTIVE_OXYGEN_SPECIES_PATHWAY
## 58
## HALLMARK_SPERMATOGENESIS
## 144
## HALLMARK_TGF_BETA_SIGNALING
## 59
## HALLMARK_TNFA_SIGNALING_VIA_NFKB
## 228
## HALLMARK_UNFOLDED_PROTEIN_RESPONSE
## 115
## HALLMARK_UV_RESPONSE_DN
## 152
## HALLMARK_UV_RESPONSE_UP
## 191
## HALLMARK_WNT_BETA_CATENIN_SIGNALING
## 50
## HALLMARK_XENOBIOTIC_METABOLISM
## 224
Input for GSEA is different from ORA. Instead of selecting only genes
of interest, GSEA utilizes all differential expression data, including
fold change and p-value from all genes.
We will use the results from the previous step (res
) as
input. Therefore, we need to filter the genes and change the data type
to a data frame using the following code, then use res_df
for further processing.
res_df <- res %>%
as.data.frame() %>%
filter(baseMean != 0) %>%
tibble::rownames_to_column(var = "gene")
GSEA requires the data to be pre-ranked based on a metric that
combines fold change and significance, such as the signed fold change
multiplied by the negative logarithm of the p-value.
Additionally, we include the ENTREZID in the data frame.
res_df <- res_df %>%
mutate(signed_rank_stats = sign(log2FoldChange) * -log10(pvalue)) %>%
left_join(background_genes_map, by = c("gene" = "SYMBOL")) %>%
arrange(desc(signed_rank_stats))
One issue that may arise is the presence of infinite values in the
data, which can cause errors during GSEA.
To address this, we replace infinite values with large numbers so that
they are ranked high:
res_df <- res_df %>%
mutate(negative_log10pvalue = -log10(pvalue)) %>%
mutate(negative_log10pvalue = ifelse(is.infinite(negative_log10pvalue), 1000, negative_log10pvalue)) %>%
mutate(signed_rank_stats = sign(log2FoldChange) * negative_log10pvalue)
This is the final step of input preparation for GSEA. We will prepare
gene_list
that contains signed_rank_stats
and
gene identification (ENTREZID
)
# Assign `signed_rank_stats` as `gene_list`
gene_list <- res_df$signed_rank_stats
# Name the `gene_list` by ENTREZID
names(gene_list) <- res_df$ENTREZID
We perform fast GSEA using our pre-ranked gene list
(gene_list
) and a reference gene set, represented by
m_t2g
. The analyzed data is stored in em2
.
em2 <- GSEA(gene_list, TERM2GENE = m_t2g)
head(em2)
## ID
## HALLMARK_E2F_TARGETS HALLMARK_E2F_TARGETS
## HALLMARK_G2M_CHECKPOINT HALLMARK_G2M_CHECKPOINT
## HALLMARK_OXIDATIVE_PHOSPHORYLATION HALLMARK_OXIDATIVE_PHOSPHORYLATION
## HALLMARK_HYPOXIA HALLMARK_HYPOXIA
## HALLMARK_MYC_TARGETS_V1 HALLMARK_MYC_TARGETS_V1
## HALLMARK_TNFA_SIGNALING_VIA_NFKB HALLMARK_TNFA_SIGNALING_VIA_NFKB
## Description setSize
## HALLMARK_E2F_TARGETS HALLMARK_E2F_TARGETS 197
## HALLMARK_G2M_CHECKPOINT HALLMARK_G2M_CHECKPOINT 195
## HALLMARK_OXIDATIVE_PHOSPHORYLATION HALLMARK_OXIDATIVE_PHOSPHORYLATION 190
## HALLMARK_HYPOXIA HALLMARK_HYPOXIA 175
## HALLMARK_MYC_TARGETS_V1 HALLMARK_MYC_TARGETS_V1 191
## HALLMARK_TNFA_SIGNALING_VIA_NFKB HALLMARK_TNFA_SIGNALING_VIA_NFKB 174
## enrichmentScore NES pvalue
## HALLMARK_E2F_TARGETS -0.7867385 -2.208564 1.000000e-10
## HALLMARK_G2M_CHECKPOINT -0.7627360 -2.146758 1.000000e-10
## HALLMARK_OXIDATIVE_PHOSPHORYLATION -0.7363473 -2.060504 1.000000e-10
## HALLMARK_HYPOXIA 0.9373476 2.038974 1.000000e-10
## HALLMARK_MYC_TARGETS_V1 -0.6939122 -1.945010 9.891268e-09
## HALLMARK_TNFA_SIGNALING_VIA_NFKB 0.8766807 1.909870 1.880431e-07
## p.adjust qvalue rank
## HALLMARK_E2F_TARGETS 1.250000e-09 6.842105e-10 1758
## HALLMARK_G2M_CHECKPOINT 1.250000e-09 6.842105e-10 1571
## HALLMARK_OXIDATIVE_PHOSPHORYLATION 1.250000e-09 6.842105e-10 1861
## HALLMARK_HYPOXIA 1.250000e-09 6.842105e-10 626
## HALLMARK_MYC_TARGETS_V1 9.891268e-08 5.414168e-08 1093
## HALLMARK_TNFA_SIGNALING_VIA_NFKB 1.567026e-06 8.577404e-07 1035
## leading_edge
## HALLMARK_E2F_TARGETS tags=59%, list=13%, signal=52%
## HALLMARK_G2M_CHECKPOINT tags=50%, list=12%, signal=45%
## HALLMARK_OXIDATIVE_PHOSPHORYLATION tags=49%, list=14%, signal=43%
## HALLMARK_HYPOXIA tags=48%, list=5%, signal=46%
## HALLMARK_MYC_TARGETS_V1 tags=65%, list=8%, signal=61%
## HALLMARK_TNFA_SIGNALING_VIA_NFKB tags=39%, list=8%, signal=37%
## core_enrichment
## HALLMARK_E2F_TARGETS 7398/27338/3978/9972/10635/55635/3148/25842/79019/4605/1633/9133/10248/10549/9125/9833/6839/3184/10535/10733/3930/10212/253714/51155/10615/10714/5558/2146/4288/29089/9319/5982/84844/23594/8726/24137/3161/4999/3015/10797/9787/84296/672/9126/23649/5889/983/3835/7514/1965/57122/3838/29028/9212/8914/3609/9700/11340/5424/4085/7153/204/701/4001/7112/6790/9918/4830/55646/6628/6427/4678/10051/29127/5631/3070/1111/332/1854/5591/11051/8243/2547/79075/54962/5983/991/4436/23165/10492/4171/6749/1503/5347/7374/9238/993/10527/1019/1164/4174/3159/5902/6426/1786/79077/5111/4175/4172/4176/10528/5901/4173/5425/1434/5036/9221
## HALLMARK_G2M_CHECKPOINT 4605/9133/10403/3980/10592/6839/3184/6713/10733/1810/3832/3930/1874/3619/10212/51155/5558/1869/8454/2146/4288/8438/6599/23594/24137/3161/3015/22809/5933/9585/7027/3151/890/699/23649/10036/262/983/22974/3835/7514/3838/990/5001/7150/7290/55872/9212/10926/9184/3609/9700/10419/51776/4085/7153/8318/4001/7112/6790/10432/6427/4678/7272/10051/29127/10146/1111/332/7371/8243/4953/9156/7090/6421/3837/3192/991/10492/4171/51659/10772/5347/993/1019/1164/6632/4174/3159/6426/8140/4175/4172/3312/85236/4691/1736/9221
## HALLMARK_OXIDATIVE_PHOSPHORYLATION 4726/10975/4702/4720/7384/4694/1743/4717/23530/509/4835/291/6832/65003/9551/10939/1340/4723/4729/1351/513/4552/374291/5018/2746/3420/4712/7417/50/5250/2108/10440/9927/7386/1355/1374/3032/8604/64981/10989/34/29796/3028/4706/1892/5447/8802/56993/10884/7388/29088/51318/4697/26517/9377/3313/92609/1738/4704/2395/533/1349/64960/1737/54704/6390/23203/4191/517/515/5435/2271/7416/518/6392/56945/3419/54205/3417/9131/38/516/11331/498/10935/10128/1537/1431/5160/2806/506/292/3945
## HALLMARK_HYPOXIA 54206/8553/8614/3486/5054/665/10397/54541/133/6515/5230/2355/857/3491/6513/5209/8497/7045/8974/6385/1316/7052/3725/205/7422/55818/4627/694/3099/901/5155/5163/7167/2026/2632/2113/55076/1907/1490/302/5033/10370/55139/6095/23645/8660/1649/2152/284119/10957/3098/2023/5214/23764/4783/5211/5236/30001/3309/7852/1026/4601/112464/26118/467/5329/6533/2597/51129/26136/3669/54800/5066/23327/4015/7128/2131/7436/2997/26355/6781/3623/23036/3516
## HALLMARK_MYC_TARGETS_V1 6637/5250/4999/3015/8662/10921/8664/65005/7027/3066/3608/6193/3178/890/6128/5093/4706/2058/7514/1965/3838/2739/7298/55651/9045/8886/1982/9184/10054/5984/328/9377/6428/4085/5478/6633/8318/26354/7965/1964/1478/6627/51690/6194/4830/23435/9136/6427/3376/9188/10146/1854/10856/6634/10971/2547/4953/7416/2079/56910/10291/51491/3837/1207/5496/3192/23196/8669/991/6732/1665/10155/23450/10492/10728/4171/1503/6187/3336/3181/11331/1019/3615/6418/10935/8761/2107/6632/7555/3735/1537/4686/4174/5902/26156/6426/6059/3183/7284/6723/5111/4175/10576/4176/790/10528/6741/5901/2806/5245/4173/10574/6950/26135/7203/708/5425/10575/3326/22948/5036/4869/9221/3329
## HALLMARK_TNFA_SIGNALING_VIA_NFKB 3914/8553/5054/6515/2355/3491/5209/6385/1316/3725/7422/694/9314/10318/22822/4616/23645/8660/2152/10957/10769/4082/56937/1827/6446/23764/4783/329/7050/3383/23135/9120/10725/19/1026/467/5329/2114/8744/330/960/8061/4084/7128/604/2354/4791/7538/10938/6303/1847/25976/3976/23258/4088/3572/5606/1906/3280/4792/9516/687/3624/1437/4973/1846/1839/3371
dotplot(em2, showCategory=20)
Visualization for the GSEA of HALLMARK_P53_PATHWAY
p1 <- gseaplot(em2, geneSetID = "HALLMARK_P53_PATHWAY",
by = "runningScore", title = "HALLMARK_P53_PATHWAY")
p1
Visualization for the GSEA of HALLMARK_HYPOXIA
p2 <- gseaplot(em2, geneSetID = "HALLMARK_HYPOXIA",
by = "runningScore", title = "HALLMARK_HYPOXIA")
p2