nf-core
Count Matrix to Differential Expression Analysis (DESeq2
)
and Functional AnalysisDESeq2
ENSEMBL
to readable format (SYMBOL
)For the downstream analysis of transcriptomics, we will run all code
in RStudio. The gene counts will serve as the input file, and DESeq2
will be used for differential expression analysis. After identifying
differentially expressed genes, we will utilize various tools and
visualizations, as shown in the picture below.
Let’s start with project organization. I organized the folders in
this project to ensure reproducibility.
I created a set of routine folders, which include data, results, and
scripts.
Furthermore, I run all R scripts using the R server in Docker and have also set up the R project and Git version control. For more details, please see My Computer Workbench.
Here, we obtain the input file from the upstream process by
navigating to the outdir/star_rsem
directory, where you
will find a file named rsem.merged.gene_counts.tsv
.
Next, we will load the readr
and dplyr
libraries to read the .tsv file and manage the data within it.
library(dplyr)
library(readr)
library(here)
# Define location of the file
raw_counts_file <- here("data/HPC/result/STAR_RSEM", "rsem.merged.gene_counts.tsv")
# Import the file
raw_counts <- read_tsv(raw_counts_file)
# Observe head of the data
head(raw_counts)
## # A tibble: 6 × 6
## gene_id `transcript_id(s)` sgCTRL_Hypox_Rep1 sgCTRL_Hypox_Rep2
## <chr> <chr> <dbl> <dbl>
## 1 ENSG00000000003 ENST00000373020,ENST00000… 757 734
## 2 ENSG00000000005 ENST00000373031,ENST00000… 0 0
## 3 ENSG00000000419 ENST00000371582,ENST00000… 2333. 2390.
## 4 ENSG00000000457 ENST00000367770,ENST00000… 324. 377.
## 5 ENSG00000000460 ENST00000286031,ENST00000… 417. 451.
## 6 ENSG00000000938 ENST00000374003,ENST00000… 4 3
## # ℹ 2 more variables: sgCTRL_Normox_Rep1 <dbl>, sgCTRL_Normox_Rep2 <dbl>
To simplify the data format, we will exclude the
transcript_id(s)
column and rename the gene_id
column to ENSEMBL
.
# Exclude column 2 which is `transcript_id(s)
raw_counts <- raw_counts[, -2]
# Rename the `gene_id` to `SYMBOL`
colnames(raw_counts) <- c("ENSEMBL",
"sgCTRL_Hypox_Rep1", "sgCTRL_Hypox_Rep2",
"sgCTRL_Normox_Rep1", "sgCTRL_Normox_Rep2")
Currently, the gene counts in raw_counts
are in
character format. We will convert raw_counts
to a matrix,
as the gene counts should be treated as numeric values. Additionally, we
need to convert all count values in raw_counts_mat
to
integers, in accordance with DESeq2’s input requirements.
# Change only count values to matrix
raw_counts_mat <- raw_counts[, -1] %>%
as.matrix()
# Convert all values in `raw_counts_mat` to integer according to DESeq2 input request.
raw_counts_mat <- apply(raw_counts_mat, 2, as.integer)
# Add rownames
rownames(raw_counts_mat) <- raw_counts$ENSEMBL
head(raw_counts_mat)
## sgCTRL_Hypox_Rep1 sgCTRL_Hypox_Rep2 sgCTRL_Normox_Rep1
## ENSG00000000003 757 734 577
## ENSG00000000005 0 0 0
## ENSG00000000419 2333 2390 2173
## ENSG00000000457 323 377 203
## ENSG00000000460 417 450 507
## ENSG00000000938 4 3 3
## sgCTRL_Normox_Rep2
## ENSG00000000003 677
## ENSG00000000005 0
## ENSG00000000419 2671
## ENSG00000000457 298
## ENSG00000000460 690
## ENSG00000000938 14
In most cases of RNA-seq analysis, we subset only protein-coding
genes for several reasons, including primary biological function, higher
expression levels compared to non-coding genes, accurate gene
annotation, reduction of noise, simplified data interpretation, and
targeted research goals (specific pathways or cellular processes).
library(org.Hs.eg.db)
# list all keytypes from `org.Hs.eg.db` to see what is gene type was named.
columns(org.Hs.eg.db)
## [1] "ACCNUM" "ALIAS" "ENSEMBL" "ENSEMBLPROT" "ENSEMBLTRANS"
## [6] "ENTREZID" "ENZYME" "EVIDENCE" "EVIDENCEALL" "GENENAME"
## [11] "GENETYPE" "GO" "GOALL" "IPI" "MAP"
## [16] "OMIM" "ONTOLOGY" "ONTOLOGYALL" "PATH" "PFAM"
## [21] "PMID" "PROSITE" "REFSEQ" "SYMBOL" "UCSCKG"
## [26] "UNIPROT"
We will retrieve the GENETYPE
of raw_counts
(the same data as raw_counts_mat
) using
SYMBOL
. Here, we will create a map
that
matches SYMBOL
with our data (raw_counts
and
raw_counts_mat
).
map <- AnnotationDbi::select(org.Hs.eg.db,
keys = raw_counts$ENSEMBL , # column that contain key (we want to convert this col.)
keytype = "ENSEMBL", # type of key in the column,
columns = "GENETYPE") # What we want to retrieve
head(map)
## ENSEMBL GENETYPE
## 1 ENSG00000000003 protein-coding
## 2 ENSG00000000005 protein-coding
## 3 ENSG00000000419 protein-coding
## 4 ENSG00000000457 protein-coding
## 5 ENSG00000000460 protein-coding
## 6 ENSG00000000938 protein-coding
Observe the types of gene.
unique(map$GENETYPE)
## [1] "protein-coding" "pseudo" "ncRNA" NA
## [5] "unknown" "snoRNA" "snRNA" "rRNA"
## [9] "other" "scRNA"
We will select only protein-coding
genes by defining
them in map_CodingGene
. Then, we will use
map_CodingGene
to subset our data.
map_CodingGene <- map %>%
dplyr::filter(GENETYPE == "protein-coding")
head(map_CodingGene)
## ENSEMBL GENETYPE
## 1 ENSG00000000003 protein-coding
## 2 ENSG00000000005 protein-coding
## 3 ENSG00000000419 protein-coding
## 4 ENSG00000000457 protein-coding
## 5 ENSG00000000460 protein-coding
## 6 ENSG00000000938 protein-coding
Show number of protein-coding
genes.
nrow(map_CodingGene)
## [1] 19519
From raw_count_mat
, we will subset only the common
genes, which are the intersecting genes between
raw_count_mat
and map_CodingGene
(protein-coding genes).
# Define the `common_genes` = genes that are in both `raw_count_mat` and `map_CodingGene`
common_genes <- intersect(rownames(raw_counts_mat), map_CodingGene$ENSEMBL)
# subset the common genes and re-order them by `coding_genes`
raw_counts_mat <- raw_counts_mat[common_genes, ]
head(raw_counts_mat)
## sgCTRL_Hypox_Rep1 sgCTRL_Hypox_Rep2 sgCTRL_Normox_Rep1
## ENSG00000000003 757 734 577
## ENSG00000000005 0 0 0
## ENSG00000000419 2333 2390 2173
## ENSG00000000457 323 377 203
## ENSG00000000460 417 450 507
## ENSG00000000938 4 3 3
## sgCTRL_Normox_Rep2
## ENSG00000000003 677
## ENSG00000000005 0
## ENSG00000000419 2671
## ENSG00000000457 298
## ENSG00000000460 690
## ENSG00000000938 14
DESeq2
DESeq2
analysis: Prepare
ColData
The DESeq2
analysis needs 2 input files which are
countData
(which we have as raw_counts_mat
)
and colData (which we will prepare).
# Define condition of each column
coldata <- data.frame(condition = c("hypoxia", "hypoxia", "normoxia", "normoxia"))
rownames(coldata) <- colnames(raw_counts_mat)
# Recheck whether the name is match with the defined condition
coldata
## condition
## sgCTRL_Hypox_Rep1 hypoxia
## sgCTRL_Hypox_Rep2 hypoxia
## sgCTRL_Normox_Rep1 normoxia
## sgCTRL_Normox_Rep2 normoxia
DESeq2
This starts by creating a DESeq2 object (dds
) using
raw_counts_mat
and coldata
as inputs. This
object will use the condition as the main factor.
library(DESeq2)
dds <- DESeqDataSetFromMatrix(countData = raw_counts_mat,
colData = coldata,
design = ~ condition)
dds <- DESeq(dds)
dds
## class: DESeqDataSet
## dim: 19519 4
## metadata(1): version
## assays(4): counts mu H cooks
## rownames(19519): ENSG00000000003 ENSG00000000005 ... ENSG00000293542
## ENSG00000293543
## rowData names(22): baseMean baseVar ... deviance maxCooks
## colnames(4): sgCTRL_Hypox_Rep1 sgCTRL_Hypox_Rep2 sgCTRL_Normox_Rep1
## sgCTRL_Normox_Rep2
## colData names(2): condition sizeFactor
To obtain differential gene expression, the two conditions (normoxia and hypoxia) will be compared.
res <- results(dds, contrast = c("condition", "hypoxia", "normoxia"))
res
## log2 fold change (MLE): condition hypoxia vs normoxia
## Wald test p-value: condition hypoxia vs normoxia
## DataFrame with 19519 rows and 6 columns
## baseMean log2FoldChange lfcSE stat pvalue
## <numeric> <numeric> <numeric> <numeric> <numeric>
## ENSG00000000003 681.296 0.00331546 0.1141074 0.0290556 9.76820e-01
## ENSG00000000005 0.000 NA NA NA NA
## ENSG00000000419 2391.024 -0.28016274 0.0660624 -4.2408819 2.22643e-05
## ENSG00000000457 294.175 0.25005363 0.1748603 1.4300194 1.52711e-01
## ENSG00000000460 520.260 -0.70213180 0.1297147 -5.4128924 6.20147e-08
## ... ... ... ... ... ...
## ENSG00000292373 5.053874 -0.8190937 1.24552 -0.657632 0.510775
## ENSG00000293137 225.412652 -0.0273231 0.19486 -0.140219 0.888487
## ENSG00000293435 0.614115 -2.8461645 3.99107 -0.713133 0.475763
## ENSG00000293542 0.000000 NA NA NA NA
## ENSG00000293543 4.600170 -0.3293342 1.31914 -0.249659 0.802851
## padj
## <numeric>
## ENSG00000000003 9.83702e-01
## ENSG00000000005 NA
## ENSG00000000419 8.16885e-05
## ENSG00000000457 2.46176e-01
## ENSG00000000460 3.18345e-07
## ... ...
## ENSG00000292373 0.630954
## ENSG00000293137 0.926124
## ENSG00000293435 NA
## ENSG00000293542 NA
## ENSG00000293543 0.864271
Firstly, we will examine the distribution of the p-value using a histogram.
library(ggplot2)
res %>% as.data.frame() %>%
arrange(padj) %>%
ggplot(aes(x=pvalue)) +
geom_histogram(color = "white", bins = 50)
According to Tommy
recommendation, genes with small counts can distort the p-values, so
we will compare the counts between those with low and high p-values
using the box plot below.
res %>% as.data.frame() %>%
tibble::rownames_to_column(var = "gene_id") %>%
filter(!is.na(pvalue)) %>%
mutate(pvalue_bin = if_else(pvalue > 0.75, "high", "low")) %>%
ggplot(aes(x= pvalue_bin, y = log2(baseMean))) +
geom_boxplot()
As the box plot shows, genes with high p-values have lower gene
expression. Therefore, we will remove the genes with low counts.
# Counts across 4 samples lower than 10 will be removed (This number can be changes.)
dds <- dds[rowSums(counts(dds)) > 10,]
# Recalculate differential gene expression
dds <- DESeq(dds)
res <- results(dds, contrast = c("condition", "hypoxia", "normoxia"))
# Re-observe the distribution of p-value
res %>% as.data.frame() %>%
arrange(padj) %>%
ggplot(aes(x=pvalue)) +
geom_histogram(color = "white", bins = 50)
Now, the distribution of p-value looks better.
The volcano plot will label the most differentially expressed genes
as either upregulated or downregulated.
ENSEMBL
to readable format
(SYMBOL
)library(org.Hs.eg.db)
res_forMap <- res %>%
as.data.frame() %>%
tibble::rownames_to_column(var = "ENSEMBL")
map <- AnnotationDbi::select(org.Hs.eg.db,
keys = res_forMap$ENSEMBL , # column that contain key (we want to convert this col.)
keytype = "ENSEMBL", # type of key in the column,
columns= "SYMBOL")# What we want to retrieve
head(map)
## ENSEMBL SYMBOL
## 1 ENSG00000000003 TSPAN6
## 2 ENSG00000000419 DPM1
## 3 ENSG00000000457 SCYL3
## 4 ENSG00000000460 FIRRM
## 5 ENSG00000000938 FGR
## 6 ENSG00000001036 FUCA2
Adding SYMBOL
to res_map
.
res_map <- left_join(res_forMap, map)
head(res_map)
## ENSEMBL baseMean log2FoldChange lfcSE stat pvalue
## 1 ENSG00000000003 681.296045 0.003393615 0.11160564 0.0304072 9.757423e-01
## 2 ENSG00000000419 2391.023676 -0.280150021 0.06539050 -4.2842618 1.833469e-05
## 3 ENSG00000000457 294.175101 0.249896622 0.17052780 1.4654304 1.428035e-01
## 4 ENSG00000000460 520.260176 -0.702192509 0.12671529 -5.5414978 2.998954e-08
## 5 ENSG00000000938 5.898808 -1.445522599 1.19412125 -1.2105325 2.260746e-01
## 6 ENSG00000001036 1970.721295 -0.250948895 0.07029303 -3.5700393 3.569277e-04
## padj SYMBOL
## 1 9.829441e-01 TSPAN6
## 2 6.357477e-05 DPM1
## 3 2.211256e-01 SCYL3
## 4 1.482511e-07 FIRRM
## 5 3.241821e-01 FGR
## 6 1.008703e-03 FUCA2
We will first define genes that we will label as
genes_to_label
, which includes only genes with a log2 fold
change greater than 2.5, an adjusted p-value lower than 0.001, and names
that do not start with LOC
(indicating genes for which a
published symbol is not available).
genes_to_label <- res_map %>%
filter(!stringr::str_detect(SYMBOL, "LOC"),
abs(log2FoldChange) >= 2.5,
padj <= 0.001)
head(genes_to_label)
## ENSEMBL baseMean log2FoldChange lfcSE stat pvalue
## 1 ENSG00000006606 47.08133 3.537408 0.53497078 6.612338 3.782959e-11
## 2 ENSG00000006659 20.89578 5.232378 1.15501146 4.530152 5.894124e-06
## 3 ENSG00000013588 2360.00398 2.931581 0.07820453 37.486076 1.553028e-307
## 4 ENSG00000019186 274.23162 3.098137 0.20695901 14.969810 1.156517e-50
## 5 ENSG00000023445 127.07710 3.772602 0.34105723 11.061494 1.928580e-28
## 6 ENSG00000024422 719.66484 6.362696 0.27362463 23.253375 1.314978e-119
## padj SYMBOL
## 1 2.427837e-10 CCL26
## 2 2.192853e-05 LGALS14
## 3 1.171392e-304 GPRC5A
## 4 4.918116e-49 CYP24A1
## 5 3.646237e-27 BIRC3
## 6 1.983680e-117 EHD2
Moreover, we want to label the significantly different genes by
color. Therefore, the significant (sig
) and not significant
(not_sig
) genes will be defined by the code below for
labeling purposes.
res_label <- res_map %>%
mutate(sig = case_when(
!stringr::str_detect(SYMBOL, "LOC") &
abs(log2FoldChange) >= 2.5 &
padj <= 0.001 ~ "sig",
TRUE ~ "not_sig"
))
head(res_label)
## ENSEMBL baseMean log2FoldChange lfcSE stat pvalue
## 1 ENSG00000000003 681.296045 0.003393615 0.11160564 0.0304072 9.757423e-01
## 2 ENSG00000000419 2391.023676 -0.280150021 0.06539050 -4.2842618 1.833469e-05
## 3 ENSG00000000457 294.175101 0.249896622 0.17052780 1.4654304 1.428035e-01
## 4 ENSG00000000460 520.260176 -0.702192509 0.12671529 -5.5414978 2.998954e-08
## 5 ENSG00000000938 5.898808 -1.445522599 1.19412125 -1.2105325 2.260746e-01
## 6 ENSG00000001036 1970.721295 -0.250948895 0.07029303 -3.5700393 3.569277e-04
## padj SYMBOL sig
## 1 9.829441e-01 TSPAN6 not_sig
## 2 6.357477e-05 DPM1 not_sig
## 3 2.211256e-01 SCYL3 not_sig
## 4 1.482511e-07 FIRRM not_sig
## 5 3.241821e-01 FGR not_sig
## 6 1.008703e-03 FUCA2 not_sig
Now, we will create the volcano plot using the ggplot2
library and use ggrepel
to prevent label overlap.
Additionally, we will add horizontal and vertical lines to mark the
highly significant levels based on p-values and fold change,
respectively.
library(ggplot2)
library(ggrepel)
ggplot(res_label, aes(x = log2FoldChange, y = -log10(pvalue))) +
geom_point(aes(color = sig)) +
scale_color_manual(values = c("blue", "red")) +
ggrepel::geom_label_repel(data = genes_to_label, aes(label = SYMBOL))+
geom_hline(yintercept = 100, linetype = 2, color = "red") +
geom_vline(xintercept = c(-2.5, 2.5), linetype = 2, color = "red")+
theme_bw(base_size = 14)+
labs(title = "Hypoxia vs Normoxia")
### 4.2 Principal Component Analysis (PCA) In genomics, PCA is often
used to visualize and explore variations in gene expression data across
different samples or conditions. Here, we expect to see the same
samples, but different replicates clustering together.
First, we need to normalize the data. This normalized data
(normalized_counts
) will be used as input for both PCA and
the heatmap (in the next section).
# Calculate the normalized data for PCA plot
vsd <- vst(dds, blind=FALSE)
normalized_counts <- assay(vsd) %>% as.matrix()
# Observe the normalized data
head(normalized_counts)
## sgCTRL_Hypox_Rep1 sgCTRL_Hypox_Rep2 sgCTRL_Normox_Rep1
## ENSG00000000003 12.23842 12.22585 12.24529
## ENSG00000000419 12.73578 12.74558 12.86559
## ENSG00000000457 12.00239 12.03673 11.96531
## ENSG00000000460 12.06331 12.08121 12.20255
## ENSG00000000938 11.60129 11.59438 11.60139
## ENSG00000001036 12.63943 12.64492 12.72490
## sgCTRL_Normox_Rep2
## ENSG00000000003 12.21803
## ENSG00000000419 12.84417
## ENSG00000000457 11.99568
## ENSG00000000460 12.22429
## ENSG00000000938 11.64762
## ENSG00000001036 12.74701
# Calculate the principal components
pca_prcomp <- prcomp(t(normalized_counts),
center = TRUE,
scale. = FALSE)
# This is the PC values
pca_prcomp$x
## PC1 PC2 PC3 PC4
## sgCTRL_Hypox_Rep1 -9.526003 1.6700232 -0.7426898 6.517009e-14
## sgCTRL_Hypox_Rep2 -9.594833 -1.6612802 0.7438744 6.385440e-14
## sgCTRL_Normox_Rep1 9.532438 -0.9430173 -1.3195604 6.548234e-14
## sgCTRL_Normox_Rep2 9.588398 0.9342743 1.3183758 6.598845e-14
We will next use the data from pca_prcomp$x
and prepare
it for PCA plotting.
# Create a DataFrame for visualization
PC1_and_PC2 <- data.frame(PC1 = pca_prcomp$x[, 1],
PC2 = pca_prcomp$x[, 2],
type = rownames(pca_prcomp$x))
PC1_and_PC2
## PC1 PC2 type
## sgCTRL_Hypox_Rep1 -9.526003 1.6700232 sgCTRL_Hypox_Rep1
## sgCTRL_Hypox_Rep2 -9.594833 -1.6612802 sgCTRL_Hypox_Rep2
## sgCTRL_Normox_Rep1 9.532438 -0.9430173 sgCTRL_Normox_Rep1
## sgCTRL_Normox_Rep2 9.588398 0.9342743 sgCTRL_Normox_Rep2
ggplot(PC1_and_PC2, aes(x = PC1, y = PC2, col = type)) +
geom_point() +
geom_text(aes(label = type), hjust = 0, vjust = 0) +
coord_fixed()+
theme_classic()
### 4.3 Heatmap Heatmaps are useful in genomics and other scientific
fields where large datasets need to be analyzed. For heatmap plotting,
only significantly different genes will be displayed, including both
upregulated and downregulated genes.
The significant genes are defined by an adjusted p-value of less than 0.01 and a log2 fold change greater than 2.
significant_genes <- res %>%
as.data.frame() %>%
filter(padj <= 0.01,
abs(log2FoldChange) >= 2) %>%
rownames()
head(significant_genes, 10)
## [1] "ENSG00000003096" "ENSG00000006606" "ENSG00000006659" "ENSG00000008517"
## [5] "ENSG00000010379" "ENSG00000012822" "ENSG00000013588" "ENSG00000019186"
## [9] "ENSG00000019549" "ENSG00000023445"
As you saw, significant_genes
is simply a list of genes.
We will now use this list to filter our normalized data
(normalized_counts
).
significant_mat <- normalized_counts[significant_genes, ]
We have four samples, with two replicates each for
normoxic
and hypoxic
conditions. We will label
the hypoxic
and normoxic
samples by coloring
the heads of the heatmap columns.
library(ComplexHeatmap)
col_anno <- HeatmapAnnotation(
df = coldata,
col = list(condition = c("hypoxia" = "red",
"normoxia" = "blue"))
)
Before we plot the heatmap, we need to scale it by following steps.
We transform rows (genes) into columns, so that each column
represents a gene.
The scaling will be performed independently for each column (this
is the default behavior in R) by subtracting the mean of each row from
its values and then dividing by the standard deviation of that row.
After scaling, each row will have a mean of 0 and a standard deviation
of 1.
We then transpose it back, so that the genes return to being in
rows.
In conclusion, scaling is performed independently for each
gene.
Please note that the scaled values will typically range from negative
to positive, depending on how far the original values deviate from their
mean. Therefore, the range is not fixed.
# Scaling the data
scaled_sig_mat <- t(scale(t(significant_mat)))
# Observe min and max values of the scaled data
range_scale <- c(min(scaled_sig_mat), max(scaled_sig_mat))
range_scale
## [1] -1.240994 1.500000
Heatmap(
scaled_sig_mat,
top_annotation = col_anno,
show_row_names = FALSE,
name = "scaled normalized\nexpression"
)
Over-Representation Analysis (ORA) is a method that compares a list
of genes of interest (e.g., DEGs) to a background set (all genes in the
experiment) to identify pathways that are over-represented among the
genes of interest.
In this case, we will use significant_genes, which have already been defined in the previous section as the genes of interest.
significant_genes
head(significant_genes)
## [1] "ENSG00000003096" "ENSG00000006606" "ENSG00000006659" "ENSG00000008517"
## [5] "ENSG00000010379" "ENSG00000012822"
background_genes <- res %>%
as.data.frame() %>%
filter(baseMean != 0) %>%
tibble::rownames_to_column(var = "ENSEMBL") %>%
pull(ENSEMBL)
According to further analysis, gene identification is mostly required
in ENTREZID format (except enrichGO
, where ENSEMBL is
accepted). Therefore, we need a conversion.
library(org.Hs.eg.db)
library(clusterProfiler)
significant_genes_map <- clusterProfiler::bitr(geneID = significant_genes,
fromType = "ENSEMBL",
toType = "ENTREZID",
OrgDb = "org.Hs.eg.db")
head(significant_genes_map)
## ENSEMBL ENTREZID
## 1 ENSG00000003096 90293
## 2 ENSG00000006606 10344
## 3 ENSG00000006659 56891
## 4 ENSG00000008517 9235
## 5 ENSG00000010379 6540
## 6 ENSG00000012822 57658
background_genes_map <- bitr(geneID = background_genes,
fromType = "ENSEMBL",
toType = "ENTREZID",
OrgDb = "org.Hs.eg.db")
head(background_genes_map)
## ENSEMBL ENTREZID
## 1 ENSG00000000003 7105
## 2 ENSG00000000419 8813
## 3 ENSG00000000457 57147
## 4 ENSG00000000460 55732
## 5 ENSG00000000938 2268
## 6 ENSG00000001036 2519
We will use two different gene sets, including Gene Ontology (GO) and the Molecular Signatures Database (MSigDB), for the ORA analysis.
For GO gene set, there are 3 main classification which are
1. MF: Molecular Function
2. CC: Cellular Component
3. BP: Biological Process
Here, we will use BP for gene ontology and Benjamini & Hochberg correction to adjust p-values for multiple testing.
ego <- enrichGO(gene = significant_genes_map$ENTREZID,
universe = background_genes_map$ENTREZID ,
OrgDb = org.Hs.eg.db, # Specifies the organism: “org.Hs.eg.db” for Homo sapiens
ont = "BP", # Specified the ontology: BP for Biological Process
pAdjustMethod = "BH", # The method used to adjjust p-values: BH for Benjamini & Hochberg correction
pvalueCutoff = 0.01,
qvalueCutoff = 0.05,
readable = TRUE)
head(ego)
## ID Description GeneRatio BgRatio
## GO:0001525 GO:0001525 angiogenesis 45/334 421/12958
## GO:0048514 GO:0048514 blood vessel morphogenesis 49/334 496/12958
## GO:0006935 GO:0006935 chemotaxis 31/334 288/12958
## GO:0042330 GO:0042330 taxis 31/334 288/12958
## GO:0045765 GO:0045765 regulation of angiogenesis 26/334 215/12958
## GO:1901342 GO:1901342 regulation of vasculature development 26/334 219/12958
## RichFactor FoldEnrichment zScore pvalue p.adjust
## GO:0001525 0.10688836 4.146884 10.677068 3.058372e-16 5.508524e-13
## GO:0048514 0.09879032 3.832710 10.463476 3.154036e-16 5.508524e-13
## GO:0006935 0.10763889 4.176002 8.865754 1.438644e-11 1.256296e-08
## GO:0042330 0.10763889 4.176002 8.865754 1.438644e-11 1.256296e-08
## GO:0045765 0.12093023 4.691659 8.878342 5.138188e-11 3.589538e-08
## GO:1901342 0.11872146 4.605966 8.753929 7.787389e-11 4.533558e-08
## qvalue
## GO:0001525 4.438891e-13
## GO:0048514 4.438891e-13
## GO:0006935 1.012351e-08
## GO:0042330 1.012351e-08
## GO:0045765 2.892529e-08
## GO:1901342 3.653242e-08
## geneID
## GO:0001525 SH2D2A/EDN1/MMP2/TGFB2/ITGB8/CAV1/SERPINE1/CCL2/SASH1/VEGFA/SEMA5A/FGF1/PDGFRB/F3/TNFAIP3/CCN2/NPPB/MMP19/C3/EDN2/CHI3L1/BTG1/LOXL2/ANXA1/KLF4/KLK3/CCN1/ACKR3/ADAM12/ADM/ITGA5/ANGPTL4/COL4A3/CXCL8/S1PR1/APLN/COL8A2/NDNF/CD34/ADGRB1/SEMA4A/COL27A1/AQP1/NOX5/PIK3R6
## GO:0048514 SH2D2A/PRDM1/EDN1/MMP2/TGFB2/ITGB8/CAV1/SERPINE1/CCL2/SASH1/VEGFA/SEMA5A/LOX/FGF1/PDGFRB/F3/TNFAIP3/CCN2/NPPB/MMP19/C3/EDN2/CHI3L1/BTG1/LOXL2/ANXA1/KLF4/KLK3/CCN1/ACKR3/ADAM12/ADM/XDH/ITGA5/ANGPTL4/COL4A3/CXCL8/S1PR1/APLN/COL8A2/NDNF/CD34/ADGRB1/NOG/SEMA4A/COL27A1/AQP1/NOX5/PIK3R6
## GO:0006935 CCL26/EDN1/MMP2/TGFB2/HSD3B7/SERPINE1/CCL2/TRPV4/NEDD9/VEGFA/SEMA5A/LOX/FGF1/PDGFRB/PADI2/F3/EDN2/LSP1/ANXA1/SEMA7A/CCN1/ACKR3/CCL28/NRG1/SLAMF8/ITGB2/CXCL8/S1PR1/SEMA4A/LPAR1/PLXNA4
## GO:0042330 CCL26/EDN1/MMP2/TGFB2/HSD3B7/SERPINE1/CCL2/TRPV4/NEDD9/VEGFA/SEMA5A/LOX/FGF1/PDGFRB/PADI2/F3/EDN2/LSP1/ANXA1/SEMA7A/CCN1/ACKR3/CCL28/NRG1/SLAMF8/ITGB2/CXCL8/S1PR1/SEMA4A/LPAR1/PLXNA4
## GO:0045765 TGFB2/ITGB8/SERPINE1/SASH1/VEGFA/SEMA5A/FGF1/F3/TNFAIP3/NPPB/C3/CHI3L1/BTG1/KLF4/KLK3/ADAM12/ADM/ITGA5/ANGPTL4/COL4A3/CXCL8/CD34/ADGRB1/SEMA4A/AQP1/PIK3R6
## GO:1901342 TGFB2/ITGB8/SERPINE1/SASH1/VEGFA/SEMA5A/FGF1/F3/TNFAIP3/NPPB/C3/CHI3L1/BTG1/KLF4/KLK3/ADAM12/ADM/ITGA5/ANGPTL4/COL4A3/CXCL8/CD34/ADGRB1/SEMA4A/AQP1/PIK3R6
## Count
## GO:0001525 45
## GO:0048514 49
## GO:0006935 31
## GO:0042330 31
## GO:0045765 26
## GO:1901342 26
The GO enrichment analysis has now been completed and stored in
ego
. Let’s observe the results in the dot plot below.
dotplot(ego, showCategory=20)
#### 5.3.2 MSigDB database for enrichment analysis. We next explore
enrichment analysis using the MSigDB database. MSigDB stores data from
various species and gene categories. Let’s first retrieve gene sets from
Homo sapiens and observe them.
# Load the msigdbr package
library(msigdbr)
# Retrieve gene sets for Homo sapiens
m_df <- msigdbr(species = "Homo sapiens")
# Them_df object now contains information about the Hallmark gene sets.
head(m_df)
## # A tibble: 6 × 15
## gs_cat gs_subcat gs_name gene_symbol entrez_gene ensembl_gene
## <chr> <chr> <chr> <chr> <int> <chr>
## 1 C3 MIR:MIR_Legacy AAACCAC_MIR140 ABCC4 10257 ENSG00000125257
## 2 C3 MIR:MIR_Legacy AAACCAC_MIR140 ABRAXAS2 23172 ENSG00000165660
## 3 C3 MIR:MIR_Legacy AAACCAC_MIR140 ACTN4 81 ENSG00000130402
## 4 C3 MIR:MIR_Legacy AAACCAC_MIR140 ACTN4 81 ENSG00000282844
## 5 C3 MIR:MIR_Legacy AAACCAC_MIR140 ACVR1 90 ENSG00000115170
## 6 C3 MIR:MIR_Legacy AAACCAC_MIR140 ADAM9 8754 ENSG00000168615
## # ℹ 9 more variables: human_gene_symbol <chr>, human_entrez_gene <int>,
## # human_ensembl_gene <chr>, gs_id <chr>, gs_pmid <chr>, gs_geoid <chr>,
## # gs_exact_source <chr>, gs_url <chr>, gs_description <chr>
Now we have the gene sets of Homo sapiens, however, MsigDB contains 8
key categories:
1. H (Hallmark gene sets) : These are a set of 50 gene sets representing
well-defined biological states or processes.
2. C1 (Positional gene sets) : Gene sets based on chromosomal
location.
3. C2 (Curated gene sets) : Manually curated gene sets from various
sources, including pathway databases.
4. C3 (Motif gene sets) : Gene sets related to transcription factor
binding motifs.
5. C4 (Computational gene sets) : Gene sets generated through
computational methods.
6. C5 (GO gene sets) : Gene sets based on Gene Ontology terms.
7. C6 (Oncogenic signatures) : Gene sets associated with cancer-related
processes.
8. C7 (Immunologic signatures) : Gene sets related to immune system
processes.
Here, we will set the species parameter to “Homo sapiens” and the category parameter to “H” to retrieve Hallmark gene sets.
m_t2g <- msigdbr(species = "Homo sapiens", category = "H") %>%
dplyr::select(gs_name, entrez_gene)
We will investigate overview of the gene set from the Hallmark gene sets of Homo sapiens.
# Count the number of genes in each gene set
table(m_t2g$gs_name)
##
## HALLMARK_ADIPOGENESIS
## 210
## HALLMARK_ALLOGRAFT_REJECTION
## 335
## HALLMARK_ANDROGEN_RESPONSE
## 102
## HALLMARK_ANGIOGENESIS
## 36
## HALLMARK_APICAL_JUNCTION
## 231
## HALLMARK_APICAL_SURFACE
## 46
## HALLMARK_APOPTOSIS
## 183
## HALLMARK_BILE_ACID_METABOLISM
## 114
## HALLMARK_CHOLESTEROL_HOMEOSTASIS
## 77
## HALLMARK_COAGULATION
## 162
## HALLMARK_COMPLEMENT
## 237
## HALLMARK_DNA_REPAIR
## 170
## HALLMARK_E2F_TARGETS
## 218
## HALLMARK_EPITHELIAL_MESENCHYMAL_TRANSITION
## 204
## HALLMARK_ESTROGEN_RESPONSE_EARLY
## 216
## HALLMARK_ESTROGEN_RESPONSE_LATE
## 218
## HALLMARK_FATTY_ACID_METABOLISM
## 165
## HALLMARK_G2M_CHECKPOINT
## 204
## HALLMARK_GLYCOLYSIS
## 215
## HALLMARK_HEDGEHOG_SIGNALING
## 36
## HALLMARK_HEME_METABOLISM
## 214
## HALLMARK_HYPOXIA
## 215
## HALLMARK_IL2_STAT5_SIGNALING
## 216
## HALLMARK_IL6_JAK_STAT3_SIGNALING
## 103
## HALLMARK_INFLAMMATORY_RESPONSE
## 222
## HALLMARK_INTERFERON_ALPHA_RESPONSE
## 140
## HALLMARK_INTERFERON_GAMMA_RESPONSE
## 286
## HALLMARK_KRAS_SIGNALING_DN
## 220
## HALLMARK_KRAS_SIGNALING_UP
## 220
## HALLMARK_MITOTIC_SPINDLE
## 215
## HALLMARK_MTORC1_SIGNALING
## 211
## HALLMARK_MYC_TARGETS_V1
## 236
## HALLMARK_MYC_TARGETS_V2
## 60
## HALLMARK_MYOGENESIS
## 212
## HALLMARK_NOTCH_SIGNALING
## 34
## HALLMARK_OXIDATIVE_PHOSPHORYLATION
## 220
## HALLMARK_P53_PATHWAY
## 215
## HALLMARK_PANCREAS_BETA_CELLS
## 44
## HALLMARK_PEROXISOME
## 110
## HALLMARK_PI3K_AKT_MTOR_SIGNALING
## 118
## HALLMARK_PROTEIN_SECRETION
## 98
## HALLMARK_REACTIVE_OXYGEN_SPECIES_PATHWAY
## 58
## HALLMARK_SPERMATOGENESIS
## 144
## HALLMARK_TGF_BETA_SIGNALING
## 59
## HALLMARK_TNFA_SIGNALING_VIA_NFKB
## 228
## HALLMARK_UNFOLDED_PROTEIN_RESPONSE
## 115
## HALLMARK_UV_RESPONSE_DN
## 152
## HALLMARK_UV_RESPONSE_UP
## 191
## HALLMARK_WNT_BETA_CATENIN_SIGNALING
## 50
## HALLMARK_XENOBIOTIC_METABOLISM
## 224
Next, let’s perform an enrichment analysis (ORA) using these sets.
em <- enricher(significant_genes_map$ENTREZID,
TERM2GENE = m_t2g,
universe = background_genes_map$ENTREZID,
)
The enrichment analysis has been completed and stored in em. Let’s take a look at the analyzed data.
head(em)
## ID
## HALLMARK_HYPOXIA HALLMARK_HYPOXIA
## HALLMARK_EPITHELIAL_MESENCHYMAL_TRANSITION HALLMARK_EPITHELIAL_MESENCHYMAL_TRANSITION
## HALLMARK_TNFA_SIGNALING_VIA_NFKB HALLMARK_TNFA_SIGNALING_VIA_NFKB
## HALLMARK_KRAS_SIGNALING_DN HALLMARK_KRAS_SIGNALING_DN
## HALLMARK_KRAS_SIGNALING_UP HALLMARK_KRAS_SIGNALING_UP
## HALLMARK_INFLAMMATORY_RESPONSE HALLMARK_INFLAMMATORY_RESPONSE
## Description
## HALLMARK_HYPOXIA HALLMARK_HYPOXIA
## HALLMARK_EPITHELIAL_MESENCHYMAL_TRANSITION HALLMARK_EPITHELIAL_MESENCHYMAL_TRANSITION
## HALLMARK_TNFA_SIGNALING_VIA_NFKB HALLMARK_TNFA_SIGNALING_VIA_NFKB
## HALLMARK_KRAS_SIGNALING_DN HALLMARK_KRAS_SIGNALING_DN
## HALLMARK_KRAS_SIGNALING_UP HALLMARK_KRAS_SIGNALING_UP
## HALLMARK_INFLAMMATORY_RESPONSE HALLMARK_INFLAMMATORY_RESPONSE
## GeneRatio BgRatio RichFactor
## HALLMARK_HYPOXIA 29/137 181/3729 0.1602210
## HALLMARK_EPITHELIAL_MESENCHYMAL_TRANSITION 25/137 157/3729 0.1592357
## HALLMARK_TNFA_SIGNALING_VIA_NFKB 19/137 180/3729 0.1055556
## HALLMARK_KRAS_SIGNALING_DN 14/137 123/3729 0.1138211
## HALLMARK_KRAS_SIGNALING_UP 16/137 156/3729 0.1025641
## HALLMARK_INFLAMMATORY_RESPONSE 15/137 145/3729 0.1034483
## FoldEnrichment zScore pvalue
## HALLMARK_HYPOXIA 4.361052 9.052171 4.196736e-12
## HALLMARK_EPITHELIAL_MESENCHYMAL_TRANSITION 4.334232 8.335277 1.838472e-10
## HALLMARK_TNFA_SIGNALING_VIA_NFKB 2.873114 5.030112 2.253037e-05
## HALLMARK_KRAS_SIGNALING_DN 3.098095 4.620563 1.305093e-04
## HALLMARK_KRAS_SIGNALING_UP 2.791690 4.464150 1.491772e-04
## HALLMARK_INFLAMMATORY_RESPONSE 2.815756 4.354995 2.210038e-04
## p.adjust qvalue
## HALLMARK_HYPOXIA 1.804597e-10 1.369461e-10
## HALLMARK_EPITHELIAL_MESENCHYMAL_TRANSITION 3.952714e-09 2.999612e-09
## HALLMARK_TNFA_SIGNALING_VIA_NFKB 3.229353e-04 2.450671e-04
## HALLMARK_KRAS_SIGNALING_DN 1.282924e-03 9.735774e-04
## HALLMARK_KRAS_SIGNALING_UP 1.282924e-03 9.735774e-04
## HALLMARK_INFLAMMATORY_RESPONSE 1.583861e-03 1.201951e-03
## geneID
## HALLMARK_HYPOXIA 10397/665/857/5054/7422/4015/8614/54206/2152/7128/1490/3623/1907/1289/694/8553/901/3491/8497/57007/3486/3484/133/6781/51129/54541/5209/3669/55076
## HALLMARK_EPITHELIAL_MESENCHYMAL_TRANSITION 9235/6591/50509/4313/7078/5054/1277/7422/4015/5159/7128/1490/3624/1289/4148/4017/1009/3491/3486/8038/6876/3678/3576/1296/50863
## HALLMARK_TNFA_SIGNALING_VIA_NFKB 330/1906/5054/6347/7422/604/2152/7128/6446/3624/694/8553/9314/3491/57007/1437/19/5209/3914
## HALLMARK_KRAS_SIGNALING_DN 1906/7042/778/80303/6446/1907/56154/7068/7032/3699/3851/3866/3860/105375355
## HALLMARK_KRAS_SIGNALING_UP 330/639/28951/5243/3587/7139/7128/3624/2162/9314/3486/3561/80201/3689/1437/51129
## HALLMARK_INFLAMMATORY_RESPONSE 1906/10148/3696/5054/6347/3587/2152/3624/23533/133/3678/19/3576/1902/4049
## Count
## HALLMARK_HYPOXIA 29
## HALLMARK_EPITHELIAL_MESENCHYMAL_TRANSITION 25
## HALLMARK_TNFA_SIGNALING_VIA_NFKB 19
## HALLMARK_KRAS_SIGNALING_DN 14
## HALLMARK_KRAS_SIGNALING_UP 16
## HALLMARK_INFLAMMATORY_RESPONSE 15
We will visualize the data using a dotplot.
dotplot(em, showCategory = 20)
## 6. Gene Set Enrichment Analysis (GSEA) Gene Set Enrichment Analysis
(GSEA) is an alternative approach that considers the entire list of
genes ranked by their association with a phenotype or condition. Rather
than merely identifying over-represented pathways, GSEA evaluates
whether a predefined gene set (e.g., a pathway) demonstrates a
significant association with the phenotype throughout the entire ranked
gene list.
We need to define the gene sets for the GSEA analysis. In this case,
we will use MSigDB
, which we retrieved from the Hallmark
gene sets of Homo sapiens in the previous step.
table(m_t2g$gs_name)
##
## HALLMARK_ADIPOGENESIS
## 210
## HALLMARK_ALLOGRAFT_REJECTION
## 335
## HALLMARK_ANDROGEN_RESPONSE
## 102
## HALLMARK_ANGIOGENESIS
## 36
## HALLMARK_APICAL_JUNCTION
## 231
## HALLMARK_APICAL_SURFACE
## 46
## HALLMARK_APOPTOSIS
## 183
## HALLMARK_BILE_ACID_METABOLISM
## 114
## HALLMARK_CHOLESTEROL_HOMEOSTASIS
## 77
## HALLMARK_COAGULATION
## 162
## HALLMARK_COMPLEMENT
## 237
## HALLMARK_DNA_REPAIR
## 170
## HALLMARK_E2F_TARGETS
## 218
## HALLMARK_EPITHELIAL_MESENCHYMAL_TRANSITION
## 204
## HALLMARK_ESTROGEN_RESPONSE_EARLY
## 216
## HALLMARK_ESTROGEN_RESPONSE_LATE
## 218
## HALLMARK_FATTY_ACID_METABOLISM
## 165
## HALLMARK_G2M_CHECKPOINT
## 204
## HALLMARK_GLYCOLYSIS
## 215
## HALLMARK_HEDGEHOG_SIGNALING
## 36
## HALLMARK_HEME_METABOLISM
## 214
## HALLMARK_HYPOXIA
## 215
## HALLMARK_IL2_STAT5_SIGNALING
## 216
## HALLMARK_IL6_JAK_STAT3_SIGNALING
## 103
## HALLMARK_INFLAMMATORY_RESPONSE
## 222
## HALLMARK_INTERFERON_ALPHA_RESPONSE
## 140
## HALLMARK_INTERFERON_GAMMA_RESPONSE
## 286
## HALLMARK_KRAS_SIGNALING_DN
## 220
## HALLMARK_KRAS_SIGNALING_UP
## 220
## HALLMARK_MITOTIC_SPINDLE
## 215
## HALLMARK_MTORC1_SIGNALING
## 211
## HALLMARK_MYC_TARGETS_V1
## 236
## HALLMARK_MYC_TARGETS_V2
## 60
## HALLMARK_MYOGENESIS
## 212
## HALLMARK_NOTCH_SIGNALING
## 34
## HALLMARK_OXIDATIVE_PHOSPHORYLATION
## 220
## HALLMARK_P53_PATHWAY
## 215
## HALLMARK_PANCREAS_BETA_CELLS
## 44
## HALLMARK_PEROXISOME
## 110
## HALLMARK_PI3K_AKT_MTOR_SIGNALING
## 118
## HALLMARK_PROTEIN_SECRETION
## 98
## HALLMARK_REACTIVE_OXYGEN_SPECIES_PATHWAY
## 58
## HALLMARK_SPERMATOGENESIS
## 144
## HALLMARK_TGF_BETA_SIGNALING
## 59
## HALLMARK_TNFA_SIGNALING_VIA_NFKB
## 228
## HALLMARK_UNFOLDED_PROTEIN_RESPONSE
## 115
## HALLMARK_UV_RESPONSE_DN
## 152
## HALLMARK_UV_RESPONSE_UP
## 191
## HALLMARK_WNT_BETA_CATENIN_SIGNALING
## 50
## HALLMARK_XENOBIOTIC_METABOLISM
## 224
Input for GSEA is different from ORA. Instead of selecting only genes
of interest, GSEA utilizes all differential expression data, including
fold change and p-value from all genes.
We will use the results from the previous step (res
) as
input. Therefore, we need to filter the genes and change the data type
to a data frame using the following code, then use res_df
for further processing.
res_df <- res %>%
as.data.frame() %>%
filter(baseMean != 0) %>%
tibble::rownames_to_column(var = "ENSEMBL")
GSEA requires the data to be pre-ranked based on a metric that
combines fold change and significance, such as the signed fold change
multiplied by the negative logarithm of the p-value.
Additionally, we include the ENTREZID in the data frame.
res_df <- res_df %>%
mutate(signed_rank_stats = sign(log2FoldChange) * -log10(pvalue)) %>%
left_join(background_genes_map, by = c("ENSEMBL" = "ENSEMBL")) %>%
arrange(desc(signed_rank_stats))
One issue that may arise is the presence of infinite values in the
data, which can cause errors during GSEA.
To address this, we replace infinite values with large numbers so that
they are ranked high:
res_df <- res_df %>%
mutate(negative_log10pvalue = -log10(pvalue)) %>%
mutate(negative_log10pvalue = ifelse(is.infinite(negative_log10pvalue), 1000, negative_log10pvalue)) %>%
mutate(signed_rank_stats = sign(log2FoldChange) * negative_log10pvalue)
This is the final step of input preparation for GSEA. We will prepare
gene_list
that contains signed_rank_stats
and
gene identification (ENTREZID
)
# Assign `signed_rank_stats` as `gene_list`
gene_list <- res_df$signed_rank_stats
# Name the `gene_list` by ENTREZID
names(gene_list) <- res_df$ENTREZID
We perform fast GSEA using our pre-ranked gene list
(gene_list
) and a reference gene set, represented by
m_t2g
. The analyzed data is stored in em2
.
em2 <- GSEA(gene_list, TERM2GENE = m_t2g)
head(em2)
## ID
## HALLMARK_E2F_TARGETS HALLMARK_E2F_TARGETS
## HALLMARK_G2M_CHECKPOINT HALLMARK_G2M_CHECKPOINT
## HALLMARK_HYPOXIA HALLMARK_HYPOXIA
## HALLMARK_TNFA_SIGNALING_VIA_NFKB HALLMARK_TNFA_SIGNALING_VIA_NFKB
## HALLMARK_MYC_TARGETS_V2 HALLMARK_MYC_TARGETS_V2
## HALLMARK_GLYCOLYSIS HALLMARK_GLYCOLYSIS
## Description setSize
## HALLMARK_E2F_TARGETS HALLMARK_E2F_TARGETS 200
## HALLMARK_G2M_CHECKPOINT HALLMARK_G2M_CHECKPOINT 199
## HALLMARK_HYPOXIA HALLMARK_HYPOXIA 181
## HALLMARK_TNFA_SIGNALING_VIA_NFKB HALLMARK_TNFA_SIGNALING_VIA_NFKB 180
## HALLMARK_MYC_TARGETS_V2 HALLMARK_MYC_TARGETS_V2 58
## HALLMARK_GLYCOLYSIS HALLMARK_GLYCOLYSIS 184
## enrichmentScore NES pvalue
## HALLMARK_E2F_TARGETS -0.7906095 -2.235532 1.000000e-10
## HALLMARK_G2M_CHECKPOINT -0.7678108 -2.164573 1.000000e-10
## HALLMARK_HYPOXIA 0.9360678 2.008638 1.000000e-10
## HALLMARK_TNFA_SIGNALING_VIA_NFKB 0.8704395 1.867458 4.045004e-07
## HALLMARK_MYC_TARGETS_V2 -0.8337025 -1.971053 9.312620e-07
## HALLMARK_GLYCOLYSIS 0.8457952 1.810901 1.036451e-05
## p.adjust qvalue rank
## HALLMARK_E2F_TARGETS 1.666667e-09 8.070175e-10 1775
## HALLMARK_G2M_CHECKPOINT 1.666667e-09 8.070175e-10 1717
## HALLMARK_HYPOXIA 1.666667e-09 8.070175e-10 615
## HALLMARK_TNFA_SIGNALING_VIA_NFKB 5.056255e-06 2.448292e-06 1129
## HALLMARK_MYC_TARGETS_V2 9.312620e-06 4.509268e-06 1247
## HALLMARK_GLYCOLYSIS 8.637088e-05 4.182169e-05 739
## leading_edge
## HALLMARK_E2F_TARGETS tags=58%, list=12%, signal=52%
## HALLMARK_G2M_CHECKPOINT tags=50%, list=12%, signal=45%
## HALLMARK_HYPOXIA tags=45%, list=4%, signal=44%
## HALLMARK_TNFA_SIGNALING_VIA_NFKB tags=38%, list=8%, signal=35%
## HALLMARK_MYC_TARGETS_V2 tags=78%, list=9%, signal=71%
## HALLMARK_GLYCOLYSIS tags=26%, list=5%, signal=25%
## core_enrichment
## HALLMARK_E2F_TARGETS 3978/253714/1633/10248/6839/4605/9972/10549/57405/9133/3148/10733/10535/55635/9125/3184/2146/3930/9833/10714/23594/10212/79019/29089/9319/10615/84296/84844/5982/4999/24137/8726/23649/3161/672/4288/27338/5889/3015/10797/9787/3835/5558/1163/11340/57122/983/9126/1965/29028/7514/8914/9212/9700/51155/3838/5424/3609/4085/701/4678/4001/55646/1111/5631/7153/7112/9918/6790/29127/54962/6628/332/1854/204/10051/3070/5591/11051/8243/79075/203068/5983/6427/2547/4436/991/1503/23165/5347/993/7374/9238/10492/6749/4171/1164/10527/4174/79077/1019/4830/5902/1786/6426/3159/4172/4175/10528/4176/5425/4173/5901/5111/1434/5036/9221
## HALLMARK_G2M_CHECKPOINT 6839/10403/4605/8899/9133/3980/10733/3184/2146/6713/1869/3930/1810/23594/10212/3832/1874/10592/8454/24137/23649/5933/8438/6599/9585/3161/4288/27338/7027/3015/10036/3835/22809/890/5558/1163/699/7290/983/5001/262/55872/7514/990/22974/9212/9184/9700/8318/51155/3838/3609/10926/7150/4085/51776/4678/4001/1111/10419/7153/7112/6790/29127/7272/3151/10432/332/9156/10051/7371/10146/8243/4953/7090/6427/3837/51659/991/3192/5347/993/6421/10772/10492/4171/1164/4174/6632/1019/6426/3159/4172/4175/8140/3312/85236/1736/4691/9221
## HALLMARK_HYPOXIA 6515/2355/5230/10397/665/8614/54206/8553/3939/3486/54541/133/6513/857/3491/7045/5209/8497/7052/8974/4627/205/6385/7422/1316/5054/3725/55818/3099/7167/694/901/5155/2026/302/6533/5163/2113/2632/55076/1907/1490/5033/55139/23645/284119/6095/2023/112464/10370/3098/8660/2152/1649/5211/10957/3309/5214/8870/7852/30001/4783/5236/23764/2597/226/1026/26118/5329/4601/26136/467/51129/5066/3669/54800/23327/4015/7436/7128/2131/26355
## HALLMARK_TNFA_SIGNALING_VIA_NFKB 6515/2355/8553/3914/3491/5209/6385/7422/1316/5054/3725/694/10318/22822/9314/23645/4616/8660/2152/10957/56937/10769/4082/1827/8870/6446/4783/23764/7050/329/3383/9120/23135/1026/19/2114/5329/960/10725/467/330/8061/8744/4084/7128/10938/604/4791/25976/2354/7538/1847/6303/23258/4088/3572/3976/5606/10135/687/3280/9516/1437/3624/4792/1846/1051/1906
## HALLMARK_MYC_TARGETS_V2 23223/79050/705/1844/64216/29078/80324/23481/6573/8886/56915/92856/26354/9136/51388/27340/2193/51491/23082/7965/27346/83743/5347/7374/10244/9238/79711/51154/4174/79077/1019/23160/10196/6723/10514/6652/5245/10528/4173/6949/3336/5036/4869/9221/3329
## HALLMARK_GLYCOLYSIS 5230/8614/3939/3486/54541/112399/7045/8497/8974/205/7422/3099/7167/2026/7039/6676/5033/55139/2023/10370/8660/3309/5214/8870/7852/30001/1999/5223/226/6662/960/9123/4601/10616/51129/5066/4907/3669/5768/7436/2131/26355/2997/6781/23036/2673/375790
dotplot(em2, showCategory=20)
Visualization for the GSEA of HALLMARK_P53_PATHWAY
p1 <- gseaplot(em2, geneSetID = "HALLMARK_P53_PATHWAY",
by = "runningScore", title = "HALLMARK_P53_PATHWAY")
p1
Visualization for the GSEA of HALLMARK_HYPOXIA
p2 <- gseaplot(em2, geneSetID = "HALLMARK_HYPOXIA",
by = "runningScore", title = "HALLMARK_HYPOXIA")
p2