1. Project organization

Let’s start with project organization. I started by organizing the folders in this project for reproducibility.

I created a set of routine folders composed of data, result, and script.

# move to the project folder
cd Hypox

# create the folders
mkdir data result script

I run all R scripts using the R server in Docker and also create the R project and Git version control. Please see more details in My Computer Workbench.

2. Get the count matrix

This project will start with the count matrix from the published data. I will start with the count matrix from the GEO dataset GSE197576 by using wget to download the data

# move to data folder before downloading
cd data
mkdir GEO
cd GEO

# download the count matrix
wget https://ftp.ncbi.nlm.nih.gov/geo/series/GSE197nnn/GSE197576/suppl/GSE197576%5Fraw%5Fgene%5Fcounts%5Fmatrix%2Etsv%2Egz

2.1 Define input and output location

I use the here library to be reproducible on every computer I work with. Please see the details and why I use the here library on My Computer Workbench.. I define the output with setwd().

library(here)

# Show where here is
here()

# Define output directory
setwd(here("result", "GEO_Count_ToTPM"))

# Show th output directory
getwd()

2.2 Observing the count matrix file in R

We will use the readr and dplyr packagess to import data and manage it.

library(dplyr)
library(readr)
library(here)

# Define location of the file
raw_counts_file <- here("data/GEO", "GSE197576_raw_gene_counts_matrix.tsv.gz")

# Import the file
raw_counts <- read_tsv(raw_counts_file)

# observing head of the data
head(raw_counts)
## # A tibble: 6 × 13
##   gene        `01_SW_sgCTRL_Norm` `02_SW_sgCTRL_Norm` `03_SW_sgITPR3_1_Norm`
##   <chr>                     <dbl>               <dbl>                  <dbl>
## 1 DDX11L1                       0                   0                      0
## 2 WASH7P                       18                  11                     28
## 3 MIR6859-1                     5                   1                      6
## 4 MIR1302-2HG                   0                   0                      0
## 5 MIR1302-2                     0                   0                      0
## 6 FAM138A                       0                   0                      0
## # ℹ 9 more variables: `04_SW_sgITPR3_1_Norm` <dbl>,
## #   `07_SW_sgRELB_3_Norm` <dbl>, `08_SW_sgRELB_3_Norm` <dbl>,
## #   `11_SW_sgCTRL_Hyp` <dbl>, `12_SW_sgCTRL_Hyp` <dbl>,
## #   `13_SW_sgITPR3_1_Hyp` <dbl>, `14_SW_sgITPR3_1_Hyp` <dbl>,
## #   `17_SW_sgRELB_3_Hyp` <dbl>, `18_SW_sgRELB_3_Hyp` <dbl>

Observing the transcriptomics condition.

colnames(raw_counts)
##  [1] "gene"                 "01_SW_sgCTRL_Norm"    "02_SW_sgCTRL_Norm"   
##  [4] "03_SW_sgITPR3_1_Norm" "04_SW_sgITPR3_1_Norm" "07_SW_sgRELB_3_Norm" 
##  [7] "08_SW_sgRELB_3_Norm"  "11_SW_sgCTRL_Hyp"     "12_SW_sgCTRL_Hyp"    
## [10] "13_SW_sgITPR3_1_Hyp"  "14_SW_sgITPR3_1_Hyp"  "17_SW_sgRELB_3_Hyp"  
## [13] "18_SW_sgRELB_3_Hyp"

Next, the condition that we want is just the baseline condition of the cell. We then subset the data to select the sgCTRL conditions. (The others are sgRELB and sgITPR3, which the cells were perturbed; we don’t want those conditions.)

# define the condition
columns_to_select <- colnames(raw_counts) %>%
  stringr::str_detect("sgCTRL|gene")

# show the columns that match the condition
columns_to_select 
##  [1]  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE FALSE
## [13] FALSE

Select the columns that match the conditions.

counts_sub <- raw_counts[, columns_to_select]

head(counts_sub)
## # A tibble: 6 × 5
##   gene        `01_SW_sgCTRL_Norm` `02_SW_sgCTRL_Norm` `11_SW_sgCTRL_Hyp`
##   <chr>                     <dbl>               <dbl>              <dbl>
## 1 DDX11L1                       0                   0                  0
## 2 WASH7P                       18                  11                 23
## 3 MIR6859-1                     5                   1                  8
## 4 MIR1302-2HG                   0                   0                  1
## 5 MIR1302-2                     0                   0                  0
## 6 FAM138A                       0                   0                  0
## # ℹ 1 more variable: `12_SW_sgCTRL_Hyp` <dbl>

2.3 Manage the data to matrix format

To change all numeric data to numeric and work with it in a matrix, we will delete the gene column first, then put it back later.

# Delete the `gene` column and change all gene count columns to matrix.
raw_counts_mat <- counts_sub[, -1] %>% 
                    as.matrix()

# Put the `gene` column back.
rownames(raw_counts_mat) <- raw_counts$gene
head(raw_counts_mat)
##             01_SW_sgCTRL_Norm 02_SW_sgCTRL_Norm 11_SW_sgCTRL_Hyp
## DDX11L1                     0                 0                0
## WASH7P                     18                11               23
## MIR6859-1                   5                 1                8
## MIR1302-2HG                 0                 0                1
## MIR1302-2                   0                 0                0
## FAM138A                     0                 0                0
##             12_SW_sgCTRL_Hyp
## DDX11L1                    0
## WASH7P                    45
## MIR6859-1                 12
## MIR1302-2HG                0
## MIR1302-2                  0
## FAM138A                    0
# Write `raw_counts_mat` as .csv file
write.csv(raw_counts_mat, file = here("result/GEO_Count_ToTPM", "raw_counts_mat.csv"), row.names = TRUE)

3. Normalize gene counts to TPM

We will use TxDb.Hsapiens.UCSC.hg19.knownGene package to get the exon length.

library(TxDb.Hsapiens.UCSC.hg19.knownGene)

txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene # shorter the name
exons <- exonsBy(txdb, by = "gene") # select only the exons
exons
## GRangesList object of length 23459:
## $`1`
## GRanges object with 15 ranges and 2 metadata columns:
##        seqnames            ranges strand |   exon_id   exon_name
##           <Rle>         <IRanges>  <Rle> | <integer> <character>
##    [1]    chr19 58858172-58858395      - |    250809        <NA>
##    [2]    chr19 58858719-58859006      - |    250810        <NA>
##    [3]    chr19 58859832-58860494      - |    250811        <NA>
##    [4]    chr19 58860934-58862017      - |    250812        <NA>
##    [5]    chr19 58861736-58862017      - |    250813        <NA>
##    ...      ...               ...    ... .       ...         ...
##   [11]    chr19 58868951-58869015      - |    250821        <NA>
##   [12]    chr19 58869318-58869652      - |    250822        <NA>
##   [13]    chr19 58869855-58869951      - |    250823        <NA>
##   [14]    chr19 58870563-58870689      - |    250824        <NA>
##   [15]    chr19 58874043-58874214      - |    250825        <NA>
##   -------
##   seqinfo: 93 sequences (1 circular) from hg19 genome
## 
## $`10`
## GRanges object with 2 ranges and 2 metadata columns:
##       seqnames            ranges strand |   exon_id   exon_name
##          <Rle>         <IRanges>  <Rle> | <integer> <character>
##   [1]     chr8 18248755-18248855      + |    113603        <NA>
##   [2]     chr8 18257508-18258723      + |    113604        <NA>
##   -------
##   seqinfo: 93 sequences (1 circular) from hg19 genome
## 
## ...
## <23457 more elements>

3.1 Calculate the exon length that is needed for TPM calculation

# Show the length of all exons of each gene
width(exons) 
## IntegerList of length 23459
## [["1"]] 224 288 663 1084 282 297 273 270 36 96 65 335 97 127 172
## [["10"]] 101 1216
## [["100"]] 326 103 130 65 102 72 128 116 144 123 62 161
## [["1000"]] 1394 165 140 234 234 143 254 186 138 173 145 156 147 227 106 112 519
## [["10000"]] 218 68 5616 50 103 88 215 129 123 69 66 132 145 112 126 158 41
## [["100008586"]] 92 121 126 127
## [["100009676"]] 2784
## [["10001"]] 704 28 116 478 801 109 130 83 92 160 52
## [["10002"]] 308 127 104 222 176 201 46 106 918 705
## [["10003"]] 191 112 187 102 126 187 94 1193 99 ... 64 68 92 91 265 82 93 1054
## ...
## <23449 more elements>
# Sum the total lengths for each gene
head(sum(width(exons)))
##         1        10       100      1000     10000 100008586 
##      4309      1317      1532      4473      7459       466
# Create a tibble which includes `ENTREZID` and `exon_length`
exon_len <- sum(width(exons)) %>% 
  tibble::enframe(name = "ENTREZID", value = "exon_length")

head(exon_len)
## # A tibble: 6 × 2
##   ENTREZID  exon_length
##   <chr>           <int>
## 1 1                4309
## 2 10               1317
## 3 100              1532
## 4 1000             4473
## 5 10000            7459
## 6 100008586         466

3.2 Map the ENTREZID to the official gene symbol

We will use the org.Hs.eg.db Bioconductor package to match the gene symbol.

library(org.Hs.eg.db)

map <- AnnotationDbi::select(org.Hs.eg.db, 
                            keys = exon_len$ENTREZID, 
                            columns= "SYMBOL",
                            keytype = "ENTREZID")

head(map)
##    ENTREZID  SYMBOL
## 1         1    A1BG
## 2        10    NAT2
## 3       100     ADA
## 4      1000    CDH2
## 5     10000    AKT3
## 6 100008586 GAGE12F

Join the exon length (exon_len) and map together to match the gene symbol with ENTREZID.

library(dplyr)

map <- left_join(exon_len, map)
head(map)
## # A tibble: 6 × 3
##   ENTREZID  exon_length SYMBOL 
##   <chr>           <int> <chr>  
## 1 1                4309 A1BG   
## 2 10               1317 NAT2   
## 3 100              1532 ADA    
## 4 1000             4473 CDH2   
## 5 10000            7459 AKT3   
## 6 100008586         466 GAGE12F

We will subset only the common genes (common_genes), which are the genes found in both our data raw_counts_mat and the gene symbol annotation map$SYMBOL.

common_genes <- intersect(rownames(raw_counts_mat), map$SYMBOL)

head(common_genes)
## [1] "DDX11L1"   "WASH7P"    "OR4F5"     "LOC729737" "OR4F29"    "OR4F16"

Here is a map that contains ENTREZID, length, and gene symbol information that intersects and reorders according to common_genes.

# select only the common genes and re-order them by common_genes
map<- map %>%
  dplyr::slice(match(common_genes, SYMBOL))

head(map)
## # A tibble: 6 × 3
##   ENTREZID  exon_length SYMBOL   
##   <chr>           <int> <chr>    
## 1 100287102        2838 DDX11L1  
## 2 653635           8050 WASH7P   
## 3 79501             918 OR4F5    
## 4 729737           5474 LOC729737
## 5 729759           1878 OR4F29   
## 6 81399             939 OR4F16

Here is raw_count_mat, which contains gene symbol and gene count information that intersect and re-order according to common_genes.

# subset the common genes and re-order them by common_genes
raw_counts_mat<- raw_counts_mat[common_genes, ]

head(raw_counts_mat)
##           01_SW_sgCTRL_Norm 02_SW_sgCTRL_Norm 11_SW_sgCTRL_Hyp 12_SW_sgCTRL_Hyp
## DDX11L1                   0                 0                0                0
## WASH7P                   18                11               23               45
## OR4F5                     0                 0                0                0
## LOC729737                 3                 3               16               16
## OR4F29                    0                 0                0                0
## OR4F16                    1                 0                0                3

3.3 Normalizing raw counts to transcripts per Million (TPM)

3.3.1 Create count2tpm function for TPM calculation

count2tpm <- function(count_matrix, exon_length) {
  # Calculate reads per base pair per gene
  reads_per_bp_gene <- count_matrix / exon_length
  
  # Calculate the total reads per base pair for each sample
  reads_per_bp_sample <- colSums(reads_per_bp_gene)
  
  # Normalize to the library size and calculate TPM
  tpm_matrix <- t(t(reads_per_bp_gene) / reads_per_bp_sample) * 1000000
  return(tpm_matrix)
}

3.3.2 Calculating TPM by using count2tpm.

tpm <- count2tpm(raw_counts_mat, map$exon_length)

head(tpm)
##           01_SW_sgCTRL_Norm 02_SW_sgCTRL_Norm 11_SW_sgCTRL_Hyp 12_SW_sgCTRL_Hyp
## DDX11L1          0.00000000         0.0000000        0.0000000        0.0000000
## WASH7P           0.31420614         0.2460997        0.4026318        0.7799782
## OR4F5            0.00000000         0.0000000        0.0000000        0.0000000
## LOC729737        0.07701131         0.0987031        0.4118995        0.4078317
## OR4F29           0.00000000         0.0000000        0.0000000        0.0000000
## OR4F16           0.14964853         0.0000000        0.0000000        0.4457809

4. Analyzing the genes of interest (GOI).

I’m mostly engaging with the cancer stem cell (CSC) field; thus, I will explore a few genes in CSC, hoping that they would significantly increase in hypoxic conditions (columns 3 and 4) compared with normoxic conditions (columns 1 and 2), as theory states.

GOI <- "NOTCH1"
t.test(tpm[GOI, c(1,2)], tpm[GOI, c(3,4)])
## 
##  Welch Two Sample t-test
## 
## data:  tpm[GOI, c(1, 2)] and tpm[GOI, c(3, 4)]
## t = 15.569, df = 1.9299, p-value = 0.004753
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##   6.064292 10.927203
## sample estimates:
## mean of x mean of y 
##  36.13129  27.63554

Oh! I found NOTCH1 significantly different between the two conditions, but normoxic (x) is greater than hypoxic (y). This is not what I expect to see. However, I can explore the genes that I am interestd in now!

I would save the TPM (tpm) which I calculated, into a .csv file for further analysis. The .csv file is also readable.

# Write `tpm` as `.csv` file and save into `result` folder.
write.csv(tpm, file = here("result/GEO_Count_ToTPM", "TPM_GeneSymbol.csv"), row.names = TRUE)