본문 바로가기

Contact English

【생물정보학】 R에서 유용한 주요 함수 모음

 

R에서 유용한 주요 함수 모음

 

추천글 : 【RStudio】 R 스튜디오 목차, 【생물정보학】 생물정보학 분석 목차  


1. 개요 [본문]

2. 기초 함수 [본문]

3. 파일 입출력 [본문]

4. 벡터 및 행렬 연산 [본문]

5. gsub 이용 [본문]

6. 통계 분석 [본문]

7. 생물정보학 [본문]

8. 이미지 생성 [본문]


a. GitHub (executable .R file)

b. 파이썬 유용 함수 모음 

c. 빅데이터분석기사 R 실기 필수 암기 및 예제


 

1. 개요 [목차]

⑴ 아래에서 정의한 함수는 다음과 같이 호출할 수 있음

 

source("https://github.com/JB243/nate9389/blob/main/RStudio/A_Collection_of_Useful_Functions_in_RStudio.R?raw=true")

 

① 다만, 위 코드로 인해 다음과 같은 문제점이 발생할 수 있음

문제 1. Error in sum(dim(scRNAseqData)): argument "b" is missing, with no default

문제 2. "do.call" r argument "b" is missing, with no default

⑵ .R 파일은 다음과 같이 사용할 수도 있음

 

Rscript /numbat/inst/bin/pileup_and_phase.R \
    --label possorted_genome_bam \
    --samples possorted_genome_bam \
    --bams possorted_genome_bam.bam \
    --barcodes possorted_genome_bam_barcodes.tsv \
    --outdir possorted_genome_bam \
    --gmap /Eagle_v2.4.1/tables/genetic_map_hg38_withX.txt.gz \
    --snpvcf /data/genome1K.phase3.SNP_AF5e2.chr1toX.hg38.vcf \
    --paneldir /data/1000G_hg38 \
    --ncores 1

 

⑶ 아래에서 정의한 함수는 다음과 같이 사용할 수 있음

 

SUM <- function(a, b){
    return (a + b)
}

A <- 1
B <- 2
print(SUM(A, B))
# 3

 

 

2. 기초 함수 [목차]

⑴ 자료형 타입 확인

 

# x is a variable
typeof(x)

 

⑵ 자연수 앞에 0을 채워서 총 다섯 자리가 되도록 하는 함수

 

n = 55
sprintf("%05d", n)

 

⑶ 엔터를 칠 때까지 대기하는 함수 (레퍼런스)

 

readline(prompt="Press [enter] to continue")

 

⑷ 다중 연산 지원

 

# Reference : https://cran.r-project.org/web/packages/future/vignettes/future-3-topologies.html

plan("multisession", workers = 10)

 

%in% 연산자

 

x <- c('월','화','수','목','금','토','일')
x %in% c('토','일')
# FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE

 

 

3. 파일 입출력 [목차]

⑴ 특정 폴더 'destination.folder'가 존재하는지 확인

 

 ## Check whether destination folder exists
if (!file.exists(destination.folder)) {
    stop(.wrap("The destination folder could not be found. Please change",
               "the path specified in", sQuote(destination.folder)))
}

 

⑵ 특정 폴더 'destination.folder'에 쓰기 권한이 존재하는지 확인

 

## Check for write permissions in the destination folder
if (file.access(destination.folder, 2) == -1) {
    stop(.wrap("You do not have write permission in the destination",
               "folder."))
}

 

⑶ 폴더 생성 : destination.folder 아래 Filtered_bams 폴더 생성

 

## Create folder
destination.folder <- file.path(destination.folder, "Filtered_bams")
tryCatch({
    if (!file.exists(file.path(destination.folder))) {
        dir.create(file.path(destination.folder),
                   recursive = TRUE)
    } else {
        stop(.wrap("The folder",
                   sQuote(file.path(destination.folder, "Filtered_bams")),
                   "already exists. Please remove it, or (in case you",
                   "still need it), rename it to prevent files from being",
                   "overwritten."))
    }
}, warning = function(e) {
    stop(.wrap("You do not have write permissions in the destination",
               "folder. Stopping execution of the remaining part of the",
               "script..."))
})

 

⑷ .gz 파일을 읽어들여서 압축을 푸는 코드 

 

tmp = readMM('matrix.mtx.gz')
writeMM(tmp, 'matrix.mtx')

 

 

4. 벡터 및 행렬 연산 [목차]

⑴ 주어진 벡터를 a, b, c 순으로 정렬하기 

 

my_vec <- c('c', 'b', 'a')

print( my_vec[order(my_vec)] )
# "a" "b" "c"

 

주어진 문자열(given_str)에서 특정 문자열(partial_str)을 포함하는지 여부 : grepl (레퍼런스

 

grepl(partial_str, given_str, fixed = TRUE)

 

주어진 문자열 벡터에서 특정 문자열 str을 포함하는 index 출력 

 

has_string_arr <- function(str, arr){
    flag <- 0
    ar <- array()
    for(i in 1:length(arr)){
        if( grepl(str, arr[i], fixed=TRUE) ){
            flag <- flag + 1
            ar[flag] <- i
        }
    }
    return(ar)
}

 

세 벡터의 교집합 

 

tri_intersect <- function(A, B, C){
    a <- intersect(A, B)
    b <- intersect(a, C)
    return(b)
}

 

⑸ 주어진 벡터에서 NA 부분을 0으로 바꾸는 함수

 

trim_na <- function(vector){
  for(i in 1: length(vector)){
    if(is.na(vector[i])){
      vector[i] = 0
    }
  }
  return (vector)
}

 

⑹ 주어진 벡터에서 NA 부분을 제외하는 함수 

 

ignore_na <- function(vector){
  flag = 0
  ar = array()
  for(i in 1: length(vector)){
    if(! is.na(vector[i])){
      flag = flag + 1
      ar[flag] = vector[i]
    }
  }
  return (ar)
}

 

⑺ 주어진 벡터에서 특정 원소를 다른 원소로 치환하는 함수

 

replace_in_vector <- function(v, from_element, to_element){
  ar <- array(dim = length(v))

  for(i in 1 : length(v)){
    if(v[i] == from_element){
      ar[i] = to_element
    } else{
      ar[i] = v[i]
    }
  }

  return(ar)
}

 

⑻ 행렬 혹은 데이터 프레임의 왼쪽 모퉁이를 반환하는 함수

 

corner <- function(x, num = 10){
  return(x[1:min(  num, dim(x)[1]  ), 
           1:min(  num, dim(x)[2]  )])
}

 

임의의 두 행렬이 주어져 있을 때 cbind를 출력하는 함수

 

my.cbind <- function(data1, data2){
  a <- intersect(rownames(data1), rownames(data2))
  return (cbind(data1[a, ], data2[a, ]))
}

 

임의의 두 행렬이 주어져 있을 때 rbind를 출력하는 함수 

 

my.rbind <- function(data1, data2){
  a <- intersect(colnames(data1), colnames(data2))
  return (rbind(data1[, a], data2[, a]))
}

 

⑾ 주어진 행렬 혹은 데이터프레임에서 A행과 B행을 바꾸어 출력해주는 함수

 

switch_A_B_row <- function(mat0, A, B){
  mat <- mat0
  mat[A, ] = mat0[B, ]
  mat[B, ] = mat0[A, ] 
  rownames(mat)[A] = rownames(mat0)[B]
  rownames(mat)[B] = rownames(mat0)[A]
  return(mat)
}

 

주어진 행렬 혹은 데이터프레임에서 A열과 B열을 바꾸어 출력해주는 함수 

 

switch_A_B_col <- function(mat0, A, B){
  mat <- mat0
  mat[, A] = mat0[, B]
  mat[, B] = mat0[, A] 
  colnames(mat)[A] = colnames(mat0)[B]
  colnames(mat)[B] = colnames(mat0)[A]
  return(mat)
}

 

 

5. gsub 이용 [목차]

⑴ 주어진 문자열에서 하이픈(-)을 제거하는 함수 

 

eliminate_hyphene <- function(str){
    return(gsub("-", "", str))
}

 

⑵ 주어진 문자열에서 "("과 같은 특수기호를 제거하는 함수 (레퍼런스)

 

eliminate_symbol <- function(str){
    return(gsub("(", "", str, fixed = TRUE))
}

 

⑶ 주어진 문자열에서 x와 x 뒤의 모든 문자들을 제거하는 함수 (레퍼런스

 

eliminate_backward <- function(str){
    return(gsub('x.*$', '', str))
}

 

⑷ "ABCD.123"과 같이 주어진 문자열에서 .과 . 뒤에 오는 모든 숫자들을 제거하는 함수 

 

eliminate_backward2 <- function(str){
    return(gsub('*.[0-9]*$', '', str))
}

 

⑸ 주어진 문자열에서 x와 x 앞의 모든 문자들을 제거하는 함수

 

eliminate_forward <- function(str){
    return(gsub('.*x', '', str))
}

 

⑹ 주어진 string형 변수(’given_str’)에서 start부터 end까지의 문자를 “x”로 채운 뒤 반환하는 함수

 

str_substitute <- function(given_str, start, end){
    library(stringr)
    library(stringi)
  
    result = paste0(
        str_sub(given_str, 1, start-1),    
        stri_dup("x", (end-start+1) ),
        str_sub(given_str, end+1, str_length(given_str))
    )

    return(result)
}

 

⑺ 주어진 string (’given_str’) 안에서 특정 패턴 뒤에 오는 숫자들 수집하는 코드

 

number_after_pattern_in_str <- function(given_str, pattern){
    # reference : https://cran.r-project.org/web/packages/stringr/vignettes/stringr.html

    library(stringr)

    phone1 <- paste0(pattern, '([1-9]{1})')
    phone2 <- paste0(pattern, '([1-9][0-9]{1})')
    phone3 <- paste0(pattern, '([1-9][0-9]{2})')
    phone4 <- paste0(pattern, '([1-9][0-9]{3})')

    arr <- array()
    flag = 0

    for(i in 1 : dim(str_locate_all(given_str, phone4)[[1]])[1] ){
        if (dim(str_locate_all(given_str, phone4)[[1]])[1] > 0){
            flag <- flag + 1
            arr[flag] = str_sub(
                given_str, 
                str_locate_all(given_str, phone4)[[1]][1, 1], #start
                str_locate_all(given_str, phone4)[[1]][1, 2]  #end
            )
            given_str = str_substitute(
                given_str, 
                str_locate_all(given_str, phone4)[[1]][1, 1], #start
                str_locate_all(given_str, phone4)[[1]][1, 2]  #end
            )
        }
    }

    for(i in 1 : dim(str_locate_all(given_str, phone3)[[1]])[1] ){
        if (dim(str_locate_all(given_str, phone3)[[1]])[1] > 0){
            flag <- flag + 1
            arr[flag] = str_sub(
                given_str, 
                str_locate_all(given_str, phone3)[[1]][1, 1], #start
                str_locate_all(given_str, phone3)[[1]][1, 2]  #end
            )
            given_str = str_substitute(
                given_str, 
                str_locate_all(given_str, phone3)[[1]][1, 1], #start
                str_locate_all(given_str, phone3)[[1]][1, 2]  #end
            )
        }
    }

    for(i in 1 : dim(str_locate_all(given_str, phone2)[[1]])[1] ){
        if (dim(str_locate_all(given_str, phone2)[[1]])[1] > 0){
            flag <- flag + 1
            arr[flag] = str_sub(
                given_str, 
                str_locate_all(given_str, phone2)[[1]][1, 1], #start
                str_locate_all(given_str, phone2)[[1]][1, 2]  #end
            )
            given_str = str_substitute(
                given_str, 
                str_locate_all(given_str, phone2)[[1]][1, 1], #start
                str_locate_all(given_str, phone2)[[1]][1, 2]  #end
            )
        }
    }

    for(i in 1 : dim(str_locate_all(given_str, phone1)[[1]])[1] ){
        if (dim(str_locate_all(given_str, phone1)[[1]])[1] > 0){
            flag <- flag + 1
            arr[flag] = str_sub(
                given_str, 
                str_locate_all(given_str, phone1)[[1]][1, 1], #start
                str_locate_all(given_str, phone1)[[1]][1, 2]  #end
            )
            given_str = str_substitute(
                given_str, 
                str_locate_all(given_str, phone1)[[1]][1, 1], #start
                str_locate_all(given_str, phone1)[[1]][1, 2]  #end
            )
        }
    }	

    arr <- str_sub(arr, str_length(pattern)+1, str_length(arr))
    arr <- as.numeric(arr)
    return(arr)
}

 

 

6. 통계 분석 [목차]

⑴ 10% ~ 90%까지의 percentile 분위수를 알려주는 코드

 

# reference: https://www.statology.org/percentiles-in-r/

quantile(data, probs = seq(.1, .9, by = .1))

 

n!에 log10을 취한 값

 

log10_factorial <- function(n){
  if(n == 0){
    return(0)
  }

  out <- 0
  for(i in 1 : n){
    out <- out + log(i) / log(10)
  }
  return(out)
}

 

n개 중 k개를 뽑는 경우의 수인 이항계수 nCk를 구하는 함수 : 단순히 factorial을 이용하면 Inf가 뜰 수 있으므로 코드를 개선함

 

my.combination <- function(n, k){
  # return nCk = n! / ((n-k)! k!)
  
  if (n == k || n == 0 || k == 0){
    return(1)
  }

  A = log10_factorial(n)
  B = log10_factorial(n-k)
  C = log10_factorial(k)
  
  log10_nCk = A - B - C
  return(10^(log10_nCk))
}

 

두 집단의 t test 

 

t.test(v1, v2, paired = FALSE)
# maybe you can activate 'paired' in a special condition

 

두 집단의 ANOVA test 

 

one_way_2_factor_anova <- function(v1, v2){
  dat <- matrix(0, nrow = ( length(v1) + length(v2) ), ncol = 2 )
  for(i in 1 : length(v1) ){
    dat[i, 1] <- v1[i]
    dat[i, 2] <- 'v1'
  }
  for(i in 1 : length(v2) ){
    dat[i + length(v1), 1] <- v2[i]
    dat[i + length(v1), 2] <- 'v2'
  }
  dat <- as.data.frame(dat)
  
  colnames(dat) <- c('val', 'factor')

  anova_IS <- aov(val ~ factor, data = dat)
  print(summary(anova_IS))

  anova_residuals <- anova_IS$residuals
  print(summary(anova_residuals))
}

 

세 집단의 ANOVA test

 

one_way_3_factor_anova <- function(v1, v2, v3){
  dat <- matrix(0, nrow = ( length(v1) + length(v2) + length(v3) ), ncol = 2 )
  for(i in 1 : length(v1) ){
    dat[i, 1] <- v1[i]
    dat[i, 2] <- 'v1'
  }
  for(i in 1 : length(v2) ){
    dat[i + length(v1), 1] <- v2[i]
    dat[i + length(v1), 2] <- 'v2'
  }
  for(i in 1 : length(v3) ){
    dat[i + length(v1) + length(v2), 1] <- v3[i]
    dat[i + length(v1) + length(v2), 2] <- 'v3'
  }
  dat <- as.data.frame(dat)
  
  colnames(dat) <- c('val', 'factor')

  anova_IS <- aov(val ~ factor, data = dat)
  print(summary(anova_IS))

  anova_residuals <- anova_IS$residuals
  print(summary(anova_residuals))
}

 

네 집단의 ANOVA test

 

one_way_4_factor_anova <- function(v1, v2, v3, v4){
  dat <- matrix(0, nrow = ( length(v1) + length(v2) + length(v3) + length(v4) ), ncol = 2 )
  for(i in 1 : length(v1) ){
    dat[i, 1] <- v1[i]
    dat[i, 2] <- 'v1'
  }
  for(i in 1 : length(v2) ){
    dat[i + length(v1), 1] <- v2[i]
    dat[i + length(v1), 2] <- 'v2'
  }
  for(i in 1 : length(v3) ){
    dat[i + length(v1) + length(v2), 1] <- v3[i]
    dat[i + length(v1) + length(v2), 2] <- 'v3'
  }
  for(i in 1 : length(v4) ){
    dat[i + length(v1) + length(v2) + length(v3), 1] <- v4[i]
    dat[i + length(v1) + length(v2) + length(v3), 2] <- 'v4'
  }
  dat <- as.data.frame(dat)
  
  colnames(dat) <- c('val', 'factor')

  anova_IS <- aov(val ~ factor, data = dat)
  print(summary(anova_IS))

  anova_residuals <- anova_IS$residuals
  print(summary(anova_residuals))
}

 

다섯 집단의 ANOVA test

 

one_way_5_factor_anova <- function(v1, v2, v3, v4, v5){
  dat <- matrix(0, nrow = ( length(v1) + length(v2) + length(v3) + length(v4) + length(v5) ), ncol = 2 )
  for(i in 1 : length(v1) ){
    dat[i, 1] <- v1[i]
    dat[i, 2] <- 'v1'
  }
  for(i in 1 : length(v2) ){
    dat[i + length(v1), 1] <- v2[i]
    dat[i + length(v1), 2] <- 'v2'
  }
  for(i in 1 : length(v3) ){
    dat[i + length(v1) + length(v2), 1] <- v3[i]
    dat[i + length(v1) + length(v2), 2] <- 'v3'
  }
  for(i in 1 : length(v4) ){
    dat[i + length(v1) + length(v2) + length(v3), 1] <- v4[i]
    dat[i + length(v1) + length(v2) + length(v3), 2] <- 'v4'
  }
  for(i in 1 : length(v5) ){
    dat[i + length(v1) + length(v2) + length(v3) + length(v4), 1] <- v5[i]
    dat[i + length(v1) + length(v2) + length(v3) + length(v4), 2] <- 'v5'
  }
  dat <- as.data.frame(dat)
  
  colnames(dat) <- c('val', 'factor')

  anova_IS <- aov(val ~ factor, data = dat)
  print(summary(anova_IS))

  anova_residuals <- anova_IS$residuals
  print(summary(anova_residuals))
}

 

여섯 집단의 ANOVA test

 

one_way_6_factor_anova <- function(v1, v2, v3, v4, v5, v6){
  dat <- matrix(0, nrow = ( length(v1) + length(v2) + length(v3) + length(v4) + length(v5) + length(v6) ), ncol = 2 )
  for(i in 1 : length(v1) ){
    dat[i, 1] <- v1[i]
    dat[i, 2] <- 'v1'
  }
  for(i in 1 : length(v2) ){
    dat[i + length(v1), 1] <- v2[i]
    dat[i + length(v1), 2] <- 'v2'
  }
  for(i in 1 : length(v3) ){
    dat[i + length(v1) + length(v2), 1] <- v3[i]
    dat[i + length(v1) + length(v2), 2] <- 'v3'
  }
  for(i in 1 : length(v4) ){
    dat[i + length(v1) + length(v2) + length(v3), 1] <- v4[i]
    dat[i + length(v1) + length(v2) + length(v3), 2] <- 'v4'
  }
  for(i in 1 : length(v5) ){
    dat[i + length(v1) + length(v2) + length(v3) + length(v4), 1] <- v5[i]
    dat[i + length(v1) + length(v2) + length(v3) + length(v4), 2] <- 'v5'
  }
  for(i in 1 : length(v6) ){
    dat[i + length(v1) + length(v2) + length(v3) + length(v4) + length(v5), 1] <- v6[i]
    dat[i + length(v1) + length(v2) + length(v3) + length(v4) + length(v5), 2] <- 'v6'
  }
  dat <- as.data.frame(dat)
  
  colnames(dat) <- c('val', 'factor')

  anova_IS <- aov(val ~ factor, data = dat)
  print(summary(anova_IS))

  anova_residuals <- anova_IS$residuals
  print(summary(anova_residuals))
}

 

두 벡터의 FC, p value 조사

 

comparison_of_two_vectors <- function(v1, v2, paired = FALSE){
  p.val = t.test(v1, v2, paired = paired)
  print(p.val)
  
  log2FC = log( mean(v1 + 0.000000000001)/mean(v2 + 0.000000000001) ) / log(2)
  print(log2FC)
}

 

 Fisher's exact test를 이용하여 두 집합의 통계적 동일을 검정하는 방법 ver 1

 

total.gene <- 32285
ST <- 31
scRNAseq <- 14
cross <- 5

a <- cross
b <- scRNAseq - a
c <- ST - a
d <- total.gene - a - b - c
A <- a + b
B <- c + d
C <- a + c
D <- b + d

group<-c("A","A","B","B")
cancer<-c("1.Yes","2.No","1.Yes","2.No")
count<-c(a,b,c,d)
dat<-data.frame(group,cancer,count)
tab<-xtabs(count~group+cancer,data=dat)
tab

chisq.test(tab)$observed
chisq.test(tab)$expected
fisher.test(tab)
-log(fisher.test(tab)$p.value, 10)

if(cross > ST * scRNAseq / total.gene){
  print("Enrichment")
} else if(cross < ST * scRNAseq / total.gene){
  print("Depletion")
}

 

Fisher's exact test를 이용하여 두 집합의 통계적 동일을 검정하는 방법 ver 2

 

my.Fisher.exact.test <- function(total, A, B, cross){
  a1 <- log10_factorial(A)
  a2 <- log10_factorial(total - A)
  a3 <- log10_factorial(B)
  a4 <- log10_factorial(total - B)

  b1 <- log10_factorial(cross)
  b2 <- log10_factorial(A - cross)
  b3 <- log10_factorial(B - cross)
  b4 <- log10_factorial(total - cross - (A - cross) - (B - cross))
  b5 <- log10_factorial(total)

  out = a1 + a2 + a3 + a4 - b1 - b2 - b3 - b4 - b5
  return(10^out)
}

 

MIA assay (enrichment) 

 

my.MIA.assay.enrichment <- function(total, A, B, cross){
  out <- 0
  for(i in cross:min(A, B)){
    out = out + my.Fisher.exact.test(total, A, B, i)
  }
  return(out)
}

 

⒁ MIA assay (depletion)

 

my.MIA.assay.depletion <- function(total, A, B, cross){
  out <- 0
  for(i in 0:cross-1){
    out = out + my.Fisher.exact.test(total, A, B, i)
  }
  return(out)
}

 

 

7. 생물정보학 [목차]

⑴ 특정 단어(keyword)로 시작하는 유전자들 탐색 

 

gene_starting_with <- function(keyword){
  human = read.csv("https://blog.kakaocdn.net/dn/29YTj/btrS5iG9QOH/Di6RQKxHOPDii7EjkdHN30/human_genes_36601.tsv?attach=1&knm=tfile.tsv", sep = '\t', header = F)
  mouse = read.csv("https://blog.kakaocdn.net/dn/wkjwJ/btrS1QSgrpD/VS8ELANCQyeZAA3vL8JQP0/mouse_genes_32285.tsv?attach=1&knm=tfile.tsv", sep = '\t', header = F)
  ar = array()
  flag = 0

  if(keyword == toupper(keyword)){ # human genes
    for(i in 1:dim(human)[1]){
      if(grepl(keyword, human[i, 2], fixed = TRUE)){
        flag = flag + 1
        ar[flag] = human[i, 2]
      }
    }   
  }   
  else{ # mouse genes
    for(i in 1:dim(mouse)[1]){
      if(grepl(keyword, mouse[i, 2], fixed = TRUE)){
        flag = flag + 1
        ar[flag] = mouse[i, 2]
      }
    }   
  }
  
  return(ar)
}

 

⑵ ensembl_gene_id와 gene_symbol 간의 변환 등

① convert from ensembl.gene to gene.symbol 

 

library(EnsDb.Hsapiens.v79)
ensembl.genes <- c("ENSG00000150676", "ENSG00000099308", "ENSG00000142676", "ENSG00000180776", "ENSG00000108848", "ENSG00000277370", "ENSG00000103811", "ENSG00000101473")
geneIDs1 <- ensembldb::select(EnsDb.Hsapiens.v79, keys= ensembl.genes, keytype = "GENEID", columns = c("SYMBOL","GENEID"))

  

② convert from ensembl.gene to gene.symbol (표를 이용한 방식) 

 

human_genes_36601.tsv
0.89MB
mouse_genes_32285.tsv
0.85MB

 

ensembl_to_gene <- function(ensembl_list){
  ar = array(dim = length(ensembl_list))

  human = read.csv("https://blog.kakaocdn.net/dn/29YTj/btrS5iG9QOH/Di6RQKxHOPDii7EjkdHN30/human_genes_36601.tsv?attach=1&knm=tfile.tsv", sep = '\t', header = F)
  mouse = read.csv("https://blog.kakaocdn.net/dn/wkjwJ/btrS1QSgrpD/VS8ELANCQyeZAA3vL8JQP0/mouse_genes_32285.tsv?attach=1&knm=tfile.tsv", sep = '\t', header = F)

  for(i in 1:length(ensembl_list)){
    if(grepl('ENSG', ensembl_list[i], fixed = TRUE)){ # human gene
      index = match(ensembl_list[i], human[, 1])
      ar[i] = human[index, 2]
    } 
    else if(grepl('ENSMUSG', ensembl_list[i], fixed = TRUE)){ # mouse gene
      index = match(ensembl_list[i], mouse[, 1])
      ar[i] = mouse[index, 2]
    } 
  }
  return(ar)
}

 

③ convert from gene.symbol to ensembl.gene

 

library(EnsDb.Hsapiens.v79)
geneSymbols <-  c('DDX26B','CCDC83',  'MAST3', 'RPL11', 'ZDHHC20',  'LUC7L3',  'SNORD49A',  'CTSH', 'ACOT8')
geneIDs2 <- ensembldb::select(EnsDb.Hsapiens.v79, keys= geneSymbols, keytype = "SYMBOL", columns = c("SYMBOL","GENEID"))

 

④ convert from gene.symbol to ensembl.gene (표를 이용한 방식)

 

gene_to_ensembl <- function(gene_list){
  ar = array(dim = length(gene_list))

  human = read.csv("https://blog.kakaocdn.net/dn/29YTj/btrS5iG9QOH/Di6RQKxHOPDii7EjkdHN30/human_genes_36601.tsv?attach=1&knm=tfile.tsv", sep = '\t', header = F)
  mouse = read.csv("https://blog.kakaocdn.net/dn/wkjwJ/btrS1QSgrpD/VS8ELANCQyeZAA3vL8JQP0/mouse_genes_32285.tsv?attach=1&knm=tfile.tsv", sep = '\t', header = F)

  for(i in 1:length(gene_list)){
    if(gene_list[i] == toupper(gene_list[i])){ # human gene
      index = match(gene_list[i], human[, 2])
      ar[i] = human[index, 1]
    } 
    else{ # mouse gene
      index = match(gene_list[i], mouse[, 2])
      ar[i] = mouse[index, 1]
    } 
  }

  # return(ignore_na(ar)) 
  ## if possible, ignore_na should be used
  
  return(ar)
}

 

⑤ human ensembl transcript to gene name (ref)

 

library(biomaRt)
mart <- useMart(biomart = "ensembl", dataset = "hsapiens_gene_ensembl")  
ensembl_transcript_to_gene <- function(transcript_ids){
  # reference : https://support.bioconductor.org/p/106253/#106256

  res <- getBM(attributes = c('ensembl_transcript_id_version', 
                              'ensembl_gene_id', 
                              'external_transcript_name',
                              'external_gene_name'),
               filters = 'ensembl_transcript_id_version', 
               values = transcript_ids,
               mart = mart)

  if(dim(res)[1] == 0){
    return("")
  }	

  return(res[, 'external_gene_name'])
}

 

⑥ mouse gene to MGI symbol (표를 이용한 방식)

 

mouse_gene_to_MGI <- function(mouse_gene_list){
  ar = array(dim = length(mouse_gene_list))

	dat <- read.csv("https://blog.kakaocdn.net/dn/cVeqsA/btrS1JMnxyX/HtVhPmqtxdgt7LQlGkeql0/HOM_MouseHumanSequence.csv?attach=1&knm=tfile.csv")

  for(i in 1:length(mouse_gene_list)){
	  index = match(mouse_gene_list[i], dat[,'Symbol'])
    ar[i] = dat[index, 'Mouse.MGI.ID']
  }

  return(ar)
}

 

⑦ MGI symbol to mouse gene (표를 이용한 방식)

 

MGI_to_mouse_gene <- function(MGI_list){
  ar = array(dim = length(MGI_list))

	dat <- read.csv("https://blog.kakaocdn.net/dn/cVeqsA/btrS1JMnxyX/HtVhPmqtxdgt7LQlGkeql0/HOM_MouseHumanSequence.csv?attach=1&knm=tfile.csv")

  for(i in 1:length(MGI_list)){
	  index = match(MGI_list[i], dat[,'Mouse.MGI.ID'])
    ar[i] = dat[index, 'Symbol']
  }

  return(ar)
}

 

⑧ human gene to HGNC symbol (표를 이용한 방식)

 

human_gene_to_HGNC <- function(human_gene_list){
  ar = array(dim = length(human_gene_list))

	dat <- read.csv("https://blog.kakaocdn.net/dn/cVeqsA/btrS1JMnxyX/HtVhPmqtxdgt7LQlGkeql0/HOM_MouseHumanSequence.csv?attach=1&knm=tfile.csv")

  for(i in 1:length(human_gene_list)){
	  index = match(human_gene_list[i], dat[,'Symbol'])
    ar[i] = dat[index, 'HGNC.ID']
  }

  return(ar)
}

 

⑨ HGNC symbol to human gene (표를 이용한 방식)

 

HGNC_to_human_gene <- function(HGNC_list){
  ar = array(dim = length(HGNC_list))

	dat <- read.csv("https://blog.kakaocdn.net/dn/cVeqsA/btrS1JMnxyX/HtVhPmqtxdgt7LQlGkeql0/HOM_MouseHumanSequence.csv?attach=1&knm=tfile.csv")

  for(i in 1:length(HGNC_list)){
	  index = match(HGNC_list[i], dat[,'HGNC.ID'])
    ar[i] = dat[index, 'Symbol']
  }

  return(ar)
}

 

사람, 마우스 이외의 동물에 대한 변환표 

⑶ Affymetrix probe ID와 ensembl_gene_id, gene_name

① convert Affymetrix probe ID to ensembl_gene_id, gene_name

 

#Convert Affymetrix probe ID to ensembl_gene_id, gene_name
## https://www.biostars.org/p/328065/#328328
## https://www.biostars.org/p/332461/#332474
BiocManager::install("biomaRt", force=TRUE)
library(biomaRt)

dat<-c('1007_s_at', '1053_at', '117_at', '121_at', '1255_g_at', '1294_at')
require("biomaRt")
mart <- useMart("ENSEMBL_MART_ENSEMBL")
mart <- useDataset("hsapiens_gene_ensembl", mart)
annotLookup <- getBM(
  mart = mart,
  attributes = c(
    "affy_hg_u133_plus_2",
    "ensembl_gene_id",
    "gene_biotype",
    "external_gene_name"),
  filter = "affy_hg_u133_plus_2",
  values = dat,
  uniqueRows=TRUE)

 

⑷ 사람과 마우스의 상동(gene homology between human and mouse)

① 개요 

사람의 유전자는 36601개 

마우스의 유전자는 32285개 

사람과 마우스의 유전체는 거의 유사하다고는 하지만 다른 점도 많음

따라서 사람 유전자를 마우스 유전자로, 혹은 그 역으로 변환하는 방법을 숙지할 필요가 있음 

단순히 마우스 유전자를 대문자로 했을 때 대응 되는 사람 유전자가 아닌 경우도 많음 

방법 1. biomaRt 

 

install.packages("BiocManager")
BiocManager::install("biomaRt")
library(biomaRt)

genes <- c("Xkr4", "Gm1992", "Gm37381")

human <- useMart("ensembl", dataset = "hsapiens_gene_ensembl")
mouse <- useMart("ensembl", dataset = "mmusculus_gene_ensembl")
hh <- getLDS(attributes = c("mgi_symbol", "ensembl_gene_id", "entrezgene_id"), filters = "mgi_symbol", values = genes, mart = mouse, attributesL =
c("hgnc_symbol", "ensembl_gene_id", "entrezgene_id"), martL = human, uniqueRows = T)

 

attribute 1. MGI symbol와 HGNC symbol의 대응표를 통해 대응관계를 결정

 attribute 2. ENSMUSG code와 ENSG code의 대응표를 통해 대응관계를 결정

 attribute 3. NCBI gene ID와 NCBI gene ID의 대응표를 통해 대응관계를 결정

 각 attribute에 대해 왼쪽에 마우스에 대한 정보이고 오른쪽은 사람에 대한 정보

 현재 작동하지 않음 : 현재 다음과 같은 에러메세지가 발견되어 작동되지 않음 

 

#1. Error in getLDS(attributes = c("mgi_symbol", "ensembl_gene_id", "entrezgene_id"), : Query ERROR: caught BioMart::Exception::Database: Could not connect to mysql database ensembl_mart_106: DBI connect ('database =ensembl_mart_106;host=127.0.0.1;port=5316','ensro',...) failed: Can't connect to MySQL server on '127.0.0.1' (111) at /nfs/public/ro/ensweb/live/mart/www_106/biomart-perl/lib/BioMart/Configuration/DBLocation.pm line 98.
#2. Error in textConnection(attrfilt) : invalid 'text' argument
#3. Ensembl site unresponsive, trying useast mirror. Ensembl site unresponsive, trying asia mirror
#4. Error: biomaRt has encountered an unexpected server error. Consider trying one of the Ensembl mirrors (for more details look at ?useEnsembl)

 

방법 2. MGI 홈페이지에서 제공하는 사람-마우스 대응표를 이용하는 방법

 

HOM_MouseHumanSequence.csv
12.78MB

 

human_to_mouse <- function(human_gene){
  hom <- read.csv("https://blog.kakaocdn.net/dn/cVeqsA/btrS1JMnxyX/HtVhPmqtxdgt7LQlGkeql0/HOM_MouseHumanSequence.csv?attach=1&knm=tfile.csv")

  mouse_gene = array()
  flag = 0

  for(i in 1 : length(human_gene)){
    index = match(human_gene[i], hom[hom$Common.Organism.Name == 'human', 'Symbol'])
    key = hom[hom$Common.Organism.Name == 'human', 'DB.Class.Key'][index]
    flag = flag + 1
    mouse_gene[flag] = hom[hom$DB.Class.Key == key 
                           & hom$Common.Organism.Name == 'mouse, laboratory'
                           , 'Symbol'][1] # duplicate mouse genes can be found
  }

  return(mouse_gene)
}

mouse_to_human <- function(mouse_gene){
  hom <- read.csv("https://blog.kakaocdn.net/dn/cVeqsA/btrS1JMnxyX/HtVhPmqtxdgt7LQlGkeql0/HOM_MouseHumanSequence.csv?attach=1&knm=tfile.csv")

  human_gene = array()
  flag = 0

  for(i in 1 : length(mouse_gene)){
    index = match(mouse_gene[i], hom[hom$Common.Organism.Name == 'mouse, laboratory', 'Symbol'])
    key = hom[hom$Common.Organism.Name == 'mouse, laboratory', 'DB.Class.Key'][index]
    flag = flag + 1
    human_gene[flag] = hom[hom$DB.Class.Key ==  key
                           & hom$Common.Organism.Name == 'human'
                           , 'Symbol'][1] # duplicate human genes can be found
  }

  return(human_gene)
}

 

⑸ chromosome position to hgnc_symbol

 

ChromosomePosition_to_hgnc_symbol <- function(chromosome, start, end){
  # reference : https://support.bioconductor.org/p/127035/

  library(biomaRt)
  positions <- data.frame(chromosome = chromosome,
                          start = start,
                          end = end)

  ensembl = useEnsembl(biomart='ensembl', 
                       dataset="hsapiens_gene_ensembl") 

  results <- getBM(attributes = c("hgnc_symbol", "chromosome_name", "start_position", "end_position"), 
                   filters = c("chromosome_name", "start", "end"),
                   values = list(positions[,1], positions[,2], positions[,3]),
                   mart = ensembl)

  print(results)

  postions_combined <- apply(as.matrix(positions), 1, paste, collapse = ":")

  results2 <- getBM(attributes = c("hgnc_symbol", "chromosome_name", "start_position", "end_position"), 
                   filters = c("chromosomal_region"),
                   values = postions_combined,
                   mart = ensembl)

  print(results2)
}

 

 gene name to chromosome position (표를 이용한 방식)

 

human gene annotation.csv
2.95MB
mouse gene annotation.csv
3.17MB

 

gene_to_chromosome_position <- function(gene_list){
  # gene_list : list of genes

  human <- read.csv("https://blog.kakaocdn.net/dn/lTbKq/btrWjlmGho8/eWwWHbfLOlVGKAVeuDMKt1/human%20gene%20annotation.csv?attach=1&knm=tfile.csv")
  mouse = read.csv("https://blog.kakaocdn.net/dn/clSwT7/btrWcrWmS41/mNLCUuBlQxfJFhG1U2JQNk/mouse%20gene%20annotation.csv?attach=1&knm=tfile.csv")

  result = array()
  
  for(i in 1:length(gene_list)){
    if(gene_list[i] == toupper(gene_list[i])){ # human gene
      idx = match(gene_list[i], human[, 1])
      note = paste('Gene.ID: ', gene_list[i],
                   ', chromosome: ', human[idx, 'chromosome'],
                   ', start: ', human[idx, 'start'],
                   ', end: ', human[idx, 'end'])
      result[i] = note
    }
    else{ # mouse gene
      idx = match(gene_list[i], mouse[, 1])
      note = paste('Gene.ID: ', gene_list[i],
                   ', chromosome: ', mouse[idx, 'chromosome'],
                   ', start: ', mouse[idx, 'start'],
                   ', end: ', mouse[idx, 'end'])
      result[i] = note
    }
  }
  
  return(result)
}

 

⑺ gene name to description 

 

gene_to_description <- function(gene_list){
  # gene_list : list of genes

  human <- read.csv("https://blog.kakaocdn.net/dn/lTbKq/btrWjlmGho8/eWwWHbfLOlVGKAVeuDMKt1/human%20gene%20annotation.csv?attach=1&knm=tfile.csv")
  mouse = read.csv("https://blog.kakaocdn.net/dn/clSwT7/btrWcrWmS41/mNLCUuBlQxfJFhG1U2JQNk/mouse%20gene%20annotation.csv?attach=1&knm=tfile.csv")

  result = array()
  
  for(i in 1:length(gene_list)){
    if(gene_list[i] == toupper(gene_list[i])){ #human
      idx = match(gene_list[i], human[, 1])
      result[i] = human[idx, 'Description']
    }
    else{ #mouse 
      idx = match(gene_list[i], mouse[, 1])
      result[i] = mouse[idx, 'Description']
    }    
  }
  
  return(result)
}

 

⑻ gene name to bioType

 

gene_to_bioType <- function(gene_list){
  # gene_list : list of mouse genes

  human <- read.csv("https://blog.kakaocdn.net/dn/lTbKq/btrWjlmGho8/eWwWHbfLOlVGKAVeuDMKt1/human%20gene%20annotation.csv?attach=1&knm=tfile.csv")
  mouse = read.csv("https://blog.kakaocdn.net/dn/clSwT7/btrWcrWmS41/mNLCUuBlQxfJFhG1U2JQNk/mouse%20gene%20annotation.csv?attach=1&knm=tfile.csv")

  result = array()
  
  for(i in 1:length(gene_list)){
    if(gene_list[i] == toupper(gene_list[i])){ # human
      idx = match(gene_list[i], human[, 1])
      result[i] = human[idx, 'bioType']
    }
    else { # mouse
      idx = match(gene_list[i], mouse[, 1])
      result[i] = mouse[idx, 'bioType']
    }
  }
  
  return(result)
}

 

Seurat 객체 object 유전자 이름 변경 : 객체를 새로 만들어야 함 (레퍼런스)

 

# RenameGenesSeurat  ------------------------------------------------------------------------------------
RenameGenesSeurat <- function(obj = ls.Seurat[[i]], newnames = HGNC.updated[[i]]$Suggested.Symbol) { # Replace gene names in different slots of a Seurat object. Run this before integration. Run this before integration. It only changes obj@assays$RNA@counts, @data and @scale.data.
  print("Run this before integration. It only changes obj@assays$RNA@counts, @data and @scale.data.")
  RNA <- obj@assays$RNA

  if (nrow(RNA) == length(newnames)) {
    if (length(RNA@counts)) RNA@counts@Dimnames[[1]]            <- newnames
    if (length(RNA@data)) RNA@data@Dimnames[[1]]                <- newnames
    if (length(RNA@scale.data)) RNA@scale.data@Dimnames[[1]]    <- newnames
  } else {"Unequal gene sets: nrow(RNA) != nrow(newnames)"}
  obj@assays$RNA <- RNA
  return(obj)
}
# RenameGenesSeurat(obj = SeuratObj, newnames = HGNC.updated.genes)

 

⑽ 클러스터 정보 변경

 

update_cluster_in_seurat_obj <- function(seurat_obj, barcode, cluster){

  # dim(seurat_obj)[1] = length(barcode) = length(cluster)
  mat <- matrix(0, nrow = length(barcode), ncol = 2)
  mat[, 1] = barcode
  mat[, 2] = cluster
  mat = as.data.frame(mat)
  rownames(mat) = barcode

  seurat_obj@meta.data$orig.ident = mat[rownames(seurat_obj@meta.data), 2]
  # you may need to modify the above code
  seurat_obj@active.ident = as.factor(seurat_obj@meta.data$orig.ident)
  
  return (seurat_obj)
}

 

Idents를 쓰면 원하는 metadata에 대해 FindAllMarkers를 쓸 수 있음 (레퍼런스 1, 레퍼런스 2)

 

Idents(pbmc) <- pbmc$celltype
markers = FindAllMarkers(object = pbmc, only.pos = TRUE, min.pct = 0.25, logfc.threshold = 0.25)

 

Seurat 객체의 orig.ident의 플롯 내 순서를 바꾸기 

 

# Reference : https://github.com/satijalab/seurat/issues/2471

object$orig.ident <- factor(x = object$orig.ident, levels = c("A", "B", "C"))

 

Seurat 객체 pbmc3k를 h5ad로 저장하기 (레퍼런스)

 

library(Seurat)
library(SeuratData)
library(SeuratDisk)

SaveH5Seurat(pbmc3k, filename = "~/Downloads/pbmc3k.h5Seurat")
Convert("~/Downloads/pbmc3k.h5Seurat", dest = "h5ad")

 

① 생성된 .h5ad 파일은 파이썬 상에서 다음 명령어로 열 수 있음 

 

import scanpy as sc

pbmc = sc.read_h5ad("pbmc3k.h5ad")

 

⒁ tissue_dir 디렉토리 내 matrix.mtx, barcodes.tsv, features.tsv 및 spatial 폴더가 있을 때, Visium 데이터를 R에서 읽는 코드

 

library(Seurat)

b_data <-ReadMtx('./matrix.mtx.gz',
               './barcodes.tsv.gz',
               './features.tsv.gz',
               feature.column=1)
b_data = CreateSeuratObject(b_data, assay='Spatial')
b_image = Read10X_Image(paste('./spatial/',sep=''))
b_image_ = b_image[Cells(b_data)]
DefaultAssay(object = b_image_) <- 'Spatial'
b_data[["slice1"]]=b_image_
# you can change "slice1"

b_data <- SCTransform(b_data, assay = "Spatial", verbose = FALSE, variable.features.n = 1000)
# Some errors might occur when running SpatialFeaturePlot without SCTransform

SpatialFeaturePlot(b_data, rownames(b_data)[1])
# check whether it works properly

 

⒂ .h5ad로 저장된 Visium 데이터를 R에서 읽는 코드

 

library(Seurat)
library(anndata)
library(png)
library(jsonlite)

adata = read_h5ad("adata.h5ad")

hires = adata$uns$spatial$SAMPLE$images$hires
# you must change "SAMPLE"
png_file_path_hires = "~/Downloads/data/spatial/tissue_hires_image.png"
png(filename = png_file_path_hires, width = dim(hires)[1], height = dim(hires)[2])
par(mar = c(0, 0, 0, 0))  # Remove margins
plot.new()
rasterImage(hires, 0, 0, 1, 1)
dev.off()

lowres = adata$uns$spatial$SAMPLE$images$lowres
# you must change "SAMPLE"
png_file_path_lowres = "~/Downloads/data/spatial/tissue_lowres_image.png"
png(filename = png_file_path_lowres, width = dim(lowres)[1], height = dim(lowres)[2])
par(mar = c(0, 0, 0, 0))  # Remove margins
plot.new()
rasterImage(lowres, 0, 0, 1, 1)
dev.off()

scale_factors <- adata$uns$spatial$SAMPLE$scalefactors
# you must change "SAMPLE"
json_file_path <- "~/Downloads/data/spatial/scalefactors_json.json"
write_json(scale_factors, json_file_path)

df = adata$obs[,1:3]
df[,4] = adata$obsm$spatial[,2]
df[,5] = adata$obsm$spatial[,1]
write.csv(df, "~/Downloads/data/spatial/tissue_positions_list.csv")

b_data = CreateSeuratObject(t(as.matrix(adata$X)), assay='Spatial')
b_image = Read10X_Image(paste('~/Downloads/data/spatial/',sep=''))
b_image_ = b_image[Cells(b_data)]
DefaultAssay(object = b_image_) <- 'Spatial'
b_data[["slice1"]]=b_image_
# you can change "slice1"

b_data <- SCTransform(b_data, assay = "Spatial", verbose = FALSE, variable.features.n = 1000)
# Some errors might occur when running SpatialFeaturePlot without SCTransform

SpatialFeaturePlot(b_data, rownames(b_data)[1])
# An error occurs...

 

⒃ highly variable gene을 뽑아주는 코드

 

library(dplyr)
library(Seurat)

pbmc.data <- Read10X(data.dir = "./outs")
pbmc <- CreateSeuratObject(counts = pbmc.data, project = "pbmc3k", min.cells = 3, min.features = 200)
pbmc[["percent.mt"]] <- PercentageFeatureSet(object = pbmc, pattern = "^MT-")
pbmc <- subset(pbmc, subset = nFeature_RNA > 200 & nFeature_RNA < 2500 & percent.mt < 15)
pbmc <- NormalizeData(object = pbmc, normalization.method = "LogNormalize", scale.factor = 10000)
pbmc <- FindVariableFeatures(object = pbmc, selection.method = "vst", nfeatures = 2000)

gene_list = rownames(pbmc)[pbmc@assays$RNA@meta.features$vst.variable]

 

주성분 분석 (PCA) 

 

# reference : https://kkokkilkon.tistory.com/144

#install_github("devtools")
library(devtools)
#install_github("ggbiplot", "vqv")
library(ggbiplot)

PCA <- function(dt, dt_group, scale=T){
  # rownames(dt) : our interest
  # colnames(dt) : the dimensional space of each sample
  
  pca_dt <- prcomp(dt,
                   center = T,
                   scale. = scale)
  
  ggbiplot(pca_dt,
                choices = c(1, 2),
                obs.scale = 1,
                var.scale = 1,
                groups = dt_group,
                circle = TRUE,
                varname.size=0,
                var.axes = F)
}

# Example
dt <- iris[, -5]
dt_group <- iris[, 5]
scale = TRUE
PCA(dt, dt_group, scale)

 

Kaplan-Meier 생존 분석 

 

# 필요한 패키지 설치 및 로드
install.packages("survival")
install.packages("survminer")
library(survival)
library(survminer)

# 수정된 데이터 프레임 생성
# 모든 사망과 censoring 사건을 정확하게 반영
surv_data <- data.frame(
  time = c(6, 12, 21, 27, 32, 39, 43, 43, 43, 89, 89, 89, 89, 89, 89, 261, 263, 270, 270, 311),
  status = c(1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1) # 사망은 1, censoring은 0
)

# Surv 객체 생성
surv_obj <- Surv(time = surv_data$time, event = surv_data$status)

# Kaplan-Meier 생존 곡선 추정
fit <- survfit(surv_obj ~ 1, data = surv_data)

# 생존 곡선 그리기
ggsurvplot(
  fit, 
  data = surv_data, 
  xlab = "Time", 
  ylab = "Survival probability", 
  title = "Kaplan-Meier Survival Curve",
  surv.median.line = "hv", # 중앙값 생존 시간 선 추가
  ggtheme = theme_minimal(), # 테마 설정
  risk.table = TRUE, # 위험 테이블 추가
  palette = "Dark2" # 색상 팔레트 설정
)

 

 

8. 이미지 생성 [목차]

⑴ 이미지를 png로 저장하는 예시

 

png(file = "my plot.png", width = 1500, height = 300)
DimPlot(pbmc, reduction = "umap", label = TRUE, repel = TRUE)
dev.off()

 

x와 y가 주어져 있을 때 scatter plot과 기울기의 신뢰구간을 그리는 함수 

 

scatter_plot <- function(x, y, xlab = "x", ylab = "y", point_size = 2, lab_size = 4, png=TRUE){
  library(ggplot2)

  # the lenth(x) must be same with the length(y)
  mat <- matrix(0, nrow = length(x), ncol = 2)
  mat[, 1] = x
  mat[, 2] = y
  colnames(mat) = c(xlab, ylab)
  mat <- as.data.frame(mat)

  if(png){
    png("./scatter_plot.png",width=2000,height=2000,res=500)
    ggplot(mat, aes(x=x, y=y)) + geom_point(shape=19, size=point_size, color="blue") + theme(plot.background = element_blank(),   panel.grid.major = element_blank(), panel.grid.minor = element_blank(),panel.background = element_blank(), axis.line = element_line(size =1)) +   stat_smooth(method = lm, level=.95, color="grey") + labs(x=xlab, y=ylab, size=lab_size)
    dev.off()
  } else{
    ggplot(mat, aes(x=x, y=y)) + geom_point(shape=19, size=point_size, color="blue") + theme(plot.background = element_blank(),   panel.grid.major = element_blank(), panel.grid.minor = element_blank(),panel.background = element_blank(), axis.line = element_line(size =1)) +   stat_smooth(method = lm, level=.95, color="grey") + labs(x=xlab, y=ylab, size=lab_size)
  }
}

 

⑶ x, y, color가 주어져 있을 때 SpatialFeaturePlot을 그려주는 함수

 

my.plot <- function(x, y, col){
	# assume that length(x) = length(y) = length(col)

  plot(x, y, t="n")
  colfunc <- colorRampPalette(c("#000000", "#EB4600", "#FFF800"))
  
  coll = array(dim = length(col))
  for(i in 1 : length(col)){
    coll[i] <- colfunc(100) [as.integer( col[i] / max(col) * 99 + 1)] 
  }
  
  text(x, y, labels = "●", col = coll, cex = 1)
}

 

⑷ spatial feature plot

 

# tissue_dir : the directory that contains a filtered_feature_bc_matrix.h5
tissue_dir <- './outs/' 

# Tgenes : genes of interest
Tgenes <- c('Slc2a1', 'Slc2a3')

conv_spatial_feature_plot <- function(tissue_dir, Tgenes, quality.control = FALSE){
  library(Seurat)
  library(SeuratData)
  library(ggplot2)
  library(cowplot)
  library(dplyr)

  # reference : https://nbisweden.github.io/workshop-scRNAseq/labs/compiled/seurat/seurat_07_spatial.html
  
  br.sp = Load10X_Spatial(tissue_dir, slice= 'slice1')
  br.sp <- SCTransform(br.sp, assay = "Spatial", verbose = FALSE, variable.features.n = 1000)

  if(quality.control){
    br.sp <- PercentageFeatureSet(br.sp, "^mt-", col.name = "percent_mito")
    br.sp <- PercentageFeatureSet(br.sp, "^Hb.*-", col.name = "percent_hb")
    br.sp <- br.sp[, br.sp$nFeature_Spatial > 500 & br.sp$percent_mito < 25 & br.sp$percent_hb < 20]
  }

  SpatialFeaturePlot(br.sp, features = Tgenes)
}

conv_spatial_feature_plot(tissue_dir, Tgenes)

 

 gene.list, log FC 값 벡터, adjusted p value 값 벡터가 주어졌을 때 enhanced volcano plot

 

my.EnhancedVolcano <- function(gene.name, logFC, adj.P.Val, 
                               pCutoff = 0.05, FCcutoff = 0.3,
                               xlim = c(-0.5, 0.5), ylim = c(-0.5, 0.5)){
  # install.packages("BiocManager")
  # BiocManager::install("EnhancedVolcano")
  library(EnhancedVolcano)


  tested <- matrix(0, nrow = length(gene.name), ncol = 2)
  tested <- as.data.frame(tested)
  for(i in 1:length(gene.name)){
    tested[i, 1] <- logFC[i]
    tested[i, 2] <- adj.P.Val[i]
  }
  rownames(tested) <- gene.name
  colnames(tested) <- c('logFC', 'adj.P.Val')

  EnhancedVolcano(tested, lab = rownames(tested), 
                  x='logFC', y='adj.P.Val', xlim = xlim, ylim = ylim, 
                  pCutoff = pCutoff, FCcutoff = FCcutoff) 
}

 

 gene list로부터 GO (gene ontology)를 구하는 법

 

GO <- function(gene){
  library(EnhancedVolcano)
  library(clusterProfiler)
  library(org.Hs.eg.db)
  library(org.Mm.eg.db)
  library(enrichplot)

  # ont = "ALL", "BP", "CC", "MF"
  # showCategory is not mandatory

  gene <- gsub('.*-', '', gene) # -와 그 앞에 있는 것들을 제거 (예 : "GRCh38-")
  gene <- gsub('\\ .*$', '', gene) # 'KLK2 ENSG00000167751' 같은 것을 해결 
    
  if (gene[1] == toupper(gene[1])){ ## Human gene
      gene.df <- bitr(gene, fromType = "SYMBOL", toType = "ENTREZID", OrgDb = org.Hs.eg.db)
      gene.df <- as.vector(gene.df[[2]])
      GO <- enrichGO(gene.df, OrgDb = 'org.Hs.eg.db',keyType = "ENTREZID", ont = "ALL", pvalueCutoff = 0.05, pAdjustMethod = "BH")
      return(GO)
  } else{ ## Mouse gene?
      gene.df <- bitr(gene, fromType = "SYMBOL", toType = "ENTREZID", OrgDb = org.Mm.eg.db)
      gene.df <- as.vector(gene.df[[2]])
      GO <- enrichGO(gene.df, OrgDb = 'org.Mm.eg.db',keyType = "ENTREZID", ont = "ALL", pvalueCutoff = 0.05, pAdjustMethod = "BH")
      return(GO)
  }
}

 

⑺ gene list로부터 GO (gene ontology) plot을 그리는 법 ( GO 플롯을 해석하는 방법)

 

GO.plot <- function(gene){
  library(EnhancedVolcano)
  library(clusterProfiler)
  library(org.Hs.eg.db)
  library(org.Mm.eg.db)
  library(enrichplot)

  # ont = "ALL", "BP", "CC", "MF"
  # showCategory is not mandatory

  gene <- gsub('.*-', '', gene) # -와 그 앞에 있는 것들을 제거 (예 : "GRCh38-")
  gene <- gsub('\\ .*$', '', gene) # 'KLK2 ENSG00000167751' 같은 것을 해결 
    
  if (gene[1] == toupper(gene[1])){ ## Human gene
      gene.df <- bitr(gene, fromType = "SYMBOL", toType = "ENTREZID", OrgDb = org.Hs.eg.db)
      gene.df <- as.vector(gene.df[[2]])
      GO <- enrichGO(gene.df, OrgDb = 'org.Hs.eg.db',keyType = "ENTREZID", ont = "ALL", pvalueCutoff = 0.05, pAdjustMethod = "BH")
      dotplot(GO,split="ONTOLOGY", showCategory = 5)+facet_grid(ONTOLOGY~., scale="free")
  } else{ ## Mouse gene?
      gene.df <- bitr(gene, fromType = "SYMBOL", toType = "ENTREZID", OrgDb = org.Mm.eg.db)
      gene.df <- as.vector(gene.df[[2]])
      GO <- enrichGO(gene.df, OrgDb = 'org.Mm.eg.db',keyType = "ENTREZID", ont = "ALL", pvalueCutoff = 0.05, pAdjustMethod = "BH")
      dotplot(GO,split="ONTOLOGY", showCategory = 5)+facet_grid(ONTOLOGY~., scale="free")
  }
}

### Example
GO.plot(c("COL1A1", "COL1A2", "COL3A1", "COL6A3"))

 

⑻ gene list로부터 cnetplot을 그리는 법

 

my.cnetplot <- function(gene.list){
  GO <- function(gene){
    library(EnhancedVolcano)
    library(clusterProfiler)
    library(org.Hs.eg.db)
    library(org.Mm.eg.db)
    library(enrichplot)
  
    # ont = "ALL", "BP", "CC", "MF"
    # showCategory is not mandatory
  
    gene <- gsub('.*-', '', gene) # -와 그 앞에 있는 것들을 제거 (예 : "GRCh38-")
    gene <- gsub('\\ .*$', '', gene) # 'KLK2 ENSG00000167751' 같은 것을 해결 
      
    if (gene[1] == toupper(gene[1])){ ## Human gene
        gene.df <- bitr(gene, fromType = "SYMBOL", toType = "ENTREZID", OrgDb = org.Hs.eg.db)
        gene.df <- as.vector(gene.df[[2]])
        GO <- enrichGO(gene.df, OrgDb = 'org.Hs.eg.db',keyType = "ENTREZID", ont = "ALL", pvalueCutoff = 0.05, pAdjustMethod = "BH")
        return(GO)
    } else{ ## Mouse gene?
        gene.df <- bitr(gene, fromType = "SYMBOL", toType = "ENTREZID", OrgDb = org.Mm.eg.db)
        gene.df <- as.vector(gene.df[[2]])
        GO <- enrichGO(gene.df, OrgDb = 'org.Mm.eg.db',keyType = "ENTREZID", ont = "ALL", pvalueCutoff = 0.05, pAdjustMethod = "BH")
        return(GO)
    }
  }
  
  go = GO(gene.list)
  ego_ <- 0
  if (gene.list[1] == toupper(gene.list[1])){ ## Human gene
    ego_ <- setReadable(go, org.Hs.eg.db, keyType='ENTREZID')
  } else { ## Mouse gene?
    ego_ <- setReadable(go, org.Mm.eg.db, keyType='ENTREZID')
  }
  
  cnetplot(ego_, categorySize="pvalue", foldChange= gene.list )
}

my.cnetplot(c("COL1A1", "COL1A2", "COL3A1", "COL6A3"))

 

통계 분석까지 곁들인 바이올린 플롯

 

VlnPlot(object = br.sp, features = c('Col1a1'),
        group.by = 'orig.ident', pt.size = 0.1) + 
    	facet_grid(.~tnbc.merge@active.ident)+
    	fill_palette(palette='npg')+
    	stat_compare_means(method = "anova", label='p')+
    	theme(axis.text.x = element_text(angle = 90, hjust = 1),
    	strip.text.x = element_text(size = rel(0.7)))

 

입력: 2022.05.03 00:13