본문 바로가기

Contact English

【생물정보학】 생물정보학 부록

 

생물정보학 부록

 

추천글 : 【생물정보학】 생물정보학 분석 목차 


1. gnomAD [본문]

2. ENCODE project [본문]

3. 데이터 증가 속도 [본문]

4. 시퀀싱 기술 처리량 [본문]


 

1. gnomAD [목차]

⑴ 개요

gnomAD data is available for download through Google Cloud Public Datasets, the Registry of Open Data on AWS, and Azure Open Datasets. We recommended using Hail and our Hail utilities for gnomAD to work with the data. In addition to the files listed below, Terra has a demo workspace for working gnomAD data.

⑵ Google Cloud Public Datasets

Files can be browsed and downloaded using gsuitl.

 

$ gsutil ls gs://gcp-public-data--gnomad/release/

 

gnomAD variants are also available as a BigQuery dataset

⑶ Registry of Open Data on AWS

Files can be browsed and downloaded using the AWS Command Line Interface.

 

$ aws s3 ls s3://gnomad-public-us-east-1/release/

⑷ Azure Open Datasets 

Files can be browsed and downloaded using AzCopy or Azure Storage Explorer.

 

$ azcopy ls https://datasetgnomad.blob.core.windows.net/dataset/

 

 

2. ENCODE project [목차]

연혁 : 2001년 초안 → 2003년, NHGRI는 인간 유전체의 모든 기능적 요소를 식별하기 위해 ENCODE 프로젝트를 시작

⑵ Phase I : 1% of genome. 2007년에 종료 

⑶ Phase II : build-out phase. 2012년에 종료

버전 7 (2010년 12월)

○ 유전자 51,082개 : transcript 161,375개

○ protein-coding gene 20,687개 : transcript 76,052개

○ lncRNA 9,640개 : transcript는 15,512개 

⑷ Phase III : production phase. 2016년에 종료

⑸ Phase IV : 2016-2017년에 시작

① 버전 29 (2018년 5월)

○ 유전자 58,721개 : transcript 206,694개 

○ protein-coding gene 19,940개 : transcript 83,129개 

○ lncRNA 16,066개 : transcript 29,566개 

② 버전 36 (2020년 5월)

유전자 60,660개 : transcript 232,117개

○ protein-coding gene 19,962개 : transcript 85,269개 

○ lncRNA 17,958개 : transcript 48,734개 

 

 

3. 데이터 증가 속도 [목차]

 

data phase astronomy Twitter YouTube genomics
acquisition 25 zetta-bytes/yr 0.5-15 billion tweets/yr 500-900 million hrs/yr 1 zetta-bases/yr
storage 1 EB/yr 1-17 PB/yr 1-2 EB/yr 2-40 EB/yr
analysis in situ data reduction topic and sentiment mining limited requirements heterogeneous data and analysis
  real-time processing metadata analysis   variant calling, ~2 trillion central processing unit (CPU) hours
  massive volumes      
distribution dedicated lines from antennae to server (600 TB/s) small units of distribution major component of modern user's bandwidth (10 MB/s) many small (10 MB/s) and fewer massive (10 TB/s) data movement

Table. 1. 데이터 증가 속도 (ref)

 

 

4. 시퀀싱 기술 처리량(sequencing technology throughput) [목차]

 

플랫폼
(platform)
시퀀서 모델
(sequencer model)
리드 길이
(read length)
런 당 리드
(reads per run)
Illumina iSeq 100 75-300 bp 4 million
  MiniSeq 75-300 bp 25 million
  MiSeq 75-300 bp 25 million
  NextSeq 550 75-150 bp 400 million
  NovaSeq 6000 75-300 bp 10 billion
PacBio Sequel 10-60 kb 1 million
  Sequel II 10-100 kb 7 million
  Sequel IIe 10-100 kb 8 million
Oxford Nanopore MinION 10 kb - 1 Mb 1 million
  GridION 10 kb - 1 Mb 5 million
  PromethION 24 10 kb - 1 Mb 15 million
  PromethION 48 10 kb - 1 Mb 30 million

Table. 1. 시퀀싱 기술 처리량

 

○ Sanger dideoxy (모세관 전기영동) : 700-800 bp  read. 정확도 매우 높음

○ pyrosequencing : ~400 bp / read

○ Illumina : ~100 bp / read (최근에는 250 bp)

 

입력: 2022.02.21 12:51

수정: 2024.10.24 22:06