생물정보학 부록
추천글 : 【생물정보학】 생물정보학 분석 목차
1. gnomAD [본문]
2. ENCODE project [본문]
3. 데이터 증가 속도 [본문]
4. 시퀀싱 기술 처리량 [본문]
1. gnomAD [목차]
⑴ 개요
gnomAD data is available for download through Google Cloud Public Datasets, the Registry of Open Data on AWS, and Azure Open Datasets. We recommended using Hail and our Hail utilities for gnomAD to work with the data. In addition to the files listed below, Terra has a demo workspace for working gnomAD data.
⑵ Google Cloud Public Datasets
Files can be browsed and downloaded using gsuitl.
$ gsutil ls gs://gcp-public-data--gnomad/release/
gnomAD variants are also available as a BigQuery dataset
⑶ Registry of Open Data on AWS
Files can be browsed and downloaded using the AWS Command Line Interface.
$ aws s3 ls s3://gnomad-public-us-east-1/release/
⑷ Azure Open Datasets
Files can be browsed and downloaded using AzCopy or Azure Storage Explorer.
$ azcopy ls https://datasetgnomad.blob.core.windows.net/dataset/
2. ENCODE project [목차]
⑴ 연혁 : 2001년 초안 → 2003년, NHGRI는 인간 유전체의 모든 기능적 요소를 식별하기 위해 ENCODE 프로젝트를 시작
⑵ Phase I : 1% of genome. 2007년에 종료
⑶ Phase II : build-out phase. 2012년에 종료
① 버전 7 (2010년 12월)
○ 유전자 51,082개 : transcript 161,375개
○ protein-coding gene 20,687개 : transcript 76,052개
○ lncRNA 9,640개 : transcript는 15,512개
⑷ Phase III : production phase. 2016년에 종료
⑸ Phase IV : 2016-2017년에 시작
① 버전 29 (2018년 5월)
○ 유전자 58,721개 : transcript 206,694개
○ protein-coding gene 19,940개 : transcript 83,129개
○ lncRNA 16,066개 : transcript 29,566개
② 버전 36 (2020년 5월)
○ 유전자 60,660개 : transcript 232,117개
○ protein-coding gene 19,962개 : transcript 85,269개
○ lncRNA 17,958개 : transcript 48,734개
3. 데이터 증가 속도 [목차]
data phase | astronomy | YouTube | genomics | |
acquisition | 25 zetta-bytes/yr | 0.5-15 billion tweets/yr | 500-900 million hrs/yr | 1 zetta-bases/yr |
storage | 1 EB/yr | 1-17 PB/yr | 1-2 EB/yr | 2-40 EB/yr |
analysis | in situ data reduction | topic and sentiment mining | limited requirements | heterogeneous data and analysis |
real-time processing | metadata analysis | variant calling, ~2 trillion central processing unit (CPU) hours | ||
massive volumes | ||||
distribution | dedicated lines from antennae to server (600 TB/s) | small units of distribution | major component of modern user's bandwidth (10 MB/s) | many small (10 MB/s) and fewer massive (10 TB/s) data movement |
Table. 1. 데이터 증가 속도 (ref)
4. 시퀀싱 기술 처리량(sequencing technology throughput) [목차]
플랫폼 (platform) |
시퀀서 모델 (sequencer model) |
리드 길이 (read length) |
런 당 리드 (reads per run) |
Illumina | iSeq 100 | 75-300 bp | 4 million |
MiniSeq | 75-300 bp | 25 million | |
MiSeq | 75-300 bp | 25 million | |
NextSeq 550 | 75-150 bp | 400 million | |
NovaSeq 6000 | 75-300 bp | 10 billion | |
PacBio | Sequel | 10-60 kb | 1 million |
Sequel II | 10-100 kb | 7 million | |
Sequel IIe | 10-100 kb | 8 million | |
Oxford Nanopore | MinION | 10 kb - 1 Mb | 1 million |
GridION | 10 kb - 1 Mb | 5 million | |
PromethION 24 | 10 kb - 1 Mb | 15 million | |
PromethION 48 | 10 kb - 1 Mb | 30 million |
Table. 1. 시퀀싱 기술 처리량
○ Sanger dideoxy (모세관 전기영동) : 700-800 bp read. 정확도 매우 높음
○ pyrosequencing : ~400 bp / read
○ Illumina : ~100 bp / read (최근에는 250 bp)
입력: 2022.02.21 12:51
수정: 2024.10.24 22:06
'▶ 자연과학 > ▷ 생물정보학' 카테고리의 다른 글
【생물정보학】 Cell Type Classification Pipeline (0) | 2019.11.22 |
---|---|
【생물정보학】 TCGA DATA 얻는 법 (2) | 2019.08.26 |
【생물정보학】 리간드-수용체 상호작용 분석 (0) | 2016.06.27 |
【생물정보학】 세포주 (셀라인) 라이브러리 (0) | 2016.06.27 |
【생물정보학】 생물도감 (0) | 2016.06.24 |
최근댓글