본문 바로가기

Contact English

【생물정보학】 FASTQ, FASTA, GTF, GFF, BAM, SAM, Loom 파일 이해하기

 

FASTQ, FASTA, GTF, GFF, BAM, SAM, Loom 파일 이해하기

 

추천글 : 【생물정보학】 생물정보학 분석 목차


1. FASTQ 파일 [본문]

2. FASTA 파일 [본문]

3. GTF 파일 [본문]

4. GFF 파일 [본문]

5. BAM 파일 [본문]

6. SAM 파일 [본문]

7. Loom 파일 [본문]


 

1. FASTQ(fast-Q) [목차]

 

@SEQ_ID

GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT

+

!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65

 

샘플의 시퀀스 정보를 저장

첫 번째 행 : SEQ_ID, 즉 @ + 시퀀스 식별자(sequence identifier) + 추가 설명(optional description)

예 1. @HWUSI-EAS100R:6:73:941:1973#0/1

 

HWUSI-EAS100R the unique instrument name
6 flowcell lane
73 tile number within the flowcell lane
941 'x'-coordinate of the cluster within the tile
1973 'y'-coordinate of the cluster within the tile
#0 index number for a multiplexed sample (0 for no indexing)
/1 the member of a pair, /1 or /2 
(paired-end or mate-pair reads only)

Table. 1. SEQ_ID 예시 (ref)

 

예 2. @EAS139:136:FC706VJ:2:2104:15343:197393 1:Y:18:ATCACG

 

EAS139 the unique instrument name
136 the run id
FC706VJ the flowcell id
2 flowcell lane
2104 tile number within the flowcell lane
15343 'x'-coordinate of the cluster within the tile
197393 'y'-coordinate of the cluster within the tile
1 the member of a pair, 1 or 2
(paired-end or mate-pair reads only)
Y Y if the read is filtered (did not pass), N otherwise
18 0 when none of the control bits are on, otherwise it is an even number
ATCACG index sequence

Table. 2. SEQ_ID 예시 (ref)

 

두 번째 행 : raw sequence 

세 번째 행 : "+" + (optional) 시퀀스 식별자 

네 번째 행 : 두 번째 행의 시퀀스에 대한 퀄리티 스코어(quality score) 

① 퀄리티 스코어 Q = -log P (단, P는 error probability)

ASCII 문자로 표현되며 raw sequence와 문자 개수가 동일함

 종류 1. PHRED 33 encoding : 현재 대부분 사용되는 형식 

 

출처 : 이미지 클릭

Table. 3. PHRED 33 encoding

 

 종류 2. PHRED 64 encoding

 

출처 : 이미지 클릭

Table. 4. PHRED 64 encoding

 

 

2. FASTA(fast-A) [목차]

⑴ 레퍼런스의 시퀀스 정보를 저장

⑵ 예 : GFP에 대한 FASTA 파일

 

>L29345.1 Aequorea victoria green-fluorescent protein (GFP) mRNA, complete cds
TACACACGAATAAAAGATAACAAAGATGAGTAAAGGAGAAGAACTTTTCACTGGAGTTGTCCCAATTCTT
GTTGAATTAGATGGCGATGTTAATGGGCAAAAATTCTCTGTCAGTGGAGAGGGTGAAGGTGATGCAACAT
ACGGAAAACTTACCCTTAAATTTATTTGCACTACTGGGAAGCTACCTGTTCCATGGCCAACACTTGTCAC
TACTTTCTCTTATGGTGTTCAATGCTTTTCAAGATACCCAGATCATATGAAACAGCATGACTTTTTCAAG
AGTGCCATGCCCGAAGGTTATGTACAGGAAAGAACTATATTTTACAAAGATGACGGGAACTACAAGACAC
GTGCTGAAGTCAAGTTTGAAGGTGATACCCTTGTTAATAGAATCGAGTTAAAAGGTATTGATTTTAAAGA
AGATGGAAACATTCTTGGACACAAAATGGAATACAACTATAACTCACATAATGTATACATCATGGCAGAC
AAACCAAAGAATGGAATCAAAGTTAACTTCAAAATTAGACACAACATTAAAGATGGAAGCGTTCAATTAG
CAGACCATTATCAACAAAATACTCCAATTGGCGATGGCCCTGTCCTTTTACCAGACAACCATTACCTGTC
CACACAATCTGCCCTTTCCAAAGATCCCAACGAAAAGAGAGATCACATGATCCTTCTTGAGTTTGTAACA
GCTGCTGGGATTACACATGGCATGGATGAACTATACAAATAAATGTCCAGACTTCCAATTGACACTAAAG
TGTCCGAACAATTACTAAATTCTCAGGGTTCCTGGTTAAATTCAGGCTGAGACTTTATTTATATATTTAT
AGATTCATTAAAATTTTATGAATAATTTATTGATGTTATTAATAGGGGCTATTTTCTTATTAAATAGGCT
ACTGGAGTGTAT

 

 

3. GTF(gene transfer format) [목차]

⑴ 레퍼런스의 annotation 정보를 저장

⑵ 예 : MUC1 유전자 및 한 개의 transcript에 대한 GTF 파일 내용

 

NC_000001.11	BestRefSeq	gene	155185824	155192915	.	-	.	gene_id "MUC1"; transcript_id ""; db_xref "GeneID:4582"; db_xref "HGNC:HGNC:7508"; db_xref "MIM:158340"; description "mucin 1, cell surface associated"; gbkey "Gene"; gene "MUC1"; gene_biotype "protein_coding"; gene_synonym "ADMCKD"; gene_synonym "ADMCKD1"; gene_synonym "ADTKD2"; gene_synonym "CA 15-3"; gene_synonym "Ca15-3"; gene_synonym "CD227"; gene_synonym "EMA"; gene_synonym "H23AG"; gene_synonym "KL-6"; gene_synonym "MAM6"; gene_synonym "MCD"; gene_synonym "MCKD"; gene_synonym "MCKD1"; gene_synonym "MUC-1"; gene_synonym "MUC-1/SEC"; gene_synonym "MUC-1/X"; gene_synonym "MUC1/ZD"; gene_synonym "PEM"; gene_synonym "PEMT"; gene_synonym "PUM"; 
NC_000001.11	BestRefSeq	transcript	155185824	155192915	.	-	.	gene_id "MUC1"; transcript_id "NM_001204291.1"; db_xref "GeneID:4582"; gbkey "mRNA"; gene "MUC1"; product "mucin 1, cell surface associated, transcript variant 15"; transcript_biotype "mRNA"; 
NC_000001.11	BestRefSeq	exon	155192786	155192915	.	-	.	gene_id "MUC1"; transcript_id "NM_001204291.1"; db_xref "GeneID:4582"; gene "MUC1"; product "mucin 1, cell surface associated, transcript variant 15"; transcript_biotype "mRNA"; exon_number "1"; 
NC_000001.11	BestRefSeq	exon	155192183	155192310	.	-	.	gene_id "MUC1"; transcript_id "NM_001204291.1"; db_xref "GeneID:4582"; gene "MUC1"; product "mucin 1, cell surface associated, transcript variant 15"; transcript_biotype "mRNA"; exon_number "2"; 
NC_000001.11	BestRefSeq	exon	155188008	155188063	.	-	.	gene_id "MUC1"; transcript_id "NM_001204291.1"; db_xref "GeneID:4582"; gene "MUC1"; product "mucin 1, cell surface associated, transcript variant 15"; transcript_biotype "mRNA"; exon_number "3"; 
NC_000001.11	BestRefSeq	exon	155187722	155187858	.	-	.	gene_id "MUC1"; transcript_id "NM_001204291.1"; db_xref "GeneID:4582"; gene "MUC1"; product "mucin 1, cell surface associated, transcript variant 15"; transcript_biotype "mRNA"; exon_number "4"; 
NC_000001.11	BestRefSeq	exon	155187455	155187576	.	-	.	gene_id "MUC1"; transcript_id "NM_001204291.1"; db_xref "GeneID:4582"; gene "MUC1"; product "mucin 1, cell surface associated, transcript variant 15"; transcript_biotype "mRNA"; exon_number "5"; 
NC_000001.11	BestRefSeq	exon	155187225	155187374	.	-	.	gene_id "MUC1"; transcript_id "NM_001204291.1"; db_xref "GeneID:4582"; gene "MUC1"; product "mucin 1, cell surface associated, transcript variant 15"; transcript_biotype "mRNA"; exon_number "6"; 
NC_000001.11	BestRefSeq	exon	155185824	155186209	.	-	.	gene_id "MUC1"; transcript_id "NM_001204291.1"; db_xref "GeneID:4582"; gene "MUC1"; product "mucin 1, cell surface associated, transcript variant 15"; transcript_biotype "mRNA"; exon_number "7"; 
NC_000001.11	BestRefSeq	CDS	155192786	155192843	.	-	0	gene_id "MUC1"; transcript_id "NM_001204291.1"; db_xref "CCDS:CCDS72934.1"; db_xref "GeneID:4582"; gbkey "CDS"; gene "MUC1"; note "isoform 15 precursor is encoded by transcript variant 15"; product "mucin-1 isoform 15 precursor"; protein_id "NP_001191220.1"; exon_number "1"; 
NC_000001.11	BestRefSeq	CDS	155192183	155192310	.	-	2	gene_id "MUC1"; transcript_id "NM_001204291.1"; db_xref "CCDS:CCDS72934.1"; db_xref "GeneID:4582"; gbkey "CDS"; gene "MUC1"; note "isoform 15 precursor is encoded by transcript variant 15"; product "mucin-1 isoform 15 precursor"; protein_id "NP_001191220.1"; exon_number "2"; 
NC_000001.11	BestRefSeq	CDS	155188008	155188063	.	-	0	gene_id "MUC1"; transcript_id "NM_001204291.1"; db_xref "CCDS:CCDS72934.1"; db_xref "GeneID:4582"; gbkey "CDS"; gene "MUC1"; note "isoform 15 precursor is encoded by transcript variant 15"; product "mucin-1 isoform 15 precursor"; protein_id "NP_001191220.1"; exon_number "3"; 
NC_000001.11	BestRefSeq	CDS	155187722	155187858	.	-	1	gene_id "MUC1"; transcript_id "NM_001204291.1"; db_xref "CCDS:CCDS72934.1"; db_xref "GeneID:4582"; gbkey "CDS"; gene "MUC1"; note "isoform 15 precursor is encoded by transcript variant 15"; product "mucin-1 isoform 15 precursor"; protein_id "NP_001191220.1"; exon_number "4"; 
NC_000001.11	BestRefSeq	CDS	155187455	155187576	.	-	2	gene_id "MUC1"; transcript_id "NM_001204291.1"; db_xref "CCDS:CCDS72934.1"; db_xref "GeneID:4582"; gbkey "CDS"; gene "MUC1"; note "isoform 15 precursor is encoded by transcript variant 15"; product "mucin-1 isoform 15 precursor"; protein_id "NP_001191220.1"; exon_number "5"; 
NC_000001.11	BestRefSeq	CDS	155187225	155187374	.	-	0	gene_id "MUC1"; transcript_id "NM_001204291.1"; db_xref "CCDS:CCDS72934.1"; db_xref "GeneID:4582"; gbkey "CDS"; gene "MUC1"; note "isoform 15 precursor is encoded by transcript variant 15"; product "mucin-1 isoform 15 precursor"; protein_id "NP_001191220.1"; exon_number "6"; 
NC_000001.11	BestRefSeq	CDS	155186138	155186209	.	-	0	gene_id "MUC1"; transcript_id "NM_001204291.1"; db_xref "CCDS:CCDS72934.1"; db_xref "GeneID:4582"; gbkey "CDS"; gene "MUC1"; note "isoform 15 precursor is encoded by transcript variant 15"; product "mucin-1 isoform 15 precursor"; protein_id "NP_001191220.1"; exon_number "7"; 
NC_000001.11	BestRefSeq	start_codon	155192841	155192843	.	-	0	gene_id "MUC1"; transcript_id "NM_001204291.1"; db_xref "CCDS:CCDS72934.1"; db_xref "GeneID:4582"; gbkey "CDS"; gene "MUC1"; note "isoform 15 precursor is encoded by transcript variant 15"; product "mucin-1 isoform 15 precursor"; protein_id "NP_001191220.1"; exon_number "1"; 
NC_000001.11	BestRefSeq	stop_codon	155186135	155186137	.	-	0	gene_id "MUC1"; transcript_id "NM_001204291.1"; db_xref "CCDS:CCDS72934.1"; db_xref "GeneID:4582"; gbkey "CDS"; gene "MUC1"; note "isoform 15 precursor is encoded by transcript variant 15"; product "mucin-1 isoform 15 precursor"; protein_id "NP_001191220.1"; exon_number "7";

 

① NM_000001.11 : 레퍼런스의 accession number. "NM_000001"은 1번 염색체를, ".11"은 11번째 버전을 의미함 

BestRefSeq, RefSeq, Gnomon, HAVANA 등 : 레퍼런스 종류

③ GTF의 각 행은 gene, transcript, exon, CDS, domain, group, start_codon, stop_codon 등이 있음

④ 첫 행의 155185825, 155192915는 FASTA 내 155185825번째 염기부터 155192915번째 염기가 MUC1 유전자라는 의미 

+, - : (+)는 forward (= plus, sense) strand에 유전자가 있음을, (-)는 reverse (= minus, antisense) strand에 유전자가 있음

⑥ 0, 1, 2 : CDS에 있는 0, 1, 2는 각각 feature의 1, 2, 3번째 염기가 해독틀의 1번째 코돈이라는 의미 

⑦ 하나의 gene에는 하나 이상의 transcript가 있음 : gene과 transcript는 gene_id, gene 등을 통해 서로 연결됨

 각 transcript에는 여러 개의 exon 수식 부위가 있음 : transcript와 exon은 transcript_id를 통해 서로 연결됨

 CDS(protein coding sequence)는 일반적으로 exon의 부분집합 : 그마저도 CDS = exon인 부분이 상당히 많음

 start_codon, stop_codon이 없는 유전자도 있음 (e.g., LOC102724389)  

 

 

4. GFF(general feature format) [목차]

⑴ 개요

① 레퍼런스의 annotation 정보를 저장. GTF와 약간의 형식 차이가 있음

② gene_id와 transcript_id 사이의 계층 관계는 GFF에서 보존되지 않음

GTF → GFF 변환

 

import sys

inFile = open(sys.argv[1],'r')

for line in inFile:
  #skip comment lines that start with the '#' character
  if line[0] != '#':
    #split line into columns by tab
    data = line.strip().split('\t')

    #parse the transcript/gene ID. I suck at using regex, so I usually just do a series of splits.
    transcriptID = data[-1].split('transcript_id')[-1].split(';')[0].strip()[1:-1]
    geneID = data[-1].split('gene_id')[-1].split(';')[0].strip()[1:-1]

    #replace the last column with a GFF formatted attributes columns
    #I added a GID attribute just to conserve all the GTF data
    data[-1] = "ID=" + transcriptID + ";GID=" + geneID

    #print out this new GFF line
    print '\t'.join(data)

 

GFF → GTF 변환 

 

import sys

inFile = open(sys.argv[1],'r')

for line in inFile:
  #skip comment lines that start with the '#' character
  if line[0] != '#':
    #split line into columns by tab
    data = line.strip().split('\t')

    ID = ''

    #if the feature is a gene 
    if data[2] == "gene":
      #get the id
      ID = data[-1].split('ID=')[-1].split(';')[0]

    #if the feature is anything else
    else:
      # get the parent as the ID
      ID = data[-1].split('Parent=')[-1].split(';')[0]

    #modify the last column
    data[-1] = 'gene_id "' + ID + '"; transcript_id "' + ID

    #print out this new GTF line
    print '\t'.join(data)

 

 

5. BAM(binary alignment map) [목차]

⑴ FASTQ 파일을 reference 파일(e.g., GTF)에 맵핑한 결과를 저장하는 파일

 

 

6. SAM(sequence alignment/MAP format) [목차]

⑴ BAM 파일을 SAMtools view로 보고 sorting / selection한 결과를 생성하는 파일

⑵ 해석 : 다음과 같은 행이 계속 줄지어 있음 

 

QNAME (Query template NAME)	FLAG	RNAME (Reference sequence NAME)	POS (1-based leftmost mapping Position)	MAPQ (Mapping Quality)	CIGAR (Concise Idiosyncratic Gapped Alignment Report) string	RNEXT	PNEXT	TLEN (observed Template LENgth)	SEQ (segment SEQuence)	QUAL (quality)	The number of reported alignments that contain the read.	The hit index.	Alignment score.	Number of mismatches.	An additional tag, possibly specific to the aligner or analysis pipeline.	Read group identifier.	Tags related to the gene or transcript the read is aligned to 'tdtomato'.	An additional flag used by specific software.	....20	....21	....22	....23	....24	....25	....26	....27	....28	....29	...30
A01192:688:HM3FVDMXY:1:2117:20374:35023	1024	tdtomato	1438	255	30S40M20S	*	0	0	AAGCAGTGGTATCAACGCAGAGTACATGGGATCACCTGTTCCTGTACGGCATGGATGAGCTGTACAAGTGAGCTGCCTTCTGCGGGGCTT	FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF	NH:i:1	HI:i:1	AS:i:39	nM:i:0	ts:i:30	RG:Z:CR_23_15103_TS_R_SSV_1:0:1:HM3FVDMXY:1	TX:Z:tdtomato,+1437,30S40M20S	GX:Z:tdtomato	GN:Z:tdtomato	fx:Z:tdtomato	RE:A:E	xf:i:17	CR:Z:CGGGCAGCTAAACCGC	CY:Z:FFFFFFFFFFFFFFFF	CB:Z:CGGGCAGCTAAACCGC-1	UR:Z:AATCAGTTATGC	UY:Z:FFFFFF,:FFFF	UB:Z:AATCAGTTATGC	NA

 

① QNAME (Query template NAME)

② FLAG

③ RNAME (Reference sequence NAME) : chr1, chr2 chrM 등

④ POS (1-based leftmost mapping Position)

⑤ MAPQ (Mapping Quality)

CIGAR (Concise Idiosyncratic Gapped Alignment Report) string

 

Op BAM Description
M 0 alignment match (can be a sequence match or mismatch)
I 1 insertion to the reference
D 2 deletion from the reference
N 3 skipped region from the reference
S 4 soft clipping (clipped sequences present in SEQ)
H 5 hard clipping (clipped sequences NOT present in SEQ)
P 6 padding (silent deletion from padded reference)
= 7 sequence match
X 8 sequence mismatch

 

⑦ RNEXT

⑧ PNEXT

⑨ TLEN (observed Template LENgth)

⑩ SEQ (segment SEQuence)

⑪ QUAL (quality)

⑫ NH:i : The number of reported alignments that contain the read.

⑬ HI:i : The hit index.

⑭ AS:i : Alignment score.

⑮ nM:i : Number of mismatches.

⑯ ts:i : An additional tag, possibly specific to the aligner or analysis pipeline.

⑰ RG:Z : Read group identifier.

⑱ TX:Z, GX:Z, GN:Z, fx:Z : Tags related to the gene or transcript the read is aligned to your query.

○ GN:Z : gene name tag

⑲ xf:i : An additional flag used by specific software.

⑳ CR:Z, CY:Z, UR:Z, UY:Z, UB:Z : Fields related to cell barcodes and unique molecular identifiers (UMIs), which are important in single-cell sequencing technologies.

 

 

7. Loom (.loom) [목차]

⑴ gene expression data : .h5 파일의 내용

⑵ (optional) layer for spliced and unspliced RNA transcripts : RNA velocity-aware tool을 쓴 경우

⑶ (optional) layer for cell metadata

⑷ (optional) layer for gene metadata

 

입력: 2023.08.03 17:05

수정: 2024.02.02 21:56