유기화학 관련 파이썬 함수 모음 (화학정보학, chemoinformatics; structural bioinformatics)
추천글 : 【유기화학】 유기화학 목차, 【Python】 파이썬 유용 함수 모음
1. 명명법 [본문]
2. 드로잉 [본문]
3. 아미노산 [본문]
4. 분광학 [본문]
5. 반응 메커니즘 [본문]
a. 생물정보학
⑴ SMILES(simplified molecular-input line entry system) : 짧은 ASCII 문자열 표현
① 이중결합은 '='로, 삼중결합은 '#'로 나타냄
② 고리형 화합물의 경우, 1, 2, ···과 같은 숫자를 부여하여, 선형 분자의 끝과 끝이 연결돼 고리를 이루는 것을 나타냄
○ 예 : CN1C=NC2=C1C(=O)N(C(=O)N2C)C
③ C는 일반 탄소를 나타내는 반면, c는 방향족 탄소를 나타냄
○ C1CCCCC1 : cyclohexane
○ c1ccccc1 : benzene
④ 대괄호를 써서 보다 복잡한 경우를 나타낼 수 있음
○ [N+]와 같이 전하 표시도 할 수 있음
⑤ @을 써서 입체 배열을 나타낼 수 있음
○ 예 : C[C@H](O)[C@H](O)C
⑥ /, \를 써서 방향이 같은지 다른지를 나타내어 EZ 이성질체를 나타낼 수 있음
○ 예 : CCC/C(=C/C(=O)OCC)/C(=O)OCC
⑵ 유기화합물을 SMILES로 바꾸는 코드
rdkit.Chem.MolToSmiles(Chem.MolFromFASTA(sequence, flavor = 1))
"""
flavor: (optional)
0 Protein, L amino acids (default)
1 Protein, D amino acids
2 RNA, no cap
3 RNA, 5’ cap
4 RNA, 3’ cap
5 RNA, both caps
6 DNA, no cap
7 DNA, 5’ cap
8 DNA, 3’ cap
9 DNA, both caps
"""
⑶ SMILES, IUPAC 상호 변환 코드 : transformer 활용
# reference: https://github.com/Kohulan/Smiles-TO-iUpac-Translator
! pip install STOUT-pypi
! pip install git+https://github.com/Kohulan/Smiles-TO-iUpac-Translator.git
from STOUT import translate_forward, translate_reverse
# SMILES to IUPAC name translation
SMILES = "CN1C=NC2=C1C(=O)N(C(=O)N2C)C"
IUPAC_name = translate_forward(SMILES)
print("IUPAC name of "+SMILES+" is: "+IUPAC_name)
# IUPAC name to SMILES translation
IUPAC_name = "1,3,7-trimethylpurine-2,6-dione"
SMILES = translate_reverse(IUPAC_name)
print("SMILES of "+IUPAC_name+" is: "+SMILES)
① IUPAC 입력이 완전히 정확하지 않아도 SMILES 코드 변환이 작동할 수 있음
⑷ 임의의 화학식이 주어져 있을 때 IUPAC 명명법을 알아내는 방법
① 단계 1. SMILES-to-drawing 함수를 여러 번 시도하여 주어진 화합물을 나타내는 SMILES 코드 찾아내기
② 단계 2. SMILES-to-IUPAC 함수를 실행시켜 최종 명명법 구하기
⑸ SMILES로부터 분자량을 구하는 코드
from rdkit import Chem
from rdkit.Chem import Descriptors
def calculate_molecular_weight(smiles):
molecule = Chem.MolFromSmiles(smiles)
return Descriptors.ExactMolWt(molecule)
# Example usage
smiles_code = "C1=CC=C(C=C1)O" # SMILES code for phenol
molecular_weight = calculate_molecular_weight(smiles_code)
print(f"Molecular Weight: {molecular_weight}")
⑹ SMILES로부터 방향족성을 판단하는 코드
from rdkit import Chem
def is_aromatic(smiles):
molecule = Chem.MolFromSmiles(smiles)
if molecule is None:
return False
return any(atom.GetIsAromatic() for atom in molecule.GetAtoms())
# Example usage
cyclohexane = "C1CCCCC1" # cyclohexane
benzene = "c1ccccc1" # benzene
imidazole = "C1=CN=CN1" # 1H-imidazole
print(f"The {cyclohexane} is {'aromatic' if is_aromatic(cyclohexane) else 'not aromatic'}")
print(f"The {benzene} is {'aromatic' if is_aromatic(benzene) else 'not aromatic'}")
print(f"The {imidazole} is {'aromatic' if is_aromatic(imidazole) else 'not aromatic'}")
⑺ SMILES로부터 쌍극자 모멘트를 계산하는 코드
# conda install -c psi4 psi4
import psi4
from rdkit import Chem
from rdkit.Chem import AllChem
import numpy as np
def calculate_dipole_moment(smiles):
# Convert SMILES to molecule
mol = Chem.MolFromSmiles(smiles)
# Add Hydrogens
mol = Chem.AddHs(mol)
# Generate 3D coordinates
AllChem.EmbedMolecule(mol, AllChem.ETKDG())
# Extract coordinates
conf = mol.GetConformer()
xyz = ''
for atom in mol.GetAtoms():
pos = conf.GetAtomPosition(atom.GetIdx())
xyz += f"{atom.GetSymbol()} {pos.x} {pos.y} {pos.z}\n"
# Set up Psi4
psi4.set_memory('500 MB')
psi4.set_options({'basis': 'sto-3g'})
# Calculate dipole moment using Psi4
psi4_mol = psi4.geometry(xyz)
psi4.energy('scf')
dipole_moment = psi4.variable('SCF DIPOLE')
return dipole_moment
### Example usage
smiles_code = "CCO" # Example SMILES code for ethanol
dipole_moment = calculate_dipole_moment(smiles_code)
print(f"Dipole Moment: {dipole_moment} Debye")
# Dipole Moment: [ 0.04250251 0.20600936 -0.52850913] Debye
dipole_vector = np.array([0.04250251, 0.20600936, -0.52850913])
magnitude = np.linalg.norm(dipole_vector)
print(f"Magnitude of Dipole Moment: {magnitude} Debye")
# Magnitude of Dipole Moment: 0.5688305725409514 Debye
⑻ SMILES로부터 끓는점(bp), 녹는점(mp), 임계온도를 구하는 함수
# reference: https://thermo.readthedocs.io/thermo.chemical.html
from thermo.chemical import Chemical
N2 = Chemical('Nitrogen')
print(N2.Tm, N2.Tb, N2.Tc) # melting, boiling, and critical points [K]
## 63.15 77.3549950205 126.192
molecule = Chemical('CC(C)C')
print(molecule.Tm, molecule.Tb, molecule.Tc) # melting, boiling, and critical points [K]
## 124.2 261.401014643 407.81
molecule_ = Chemical('2-methylpropane')
print(molecule_.Tm, molecule_.Tb, molecule_.Tc) # melting, boiling, and critical points [K]
## 124.2 261.401014643 407.81
① search-based로 작동하며 모든 화합물이 대상이 되지 않음 : 이를 개선하기 위해 다양한 머신러닝 모델이 도입되고 있음
② 응용 1. 끓는점, 녹는점 예제 만들기
import random
from rdkit import Chem
from rdkit.Chem import Draw
from thermo.chemical import Chemical
from STOUT import translate_forward
from IPython.display import display
from PIL import Image, ImageDraw, ImageFont
def generate_alkanes(n_carbons):
if n_carbons == 1:
return [Chem.MolFromSmiles('C')]
elif n_carbons == 2:
return [Chem.MolFromSmiles('CC')]
smaller_alkanes = generate_alkanes(n_carbons - 1)
new_alkanes = set()
for mol in smaller_alkanes:
for atom in mol.GetAtoms():
if atom.GetDegree() < 4: # 탄소는 최대 4개의 결합을 가질 수 있음
new_mol = Chem.RWMol(mol)
new_idx = new_mol.AddAtom(Chem.Atom(6))
new_mol.AddBond(atom.GetIdx(), new_idx, Chem.BondType.SINGLE)
Chem.SanitizeMol(new_mol)
smiles = Chem.MolToSmiles(new_mol, canonical=True)
new_alkanes.add(smiles)
return [Chem.MolFromSmiles(smiles) for smiles in new_alkanes]
def draw_and_display_properties(isomer1, isomer2):
# 이성질체의 SMILES로부터 thermo의 Chemical 객체 생성
smiles1 = Chem.MolToSmiles(isomer1)
smiles2 = Chem.MolToSmiles(isomer2)
chemical1 = Chemical(smiles1)
chemical2 = Chemical(smiles2)
# IUPAC 이름 변환
iupac_name1 = translate_forward(smiles1)
iupac_name2 = translate_forward(smiles2)
# 끓는점, 녹는점 계산 (°C로 변환)
melting_point1 = chemical1.Tm - 273.15 # K to °C
boiling_point1 = chemical1.Tb - 273.15 # K to °C
melting_point2 = chemical2.Tm - 273.15 # K to °C
boiling_point2 = chemical2.Tb - 273.15 # K to °C
# 구조 이성질체 그리기
img1 = Draw.MolToImage(isomer1, size=(300, 300))
img2 = Draw.MolToImage(isomer2, size=(300, 300))
# PIL로 이미지 결합 및 텍스트 추가
combined_img = Image.new('RGB', (600, 400), (255, 255, 255))
combined_img.paste(img1, (0, 0))
combined_img.paste(img2, (300, 0))
draw = ImageDraw.Draw(combined_img)
font = ImageFont.load_default(size=16)
# 첫 번째 이성질체 텍스트 추가
draw.text((10, 310), f"IUPAC: {iupac_name1}", fill=(0, 0, 0), font=font)
draw.text((10, 330), f"Melting Point: {melting_point1:.2f} °C", fill=(0, 0, 0), font=font)
draw.text((10, 350), f"Boiling Point: {boiling_point1:.2f} °C", fill=(0, 0, 0), font=font)
# 두 번째 이성질체 텍스트 추가
draw.text((310, 310), f"IUPAC: {iupac_name2}", fill=(0, 0, 0), font=font)
draw.text((310, 330), f"Melting Point: {melting_point2:.2f} °C", fill=(0, 0, 0), font=font)
draw.text((310, 350), f"Boiling Point: {boiling_point2:.2f} °C", fill=(0, 0, 0), font=font)
# 비교 결과 추가
#melting_comparison = f"{iupac_name1} has higher melting point" if melting_point1 > melting_point2 else f"{iupac_name2} has higher melting point"
#boiling_comparison = f"{iupac_name1} has higher boiling point" if boiling_point1 > boiling_point2 else f"{iupac_name2} has higher boiling point"
#draw.text((10, 370), melting_comparison, fill=(0, 0, 255), font=font)
#draw.text((310, 370), boiling_comparison, fill=(0, 0, 255), font=font)
# 이미지 출력
display(combined_img)
# n값 설정 (예: n=7, 탄소 7개)
n = 7
alkanes = generate_alkanes(n)
isomer1, isomer2 = random.sample(alkanes, 2)
draw_and_display_properties(isomer1, isomer2)
⑼ PubChem으로부터 pKa 값들을 크롤링하는하는 함수
import requests
import re
import statistics # 중앙값 계산을 위한 모듈
def get_median_pka(cid):
"""
PubChem에서 pKa 데이터를 긁어와서 중앙값(Median)을 반환합니다.
"""
# 1. PubChem API 요청 (Experimental Properties 전체)
url = f"https://pubchem.ncbi.nlm.nih.gov/rest/pug_view/data/compound/{cid}/JSON?heading=Experimental+Properties"
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64)"}
try:
response = requests.get(url, headers=headers, timeout=10)
if response.status_code != 200:
print(f"API 요청 실패 (CID: {cid}): {response.status_code}")
return None
data = response.json()
# 수집된 pKa 값들을 저장할 리스트
collected_pkas = []
# 2. 재귀 탐색 함수
def find_pka_recursive(node):
if isinstance(node, dict):
# 섹션 제목이 pKa 관련이면 내부 탐색
if 'TOCHeading' in node and node['TOCHeading'] in ['Dissociation Constants', 'pKa']:
extract_info(node)
# 하위 노드 계속 탐색
for key, value in node.items():
find_pka_recursive(value)
elif isinstance(node, list):
for item in node:
find_pka_recursive(item)
# 3. 텍스트에서 숫자 추출 및 필터링 함수
def extract_info(section_node):
if 'Information' in section_node:
for info in section_node['Information']:
if 'Value' in info:
raw_string = ""
if 'StringWithMarkup' in info['Value']:
for markup in info['Value']['StringWithMarkup']:
raw_string += markup.get('String', '') + " "
elif 'Number' in info['Value']:
raw_string = str(info['Value']['Number'])
if raw_string:
# 정규식: "pKa = 3.5 at 25 C" -> 3.5 추출
match = re.search(r"(-?\d+\.?\d*)", raw_string)
if match:
try:
val = float(match.group(1))
# [중요] 필터링 로직 (Sanity Check)
# pKa는 보통 -10 ~ 20 사이입니다.
# 이 범위를 벗어나면 '온도(25)'나 '분자량'일 확률이 높으므로 제외합니다.
if -10 <= val <= 20:
collected_pkas.append(val)
except ValueError:
pass
# 탐색 시작
find_pka_recursive(data)
# 4. 결과 계산 (중앙값)
if not collected_pkas:
return None # 데이터 없음
median_val = statistics.median(collected_pkas)
return median_val
except Exception as e:
print(f"에러 발생: {e}")
return None
# 테스트할 CID 리스트 (아스피린, 이부프로펜, 아세트아미노펜)
test_cids = [2244, 3672, 1983]
print(f"{'CID':<10} | {'Median pKa':<10} | {'Status'}")
print("-" * 35)
for cid in test_cids:
result = get_median_pka(cid)
if result is not None:
print(f"{cid:<10} | {result:<10.2f} | Success")
else:
print(f"{cid:<10} | {'N/A':<10} | Fail")
⑽ SMILES로부터 생리 활성을 예측하는 모델
""" Anaconda env
conda create --name TDC python=3.10
conda activate TDC
pip install PyTDC
pip install numpy pandas tqdm scikit-learn fuzzywuzzy seaborn
pip install jupyter jupyterlab
pip install ipykernel
python -m ipykernel install --user --name TDC --display-name TDC
"""
from tdc.multi_pred import GDA
data = GDA(name = 'DisGeNET')
data.get_data().head(2)
"""
from tdc.multi_pred import GDA
data = GDA(name = 'DisGeNET') #gene-disease association (GDA)
from tdc.multi_pred import DTI
data = DTI(name = 'BindingDB_Kd', print_stats = True) #drug-target interaction (DTI)
data = DTI(name = 'BindingDB_IC50', print_stats = True)
data = DTI(name = 'BindingDB_EC50', print_stats = True)
data = DTI(name = 'BindingDB_Ki', print_stats = True)
from tdc.generation import MolGen
data = MolGen(name = 'MOSES', print_stats = True) #molecule generation
from tdc.multi_pred import DrugRes
data = DrugRes(name = 'GDSC2') #drug response
from tdc.multi_pred import DrugSyn
data = DrugSyn(name = 'OncoPolyPharmacology') #drug synergistic effect
from tdc import utils
utils.retrieve_dataset_names('ADME')
from tdc.single_pred import ADME
data = ADME(name = 'Pgp_Broccatelli')
from tdc.multi_pred import DDI
data = DDI(name = 'DrugBank') #drug-drug interaction (DDI)
from tdc.generation import RetroSyn
data = RetroSyn(name = 'USPTO-50K') #retrosynthesis
from tdc.multi_pred import Catalyst
data = Catalyst(name = 'USPTO_Catalyst')
from tdc.single_pred import Yields
data = Yields(name = 'Buchwald-Hartwig')
from tdc.multi_pred import MTI
data = MTI(name = 'miRTarBase') #miRNA-target interaction (MTI)
from tdc.single_pred import Paratope
data = Paratope(name = 'SAbDab_Liberis') #antibody binding region
from tdc.single_pred import Epitope
data = Epitope(name = 'IEDB_Jespersen')
from tdc.multi_pred import AntibodyAff
data = AntibodyAff(name = 'Protein_SAbDab') #antigen-antibody binding affinity
from tdc.single_pred import Develop
data = Develop(name = 'TAP', label_name = 'CDR Length') #antibody developability
from tdc.multi_pred import PeptideMHC
data = PeptideMHC(name = 'MHC1_IEDB-IMGT_Nielsen') #MHC1-peptide binding affinity
from tdc.multi_pred import PeptideMHC
data = PeptideMHC(name = 'MHC2_IEDB_Jensen') #MHC2-peptide binding affinity
"""
② 기초 모델
③ ChemBERTa : RoBERTa 기반. 2020 NeurIPS 제출. D-MPNN보다는 퍼포먼스가 떨어짐
④ Smile-to-Bert : SMILES로부터 ADMET(흡수, 분포, 대사, 배출, 독성)를 예측. 퍼포먼스는 나쁨
⑤ ADMET-AI : Chemprop-RDKit 사용. ADMET(흡수, 분포, 대사, 배출, 독성)를 예측. 속도, 정확도, 사용 편의성, 라이선스, DrugBank 연동 등에서 상당한 이점
○ molecular_weight, logP, hydrogen_bond_acceptors, hydrogen_bond_donors, Lipinski, QED, stereo_centers, tpsa, AMES, BBB_Martins, Bioavailability_Ma, CYP1A2_Veith, CYP2C19_Veith, CYP2C9_Substrate_CarbonMangels, CYP2C9_Veith, CYP2D6_Substrate_CarbonMangels, CYP2D6_Veith, CYP3A4_Substrate_CarbonMangels, CYP3A4_Veith, Carcinogens_Lagunin, ClinTox, DILI, HIA_Hou, NR-AR-LBD, NR-AR, NR-AhR, NR-Aromatase, NR-ER-LBD, NR-ER, NR-PPAR-gamma, PAMPA_NCATS, Pgp_Broccatelli, SR-ARE, SR-ATAD5, SR-HSE, SR-MMP, SR-p53, Skin_Reaction, hERG, Caco2_Wang, Clearance_Hepatocyte_AZ, Clearance_Microsome_AZ, Half_Life_Obach, HydrationFreeEnergy_FreeSolv, LD50_Zhu, Lipophilicity_AstraZeneca, PPBR_AZ, Solubility_AqSolDB, VDss_Lombardo, molecular_weight_drugbank_approved_percentile, logP_drugbank_approved_percentile, hydrogen_bond_acceptors_drugbank_approved_percentile, hydrogen_bond_donors_drugbank_approved_percentile, Lipinski_drugbank_approved_percentile, QED_drugbank_approved_percentile, stereo_centers_drugbank_approved_percentile, tpsa_drugbank_approved_percentile, AMES_drugbank_approved_percentile, BBB_Martins_drugbank_approved_percentile, Bioavailability_Ma_drugbank_approved_percentile, CYP1A2_Veith_drugbank_approved_percentile, CYP2C19_Veith_drugbank_approved_percentile, CYP2C9_Substrate_CarbonMangels_drugbank_approved_percentile, CYP2C9_Veith_drugbank_approved_percentile, CYP2D6_Substrate_CarbonMangels_drugbank_approved_percentile, CYP2D6_Veith_drugbank_approved_percentile, CYP3A4_Substrate_CarbonMangels_drugbank_approved_percentile, CYP3A4_Veith_drugbank_approved_percentile, Carcinogens_Lagunin_drugbank_approved_percentile, ClinTox_drugbank_approved_percentile, DILI_drugbank_approved_percentile, HIA_Hou_drugbank_approved_percentile, NR-AR-LBD_drugbank_approved_percentile, NR-AR_drugbank_approved_percentile, NR-AhR_drugbank_approved_percentile, NR-Aromatase_drugbank_approved_percentile, NR-ER-LBD_drugbank_approved_percentile, NR-ER_drugbank_approved_percentile, NR-PPAR-gamma_drugbank_approved_percentile, PAMPA_NCATS_drugbank_approved_percentile, Pgp_Broccatelli_drugbank_approved_percentile, SR-ARE_drugbank_approved_percentile, SR-ATAD5_drugbank_approved_percentile, SR-HSE_drugbank_approved_percentile, SR-MMP_drugbank_approved_percentile, SR-p53_drugbank_approved_percentile, Skin_Reaction_drugbank_approved_percentile, hERG_drugbank_approved_percentile, Caco2_Wang_drugbank_approved_percentile, Clearance_Hepatocyte_AZ_drugbank_approved_percentile, Clearance_Microsome_AZ_drugbank_approved_percentile, Half_Life_Obach_drugbank_approved_percentile, HydrationFreeEnergy_FreeSolv_drugbank_approved_percentile, LD50_Zhu_drugbank_approved_percentile, Lipophilicity_AstraZeneca_drugbank_approved_percentile, PPBR_AZ_drugbank_approved_percentile, Solubility_AqSolDB_drugbank_approved_percentile, VDss_Lombardo_drugbank_approved_percentile
⑥ GNEprop : small molecule virtual screening
⑾ PubChem으로부터 IUPAC 이름들을 생성하는 함수
import requests
import time
def fetch_iupac_names_from_pubchem(start_cid, count):
iupac_names = []
base_url = "https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/{}/property/IUPACName/JSON"
current_cid = start_cid
while len(iupac_names) < count:
response = requests.get(base_url.format(current_cid))
if response.status_code == 200:
data = response.json()
if 'PropertyTable' in data and 'Properties' in data['PropertyTable']:
for prop in data['PropertyTable']['Properties']:
if 'IUPACName' in prop:
iupac_names.append(prop['IUPACName'])
if len(iupac_names) >= count:
break
else:
print(f"Failed to fetch data for CID {current_cid}")
current_cid += 1
time.sleep(0.1) # To prevent hitting the API rate limit
return iupac_names
# Fetch 100 IUPAC names starting from CID 1
iupac_names = fetch_iupac_names_from_pubchem(start_cid=1, count=100)
○ PubChem으로부터 IUPAC 명명법 크롤링 → IUPAC-to-SMILES → SMILES-to-image
○ 예제 생성 알고리즘에서 추가로 SMILES-to-IUPAC과 원래 IUPAC이 일치하는지 확인함으로써 다음과 같은 부수적인 효과가 발생
○ 효과 1. 부적절한 IUPAC 명명법을 제외
○ 효과 2. 컴퓨터가 이해하기 난해한 명명법을 제거함으로써 명명법 예제의 난이도를 조절
⑿ PubChem으로부터 입체배열을 갖는 SMILES 이름들을 생성하는 함수
import requests
import time
from rdkit import Chem
from rdkit.Chem import AllChem
from rdkit.Chem.EnumerateStereoisomers import EnumerateStereoisomers, StereoEnumerationOptions
def fetch_smiles_from_pubchem(start_cid, count):
smiles_list = []
base_url = "https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/{}/property/CanonicalSMILES/JSON"
current_cid = start_cid
while len(smiles_list) < count:
response = requests.get(base_url.format(current_cid))
if response.status_code == 200:
data = response.json()
if 'PropertyTable' in data and 'Properties' in data['PropertyTable']:
for prop in data['PropertyTable']['Properties']:
if 'CanonicalSMILES' in prop:
smiles_list.append(prop['CanonicalSMILES'])
if len(smiles_list) >= count:
break
else:
print(f"Failed to fetch data for CID {current_cid}")
current_cid += 1
time.sleep(0.1) # To prevent hitting the API rate limit
return smiles_list
def generate_stereoisomers(smiles, max_isomers=5):
mol = Chem.MolFromSmiles(smiles)
if mol is None:
return []
opts = StereoEnumerationOptions(onlyUnassigned=True, maxIsomers=max_isomers)
isomers = list(EnumerateStereoisomers(mol, options=opts))
smiles_isomers = [Chem.MolToSmiles(isomer, isomericSmiles=True) for isomer in isomers]
return smiles_isomers
# Fetch 100 canonical SMILES codes starting from CID 1
smiles_codes = fetch_smiles_from_pubchem(start_cid=1, count=100)
# Generate stereoisomers for each fetched SMILES code
all_stereoisomers = []
for i, smiles in enumerate(smiles_codes, start=1):
stereoisomers = generate_stereoisomers(smiles)
all_stereoisomers.extend(stereoisomers)
print(f"Canonical SMILES {i}: {smiles}")
for j, isomer in enumerate(stereoisomers, start=1):
print(f" Stereoisomer {j}: {isomer}")
# Optionally, limit the number of generated stereoisomers for display
max_display = 100
print("\nGenerated Stereoisomers (limited to first {}):".format(max_display))
for i, isomer in enumerate(all_stereoisomers[:max_display], start=1):
print(f"{i}: {isomer}")
''' Visualization code example
from rdkit import Chem
from rdkit.Chem import Draw
smiles = 'O=C(O)C[C@@]1(Cl)C=CC(=O)O1'
molecule = Chem.MolFromSmiles(smiles)
Draw.MolToImage(molecule)
'''
○ PubChem으로부터 SMILES 명명법 크롤링 → canonical SMILES to stereochemical SMILES → SMILES-to-image
② 응용 2. 위 코드는 R 배향을 특히 선호하는 구조를 도출함
○ 시험에 출제된 이성질체 62개를 조사한 결과, R형이 답인 경우가 40개, S형이 답인 경우가 22개로 시험에서도 R형이 더 선호됨
○ R형과 S형의 이론적 비율은 동일할 것이므로, 오히려 인지 편향성이 있는 게 아닌가 추정됨
2. 드로잉 [목차]
⑴ 유기화합물 분자식 그리기 (예 : peracetic acid)
from rdkit import Chem
from rdkit.Chem import Draw
# Define the SMILES strings for peracetic acid
smiles = 'C=CC(=O)O'
# Convert the SMILES strings to RDKit molecule objects
molecule = Chem.MolFromSmiles(smiles)
# Draw the molecules without saving
Draw.MolToImage(molecule)
# Draw the molecules with saving
Draw.MolToFile(molecule, 'peracetic_acid.png')

⑵ 유기화합물 electron density map 그리기 (ver. 1) (예 : peracetic acid)
from rdkit import Chem
from rdkit.Chem import AllChem
from rdkit.Chem.Draw import SimilarityMaps
# Generate a 3D structure
molecule = Chem.MolFromSmiles('CC(=O)OO')
molecule_3d = Chem.AddHs(molecule)
AllChem.EmbedMolecule(molecule_3d, AllChem.ETKDG())
AllChem.MMFFOptimizeMolecule(molecule_3d)
# Calculate Gasteiger charges
AllChem.ComputeGasteigerCharges(molecule_3d)
# Function to get atom charges
def GetAtomCharges(mol):
charges = [float(mol.GetAtomWithIdx(i).GetProp('_GasteigerCharge')) for i in range(mol.GetNumAtoms())]
return charges
# Draw electrostatic potential map
fig = SimilarityMaps.GetSimilarityMapFromWeights(molecule_3d, GetAtomCharges(molecule_3d), colorMap='jet', contourLines=10)

⑶ 유기화합물 electron density map 그리기 (ver. 2) (예 : peracetic acid)
# Refernece: https://rdkit.readthedocs.io/en/latest/Cookbook.html
## STEP 1. make a random forest model
from rdkit import Chem, DataStructs
from rdkit.Chem import AllChem
from sklearn.ensemble import RandomForestClassifier
import numpy
# generate four molecules
m1 = Chem.MolFromSmiles('c1ccccc1')
m2 = Chem.MolFromSmiles('c1ccccc1CC')
m3 = Chem.MolFromSmiles('c1ccncc1')
m4 = Chem.MolFromSmiles('c1ccncc1CC')
mols = [m1, m2, m3, m4]
# generate fingeprints: Morgan fingerprint with radius 2
fps = [AllChem.GetMorganFingerprintAsBitVect(m, 2) for m in mols]
# convert the RDKit explicit vectors into numpy arrays
np_fps = []
for fp in fps:
arr = numpy.zeros((1,))
DataStructs.ConvertToNumpyArray(fp, arr)
np_fps.append(arr)
# get a random forest classifiert with 100 trees
rf = RandomForestClassifier(n_estimators=100, random_state=1123)
# train the random forest
# with the first two molecules being actives (class 1) and
# the last two being inactives (class 0)
ys_fit = [1, 1, 0, 0]
rf.fit(np_fps, ys_fit)
# use the random forest to predict a new molecule
m5 = Chem.MolFromSmiles('c1ccccc1O')
fp = numpy.zeros((1,))
DataStructs.ConvertToNumpyArray(AllChem.GetMorganFingerprintAsBitVect(m5, 2), fp)
print(rf.predict((fp,)))
print(rf.predict_proba((fp,)))
## STEP 2. run the random forest model for the input
from rdkit.Chem.Draw import SimilarityMaps
# helper function
def getProba(fp, predictionFunction):
return predictionFunction((fp,))[0][1]
m5 = Chem.MolFromSmiles('CC(=O)OO')
fig, maxweight = SimilarityMaps.GetSimilarityMapForModel(m5, SimilarityMaps.GetMorganFingerprint, lambda x: getProba(x, rf.predict_proba))

⑷ 유기화합물 3차원 분자식 그리기
from rdkit import Chem
from rdkit.Chem import AllChem
import py3Dmol
def draw_3d_molecule(smiles):
# Convert SMILES to RDKit molecule
mol = Chem.MolFromSmiles(smiles)
if mol is None:
print("Invalid SMILES code.")
return
# Generate 3D coordinates for the molecule
mol = Chem.AddHs(mol) # Add hydrogens
AllChem.EmbedMolecule(mol, AllChem.ETKDG()) # Embed molecule in 3D space
# Convert RDKit molecule to 3Dmol.js viewable format
mb = Chem.MolToMolBlock(mol)
# Visualization with Py3Dmol
viewer = py3Dmol.view(width=400, height=300)
viewer.addModel(mb, 'mol')
viewer.setStyle({'stick': {}})
viewer.zoomTo()
return viewer.show()
# Example usage
smiles_code = "CCO" # Ethanol
draw_3d_molecule(smiles_code)
① 검토 1. 메탄올은 입체 장애를 최소화하기 위해 두 메틸기가 staggered 배향인 반면, 에탄올은 분자 내 수소결합에 의해 그렇지 않음

② 검토 2. 다음과 같은 콘쥬게이트 고리 화합물에서도 구조가 잘 구현됨


Figure. 1. cyclooctatetraene에 적용한 예시
③ 이렇게 머신러닝 모델만으로 새로운 지식을 만들 수 있는 게 chemoinformatics의 매력
⑸ 유기화합물 2차원 분자식에 RS 명명법 표시하기
from rdkit import Chem
from rdkit.Chem import AllChem, Draw
from rdkit.Chem.rdMolDescriptors import CalcMolFormula
from PIL import Image, ImageDraw, ImageFont
import matplotlib.font_manager as fm
def get_stereochemistry(mol):
Chem.AssignStereochemistry(mol, cleanIt=True, force=True)
stereo_info = {}
for atom in mol.GetAtoms():
if atom.HasProp('_CIPCode'):
stereo_info[atom.GetIdx()] = atom.GetProp('_CIPCode')
return stereo_info
def draw_2d_molecule_with_stereochemistry(smiles, filename="molecule.png"):
# Convert SMILES to RDKit molecule
mol = Chem.MolFromSmiles(smiles)
if mol is None:
print("Invalid SMILES code.")
return
# Generate 2D coordinates for the molecule
AllChem.Compute2DCoords(mol)
# Get stereochemistry information
stereo_info = get_stereochemistry(mol)
print(stereo_info)
# Draw the molecule
img = Draw.MolToImage(mol, size=(300, 300), kekulize=True, wedgeBonds=True)
# Convert the RDKit image to a PIL image
pil_img = img.convert("RGBA")
# Create a drawing context
draw = ImageDraw.Draw(pil_img)
# Load fonts
font_size = 18 # You can change the font size here
try:
# Use DejaVuSans.ttf for regular text
font_path = fm.findfont(fm.FontProperties(family="DejaVu Sans"))
font = ImageFont.truetype(font_path, font_size)
# Use DejaVuSans-Oblique.ttf for italic text
italic_font_path = fm.findfont(fm.FontProperties(family="DejaVu Sans", style="italic"))
italic_font = ImageFont.truetype(italic_font_path, font_size)
except IOError:
font = ImageFont.load_default()
italic_font = font # Fallback if no italic font is found
# Generate 2D coordinates for the drawing
AllChem.Compute2DCoords(mol)
conf = mol.GetConformer()
# Get 2D coordinates for each atom
coords = conf.GetPositions()
# Add stereo annotations
for atom_idx, stereo in stereo_info.items():
atom = mol.GetAtomWithIdx(atom_idx)
pos = coords[atom_idx]
# Calculate pixel position from molecule coordinates
x = pos[0] * 35 + 135
y = -pos[1] * 35 + 150
# Draw the text annotation with non-italicized brackets and italicized R/S
draw.text((x, y), "(", fill=(0, 0, 0), font=font)
draw.text((x + 10, y), stereo, fill=(0, 0, 0), font=italic_font)
draw.text((x + 30, y), ")", fill=(0, 0, 0), font=font)
# Save the image to a file
pil_img.save(filename)
print(f"Image saved as {filename}")
# Example usage
smiles_code = "C[C@H](O)[C@H](O)C" # Example molecule with stereochemistry
draw_2d_molecule_with_stereochemistry(smiles_code, "molecule_with_stereochemistry.png")

⑹ 유기화합물 3차원 분자식에 RS 명명법 표시하기
from rdkit import Chem
from rdkit.Chem import AllChem, Draw
from rdkit.Chem.rdMolDescriptors import CalcMolFormula
import py3Dmol
def get_stereochemistry(mol):
Chem.AssignStereochemistry(mol, cleanIt=True, force=True)
stereo_info = {}
for atom in mol.GetAtoms():
if atom.HasProp('_CIPCode'):
stereo_info[atom.GetIdx()] = atom.GetProp('_CIPCode')
return stereo_info
def draw_3d_molecule_with_stereochemistry(smiles):
# Convert SMILES to RDKit molecule
mol = Chem.MolFromSmiles(smiles)
if mol is None:
print("Invalid SMILES code.")
return
# Generate 3D coordinates for the molecule
mol = Chem.AddHs(mol) # Add hydrogens
AllChem.EmbedMolecule(mol, AllChem.ETKDG()) # Embed molecule in 3D space
# Get stereochemistry information
stereo_info = get_stereochemistry(mol)
print(stereo_info)
# Convert RDKit molecule to 3Dmol.js viewable format
mb = Chem.MolToMolBlock(mol)
# Visualization with Py3Dmol
viewer = py3Dmol.view(width=400, height=300)
viewer.addModel(mb, 'mol')
viewer.setStyle({'stick': {}})
# Annotate the stereochemistry on the molecule
for atom_idx, stereo in stereo_info.items():
pos = mol.GetConformer().GetAtomPosition(atom_idx)
viewer.addLabel(stereo, {
'position': {'x': pos.x, 'y': pos.y, 'z': pos.z},
'backgroundColor': 'black',
'fontColor': 'white',
'fontSize': 14,
'showBackground': True
})
viewer.zoomTo()
return viewer.show()
# Example usage
smiles_code = "C[C@H](O)[C@H](O)C" # Example molecule with stereochemistry
draw_3d_molecule_with_stereochemistry(smiles_code)

⑺ 탄소수에 따른 알케인의 구조이성질체 모두 그리는 코드
from rdkit import Chem
from rdkit.Chem import Draw
from rdkit.Chem import AllChem
from rdkit.Chem.rdchem import Mol
from IPython.display import display
def generate_alkanes(n_carbons):
if n_carbons == 1:
return [Chem.MolFromSmiles('C')]
elif n_carbons == 2:
return [Chem.MolFromSmiles('CC')]
smaller_alkanes = generate_alkanes(n_carbons - 1)
new_alkanes = set()
for mol in smaller_alkanes:
for atom in mol.GetAtoms():
if atom.GetDegree() < 4: # 탄소는 최대 4개의 결합을 가질 수 있음
new_mol = Chem.RWMol(mol)
new_idx = new_mol.AddAtom(Chem.Atom(6))
new_mol.AddBond(atom.GetIdx(), new_idx, Chem.BondType.SINGLE)
Chem.SanitizeMol(new_mol)
smiles = Chem.MolToSmiles(new_mol, canonical=True)
new_alkanes.add(smiles)
return [Chem.MolFromSmiles(smiles) for smiles in new_alkanes]
# C7H16의 모든 구조 이성질체 생성 및 시각화
n_carbons = 7
alkanes = generate_alkanes(n_carbons)
img = Draw.MolsToGridImage(alkanes, molsPerRow=5, subImgSize=(200, 200))
# 이미지를 직접 디스플레이 (Jupyter 노트북에서만 동작)
display(img)

| 화학식 | 구조이성질체의 수 |
| C3H8 | 1 |
| C4H10 | 2 |
| C5H12 | 3 |
| C6H14 | 5 |
| C7H16 | 9 |
| C8H18 | 18 |
| C9H20 | 35 |
| C10H22 | 75 |
| C11H24 | 159 |
| C12H26 | 355 |
| C13H28 | 802 |
| C14H30 | 1,858 |
| C15H32 | 4,347 |
| C16H34 | 10,359 |
| C17H36 | 24,894 |
| C18H39 | 60,523 |
| C19H40 | 148,284 |
| C20H42 | 366,319 |
| C30H62 | 4,111,846,763 |
| C40H82 | 62,481,801,147,341 |
Table. 1. 탄소수에 따른 알케인의 구조이성질체의 수
⑻ 트리 자료구조를 활용하여 모든 알케인 치환기를 그리는 코드 (n_carbons = 6부터 중복이 존재함)
from rdkit import Chem
from rdkit.Chem import Draw
from rdkit.Chem.rdchem import AtomValenceException
from IPython.display import display
class Node:
def __init__(self, id):
self.id = id
self.children = []
class Tree:
def __init__(self, root):
self.nodes = [root]
def add_node(self, parent_id, node_id, max_nodes):
if len(self.nodes) >= max_nodes or len(self.nodes[parent_id].children) >= 3:
return None
new_node = Node(node_id)
self.nodes.append(new_node)
self.nodes[parent_id].children.append(new_node)
return new_node
def generate_trees(current_tree, current_id, max_nodes):
if len(current_tree.nodes) == max_nodes:
if is_valid_tree(current_tree):
return [current_tree]
else:
return []
trees = []
for node_id in range(len(current_tree.nodes)):
new_tree = Tree(Node(0))
new_tree.nodes = [Node(i) for i in range(len(current_tree.nodes))]
for i in range(len(current_tree.nodes)):
new_tree.nodes[i].children = [new_tree.nodes[n.id] for n in current_tree.nodes[i].children]
new_node = new_tree.add_node(node_id, current_id, max_nodes)
if new_node and is_valid_tree(new_tree):
trees.extend(generate_trees(new_tree, current_id + 1, max_nodes))
return trees
def is_valid_tree(tree):
for node in tree.nodes:
children_counts = [len(child.children) for child in node.children]
if any(children_counts[i] > children_counts[i+1] for i in range(len(children_counts)-1)):
return False
return True
def visualize_organic_structures(trees):
mols = []
for tree in trees:
mol = Chem.RWMol()
atom_index = {}
for node in tree.nodes:
if node.id == 0:
atom = Chem.Atom(6)
atom.SetNumRadicalElectrons(1) # Set radical electron for the root
else:
atom = Chem.Atom(6)
atom_index[node.id] = mol.AddAtom(atom)
for node in tree.nodes:
for child in node.children:
mol.AddBond(atom_index[node.id], atom_index[child.id], Chem.BondType.SINGLE)
mols.append(mol)
return mols
# Input the total number of nodes
n_nodes = 4
# Start with a single root node
initial_tree = Tree(Node(0))
all_trees = generate_trees(initial_tree, 1, n_nodes)
all_mols = visualize_organic_structures(all_trees)
# 시각화
img = Draw.MolsToGridImage(all_mols, molsPerRow=5, subImgSize=(200, 200), useSVG=False)
display(img)
① n_carbons = 3 (총 2개)

② n_carbons = 4 (총 4개)

② n_carbons = 5 (총 8개)

③ n_carbons = 6 (총 17개)

⑴ 염기서열을 아미노산 서열로 바꾸는 함수
from Bio.Seq import Seq
# Assuming 'sequence' is your string of nucleotides:
sequence = "ACTCATTCTCCCCAGACGCCAAGGATGGTGGTCATGGCGCCCCGAACCCTCTTCCTGCTGCTCTCGGGGGCCCTGACCCTGACCGAGACCTGGGCGG" # truncated for brevity
# Create a sequence object
seq_obj = Seq(sequence)
# Translate the sequence
amino_acid_sequence = seq_obj.translate(to_stop=True)
# Print the amino acid sequence
print(amino_acid_sequence)
⑵ 유전자명을 아미노산 서열로 바꾸는 함수
!git clone https://github.com/deepmind/alphafold.git
!cd alphafold/scripts
!sudo apt install aria2
!bash download_uniprot.sh ./
from pathlib import Path
def extract_sequences_from_uniprot(uniprot_fasta_path, gene_list, output_path="selected_genes.fa"):
"""
UniProt FASTA 파일에서 gene_list에 포함된 유전자의 단백질 서열을 추출하여 .fa 파일로 저장
"""
result = []
current_header = ""
current_seq = []
include = False
gene_set = set(g.upper() for g in gene_list)
with open(uniprot_fasta_path, 'r') as f:
for line in f:
if line.startswith('>'):
if include and current_header and current_seq:
result.append((gene_name, ''.join(current_seq)))
# 새로운 서열 시작
current_header = line.strip()
current_seq = []
include = False
# header에서 gene name 추출
if 'GN=' in current_header:
try:
gene_name = current_header.split('GN=')[1].split()[0].upper()
if gene_name in gene_set:
include = True
except Exception:
pass
else:
if include:
current_seq.append(line.strip())
# 마지막 서열 추가
if include and current_header and current_seq:
result.append((gene_name, ''.join(current_seq)))
# 저장
if result:
with open(output_path, 'w') as f:
for gene, seq in result:
f.write(f">{gene}\n{seq}\n")
print(f"✅ {len(result)}개 서열이 '{output_path}'에 저장되었습니다.")
else:
print("❌ 해당 유전자의 서열을 찾을 수 없습니다.")
# 🔧 예시 실행
gene_list = ['THBS4', 'CD36']
uniprot_fasta_path = "uniprot.fasta"
extract_sequences_from_uniprot(uniprot_fasta_path, gene_list, output_path="genes_from_local_uniprot.fa")
# ✅ 1312개 서열이 'genes_from_local_uniprot.fa'에 저장되었습니다.
⑶ AlphaFold2를 이용하여 아미노산 서열로부터 아미노산 구조에 관한 PDB 파일을 생성하는 함수
① AlphaFold2 설치
git clone https://github.com/deepmind/alphafold.git
cd alphafold/scripts
sudo apt install aria2
bash download_all_data.sh ./
docker build -f docker/Dockerfile -t alphafold .
docker run --gpus all -it \
--name alphafold_cpu_session \
-v "$(pwd)":/app/alphafold \
-v "$(pwd)/afdb":/mnt/databases \
-v "$(pwd)/output":/mnt/output \
--entrypoint /bin/bash \
alphafold
#docker start -ai alphafold_cpu_session
#docker exec -it alphafold_cpu_session bash
#docker stop alphafold_cpu_session
#docker rm alphafold_cpu_session
② AlphaFold2 Monomer 실행
#export TF_FORCE_UNIFIED_MEMORY=1 ## for GPU
#export XLA_PYTHON_CLIENT_PREALLOCATE=false ## for GPU
#export XLA_PYTHON_CLIENT_MEM_FRACTION=.85 ## for GPU
export JAX_PLATFORM_NAME=cpu ## for CPU
export JAX_TRACEBACK_FILTERING=off ## for CPU
python3 /app/alphafold/run_alphafold.py \
--fasta_paths=/app/alphafold/data/data.fa \
--output_dir=/mnt/output \
--data_dir=/app/alphafold/scripts \
--model_preset=monomer \
--db_preset=reduced_dbs \
--uniref90_database_path=/app/alphafold/scripts/uniref90/uniref90.fasta \
--mgnify_database_path=/app/alphafold/scripts/mgnify/mgy_clusters_2022_05.fa \
--small_bfd_database_path=/app/alphafold/scripts/small_bfd/bfd-first_non_consensus_sequences.fasta \
--pdb70_database_path=/app/alphafold/scripts/pdb70/pdb70 \
--template_mmcif_dir=/app/alphafold/scripts/pdb_mmcif/mmcif_files \
--obsolete_pdbs_path=/app/alphafold/scripts/pdb_mmcif/obsolete.dat \
--max_template_date=2023-12-31 \
--use_gpu_relax=False
③ AlphaFold2 Multimer 실행
#export TF_FORCE_UNIFIED_MEMORY=1 ## for GPU
#export XLA_PYTHON_CLIENT_PREALLOCATE=false ## for GPU
#export XLA_PYTHON_CLIENT_MEM_FRACTION=.85 ## for GPU
export JAX_PLATFORM_NAME=cpu ## for CPU
export JAX_TRACEBACK_FILTERING=off ## for CPU
python3 /app/alphafold/run_alphafold.py \
--fasta_paths=/app/alphafold/data/data.fa \
--output_dir=/mnt/output \
--data_dir=/app/alphafold/scripts \
--model_preset=multimer \
--db_preset=reduced_dbs \
--uniref90_database_path=/app/alphafold/scripts/uniref90/uniref90.fasta \
--mgnify_database_path=/app/alphafold/scripts/mgnify/mgy_clusters_2022_05.fa \
--small_bfd_database_path=/app/alphafold/scripts/small_bfd/bfd-first_non_consensus_sequences.fasta \
--template_mmcif_dir=/app/alphafold/scripts/pdb_mmcif/mmcif_files \
--obsolete_pdbs_path=/app/alphafold/scripts/pdb_mmcif/obsolete.dat \
--pdb_seqres_database_path=/app/alphafold/scripts/pdb_seqres/pdb_seqres.txt \
--uniprot_database_path=/app/alphafold/scripts/uniprot/uniprot.fasta \
--max_template_date=1900-01-01 \
--use_gpu_relax=False
⑷ SMILES 임베딩 함수로 아미노산 임베딩 후 2차원 시각화
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import torch
from transformers import AutoTokenizer, AutoModel
from sklearn.manifold import TSNE
import os
# -------------------------------
# 0) 설정
# -------------------------------
os.environ["TOKENIZERS_PARALLELISM"] = "false"
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
print(f"[INFO] Using device: {DEVICE}")
# -------------------------------
# 1) 20개 아미노산 그룹 정의
# -------------------------------
aa_groups = {
"Nonpolar": ["Ala","Ile","Leu","Met","Phe","Pro","Trp","Val","Gly"],
"Polar (uncharged)": ["Asn","Cys","Gln","Ser","Thr","Tyr"],
"Charged (+)": ["Arg","His","Lys"],
"Charged (-)": ["Asp","Glu"],
}
aa_1letter = {
"Ala":"A","Arg":"R","Asn":"N","Asp":"D","Cys":"C","Gln":"Q","Glu":"E","Gly":"G",
"His":"H","Ile":"I","Leu":"L","Lys":"K","Met":"M","Phe":"F","Pro":"P","Ser":"S",
"Thr":"T","Trp":"W","Tyr":"Y","Val":"V",
}
amino_acids = sum(aa_groups.values(), [])
AA_SEQ = {
"Ala": "C[C@@H](C(=O)O)N",
"Arg": "C(C[C@@H](C(=O)O)N)CN=C(N)N",
"Asn": "C([C@@H](C(=O)O)N)C(=O)N",
"Asp": "C(C(C(=O)O)N)C(=O)O",
"Cys": "C([C@@H](C(=O)O)N)S",
"Gln": "C(CC(=O)N)[C@@H](C(=O)O)N",
"Glu": "C(CC(=O)O)[C@@H](C(=O)O)N",
"Gly": "C(C(=O)O)N",
"His": "C1=C(NC=N1)C[C@@H](C(=O)O)N",
"Ile": "CC[C@H](C)[C@@H](C(=O)O)N",
"Leu": "CC(C)C[C@@H](C(=O)O)N",
"Lys": "C(CCN)C[C@@H](C(=O)O)N",
"Met": "CSCC[C@@H](C(=O)O)N",
"Phe": "C1=CC=C(C=C1)C[C@@H](C(=O)O)N",
"Pro": "C1C[C@H](NC1)C(=O)O",
"Ser": "C([C@@H](C(=O)O)N)O",
"Thr": "C[C@H]([C@@H](C(=O)O)N)O",
"Trp": "C1=CC=C2C(=C1)C(=CN2)C[C@@H](C(=O)O)N",
"Tyr": "C1=CC(=CC=C1C[C@@H](C(=O)O)N)O",
"Val": "CC(C)[C@@H](C(=O)O)N",
}
def get_sequence(aa3):
s = AA_SEQ.get(aa3, "")
if isinstance(s, str) and len(s.strip()) > 0:
return s.strip()
return aa_1letter[aa3] # 입력 없으면 1-letter로 대체
texts = [get_sequence(a) for a in amino_acids]
# -------------------------------
# 3) 임베딩 모델 로드
# -------------------------------
MODEL_ID_CHEM = "DeepChem/ChemBERTa-100M-MLM"
MODEL_ID_MOL = "ibm-research/MoLFormer-XL-both-10pct"
tok = AutoTokenizer.from_pretrained(MODEL_ID_MOL, trust_remote_code=True)
model = AutoModel.from_pretrained(MODEL_ID_MOL, trust_remote_code=True).to(DEVICE).eval()
@torch.no_grad()
def embed_smiles_molformer_batch(smiles_list, batch_size=64):
embeddings = []
total = len(smiles_list)
for i in range(0, total, batch_size):
batch = smiles_list[i : i + batch_size]
enc = tok(batch, padding=True, truncation=True, return_tensors="pt")
enc = {k: v.to(DEVICE) for k, v in enc.items()}
out = model(**enc)
if hasattr(out, "pooler_output") and out.pooler_output is not None:
emb = out.pooler_output
else:
last = out.last_hidden_state
mask = enc["attention_mask"].unsqueeze(-1)
emb = (last * mask).sum(1) / mask.sum(1).clamp(min=1)
embeddings.append(emb.cpu().numpy())
return np.vstack(embeddings)
# DataFrame
df = pd.DataFrame({
"AA": amino_acids,
"Seq": texts,
"Group": [g for g, items in aa_groups.items() for _ in items]
})
# -------------------------------
# 4) 임베딩 + t-SNE
# -------------------------------
X = embed_smiles_molformer_batch(df["Seq"].tolist(), batch_size=16)
tsne = TSNE(n_components=2, perplexity=5, random_state=42, init="pca", learning_rate="auto")
X2 = tsne.fit_transform(X)
df["tSNE1"] = X2[:, 0]
df["tSNE2"] = X2[:, 1]
# -------------------------------
# 5) 시각화 (그룹별 색)
# -------------------------------
color_map = {
"Nonpolar": "tab:blue",
"Polar (uncharged)": "tab:green",
"Charged (+)": "tab:red",
"Charged (-)": "tab:purple",
}
plt.figure(figsize=(10, 8))
for grp in df["Group"].unique():
sub = df[df["Group"] == grp]
plt.scatter(
sub["tSNE1"], sub["tSNE2"],
s=140, alpha=0.85, label=grp,
c=color_map[grp], edgecolors="k", linewidths=0.5
)
for _, r in df.iterrows():
plt.text(r["tSNE1"] + 0.5, r["tSNE2"] + 0.5, r["AA"], fontsize=11)
plt.title("20 Amino Acids (Manual sequences) - t-SNE 2D", fontsize=16, fontweight="bold")
plt.xlabel("t-SNE 1", fontsize=14)
plt.ylabel("t-SNE 2", fontsize=14)
plt.legend(fontsize=12)
plt.grid(True, linestyle="--", alpha=0.4)
plt.tight_layout()
plt.show()

⑴ 개요
① 화학식으로부터 MS, IR, NMR 스펙트럼을 예측하거나, MS, IR, NMR 데이터를 통해 화학식을 예측하는 연구가 활발히 진행되고 있음
② 이 연구는 딥러닝 기술의 발전과 함께 현재 크게 진전되고 있음
○ 예 : NMR-TS
⑴ ASKCOS
⑵ AiZynthFinder
⑶ SPARROW
⑷ 최근 Chemical.AI를 비롯한 AI 회사들이 유기화학 반응을 재구성하는 AI를 발표한 바 있음

입력: 2023.11.30 02:40
수정: 2024.06.11 12:26
'▶ 자연과학 > ▷ 화학정보학' 카테고리의 다른 글
| 【화학정보학】 단백질체 분석 파이프라인(Proteomics Pipeline) (1) | 2024.03.31 |
|---|---|
| 【화학정보학】 물질 라이브러리 (0) | 2023.12.11 |
| 【화학정보학】 생물학 실험에 사용되는 형광물질 라이브러리 (7) | 2022.05.04 |
| 【화학정보학】 저분자 약물 데이터베이스 종류 (0) | 2022.04.27 |
| 【화학정보학】 New Cancer Drugs Approved by the FDA in 20 years (0) | 2022.01.08 |
최근댓글