유기화학 관련 파이썬 함수 모음 (화학정보학, chemoinformatics; structural bioinformatics)
추천글 : 【유기화학】 유기화학 목차, 【Python】 파이썬 유용 함수 모음
1. 명명법 [본문]
2. 드로잉 [본문]
3. 아미노산 [본문]
4. 분광학 [본문]
a. 생물정보학
⑴ SMILES(simplified molecular-input line entry system) : 짧은 ASCII 문자열 표현
① 이중결합은 '='로, 삼중결합은 '#'로 나타냄
② 고리형 화합물의 경우, 1, 2, ···과 같은 숫자를 부여하여, 선형 분자의 끝과 끝이 연결돼 고리를 이루는 것을 나타냄
○ 예 : CN1C=NC2=C1C(=O)N(C(=O)N2C)C
③ C는 일반 탄소를 나타내는 반면, c는 방향족 탄소를 나타냄
○ C1CCCCC1 : cyclohexane
○ c1ccccc1 : benzene
⑵ 유기화합물을 SMILES로 바꾸는 코드
rdkit.Chem.MolToSmiles(Chem.MolFromFASTA(sequence, flavor = 1))
"""
flavor: (optional)
0 Protein, L amino acids (default)
1 Protein, D amino acids
2 RNA, no cap
3 RNA, 5’ cap
4 RNA, 3’ cap
5 RNA, both caps
6 DNA, no cap
7 DNA, 5’ cap
8 DNA, 3’ cap
9 DNA, both caps
"""
⑶ SMILES, IUPAC 상호 변환 코드 : transformer 활용
# reference: https://github.com/Kohulan/Smiles-TO-iUpac-Translator
! pip install STOUT-pypi
! pip install git+https://github.com/Kohulan/Smiles-TO-iUpac-Translator.git
from STOUT import translate_forward, translate_reverse
# SMILES to IUPAC name translation
SMILES = "CN1C=NC2=C1C(=O)N(C(=O)N2C)C"
IUPAC_name = translate_forward(SMILES)
print("IUPAC name of "+SMILES+" is: "+IUPAC_name)
# IUPAC name to SMILES translation
IUPAC_name = "1,3,7-trimethylpurine-2,6-dione"
SMILES = translate_reverse(IUPAC_name)
print("SMILES of "+IUPAC_name+" is: "+SMILES)
⑷ 임의의 화학식이 주어져 있을 때 IUPAC 명명법을 알아내는 방법
① 단계 1. SMILES-to-drawing 함수를 여러 번 시도하여 주어진 화합물을 나타내는 SMILES 코드 찾아내기
② 단계 2. SMILES-to-IUPAC 함수를 실행시켜 최종 명명법 구하기
⑸ SMILES로부터 분자량을 구하는 코드
from rdkit import Chem
from rdkit.Chem import Descriptors
def calculate_molecular_weight(smiles):
molecule = Chem.MolFromSmiles(smiles)
return Descriptors.ExactMolWt(molecule)
# Example usage
smiles_code = "C1=CC=C(C=C1)O" # SMILES code for phenol
molecular_weight = calculate_molecular_weight(smiles_code)
print(f"Molecular Weight: {molecular_weight}")
⑹ SMILES로부터 방향족성을 판단하는 코드
from rdkit import Chem
def is_aromatic(smiles):
molecule = Chem.MolFromSmiles(smiles)
if molecule is None:
return False
return any(atom.GetIsAromatic() for atom in molecule.GetAtoms())
# Example usage
cyclohexane = "C1CCCCC1" # cyclohexane
benzene = "c1ccccc1" # benzene
imidazole = "C1=CN=CN1" # 1H-imidazole
print(f"The {cyclohexane} is {'aromatic' if is_aromatic(cyclohexane) else 'not aromatic'}")
print(f"The {benzene} is {'aromatic' if is_aromatic(benzene) else 'not aromatic'}")
print(f"The {imidazole} is {'aromatic' if is_aromatic(imidazole) else 'not aromatic'}")
⑺ SMILES로부터 쌍극자 모멘트를 계산하는 코드
# conda install -c psi4 psi4
import psi4
from rdkit import Chem
from rdkit.Chem import AllChem
import numpy as np
def calculate_dipole_moment(smiles):
# Convert SMILES to molecule
mol = Chem.MolFromSmiles(smiles)
# Add Hydrogens
mol = Chem.AddHs(mol)
# Generate 3D coordinates
AllChem.EmbedMolecule(mol, AllChem.ETKDG())
# Extract coordinates
conf = mol.GetConformer()
xyz = ''
for atom in mol.GetAtoms():
pos = conf.GetAtomPosition(atom.GetIdx())
xyz += f"{atom.GetSymbol()} {pos.x} {pos.y} {pos.z}\n"
# Set up Psi4
psi4.set_memory('500 MB')
psi4.set_options({'basis': 'sto-3g'})
# Calculate dipole moment using Psi4
psi4_mol = psi4.geometry(xyz)
psi4.energy('scf')
dipole_moment = psi4.variable('SCF DIPOLE')
return dipole_moment
### Example usage
smiles_code = "CCO" # Example SMILES code for ethanol
dipole_moment = calculate_dipole_moment(smiles_code)
print(f"Dipole Moment: {dipole_moment} Debye")
# Dipole Moment: [ 0.04250251 0.20600936 -0.52850913] Debye
dipole_vector = np.array([0.04250251, 0.20600936, -0.52850913])
magnitude = np.linalg.norm(dipole_vector)
print(f"Magnitude of Dipole Moment: {magnitude} Debye")
# Magnitude of Dipole Moment: 0.5688305725409514 Debye
⑻ SMILES로부터 끓는점(bp), 녹는점(mp), 임계온도를 구하는 함수
# reference: https://thermo.readthedocs.io/thermo.chemical.html
from thermo.chemical import Chemical
N2 = Chemical('Nitrogen')
print(N2.Tm, N2.Tb, N2.Tc) # melting, boiling, and critical points [K]
## 63.15 77.3549950205 126.192
molecule = Chemical('CC(C)C')
print(molecule.Tm, molecule.Tb, molecule.Tc) # melting, boiling, and critical points [K]
## 124.2 261.401014643 407.81
molecule_ = Chemical('2-methylpropane')
print(molecule_.Tm, molecule_.Tb, molecule_.Tc) # melting, boiling, and critical points [K]
## 124.2 261.401014643 407.81
① search-based로 작동하며 모든 화합물이 대상이 되지 않음
② 이를 개선하기 위해 다양한 머신러닝 모델이 도입되고 있음
⑼ SMILES로부터 ADMET(흡수, 분포, 대사, 배출, 독성)를 예측하는 모델 : ADMET-AI
2. 드로잉 [목차]
⑴ 유기화합물 분자식 그리기 (예 : peractic acid)
from rdkit import Chem
from rdkit.Chem import Draw
# Define the SMILES strings for peractic acid
smiles = 'C=CC(=O)O'
# Convert the SMILES strings to RDKit molecule objects
molecule = Chem.MolFromSmiles(smiles)
# Draw the molecules without saving
Draw.MolToImage(molecule)
# Draw the molecules with saving
Draw.MolToFile(molecule, 'peractic_acid.png')
⑵ 유기화합물 electron density map 그리기 (ver. 1) (예 : peractic acid)
from rdkit import Chem
from rdkit.Chem import AllChem
from rdkit.Chem.Draw import SimilarityMaps
# Generate a 3D structure
molecule = Chem.MolFromSmiles('CC(=O)OO')
molecule_3d = Chem.AddHs(molecule)
AllChem.EmbedMolecule(molecule_3d, AllChem.ETKDG())
AllChem.MMFFOptimizeMolecule(molecule_3d)
# Calculate Gasteiger charges
AllChem.ComputeGasteigerCharges(molecule_3d)
# Function to get atom charges
def GetAtomCharges(mol):
charges = [float(mol.GetAtomWithIdx(i).GetProp('_GasteigerCharge')) for i in range(mol.GetNumAtoms())]
return charges
# Draw electrostatic potential map
fig = SimilarityMaps.GetSimilarityMapFromWeights(molecule_3d, GetAtomCharges(molecule_3d), colorMap='jet', contourLines=10)
⑶ 유기화합물 electron density map 그리기 (ver. 2) (예 : peractic acid)
# Refernece: https://rdkit.readthedocs.io/en/latest/Cookbook.html
## STEP 1. make a random forest model
from rdkit import Chem, DataStructs
from rdkit.Chem import AllChem
from sklearn.ensemble import RandomForestClassifier
import numpy
# generate four molecules
m1 = Chem.MolFromSmiles('c1ccccc1')
m2 = Chem.MolFromSmiles('c1ccccc1CC')
m3 = Chem.MolFromSmiles('c1ccncc1')
m4 = Chem.MolFromSmiles('c1ccncc1CC')
mols = [m1, m2, m3, m4]
# generate fingeprints: Morgan fingerprint with radius 2
fps = [AllChem.GetMorganFingerprintAsBitVect(m, 2) for m in mols]
# convert the RDKit explicit vectors into numpy arrays
np_fps = []
for fp in fps:
arr = numpy.zeros((1,))
DataStructs.ConvertToNumpyArray(fp, arr)
np_fps.append(arr)
# get a random forest classifiert with 100 trees
rf = RandomForestClassifier(n_estimators=100, random_state=1123)
# train the random forest
# with the first two molecules being actives (class 1) and
# the last two being inactives (class 0)
ys_fit = [1, 1, 0, 0]
rf.fit(np_fps, ys_fit)
# use the random forest to predict a new molecule
m5 = Chem.MolFromSmiles('c1ccccc1O')
fp = numpy.zeros((1,))
DataStructs.ConvertToNumpyArray(AllChem.GetMorganFingerprintAsBitVect(m5, 2), fp)
print(rf.predict((fp,)))
print(rf.predict_proba((fp,)))
## STEP 2. run the random forest model for the input
from rdkit.Chem.Draw import SimilarityMaps
# helper function
def getProba(fp, predictionFunction):
return predictionFunction((fp,))[0][1]
m5 = Chem.MolFromSmiles('CC(=O)OO')
fig, maxweight = SimilarityMaps.GetSimilarityMapForModel(m5, SimilarityMaps.GetMorganFingerprint, lambda x: getProba(x, rf.predict_proba))
⑷ 유기화합물 3차원 분자식 그리기
from rdkit import Chem
from rdkit.Chem import AllChem
import py3Dmol
def draw_3d_molecule(smiles):
# Convert SMILES to RDKit molecule
mol = Chem.MolFromSmiles(smiles)
if mol is None:
print("Invalid SMILES code.")
return
# Generate 3D coordinates for the molecule
mol = Chem.AddHs(mol) # Add hydrogens
AllChem.EmbedMolecule(mol, AllChem.ETKDG()) # Embed molecule in 3D space
# Convert RDKit molecule to 3Dmol.js viewable format
mb = Chem.MolToMolBlock(mol)
# Visualization with Py3Dmol
viewer = py3Dmol.view(width=400, height=300)
viewer.addModel(mb, 'mol')
viewer.setStyle({'stick': {}})
viewer.zoomTo()
return viewer.show()
# Example usage
smiles_code = "CCO" # Ethanol
draw_3d_molecule(smiles_code)
① 검토 1. 메탄올은 입체 장애를 최소화하기 위해 두 메틸기가 staggered 배향인 반면, 에탄올은 분자 내 수소결합에 의해 그렇지 않음
② 검토 2. 다음과 같은 콘쥬게이트 고리 화합물에서도 구조가 잘 구현됨
Figure. 1. cyclooctatetraene에 적용한 예시
③ 이렇게 머신러닝 모델만으로 새로운 지식을 만들 수 있는 게 chemoinformatics의 매력
⑴ 염기서열을 아미노산 서열로 바꾸는 함수
from Bio.Seq import Seq
# Assuming 'sequence' is your string of nucleotides:
sequence = "ACTCATTCTCCCCAGACGCCAAGGATGGTGGTCATGGCGCCCCGAACCCTCTTCCTGCTGCTCTCGGGGGCCCTGACCCTGACCGAGACCTGGGCGG" # truncated for brevity
# Create a sequence object
seq_obj = Seq(sequence)
# Translate the sequence
amino_acid_sequence = seq_obj.translate(to_stop=True)
# Print the amino acid sequence
print(amino_acid_sequence)
⑵ AlphaFold2를 이용하여 아미노산 서열로부터 아미노산 구조에 관한 PBD 파일을 생성하는 함수 (추후 업데이트)
⑴ SMILES-to-IR (추후 업데이트)
⑵ SMILES-to-NMR (추후 업데이트)
⑶ NMR-to-structure (추후 업데이트)
입력: 2023.11.30 02:40
'▶ 자연과학 > ▷ 화학정보학' 카테고리의 다른 글
【화학정보학】 단백질-단백질 상호작용(PPI) 모델 (1) | 2024.03.31 |
---|---|
【화학정보학】 물질 라이브러리 (0) | 2023.12.11 |
【화학정보학】 생물학 실험에 사용되는 형광물질 라이브러리 (7) | 2022.05.04 |
【화학정보학】 저분자 약물 데이터베이스 종류 (0) | 2022.04.27 |
【화학정보학】 New Cancer Drugs Approved by the FDA in 20 years (0) | 2022.01.08 |
최근댓글