【화학정보학】 유기화학 관련 파이썬 함수 모음

유기화학 관련 파이썬 함수 모음 (화학정보학, chemoinformatics; structural bioinformatics)

추천글 : 【유기화학】 유기화학 목차, 【Python】 파이썬 유용 함수 모음

1. 명명법 [본문]

2. 드로잉 [본문]

3. 아미노산 [본문]

4. 분광학 [본문]

a. 생물정보학

1. 명명법 [목차]

⑴ SMILES(simplified molecular-input line entry system) : 짧은 ASCII 문자열 표현

① 이중결합은 '='로, 삼중결합은 '#'로 나타냄

② 고리형 화합물의 경우, 1, 2, ···과 같은 숫자를 부여하여, 선형 분자의 끝과 끝이 연결돼 고리를 이루는 것을 나타냄

○ 예 : CN1C=NC2=C1C(=O)N(C(=O)N2C)C

③ C는 일반 탄소를 나타내는 반면, c는 방향족 탄소를 나타냄

○ C1CCCCC1 : cyclohexane

○ c1ccccc1 : benzene

④ 대괄호를 써서 보다 복잡한 경우를 나타낼 수 있음

○ [N+]와 같이 전하 표시도 할 수 있음

⑤ @을 써서 입체 배열을 나타낼 수 있음

○ 예 : C[C@H](O)[C@H](O)C

⑥ /, \를 써서 방향이 같은지 다른지를 나타내어 EZ 이성질체를 나타낼 수 있음

○ 예 : CCC/C(=C/C(=O)OCC)/C(=O)OCC

⑵ 유기화합물을 SMILES로 바꾸는 코드

rdkit.Chem.MolToSmiles(Chem.MolFromFASTA(sequence, flavor = 1))

"""
flavor: (optional)
0 Protein, L amino acids (default)
1 Protein, D amino acids
2 RNA, no cap
3 RNA, 5’ cap
4 RNA, 3’ cap
5 RNA, both caps
6 DNA, no cap
7 DNA, 5’ cap
8 DNA, 3’ cap
9 DNA, both caps
"""

⑶ SMILES, IUPAC 상호 변환 코드 : transformer 활용

# reference: https://github.com/Kohulan/Smiles-TO-iUpac-Translator

! pip install STOUT-pypi
! pip install git+https://github.com/Kohulan/Smiles-TO-iUpac-Translator.git

from STOUT import translate_forward, translate_reverse

# SMILES to IUPAC name translation

SMILES = "CN1C=NC2=C1C(=O)N(C(=O)N2C)C"
IUPAC_name = translate_forward(SMILES)
print("IUPAC name of "+SMILES+" is: "+IUPAC_name)

# IUPAC name to SMILES translation

IUPAC_name = "1,3,7-trimethylpurine-2,6-dione"
SMILES = translate_reverse(IUPAC_name)
print("SMILES of "+IUPAC_name+" is: "+SMILES)

① IUPAC 입력이 완전히 정확하지 않아도 SMILES 코드 변환이 작동할 수 있음

⑷ 임의의 화학식이 주어져 있을 때 IUPAC 명명법을 알아내는 방법

① 단계 1. SMILES-to-drawing 함수를 여러 번 시도하여 주어진 화합물을 나타내는 SMILES 코드 찾아내기

② 단계 2. SMILES-to-IUPAC 함수를 실행시켜 최종 명명법 구하기

⑸ SMILES로부터 분자량을 구하는 코드

from rdkit import Chem
from rdkit.Chem import Descriptors

def calculate_molecular_weight(smiles):
    molecule = Chem.MolFromSmiles(smiles)
    return Descriptors.ExactMolWt(molecule)

# Example usage
smiles_code = "C1=CC=C(C=C1)O"  # SMILES code for phenol
molecular_weight = calculate_molecular_weight(smiles_code)
print(f"Molecular Weight: {molecular_weight}")

⑹ SMILES로부터 방향족성을 판단하는 코드

from rdkit import Chem

def is_aromatic(smiles):
    molecule = Chem.MolFromSmiles(smiles)
    if molecule is None:
        return False
    return any(atom.GetIsAromatic() for atom in molecule.GetAtoms())

# Example usage
cyclohexane = "C1CCCCC1"  # cyclohexane
benzene = "c1ccccc1"  # benzene
imidazole = "C1=CN=CN1" # 1H-imidazole

print(f"The {cyclohexane} is {'aromatic' if is_aromatic(cyclohexane) else 'not aromatic'}")
print(f"The {benzene} is {'aromatic' if is_aromatic(benzene) else 'not aromatic'}")
print(f"The {imidazole} is {'aromatic' if is_aromatic(imidazole) else 'not aromatic'}")

⑺ SMILES로부터 쌍극자 모멘트를 계산하는 코드

# conda install -c psi4 psi4

import psi4
from rdkit import Chem
from rdkit.Chem import AllChem
import numpy as np

def calculate_dipole_moment(smiles):
    # Convert SMILES to molecule
    mol = Chem.MolFromSmiles(smiles)

    # Add Hydrogens
    mol = Chem.AddHs(mol)

    # Generate 3D coordinates
    AllChem.EmbedMolecule(mol, AllChem.ETKDG())

    # Extract coordinates
    conf = mol.GetConformer()
    xyz = ''
    for atom in mol.GetAtoms():
        pos = conf.GetAtomPosition(atom.GetIdx())
        xyz += f"{atom.GetSymbol()} {pos.x} {pos.y} {pos.z}\n"

    # Set up Psi4
    psi4.set_memory('500 MB')
    psi4.set_options({'basis': 'sto-3g'})

    # Calculate dipole moment using Psi4
    psi4_mol = psi4.geometry(xyz)
    psi4.energy('scf')
    dipole_moment = psi4.variable('SCF DIPOLE')

    return dipole_moment

### Example usage
smiles_code = "CCO"  # Example SMILES code for ethanol
dipole_moment = calculate_dipole_moment(smiles_code)
print(f"Dipole Moment: {dipole_moment} Debye")
# Dipole Moment: [ 0.04250251  0.20600936 -0.52850913] Debye

dipole_vector = np.array([0.04250251, 0.20600936, -0.52850913])
magnitude = np.linalg.norm(dipole_vector)
print(f"Magnitude of Dipole Moment: {magnitude} Debye")
# Magnitude of Dipole Moment: 0.5688305725409514 Debye

⑻ SMILES로부터 끓는점(bp), 녹는점(mp), 임계온도를 구하는 함수

# reference: https://thermo.readthedocs.io/thermo.chemical.html

from thermo.chemical import Chemical

N2 = Chemical('Nitrogen')
print(N2.Tm, N2.Tb, N2.Tc) # melting, boiling, and critical points [K]
## 63.15 77.3549950205 126.192

molecule = Chemical('CC(C)C')
print(molecule.Tm, molecule.Tb, molecule.Tc) # melting, boiling, and critical points [K]
## 124.2 261.401014643 407.81

molecule_ = Chemical('2-methylpropane')
print(molecule_.Tm, molecule_.Tb, molecule_.Tc) # melting, boiling, and critical points [K]
## 124.2 261.401014643 407.81

① search-based로 작동하며 모든 화합물이 대상이 되지 않음 : 이를 개선하기 위해 다양한 머신러닝 모델이 도입되고 있음

② 응용 1. 끓는점, 녹는점 예제 만들기

import random
from rdkit import Chem
from rdkit.Chem import Draw
from thermo.chemical import Chemical
from STOUT import translate_forward
from IPython.display import display
from PIL import Image, ImageDraw, ImageFont

def generate_alkanes(n_carbons):
    if n_carbons == 1:
        return [Chem.MolFromSmiles('C')]
    elif n_carbons == 2:
        return [Chem.MolFromSmiles('CC')]
    
    smaller_alkanes = generate_alkanes(n_carbons - 1)
    new_alkanes = set()
    
    for mol in smaller_alkanes:
        for atom in mol.GetAtoms():
            if atom.GetDegree() < 4:  # 탄소는 최대 4개의 결합을 가질 수 있음
                new_mol = Chem.RWMol(mol)
                new_idx = new_mol.AddAtom(Chem.Atom(6))
                new_mol.AddBond(atom.GetIdx(), new_idx, Chem.BondType.SINGLE)
                Chem.SanitizeMol(new_mol)
                smiles = Chem.MolToSmiles(new_mol, canonical=True)
                new_alkanes.add(smiles)
    
    return [Chem.MolFromSmiles(smiles) for smiles in new_alkanes]

def draw_and_display_properties(isomer1, isomer2):
    # 이성질체의 SMILES로부터 thermo의 Chemical 객체 생성
    smiles1 = Chem.MolToSmiles(isomer1)
    smiles2 = Chem.MolToSmiles(isomer2)
    
    chemical1 = Chemical(smiles1)
    chemical2 = Chemical(smiles2)
    
    # IUPAC 이름 변환
    iupac_name1 = translate_forward(smiles1)
    iupac_name2 = translate_forward(smiles2)
    
    # 끓는점, 녹는점 계산 (°C로 변환)
    melting_point1 = chemical1.Tm - 273.15  # K to °C
    boiling_point1 = chemical1.Tb - 273.15  # K to °C
    
    melting_point2 = chemical2.Tm - 273.15  # K to °C
    boiling_point2 = chemical2.Tb - 273.15  # K to °C
    
    # 구조 이성질체 그리기
    img1 = Draw.MolToImage(isomer1, size=(300, 300))
    img2 = Draw.MolToImage(isomer2, size=(300, 300))
    
    # PIL로 이미지 결합 및 텍스트 추가
    combined_img = Image.new('RGB', (600, 400), (255, 255, 255))
    combined_img.paste(img1, (0, 0))
    combined_img.paste(img2, (300, 0))
    
    draw = ImageDraw.Draw(combined_img)
    font = ImageFont.load_default(size=16)
    
    # 첫 번째 이성질체 텍스트 추가
    draw.text((10, 310), f"IUPAC: {iupac_name1}", fill=(0, 0, 0), font=font)
    draw.text((10, 330), f"Melting Point: {melting_point1:.2f} °C", fill=(0, 0, 0), font=font)
    draw.text((10, 350), f"Boiling Point: {boiling_point1:.2f} °C", fill=(0, 0, 0), font=font)
    
    # 두 번째 이성질체 텍스트 추가
    draw.text((310, 310), f"IUPAC: {iupac_name2}", fill=(0, 0, 0), font=font)
    draw.text((310, 330), f"Melting Point: {melting_point2:.2f} °C", fill=(0, 0, 0), font=font)
    draw.text((310, 350), f"Boiling Point: {boiling_point2:.2f} °C", fill=(0, 0, 0), font=font)
    
    # 비교 결과 추가
    #melting_comparison = f"{iupac_name1} has higher melting point" if melting_point1 > melting_point2 else f"{iupac_name2} has higher melting point"
    #boiling_comparison = f"{iupac_name1} has higher boiling point" if boiling_point1 > boiling_point2 else f"{iupac_name2} has higher boiling point"
    
    #draw.text((10, 370), melting_comparison, fill=(0, 0, 255), font=font)
    #draw.text((310, 370), boiling_comparison, fill=(0, 0, 255), font=font)
    
    # 이미지 출력
    display(combined_img)

# n값 설정 (예: n=7, 탄소 7개)
n = 7
alkanes = generate_alkanes(n)
isomer1, isomer2 = random.sample(alkanes, 2)
draw_and_display_properties(isomer1, isomer2)

⑼ SMILES로부터 산도를 판단하는 코드 (추후 업데이트)

⑽ SMILES로부터 ADMET(흡수, 분포, 대사, 배출, 독성)를 예측하는 모델 : ADMET-AI

⑾ PubChem으로부터 IUPAC 이름들을 생성하는 함수

import requests
import time

def fetch_iupac_names_from_pubchem(start_cid, count):
    iupac_names = []
    base_url = "https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/{}/property/IUPACName/JSON"
    
    current_cid = start_cid
    while len(iupac_names) < count:
        response = requests.get(base_url.format(current_cid))
        if response.status_code == 200:
            data = response.json()
            if 'PropertyTable' in data and 'Properties' in data['PropertyTable']:
                for prop in data['PropertyTable']['Properties']:
                    if 'IUPACName' in prop:
                        iupac_names.append(prop['IUPACName'])
                        if len(iupac_names) >= count:
                            break
        else:
            print(f"Failed to fetch data for CID {current_cid}")
        
        current_cid += 1
        time.sleep(0.1)  # To prevent hitting the API rate limit

    return iupac_names

# Fetch 100 IUPAC names starting from CID 1
iupac_names = fetch_iupac_names_from_pubchem(start_cid=1, count=100)

① 응용 1. IUPAC 명명법 예제 만들기

○ PubChem으로부터 IUPAC 명명법 크롤링 → IUPAC-to-SMILES → SMILES-to-image

○ 예제 생성 알고리즘에서 추가로 SMILES-to-IUPAC과 원래 IUPAC이 일치하는지 확인함으로써 다음과 같은 부수적인 효과가 발생

○ 효과 1. 부적절한 IUPAC 명명법을 제외

○ 효과 2. 컴퓨터가 이해하기 난해한 명명법을 제거함으로써 명명법 예제의 난이도를 조절

⑿ PubChem으로부터 입체배열을 갖는 SMILES 이름들을 생성하는 함수

import requests
import time
from rdkit import Chem
from rdkit.Chem import AllChem
from rdkit.Chem.EnumerateStereoisomers import EnumerateStereoisomers, StereoEnumerationOptions

def fetch_smiles_from_pubchem(start_cid, count):
    smiles_list = []
    base_url = "https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/{}/property/CanonicalSMILES/JSON"
    
    current_cid = start_cid
    while len(smiles_list) < count:
        response = requests.get(base_url.format(current_cid))
        if response.status_code == 200:
            data = response.json()
            if 'PropertyTable' in data and 'Properties' in data['PropertyTable']:
                for prop in data['PropertyTable']['Properties']:
                    if 'CanonicalSMILES' in prop:
                        smiles_list.append(prop['CanonicalSMILES'])
                        if len(smiles_list) >= count:
                            break
        else:
            print(f"Failed to fetch data for CID {current_cid}")
        
        current_cid += 1
        time.sleep(0.1)  # To prevent hitting the API rate limit

    return smiles_list

def generate_stereoisomers(smiles, max_isomers=5):
    mol = Chem.MolFromSmiles(smiles)
    if mol is None:
        return []
    
    opts = StereoEnumerationOptions(onlyUnassigned=True, maxIsomers=max_isomers)
    isomers = list(EnumerateStereoisomers(mol, options=opts))
    
    smiles_isomers = [Chem.MolToSmiles(isomer, isomericSmiles=True) for isomer in isomers]
    return smiles_isomers

# Fetch 100 canonical SMILES codes starting from CID 1
smiles_codes = fetch_smiles_from_pubchem(start_cid=1, count=100)

# Generate stereoisomers for each fetched SMILES code
all_stereoisomers = []
for i, smiles in enumerate(smiles_codes, start=1):
    stereoisomers = generate_stereoisomers(smiles)
    all_stereoisomers.extend(stereoisomers)
    print(f"Canonical SMILES {i}: {smiles}")
    for j, isomer in enumerate(stereoisomers, start=1):
        print(f"  Stereoisomer {j}: {isomer}")

# Optionally, limit the number of generated stereoisomers for display
max_display = 100
print("\nGenerated Stereoisomers (limited to first {}):".format(max_display))
for i, isomer in enumerate(all_stereoisomers[:max_display], start=1):
    print(f"{i}: {isomer}")
    

''' Visualization code example
from rdkit import Chem
from rdkit.Chem import Draw

smiles = 'O=C(O)C[C@@]1(Cl)C=CC(=O)O1'
molecule = Chem.MolFromSmiles(smiles)
Draw.MolToImage(molecule)
'''

① 응용 1. RS 판별법 예제 만들기

○ PubChem으로부터 SMILES 명명법 크롤링 → canonical SMILES to stereochemical SMILES → SMILES-to-image

② 응용 2. 위 코드는 R 배향을 특히 선호하는 구조를 도출함

○ 시험에 출제된 이성질체 62개를 조사한 결과, R형이 답인 경우가 40개, S형이 답인 경우가 22개로 시험에서도 R형이 더 선호됨

○ R형과 S형의 이론적 비율은 동일할 것이므로, 오히려 인지 편향성이 있는 게 아닌가 추정됨

2. 드로잉 [목차]

⑴ 유기화합물 분자식 그리기 (예 : peracetic acid)

from rdkit import Chem
from rdkit.Chem import Draw

# Define the SMILES strings for peracetic acid
smiles = 'C=CC(=O)O'

# Convert the SMILES strings to RDKit molecule objects
molecule = Chem.MolFromSmiles(smiles)

# Draw the molecules without saving
Draw.MolToImage(molecule)

# Draw the molecules with saving
Draw.MolToFile(molecule, 'peracetic_acid.png')

⑵ 유기화합물 electron density map 그리기 (ver. 1) (예 : peracetic acid)

from rdkit import Chem
from rdkit.Chem import AllChem
from rdkit.Chem.Draw import SimilarityMaps

# Generate a 3D structure
molecule = Chem.MolFromSmiles('CC(=O)OO')
molecule_3d = Chem.AddHs(molecule)
AllChem.EmbedMolecule(molecule_3d, AllChem.ETKDG())
AllChem.MMFFOptimizeMolecule(molecule_3d)

# Calculate Gasteiger charges
AllChem.ComputeGasteigerCharges(molecule_3d)

# Function to get atom charges
def GetAtomCharges(mol):
    charges = [float(mol.GetAtomWithIdx(i).GetProp('_GasteigerCharge')) for i in range(mol.GetNumAtoms())]
    return charges

# Draw electrostatic potential map
fig = SimilarityMaps.GetSimilarityMapFromWeights(molecule_3d, GetAtomCharges(molecule_3d), colorMap='jet', contourLines=10)

⑶ 유기화합물 electron density map 그리기 (ver. 2) (예 : peracetic acid)

# Refernece: https://rdkit.readthedocs.io/en/latest/Cookbook.html


## STEP 1. make a random forest model

from rdkit import Chem, DataStructs
from rdkit.Chem import AllChem
from sklearn.ensemble import RandomForestClassifier
import numpy

# generate four molecules
m1 = Chem.MolFromSmiles('c1ccccc1')
m2 = Chem.MolFromSmiles('c1ccccc1CC')
m3 = Chem.MolFromSmiles('c1ccncc1')
m4 = Chem.MolFromSmiles('c1ccncc1CC')
mols = [m1, m2, m3, m4]

# generate fingeprints: Morgan fingerprint with radius 2
fps = [AllChem.GetMorganFingerprintAsBitVect(m, 2) for m in mols]

# convert the RDKit explicit vectors into numpy arrays
np_fps = []
for fp in fps:
  arr = numpy.zeros((1,))
  DataStructs.ConvertToNumpyArray(fp, arr)
  np_fps.append(arr)

# get a random forest classifiert with 100 trees
rf = RandomForestClassifier(n_estimators=100, random_state=1123)

# train the random forest
# with the first two molecules being actives (class 1) and
# the last two being inactives (class 0)
ys_fit = [1, 1, 0, 0]
rf.fit(np_fps, ys_fit)

# use the random forest to predict a new molecule
m5 = Chem.MolFromSmiles('c1ccccc1O')
fp = numpy.zeros((1,))
DataStructs.ConvertToNumpyArray(AllChem.GetMorganFingerprintAsBitVect(m5, 2), fp)

print(rf.predict((fp,)))
print(rf.predict_proba((fp,)))



## STEP 2. run the random forest model for the input

from rdkit.Chem.Draw import SimilarityMaps

# helper function
def getProba(fp, predictionFunction):
  return predictionFunction((fp,))[0][1]

m5 = Chem.MolFromSmiles('CC(=O)OO')
fig, maxweight = SimilarityMaps.GetSimilarityMapForModel(m5, SimilarityMaps.GetMorganFingerprint, lambda x: getProba(x, rf.predict_proba))

⑷ 유기화합물 3차원 분자식 그리기

from rdkit import Chem
from rdkit.Chem import AllChem
import py3Dmol

def draw_3d_molecule(smiles):
    # Convert SMILES to RDKit molecule
    mol = Chem.MolFromSmiles(smiles)
    if mol is None:
        print("Invalid SMILES code.")
        return
    
    # Generate 3D coordinates for the molecule
    mol = Chem.AddHs(mol)  # Add hydrogens
    AllChem.EmbedMolecule(mol, AllChem.ETKDG())  # Embed molecule in 3D space
    
    # Convert RDKit molecule to 3Dmol.js viewable format
    mb = Chem.MolToMolBlock(mol)
    
    # Visualization with Py3Dmol
    viewer = py3Dmol.view(width=400, height=300)
    viewer.addModel(mb, 'mol')
    viewer.setStyle({'stick': {}})
    viewer.zoomTo()
    
    return viewer.show()

# Example usage
smiles_code = "CCO"  # Ethanol
draw_3d_molecule(smiles_code)

① 검토 1. 메탄올은 입체 장애를 최소화하기 위해 두 메틸기가 staggered 배향인 반면, 에탄올은 분자 내 수소결합에 의해 그렇지 않음

② 검토 2. 다음과 같은 콘쥬게이트 고리 화합물에서도 구조가 잘 구현됨

Figure. 1. cyclooctatetraene에 적용한 예시

③ 이렇게 머신러닝 모델만으로 새로운 지식을 만들 수 있는 게 chemoinformatics의 매력

⑸ 유기화합물 2차원 분자식에 RS 명명법 표시하기

from rdkit import Chem
from rdkit.Chem import AllChem, Draw
from rdkit.Chem.rdMolDescriptors import CalcMolFormula
from PIL import Image, ImageDraw, ImageFont
import matplotlib.font_manager as fm

def get_stereochemistry(mol):
    Chem.AssignStereochemistry(mol, cleanIt=True, force=True)
    stereo_info = {}
    for atom in mol.GetAtoms():
        if atom.HasProp('_CIPCode'):
            stereo_info[atom.GetIdx()] = atom.GetProp('_CIPCode')
    return stereo_info

def draw_2d_molecule_with_stereochemistry(smiles, filename="molecule.png"):
    # Convert SMILES to RDKit molecule
    mol = Chem.MolFromSmiles(smiles)
    if mol is None:
        print("Invalid SMILES code.")
        return
    
    # Generate 2D coordinates for the molecule
    AllChem.Compute2DCoords(mol)
    
    # Get stereochemistry information
    stereo_info = get_stereochemistry(mol)
    print(stereo_info)
    
    # Draw the molecule
    img = Draw.MolToImage(mol, size=(300, 300), kekulize=True, wedgeBonds=True)
    
    # Convert the RDKit image to a PIL image
    pil_img = img.convert("RGBA")
    
    # Create a drawing context
    draw = ImageDraw.Draw(pil_img)
    
    # Load fonts
    font_size = 18  # You can change the font size here
    try:
        # Use DejaVuSans.ttf for regular text
        font_path = fm.findfont(fm.FontProperties(family="DejaVu Sans"))
        font = ImageFont.truetype(font_path, font_size)
        # Use DejaVuSans-Oblique.ttf for italic text
        italic_font_path = fm.findfont(fm.FontProperties(family="DejaVu Sans", style="italic"))
        italic_font = ImageFont.truetype(italic_font_path, font_size)
    except IOError:
        font = ImageFont.load_default()
        italic_font = font  # Fallback if no italic font is found
    
    # Generate 2D coordinates for the drawing
    AllChem.Compute2DCoords(mol)
    conf = mol.GetConformer()
    
    # Get 2D coordinates for each atom
    coords = conf.GetPositions()
    
    # Add stereo annotations
    for atom_idx, stereo in stereo_info.items():
        atom = mol.GetAtomWithIdx(atom_idx)
        pos = coords[atom_idx]
        
        # Calculate pixel position from molecule coordinates
        x = pos[0] * 35 + 135
        y = -pos[1] * 35 + 150
        
        # Draw the text annotation with non-italicized brackets and italicized R/S
        draw.text((x, y), "(", fill=(0, 0, 0), font=font)
        draw.text((x + 10, y), stereo, fill=(0, 0, 0), font=italic_font)
        draw.text((x + 30, y), ")", fill=(0, 0, 0), font=font)
    
    # Save the image to a file
    pil_img.save(filename)
    print(f"Image saved as {filename}")

# Example usage
smiles_code = "C[C@H](O)[C@H](O)C"  # Example molecule with stereochemistry
draw_2d_molecule_with_stereochemistry(smiles_code, "molecule_with_stereochemistry.png")

⑹ 유기화합물 3차원 분자식에 RS 명명법 표시하기

from rdkit import Chem
from rdkit.Chem import AllChem, Draw
from rdkit.Chem.rdMolDescriptors import CalcMolFormula
import py3Dmol

def get_stereochemistry(mol):
    Chem.AssignStereochemistry(mol, cleanIt=True, force=True)
    stereo_info = {}
    for atom in mol.GetAtoms():
        if atom.HasProp('_CIPCode'):
            stereo_info[atom.GetIdx()] = atom.GetProp('_CIPCode')
    return stereo_info

def draw_3d_molecule_with_stereochemistry(smiles):
    # Convert SMILES to RDKit molecule
    mol = Chem.MolFromSmiles(smiles)
    if mol is None:
        print("Invalid SMILES code.")
        return
    
    # Generate 3D coordinates for the molecule
    mol = Chem.AddHs(mol)  # Add hydrogens
    AllChem.EmbedMolecule(mol, AllChem.ETKDG())  # Embed molecule in 3D space
    
    # Get stereochemistry information
    stereo_info = get_stereochemistry(mol)
    print(stereo_info)
    
    # Convert RDKit molecule to 3Dmol.js viewable format
    mb = Chem.MolToMolBlock(mol)
    
    # Visualization with Py3Dmol
    viewer = py3Dmol.view(width=400, height=300)
    viewer.addModel(mb, 'mol')
    viewer.setStyle({'stick': {}})
    
    # Annotate the stereochemistry on the molecule
    for atom_idx, stereo in stereo_info.items():
        pos = mol.GetConformer().GetAtomPosition(atom_idx)
        viewer.addLabel(stereo, {
            'position': {'x': pos.x, 'y': pos.y, 'z': pos.z}, 
            'backgroundColor': 'black', 
            'fontColor': 'white', 
            'fontSize': 14, 
            'showBackground': True
        })
    
    viewer.zoomTo()
    
    return viewer.show()

# Example usage
smiles_code = "C[C@H](O)[C@H](O)C"  # Example molecule with stereochemistry
draw_3d_molecule_with_stereochemistry(smiles_code)

⑺ 탄소수에 따른 알케인의 구조이성질체 모두 그리는 코드

from rdkit import Chem
from rdkit.Chem import Draw
from rdkit.Chem import AllChem
from rdkit.Chem.rdchem import Mol
from IPython.display import display

def generate_alkanes(n_carbons):
    if n_carbons == 1:
        return [Chem.MolFromSmiles('C')]
    elif n_carbons == 2:
        return [Chem.MolFromSmiles('CC')]
    
    smaller_alkanes = generate_alkanes(n_carbons - 1)
    new_alkanes = set()
    
    for mol in smaller_alkanes:
        for atom in mol.GetAtoms():
            if atom.GetDegree() < 4:  # 탄소는 최대 4개의 결합을 가질 수 있음
                new_mol = Chem.RWMol(mol)
                new_idx = new_mol.AddAtom(Chem.Atom(6))
                new_mol.AddBond(atom.GetIdx(), new_idx, Chem.BondType.SINGLE)
                Chem.SanitizeMol(new_mol)
                smiles = Chem.MolToSmiles(new_mol, canonical=True)
                new_alkanes.add(smiles)
    
    return [Chem.MolFromSmiles(smiles) for smiles in new_alkanes]

# C7H16의 모든 구조 이성질체 생성 및 시각화
n_carbons = 7
alkanes = generate_alkanes(n_carbons)
img = Draw.MolsToGridImage(alkanes, molsPerRow=5, subImgSize=(200, 200))

# 이미지를 직접 디스플레이 (Jupyter 노트북에서만 동작)
display(img)

화학식	구조이성질체의 수
C₃H₈	1
C₄H₁₀	2
C₅H₁₂	3
C₆H₁₄	5
C₇H₁₆	9
C₈H₁₈	18
C₉H₂₀	35
C₁₀H₂₂	75
C₁₁H₂₄	159
C₁₂H₂₆	355
C₁₃H₂₈	802
C₁₄H₃₀	1,858
C₁₅H₃₂	4,347
C₁₆H₃₄	10,359
C₁₇H₃₆	24,894
C₁₈H₃₉	60,523
C₁₉H₄₀	148,284
C₂₀H₄₂	366,319
C₃₀H₆₂	4,111,846,763
C₄₀H₈₂	62,481,801,147,341

Table. 1. 탄소수에 따른 알케인의 구조이성질체의 수

① 알케인의 구조이성질체와 그래프 이론

⑻ 트리 자료구조를 활용하여 모든 알케인 치환기를 그리는 코드 (n_carbons = 6부터 중복이 존재함)

from rdkit import Chem
from rdkit.Chem import Draw
from rdkit.Chem.rdchem import AtomValenceException
from IPython.display import display

class Node:
    def __init__(self, id):
        self.id = id
        self.children = []

class Tree:
    def __init__(self, root):
        self.nodes = [root]

    def add_node(self, parent_id, node_id, max_nodes):
        if len(self.nodes) >= max_nodes or len(self.nodes[parent_id].children) >= 3:
            return None
        new_node = Node(node_id)
        self.nodes.append(new_node)
        self.nodes[parent_id].children.append(new_node)
        return new_node

def generate_trees(current_tree, current_id, max_nodes):
    if len(current_tree.nodes) == max_nodes:
        if is_valid_tree(current_tree):
            return [current_tree]
        else:
            return []
    
    trees = []
    for node_id in range(len(current_tree.nodes)):
        new_tree = Tree(Node(0))
        new_tree.nodes = [Node(i) for i in range(len(current_tree.nodes))]
        for i in range(len(current_tree.nodes)):
            new_tree.nodes[i].children = [new_tree.nodes[n.id] for n in current_tree.nodes[i].children]

        new_node = new_tree.add_node(node_id, current_id, max_nodes)
        if new_node and is_valid_tree(new_tree):
            trees.extend(generate_trees(new_tree, current_id + 1, max_nodes))
    return trees

def is_valid_tree(tree):
    for node in tree.nodes:
        children_counts = [len(child.children) for child in node.children]
        if any(children_counts[i] > children_counts[i+1] for i in range(len(children_counts)-1)):
            return False
    return True

def visualize_organic_structures(trees):
    mols = []
    for tree in trees:
        mol = Chem.RWMol()
        atom_index = {}
        for node in tree.nodes:
            if node.id == 0:
                atom = Chem.Atom(6)
                atom.SetNumRadicalElectrons(1)  # Set radical electron for the root
            else:
                atom = Chem.Atom(6)
            atom_index[node.id] = mol.AddAtom(atom)
        for node in tree.nodes:
            for child in node.children:
                mol.AddBond(atom_index[node.id], atom_index[child.id], Chem.BondType.SINGLE)
        mols.append(mol)
    return mols

# Input the total number of nodes
n_nodes = 4

# Start with a single root node
initial_tree = Tree(Node(0))
all_trees = generate_trees(initial_tree, 1, n_nodes)
all_mols = visualize_organic_structures(all_trees)

# 시각화
img = Draw.MolsToGridImage(all_mols, molsPerRow=5, subImgSize=(200, 200), useSVG=False)
display(img)

① n_carbons = 3 (총 2개)

② n_carbons = 4 (총 4개)

② n_carbons = 5 (총 8개)

③ n_carbons = 6 (총 17개)

3. 아미노산 [목차]

⑴ 염기서열을 아미노산 서열로 바꾸는 함수

from Bio.Seq import Seq

# Assuming 'sequence' is your string of nucleotides:
sequence = "ACTCATTCTCCCCAGACGCCAAGGATGGTGGTCATGGCGCCCCGAACCCTCTTCCTGCTGCTCTCGGGGGCCCTGACCCTGACCGAGACCTGGGCGG"  # truncated for brevity

# Create a sequence object
seq_obj = Seq(sequence)

# Translate the sequence
amino_acid_sequence = seq_obj.translate(to_stop=True)

# Print the amino acid sequence
print(amino_acid_sequence)

⑵ 유전자명을 아미노산 서열로 바꾸는 함수

!git clone https://github.com/deepmind/alphafold.git
!cd alphafold/scripts
!sudo apt install aria2
!bash download_uniprot.sh ./

from pathlib import Path

def extract_sequences_from_uniprot(uniprot_fasta_path, gene_list, output_path="selected_genes.fa"):
    """
    UniProt FASTA 파일에서 gene_list에 포함된 유전자의 단백질 서열을 추출하여 .fa 파일로 저장
    """
    result = []
    current_header = ""
    current_seq = []
    include = False
    gene_set = set(g.upper() for g in gene_list)

    with open(uniprot_fasta_path, 'r') as f:
        for line in f:
            if line.startswith('>'):
                if include and current_header and current_seq:
                    result.append((gene_name, ''.join(current_seq)))
                # 새로운 서열 시작
                current_header = line.strip()
                current_seq = []
                include = False
                # header에서 gene name 추출
                if 'GN=' in current_header:
                    try:
                        gene_name = current_header.split('GN=')[1].split()[0].upper()
                        if gene_name in gene_set:
                            include = True
                    except Exception:
                        pass
            else:
                if include:
                    current_seq.append(line.strip())

        # 마지막 서열 추가
        if include and current_header and current_seq:
            result.append((gene_name, ''.join(current_seq)))

    # 저장
    if result:
        with open(output_path, 'w') as f:
            for gene, seq in result:
                f.write(f">{gene}\n{seq}\n")
        print(f"✅ {len(result)}개 서열이 '{output_path}'에 저장되었습니다.")
    else:
        print("❌ 해당 유전자의 서열을 찾을 수 없습니다.")
        
# 🔧 예시 실행
gene_list = ['THBS4', 'CD36']
uniprot_fasta_path = "uniprot.fasta"
extract_sequences_from_uniprot(uniprot_fasta_path, gene_list, output_path="genes_from_local_uniprot.fa")

# ✅ 1312개 서열이 'genes_from_local_uniprot.fa'에 저장되었습니다.

⑶ AlphaFold2를 이용하여 아미노산 서열로부터 아미노산 구조에 관한 PDB 파일을 생성하는 함수

① AlphaFold2 설치

git clone https://github.com/deepmind/alphafold.git
cd alphafold/scripts
sudo apt install aria2

bash download_all_data.sh ./ 
docker build -f docker/Dockerfile -t alphafold . 

docker run --gpus all -it \
  --name alphafold_cpu_session \
  -v "$(pwd)":/app/alphafold \
  -v "$(pwd)/afdb":/mnt/databases \
  -v "$(pwd)/output":/mnt/output \
  --entrypoint /bin/bash \
  alphafold

#docker start -ai alphafold_cpu_session
#docker exec -it alphafold_cpu_session bash
#docker stop alphafold_cpu_session
#docker rm alphafold_cpu_session

② AlphaFold2 Monomer 실행

#export TF_FORCE_UNIFIED_MEMORY=1           ## for GPU
#export XLA_PYTHON_CLIENT_PREALLOCATE=false ## for GPU
#export XLA_PYTHON_CLIENT_MEM_FRACTION=.85  ## for GPU

export JAX_PLATFORM_NAME=cpu                ## for CPU
export JAX_TRACEBACK_FILTERING=off          ## for CPU

python3 /app/alphafold/run_alphafold.py \
  --fasta_paths=/app/alphafold/data/data.fa \
  --output_dir=/mnt/output \
  --data_dir=/app/alphafold/scripts \
  --model_preset=monomer \
  --db_preset=reduced_dbs \
  --uniref90_database_path=/app/alphafold/scripts/uniref90/uniref90.fasta \
  --mgnify_database_path=/app/alphafold/scripts/mgnify/mgy_clusters_2022_05.fa \
  --small_bfd_database_path=/app/alphafold/scripts/small_bfd/bfd-first_non_consensus_sequences.fasta \
  --pdb70_database_path=/app/alphafold/scripts/pdb70/pdb70 \
  --template_mmcif_dir=/app/alphafold/scripts/pdb_mmcif/mmcif_files \
  --obsolete_pdbs_path=/app/alphafold/scripts/pdb_mmcif/obsolete.dat \
  --max_template_date=2023-12-31 \
  --use_gpu_relax=False

③ AlphaFold2 Multimer 실행

#export TF_FORCE_UNIFIED_MEMORY=1           ## for GPU
#export XLA_PYTHON_CLIENT_PREALLOCATE=false ## for GPU
#export XLA_PYTHON_CLIENT_MEM_FRACTION=.85  ## for GPU

export JAX_PLATFORM_NAME=cpu                ## for CPU
export JAX_TRACEBACK_FILTERING=off          ## for CPU

python3 /app/alphafold/run_alphafold.py \
  --fasta_paths=/app/alphafold/data/data.fa \
  --output_dir=/mnt/output \
  --data_dir=/app/alphafold/scripts \
  --model_preset=multimer \
  --db_preset=reduced_dbs \
  --uniref90_database_path=/app/alphafold/scripts/uniref90/uniref90.fasta \
  --mgnify_database_path=/app/alphafold/scripts/mgnify/mgy_clusters_2022_05.fa \
  --small_bfd_database_path=/app/alphafold/scripts/small_bfd/bfd-first_non_consensus_sequences.fasta \
  --template_mmcif_dir=/app/alphafold/scripts/pdb_mmcif/mmcif_files \
  --obsolete_pdbs_path=/app/alphafold/scripts/pdb_mmcif/obsolete.dat \
  --pdb_seqres_database_path=/app/alphafold/scripts/pdb_seqres/pdb_seqres.txt \
  --uniprot_database_path=/app/alphafold/scripts/uniprot/uniprot.fasta \
  --max_template_date=1900-01-01 \
  --use_gpu_relax=False

④ 생체고분자 라이브러리

4. 분광학 [목차]

⑴ 개요

① 화학식으로부터 MS, IR, NMR 스펙트럼을 예측하거나, MS, IR, NMR 데이터를 통해 화학식을 예측하는 연구가 활발히 진행되고 있음

② 이 연구는 딥러닝 기술의 발전과 함께 현재 크게 진전되고 있음

○ 예 : NMR-TS

입력: 2023.11.30 02:40

수정: 2024.06.11 12:26

저작자표시 (새창열림)

'▶ 자연과학 > ▷ 화학정보학' 카테고리의 다른 글

【화학정보학】 단백질체 분석 파이프라인(Proteomics Pipeline) (1)	2024.03.31
【화학정보학】 물질 라이브러리 (0)	2023.12.11
【화학정보학】 생물학 실험에 사용되는 형광물질 라이브러리 (7)	2022.05.04
【화학정보학】 저분자 약물 데이터베이스 종류 (0)	2022.04.27
【화학정보학】 New Cancer Drugs Approved by the FDA in 20 years (0)	2022.01.08

정빈이의 공부방

최근댓글

【화학정보학】 유기화학 관련 파이썬 함수 모음

'▶ 자연과학 > ▷ 화학정보학' 카테고리의 다른 글

티스토리툴바

【화학정보학】 유기화학 관련 파이썬 함수 모음

'▶ 자연과학 > ▷ 화학정보학' 카테고리의 다른 글

'▶ 자연과학/▷ 화학정보학' 관련 포스팅

티스토리툴바