티스토리 블로그 실험들
추천글 : 【컴퓨터과학】 컴퓨터과학 목차
1. 티스토리 내부 검색 url 종류 [본문]
2. 티스토리에 네이버 파파고 링크 연결하기 [본문]
3. 각 포스팅을 .csv 파일로 저장 및 임베딩 [본문]
4. 각 포스팅을 .html 파일로 저장 및 임베딩 [본문]
5. 블로그 번역 프로젝트 [본문]
6. ChatGPT Store로 블로그를 위한 챗봇 만들기 [본문]
7. 리뷰어 AI [본문]
8. 워드 네트워크 [본문]
9. 다른 웹사이트로 리다이렉트(redirect) [본문]
10. ip 차단 [본문]
2. 티스토리에 네이버 파파고 링크 연결하기 [목차]
<a href = "https://papago.naver.net/website?locale=ko&source=ko&target=en&url=https%3A%2F%2Fnate9389.tistory.com%2F/2178"
style = "float:right; margin-left: 10px; font-size: 10.5pt; text-align: center; line-height: 2; text-decoration:none;">English
</a>
⑴ 위 코드에서 nate9389.tistory.com 부분을 각자의 티스토리 주소로 치환할 것
⑵ /366 부분은 각 포스팅의 url 끝자리 숫자를 의미함 : 이 포스팅의 경우 2178을 의미
⑶ 위 코드가 굳이 <h1> 【웹 프로그래밍】 티스토리 블로그 실험들</h1> 위에 삽입될 필요는 없고 자유롭게 위치 선택 가능
⑷ locale, source, target 부분은 어떤 언어에서 어떤 언어로 갈 것인지를 특정하는 부분으로 자유롭게 수정 가능
3. 각 포스팅을 .csv 파일로 저장 및 임베딩 [목차]
⑴ 1단계. (Python) 크롤링을 통해 포스팅을 백업
import io
import requests
import pandas as pd
import numpy as np
import time
l = []
for i in range(1, 2291, 1):
url = 'https://nate9389.tistory.com/' + str(i)
github_session = requests.Session()
download = github_session.get(url).content
l.append(
str(download, "utf-8")
)
if i % 100 == 0:
print(i)
pd.DataFrame(l).to_csv("~/Downloads/Tistory_Backup.csv", header = False, index = False)
① pandas.Dataframe.to_excel을 쓰면 내용이 완전하게 저장되지 않음
② 포스팅 2천 개 기준 250 MB의 .csv 파일이 생성됨
⑵ 2단계. (RStudio) 주어진 string형 변수(’given_str’)에서 start부터 end까지의 문자를 “x”로 채운 뒤 반환하는 함수
str_substitute <- function(given_str, start, end){
library(stringr)
library(stringi)
result = paste0(
str_sub(given_str, 1, start-1),
stri_dup("x", (end-start+1) ),
str_sub(given_str, end+1, str_length(given_str))
)
return(result)
}
⑶ 3단계. (RStudio) 주어진 string (’given_str’) 안에서 특정 패턴 뒤에 오는 숫자들 수집하는 코드
number_after_pattern_in_str <- function(given_str){
# reference : https://cran.r-project.org/web/packages/stringr/vignettes/stringr.html
library(stringr)
phone1 <- 'nate9389.tistory.com/([1-9]{1})'
phone2 <- 'nate9389.tistory.com/([1-9][0-9]{1})'
phone3 <- 'nate9389.tistory.com/([1-9][0-9]{2})'
phone4 <- 'nate9389.tistory.com/([1-9][0-9]{3})'
arr <- array()
flag = 0
for(i in 1 : dim(str_locate_all(given_str, phone4)[[1]])[1] ){
if (dim(str_locate_all(given_str, phone4)[[1]])[1] > 0){
flag <- flag + 1
arr[flag] = str_sub(
given_str,
str_locate_all(given_str, phone4)[[1]][1, 1], #start
str_locate_all(given_str, phone4)[[1]][1, 2] #end
)
given_str = str_substitute(
given_str,
str_locate_all(given_str, phone4)[[1]][1, 1], #start
str_locate_all(given_str, phone4)[[1]][1, 2] #end
)
}
}
for(i in 1 : dim(str_locate_all(given_str, phone3)[[1]])[1] ){
if (dim(str_locate_all(given_str, phone3)[[1]])[1] > 0){
flag <- flag + 1
arr[flag] = str_sub(
given_str,
str_locate_all(given_str, phone3)[[1]][1, 1], #start
str_locate_all(given_str, phone3)[[1]][1, 2] #end
)
given_str = str_substitute(
given_str,
str_locate_all(given_str, phone3)[[1]][1, 1], #start
str_locate_all(given_str, phone3)[[1]][1, 2] #end
)
}
}
for(i in 1 : dim(str_locate_all(given_str, phone2)[[1]])[1] ){
if (dim(str_locate_all(given_str, phone2)[[1]])[1] > 0){
flag <- flag + 1
arr[flag] = str_sub(
given_str,
str_locate_all(given_str, phone2)[[1]][1, 1], #start
str_locate_all(given_str, phone2)[[1]][1, 2] #end
)
given_str = str_substitute(
given_str,
str_locate_all(given_str, phone2)[[1]][1, 1], #start
str_locate_all(given_str, phone2)[[1]][1, 2] #end
)
}
}
for(i in 1 : dim(str_locate_all(given_str, phone1)[[1]])[1] ){
if (dim(str_locate_all(given_str, phone1)[[1]])[1] > 0){
flag <- flag + 1
arr[flag] = str_sub(
given_str,
str_locate_all(given_str, phone1)[[1]][1, 1], #start
str_locate_all(given_str, phone1)[[1]][1, 2] #end
)
given_str = str_substitute(
given_str,
str_locate_all(given_str, phone1)[[1]][1, 1], #start
str_locate_all(given_str, phone1)[[1]][1, 2] #end
)
}
}
arr <- str_sub(arr, 22, str_length(arr))
# 22 = str_length('nate9389.tistory.com/')+1
arr <- as.numeric(arr)
return(arr)
}
① 여기에서는 그 특정 패턴을 ‘nate9389.tistory.com/’으로 고정
⑷ 4단계. (RStudio) 블로그 연결성 조사(connectivity map)
dat <- read.csv("~/Downloads/Tistory_Backup.csv", header = FALSE)
mat <- matrix(0, nrow = 2290, ncol = 2290)
rownames(mat) <- seq(1:2290)
colnames(mat) <- seq(1:2290)
for(i in 1:2290){
arr <- number_after_pattern_in_str(dat[i, 1])
if(! is.na(arr[1])){
for(j in 1:length(arr)){
if(! is.na(arr[j]) && arr[j] != i){
mat[i, arr[j]] = mat[i, arr[j]] + 1
}
}
}
}
ar1 <- array()
flag1 <- 0
ar2 <- array()
flag2 <- 0
threshold = 20
for(i in 1:2290){
if(sum(mat[i, ]) >= threshold){
flag1 <- flag1 + 1
ar1[flag1] <- i
}
}
for(j in 1:2290){
if(sum(mat[, j]) >= threshold){
flag2 <- flag2 + 1
ar2[flag2] <- j
}
}
heatmap(mat[ar1, ar2], scale="none", col= hcl.colors(18))
① 한 포스팅이 자기 자신을 인용하는 경우 따로 카운트를 하지는 않았음
⑸ 5단계. 시각화(visualization) : 단, 인용수 20 이상 및 피인용수 20 이상
⑹ 6단계. 해석
① rownames는 인용을 하는 포스팅, colnames는 인용이 되는 포스팅을 나타냄
② 인용수, 피인용수 제한이 적을 때에는 클러스터링이 인용 개수에 따라 정렬되는 경향이 있음
○ 이유 : 인용이 많이 이루어지는 소수의 포스팅과 적게 이루어지는 다수의 포스팅으로 구분되기 때문
○ 사회과학에서 흔히 관찰되는 멱법칙(power-law)의 한 예시로 보아도 될 것 같음
③ 인용수, 피인용수 제한을 점점 엄격하게 두게 되면 클러스터링이 더 잘 되는 경향이 생김
④ 목차 글(e.g., 1457, 1492, 1483)이 피인용되는 포스팅으로서 제일 많음
⑤ 【유기화학】 27강. 주요 반응 메커니즘 요약은 예상대로 제일 많이 인용하는 포스팅으로 나타났음
4. 각 포스팅을 .html 파일로 저장 및 임베딩 [목차]
⑴ 크롤링으로 포스팅을 백업하는 파이썬 코드
## Python
import requests
import os
def fetch_and_save_html(url, html_filename):
response = requests.get(url)
with open(html_filename, 'w', encoding='utf-8') as file:
file.write(response.text)
def save_range_of_html(start, end, directory):
if not os.path.exists(directory):
os.makedirs(directory)
for i in range(start, end + 1):
url = f"https://nate9389.tistory.com/{i}"
html_filename = os.path.join(directory, f"page_{i}.html")
print(f"Processing: {url}")
fetch_and_save_html(url, html_filename)
def save_specific_html(i, output_filename):
url = f"https://nate9389.tistory.com/{i}"
fetch_and_save_html(url, output_filename)
# Save a range of HTML files
save_range_of_html(1, 2392, 'output')
# Save a specific HTML file based on user input
i = int(input("Enter a page number: ")) # Replace with your desired page number
save_specific_html(i, f"specific_page_{i}.html")
⑵ 크롤링된 포스팅으로 워드 클라우드를 만드는 코드 (단, 한글 처리 시 Java heap space 초과 에러가 발생하여 영어만 따로 진행)
from bs4 import BeautifulSoup
import os
import nltk
from wordcloud import WordCloud
import matplotlib.pyplot as plt
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import string
# Make sure you have the required libraries installed:
# pip install beautifulsoup4 nltk wordcloud matplotlib
def extract_text_from_html(html_filename):
with open(html_filename, 'r', encoding='utf-8') as file:
soup = BeautifulSoup(file, 'html.parser')
return soup.get_text()
def extract_text_from_directory(directory):
all_text = ""
for filename in os.listdir(directory):
if filename.endswith('.html'):
full_path = os.path.join(directory, filename)
all_text += extract_text_from_html(full_path) + "\n"
return all_text
def process_english_text(text):
# Convert text to lowercase
text = text.lower()
# Remove punctuation
text = text.translate(str.maketrans('', '', string.punctuation))
# Tokenize text
tokens = word_tokenize(text)
# Download stopwords from NLTK
nltk.download('punkt')
nltk.download('stopwords')
# Remove stopwords and non-ASCII words
tokens = [word for word in tokens if word.isascii() and word not in stopwords.words('english')]
return ' '.join(tokens)
def create_english_word_cloud(text):
wordcloud = WordCloud(
width=800,
height=800,
background_color='white'
).generate(text)
plt.figure(figsize=(8, 8), facecolor=None)
plt.imshow(wordcloud)
plt.axis("off")
plt.tight_layout(pad=0)
plt.show()
# Replace 'output' with your directory containing HTML files
directory = 'output'
all_text = extract_text_from_directory(directory)
english_text = process_english_text(all_text)
create_english_word_cloud(english_text)
⑶ TfidfVectorizer를 활용한 포스팅들의 2차원 시각화
from bs4 import BeautifulSoup
import os
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import string
# Make sure you have the required libraries installed:
# pip install beautifulsoup4 nltk scikit-learn matplotlib
nltk.download('punkt')
nltk.download('stopwords')
def extract_text_from_html(html_filename):
with open(html_filename, 'r', encoding='utf-8') as file:
soup = BeautifulSoup(file, 'html.parser')
return soup.get_text()
def process_text(text):
# Convert text to lowercase
text = text.lower()
# Remove punctuation
text = text.translate(str.maketrans('', '', string.punctuation))
# Tokenize text
tokens = word_tokenize(text)
# Remove stopwords
tokens = [word for word in tokens if word.isascii() and word not in stopwords.words('english')]
return ' '.join(tokens)
def cluster_html_files(directory, num_clusters=5):
file_contents = []
file_names = []
for filename in os.listdir(directory):
if filename.endswith('.html'):
full_path = os.path.join(directory, filename)
text = extract_text_from_html(full_path)
processed_text = process_text(text)
file_contents.append(processed_text)
file_names.append(filename)
# Convert the text documents to a matrix of TF-IDF features
vectorizer = TfidfVectorizer(max_df=0.5, min_df=2, stop_words='english')
tfidf_matrix = vectorizer.fit_transform(file_contents)
# Perform K-Means clustering
km = KMeans(n_clusters=num_clusters)
km.fit(tfidf_matrix)
clusters = km.labels_.tolist()
# Print out the filenames and their assigned clusters
for filename, cluster in zip(file_names, clusters):
print(f"File: {filename}, Cluster: {cluster}")
return km, tfidf_matrix
directory = 'output' # Replace with your directory containing HTML files
num_clusters = 5 # Adjust the number of clusters as per your requirement
# Cluster the HTML files
km, tfidf_matrix = cluster_html_files(directory, num_clusters)
from sklearn.manifold import TSNE
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
def plot_clusters(tfidf_matrix, cluster_labels):
# Convert the sparse matrix to a dense array
dense_tfidf_matrix = tfidf_matrix.toarray()
# Use t-SNE to reduce dimensionality for visualization
tsne_model = TSNE(n_components=2, init='random', perplexity=30, n_iter=3000, random_state=32)
tsne_tfidf = tsne_model.fit_transform(dense_tfidf_matrix) # Apply t-SNE transformation
# Convert to DataFrame for convenience
tsne_df = pd.DataFrame(tsne_tfidf, columns=['x', 'y'])
tsne_df['cluster'] = cluster_labels
# Plot the clusters
plt.figure(figsize=(12, 8))
sns.scatterplot(x='x', y='y', hue='cluster', data=tsne_df, palette='viridis', legend='full')
plt.title('t-SNE Clustering of HTML Files')
plt.xlabel('t-SNE Feature 1')
plt.ylabel('t-SNE Feature 2')
plt.show()
plot_clusters(tfidf_matrix, km.labels_)
⑷ sentence embedder (e.g., all-MiniLM-L6-v2)를 활용한 포스팅들의 2차원 시각화
import io
import requests
import pandas as pd
import numpy as np
import time
from bs4 import BeautifulSoup, Tag, NavigableString
import html2text
import re
import argparse
import matplotlib.patheffects as PathEffects
from langchain.embeddings.sentence_transformer import SentenceTransformerEmbeddings
import numpy as np
from scipy.sparse import csr_matrix
import pandas as pd
import scanpy as sc
from sklearn.neighbors import NearestNeighbors
import torch
from torch.utils.data import DataLoader, TensorDataset
from xgboost import XGBClassifier
def html_to_markdown(html_content):
# Create an html2text converter object
h = html2text.HTML2Text()
h.ignore_links = False
# Convert the HTML to Markdown
markdown_content = h.handle(html_content)
return markdown_content
l = []
for i in range(2410):
idx = i + 1
print(idx)
url = 'https://nate9389.tistory.com/' + str(idx)
github_session = requests.Session()
download = github_session.get(url).content
html = download
# Parse the HTML with BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')
# Find the desired content by its attributes
size_pattern = re.compile(r'size\d+')
parsed_items = soup.find_all('p', {'data-ke-size': size_pattern})
markdown_outputs = html_to_markdown(parsed_items)
l.append(markdown_outputs)
sentences = [' '.join(sublist) for sublist in l]
filtered_sentences = [sentence for sentence in sentences if sentence]
filtered_sentences = [x.replace('*', '') if isinstance(x, str) else x for x in filtered_sentences]
filtered_l = [[word for word in sublist if word] for sublist in l if sublist and any(word.strip() for word in sublist)]
titles = []
for i in range(len(filtered_l)):
titles.append(filtered_l[i][0])
titles = [x.replace('*', '') if isinstance(x, str) else x for x in titles]
embedding_function = SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2")
db = embedding_function.embed_documents(filtered_sentences)
emb_res = np.asarray(db)
import numpy as np
import matplotlib.pyplot as plt
from sklearn.manifold import TSNE
import random
# Your embedding result as a numpy array
emb_res = np.asarray(db)
# Check the number of samples
n_samples = emb_res.shape[0]
print("Number of samples:", n_samples)
# Set perplexity to a fraction of the number of samples
perplexity_value = max(5, n_samples // 50) # ensures at least 5, or two percent of the samples
# Apply t-SNE to reduce dimensions to 2D for visualization
tsne = TSNE(n_components=2, random_state=42, perplexity=perplexity_value)
reduced_emb = tsne.fit_transform(emb_res)
# Sample labels
names_list = [re.sub(r'[^a-zA-Z\- ]', '', title) for title in titles]
# Extract main category for coloring
group_labels = [name.split('-')[0] for name in names_list]
# Assign colors to each unique group in a predictable way
unique_groups = sorted(set(group_labels))
colors = plt.cm.get_cmap('viridis', len(unique_groups))
color_map = {group: colors(i) for i, group in enumerate(unique_groups)}
# Visualization
fig, ax = plt.subplots(figsize=(12, 8))
# Plot all points with assigned colors
for i, label in enumerate(group_labels):
ax.scatter(reduced_emb[i, 0], reduced_emb[i, 1], color=color_map[label], alpha=0.6)
# Annotate only about 5% of the points
if random.random() < 0.05: # Adjust this percentage as needed
text = ax.annotate(names_list[i], (reduced_emb[i, 0], reduced_emb[i, 1]), fontsize=12, color='black')
text.set_path_effects([PathEffects.withStroke(linewidth=3, foreground='white')])
ax.set_title('t-SNE Visualization of Tistory Postings')
ax.set_xlabel('t-SNE Dimension 1')
ax.set_ylabel('t-SNE Dimension 2')
plt.show()
5. 블로그 번역 프로젝트 [목차]
6. ChatGPT Store로 블로그를 위한 챗봇 만들기 [목차]
⑴ 특정 키워드(e.g., SNAr)가 있는 포스팅 탐색
import os
from bs4 import BeautifulSoup
def find_character_in_html(directory, character):
results = []
for filename in os.listdir(directory):
if filename.endswith(".html"):
filepath = os.path.join(directory, filename)
with open(filepath, 'r', encoding='utf-8') as file:
soup = BeautifulSoup(file, 'html.parser')
paragraphs = soup.find_all('p')
text = ' '.join([p.get_text() for p in paragraphs])
if character in text:
results.append(filename)
return results
# Directory where your HTML files are stored
directory = 'output'
# Character to search for
character = 'SNAr'
# Find and display the files that contain the character
matching_files = find_character_in_html(directory, character)
print("Files containing the character '" + character + "':")
for file in matching_files:
print(file)
⑵ 코드 글에서 style에 white-space: pre가 없는 경우
from bs4 import BeautifulSoup
def check_style_attribute(directory):
for filename in os.listdir(directory):
if filename.endswith('.html'):
with open(os.path.join(directory, filename), 'r', encoding='utf-8') as file:
soup = BeautifulSoup(file, 'html.parser')
pre_tags = soup.find_all('pre')
for pre_tag in pre_tags:
if pre_tag.find('code'):
# Check if 'style' attribute exists and if it contains 'white-space: pre;'
if 'style' in pre_tag.attrs and 'white-space: pre;' in pre_tag['style']:
a = 1
#print(f"The file {filename} contains <pre><code> with 'white-space: pre;'.")
else:
# Output if the 'style' attribute does not contain 'white-space: pre;'
print(f"<pre><code>{filename}</code></pre>")
directory = 'output' # Replace with your directory containing HTML files
check_style_attribute(directory)
8. 워드 네트워크 [목차]
⑴ 아이디어 : 신경해부학 포스팅에서 각 <p> ... </p> 내에 있는 여러 anatomical terminology가 서로 관련 있음을 암시함
⑵ 방법 : 용어 간 관련성을 네트워크 자료구조로 나타냄
⑶ 시각화
import networkx as nx
import matplotlib.pyplot as plt
import numpy as np
# Create a graph
G = nx.Graph()
# Adding nodes with the brain structures
brain_structures = [
"Cerebrum", "Cerebellum", "Brainstem",
"Thalamus", "Hypothalamus", "Cortex",
"Hippocampus", "Amygdala", "Basal Ganglia"
]
# Adding relationships based on structural or functional connectivity
relationships = [
("Cerebrum", "Cortex"),
("Cerebrum", "Basal Ganglia"),
("Cortex", "Hippocampus"),
("Cortex", "Amygdala"),
("Thalamus", "Cortex"),
("Thalamus", "Basal Ganglia"),
("Hypothalamus", "Thalamus"),
("Hypothalamus", "Brainstem"),
("Brainstem", "Cerebellum"),
("Hippocampus", "Amygdala"),
("Basal Ganglia", "Thalamus")
]
G.add_nodes_from(brain_structures)
G.add_edges_from(relationships)
# Drawing the graph
plt.figure(figsize=(12, 8))
pos = nx.spring_layout(G, seed=42) # For consistent layout
nx.draw(G, pos, with_labels=True, node_size=3000, node_color="skyblue", font_size=10, font_weight="bold")
plt.title("Simplified Brain Structures and Their Relationships", size=15)
plt.axis('off') # Turn off the axis
plt.show()
9. 다른 웹사이트로 리다이렉트 [목차]
<!doctype html>
<html lang="ko">
<script type="text/javascript"
src="http://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX- AMSMML_HTMLorMML"></script>
<script type="text/x-mathjax-config">
MathJax.Hub.Config({
jax: ["input/TeX","output/HTML-CSS"],
extensions: ["tex2jax.js"],
tex2jax: {
inlineMath: [ ['$','$'], ['\\(','\\)'] ],
processEscapes: true
}
});
</script>
<head>
<script type="text/javascript">
pathArray = window.location.pathname.split( '/' );
document.write("<meta http-equiv=\"Refresh\" content=\"1\;url=http:\/\/nate9389.tistory.com\/", pathArray[1], "\">");
</script>
<title>블로그 주소가 변경되었습니다.</title>
</head>
<body id="tt-body-page">
블로그 주소가 <a href="http://nate9389.tistory.com">http://nate9389.tistory.com</a>로 옮겨졌습니다.</br>
</body>
</html>
10. ip 차단 [목차]
⑴ 코드
<script type="application/javascript">
function getIP(json) {
var arrUserIP = [
'185.14.192.165',
'61.248.186.243',
'39.7.28.67',
'175.223.10.110',
'39.7.25.35',
'175.223.20.199',
'39.7.46.230',
'175.223.10.15'
];
for (var i = 0; i < arrUserIP.length; i++) {
if (arrUserIP[i] == json.ip) {
alert('당신은 접속이 차단되었습니다.');
window.location.replace('https://tistory.com');
}
}
}
</script>
<script type="application/javascript" src="https://api.ipify.org?format=jsonp&callback=getIP"></script>
⑵ 위 코드를 실행하게 되면, 모바일 버전은 문제 없는 데 U+ 와이파이를 통한 PC 버전 접속이 너무 오래 걸림 : SKT, KT는 모두 괜찮음
입력: 2023.11.15 14:14
수정: 2024.05.03 14:58
'▶ 자연과학 > ▷ 웹 프로그래밍' 카테고리의 다른 글
【프로그램】 히포크라테스 기질테스트 프로그램과 12기질 장단점 (2) | 2018.10.11 |
---|---|
【웹 프로그래밍】 더 구체적으로 URL 나타내기 (0) | 2017.08.06 |
【웹 프로그래밍】 티스토리 사이트맵 (0) | 2017.04.23 |
【웹 프로그래밍】 마크다운(Markdown) (0) | 2017.03.07 |
【웹 프로그래밍】 티스토리 사이트맵 (0) | 2016.08.02 |
최근댓글