๋ฐ๋ธŒ์ฝ”์Šค_๋ฐ์ดํ„ฐ์—”์ง€๋‹ˆ์–ด๋ง

[Week3 ์›น๋ฐ์ดํ„ฐ ํฌ๋กค ๋ฐ ๋ถ„์„] TIL 10์ผ์ฐจ - Seaborn ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ ํ†ตํ•œ ์‹œ๊ฐํ™”

๐Ÿช„ํ•˜๋ฃจ๐Ÿช„ 2023. 10. 27. 15:53
728x90

0) Seaborn ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ

matplotlib์„ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•˜๋Š” ์‹œ๊ฐํ™” ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ

๋‹ค์–‘ํ•œ ๊ทธ๋ž˜ํ”„๋ฅผ ๊ณ ์ˆ˜์ค€์—์„œ ์‰ฝ๊ฒŒ ๊ทธ๋ฆด ์ˆ˜ ์žˆ๋‹ค.

 

1) ๊ทธ๋ž˜ํ”„์˜ ์ข…๋ฅ˜

1. figure-level vs axes-level

seaborn ํ•จ์ˆ˜๋Š” matplotlib ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋กœ ์ปค์Šคํ„ฐ๋งˆ์ด์ง• ํ•˜๊ฑฐ๋‚˜ ์ž์ฒด ์ปค์Šคํ„ฐ๋งˆ์ด์ง•์„ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ๋‹ค.

์ด๋ฅผ ๊ตฌ๋ถ„ํ•˜๊ธฐ ์œ„ํ•ด์„  figure-level๊ณผ axes-level์— ๋Œ€ํ•œ ์ดํ•ด๊ฐ€ ํ•„์š”ํ•˜๋‹ค.

 

figure-level

ํ•ด๋‹น ๋ ˆ๋ฒจ์˜ ํ•จ์ˆ˜๋Š” seaborn ๊ฐ์ฒด๋ฅผ ํ†ตํ•ด์„œ matplotlib๊ณผ ์ƒํ˜ธ์ž‘์šฉ ํ•˜๊ธฐ ๋•Œ๋ฌธ์— FacetGrid(seaborn์˜ figure)์„ ์ด์šฉํ•ด์„œ ๊ทธ๋ž˜ํ”„ ์ปค์Šคํ„ฐ๋งˆ์ด์ง•์„ ์ˆ˜ํ–‰ํ•ด์•ผ ํ•œ๋‹ค.

  • replot
  • displot
  • catplot
  • lmplot

 

axes-level

ํ•ด๋‹น ๋ ˆ๋ฒจ์˜ ํ•จ์ˆ˜๋Š” matplotlib.pyplot.Axes ๊ฐ์ฒด ์œ„์— ๋ฐ์ดํ„ฐ๋ฅผ ํ‘œ์‹œํ•˜๊ธฐ ๋•Œ๋ฌธ์— matplotlib ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ์˜ ํ•จ์ˆ˜๋กœ ์ปค์Šคํ„ฐ๋งˆ์ด์ง•์„ ํ•  ์ˆ˜ ์žˆ๋‹ค.

  • ์ด์™ธ์˜ ๋‚˜๋จธ์ง€

 

2. seaborn ๋ผ์ด๋ธŒ๋กœ ํ‘œํ˜„ํ•  ์ˆ˜ ์žˆ๋Š” ๊ทธ๋ž˜ํ”„ ๋ชฉ๋ก

Relational plots

  • scatterplot : ์—ฌ๋Ÿฌ ์ ์œผ๋กœ ์ด๋ฃจ์–ด์ง„ ์‚ฐ์ ๋„
  • lineplot : ํ•˜๋‚˜์˜ ์„ ์œผ๋กœ ์ด๋ฃจ์–ด์ง„ ๊บพ์€์„  ๊ทธ๋ž˜ํ”„
  • replot : figure-level ํ•จ์ˆ˜๋กœ ์œ„์˜ ๋‘ ํ•จ์ˆ˜๋ฅผ ํ‘œํ˜„ํ•  ์ˆ˜ ์žˆ์Œ
    • kind ์†์„ฑ์„ ์ง€์ •ํ•˜์ง€ ์•Š์œผ๋ฉด scatterplot ํ˜•ํƒœ์ด๋‹ค.
    • kind="line" ์†์„ฑ์„ ์ง€์ •ํ•˜๋ฉด lineplot ํ˜•ํƒœ์ด๋‹ค.

 

๊ธฐ๋ณธ์ ์ธ ์‚ฌ์šฉ๋ฒ•์€ ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

 

Visualizing statistical relationships — seaborn 0.13.0 documentation

Visualizing statistical relationships Statistical analysis is a process of understanding how variables in a dataset relate to each other and how those relationships depend on other variables. Visualization can be a core component of this process because, w

seaborn.pydata.org

Distribution plots

  • hisplot : ํžˆ์Šคํ† ๊ทธ๋žจ ํ˜•ํƒœ
  • kdeplot : ์ปค๋„ ๋ฐ€๋„ ํ•จ์ˆ˜ ํ˜•ํƒœ
  • ecdfplot : ๊ฒฝํ—˜์  ๋ˆ„์  ๋ถ„ํฌ ํ•จ์ˆ˜ ํ˜•ํƒœ
  • rugplot : ์‹ค์ˆ˜ ๋ฐ์ดํ„ฐ์˜ ๋ถ„ํฌ๋ฅผ ์„ ๋ถ„์œผ๋กœ ๋ณด๋Š” ํ˜•ํƒœ
  • displot : figure-level ํ•จ์ˆ˜๋กœ ์œ„์˜ ๋‹ค์„ฏ ํ•จ์ˆ˜๋ฅผ ํ‘œํ˜„ํ•  ์ˆ˜ ์žˆ์Œ
    • ๊ธฐ๋ณธ ํ•จ์ˆ˜๋Š” histplot ์ž„
    • kind="kde" ์†์„ฑ์„ ์ง€์ •ํ•˜๋ฉด kdeplot ํ˜•ํƒœ์ด๋‹ค.
    • kind="ecdf" ์†์„ฑ์„ ์ง€์ •ํ•˜๋ฉด ecdfplot ํ˜•ํƒœ์ด๋‹ค.
    • rug=True ์†์„ฑ์„ ์ง€์ •ํ•˜๋ฉด rugplot ํ˜•ํƒœ์ด๋‹ค.

 

๊ธฐ๋ณธ์ ์ธ ์‚ฌ์šฉ๋ฒ•์€ ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

 

Visualizing distributions of data — seaborn 0.13.0 documentation

Visualizing distributions of data An early step in any effort to analyze or model data should be to understand how the variables are distributed. Techniques for distribution visualization can provide quick answers to many important questions. What range do

seaborn.pydata.org

Categorical plots

  • stripplot : ์นดํ…Œ๊ณ ๋ฆฌ ๋ณ€์ˆ˜์— ๋Œ€ํ•œ y ๋ถ„ํฌ
  • swarmplot : stripplot์˜ ๊ฒน์น˜๋Š” ๋ฐ์ดํ„ฐ๋ฅผ ๊ฐ„๊ฒฉ์„ ์กฐ์ •ํ•˜์—ฌ ํ‘œํ˜„(์‹œ๊ฐ์ ์œผ๋กœ ์•Œ์•„๋ณด๊ธฐ ์šฉ์ด)
  • boxplot : ๋ฐ•์Šคํ”Œ๋กฏ(์ˆ˜์น˜์  ์ž๋ฃŒ๋ฅผ ํ‘œํ˜„ํ•˜๋Š” ๊ทธ๋ž˜ํ”„)
  • violinplot : ์ผ๋ณ€๋Ÿ‰, ์—ฐ์†ํ˜• ๋ฐ์ดํ„ฐ์˜ ๋ถ„ํฌ๋ฅผ ์„ค๋ช…ํ•˜๊ธฐ ์œ„ํ•ด ์‚ฌ์šฉ๋˜๋Š” ๊ทธ๋ž˜ํ”„
  • boxenplot : boxplot๊ณผ ์œ ์‚ฌํ•˜์ง€๋งŒ, ๋ถ„ํฌ์˜ ํ˜•ํƒœ๋ฅผ ๋” ์ž์„ธํ•˜๊ฒŒ ํ‘œํ˜„ํ•˜๋Š” ๊ทธ๋ž˜ํ”„
  • pointplot : ํŠน์ • ์  ์‚ฌ์ด์˜ ํ‘œํ˜•์„ ์ง์„ ์œผ๋กœ ํ‘œํ˜„ํ•˜๋Š” ๊ทธ๋ž˜ํ”„
  • barplot : ๋ง‰๋Œ€๊ทธ๋ž˜ํ”„ ํ˜•ํƒœ
  • counterplot : ํžˆ์Šคํ† ๊ทธ๋žจ๊ณผ ์œ ์‚ฌํ•œ ํ˜•ํƒœ๋กœ ๊ฐœ์ˆ˜๋ฅผ ํ‘œํ˜„
  • catplot : figure-level ํ•จ์ˆ˜๋กœ ์œ„์˜ ์—ฌ๋Ÿ ํ•จ์ˆ˜๋ฅผ ํ‘œํ˜„ํ•  ์ˆ˜ ์žˆ์Œ
    • plot ์•ž์˜ ๋‹จ์–ด๋ฅผ kind์†์„ฑ์œผ๋กœ ์ง€์ •ํ•˜๋ฉด ์—ฌ๋Ÿ ๊ฐ€์ง€ ํ•จ์ˆ˜๊ฐ€ ๋ชจ๋‘ ํ‘œํ˜„๋œ๋‹ค. catplot(kind="๋‹จ์–ด")

 

๊ธฐ๋ณธ์ ์ธ ์‚ฌ์šฉ๋ฒ•์€ ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

 

Visualizing categorical data — seaborn 0.13.0 documentation

Visualizing categorical data In the relational plot tutorial we saw how to use different visual representations to show the relationship between multiple variables in a dataset. In the examples, we focused on cases where the main relationship was between t

seaborn.pydata.org

Regression plots

  • regplot : ๋ฐ์ดํ„ฐ์˜ ๋ถ„ํฌ์™€ ํ•ด๋‹น ๋ฐ์ดํ„ฐ์˜ ํšŒ๊ท€ ํ•จ์ˆ˜๋ฅผ ํ‘œํ˜„
  • residplot : ํšŒ๊ท€ ํ•จ์ˆ˜์˜ ์ž”์ฐจ ๊ทธ๋ž˜ํ”„
  • lmplot :  figure-level ํ•จ์ˆ˜๋กœ ์œ„์˜ regplot์„ ํ‘œํ˜„ํ•  ์ˆ˜ ์žˆ์Œ

Matrix plots

  • heapmap : ์ด๋ฏธ์ง€ ์œ„์— ์—ด๋ถ„ํฌ ํ˜•ํƒœ์˜ ๊ทธ๋ž˜ํ”„๋ฅผ ํ‘œํ˜„
  • clustermap : ์ด๋ฏธ์ง€ ์œ„์— ๊ตฐ์ง‘ํ™”๋œ ํ˜•ํƒœ์˜ ๊ทธ๋ž˜ํ”„๋ฅผ ํ‘œํ˜„

 

3. matplotlib ๊พธ๋ฏธ๊ธฐ ์†์„ฑ

matplotlib.pyplot์„ ์ด์šฉํ•˜๋ฉด ๊ทธ๋ž˜ํ”„์— ๋‹ค์–‘ํ•œ ์š”์†Œ๋ฅผ ๋ณ€๊ฒฝ/์ถ”๊ฐ€ํ•  ์ˆ˜ ์žˆ๋‹ค.

๊พธ๋ฏธ๊ธฐ ์†์„ฑ์„ ์ ์šฉํ•œ ๋’ค์—๋Š” plt.show() ํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๊พธ๋ฏธ๊ธฐ ์†์„ฑ์ด ๋ณด์ด๋„๋ก ํ•œ๋‹ค.

1. ์ œ๋ชฉ ์ถ”๊ฐ€

  • plt.title(์ œ๋ชฉ)

2. ์ถ•์— ์„ค๋ช… ์ถ”๊ฐ€

  • plt.xlabel(X์ถ• ๋ผ๋ฒจ๋ช…)
  • plt.ylabel(y์ถ• ๋ผ๋ฒจ๋ช…)

3. ๊ทธ๋ž˜ํ”„ ์ถ•์˜ ์‹œ๊ฐํ™” ํ•˜๋Š” ๋ฒ”์œ„๋ฅผ ์ง€์ •

  • plt.xlim(์‹œ๊ฐํ™” ์‹œ์ž‘ x๊ฐ’, ์‹œ๊ฐํ™” ๋ x๊ฐ’)
  • plt.ylim(์‹œ๊ฐํ™” ์‹œ์ž‘ y๊ฐ’, ์‹œ๊ฐํ™” ๋ y๊ฐ’)

4. ๊ทธ๋ž˜ํ”„์˜ ํฌ๊ธฐ

์šฐ์„  ํฌ๊ธฐ๋ฅผ ์ง€์ •ํ•˜๊ณ  ๊ทธ๋ž˜ํ”„๋ฅผ ๊ทธ๋ฆฌ๊ธฐ ๋•Œ๋ฌธ์— figure์„ ์ง€์ •ํ•˜๋Š” ํ•จ์ˆ˜๊ฐ€ plotํ•จ์ˆ˜ ์•ž์— ์œ„์น˜ํ•ด์•ผ ํ•œ๋‹ค.

  • plt.figure(figsize=(x์ถ• ์‚ฌ์ด์ฆˆ, y์ถ• ์‚ฌ์ด์ฆˆ)

 

 

2) ์›น ์Šคํฌ๋ž˜ํ•‘ ํ›„ ์‹œ๊ฐํ™”(BeautifulSoup, Seaborn, WordCloud)

1. ๋ง‰๋Œ€๊ทธ๋ž˜ํ”„ ํ˜•ํƒœ๋กœ ์‹œ๊ฐํ™”

์•„๋ž˜์˜ ์ฝ”๋“œ๋Š” ํ•ด์‹œ์ฝ”๋“œ ์‚ฌ์ดํŠธ์— ์–ด๋Š ์ข…๋ฅ˜์˜ ์งˆ๋ฌธ์ด ๋งŽ์ด ์˜ฌ๋ผ์˜ค๋Š”์ง€์— ๋Œ€ํ•ด ์‹œ๊ฐํ™”ํ•˜๋Š” ์ฝ”๋“œ์ด๋‹ค.

 

QnA | ํ”„๋กœ๊ทธ๋ž˜๋จธ์Šค ์ปค๋ฎค๋‹ˆํ‹ฐ

ํ”„๋กœ๊ทธ๋ž˜๋จธ์Šค QnA๋Š” ํ”„๋กœ๊ทธ๋ž˜๋ฐ ๋ฌธ์ œํ•ด๊ฒฐ์„ ์œ„ํ•œ QnA์„œ๋น„์Šค์ž…๋‹ˆ๋‹ค. ํ”„๋กœ๊ทธ๋ž˜๋ฐ๊ณผ ๊ด€๋ จํ•ด์„œ ๊ฐœ๋ฐœ์ž๋“ค๋ผ๋ฆฌ ๊ถ๊ธˆํ•œ๊ฑด ๋ฌผ์–ด๋ณด๊ณ  ์•„๋Š”๊ฑด ํ•จ๊ป˜ ๋‚˜๋ˆ ์š”. C, Java, Python, Ruby๋“ฑ์˜ ์ฝ”๋“œ๋ฅผ ์›น์—์„œ ์ง์ ‘ ์‹คํ–‰

qna.programmers.co.kr

๊ณผ์ •

1. ํƒœ๊ทธ(์–ด๋Š ์ข…๋ฅ˜์˜ ์งˆ๋ฌธ)์˜ ์œ„์น˜๋ฅผ ํŒŒ์•…ํ•œ๋‹ค.(.html)

<ul class="question-tags">
    <li>
        <span>์—ฌ๊ธฐ!</span>
    </li>
    <li>
        <span>์—ฌ๊ธฐ!</span>
    </li>
    ....
</ul>

2. ๊ฐ ํŽ˜์ด์ง€๋งˆ๋‹ค BeautifulSoup๋ฅผ ์ด์šฉํ•ด ์›น ์Šคํฌ๋ž˜ํ•‘์„ ์ง„ํ–‰ํ•œ๋‹ค.

3. ์Šคํฌ๋ž˜ํ•‘ ์‹œ ๋ถ€ํ•˜๋ฅผ ์ค„์ด๊ธฐ ์œ„ํ•ด ํƒ€์ด๋จธ๋ฅผ ์ด์šฉํ•œ๋‹ค.

4. 2๋ฒˆ์˜ ์ •๋ณด๋ฅผ dictionary ํ˜•ํƒœ๋กœ ์ €์žฅํ•œ๋‹ค.

5. Counter๋กœ ๋ณ€ํ™˜ : Counter๋ฅผ ์‚ฌ์šฉํ•ด ๊ฐ€์žฅ ๋นˆ๋„๊ฐ€ ๋†’์€ value๋“ค์„ ์ถ”์ถœํ•œ๋‹ค.

6. dictionary๋ฅผ seaborn์„ ํ†ตํ•ด ์‹œ๊ฐํ™”ํ•œ๋‹ค.

# ์งˆ๋ฌธ์˜ ๋นˆ๋„๋ฅผ ์ฒดํฌํ•˜๋Š” dict๋ฅผ ๋งŒ๋“  ํ›„, ๋นˆ๋„๋ฅผ ์‹œ๊ฐํ™”ํ•ด ๋ณด์ž.
import time
## 4. 2๋ฒˆ์˜ ์ •๋ณด๋ฅผ dictionary ํ˜•ํƒœ๋กœ ์ €์žฅ
frequency={}
import requests
from bs4 import BeautifulSoup
for i in range(10):
    ## 2. ๊ฐ ํŽ˜์ด์ง€๋งˆ๋‹ค BeautifulSoup๋ฅผ ์ด์šฉํ•ด ์›น ์Šคํฌ๋ž˜ํ•‘์„ ์ง„ํ–‰
    res=requests.get("https://hashcode.co.kr?page={}".format(i))
    soup=BeautifulSoup(res.text, "html.parser")

    ## 1. ul->li์˜ text๋งŒ ์ถ”์ถœ
    ul_tags=soup.findAll("ul", "question-tags")
    for ul in ul_tags:
        li_tags=ul.find_all("li")
        for li in li_tags:
            tag=li.text.strip()
            frequency[tag]=frequency.get(tag, 0)+1
            
	## 3. ์Šคํฌ๋ž˜ํ•‘์‹œ ๋ถ€ํ•˜๋ฅผ ์ค„์ด๊ธฐ ์œ„ํ•ด ํƒ€์ด๋จธ๋ฅผ ์ด์šฉ
    time.sleep(0.5)

## 5. Counter๋กœ ๋ณ€ํ™˜ : Counter๋ฅผ ์‚ฌ์šฉํ•ด ๊ฐ€์žฅ ๋นˆ๋„๊ฐ€ ๋†’์€ value๋“ค์„ ์ถ”์ถœ
from collections import Counter
counter=Counter(frequency)
counter.most_common(10)

## 6. dictionary๋ฅผ seaborn์„ ํ†ตํ•ด ์‹œ๊ฐํ™”
import seaborn as sns
import matplotlib.pyplot as plt
plt.figure(figsize=(20, 10))
plt.title("Frequency of question in Hashcode")
plt.xlabel("Tag")
plt.ylabel("Frequency")
x_list=[elem[0] for elem in counter.most_common(10)]
y_list=[elem[1] for elem in counter.most_common(10)]
sns.barplot(x=x_list, y=y_list)
plt.show()

๊ฒฐ๊ณผ

์‹œ๊ฐํ™” ๊ฒฐ๊ณผ python>c>java>c++>pandas... ์นดํ…Œ๊ณ ๋ฆฌ ์ˆœ์œผ๋กœ ์งˆ๋ฌธ์ด ๋งŽ์ด ๋œ ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค.

 

 

2. WordColud ํ˜•ํƒœ๋กœ ์‹œ๊ฐํ™”

WordCloud๋ž€?

์ž์ฃผ ๋“ฑ์žฅํ•˜๋Š” ํ…์ŠคํŠธ๋ฅผ ์ค‘์š”๋„๋‚˜ ์ธ๊ธฐ๋„๋ฅผ ๊ณ ๋ คํ•ด ์‹œ๊ฐ์ ์œผ๋กœ ํ‘œํ˜„ํ•œ ๊ฒƒ์œผ๋กœ

์ž์ฃผ ๋“ฑ์žฅํ•˜๋Š” ๋‹จ์–ด์ผ์ˆ˜๋ก ํฌ๊ณ  ๋‘๊ป๊ฒŒ ํ‘œ์‹œ๋œ๋‹ค.

wordCloud ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ ํ™œ์šฉํ•œ๋‹ค.

์›Œ๋“œ ํด๋ผ์šฐ๋“œ ์˜ˆ์‹œ

 

์ž์—ฐ์–ด์ฒ˜๋ฆฌ์—์„œ ๋งŽ์ด ์“ฐ์ด๋Š”๋ฐ, ํ•œ ๋ฌธ์žฅ์—์„œ ๋‹จ์–ด๋ฅผ ์ถ”์ถœํ•˜์—ฌ ์ž์—ฐ์–ด๋ถ„์„์„ ์ง„ํ–‰ํ•œ ๋’ค ์‹œ๊ฐํ™”ํ•œ๋‹ค.

์•„๋ž˜์˜ ์ฝ”๋“œ๋Š” ํ•ด์‹œ์ฝ”๋“œ ์‚ฌ์ดํŠธ์˜ ์ œ๋ชฉ์„ ์Šคํฌ๋ž˜ํ•‘ ํ•œ ๋’ค, ์–ด๋Š ๋‹จ์–ด๊ฐ€ ๋งŽ์ด ๋‚˜์˜ค๋Š”์ง€์— ๋Œ€ํ•ด ์‹œ๊ฐํ™”ํ•˜๋Š” ์ฝ”๋“œ์ด๋‹ค.

 

๊ณผ์ •

1. ํƒœ๊ทธ(์–ด๋Š ์ข…๋ฅ˜์˜ ์งˆ๋ฌธ)์˜ ์œ„์น˜๋ฅผ ํŒŒ์•…ํ•œ๋‹ค.(.html)

<ul class="question-tags">
    <li>
    	<div>...</div>
        <div>...</div>
        <div class=question>
            <div class="top">
                <h4>์ œ๋ชฉ์—ฌ๊ธฐ์žˆ๋‹ค.</h4>
            </div>	
        </div>
    </li>
    <li>
    	<div>...</div>
        <div>...</div>
        <div class=question>
        	<div class="top">
                <h4>์ œ๋ชฉ์—ฌ๊ธฐ์žˆ๋‹ค.</h4>
            </div>	
        </div>
    </li>
</ul>

2. ๊ฐ ํŽ˜์ด์ง€๋งˆ๋‹ค BeautifulSoup๋ฅผ ์ด์šฉํ•ด ์›น ์Šคํฌ๋ž˜ํ•‘์„ ์ง„ํ–‰ํ•œ๋‹ค.

3. ์Šคํฌ๋ž˜ํ•‘ ์‹œ ๋ถ€ํ•˜๋ฅผ ์ค„์ด๊ธฐ ์œ„ํ•ด ํƒ€์ด๋จธ๋ฅผ ์ด์šฉํ•œ๋‹ค.

4. ์ œ๋ชฉ๋“ค์„ list ํ˜•ํƒœ๋กœ ์ €์žฅํ•œ๋‹ค.

5. list์˜ ๋ฌธ์žฅ๋“ค์—์„œ ๋‹จ์–ด๋ฅผ ์ถ”์ถœํ•˜์—ฌ ์ œ๋ชฉ๋“ค์˜ ๋‹จ์–ด๋ชฉ๋ก์„ ์–ป๋Š”๋‹ค.

6. Counter๋กœ ๋ณ€ํ™˜ : ๋‹จ์–ด๋ชฉ๋ก์„ Counter๋กœ ๋ณ€ํ™˜ํ•œ๋‹ค.

7. dictionary๋ฅผ wordcloud์„ ํ†ตํ•ด ์‹œ๊ฐํ™”ํ•œ๋‹ค.

# ์งˆ๋ฌธ ๋ฆฌ์ŠคํŠธ์˜ ์ œ๋ชฉ์—์„œ ๋‹จ์–ด๋ฅผ ์ถ”์ถœํ•˜์—ฌ ์‹œ๊ฐํ™”ํ•˜๊ธฐ

import time
import requests
from bs4 import BeautifulSoup
## 4. ์ œ๋ชฉ๋“ค์„ list ํ˜•ํƒœ๋กœ ์ €์žฅํ•œ๋‹ค.
title=[]
## 2. ๊ฐ ํŽ˜์ด์ง€๋งˆ๋‹ค BeautifulSoup๋ฅผ ์ด์šฉํ•ด ์›น ์Šคํฌ๋ž˜ํ•‘์„ ์ง„ํ–‰ํ•œ๋‹ค.
for i in range(6):
    res=requests.get("https://hashcode.co.kr?page={}".format(i), {"User-Agent":user_agent})
    soup=BeautifulSoup(res.text, "html.parser")
	## 1. li->data๋ฅผ ์ฐพ๋Š”๋‹ค.
    parsed_datas=soup.find_all("li", "question-list-item")
    for data in parsed_datas:
        title.append(data.h4.text.strip())
    ## 3. ์Šคํฌ๋ž˜ํ•‘ ์‹œ ๋ถ€ํ•˜๋ฅผ ์ค„์ด๊ธฐ ์œ„ํ•ด ํƒ€์ด๋จธ๋ฅผ ์ด์šฉํ•œ๋‹ค.
    time.sleep(0.5)


##5. list์˜ ๋ฌธ์žฅ๋“ค์—์„œ ๋‹จ์–ด๋ฅผ ์ถ”์ถœํ•˜์—ฌ ์ œ๋ชฉ๋“ค์˜ ๋‹จ์–ด๋ชฉ๋ก์„ ์–ป๋Š”๋‹ค.
# ๋ฌธ์žฅ์—์„œ ๋ช…์‚ฌ๋ฅผ ์ถ”์ถœํ•˜๋Š” ํ˜•ํƒœ์†Œ ๋ถ„์„ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ
from konlpy.tag import Hannanum
# Hannanum ๊ฐ์ฒด๋ฅผ ์ƒ์„ฑ(ํ˜•ํƒœ์†Œ ๋ถ„์„ ๊ฐ์ฒด)ํ•œ ํ›„, .nouns()๋ฅผ ํ†ตํ•ด ๋ช…์‚ฌ๋ฅผ ์ถ”์ถœํ•ฉ๋‹ˆ๋‹ค.
words=[]
hannanum = Hannanum()
for t in title:
    nouns = hannanum.nouns(t)
    words+=nouns


## 6. Counter๋กœ ๋ณ€ํ™˜ : ๋‹จ์–ด๋ชฉ๋ก์„ Counter๋กœ ๋ณ€ํ™˜ํ•œ๋‹ค.
from collections import Counter
counter=Counter(words)

## 7. dictionary๋ฅผ wordCloud๋ฅผ ์ด์šฉํ•ด ์‹œ๊ฐํ™”ํ•œ๋‹ค.
import matplotlib.pyplot as plt
from wordcloud import WordCloud
# WordCloud๋ฅผ ์ด์šฉํ•ด ํ…์ŠคํŠธ ๊ตฌ๋ฆ„์„ ๋งŒ๋“ค์–ด๋ด…์‹œ๋‹ค.
wordcloud=WordCloud(font_path="C:\Windows\Fonts\malgun.ttf", 
                    background_color="white", width=2000, height=2000)
img=wordcloud.generate_from_frequencies(counter)
plt.imshow(img)

๊ฒฐ๊ณผ

 

์‹œ๊ฐํ™” ๊ฒฐ๊ณผ ํŒŒ์ด์ฌ, ์งˆ๋ฌธ, ๋ฌธ์ œ, ์˜ค๋ฅ˜, ์ฝ”๋“œ ๋“ฑ์˜ ํ‚ค์›Œ๋“œ ๋“ฑ์ด ์ž์ฃผ ๋นˆ์ถœ๋œ ๊ฒƒ์„ ํ™•์ธ ํ•  ์ˆ˜ ์žˆ๋‹ค.

728x90