์ธ๊ณต์ง€๋Šฅ๐Ÿพ/ํŒŒ์ด์ฌ ๊ธฐ๋ณธ

[ํŒŒ์ด์ฌ ๊ธฐ๋ณธ ๋ฐ์ดํ„ฐ ํƒ€์ž…] csv, ์›น(web), xml, json

๐Ÿช„ํ•˜๋ฃจ๐Ÿช„ 2023. 1. 30. 09:30
728x90

* ์ด ๊ธ€์€ ๋„ค์ด๋ฒ„ ๋ถ€์ŠคํŠธ ์ฝ”์Šค์˜ ์ธ๊ณต์ง€๋Šฅ(AI) ๊ธฐ์ดˆ ๋‹ค์ง€๊ธฐ ๊ฐ•์˜๋ฅผ ์ˆ˜๊ฐ•ํ•˜๋ฉฐ ์ •๋ฆฌํ•œ ๊ธ€์ž…๋‹ˆ๋‹ค.

๋จธ์‹ ๋Ÿฌ๋‹์—์„œ ์ œ์ผ ๋งŽ์ด ์‚ฌ์šฉ๋˜๋Š” csv์— ๋Œ€ํ•ด ์ž์„ธํžˆ ์•Œ์•„๋ณด์ž

 

1) csv (Comma Separate Value)

- ์‰ผํ‘œ(,)๋กœ ๊ตฌ๋ถ„ํ•œ ํ…์ŠคํŠธ ํŒŒ์ผ

- ์—‘์…€ ์–‘์‹์˜ ๋ฐ์ดํ„ฐ๋ฅผ ํ”„๋กœ๊ทธ๋žจ์— ์“ฐ๊ธฐ ์œ„ํ•œ ๋ฐ์ดํ„ฐ ํ˜•์‹

- ํŒŒ์ผ๋กœ๋„ ๋ฐ์ดํ„ฐ๋ฅผ ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ์ง€๋งŒ, csv ๋ชจ๋“ˆ์„ ์ด์šฉํ•ด csv ๊ฐ์ฒด๋กœ ์ฒ˜๋ฆฌ ๊ฐ€๋Šฅ

import csv
reader=csv.reader(f, delimeter=',', quotechar='"', quoting=csv.QUOTE_ALL)
  • delimeter : ๊ธ€์ž๋ฅผ ๋‚˜๋ˆ„๋Š” ๊ธฐ์ค€ (๊ธฐ๋ณธ : ,)
  • lineterminator : ์ค„๋ฐ”๊ฟˆ ๊ธฐ์ค€ (๊ธฐ๋ณธ : \r\n)
  • quotechar : ๋ฌธ์ž์—ด์„ ๋‘˜๋Ÿฌ์‹ธ๋Š” ์‹ ํ˜ธ ๋ฌธ์ž (๊ธฐ๋ณธ : " )
  • quoting : ๋ฐ์ดํ„ฐ๋ฅผ ๋‚˜๋ˆ„๋Š” ๊ธฐ์ค€์ด QUOTECHAR์— ์˜ํ•ด ๋‘˜๋Ÿฌ์Œ“์ธ ๋ ˆ๋ฒจ. ์ „๋ถ€ ํ˜น์€ ์ผ๋ถ€(๊ธฐ๋ณธ : QUOTE_MINIMAL)

์‚ฌ์‹ค ํ•ด๋‹น ๋ชจ๋“ˆ๋ณด๋‹ค๋Š” pandas๋ฅผ ์ฃผ๋กœ ํ™œ์šฉํ•˜๊ธฐ ๋•Œ๋ฌธ์— ๊ตณ์ด? ์•Œ ํ•„์š”๋Š” ์—†์„๊ฒƒ ๊ฐ™๋‹ค.

 

2) web(World Wide Web)

- ์ธํ„ฐ๋„ท ๊ณต๊ฐ„

- ์†ก์ˆ˜์‹ ์„ ์œ„ํ•œ HTTP ํ”„๋กœํ† ์ฝœ ์‚ฌ์šฉ

- ๋ฐ์ดํ„ฐ๋ฅผ ํ‘œ์‹œํ•˜๊ธฐ ์œ„ํ•ด HTML ํ˜•์‹์„ ์‚ฌ์šฉ

- ์š”์ฒญ -> ์ฒ˜๋ฆฌ -> ์‘๋‹ต -> ๋ Œ๋”๋ง ์˜ ๊ณผ์ •์œผ๋กœ ๊ตฌํ˜„

 

cf ) HTML (Hyper Text Markup Language)

  • ์›น ์ƒ์˜ ์ •๋ณด๋ฅผ ๊ตฌ์กฐ์ ์œผ๋กœ ํ‘œํ˜„
  • Tag๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์š”์†Œ๋ฅผ ํ‘œ์‹œ(ํŠธ๋ฆฌ ๋ชจ์–‘์˜ ํฌํ•จ ๊ด€๊ณ„๋ฅผ ๊ฐ€์ง)
  • ํŽ˜์ด์ง€ ์ƒ์„ฑ ๊ทœ์น™์ด ์žˆ๊ธฐ ๋•Œ๋ฌธ์— ๋‹ค์–‘ํ•œ ๋ถ„์„์ด ๊ฐ€๋Šฅ
  • string, Regex, beautifulsoup ๋“ฑ์œผ๋กœ ์ฒ˜๋ฆฌ

cf) ์ •๊ทœ์‹(regular expression) : ๋ณต์žกํ•œ ๋ฌธ์ž์—ด ํŒจํ„ด์„ ์ •์˜ํ•˜๋Š” ๋ฌธ์ž ํ‘œํ˜„ ๊ณต์‹

์ •๊ทœ ํ‘œํ˜„์‹์€ ํ•„์š”์‹œ ๊ตฌ๊ธ€๋งํ•ด์„œ ์‚ฌ์šฉํ•˜๊ธฐ

 

์˜ˆ์ œ) ์ •๊ทœํ‘œํ˜„์‹์˜ ํŒจํ„ด์„ ๊ฐ€์ง„ ๋ฌธ์ž๋ชฉ๋ก์„ ํŠน์ • ํŽ˜์ด์ง€ ์ฃผ์†Œ์—์„œ ์ฐพ์•„๋ณด์ž.

import re
import urllib.request

url="ํŽ˜์ด์ง€ ์ฃผ์†Œ"
html=urllib.request.urlopen(url)
html_contents=str(html.read().decode("๋””์ฝ”๋“œ์ข…๋ฅ˜"))
#re.findall์„ ์ด์šฉํ•ด ์ •๊ทœํ‘œํ˜„์‹์˜ ํŒจํ„ด์„ ๊ฐ€์ง„ ๋ชจ๋“ ๊ฒƒ์„ ์ฐพ์Œ
id_results=re.findall(r"(์ •๊ทœํ‘œํ˜„์‹)", html_contents)

 

3) xml (eXtensible Markup Language)

<ํƒœ๊ทธ>๋ฐ์ดํ„ฐ</ํƒœ๊ทธ>

- ๋ฐ์ดํ„ฐ์˜ ๊ตฌ์กฐ์™€ ์˜๋ฏธ๋ฅผ TAG(MarkUp)์„ ์‚ฌ์šฉํ•˜์—ฌ ํ‘œ์‹œ(HTML๊ณผ ๋น„์Šท)

- ํŠธ๋ฆฌ๊ตฌ์กฐ

- ์ฃผ๋กœ beautifulsoup์œผ๋กœ ํŒŒ์‹ฑ

 

์‚ฌ์šฉ๋ฐฉ๋ฒ•

# ๋ชจ๋“ˆํ˜ธ์ถœ
from bs4 import BeautifulSoup
# ๊ฐ์ฒด์ƒ์„ฑ
soup=BeautifulSoup(xml, "lxml")
# ํƒœ๊ทธ ํƒ์ƒ‰
soup.find_all("ํƒœ๊ทธ์ด๋ฆ„")

 

์˜ˆ์ œ) ํŒŒ์ผ.xml์— ํƒœ๊ทธ์ด๋ฆ„์„ ๊ฐ€์ง„ ๋ชจ๋“  ์š”์†Œ๋ฅผ ์ถ”์ถœํ•˜์—ฌ๋ผ. 

from bs4 import BeautifulSoup

with open("ํŒŒ์ผ.xml", "r", encoding="utf-8") as books_file:
    books_xml=books_file.read() # File์„ string์œผ๋กœ ์ฝ์–ด์˜ค๊ธฐ
    
soup=BeautifulSoup(books_xml, "lxml")

for book_info in soup.find_all("ํƒœ๊ทธ์ด๋ฆ„"):
	print(book_info)
    print(book_info.get_text())

 

4) json (JavaScript Object Notation)

{"parent_tag":["child_tag":๋ฐ์ดํ„ฐ]}

- Javascript์˜ ๋ฐ์ดํ„ฐ ๊ฐ์ฒด ํ‘œํ˜„ ๋ฐฉ์‹

- ๊ฐ„๊ฒฐํ•˜๊ณ  ๋ฐ์ดํ„ฐ ์šฉ๋Ÿ‰์ด ์ ๊ณ  ์ฝ”๋“œ๋กœ์˜ ์ „ํ™˜์ด ์‰ฌ์›€

- json ๋ชจ๋“ˆ์„ ์‚ฌ์šฉํ•˜์—ฌ ํŒŒ์‹ฑ

 

์‚ฌ์šฉ๋ฐฉ๋ฒ•

  • ์ฝ๊ธฐ(Read)
# ๋ชจ๋“ˆ ํ˜ธ์ถœ
import json
with open("ํŒŒ์ผ.json", "r", encoding="utf-8") as f:
    contents=f.read()
    # ๊ฐ์ฒด ์ƒ์„ฑ -> dict ํƒ€์ž…์œผ๋กœ ๋ฐ˜ํ™˜
    json_data=json.loads(contents)
    # ํƒœ๊ทธ ๋ฐ์ดํ„ฐ ํƒ์ƒ‰
    print(json_data["ํƒœ๊ทธ๋ช…"])
  • ์“ฐ๊ธฐ(Write)
# ๋ชจ๋“ˆ ํ˜ธ์ถœ
import json
# ์ €์žฅํ•  ๋ฐ์ดํ„ฐ
dict_data={'Name':'Zara', 'Age':7, 'Class':'First'}
# ํŒŒ์ผ์— ์ž‘์„ฑ
with open("ํŒŒ์ผ.json", "w") as f:
    json.dump(dict_data, f)
728x90