๋ฐ๋ธŒ์ฝ”์Šค_๋ฐ์ดํ„ฐ์—”์ง€๋‹ˆ์–ด๋ง

[Week6 ๋ฐ์ดํ„ฐ ์›จ์–ดํ•˜์šฐ์Šค์™€ SQL ๊ธฐ๋ณธ] TIL 23์ผ์ฐจ - Google Colab์—์„œ ๋ฐ์ดํ„ฐ์›จ์–ดํ•˜์šฐ์Šค ์—ฐ๊ฒฐํ•ด ์‚ฌ์šฉํ•˜๊ธฐ

๐Ÿช„ํ•˜๋ฃจ๐Ÿช„ 2023. 11. 16. 18:25
728x90

์ด๋ฒˆ์‹œ๊ฐ„์—๋Š” ์ฝ”๋žฉ์— ๋ฐ์ดํ„ฐ์›จ์–ดํ•˜์šฐ์Šค(Redshift)์— ์—ฐ๊ฒฐํ•˜๊ณ  ๋ฐ์ดํ„ฐ๋ฅผ ๊ฐ€๊ณตํ•ด ๋ณด์ž.

 

0) ๋ฐ์ดํ„ฐ ์Šคํ‚ค๋งˆ ์ •๋ณด(ํด๋”)

๐Ÿ“ row data ๐Ÿ“ analytics ๐Ÿ“ adhoc
ETL ๊ณผ์ •์—์„œ ์ˆ˜์ง‘๋œ ๋ฐ์ดํ„ฐ row data์˜ ํ…Œ์ด๋ธ”์„ ์กฐํ•ฉํ•ด์„œ ๋งŒ๋“  ์ƒˆ๋กœ์šด ํ…Œ์ด๋ธ”  

 

 

1) ๋ฐ์ดํ„ฐ ์›จ์–ดํ•˜์šฐ์Šค(Redshift) Colab์— ์—ฐ๊ฒฐํ•˜๊ธฐ

%load_ext sql
%sql postgresql:/*redshift ์„œ๋ฒ„ ์—ฐ๊ฒฐ ์ฃผ์†Œ*/
%%sql
์›ํ•˜๋Š” sql ๋ฌธ์žฅ

 

์œ„์˜ ์ฝ”๋“œ๋ฅผ ์‚ฌ์šฉํ•˜๋ฉด ์•„๋งˆ์กด์˜ Redshift ๋ฐ์ดํ„ฐ ์›จ์–ดํ•˜์šฐ์Šค์— ์—ฐ๊ฒฐํ•˜๊ณ , SQL ๋ฌธ์žฅ์„ ์ž…๋ ฅํ•˜์—ฌ ์›ํ•˜๋Š” ์ž‘์—…์„ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ๋‹ค.

 

 

2) ํ…Œ์ด๋ธ”์„ (ํŒ๋‹ค์Šค) Dataframe ์ž๋ฃŒํ˜•์œผ๋กœ ๋ณ€ํ™˜

result=%sql SELECT * FROM ์ธ์Šคํ„ด์Šค๋ช….ํ…Œ์ด๋ธ”๋ช…
df = result.DataFrame()

 

๋ฐ์ดํ„ฐ ์›จ์–ดํ•˜์šฐ์Šค์— ์žˆ๋Š” ํ…Œ์ด๋ธ”์„ ์šฐ๋ฆฌ์—๊ฒŒ ์ต์ˆ™ํ•œ ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„(Dataframe) ํ˜•ํƒœ๋กœ ๋ฐ”๊พธ๋Š” ์ฝ”๋“œ์ด๋‹ค.

๋ฐ์ดํ„ฐ ํ”„๋ ˆ์ž„ ํ˜•ํƒœ๋กœ ๋ฐ”๊พผ ๋’ค์— ์ด์ œ ์›ํ•˜๋Š” ์ž‘์—…์„ ์ˆ˜ํ–‰ํ•  ์ˆ˜๋„ ์žˆ๋‹ค.

 

 

3) ๋ฐ์ดํ„ฐ ํ’ˆ์งˆ ํ™•์ธ

์ค‘๋ณต๋œ ๋ ˆ์ฝ”๋“œ ์ฒดํฌํ•˜๊ธฐ

count == count(distinct)

์ตœ๊ทผ ๋ฐ์ดํ„ฐ ์กด์žฌ์—ฌ๋ถ€ ํŒŒ์•…

min(timestamp), max(timestamp)

Primary key uniqueness ์ฒดํฌํ•˜๊ธฐ

primarykey ๊ธฐ์ค€์œผ๋กœ group ํ•œ ๋’ค์— count๋ฅผ ์ˆ˜ํ–‰ํ•ด์„œ ๊ฐ’์ด ๋ชจ๋‘ 1์ธ์ง€ ํ™•์ธ(๋‚ด๋ฆผ์ฐจ์ˆœ ์ •๋ ฌํ•ด์„œ ์ฒซ ๋ฒˆ์งธ ๋ ˆ์ฝ”๋“œ๋งŒ ํ™•์ธ)

groupy by primarykey

count(primarykey), count(1)

order by 2 desc

๋ถ„์„ํ•  ์ปฌ๋Ÿผ์˜ ๊ฐ’์ด ๋น„์–ด์žˆ๋Š” ๋น„์œจ ํ™•์ธํ•˜๊ธฐ

- case ์ด์šฉ

count(CASE WHEN ์ปฌ๋Ÿผ๋ช… is NULL THEN 1 END)

- ๋น„๊ต์—ฐ์‚ฐ ์ด์šฉ

where ์ปฌ๋Ÿผ๋ช… is null

 

cf) CTE๋ž€?

"Common Table Expressions"์˜ ์•ฝ์ž๋กœ, SQL ์ฟผ๋ฆฌ์—์„œ ์‚ฌ์šฉ๋˜๋Š” ์ผ์ข…์˜ ์ž„์‹œ ๊ฒฐ๊ณผ ์ง‘ํ•ฉ์„ ์ •์˜ํ•˜๋Š” ๋ฐ ์‚ฌ์šฉ๋˜๋Š” ๋ฌธ๋ฒ•์œผ๋กœ ์ฟผ๋ฆฌ ์‹คํ–‰ ์‹œ์—๋งŒ ์œ ํšจํ•˜๋‹ค. ์ฃผ๋กœ ๋ณต์žกํ•œ ์ฟผ๋ฆฌ๋ฅผ ๋” ๊ฐ„๋‹จํ•˜๊ณ  ๊ฐ€๋…์„ฑ ์žˆ๊ฒŒ ๋งŒ๋“ค๊ธฐ ์œ„ํ•ด ์‚ฌ์šฉ๋œ๋‹ค.

WITH CTEํ…Œ์ด๋ธ”๋ช… (๋ฐ˜ํ™˜๋˜๋Š” ์ปฌ๋Ÿผ1, ๋ฐ˜ํ™˜๋˜๋Š” ์ปฌ๋Ÿผ2, ...) AS (
    -- CTE์˜ ์ฟผ๋ฆฌ ์ •์˜
    SELECT ...
)

-- CTE๋ฅผ ํ™œ์šฉํ•œ ๋ฉ”์ธ ์ฟผ๋ฆฌ
SELECT ~
FROM CTEํ…Œ์ด๋ธ”๋ช… 
WHERE ~

 

 

4) ๋ฐ์ดํ„ฐ ๋ถ„์„ - ์ง‘๊ณ„ํ•จ์ˆ˜ ์‚ฌ์šฉ

์ฑ„๋„๋ณ„ ์„ธ์…˜์ˆ˜๋ฅผ ์ง‘๊ณ„ํ•ด๋ณด์ž.

  • user_session_channel ํ…Œ์ด๋ธ” ์ด์šฉ + COUNT ํ•จ์ˆ˜ ์‚ฌ์šฉ

์›”๋ณ„ ์„ธ์…˜์ˆ˜๋ฅผ ์ง‘๊ณ„ํ•ด ๋ณด์ž.

  • session_timestamp ํ…Œ์ด๋ธ” ์ด์šฉ + LEFT ํ•จ์ˆ˜ ์‚ฌ์šฉ + COUNT ํ•จ์ˆ˜ ์‚ฌ์šฉ

๊ฐ€์žฅ ๋งŽ์ด ์‚ฌ์šฉ๋œ ์ฑ„๋„์— ๋Œ€ํ•ด ์•Œ์•„๋ณด์ž.

  • ๊ฐ€์žฅ ๋งŽ์ด ์‚ฌ์šฉ๋œ ๊ธฐ์ค€์€? (์‚ฌ์šฉ์ž ๊ธฐ์ค€ vs ์„ธ์…˜ ๊ธฐ์ค€)
  • user_session_channel ํ…Œ์ด๋ธ” ์ด์šฉ + COUNT ํ•จ์ˆ˜ ์‚ฌ์šฉ + DISTINCT ํ•จ์ˆ˜ ์‚ฌ์šฉ

๊ฐ€์žฅ ๋งŽ์€ ์„ธ์…˜์„ ๊ฐ€์ง„ ์‚ฌ์šฉ์ž ID์— ๋Œ€ํ•ด ์•Œ์•„๋ณด์ž.

  • user_session_channel ํ…Œ์ด๋ธ” ์ด์šฉ + COUNT ํ•จ์ˆ˜ ์‚ฌ์šฉ + LIMIT ์ง€์ •

์›”๋ณ„ ์œ ๋‹ˆํฌํ•œ ์‚ฌ์šฉ์ž์ˆ˜๋ฅผ ์•Œ์•„๋ณด์ž.

  • user_session_channel, session_timestamp JOIN + TO_CHAR(๊ธฐ์กด ์นผ๋Ÿผ๋ช…, ํ˜•์‹) ํ•จ์ˆ˜ ์‚ฌ์šฉ + COUNT ํ•จ์ˆ˜ ์‚ฌ์šฉ

์›”๋ณ„ ์ฑ„๋„๋ณ„ ์œ ๋‹ˆํฌํ•œ ์‚ฌ์šฉ์ž์ˆ˜๋ฅผ ์•Œ์•„๋ณด์ž.

  • user_session_channel, session_timestamp JOIN + TO_CHAR(๊ธฐ์กด ์ปฌ๋Ÿผ๋ช…, ํ˜•์‹) ํ•จ์ˆ˜ ์‚ฌ์šฉ+ COUNT ํ•จ์ˆ˜ ์‚ฌ์šฉ + DISTINCT ํ•จ์ˆ˜ ์‚ฌ์šฉ

 

cf) ๋งˆ์ผ€ํŒ…์—์„œ ์ค‘์š”ํ•˜๊ฒŒ ์ƒ๊ฐํ•˜๋Š” ์ง€ํ‘œ

DAU(Daily Active User), WAU(Weekly Active User), MAU(Monthly Active User)

: ๋‹จ์œ„ ์‹œ๊ฐ„(Daily, Weekly, Monthly)์— ๋ฐฉ๋ฌธํ•œ unique ํ•œ ์‚ฌ์šฉ์ž์ˆ˜

 

 

5) Timestamp ์ •๋ณด ์ค‘ ์ผ๋ถ€๋ฅผ ์ถ”์ถœํ•˜๊ธฐ

Timestamp๋ฅผ ๊ฐ€๊ณตํ•ด์„œ ํŠน์ • ์ •๋ณด๋ฅผ ๋งŒ๋“ค๊ณ  ์‹ถ์„ ๋•Œ๊ฐ€ ์žˆ๋‹ค. ๊ทธ๋Ÿด ๋• ์•„๋ž˜์˜ 4๊ฐ€์ง€ ๋ฐฉ๋ฒ•์„ ํ™œ์šฉํ•˜๋ฉด ๋œ๋‹ค.

๋ฐ˜ํ™˜ํ•˜๋Š” ๊ฐ’์˜ ํƒ€์ž…์— ๋”ฐ๋ผ ๋‘ ๊ฐ€์ง€๋กœ ๋‚˜๋‰œ๋‹ค.

 

String์„ ๋ฐ˜ํ™˜

TO_CHAR( DATETIME์นผ๋Ÿผ๋ช…, ์ž๋ฆฟ์ˆ˜ )

LEFT( DATETIME์ปฌ๋Ÿผ๋ช… , ์ž๋ฆฟ์ˆ˜ )

SUBSTRING( DATETIME์นผ๋Ÿผ๋ช…, START, ์ž๋ฆฟ์ˆ˜ )

 

Timestamp๋ฅผ ๋ฐ˜ํ™˜

DATE_TRUNC( '์›ํ•˜๋Š” ์†์„ฑ', DATETIME์นผ๋Ÿผ๋ช…)

 

 

6) Ctas ์‚ฌ์šฉํ•˜๊ธฐ

ํ…Œ์ด๋ธ”์„ ์ƒ์„ฑํ•จ๊ณผ ๋™์‹œ์— ๋‚ด์šฉ์„ ์ฑ„์šฐ๋Š” ๊ธฐ์ˆ ๋กœ, ๊ฐ„๋‹จํ•˜๊ฒŒ ์ƒˆ๋กœ์šด ํ…Œ์ด๋ธ”์„ ์ƒ์„ฑํ•˜๋Š” ์ž‘์—….

์ฃผ๋กœ ๋ฐ์ดํ„ฐ ๋ถ„์„์—์„œ ๋งŽ์ด ์‚ฌ์šฉํ•˜๋Š” ์กฐ์ธ์—ฐ์‚ฐ ๋“ฑ์„ ๋ฏธ๋ฆฌ ์ˆ˜ํ–‰ํ•ด์„œ ์ƒˆ ํ…Œ์ด๋ธ”๋กœ ๋งŒ๋“œ๋Š” ์ž‘์—…์„ ์ˆ˜ํ–‰ํ•˜๋Š” ๋ฐ ์‚ฌ์šฉ๋œ๋‹ค.

ํ•ด๋‹น ์ž‘์—…์€ analytics ํด๋”์— ์ €์žฅํ•˜๋„๋ก ํ•œ๋‹ค(ETL ํ…Œ์ด๋ธ”์„ ์กฐ์ธํ•ด์„œ ๋งŒ๋“  ๋ถ„์„์šฉ ํ…Œ์ด๋ธ”์ด๊ธฐ ๋•Œ๋ฌธ)

%%sql

DROP TABLE IF EXISTS analytics.ํ…Œ์ด๋ธ”๋ช…;
CREATE TABLE analytics.ํ…Œ์ด๋ธ”๋ช… AS
SELECT ~
FROM ํ…Œ์ด๋ธ”A JOIN ํ…Œ์ด๋ธ”B ON ์กฐ์ธ์กฐ๊ฑด;

 

 

7) ์ˆ™์ œ

์›”๋ณ„ ์ฑ„๋„๋ณ„ ๋งค์ถœ์•ก ํ…Œ์ด๋ธ” ๋งŒ๋“ค๊ธฐ

์ฃผ์˜ํ•  ์ )

1. ๋ชจ๋“  ์ฑ„๋„์— ๋Œ€ํ•ด ๊ตฌ๋งค๊ฐ€ ์ผ์–ด๋‚˜์ง€๋Š” ์•Š๋Š”๋‹ค.(์ฑ„๋„๋ณ„์ด๋ฏ€๋กœ, ์ฑ„๋„ ๊ธฐ์ค€ leftjoin์„ ์ˆ˜ํ–‰ํ•ด์•ผ ํ•œ๋‹ค)

2. int๋ผ๋ฆฌ ๋‚˜๋ˆ„๋ฉด int๊ฐ€ ๊ฒฐ๊ณผ๋กœ ๋‚˜์˜จ๋‹ค(float๋กœ ํƒ€์ž…์บ์ŠคํŒ… ํ•  ๊ฒƒ)

month  
channel  
uniqueUsers ์ด ๋ฐฉ๋ฌธ ์‚ฌ์šฉ์ž
paidUsers ๊ตฌ๋งค ์‚ฌ์šฉ์ž
conversionRate ๊ตฌ๋งค์‚ฌ์šฉ์ž/์ด ๋ฐฉ๋ฌธ ์‚ฌ์šฉ์ž
grossRevenue session_transaction์˜ amount์˜ ํ•ฉ
netRevenue session_transaction์˜ amount ํ•ฉ(๋‹จ, refund์‹œ ์ œ์™ธ)

 

์ž‘์„ฑํ•œ ์ฝ”๋“œ

%%sql
DROP TABLE IF EXISTS adhoc.session_summary;
CREATE TABLE adhoc.session_summary AS
SELECT 
  TO_CHAR(T.ts, 'YYYY-MM') AS month, 
  C.channelname as channel,  
  COUNT(DISTINCT U.userid) as uniqueUsers, 
  Count(CASE WHEN TR.amount >=0 THEN 1 END)as paidUsers,
  ROUND(Cast(paidUsers AS Float)*100/NULLIF(uniqueUsers, 0), 2) as conversionRate,
  Sum(CASE WHEN TR.refunded is NOT NULL then Tr.amount END) as grossRevenue,
  Sum(CASE WHEN TR.refunded=False then Tr.amount ELSE 0 END) as netRevenue
FROM raw_data.channel C 
LEFT JOIN raw_data.user_session_channel U on C.channelname=U.channel
JOIN raw_data.session_timestamp T ON T.sessionid=U.sessionid
LEFT JOIN raw_data.session_transaction TR ON  T.sessionid=TR.sessionid
GROUP BY 1, 2
ORDER BY 1, 2;

 

728x90