NLP

[NLP] ์ž์—ฐ์–ด์ฒ˜๋ฆฌ 4. ํ† ํ”ฝ ๋ชจ๋ธ๋ง(LDA, NMF)

sueeee-e 2024. 9. 28. 17:10

 

๐ŸŽˆํ† ํ”ฝ ๋ชจ๋ธ๋ง (Topic Modeling)

- ๋ฌธ์„œ ๋‚ด ์ž ์žฌ์ ์ธ "์ฃผ์ œ"๋ฅผ ์‹๋ณ„ํ•˜๋Š” ๊ฒƒ = ์ฃผ์ œ ํŒŒ์•…/๋ถ„๋ฅ˜

- LDA, NMF, pLSA, LSI, PLSI

- ํ† ํ”ฝ ๋ชจ๋ธ 

: ๋ฌธ์„œ ์ง‘ํ•ฉ์˜ ์ถ”์ƒ์ ์ธ "์ฃผ์ œ"๋ฅผ ๋ฐœ๊ฒฌํ•˜๊ธฐ ์œ„ํ•œ ํ†ต๊ณ„์  ๋ชจ๋ธ  ์ค‘ ํ•˜๋‚˜

: ํ…์ŠคํŠธ ๋ณธ๋ฌธ์— ์ˆจ๊ฒจ์ง„ ์˜๋ฏธ๊ตฌ์กฐ๋ฅผ ๋ฐœ๊ฒฌํ•˜๊ธฐ ์œ„ํ•ด ์‚ฌ์šฉ๋˜๋Š” ํ…์ŠคํŠธ ๋งˆ์ด๋‹ ๊ธฐ๋ฒ• ์ค‘ ํ•˜๋‚˜

 

โœ”๏ธ์‘์šฉ : ๋ฌธ์„œ ๋ถ„๋ฅ˜ ๋ฐ ์š”์•ฝ, ์ถ”์ฒœ ์‹œ์Šคํ…œ, ์‹œ์žฅ ์กฐ์‚ฌ ๋ฐ ์—ฌ๋ก  ๋ถ„์„

 

โš’๏ธ ์ž ์žฌ ์˜๋ฏธ ๋ถ„์„, LSI : ์ตœ์ดˆ์˜ ํ† ํ”ฝ ๋ชจ๋ธ, ๋ฌธํ—Œ-์šฉ์–ดํ–‰๋ ฌ = ๋ฌธํ—Œ-์˜๋ฏธํ–‰๋ ฌ + ์˜๋ฏธ-์šฉ์–ดํ–‰๋ ฌ ๋ถ„ํ•ดํ•˜์—ฌ ์ž ์žฌ๋ณ€์ˆ˜ ์˜๋ฏธ๋ฅผ ๋ฐœ๊ฒฌํ•˜๊ณ ์ž ํ•จ

โš’๏ธ ์ž ์žฌ ๋””๋ฆฌํด๋ ˆ ํ• ๋‹นLDA : ๊ฐ ๋ฌธ์„œ์˜ ํ† ํ”ฝ ๋ถ„ํฌ์™€ ๊ฐ ํ† ํ”ฝ ๋‚ด์˜ ๋‹จ์–ด ๋ถ„ํฌ๋ฅผ ์ถ”์ • / ํ™•๋ฅ ์  ๋ชจ๋ธ   

โš’๏ธ NMF ๋น„์Œ์ˆ˜ ํ–‰๋ ฌ๋ถ„ํ•ด : ๋ชจ๋“  ์š”์†Œ๊ฐ€ 0 ์ด์ƒ์ธ ํ–‰๋ ฌ์„ ๋‘ ๊ฐœ ์ด์ƒ์˜ ๋น„์Œ์ˆ˜ ํ–‰๋ ฌ์˜ ๊ณฑ์œผ๋กœ ๋ถ„ํ•ดํ•˜๋Š” ๊ธฐ๋ฒ• / ์„ ํ˜• ๋Œ€์ˆ˜ ๊ธฐ๋ฐ˜ ๋ชจ๋ธ

- ์žฅ์  : ์ง๊ด€์ ์œผ๋กœ ํ•ด์„, ๋ณต์ ‘ํ•œ ๋ฐ์ดํ„ฐ์˜ ๋‚ด์žฌ๋œ ๊ตฌ์กฐ ๋ฐœ๊ฒฌ์— ํšจ๊ณผ์   |  ๋‹จ์  : ๋ชจ๋“  ์š”์†Œ๊ฐ€ ๋น„์Œ์ˆ˜์ผ ๋•Œ๋งŒ ์ ์šฉ ๊ฐ€๋Šฅ

 


๐Ÿ‘พLDA ํ† ํ”ฝ ๋ชจ๋ธ๋ง ์ด์šฉ

 

๋ฐ์ดํ„ฐ๋Š” ๋‰ด์Šค ์ œ๋ชฉ๊ณผ ๋ ˆ์ด๋ธ”์„ ํฌํ•จํ•˜๊ณ  ์žˆ๊ณ , ์ด ์ œ๋ชฉ๋“ค์„ ๋ถ„์„ํ•˜์—ฌ 4๊ฐœ์˜ ํ† ํ”ฝ์œผ๋กœ ๋‚˜๋ˆ ๋ณด๊ณ  ์‹ค์ œ ๋ ˆ์ด๋ธ”๊ณผ ๋น„๊ตํ•ด๋ณด๋Š” ์‹ค์Šต์ž…๋‹ˆ๋‹ค.

# TF-IDF๋กœ ํ…์ŠคํŠธ ๋ฒกํ„ฐํ™” ์ง„ํ–‰
from sklearn.feature_extraction.text import TfidfVectorizer
tfidfV= TfidfVectorizer()
dtm = tfidfV.fit_transform(df['title'])
df_dtm = pd.DataFrame(dtm.toarray(), columns=tfidfV.get_feature_names_out())

 

LDA ํ† ํ”ฝ ๋ชจ๋ธ๋ง์„ ์ง„ํ–‰ํ•˜๊ณ  ํ† ํ”ฝ์€ 4๊ฐœ๋กœ ์ง€์ •, 

 

W : ์›๋ณธ ๋ฐ์ดํ„ฐ์˜ ๊ฐ ํ–‰์ด ์–ด๋–ป๊ฒŒ H์˜ ๊ฐ ํ–‰์˜ ์กฐํ•ฉ์œผ๋กœ ํ‘œํ˜„๋  ์ˆ˜ ์žˆ๋Š”์ง€ ๋‚˜ํƒ€๋ƒ„

-> weight matrix, basis matrix / ์›๋ณธ ๋ฐ์ดํ„ฐ๋ฅผ ์ž ์žฌ์  ํŠน์ •์œผ๋กœ ๋ณ€ํ™˜ํ•˜๋Š” ๊ธฐ์ € ํ˜•์„ฑ

 

H :  ์›๋ณธ ๋ฐ์ดํ„ฐ์˜ ์—ด์„ ์ƒˆ๋กœ์šด ์ถ•์†Œ๋œ ์ฐจ์›์˜ ํŠน์„ฑ์œผ๋กœ ํ‘œํ˜„ํ•จ

-> coefficient matrix, encoding matrix / ์ž ์žฌ์  ํŠน์„ฑ๋“ค์ด ์›๋ณธ ๋ฐ์ดํ„ฐ๋ฅผ ์–ด๋–ป๊ฒŒ ์žฌ๊ตฌ์„ฑํ•˜๋Š”์ง€๋ฅผ ๋‚˜ํƒ€๋‚ด๋Š” ๊ณ„์ˆ˜๋“ค ํฌํ•จ

# LDA ํ† ํ”ฝ ๋ชจ๋ธ๋ง ์ง„ํ–‰
from sklearn.decomposition import LatentDirichletAllocation
LDA_model = LatentDirichletAllocation(n_component = 4) #4๊ฐœ ํ† ํ”ฝ์œผ๋กœ ๋‚˜๋ˆ”
W = LDA_model.fit_transform(df_dtm)
H = LDA_model.components_
# df์— ์‹ค์ œ ์ œ๋ชฉ๊ณผ ๋ ˆ์ด๋ธ”์„ ๊ฐ™์ด ํ‘œํ˜„
df_lda_w = pd.DataFrame(W)
df_lda_w['title'] = df['title']
df_lda_w['label'] = df['label']

# df๋กœ ๊ฒฐ๊ณผ ๋ฐ˜ํ™˜
df_lda_topic = pd.DataFrame(H, columns=tfidfvect.get_feature_names_out())

df_lda_w๊ฒฐ๊ณผ์™€ df_lda_topic ๊ฒฐ๊ณผ

df_lda_W[df_lda_W["label"] == "์Šคํฌ์ธ "].head(20).style.background_gradient(axis=1)

 

 

 

 

์ด๋ ‡๊ฒŒ ํŠน์ • ๋ ˆ์ด๋ธ”์˜ ๊ฐ’๋“ค๋งŒ ๋ดค์„ ๋•Œ ์Šคํฌ์ธ ๋Š” 3๋ฒˆ ์ธ๋ฑ์Šค ํ† ํ”ฝ์œผ๋กœ ์ž˜ ๋‚˜๋‰˜์–ด ์ง„ ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค. 

 

 

 

 


๐Ÿ‘พNMF ํ† ํ”ฝ ๋ชจ๋ธ๋ง ์ด์šฉ

from sklearn.decomposition import NMF
nmf_model = NMF(n_components = 4)
W = nmf_model.fit_transform(df_dtm)
H = nmf_model.components_

df_nmf_w = pd.DataFrame(W)
df_nmf_w['title'] = df['title']
df_nmf_w['label'] = df['label']

df_nmf_w[df_nmf_w["label"] == "์„ธ๊ณ„"].head(20).style.background_gradient(axis=1)

 

์œ„์—์„œ ํ–ˆ๋˜ LDA ๊ณผ์ •์ด๋ž‘ ๋˜‘๊ฐ™๋‹ค. 

 

๋งˆ์ง€๋ง‰์œผ๋กœ ๋‘ ๋ชจ๋ธ์—์„œ ๊ฐ ํ† ํ”ฝ ๋ณ„๋กœ ๋งŽ์ด ๋‚˜์˜จ ๋‹จ์–ด๋“ค์„ ๊ทธ๋ž˜ํ”„๋กœ ๋น„๊ต

# ํ† ํ”ฝ๋ณ„๋กœ ์ƒ์œ„ ํ‚ค์›Œ๋“œ๋ฅผ ๋ง‰๋Œ€๊ทธ๋ž˜ํ”„๋กœ ์‹œ๊ฐํ™”ํ•˜๋Š” ํ•จ์ˆ˜ : plot_top_words
n_top_words = 20

plot_top_words(
    LDA_model, tfidfvect.get_feature_names_out(), n_top_words, 
    "Topics in LDA model (LatentDirichletAllocation)", n_topics=4
)

plot_top_words(
    nmf_model, tfidfvect.get_feature_names_out(), n_top_words, 
    "Topics in LDA model (LatentDirichletAllocation)", n_topics=4
)

 

 

ํ…์ŠคํŠธ ๋ฒกํ„ฐํ™”๋ฅผ TF-IDF ๋ง๊ณ  CountVectorizer๋ฅผ ์ด์šฉํ•ด์„œ ๋‹ค์‹œ ํ•ด๋ด๋„ ์ข‹๊ณ 

์ด์ „์— ํ–ˆ๋˜ ์›Œ๋“œํด๋ผ์šฐ๋“œ๋„ ๋ณต์Šตํ•ด๋ณด๋Š” ๊ฒƒ์„ ์ถ”์ฒœํ•œ๋‹ค.

 


์ถœ์ฒ˜

https://www.eecis.udel.edu/~shatkay/Course/papers/UIntrotoTopicModelsBlei2011-5.pdf

https://inf.run/uh9Xr

 

๋ชจ๋‘์˜ ํ•œ๊ตญ์–ด ํ…์ŠคํŠธ ๋ถ„์„๊ณผ ์ž์—ฐ์–ด์ฒ˜๋ฆฌ with ํŒŒ์ด์ฌ ๊ฐ•์˜ | ๋ฐ•์กฐ์€ - ์ธํ”„๋Ÿฐ

๋ฐ•์กฐ์€ | ํŒŒ์ด์ฌ ํ•œ๊ตญ์–ด ํ…์ŠคํŠธ ๋ถ„์„๊ณผ ์ž์—ฐ์–ด์ฒ˜๋ฆฌ ์›Œ๋“œํด๋ผ์šฐ๋“œ ์‹œ๊ฐํ™”, ํ˜•ํƒœ์†Œ ๋ถ„์„, ํ† ํ”ฝ๋ชจ๋ธ๋ง, ๊ตฐ์ง‘ํ™”, ์œ ์‚ฌ๋„ ๋ถ„์„, ํ…์ŠคํŠธ๋ฐ์ดํ„ฐ ๋ฒกํ„ฐํ™”๋ฅผ ์œ„ํ•œ ๋‹จ์–ด ๊ฐ€๋ฐฉ๊ณผ TF-IDF, ๋จธ์‹ ๋Ÿฌ๋‹๊ณผ ๋”ฅ๋Ÿฌ๋‹์„

www.inflearn.com