R語言學習筆記(九):文字處理

使用jiebaR斷詞

這裡假設已經有article_txt這個變數,且已有文字內容了

library(jiebaR)cutter = worker(bylines =T)
article_words = sapply(article_txt, function(x) segment(x, cutter))

使用text2vec建立詞庫

a = article_words
library(text2vec)
a.token <- itoken(a)
a.vocab <- create_vocabulary(a.token, ngram = c(1,1))
#詞,次數,文章佔比率
head(a.vocab$vocab)

計算TCM(字詞互相伴隨的頻率)

term-co-occurrence matrix(TCM)

a.token <- itoken(a)
a.vectorizer <- vocab_vectorizer(a.vocab, grow_dtm =FALSE, skip_grams_window =5)
a.tcm <- create_tcm(a.token, a.vectorizer)

Written by

Machine Learning / Deep Learning / Python / Flutter cakeresume.com/yanwei-liu

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store