Python深度學習筆記(五)：使用NLTK進行自然語言處理

11 min readMay 8, 2019

【NLP】 NLTK 基本教學 - Clay-Technology World

NLTK 全文是 Nature Language Tool Kit (NLTK)，是 Python…

clay-atlas.com

安裝NLTK

pip install nltk

安裝NLTK包

import nltk
nltk.download()
#跳出GUI界面，下載需要的資料

載入內建的書籍文字檔

#共九本書(如果找不到書的話，依照螢幕提示進行下載)
from nltk.book import *

計算單字頻率並繪圖

from bs4 import BeautifulSoup
 
import urllib.request
 
import nltk
 
response = urllib.request.urlopen('http://php.net/')
 
html = response.read()
 
soup = BeautifulSoup(html,"html5lib")
 
text = soup.get_text(strip=True)
 
tokens = [t for t in text.split()]
 
freq = nltk.FreqDist(tokens)
 
for key,val in freq.items():
 
    print (str(key) + ':' + str(val))freq.plot(20, cumulative=False)

移除停用詞Stop Words

停用詞大致分為兩類。
1)人類語言中包含的功能詞，如'the'、'is'、'at'、'which'、'on'等。
2)詞彙詞，比如'want'等，這些詞應用十分廣泛，但是對這樣的詞搜尋引擎無法保證能夠給出真正相關的搜索結果。#stopwords必須使用nltk.download()下載from bs4 import BeautifulSoup
 
import urllib.request
 
import nltk
 
from nltk.corpus import stopwords
 
response = urllib.request.urlopen('http://php.net/')
 
html = response.read()
 
soup = BeautifulSoup(html,"html5lib")
 
text = soup.get_text(strip=True)
 
tokens = [t for t in text.split()]
 
clean_tokens = tokens[:]
 
sr = stopwords.words('english')
 
for token in tokens:
 
    if token in stopwords.words('english'):
 
        clean_tokens.remove(token)
 
freq = nltk.FreqDist(clean_tokens)
 
for key,val in freq.items():
 
    print (str(key) + ':' + str(val))freq.plot(20,cumulative=False)可使用1984這本小說來分析看看，哪些是高頻率單字如下連結http://gutenberg.net.au/ebooks01/0100021.txt

分開英文句子(斷句)

from nltk.tokenize import sent_tokenizemytext = "Hello Adam, how are you? I hope everything is going well. Today is a good day, see you dude."print(sent_tokenize(mytext))['Hello Adam, how are you?', 'I hope everything is going well.', 'Today is a good day, see you dude.']

分開英文單字(斷詞)

from nltk.tokenize import word_tokenizemytext = "Hello Mr. Adam, how are you? I hope everything is going well. Today is a good day, see you dude."print(word_tokenize(mytext))['Hello', 'Mr.', 'Adam', ',', 'how', 'are', 'you', '?', 'I', 'hope', 'everything', 'is', 'going', 'well', '.', 'Today', 'is', 'a', 'good', 'day', ',', 'see', 'you', 'dude', '.']

分開非英文文字

from nltk.tokenize import sent_tokenizemytext = "Bonjour M. Adam, comment allez-vous? J'espère que tout va bien. Aujourd'hui est un bon jour."print(sent_tokenize(mytext,"french"))['Bonjour M. Adam, comment allez-vous?', "J'espère que tout va bien.", "Aujourd'hui est un bon jour."]

使用WordNet顯示同義/反義詞

#wordnet必須使用nltk.download()下載from nltk.corpus import wordnetsyn = wordnet.synsets("pain")print(syn[0].definition())print(syn[0].examples())#用For迴圈取得大量相關代名詞from nltk.corpus import wordnet
 
synonyms = []
 
for syn in wordnet.synsets('Computer'):
 
    for lemma in syn.lemmas():
 
        synonyms.append(lemma.name())
 
print(synonyms)#用For迴圈取得大量反義詞
from nltk.corpus import wordnet
 
antonyms = []
 
for syn in wordnet.synsets("small"):
 
    for l in syn.lemmas():
 
        if l.antonyms():
 
            antonyms.append(l.antonyms()[0].name())
 
print(antonyms)

去除字尾Stemming

from nltk.stem import PorterStemmer
 
stemmer = PorterStemmer()
 
print(stemmer.stem('working'))#顯示work
-----------------------------------------------支持去除以下語言的字尾from nltk.stem import SnowballStemmerprint(SnowballStemmer.languages)('arabic', 'danish', 'dutch', 'english', 'finnish', 'french', 'german', 'hungarian', 'italian', 'norwegian', 'porter', 'portuguese', 'romanian', 'russian', 'spanish', 'swedish')------------------------------------------------去除德文字尾
from nltk.stem import SnowballStemmer
french_stemmer = SnowballStemmer('german')
print(french_stemmer.stem("Guten"))

更精確的去除字尾Lemmatization

from nltk.stem import WordNetLemmatizer
 
lemmatizer = WordNetLemmatizer()
 
print(lemmatizer.lemmatize('playing', pos="v"))
 
print(lemmatizer.lemmatize('playing', pos="n"))
 
print(lemmatizer.lemmatize('playing', pos="a"))
 
print(lemmatizer.lemmatize('playing', pos="r"))

找出高頻率單字

# Importing necessary library
import pandas as pd
import numpy as np
import nltk
import os
import nltk.corpus

text = “In Brazil they drive on the right-hand side of the road. Brazil has a large coastline on the eastern
side of South America"from nltk.tokenize import word_tokenizetoken = word_tokenize(text)
tokenfrom nltk.probability import FreqDist
fdist = FreqDist(token)
fdist#出現頻率最高的10個單字
fdist1 = fdist.most_common(10)
fdist1

Part of speech tagging (POS)詞性標記

part-of-speech (POS) tagging - 詞性標記

字詞（word）是語言系統中具有獨立語意或扮演特定語法功能，且可以自由使用的最小語言單位。依據字詞在句法結構或語言形態上扮演的角色，經由詞性分類賦予語句中每個字詞適當之詞性符號或標記的過程，則稱為詞性標記（part-of-speech…

terms.naer.edu.tw

text = “vote to choose a particular man or a group (party) to represent them in parliament”
#Tokenize the text
tex = word_tokenize(text)
for token in tex:
    print(nltk.pos_tag([token]))

Named entity recognition命名實體標註(NER)

[自然語言處理] #3 命名實體標註 Name Entity Recognition 理論設計篇

Name Entity 的意義以及目的

medium.com

text = “Google’s CEO Sundar Pichai introduced the new Pixel at Minnesota Roi Centre Event”#importing chunk library from nltk
from nltk import ne_chunk# tokenize and POS Tagging before doing chunk
token = word_tokenize(text)
tags = nltk.pos_tag(token)
chunk = ne_chunk(tags)
chunk

Chunking資訊提取

NLTK读书笔记 - 信息提取(二)

2. Chunking在entity detection中，最基础的技术就是chunking, 可以理解为将标注好词性的句子按句法结构把某些词能够聚合在一起形成比如主语、谓语、宾语等等. 可以这样理解: word >> sentence…

superangevil.wordpress.com

text = “We saw the yellow dog”
token = word_tokenize(text)
tags = nltk.pos_tag(token)reg = “NP: {<DT>?<JJ>*<NN>}” 
a = nltk.RegexpParser(reg)
result = a.parse(tags)
print(result)

n-gram

from nltk.book import *
from nltk import bigrams
bigrams = bigrams(text6)
bigramsDist = FreqDist(bigrams)
bigramDist[("Sir","Robin")]

拼字錯誤(edit-distance)

#代表著從rain變成shine這個單字需要三步驟(sain->shin->shine)
from nltk.metrics import edit_distance
edit_distance("rain","shine")
#3

Python深度學習筆記(五)：使用NLTK進行自然語言處理

【NLP】 NLTK 基本教學 - Clay-Technology World

NLTK 全文是 Nature Language Tool Kit (NLTK)，是 Python…

安裝NLTK

安裝NLTK包

載入內建的書籍文字檔

計算單字頻率並繪圖

移除停用詞Stop Words

分開英文句子(斷句)

分開英文單字(斷詞)

分開非英文文字

使用WordNet顯示同義/反義詞

去除字尾Stemming

更精確的去除字尾Lemmatization

找出高頻率單字

Part of speech tagging (POS)詞性標記

part-of-speech (POS) tagging - 詞性標記

Named entity recognition命名實體標註(NER)

[自然語言處理] #3 命名實體標註 Name Entity Recognition 理論設計篇

Name Entity 的意義以及目的

Chunking資訊提取

NLTK读书笔记 - 信息提取(二)

2. Chunking在entity detection中，最基础的技术就是chunking, 可以理解为将标注好词性的句子按句法结构把某些词能够聚合在一起形成比如主语、谓语、宾语等等. 可以这样理解: word >> sentence…

n-gram

拼字錯誤(edit-distance)

Written by Yanwei Liu

No responses yet