Python深度學習筆記(六)：文字處理完整流程

Yanwei Liu

2 min readJul 13, 2019

Text Processing Is Coming

How to use Regular Expression (Regex) and the Natural Language Toolkit (NLTK) on Game of Thrones Book 1

towardsdatascience.com

Regex syntax
Regex functions
Tokenization
Stemming/Lemmatization
Combining NLTK and Regex
Visualizing Word Frequencies

Searching Text using Regex

re.search: finds the first instance matching the pattern, returns a match object or None
re.match: finds an instance of the pattern only at the beginning of the string, returns a match object or None
re.fullmatch: finds whether the whole string matches the pattern given, returns a match object or None
re.findall: finds all non-overlapping matches to the pattern, returns a list of all matches
re.finditer: finds all non-overlapping matches to the pattern, returns an iterator object that can tell you the start/stop/contents of the match
re.sub(pattern, replacement, string): replaces the pattern with a replacement string, returns modified string
re.split: splits a string based on a pattern, return list of strings

Tokenization

from nltk.tokenize import word_tokenize, sent_tokenizeprint(word_tokenize("This is the queen's castle. Yay!"))
# ['This', 'is', 'the', 'queen', "'s", 'castle', '.', 'Yay', '!']print(sent_tokenize(got)[1:3])
# ['"The wildlings are \ndead."', '"Do the dead frighten you?"']from nltk.corpus import stopwords
stop_words=stopwords.words("english")print(random.sample(stop_words, 8))
print('There are', len(stop_words), 'English stopwords.')
# [‘now’, ‘about’, ‘to’, ‘too’, ‘himself’, ‘were’, ‘some’, “you’ll”] 
# There are 179 English stopwords.import string
punct = list(string.punctuation)
print(punct[0:13])
print('There are', len(punct), 'punctuation marks.')
# [‘!’, ‘“‘, ‘#’, ‘$’, ‘%’, ‘&’, “‘“, ‘(‘, ‘)’, ‘*’, ‘+’, ‘,’, ‘-’] # There are 32 punctuation marks.stops = stop_words + punct + ["''", 'r.', '``', "'s", "n't"]filtered_words=[]for w in got_words:
    if w.lower() not in stops:
        filtered_words.append(w.lower())
print(filtered_words[0:8])# [‘game’, ‘thrones’, ‘book’, ‘one’, ‘song’, ‘ice’, ‘fire’, ‘george’]

Stemming and Lemmatization

from nltk.stem.porter import PorterStemmer
ps = PorterStemmer()
stemmed_words=[]
for w in filtered_words:
    stemmed_words.append(ps.stem(w))print('Original:', filtered_words[7], filtered_words[13], filtered_words[15], filtered_words[26])
# Original: george urged began askedprint('Stemmed:', stemmed_words[7], stemmed_words[13], stemmed_words[15], stemmed_words[26])
# Stemmed: georg urg began askfrom nltk.stem.wordnet import WordNetLemmatizer
lem = WordNetLemmatizer()
lemm_words=[]
for w in filtered_words:
    lemm_words.append(lem.lemmatize(w, 'v'))print('Original:', filtered_words[7], filtered_words[13], filtered_words[15], filtered_words[26])
# Original: george urged began askedprint('Lemmatized:', lemm_words[7], lemm_words[13], lemm_words[15],lemm_words[26])
# Lemmatized: george urge begin ask

Combining NLTK and Regex

from nltk.tokenize import RegexpTokenizer 
print(RegexpTokenizer(r'\w+').tokenize("This is the queen's castle. So exciting!"))
# ['This', 'is', 'the', 'queen', 's', 'castle', 'So', 'exciting']words_ending_with_ing = [w for w in got_words if re.search("ing$", w)]
print('Tokens:', words_ending_with_ing[3:8])
# Tokens: ['falling', 'being', 'something', 'rushing', 'Something']words_ending_with_ing2 = [w for w in lemm_words if re.search("ing$", w)]
print('Lemmatized:', words_ending_with_ing2[3:7])
# Lemmatized: ['something', 'something', 'wildling', 'something']got_text = nltk.Text(lemm_words)
print(got_text)
# <Text: game throne book one song ice fire george...>print(got_text.findall(r'<.*><daenerys><.*>'))
# hide daenerys brother; usurper daenerys quicken; day daenerys want; hot daenerys flinch; archer daenerys say; princess daenerys magister; hand daenerys find; help daenerys collar; sister daenerys stormborn;...drogo daenerys targaryen

Visualizing word frequencies

freqdist = nltk.FreqDist(got_words)
plt.figure(figsize=(16,5))
plt.title('GOT Base Tokens Frequency Distribution')
freqdist.plot(50)

Python深度學習筆記(六)：文字處理完整流程

Text Processing Is Coming

How to use Regular Expression (Regex) and the Natural Language Toolkit (NLTK) on Game of Thrones Book 1

Searching Text using Regex

Tokenization

Stemming and Lemmatization

Combining NLTK and Regex

Visualizing word frequencies

Written by Yanwei Liu

No responses yet