Python深度學習筆記(六):文字處理完整流程

Yanwei Liu
2 min readJul 13, 2019

--

  • Regex syntax
  • Regex functions
  • Tokenization
  • Stemming/Lemmatization
  • Combining NLTK and Regex
  • Visualizing Word Frequencies

Searching Text using Regex

  • re.search: finds the first instance matching the pattern, returns a match object or None
  • re.match: finds an instance of the pattern only at the beginning of the string, returns a match object or None
  • re.fullmatch: finds whether the whole string matches the pattern given, returns a match object or None
  • re.findall: finds all non-overlapping matches to the pattern, returns a list of all matches
  • re.finditer: finds all non-overlapping matches to the pattern, returns an iterator object that can tell you the start/stop/contents of the match
  • re.sub(pattern, replacement, string): replaces the pattern with a replacement string, returns modified string
  • re.split: splits a string based on a pattern, return list of strings

Tokenization

from nltk.tokenize import word_tokenize, sent_tokenizeprint(word_tokenize("This is the queen's castle. Yay!"))
# ['This', 'is', 'the', 'queen', "'s", 'castle', '.', 'Yay', '!']
print(sent_tokenize(got)[1:3])
# ['"The wildlings are \ndead."', '"Do the dead frighten you?"']
from nltk.corpus import stopwords
stop_words=stopwords.words("english")
print(random.sample(stop_words, 8))
print('There are', len(stop_words), 'English stopwords.')
# [‘now’, ‘about’, ‘to’, ‘too’, ‘himself’, ‘were’, ‘some’, “you’ll”]
# There are 179 English stopwords.
import string
punct = list(string.punctuation)
print(punct[0:13])
print('There are', len(punct), 'punctuation marks.')
# [‘!’, ‘“‘, ‘#’, ‘$’, ‘%’, ‘&’, “‘“, ‘(‘, ‘)’, ‘*’, ‘+’, ‘,’, ‘-’]
# There are 32 punctuation marks.stops = stop_words + punct + ["''", 'r.', '``', "'s", "n't"]filtered_words=[]for w in got_words:
if w.lower() not in stops:
filtered_words.append(w.lower())
print(filtered_words[0:8])
# [‘game’, ‘thrones’, ‘book’, ‘one’, ‘song’, ‘ice’, ‘fire’, ‘george’]

Stemming and Lemmatization

from nltk.stem.porter import PorterStemmer
ps = PorterStemmer()
stemmed_words=[]
for w in filtered_words:
stemmed_words.append(ps.stem(w))
print('Original:', filtered_words[7], filtered_words[13], filtered_words[15], filtered_words[26])
# Original: george urged began asked
print('Stemmed:', stemmed_words[7], stemmed_words[13], stemmed_words[15], stemmed_words[26])
# Stemmed: georg urg began ask
from nltk.stem.wordnet import WordNetLemmatizer
lem = WordNetLemmatizer()
lemm_words=[]
for w in filtered_words:
lemm_words.append(lem.lemmatize(w, 'v'))
print('Original:', filtered_words[7], filtered_words[13], filtered_words[15], filtered_words[26])
# Original: george urged began asked
print('Lemmatized:', lemm_words[7], lemm_words[13], lemm_words[15],lemm_words[26])
# Lemmatized: george urge begin ask

Combining NLTK and Regex

from nltk.tokenize import RegexpTokenizer 
print(RegexpTokenizer(r'\w+').tokenize("This is the queen's castle. So exciting!"))
# ['This', 'is', 'the', 'queen', 's', 'castle', 'So', 'exciting']
words_ending_with_ing = [w for w in got_words if re.search("ing$", w)]
print('Tokens:', words_ending_with_ing[3:8])
# Tokens: ['falling', 'being', 'something', 'rushing', 'Something']
words_ending_with_ing2 = [w for w in lemm_words if re.search("ing$", w)]
print('Lemmatized:', words_ending_with_ing2[3:7])
# Lemmatized: ['something', 'something', 'wildling', 'something']
got_text = nltk.Text(lemm_words)
print(got_text)
# <Text: game throne book one song ice fire george...>
print(got_text.findall(r'<.*><daenerys><.*>'))
# hide daenerys brother; usurper daenerys quicken; day daenerys want; hot daenerys flinch; archer daenerys say; princess daenerys magister; hand daenerys find; help daenerys collar; sister daenerys stormborn;...drogo daenerys targaryen

Visualizing word frequencies

freqdist = nltk.FreqDist(got_words)
plt.figure(figsize=(16,5))
plt.title('GOT Base Tokens Frequency Distribution')
freqdist.plot(50)

--

--

No responses yet