Python深度學習筆記(六):文字處理完整流程
2 min readJul 13, 2019
- Regex syntax
- Regex functions
- Tokenization
- Stemming/Lemmatization
- Combining NLTK and Regex
- Visualizing Word Frequencies
Searching Text using Regex
re.search
: finds the first instance matching the pattern, returns a match object or Nonere.match
: finds an instance of the pattern only at the beginning of the string, returns a match object or Nonere.fullmatch
: finds whether the whole string matches the pattern given, returns a match object or Nonere.findall
: finds all non-overlapping matches to the pattern, returns a list of all matchesre.finditer
: finds all non-overlapping matches to the pattern, returns an iterator object that can tell you the start/stop/contents of the matchre.sub(pattern, replacement, string)
: replaces the pattern with a replacement string, returns modified stringre.split
: splits a string based on a pattern, return list of strings
Tokenization
from nltk.tokenize import word_tokenize, sent_tokenizeprint(word_tokenize("This is the queen's castle. Yay!"))
# ['This', 'is', 'the', 'queen', "'s", 'castle', '.', 'Yay', '!']print(sent_tokenize(got)[1:3])
# ['"The wildlings are \ndead."', '"Do the dead frighten you?"']from nltk.corpus import stopwords
stop_words=stopwords.words("english")print(random.sample(stop_words, 8))
print('There are', len(stop_words), 'English stopwords.')
# [‘now’, ‘about’, ‘to’, ‘too’, ‘himself’, ‘were’, ‘some’, “you’ll”]
# There are 179 English stopwords.import string
punct = list(string.punctuation)
print(punct[0:13])
print('There are', len(punct), 'punctuation marks.')
# [‘!’, ‘“‘, ‘#’, ‘$’, ‘%’, ‘&’, “‘“, ‘(‘, ‘)’, ‘*’, ‘+’, ‘,’, ‘-’] # There are 32 punctuation marks.stops = stop_words + punct + ["''", 'r.', '``', "'s", "n't"]filtered_words=[]for w in got_words:
if w.lower() not in stops:
filtered_words.append(w.lower())
print(filtered_words[0:8])# [‘game’, ‘thrones’, ‘book’, ‘one’, ‘song’, ‘ice’, ‘fire’, ‘george’]
Stemming and Lemmatization
from nltk.stem.porter import PorterStemmer
ps = PorterStemmer()
stemmed_words=[]
for w in filtered_words:
stemmed_words.append(ps.stem(w))print('Original:', filtered_words[7], filtered_words[13], filtered_words[15], filtered_words[26])
# Original: george urged began askedprint('Stemmed:', stemmed_words[7], stemmed_words[13], stemmed_words[15], stemmed_words[26])
# Stemmed: georg urg began askfrom nltk.stem.wordnet import WordNetLemmatizer
lem = WordNetLemmatizer()
lemm_words=[]
for w in filtered_words:
lemm_words.append(lem.lemmatize(w, 'v'))print('Original:', filtered_words[7], filtered_words[13], filtered_words[15], filtered_words[26])
# Original: george urged began askedprint('Lemmatized:', lemm_words[7], lemm_words[13], lemm_words[15],lemm_words[26])
# Lemmatized: george urge begin ask
Combining NLTK and Regex
from nltk.tokenize import RegexpTokenizer
print(RegexpTokenizer(r'\w+').tokenize("This is the queen's castle. So exciting!"))
# ['This', 'is', 'the', 'queen', 's', 'castle', 'So', 'exciting']words_ending_with_ing = [w for w in got_words if re.search("ing$", w)]
print('Tokens:', words_ending_with_ing[3:8])
# Tokens: ['falling', 'being', 'something', 'rushing', 'Something']words_ending_with_ing2 = [w for w in lemm_words if re.search("ing$", w)]
print('Lemmatized:', words_ending_with_ing2[3:7])
# Lemmatized: ['something', 'something', 'wildling', 'something']got_text = nltk.Text(lemm_words)
print(got_text)
# <Text: game throne book one song ice fire george...>print(got_text.findall(r'<.*><daenerys><.*>'))
# hide daenerys brother; usurper daenerys quicken; day daenerys want; hot daenerys flinch; archer daenerys say; princess daenerys magister; hand daenerys find; help daenerys collar; sister daenerys stormborn;...drogo daenerys targaryen
Visualizing word frequencies
freqdist = nltk.FreqDist(got_words)
plt.figure(figsize=(16,5))
plt.title('GOT Base Tokens Frequency Distribution')
freqdist.plot(50)