A Multithreading Example on Python Web Crawling

Before the article begins, I want to tell you why I wrote this crawler. Every month my school always buy books in bulk and it’s not so convenient to check the titles of those books from the library website, because there are a lot of information which made me get distracted.

I just want to know what’s new on the shelf

So last year, I wrote a crawler to collect all the new books titles from the library website.

Everything seems pretty fine. The only drawback of my program was that the performance’s too bad. It took me 190.1 sec to collect all the information (2186 books).

Today, I re-write my code to improve its performance and I am pretty satisfied with the result. It took only 83.9 sec on the task (2186 books).

Below are the different version codes I want to share with you guys:

original version ( 190.1 sec)

“list.txt” contains the link from my school’s library website, I just want you guys know how to use multithreading to improve the performance, so please don’t ask me the “list.txt” file.

import requests
from bs4 import BeautifulSoup
headers = {
'user-agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36'}
with open('list.txt') as url:
for urls in url:
resp = requests.get(urls, headers=headers).text
soup = BeautifulSoup(resp, 'lxml')
try:
for i in range(0, 50):
bookName = soup.find_all(class_="briefcitTitle")[i]
content = bookName.text
print(content)
except:
pass

Multithreading version ( 83.9 sec)

The difference between these two version is that I save the url in list.txt firstly and def bookInfo function then use concurrent.futures.ThreadPoolExecutor to execute bookInfo function and give it url as its parameter.

import requests
from bs4 import BeautifulSoup
import concurrent.futures
headers = {
'user-agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36'}
with open('list.txt','r') as url:
info_urls = [str(line) for line in url.readlines()]
def bookInfo(info_url):
resp = requests.get(info_url, headers=headers).text
soup = BeautifulSoup(resp, 'lxml')
try:
for i in range(0, 50):
bookName = soup.find_all(class_="briefcitTitle")[i]
content = bookName.text
print(content)
except:
pass
with concurrent.futures.ThreadPoolExecutor() as executor:
for url in info_urls:
executor.submit(bookInfo,url)

Conclusion:

After this article, you should learn how to implement Multithreading on your own python programming project.

Reference:

Written by

Machine Learning / Deep Learning / Python / Flutter cakeresume.com/yanwei-liu

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store