Python爬蟲+資料處理Project(1)：電子書網站標題爬蟲

3 min readAug 20, 2019

在本文中，我們將學習如何使用Python的requests, BeautifulSoup, pandas來完成一個簡單的爬蟲專案

1.安裝模組

pip install requests
pip install beautifulsoup4
pip install pandas

2.開始爬蟲

我們要爬取的網頁是All IT eBooks — Free IT eBooks Download，目前總共有831頁。每頁都有電子書的書名，我想要把這831頁的書本標題都下載回來，但是一頁一頁抓，很麻煩，該怎麼辦呢？

import requests                               #引入兩個模組
from bs4 import BeautifulSoupfor i in range(1,832):                        #從第1頁到第831頁
    i=str(i)                                  #將int轉成str型態
    url="http://www.allitebooks.org/page/"+i  #產生第1~831頁的網址
    #print(url)                               #印出網址 
    resp = requests.get(url)                  #用requests請求網址
    soup = BeautifulSoup(resp.text, 'lxml')   #用BeautifulSoup解析網頁
    titles = soup.find_all('h2')              #尋找html標籤為h2的內容
    for t in titles:                          #h2的內容中，印出h2的文字
        print(t.string)#目前總共有8306本書
#程式執行時間：[Finished in 1632.4s]
#其實有更快的做法，但是我自己目前能力有限只能做到這樣

3.進行資料分析

執行程式後，我們將書名的標題複製起來，透過Excel產生一個xlsx檔案，如下圖所示：

接著，我們透過pandas來進行資料分析

import pandas as pd                            #引入模組
df=pd.read_excel("All IT eBooks.xlsx")         #讀取xlsx檔案
df.head()                                      #顯示前5筆資料
df.describe()                                  #針對資料產生描述性呈現#搜尋資料中，包含Python或Machine Learning或Deep Learning的字串內容
str_choice = "Python|Machine Learning|Deep Learning" 
#並且將搜尋結果存成一個變數python
python=df[df['BookName'].str.contains(str_choice, na=False)]python.describe()                              #針對資料產生描述性呈現
python.head(50)                                #顯示出前50筆資料

除此之外，我們也能透過NTLK來進行自然語言處理，找出更多有趣的事情。

Python Pandas使用筆記

medium.com

Python爬蟲學習筆記(一) — Requests, BeautifulSoup, 正規表達式

Python爬蟲

學習筆記(一) — Requests, BeautifulSoup, 正規表達式 Python爬蟲medium.com

Python爬蟲+資料處理Project(1)：電子書網站標題爬蟲

Python Pandas使用筆記

Python爬蟲學習筆記(一) — Requests, BeautifulSoup, 正規表達式

Python爬蟲

Written by Yanwei Liu

No responses yet