Member-only story

Python爬蟲學習筆記(一) - Requests, BeautifulSoup, 正規表達式,API

Yanwei Liu
41 min readDec 20, 2018

--

Python 網路爬蟲與資料分析入門實戰 GitHub Repo

解決複雜圖片驗證碼captcha和recaptcha

先備知識

GET:讀取一般網頁內容
POST:填表單傳送資料時使用

安裝Requests和BeautifulSoup

pip install beautifulsoup4
pip install requests

引入模組

import requests
from bs4 import BeautifulSoup

Requests

使用Requests

url = "https://jwlin.github.io/py-scraping-analysis-book/ch1/connect.html"
resp = requests.get(url)
#網頁抓取後編碼錯誤?
resp.encoding = 'utf-8' #轉換編碼至UTF-8
resp.encoding = 'big5' #設定成該網頁的編碼,例如big5編碼或簡體的gbk編碼
#顯示網頁狀態
resp.status_code
#顯示200即為正常
#通常2開頭為正常
#開頭為4或5表示錯誤
#若想解析亂碼該怎麼辦
至以下網站貼上亂碼,選擇對應的編碼,即可轉換
https://www.webatic.com/url-convertor

params:加入參數

import requestsr = requests.get('https://www.google.com/search',
headers={'Accept': 'application/json'}
params={'q': 'unicorns'} # 搜尋 google.com/search?q=unicorns
)

requests.content / requests.text / requests.json()

import requests 
r = requests.get('
https://official-joke-api.appspot.com/random_joke'
)
r.content #return raw bytes data
r.text #return string
r.json() #return json (dict)

payload:使傳回資料以指定格式顯示(GET方法)

import requests
payload = {'key1': 'value1', 'key2': 'value2'}
# 將查詢參數加入 GET 請求中
html = requests.get("http://httpbin.org/get", params=payload)
print(html.url) # http://httpbin.org/get?key1=value1&key2=value2

payload:使傳回資料以指定格式顯示(POST方法)

import requests
payload = {'key1': 'value1', 'key2': 'value2'}
# 將查詢參數加入 POST 請求中
html = requests.post("http://httpbin.org/post", data=payload)
print(html.text) #以json格式呈現

--

--

Responses (2)

Write a response