Member-only story

Python爬蟲學習筆記(一) - Requests, BeautifulSoup, 正規表達式,API

Yanwei Liu

41 min readDec 20, 2018

Python 網路爬蟲與資料分析入門實戰 GitHub Repo

jwlin/py-scraping-analysis-book

博碩文化「Python 網路爬蟲與資料分析入門實戰」範例程式碼. Contribute to jwlin/py-scraping-analysis-book development by creating an account on…

github.com

解決複雜圖片驗證碼captcha和recaptcha

如何破解並繞過網頁上常見的Captcha驗證？以2Captcha API為例

2Captcha是一個非常強大的CAPTCHA辨識服務。在我們日常生活當中，如果要登入網站(如：AWS的帳戶登入頁面)，可能就會遇到需要手動輸入驗證碼的視窗，有些可能單純只是英文及數字的組合。但是有些卻極度複雜，扭曲的字體及顏色，常常讓使用…

yanwei-liu.medium.com

先備知識

GET：讀取一般網頁內容
POST：填表單傳送資料時使用

安裝Requests和BeautifulSoup

pip install beautifulsoup4
pip install requests

引入模組

import requests
from bs4 import BeautifulSoup

Requests

使用Requests

url = "https://jwlin.github.io/py-scraping-analysis-book/ch1/connect.html"
resp = requests.get(url)#網頁抓取後編碼錯誤?
resp.encoding = 'utf-8'    #轉換編碼至UTF-8
resp.encoding = 'big5' #設定成該網頁的編碼，例如big5編碼或簡體的gbk編碼#顯示網頁狀態
resp.status_code
#顯示200即為正常
#通常2開頭為正常
#開頭為4或5表示錯誤#若想解析亂碼該怎麼辦
至以下網站貼上亂碼，選擇對應的編碼，即可轉換
https://www.webatic.com/url-convertor

params：加入參數

import requestsr = requests.get('https://www.google.com/search',
  headers={'Accept': 'application/json'} 
  params={'q': 'unicorns'}  # 搜尋 google.com/search?q=unicorns
)

requests.content / requests.text / requests.json()

import requests 
r = requests.get('
        https://official-joke-api.appspot.com/random_joke'
)r.content #return raw bytes data
r.text    #return string
r.json()  #return json (dict)

payload：使傳回資料以指定格式顯示(GET方法)

import requests
payload = {'key1': 'value1', 'key2': 'value2'}
# 將查詢參數加入 GET 請求中
html = requests.get("http://httpbin.org/get", params=payload)
print(html.url) # http://httpbin.org/get?key1=value1&key2=value2

payload：使傳回資料以指定格式顯示(POST方法)

import requests
payload = {'key1': 'value1', 'key2': 'value2'}
# 將查詢參數加入 POST 請求中
html = requests.post("http://httpbin.org/post", data=payload)
print(html.text) #以json格式呈現

Python爬蟲學習筆記(一) - Requests, BeautifulSoup, 正規表達式,API

jwlin/py-scraping-analysis-book

博碩文化「Python 網路爬蟲與資料分析入門實戰」範例程式碼. Contribute to jwlin/py-scraping-analysis-book development by creating an account on…

如何破解並繞過網頁上常見的Captcha驗證？以2Captcha API為例

Written by Yanwei Liu

Responses (2)