R語言學習筆記(六):爬蟲

安裝與載入套件

# 安裝 jsonlite、rvest 與 magrittrpkgs <- c("jsonlite", "rvest", "magrittr")
install.packages(pkgs)
# 載入 jsonlite、rvest 與 magrittr
library(jsonlite)
library(rvest)
library(magrittr)

爬取JSON資料

library(jsonlite)

aqi_url <- "https://opendata.epa.gov.tw/ws/Data/AQI/?$format=json"
aqi <- fromJSON(aqi_url)
class(aqi)
head(aqi)

爬取XML資料

library(xml2)
library(magrittr)
aqi_url <- "https://opendata.epa.gov.tw/ws/Data/AQI/?$format=xml"
aqi <- read_xml(aqi_url)
class(aqi)
site_names <- aqi %>%
xml_find_all(xpath = "//Data/SiteName") %>%
xml_text()
class(site_names)
site_names

爬取HTML資料

1.透過HTML結構library(rvest)
library(magrittr)

movie_url <- "https://www.imdb.com/title/tt4154756"
movie <- read_html(movie_url)
class(movie)
rating <- movie %>%
html_nodes(css = "strong span") %>%
html_text() %>%
as.numeric()
rating
2.使用XPath透過Chrome套件抓取XPath來進行爬蟲使用XPath抓資料時,將css那段(css = "strong span")改成
xpath = "//strong/span"
#數值用as.numeric()
#文字用as.character()
3.將整個HTML下載到本地url <- 'https://www.XXX.com.tw'
download.file(url)x

爬取HTML的Table

#install.packages(XML)
#library(XML)
url="https://xxxxx.xxxxxx.xxx"
tables1 = readHTMLTable(url)
names(tables1)
tables1[[2]]

#readHTMLTable(URL, which =1, header = FALSE, stringsAsFactors = FALSE)
在此說明參數which:即選擇網頁中的第幾個表格

Written by

Machine Learning / Deep Learning / Python / Flutter cakeresume.com/yanwei-liu

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store