How to Scrape Multiple Pages in R and Rvest
Blog article by Jeroen Janssens.
Nov 5, 2021 • 7 min read
.
There’s something exciting about scraping a website to build your own
dataset! For R, there’s the rvest
package to harvest (i.e., scrape) static
HTML.
When the HTML elements you’re interested in are spread across multiple
pages and you know the URLs of the pages up front (or you know how many
pages you need to visit and the URLs are predictable), you can most
likely use a for loop or one of the map functions from the purrr
package. For example, to get the Stack Overflow questions tagged
R from the first three
pages, you could do:
library(purrr)
library(rvest)
(urls <- stringr::str_c("https://stackoverflow.com/questions/",
"tagged/r?tab=votes&page=", 1:3))
map(urls,
~ read_html(.) %>%
html_elements("h3 > a.s-link") %>%
html_text()) %>%
flatten_chr() %>%
head(n = 10)
However, if you don’t necessarily know how many pages you need to visit
or the URLs are not easily generated up front, but there’s a link to
the next page, something like this function has served (or scraped) me
well:
html_more_elements <- function(session, css, more_css) {
xml2:::xml_nodeset(c(
html_elements(session, css),
tryCatch({
html_more_elements(session_follow_link(session, css = more_css),
css, more_css)
}, error = function(e) NULL)
))
}
This R function uses several functions from the rvest
package and
recursion to select HTML elements across multiple pages. It has
three arguments:
- A
session
object created by rvest::session()
- A CSS selector that identifies the elements you want to select from
each page
- A CSS selector that identifies the link to the next page
Note that this function only stops either when there’s no more links to
follow or when the server replies with an error.
Here’s an example that scrapes the names of all Lego Star Wars sets:
lego_sets <-
session("https://www.lego.com/en-us/themes/star-wars") %>%
html_more_elements("li h2 > span", "a[rel=next]") %>%
html_text()
length(lego_sets)
head(lego_sets, n = 10)
Here’s another example that selects all the titles from Hacker
News and shows the first 10:
session("https://news.ycombinator.com") %>%
html_more_elements(".titleline", ".morelink") %>%
html_text() %>%
head(n = 10)
Note that I’m getting a
503 after a
couple of pages. That’s probably because I’m making too many requests in
too little time. Adding some delay to the function (with, e.g.,
Sys.sleep(1)
) would solve this. Remember, always Scrape
Responsibly™.
— Jeroen
Would you like to receive an email whenever I have a new blog post, organize an event, or have an important announcement to make? Sign up to my newsletter:
© 2013–2024 Jeroen Janssens