Scraping multiple pages

Overview

Teaching: 30 min
Exercises: 30 min
Questions
  • How do I tell Scrapy to follow URLs and scrape their contents?

Objectives
  • Creating a two-step spider to first extract the next-page URLs, visit them, and scrape their contents.

Walking over the site we want to scrape

The primary advantage of a spider over a manual tool scraping a website is that it can follow links. Let’s use the scraper extension to identify the XPath of the “next page” link.

This is important because whenever we’re scraping a site we always want to start from the code.

The next link

Here, we see some useful things. That there is a class="next and that there’s a characteristic li/a with a title “Next Page”. These are all attributes we can target.

What happens if we take some cues from the source and run the Scrapy Shell:

$ scrapy shell "https://www.aph.gov.au/Senators_and_Members/Parliamentarian_Search_Results?q=&mem=1&par=-1&gen=0&ps=0"

with

>>> response.xpath("//a[@title='Next page']/@href")

We see:

[<Selector xpath="//a[@title='Next page']/@href" data='?page=2&q=&mem=1&par=-1&gen=0&ps=12&st=1'>, <Selector xpath="//a[@title='Next page']/@href" data='?page=2&q=&mem=1&par=-1&gen=0&ps=12&st=1'>]

We can use extract_first() here because the links are identical.

>>> response.xpath("//a[@title='Next page']/@href").extract_first()

returns

'?page=2&q=&mem=1&par=-1&gen=0&ps=12&st=1'
>>>

Dealing with relative URLs

Looking at this result and at the source code of the page, we realize that the URLs are all relative to that page. They are all missing part of the URL to become absolute URLs, which we will need if we want to ask our spider to visit those URLs to scrape more data. We could prefix all those URLs with https://www.aph.gov.au/Senators_and_Members/Parliamentarian_Search_Results to make them absolute, but since this is a common occurrence when scraping web pages, Scrapy provides a built-in function to deal with this issue.

To try it out, still in the Scrapy shell, let’s first store the first returned URL into a variable:

>>> testurl = response.xpath("//a[@title='Next page']/@href").extract_first()

Then, we can try passing it on to the urljoin() method:

>>> response.urljoin(testurl)

which returns

'https://www.aph.gov.au/Senators_and_Members/Parliamentarian_Search_Results?page=2&q=&mem=1&par=-1&gen=0&ps=12&st=1'

We see that Scrapy was able to reconstruct the absolute URL by combining the URL of the current page context (the page in the response object) and the relative link we had stored in testurl.

Moving our page-scraping to its own function

We’re going to be doing some work with the spider and moving it around pages. Let’s move our page scraper to its own function and make sure we can move the data we care about between functions.

(editing austmps/austmps/spiders/austmpdata.py)

import scrapy
from austmps.items import AustmpsItem # We need this so that Python knows about the item object

class AustmpdataSpider(scrapy.Spider):
    name = 'austmpdata'  # The name of this spider

    # The allowed domain and the URLs where the spider should start crawling:
    allowed_domains = ['www.aph.gov.au']
    start_urls = ['http://www.aph.gov.au/Senators_and_Members/Parliamentarian_Search_Results?q=&mem=1&par=-1&gen=0&ps=0/']

    def parse(self, response):
        # The main method of the spider. It scrapes the URL(s) specified in the
        # 'start_url' argument above. The content of the scraped URL is passed on
        # as the 'response' object.

        # When asked for a new item, ask self.scrape for new items and pass them along
        yield from self.scrape(response)

    def scrape(self, response):
        for resource in response.xpath("//h4[@class='title']/.."):
            # Loop over each item on the page.
            item = AustmpsItem() # Creating a new Item object

            item['name'] = resource.xpath("h4/a/text()").extract_first()
            item['link'] = resource.xpath("h4/a/@href").extract_first()
            item['district'] = resource.xpath("dl/dd/text()").extract_first()
            item['twitter'] = resource.xpath("dl/dd/a[contains(@class, 'twitter')]/@href").extract_first()
            item['party'] = resource.xpath("dl/dt[text()='Party']/following-sibling::dd/text()").extract_first()

            yield item

Extracting URLs using the spider

Since we have an XPath query we know will extract the URLs we are looking for, we can now use the XPath() method and update the spider accordingly.

(editing austmps/austmps/spiders/austmpdata.py)

(...)
    def parse(self, response):
        # The main method of the spider. It scrapes the URL(s) specified in the
        # 'start_url' argument above. The content of the scraped URL is passed on
        # as the 'response' object.

        nextpageurl = response.xpath("//a[@title='Next page']/@href").extract_first()
        nextpage = response.urljoin(nextpageurl)
        print(nextpage)

        # When asked for a new item, ask self.scrape for new items and pass them along
        yield from self.scrape(response)
(...)
$ scrapy crawl austmpdata -s DEPTH_LIMIT=1

And this prints out the next page url.

(...)
2018-06-26 19:23:50 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.aph.gov.au/Senators_and_Members/Parliamentarian_Search_Results?q=&mem=1&par=-1&gen=0&ps=0/> (referer: None)
https://www.aph.gov.au/Senators_and_Members/Parliamentarian_Search_Results?page=2&q=&mem=1&par=-1&gen=0&ps=12&st=1
(...)

Now we need our spider to follow that link and we need to make sure that the spider stops when the link isn’t present. We can do this through a technique called “recursion” which means calling a thing from itself.

To our function parse, we add a call to itself: the function parse.

    def parse(self, response):
        # The main method of the spider. It scrapes the URL(s) specified in the
        # 'start_url' argument above. The content of the scraped URL is passed on
        # as the 'response' object.

        # When asked for a new item, ask self.scrape for new items and pass them along
        # yield from self.scrape(response)

        nextpageurl = response.xpath("//a[@title='Next page']/@href")

        if nextpageurl:
            # If we've found a pattern which matches
            path = nextpageurl.extract_first()
            nextpage = response.urljoin(path)
            print("Found url: {}".format(nextpage)) # Write a debug statement
            yield scrapy.Request(nextpage, callback=self.parse) # Return a call to the function "parse"

And now, with a single invocation of the scraper (note our lack of depth limit, since we’re testing recursion):

$ scrapy crawl austmpdata

we get:

(...)
2018-06-26 19:39:17 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https://www.aph.gov.au/Senators_and_Members/Parliamentarian_Search_Results?q=&mem=1&par=-1&gen=0&ps=0/> from <GET http://www.aph.gov.au/Senators_and_Members/Parliamentarian_Search_Results?q=&mem=1&par=-1&gen=0&ps=0/>
2018-06-26 19:39:17 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.aph.gov.au/Senators_and_Members/Parliamentarian_Search_Results?q=&mem=1&par=-1&gen=0&ps=0/> (referer: None)
Found url: https://www.aph.gov.au/Senators_and_Members/Parliamentarian_Search_Results?page=2&q=&mem=1&par=-1&gen=0&ps=12&st=1
(...)
2018-06-26 19:39:20 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.aph.gov.au/Senators_and_Members/Parliamentarian_Search_Results?page=13&q=&mem=1&par=-1&gen=0&ps=12&st=1> (
(...)

If we turn our response parsing back on by uncommenting the item for loop, we suddenly get all 145 members of parliament.

Here is the full code of austmpdata.py:

import scrapy
from austmps.items import AustmpsItem # We need this so that Python knows about the item object

class AustmpdataSpider(scrapy.Spider):
    name = 'austmpdata'  # The name of this spider

    # The allowed domain and the URLs where the spider should start crawling:
    allowed_domains = ['www.aph.gov.au']
    start_urls = ['http://www.aph.gov.au/Senators_and_Members/Parliamentarian_Search_Results?q=&mem=1&par=-1&gen=0&ps=0/']

    def parse(self, response):
        # The main method of the spider. It scrapes the URL(s) specified in the
        # 'start_url' argument above. The content of the scraped URL is passed on
        # as the 'response' object.

        nextpageurl = response.xpath("//a[@title='Next page']/@href")

        # When asked for a new item, ask self.scrape for new items and pass them along
        yield from self.scrape(response)

        if nextpageurl:
            path = nextpageurl.extract_first()
            nextpage = response.urljoin(path)
            print("Found url: {}".format(nextpage))
            yield scrapy.Request(nextpage, callback=self.parse)

    def scrape(self, response):
        for resource in response.xpath("//h4[@class='title']/.."):
            # Loop over each item on the page.
            item = AustmpsItem() # Creating a new Item object

            item['name'] = resource.xpath("h4/a/text()").extract_first()
            item['link'] = resource.xpath("h4/a/@href").extract_first()
            item['district'] = resource.xpath("dl/dd/text()").extract_first()
            item['twitter'] = resource.xpath("dl/dd/a[contains(@class, 'twitter')]/@href").extract_first()
            item['party'] = resource.xpath("dl/dt[text()='Party']/following-sibling::dd/text()").extract_first()

            yield item


If we run:

$ rm output.csv
$ scrapy crawl austmpdata -o output.csv
$ wc -l output.csv

We get all 145 members of parliament + 1 line for the header:

146 output.csv

Visiting child pages

Now that we’re scraping and following links, what happens if we want to add a member’s Electorate Office phone number to this data sheet?

We will need to tell the scraper to load their profile page (which we have the URL for) and to write a second scraper function to find the data we want from this specific page.

First, use the tools we’ve explored today to find the correct XPath for the Electorate Office phone number.

Tip: "Electorate Office " has a space inside the h3. And we’re going to need to use following-sibling::.

Using scrapy shell

$ scrapy shell "https://www.aph.gov.au/Senators_and_Members/Parliamentarian?MPID=R36"
>>> response.xpath("//h3[text()='Electorate Office ']/following-sibling::dl/dd[1]/a/text()").extract()

We will use the dd[1] here because otherwise we’re going to enter into far more complex selectors.

Now that we have the XPath solution, we need to make sure the items.py object has a phone number that it can accept.

import scrapy


class AustmpsItem(scrapy.Item):
    # define the fields for your item here like:
    name = scrapy.Field()
    district = scrapy.Field()
    link = scrapy.Field()
    twitter = scrapy.Field()
    party = scrapy.Field()
    phonenumber = scrapy.Field()

Next, we need to make another function called get_phonenumber(self, response).

Looking at the scrapy documentation, we can “pass” the item class around using .meta

Editing austmpdata.py

import scrapy
from austmps.items import AustmpsItem # We need this so that Python knows about the item object

class AustmpdataSpider(scrapy.Spider):
    name = 'austmpdata'  # The name of this spider

    # The allowed domain and the URLs where the spider should start crawling:
    allowed_domains = ['www.aph.gov.au']
    start_urls = ['http://www.aph.gov.au/Senators_and_Members/Parliamentarian_Search_Results?q=&mem=1&par=-1&gen=0&ps=0/']

    def parse(self, response):
        # The main method of the spider. It scrapes the URL(s) specified in the
        # 'start_url' argument above. The content of the scraped URL is passed on
        # as the 'response' object.

        nextpageurl = response.xpath("//a[@title='Next page']/@href")

        # When asked for a new item, ask self.scrape for new items and pass them along
        yield from self.scrape(response)

        if nextpageurl:
            path = nextpageurl.extract_first()
            nextpage = response.urljoin(path)
            print("Found url: {}".format(nextpage))
            yield scrapy.Request(nextpage, callback=self.parse)


    def scrape(self, response):
        for resource in response.xpath("//h4[@class='title']/.."):
            # Loop over each item on the page.
            item = AustmpsItem() # Creating a new Item object

            item['name'] = resource.xpath("h4/a/text()").extract_first()


            # Instead of just writing the relative path of the profile page, lets make the full profile page so we can use it later.
            profilepage = response.urljoin(resource.xpath("h4/a/@href").extract_first())
            item['link'] = profilepage

            item['district'] = resource.xpath("dl/dd/text()").extract_first()
            item['twitter'] = resource.xpath("dl/dd/a[contains(@class, 'twitter')]/@href").extract_first()
            item['party'] = resource.xpath("dl/dt[text()='Party']/following-sibling::dd/text()").extract_first()

            # We need to make a new variable that the scraper will return that will get passed through another callback. We're calling that variable "request"
            request= scrapy.Request(profilepage, callback=self.get_phonenumber)
            request.meta['item'] = item #By calling .meta, we can pass our item object into the callback.
            yield request #Return the item + phonenumber back to the parser.

    def get_phonenumber(self, response):
        # A scraper designed to operate on one of the profile pages
        item = response.meta['item'] #Get the item we passed from scrape()
        item['phonenumber'] = response.xpath("//h3[text()='Electorate Office ']/following-sibling::dl/dd[1]/a/text()").extract_first()
        yield item #Return the new phonenumber'd item back to scrape

and running:

$ rm -f output.csv
$ scrapy crawl austmpdata -o output.csv
$ head -3 output.csv

gets us

district,link,name,party,phonenumber,twitter
"Warringah, New South Wales",https://www.aph.gov.au/Senators_and_Members/Parliamentarian?MPID=EZ5,Hon Tony Abbott MP,Liberal Party of Australia,(02) 9977 6411,http://twitter.com/TonyAbbottMHR
"Menzies, Victoria",https://www.aph.gov.au/Senators_and_Members/Parliamentarian?MPID=HK5,Hon Kevin Andrews MP,Liberal Party of Australia,(03) 9848 9900,http://twitter.com/kevinandrewsmp

Triumph!

Summative exercise: Write a new web scraper

Now that we’ve built this web scraper. Use the list of Members of the house of commons and extract their name, constituency, party, twitter handle, and phone number.

Key Points

  • We can have the spider follow links to collect more data in an automated fashion.

  • We can use a callback to get that data scraped for our output file

  • We can use .meta to exchange data between callbacks