Manually scrape data using browser extensions

Overview

Teaching: 45 min
Exercises: 20 min

Questions

How can I get started scraping data off the web?

How can I use XPath to more accurately select what data to scrape?

Objectives

Introduce the Chrome Scraper extension.

Practice scraping data that is well structured.

Use XPath queries to refine what needs to be scraped when data is less structured.

Using the Scraper Chrome extension

Now we are finally ready to do some web scraping. Let’s go back to the list of MPPs in the Legislative Assembly of Ontario.

We are interested in downloading this list to a spreadsheet, with columns for names and constituencies. Do do so, we will use the Scraper extension in the Chrome browser (refer to the Setup section for help installing these tools).

Scrape similar

With the extension installed, we can select the first row of the Current MPPs table right click and choose “Scrape similar” from the contextual menu:

Screenshot of the Scraper contextual menu

Alternatively, the “Scrape similar” option can also be accessed from the Scraper extension icon:

Either operation will bring up the Scraper window:

Screenshot of the Scraper main window

We recognize that Scraper has generated XPath queries that corresponds to the data we had selected upon calling it. The Selector (highlighted in red in the above screenshot) has been set to //tbody/tr[td] which selects all the rows of the table, delimiting the data we want to extract.

In fact, we can try out that query using the technique that we learned in the previous section by typing the following in the browser console:

$x("//tbody/tr[td]")

returns something like

<- Array [672]

which we can explore in the console to make sure this is the right data.

Scraper also recognized that there were two columns in that table, and has accordingly created two such columns (highlighted in blue in the screenshot), each with its own XPath selector, *[1] and *[2].

To understand what this means, we have to remember that XPath queries are relative to the current context node. The context node has been set by the Selector query above, so those queries are relative to the array of tr elements that has been selected.

We can replicate their effect by trying out

$x("//tbody/tr[td]/*[1]")

in the console. This should select only the first column of the table. The same goes for the second column.

But in this case, we don’t need to fiddle with the XPath queries too much, as Scraper was able to deduce them for us, and we can use the export functions to either create a Google Spreadsheet with the results, or copy them into the clipboard in Tab Separated Values (TSV) format for pasting into a text document, a spreadsheet or Open Refine.

There is one bit of data cleanup we might want to do, though. If we paste the data copied from Scraper into a text document, we see something like this:

MPP	Riding	Party
Anand, Deepak          	Mississauga—Malton          	Progressive Conservative Party of Ontario
Andrew, Jill          	Toronto—St. Paul's          	New Democratic Party of Ontario
Armstrong, Teresa J.          	London—Fanshawe          	New Democratic Party of Ontario
Arnott, Hon. Ted          	Wellington—Halton Hills          	Progressive Conservative Party of Ontario

This is because there are a lot of unnecessary white spaces in the HTML that’s behind that table, which are being captured by Scraper. We can however tweak the XPath column selectors to take advantage of the normalize-space XPath function:

normalize-space(*[1])
normalize-space(*[2])

Screenshot of the Scraper window showing the Column selectors

We now need to tell Scraper to scrape the data again by using our new selectors, this is done by clicking on the “Scrape” button. The preview will not noticeably change, but if we now copy again the results and paste them in our text editor, we should see

MPP	Riding	Party
Anand, Deepak	Mississauga—Malton	Progressive Conservative Party of Ontario
Andrew, Jill	Toronto—St. Paul's	New Democratic Party of Ontario
Armstrong, Teresa J.	London—Fanshawe	New Democratic Party of Ontario
Arnott, Hon. Ted	Wellington—Halton Hills	Progressive Conservative Party of Ontario

which is a bit cleaner.

Scrape the list of Ontario MPPs

Use Scraper to export the list of current members of the Ontario Legislative Assembly and try exporting the results in your favourite spreadsheet or data analysis software.

Once you have done that, try adding a third column containing the URLs that are underneath the names of the MPPs and that are leading to the detail page for each parliamentarian.

Tips:

To add another column in Scraper, use the little green “+” icon in the columns list.

Look at the source code and try out XPath queries in the console until you find what you are looking for.

The syntax to select the value of an attribute of the type <element attribute="value"> is element/@attribute.

The concat() XPath function can be use to concatenate things.
Solution

Add a third column with the XPath query
*[1]/a/@href
This extracts the URLs, but as luck would have it, those URLs are relative to the list page (i.e. they are missing https://www.ola.org/en/members/current). We can use the concat() XPath function to construct the full URLs:
concat('https://www.ola.org',*[1]/a/@href)

Custom XPath queries

Sometimes, however, we do have to do a bit of work to get Scraper to select the data elements that we are interested in.

Going back to the example of the Canadian Parliament we saw in the introduction, there is a page on the same website that lists the mailing addresses of all parliamentarians. We are interested in scraping those addresses.

If we select the addresses for the first MP and try the “Scrape similar” function…

Screenshot of the Scraper context menu being used on an address block

Scraper produces this:

Screenshot of the Scraper window trying to scrape addresses

which does a nice job separating the address elements, but what if instead we want a table of the addresses of all MPs? Selecting multiple addresses instead does not help. Remember what we said about computers not being smart about structuring information? This is a good example. We humans know what the different blocks of texts on the screen mean, but the computer will need some help from us to make sense of it.

We need to tell Scraper what exactly to scrape, using XPath.

If we look at the HTML source code of that page, we see that individual MPs are all within ul elements:

(...)
<ul>
   <li><h3>Aboultaif, Ziad</h3></li>
   <li>
      <span class="addresstype">Hill Office</span>
      <span>Telephone:</span>
      <span>613-992-0946</span>
      <span>Fax:</span>
      <span>613-992-0973</span>
   </li>
   <li>
         <ul>
            <li><span class="addresstype">Constituency Office(s)</span></li>
            <li>
               <span>8119 - 160 Avenue (Main Office)</span>
               <span>Suite 204A</span>
               <span>Edmonton, Alberta</span>
               <span>T5Z 0G3</span>
               <span>Telephone:</span> <span>780-822-1540</span>
               <span>Fax:</span> <span>780-822-1544</span>
               <span class="spacer"></span>
            </li>
      </ul>
   </li>
</ul>
(...)

So let’s try changing the Selector XPath in Scraper to

//body/div[1]/div/ul

and hit “Scrape”. We get something that is closer to what we want, with one line per MP, but the addresses are still all in one block of unstructured text:

Screenshot of the Scraper window trying to scrape addresses

Looking closer at the HTML source, we see that name and addresses are separated by li elements within those ul elements. So let’s add a few columns based on those elements:

./li[1] -> Name
./li[2] -> Hill Office
./li[3] -> Constituency

This produces the following result:

Screenshot of the Scraper window scraping addresses

The addresses are still one big block of text each, but at least we now have a table for all MPs and the addresses are separated.

Scrape the Canadian MPs’ phone numbers

Keep working on the example above to add a column for the Hill Office phone number and fax number for each MP.
Solution

Add columns with the XPath query
./li[2]/span[3] -> Hill Office Phone
./li[2]/span[5] -> Hill Office Fax

Advanced XPaths

Now we are finally ready to do some web scraping. Let’s go to the list of Senators and Members of the Australian Parliament.

With the Scrapy extension installed, we can select the first Member from the list, do a right click and choose “Scrape similar” from the contextual menu:

Screenshot of the Scraper contextual menu

Alternatively, the “Scrape similar” option can also be accessed from the Scraper extension icon:

Either operation will bring up the Scraper window:

Screenshot of the Scraper main window

We recognize that Scraper has generated XPath queries that corresponds to the data we had selected upon calling it. The Member of Parliament (highlighted in red in the above screenshot) has been set to //h4/a which selects the name and URL of the first Member of Parliament.

In fact, we can try out that query using the technique that we learned in the previous section by typing the following in the browser console:

$x("//h4/a")

returns something like

<- Array [12]

which we can explore in the console to make sure this is the right data.

We can replicate their effect by trying out

$x("//h4/a/*[1]")

in the console. This should select only the first header. The same goes for the second header.

In this case we need to do some extra work to get the constituency information, as Scraper was unable to deduce it automatically for us. If we look at the inspector for Anthony Albanese that is shown below, we can see all the information contained within a div element, including the h4 element containing the name, the Electorate information, positions and the Party.

Using the inspect element to investigate the structure of an entry.

This div element is the context node we want to select for our XPath query, with all the individual elements selected relative to it. Rather than trying to find the appropriate div element, it is easier to refer to this as the parent of the h4 element; we already know this selects each of the MPs on the page. We can update our XPath query in the Scraper extension, using //h4/.. for the selector, ./h4/a for the name, and ./h4/a/@href for the link.

We can replicate the same query as the default selecting the parent `<div>` element
and extracting information relative to that.

While this change makes the current query more complicated, it makes it easier to extract the additional information about the Electorate, Positions, the Party, and information on how to connect with the MP. To extract this information we need to add extra columns to Scrapy, specifying the relative XPath used to select each. The HTML for of the selected div for Anthony Albanese is shown below.

<div class="medium-pull-2 medium-7 large-8  columns">
  <h4 class="title">
    <a href="/Senators_and_Members/Parliamentarian?MPID=R36">Hon Anthony Albanese MP</a>
  </h4>
  <dl class="dl--inline__result text-small">
    <dt>For</dt>
    <dd>Grayndler, New South Wales</dd>
    <dt>Positions</dt>
    <dd>Leader of the Opposition</dd>
    <dt>Party</dt>
    <dd>Australian Labor Party</dd>
    <dt>Connect</dt>
    <dd>
      <a href="mailto:A.Albanese.MP@aph.gov.au" class="social mail" target="_blank"><i class="fa fa-lg margin-right fa-envelope-o"></i></a>
      <a href="http://twitter.com/AlboMP" class="social twitter margin-right" target="_blank"><i class="fa fa-lg fa-twitter"></i></a>
    </dd>
  </dl>
</div>

Here we can see the Electorate (For) is given by the first definition description tag <dd> within the definition list <dl>, the Positions is given by the second definition tag, and the party by the third. We could use the following selections;

Electorate: ./dl/dd[1]
Position: ./dl/dd[2]
Party: ./dl/dd[3]

The problems with using this method is that the structure of each MP is slightly different. Dr Katie Allen MP has no positions, so the Party information is in the Position column. We also have Hon Kevin Andrews MP who has two positions, resulting in no party information. We can solve this by using the information in the definition term tag <dt> that precedes the definition description <dd> tag. For the Electorate information, the <dt> tag contains the text For which we can use to identify it using the relative XPath query, ./dl/dt[text() = 'For']. To get the following <dd> tag, we can use the following-sibling::dd XPath selector giving

./dl/dt[text() = 'For']/following-sibling::dd`

A similar approach can be taken for all selections

Electorate: ./dl/dt[text() = 'For']/following-sibling::dd
Position: ./dl/dt[text() = 'Positions']/following-sibling::dd
Party: ./dl/dt[text() = 'Party']/following-sibling::dd

Finally we can handle multiple Positions by selecting each index within the XPath query,

Position: ./dl/dt[text() = 'Positions'][1]/following-sibling::dd
Position 2: ./dl/dt[text() = 'Positions'][2]/following-sibling::dd

Giving our final query shown below.

The Scraper extension provides a useful utility for the extraction of structured information from Websites. Using the XPath query syntax, we are able to construct powerful queries to extract the information we require. However, the Scraper extension has the fairly significant drawback that it can only scrape a single page of information. Additionally it only samples a single point in time.

Key Points

Data that is relatively well structured (in a table) is relatively easily to scrape.

More often than not, web scraping tools need to be told what to scrape.

XPath can be used to define what information to scrape, and how to structure it.

More advanced data cleaning operations are best done in a subsequent step.

previous episode

Introduction to web scraping

next episode

Manually scrape data using browser extensions

Overview

Using the Scraper Chrome extension

Scrape similar

Scrape the list of Ontario MPPs

Solution

Custom XPath queries

Scrape the Canadian MPs’ phone numbers

Solution

Advanced XPaths

Key Points

previous episode

next episode