API:s are a developers best friends when accessing remote data, but great API:s does not grow on trees. So what do you do when the data you need isn’t accessible through a well designed API, or no API at all for that matter? As long as the data is accessible through your Web browser, you can always just scrape it yourself! In this post I’ll go through how to build a simple Web scraper in 10 min using Guzzle and PHP’s DOM parser. I’ll also give a brief introduction to XPaths.
Web scraping is the art of fetching and parsing a Web document to extract information.
When scraping a Web site we first need to request the page and receive the page HTML DOM tree. There are a bunch of tools that could be used for this. One of the more well known tools is cURL, a library and command-line tool that can be used to transfer data using a wide range of protocols, including http.
After we have received the HTML DOM from the Web server it is time to parse the DOM tree to extract the information we want. The most naive way is to do this with string operations and regular expressions. This is usually very time consuming for more complex HTML documents and it usually doesn’t work too well if the HTML DOM changes slightly. A more robust solution is to use a DOM parser library with either CSS selectors or XPath’s to query the DOM for the DOM elements that contains the information we want to extract. PHP has a decent built in DOM parser in one of the languages default extention.
In the basic scrape class below I use the Guzzle PHP library to request and receive the site HTML DOM. Then I use PHP’s DOM parser library to parse the DOM tree for the nodes containing information.
Now how do we use this class to scrape a Web page? First we need to understand what an XPath is and how to use it.
If you’ve ever written CSS you should know what a CSS selector is. XPaths stands for XML Path Language and is a query language, just like CSS selectors, used for selecting nodes from an XML or HTML DOM tree. Most modern browsers support XPaths in their development console, so press
I if you are on Chrome in Windows or
I on Crome in OSX and type in this:
When you hit enter is should return this element.
So how does it work? The first part,
// tells the parser to start at the root of the document.
b filters it down to all <b> tags on the page. The brackets
 right next to the
b tells the parser to match attributes on the
@id="test" is used to only match the node where the attribute id equals “test”. Let’s look at some more examples of how we can select the same node, If you want you can look at the DOM and try to figure out how it works:
Putting it all together
Now lets use the scraper class and our knowledge about XPaths to scrape the root page of this site:
When you run this it should print out a list of all the posts and the corresponding data available on the home page of this site.
If you read this far, I hope that you found this introduction to scraping useful. If you have any questions regarding XPaths or something else in this article, just post in the comment section below and I’d be happy to help. The source code is also available on GitHub.