Web Scraping 101- A Post-Mortem

25 Jun 2016

I’ve finally taken some ground on my first serious personal project: a deep dive into Pitchfork album reviews. I’ve followed the site for nearly a decade now and thought it would be interesting to analyze and visualize their 15 years of reviews. Before I could do anything fun with the reviews, I had to scrape them.

“Web scraping” is a fancy way of saying “gathering a bunch of data from a website and storing it locally.” It’s the process of taking information out on the web presented in a human-readable form and making it geek-readable by storing it in a database or flat files. At larger scales, it’s often tied to “web crawling,” or programmatically navigating and exploring the web.

Web scraping seemed like an extremely daunting task before I actually tried it out for this project. Modern web pages can be pretty complicated and I assumed the javascript-based rendering frameworks like React and Angular would make this a painful process. In reality, this proved to not be the case. Gathering data off a web page is very much a “what you see is what you get” (WYSIWYG) affair- as long as the developers of the website haven’t gone out of their way to obfuscate any patterns they follow to render a web page, you shouldn’t have a hard time finding the information you need.

I’m sure there are some bright, enterprising devs who wanted to really get their hands dirty to extract their data, but I opted to stand on the shoulders of giants and use an existing library. I decided to make use of Python’s Beautiful Soup library - although there are some other more robust options such as Scrapy, but since I’m dealing with a relatively small and fixed set of pages I opted for the package with what felt like the smallest learning curve.

Beautiful Soup gives the user the power to grab elements from a web page based on characteristics such as the HTML element type or CSS classes applied. In order to really figure out how to scrape a page, you’ll have to get familiar with your web browser’s developer tools. I found the easiest way to find elements was with CSS classes, but you may be able to get all the info you need from some sites using HTML id’s. This ties back into a point I mentioned earlier - developers who programmatically generate their site content may have naming conventions that make sense in context with their database or site generator, but could make your life difficult.

After you’ve figured out what you’ll be grabbing, I’d suggest testing it out of a few different pages. Make sure your schema that maps web pages to your web scraper doesn’t have any corner cases where you’d lose data. For pitchfork reviews, I had to account for a variety of review formats. Which year am I grabbing for a reissue? What about reviews that span multiple albums with individual scores? Am I getting all of the record labels associated with the release? Does my logic work for new releases just as well as it does for 10+ year-old reviews originally stored & displayed with less info on a different content management system?

Once I had determined what to grab, the pseudocode for my script was pretty straightforward. 1: Grab a page of reviews. 2: Go to every review listed on that page; grab relevant data. 3: Increment pages; go to Step 1.

Of course, there’s a little more to it; if you’d like to dig into the nuts and bolts, I’ve posted a GitHub repository that contains the script I used to gather the data, some Jupyter notebooks that delve into a bit more detail about the info gathered on a particular page, and the 5MB flat file that contains some of the most salient points of the reviews.

Some additional notes:

  • Lots of sites use autoscrolling features to get from one page to the next, but probably have paging mechanisms under the hood. Even though the index of pitchfork reviews autoscrolls, I was able to still navigate page-by-page using old-fashioned query strings - sticking “?page=[number]” at the end of each call.
  • Although my code “gracefully” handles exiting on response codes other than 200 (OK), I would suggest implementing different logic for 404 vs 400 vs 502, etc. I log any reviews that are not successfully retrieved, but had to go back and grab the handful of reviews that failed along the way. I’ll do this in the future, but considering this was a historical one-shot I don’t foresee running again, I didn’t want to over-engineer it for the less than 1% of reviews that gave me an issue.
  • As mentioned above, I made sure not to over-engineer this process since it included a lot of firsts for me, but one could easily argue it’s a little under-engineered. To take things a step further, I’d encourage taking a look at David Eads of NPR’s blog post on userful scraping techniques for a peek at some stuff I’d like to use in the future.
  • Full disclosure: I have not formally asked Conde Nast or the editors of Pitchfork for permission to turn their website into my personal database. Web scraping is sometimes frowned on by website owners, so please scrape responsibly. Consider asking for permission, especially if you use the information for profit.