Sabbatical.io: The Screen Scraper

Screen Scraping

In my last post in this series, I explained why focusing on the domain model wasn't the best idea for me and the development of my app. Domain modeling was too heady of a concept for me right now to invest in, so I decided instead to start building something that might initially contain crappy code. You can always refactor later, right? So, I decided to focus on an immediate need of planning a trip to Europe. I had the general cities I wanted to visit in mind, but I was totally unsure of what route to take through them. I immediately thought of building a screen scraping tool to let me analyze the travel data, and building a screen scraper was the first feature I focused on for my sabbatical.io application. 

The Tools

For my first foray into the world of screen scraping, I decided to use Capybara and Selenium to do my bidding. Capybara is a tool you can use to test your Ruby web applications and simulate how a real user would interact with your application. Capybara's API has a very nice feel to it making following the logic very easy in any function you write with it. To demonstrate this, let's go to a site, fill in some data, and submit a form. The code might look something like this:

def test_scrape do
  visit('/about')
  choose('SearchInput_OneWay')
  select(airport, :from => 'SearchInput$Orig')
  fill_in('SearchInput$DeptDate', :with => date)
  click_on('SearchInput_ButtonSubmit')
end

I love how anyone, even someone who knows nothing about coding or Ruby, can look at that logic and know what should happen next. If all goes well, we'll be on the search results page of the site where we can assert that values exist or in my case collect and store data for future use. I simply chose a radio button, selected a select list option, filled in a text area, and clicked on a submit button using the simplest possible language that I could imagine.  

I will note that 'airport' and 'date' are variables in the logic above that are both strings. Since I was iterating over a list of locations I'd like to visit, I naturally turned these into variables, but it's important to note what types are needed for Capybara's different functions. For example, form inputs and regular HTML tags have different ways of accessing textual values. 

find('.some-class').text
find_field('Text Field').value

In the example above, the first find call searches through all the DOM nodes looking for the selector and then returns the inner text of that HTML element, while the second example looks for a form field with "Text Field" as a label and returns the value of that field. I had a little bit of trouble initially trying to use the value method when I should have been using the text method. I recommend reading the Capybara documentation in its entirety before you begin coding anything with it to avoid as many of these easily avoidable mistakes as you can.

Building the Algorithm

Since I'm a noob to both Ruby and Rails, I didn't really know the best way to try and test my screen scraping algorithm in isolation. I knew that I could build some sort of UI where a user could go and press a button to start scraping, but I thought monitoring things in my terminal would be a better idea. So, I decided to make a rake task to test out my algorithm as I built it. 

In Rails, it's pretty easy to set up any task you would want to perform as a rake task. Rails by default creates a /lib/tasks folder when you initiate a new Rails project. You can simply make files with the name of the task in the file name and define them atomically inside that file. I chose to make a "scrape.rake" file in my /lib/tasks/ folder, and this is what the code looked like. 

task :scrape => :environment do
  puts "Scraping..."
  scraper  = Scraper.new()
  scraper.test
end

The task keyword lets you define a symbol that will be the name of your task, and the environment symbol is a helper that pulls in all your Rails models for use. I'm certainly no expert on rake tasks, so I can't explain much more, but I would start by defining your task this way just to get the hang of how things work. 

I started the task with a log of "Scraping..." just so I knew it was working. I always think it's a good idea to have debugging statements built into your code when you are trying to build something that could have many parts that could stop working for numerous reasons. This way, you always know to check for these statements instead of trying to debug something that is not the issue at hand. If I don't see "Scraping..." when I start the scrape task, I'll know it's something with rake or my setup rather than my code, and this quick check could potentially save me hours of debugging time. 

I then initiate a new Scraper object from my Scraper class and run the test method to kick everything off. Since I included the :environment symbol, the rake tasks knows where to find the Scraper model and methods. Once I save this file, all I have to do is open a terminal, switch to the root directory of my Rails application, and run "rake scrape" to begin debugging my scraping algorithm. Let's see how that works out...

Selenium Web Driver

You should get an error when you try to run "rake scrape" initially. I'm not going to go over all of the steps that I went through, but the main error I'm talking about relates to including Capybara in the Scraper model and initializing the object with Selenium as the default web driver. 

First, to use Capybara you must include the Domain Specific Language, DSL, module in order to use the nice syntax I mentioned previously. Rails by default knows nothing about Capybara's DSL, and so you must include it like this:

include Capybara::DSL

Then, you must change the default web driver to access web pages that are outside of your current app structure. Capybara is setup to test applications and so it makes sense that statements like "visit('/about')" look for the "about" route of the current application you are building. This will be the normal, default behavior most of the time, and it makes sense to not have to set any variables for this behavior to work. 

However, I'm not using Capybara to test my application, but rather to go to external websites and use Capybara's DSL for easy interaction with those websites. In order to make this work, the first answer I came to on the interwebs told me to use the Selenium web driver. You'll have to include this gem in your gemfile, run bundle install again, and then place this line in your code somewhere:

def initialize() 
  Capybara.default_driver = :selenium
  # initialize other variables...
end

I ended up putting it in the initialize function of the Scraper class, because Selenium would need to be used in any sort of scraping scenario I would come up with. I also ended up initializing a lot of variables I needed as I went along here, but nothing else fancy for the initialize function. 

My next task was to build up a scraping algorithm for a couple of airline ticketing sites to end up taking the lowest price for a flight given origin, destination, and date of the flight. My goal was to end up with the following return value of each scraping function.

@trips.push("#{website} - #{origin} to #{dest} on #{date} for: #{price}")

It would then be easy for me to look at an array and know which website a trip came from, the price, and details like date, origin, and destination. Ideally these values would be saved in a Trip model where I could display the data later to a user. The user would see a lot of trips ranked by total price and could then decide which route was best. From that information, they could use whatever site they wanted to in order to book tickets for their trip. 

Selenium/Capybara Gotchas

I basically went bit by bit to to capture the logic I needed to advance my rake task to finding the cheapest flight listed on search result pages. The details of those bits of code aren't that interesting, but I'd like to point out a few gotchas that got me in the process. 

First off, I'd like demonstrate why using Selenium is a good choice for tasks like screen scraping rather than trying another default web driver. Web drivers can be separated into two camps: headless and headed.

Headless web drivers make calls to websites using protocols like CURL and do something with the response given. If you need to interact with a website; however, a headless driver can't help you out. Because they don't load a lot of the assets that a real web browser would, JavaScript files don't get parsed and certain interactions are missed. If you are trying to pretend like you are a real web browser, you might as well go whole hog and choose a headed web driver. 

Selenium is a headed web driver and will actually start up an instance of Firefox, by default, that you can watch next to your terminal. It's pretty neat to watch your code being run through in real time, and this method of interaction gives you crucial bonuses, even if you aren't aware of them at first. 

For example, I began exploring screen scraping with some good ole jQuery. The "$" object was even handy to be used on each site in my dev tools! I started by trying to input airport names into one of the site's search forms by using the .val() method.

$('.form-input-origin').val('LGW');

This worked fine in placing the text inside of the form element, but I noticed I was getting an error that said, "No origin selected..." after running that code. The JS was actually looking for a key-press or something like that to complete rendering of an autocomplete drop-down, and so the .val() method wasn't cutting it. I needed something that acted like a real user, and thankfully Selenium and Capybara made that easy.

select("LGW", :from => 'form-input-origin')

Somehow, that line just worked. I didn't need to change anything else, and the website was tricked into thinking I was a real user. I was really thrilled that I didn't have to find some way of using the key-press event to get the behavior I wanted.

The next issue I ran into was an error that stated that Capybara couldn't find a certain element on the page. I debugged this to notice that it couldn't seem to find the simplest of elements on the page, and then I went to the internet. I don't have my initial error message handy, but all I had to do was paste it into Google to find my answer. 

Apparently (and this may be fixed by the time you read this), Firefox 35 actually had some deprecated function that screwed up the Selenium web-driver. I'm so glad I found the root of this issue quickly or I would have really started banging my head against the wall. All I had to do was downgrade Firefox to version 34 to fix my issue. 

The last gotcha that I'll mention was related to selecting elements on the page. As I mentioned, some form elements needed to be selected from an autocomplete drop-down for you to be able to complete the search. So, you have to begin typing a string, click on the right autocomplete suggestion, and then go to the next part of the algorithm in order to complete your search. How would you select the correct part of the drop-down though?

My solution was to pick the first match off that list. Since each selection shared a CSS class together, I did something like this:

fill_in('OriginAirport', :with => origin)
find('.matching-class', :match => :first).click

The origin variable was deliberately tested to ensure that the selection I wanted appeared first on the autocomplete suggestion list and also to not provide enough characters to erase all suggestions from the autocomplete list. Once that value was filled in, I needed to find the matching class, select the first option and click it.

To be Continued...

My screen scraping algorithm was off to a good start. I had setup Capybara with the Selenium web driver as default, made a rake task to test out my code, and began easily selecting and filling in the forms I needed to gather the data I desired. In part 2 of my screen scraping journey, I'll continue to build up my algorithm further.

I hope you gained some knowledge from this post, and if you want to follow along further you can go to my project page on Github to look at the code. It's not great, but it at least somewhat works...look in the /app/models/ folder for the scraper.rb file for most of the code.