Web scraping with Nightmare.js

A year ago, web scraping was still a foreign concept to me. How could I possibly automate logging into a site, navigating to a target, and extracting the required data, especially if the site relied on JavaScript for key parts of the interaction or DOM manipulation??? Fortunately, Nightmare, a simple, straightforward Node.js wrapper around PhantomJS, takes all of the horror out of this. This post will cover how to get started with Nightmare and some tricks to get some real mileage out of it.

Full disclosure: Not only have I used Nightmare, I have also contributed a few features and patches to the Nightmare project (.pdf(filepath), .zoom(zoomFactor), and scrollTo(top, left)), so I obviously think it is pretty cool. Also, please don't use this tool for automating spam or harassment. No one likes a jerk.

Video

See the live version! Live the Nightmare! Mwahahahahaha!

Intro to Nightmare & PhantomJS

What sets Nightmare apart from most web scrapers is really explained by understanding two things: what sets PhantomJS apart from other web scrapers, and how Nightmare adds value on top of Phantom. PhantomJS is defined as "a headless WebKit scriptable with a JavaScript API" that has "support for various web standards". Since PhantomJS is a full WebKit browser, it handles JS well, including frameworks such as Angular, and it even renders the DOM, allowing for screenshots and PDFs. This is a bit different from the standard scraper.

As for the value added by Nightmare over PhantomJS (and other wrappers, including Casper and Slimer), I see three advantages. First, for those already familiar with Node.js, the only new element here is making sure Phantom is installed on the local machine. Second, it allows for interop with other Node modules. This means you can save data to local files, send emails with data via nodemailer, or even present a custom API served by express for another website. Third, the API is much simpler than Phantom's API. This makes it much easier to pick it up and run with it. However, there are some sites with complex enough JavaScript or poor enough identifiers that it may become too difficult to attempt with Nightmare. At that point, you may want to take the leap to Phantom or examine other options.

Installation

Before we get started, you'll need these three things installed on your system:

Now that you're all set with that, start a new Node project (npm init helps here) and then install nightmare like this:

npm install --save nightmare

Windows users beware: my experiences using Nightmare on Windows has been rather poor due to some of its dependencies not liking to compile without having the stars aligned. It may be faster to just download VMWare Player and the a copy of Ubuntu. Good luck if you plan to proceed anyway.

Getting Started

The basic Nightmare workflow follows this pattern:

  • Require Nightmare: var Nightmare = require('nightmare');
  • Construct a Nightmare object: new Nightmare()
  • Add method chain composed of goto(url), wait(), type(selector, text), click(selector), etc.
  • Call run(callback);

So, something like this will simply load Google's main page, wait for the page to load, and tell us when it is done:

var Nightmare = require('nightmare');

var google = new Nightmare()
  .goto('http://google.com')
  .wait()
  .run(function(err, nightmare) {
    if (err) return console.log(err);
    console.log('Done!');
  });

Unfortunately, this isn't very useful, as it doesn't do any actual work for us. We will fix that shortly.

User Agents, Viewports, Screenshots, PDFs, & High DPI

Before we get to the real powerful stuff, let's talk about some of the settings and how to extract data in the form of screenshots and PDFs. First, some sites will intentionally serve different versions of the site to different browsers, such as IE or various mobile browsers. In order to trigger the detection of your intended browser, use the .useragent(useragentstring) method before using .goto(url) to visit a site. Here's just part of our example from above, with the user agent of an older version of Chrome added to it:

var google = new Nightmare()
  .useragent("Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.111 Safari/537.36")
  .goto('http://google.com')
  .wait()

One excellent use case for web scrapers is simply to get a screenshot of a website. Whether it is weather warnings, stock alerts, or something else, being able to automate a screenshot and email it to yourself can be very useful. It can also be extremely useful for debugging more complex scripts, as it is your only window (no pun intended) into what is happening inside of Phantom. This is easy to do with the .screenshot(filepath) method. You have to supply the extension (PNG, GIF, JPEG, or PDF) as part of the filename. So, let's just snap a screenshot of Google's homepage:

var google = new Nightmare()
  .goto('http://google.com')
  .wait()
  .screenshot('google.png')
  .run(function() console.log('Done!));

While PDF was mentioned above as a screenshot format, it ends up creating a big single page. What if you want to turn a page into something readable on your Kindle or tablet? That's where the .pdf(filepath) method comes into play. This formats the page for A4 pages with large enough margins to be printed on US letter. The last example requires minimal changes to switch to PDF:

var google = new Nightmare()
  .goto('http://google.com')
  .wait()
  .pdf('google.pdf')
  .run(function() console.log('Done!));

Now, there's one important thing that has been overlooked: viewport size. Because Phantom is headless, we need to tell it how large the viewport should be for rendering the screenshots and PDFs. This can be done with the conveniently named .viewport(width, height) method. However, like .useragent(useragentstring) it needs to be called before calling .goto(url). It is worth noting that it does not matter if the content extends past the height of the viewport, as the entire page is captured. At first this makes the height seem useless, but it serves a few roles. First, if the page does not fill the height of the viewport, the full viewport is still captured in the screenshot. Second, the height of the viewport may still impact how things are rendered, such as DOM elements that define their dimensions in vh units. So, do set this to a reasonable value. Let's see a quick example of a screenshot based on an 800px x 600px viewport:

var google = new Nightmare()
  .screenshot(800, 600)
  .goto('http://google.com')
  .wait()
  .screenshot('google.png')
  .run(function() console.log('Done!));

So, now that we have that power, what if we want higher quality screenshots? We can achieve higher resolution screenshots by tweaking our viewport and adding zoom. The .zoom(zoomFactor) method allows us to zoom in on the page. For example, a zoomFactor of 2 yields a page rendered at 200% zoom. If you pair that with a viewport twice as wide and tall, you can effectively render the page identically, while producing more pixels. Let's see that in action with a 200% zoom on our last example:

var google = new Nightmare()
  .screenshot(1600, 1200)
  .zoom(2)
  .goto('http://google.com')
  .wait()
  .screenshot('google.png')
  .run(function() console.log('Done!));

Forms & Buttons

Now, we could keep searching on Google, but let's target a more interesting job. Let's say we want to automatically download free albums on Bandcamp. Aggregating those is a more complex process, but once you have the tools outlined below, you have the tools to figure that or most other automation tasks out. For now, let's go through the steps to download a free album once you have found one. We will use this album as an example.

This process is very exploratory. When tackling any objective with Nightmare, a combination of inspecting the DOM with your browser's dev tools and taking screenshots to verify your progress is essential. We will start by just visiting the target page and ensuring that our setup is correct.

var Nightmare = require('nightmare');

var bandcamp = new Nightmare()
    .viewport(1000, 1000)
    .useragent("Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.111 Safari/537.36")
    .goto('http://aerotrak.bandcamp.com/album/at-ease')
    .wait()
    .screenshot('bandcamp1.png')
    .run(function (err, nightmare) {
      if (err) return console.log(err);
      console.log('Done!');
    });

Assuming that the site hasn't changed much, you should see something like this:

bandcamp1.png: First stop.

Nightmare gives us the power to click on or type into any DOM element that we can target with CSS selectors via the .click(selector) method. An inspection of the page reveals the 'Buy Now' link is actually a button with the class. It turns out this class is also used on the 'Send as Gift' button. So we can specify this button by descendant of the h4 above it. Here's what the code and results look like:

var Nightmare = require('nightmare');

var bandcamp = new Nightmare()
    .viewport(1000, 1000)
    .useragent("Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.111 Safari/537.36")
    .goto('http://aerotrak.bandcamp.com/album/at-ease')
    .wait()
    .screenshot('bandcamp1.png')
    .click('h4.ft button.download-link')
    .screenshot('bandcamp2.png')
    .run(function (err, nightmare) {
      if (err) return console.log(err);
      console.log('Done!');
    });

bandcamp2.png: Payment screen.

Now, we need to fill in the price. We can do that with the .type(selector, text) method. Targeting by selectors works the same as with .click(selector). Fortuately, the price field has an id, so this is trivial. Once again, here are the code and results:

var Nightmare = require('nightmare');

var bandcamp = new Nightmare()
    .viewport(1000, 1000)
    .useragent("Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.111 Safari/537.36")
    .goto('http://aerotrak.bandcamp.com/album/at-ease')
    .wait()
    .screenshot('bandcamp1.png')
    .click('h4.ft button.download-link')
    .screenshot('bandcamp2.png')
    .type('#userPrice', '0')
    .wait(500)
    .screenshot('bandcamp3.png')
    .run(function (err, nightmare) {
      if (err) return console.log(err);
      console.log('Done!');
    });

bandcamp3.png: After price entry.

The part that has not been discussed is the use of the .wait(500). We have used .wait() before but not .wait(milliseconds). The .wait() method is used to cause Nightmare to wait for a page to load before proceeding to the next command. The .wait(milliseconds) method is used to cause Nightmare simply to pause for the given amount of time before proceeding to the next command. Generally, we use .wait(), but in this case, we are only pausing for some JavaScript to execute and reveal our next DOM target (try the screenshot with and without it to see the difference).

That next target is the "Download Now" button. It isn't directly selectable, but it's parent has an id, so we have a shot again. We .click(selector) that, .wait() for the next page to load, and .wait(milliseconds) for some JavaScript to produce our next target. As mentioned before, this kind of thing is discoverable mainly by exploratory work. Let's see the code and results:

var Nightmare = require('nightmare');

var bandcamp = new Nightmare()
    .viewport(1000, 1000)
    .useragent("Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.111 Safari/537.36")
    .goto('http://aerotrak.bandcamp.com/album/at-ease')
    .wait()
    .screenshot('bandcamp1.png')
    .click('h4.ft button.download-link')
    .screenshot('bandcamp2.png')
    .type('#userPrice', '0')
    .wait(500)
    .screenshot('bandcamp3.png')
    .click('#downloadButtons_download button')
    .wait()
    .wait(1000)
    .screenshot('bandcamp4.png')
    .run(function (err, nightmare) {
      if (err) return console.log(err);
      console.log('Done!');
    });

bandcamp4.png: Final target.

Total Control with Evaluate

Now that we are on the final target, we have a predicament. We are unable to directly download the file. This, however, is where Nightmare shows its most interesting property. The .evaluate(function, callback, params) method gives us the ability to execute a function (with parameters even) in the context of the web page and pass back a return value to Nightmare's context. This is extremely powerful. You can use this to manipulate just about anything in the DOM. Fow example, you can regex search the DOM for nodes and navigate to a related node. We will be doing something a little simpler though. We can acquire the album name and the final address of the file, and return them to Nightmare. Then, we can use Node's http and file system modules to request the file and write it to disk. We have no need for the params ourselves, but they could definitely be useful elsewhere. We also (ab)use the fact that the page already includes jQuery, making our job a little easier. When the code executes, we should find all of the screenshots and our target file (named after the album) in the same directory as our script. Here is the completed source:

var Nightmare = require('nightmare');
var fs = require('fs');
var http = require('http');

var bandcamp = new Nightmare()
    .viewport(1000, 1000)
    .useragent("Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.111 Safari/537.36")
    .goto('http://aerotrak.bandcamp.com/album/at-ease')
    .wait()
    .screenshot('bandcamp1.png')
    .click('h4.ft button.download-link')
    .screenshot('bandcamp2.png')
    .type('#userPrice', '0')
    .wait(500)
    .screenshot('bandcamp3.png')
    .click('#downloadButtons_download button')
    .wait()
    .wait(1000)
    .screenshot('bandcamp4.png')
    .evaluate(function () {
      return {
        name: $('.downloadItemTitle').text().trim(),
        href: $('.downloadGo').prop('href').trim()
      };
    },function (value) {
        var filename = './' + value.name + '.zip';
        var file = fs.createWriteStream(filename);
        var request = http.get(value.href, function (response) {
          response.pipe(file);
        });
      }
    )
    .run(function (err, nightmare) {
      if (err) return console.log(err);
      console.log('Done!');
    });

Keep in mind that this technique only works for sites that do not require a login. If you need to automate that, another technique is to execute a synchronous XMLHttpRequest for the file in the context of the web page. This yields the raw data itself. Just pack that into your return object and write it directly to disk when you get back to the Nightmare context. Here's a sample of the request:

var data = [];
var xhr = new XMLHttpRequest();
xhr.open("GET", targetUrl, false);
xhr.overrideMimeType("text/plain; charset=x-user-defined");
xhr.send();
data.push(xhr.responseText);
return data;

Wrap Up

So, we have seen how Nightmare can be used to do interactive web scraping, even allowing for DOM manipulation. If you need to do dynamic loops or more complex logic keep in mind that your Nightmare instance is passed into the callback of .run(callback). You can set up a new method chain and execute it from there. There are plenty of other functions that Nightmare exposes. Be sure to check out the Readme.md in the GitHub repo.

NightmareNode.jsJavaScriptweb scrapingPhantomJSvideotutorial