Hostingheaderbarlogoj
Join InMotion Hosting for $3.49/mo & get a year on Tuts+ FREE (worth $180). Start today.
Advertisement

How to Scrape Web Pages with Node.js and jQuery

by
Gift

Want a free year on Tuts+ (worth $180)? Start an InMotion Hosting plan for $3.49/mo.

Node.js is growing rapidly; one of the biggest reasons for this is thanks to the developers who create amazing tools that significantly improve productivity with Node. In this article, we will go through the basic installation of Express, a development framework, and creating a basic project with it.


What We're Going to Build Today

Node is similar in design to, and influenced by, systems like Ruby's Event Machine or Python's Twisted. Node takes the event model a bit further - it presents the event loop as a language construct instead of as a library.

In this tutorial, we will scrape the YouTube home page, get all the regular sized thumbnails from the page as well as links and video duration time, send all those elements to a jQueryMobile template, and play the videos using YouTube embed (which does a nice job of detecting device media support (flash/html5-video).

We will also learn how to begin using npm and Express, npm's module installation process, basic Express routing and the usage of two modules of Node: request and jsdom.

For those of you who aren't yet familiar with Node.js is and how to install it, please refer to the node.js home page
and the npm GitHub project page.

You should also refer to our "Node.js: Step by Step" series.

Note: This tutorial requires and assumes that you understand what Node.js is and that you already have node.js and npm installed.


Step 1: Setting Up Express

So what exactly is Express? According to its developers, it's an..

Insanely fast (and small) server-side JavaScript web development framework built on Node and Connect.

Sounds cool, right? Let's use npm to install express. Open a Terminal window and type the following command:

npm install express -g

By passing -g as a parameter to the install command, we're telling npm to make a global installation of the module.

I'm using /home/node-server/nettuts for this example, but you can use whatever you feel comfortable with.

After creating our express project, we need to isntruct npm to install express' dependencies.

cd nodetube
npm install -d

If it ends with, "ok," then you're good to go. You can now run your project:

node app.js

In your browser, go to http://localhost:3000.


Step 2: Installing Needed Modules

JSDOM

A JavaScript implementation of the W3C DOM.

Go back to your Terminal and, after stopping your current server (ctr + c), install jsdom:

npm install jsdom

Request

Simplified HTTP request method.

Type the following into the Terminal:

npm install request

Everything should be setup now. Now, it's time to get into some actual code!


Step 3: Creating a Simple Scraper

app.js

First, let's include all our dependencies. Open your app.js file, and, in the very first lines, append the following code:

/**
 * Module dependencies.
 */

var express = require('express')
, jsdom = require('jsdom')
, request = require('request')
, url = require('url')
, app = module.exports = express.createServer();

You will notice that Express has created some code for us. What you see in app.js is the most basic structure for a Node server using Express. In our previous code block, we told Express to include our recently installed modules: jsdom and request. Also, we're including the URL module, which will help us parse the video URL we will scrape from YouTube later.

Scraping Youtube.com

Within app.js, search for the "Routes" section (around line 40) and add the following code (read through the comments to understand what is going on):

app.get('/nodetube', function(req, res){
	//Tell the request that we want to fetch youtube.com, send the results to a callback function
        request({uri: 'http://youtube.com'}, function(err, response, body){
                var self = this;
		self.items = new Array();//I feel like I want to save my results in an array
		
		//Just a basic error check
                if(err && response.statusCode !== 200){console.log('Request error.');}
                //Send the body param as the HTML code we will parse in jsdom
		//also tell jsdom to attach jQuery in the scripts and loaded from jQuery.com
		jsdom.env({
                        html: body,
                        scripts: ['http://code.jquery.com/jquery-1.6.min.js']
                }, function(err, window){
			//Use jQuery just as in a regular HTML page
                        var $ = window.jQuery;
                        
                        console.log($('title').text());
                        res.end($('title').text());
                });
        });
});

In this case, we're fetching the content from the YouTube home page. Once complete, we're printing the text contained in the page's title tag (<title>). Return to the Terminal and run your server again.

node app.js

In your browser, go to: http://localhost:3000/nodetube

You should see, "YouTube - Broadcast Yourself," which is YouTube's title.

Now that we have everything set up and running, it is time to get some video URLs. Go to the YouTube homepage and right click on any thumbnail from the "recommended videos" section. If you have Firebug installed, (which is highly recommended) you should see something like the following:

There's a pattern we can identify and which is present in almost all other regular video links:

div.vide-entry
span.clip

Let's focus on those elements. Go back to your editor, and in app.js, add the following code to the /nodetube route:

app.get('/nodetube', function (req, res) {
    //Tell the request that we want to fetch youtube.com, send the results to a callback function
    request({
        uri: 'http://youtube.com'
    }, function (err, response, body) {
        var self = this;
        self.items = new Array(); //I feel like I want to save my results in an array
        
		  //Just a basic error check
        if (err && response.statusCode !== 200) {
            console.log('Request error.');
        }
        
		  //Send the body param as the HTML code we will parse in jsdom
        //also tell jsdom to attach jQuery in the scripts
        jsdom.env({
            html: body,
            scripts: ['http://code.jquery.com/jquery-1.6.min.js']
        }, function (err, window) {
            //Use jQuery just as in any regular HTML page
            var $ = window.jQuery,
                $body = $('body'),
                $videos = $body.find('.video-entry');
            
				//I know .video-entry elements contain the regular sized thumbnails
            //for each one of the .video-entry elements found
            $videos.each(function (i, item) {
               
					 //I will use regular jQuery selectors
                var $a = $(item).children('a'),
                   
						  //first anchor element which is children of our .video-entry item
                    $title = $(item).find('.video-title .video-long-title').text(),
                    
						  //video title
                    $time = $a.find('.video-time').text(),
                    
						  //video duration time
                    $img = $a.find('span.clip img'); //thumbnail
               
					 //and add all that data to my items array
                self.items[i] = {
                    href: $a.attr('href'),
                    title: $title.trim(),
                    time: $time,
                   
						  //there are some things with youtube video thumbnails, those images whose data-thumb attribute
                    //is defined use the url in the previously mentioned attribute as src for the thumbnail, otheriwse
                    //it will use the default served src attribute.
                    thumbnail: $img.attr('data-thumb') ? $img.attr('data-thumb') : $img.attr('src'),
                    urlObj: url.parse($a.attr('href'), true) //parse our URL and the query string as well
                };
            });
            
				//let's see what we've got
            console.log(self.items);
            res.end('Done');
        });
    });
});

It's time to restart our server one more time and reload the page in our browser (http://localhost:3000/nodetube). In your Terminal, you should see something like the following:

This looks good, but we need a way to display our results in the browser. For this, I will use the Jade template engine:

Jade is a high performance template engine heavily influenced by Haml, but implemented with JavaScript for Node.

In your editor, open views/layout.jade, which is the basic layout structure used when rendering a page with Express. It is nice but we need to modify it a bit.

views/layout.jade

!!! 5
html(lang='en')
  head
    meta(charset='utf-8')
    meta(name='viewport', content='initial-scale=1, maximum-scale=1')
    title= title
    link(rel='stylesheet', href='http://code.jquery.com/mobile/1.0b3/jquery.mobile-1.0b3.min.css')
    script(src='http://code.jquery.com/jquery-1.6.2.min.js')
    script(src='http://code.jquery.com/mobile/1.0b3/jquery.mobile-1.0b3.min.js')
  body!= body

If you compare the code above with the default code in layout.jade, you will notice that a few things have changed - doctype, the viewport meta tag, the style and script tags served from jquery.com. Let's create our list view:

views/list.jade

Before we start, please browse through jQuery Mobile's (JQM from now on) documentation on page layouts and anatomy.

The basic idea is to use a JQM listview, a thumbnail, title and video duration label for each item inside the listview along with a link to a video page for each one of the listed elements.

Note: Be careful with the indentation you use in your Jade documents, as it only accepts spaces or tabs - but not both in the same document.

div(data-role='page')
    header(data-role='header')
        h1= title
    div(data-role='content')
    	//just basic check, we will always have items from youtube though
        - if(items.length)
            //create a listview wrapper
            ul(data-role='listview')
                //foreach of the collected elements
                - items.forEach(function(item){
                    //create a li
                    li
                        //and a link using our passed urlObj Object
                        a(href='/watch/' + item['urlObj'].query.v, title=item['title'])
                            //and a thumbnail
                            img(src=item['thumbnail'], alt='Thumbnail')
                            //title and time label
                            h3= item['title']
                            h5= item['time']
                - })

That is all we need to create our listing. Return to app.js and replace the following code:

                        //let's see what we've got
                        console.log(self.items);
                        res.end('Done');

with this:

                        //We have all we came for, now let's render our view
			res.render('list', {
                        	title: 'NodeTube',
				               items: self.items
                        });

Restart your server one more time and reload your browser:

Note: Because we're using jQuery Mobile , I recommend using a Webkit based browser or an iPhone/Android cellphone (simulator) for better results.


Step 4: Viewing Videos

Let's create a view for our /watch route. Create views/video.jade and add the following code:

div(data-role='page')
    header(data-role='header')
        h1= title
    div(data-role='content')
    	//Our video div
        div#video
            //Iframe from  youtube which serves the right media object for the device in use
            iframe(width="100%", height=215, src="http://www.youtube.com/embed/" + vid, frameborder="0", allowfullscreen)

Again, go back to your Terminal, restart your server, reload your page, and click on any of the listed items. This time a video page will be displayed and you will be able to play the embed video!


Bonus: Using Forever to Run Your Server

There are ways we can keep our server running in the background, but there's one that I prefer, called Forever, a node module we can easily install using npm:

npm install forever -g

This will globally install Forever. Let's start our nodeTube application:

forever start app.js

You can also restart your server, use custom log files, pass environment variables among other useful things:

//run your application in production mode
NODE_ENV=production forever start app.js

Final Thoughts

I hope I've demonstrated how easy it is to begin using Node.js, Express and npm. In addition, you've learned how to install Node modules, add routes to Express, fetch remote pages using the Request module, and plenty of other helpful techniques.

If you have any comments or questions, please let me know in the comments section below!

Advertisement