Create a simple web spider in node.js

I’ve seen many web crawlers written in other languages like PHP, Python, Ruby, etc. However, there’s no one that notices node.js is also good in doing these kind of things. So I utilizes asynchronous network I/O provided by node.js to create a fast, stable web crawler (spider) in node.js and it’s very simple. Here I’ll reveal the logic of a web crawler and allow you to create your own.

A web spider is a long-running program that finds websites and record them. After that we can search the records generated by web spiders to find information we want. It’s the basic unit of a search engine.

First, we need to search for links from a base webpage and record the webpage to the database. Then, follow the links found and repeat the first step to discover more webpages.

Here are some modules that helps us coding a web spider:

  1. request (helps fetching webpages)
  2. cheerio (helps parsing webpages)
  3. mysql (for indexing pages)
  4. async (helper library)

And here’s the code:

Here’s how it works:

First, it crawls the base page specified in config.json first. Then it tries to find all links from the page content and add them to the queue, then the first item on the queue will be executed. This process continues again and again…

Also, we store the queue in a MySQL database for scalability. This script can be spawned across multiple machines.

23 Replies to “Create a simple web spider in node.js”

  1. Do you have a spam problem on this site; I also am a blogger, and I was wondering your situation; we have created some
    nice procedures and we are looking to swap strategies with
    other folks, please shoot me an email if interested.

  2. Hi, There is a way to get the text content of crawler link?, I mean I am interested in get all the text present in a page. How to do it?

    1. Did you disabled JavaScript in your browser? It NEED javascript to show correctly. If you didn’t disabled JavaScript, then it’s some network issues. 🙁

  3. Thanks for the update, great stuff!

    Although after a few tests I noticed the from field wasn’t populating. I think you may have left out defining the from field.

    Also, in stead of all these gists, you should throw it up on github 🙂

    1. You’ll need to set up a table for the execution queue and a table for the actual records. I’ll update this post with the SQL queries as soon as possible.

Leave a Reply

Your email address will not be published. Required fields are marked *