I’ve seen many web crawlers written in other languages like PHP, Python, Ruby, etc. However, there’s no one that notices node.js is also good in doing these kind of things. So I utilizes asynchronous network I/O provided by node.js to create a fast, stable web crawler (spider) in node.js and it’s very simple. Here I’ll reveal the logic of a web crawler and allow you to create your own.
A web spider is a long-running program that finds websites and record them. After that we can search the records generated by web spiders to find information we want. It’s the basic unit of a search engine.
First, we need to search for links from a base webpage and record the webpage to the database. Then, follow the links found and repeat the first step to discover more webpages.
Here are some modules that helps us coding a web spider:
- request (helps fetching webpages)
- cheerio (helps parsing webpages)
- mysql (for indexing pages)
- async (helper library)
And here’s the code:
Here’s how it works:
First, it crawls the base page specified in config.json first. Then it tries to find all links from the page content and add them to the queue, then the first item on the queue will be executed. This process continues again and again…
Also, we store the queue in a MySQL database for scalability. This script can be spawned across multiple machines.