Web Crawlers — How to build one?

Kartik Rai
9 min readOct 1, 2023

What exactly is a web crawler? Well, it is a program or script that automatically navigates the World Wide Web by visiting websites and web pages, downloading their content, and indexing the information it finds. Web crawlers play a fundamental role in the operation of search engines, like Google, Bing, and others, as well as in various other applications.

Key Components

It’s very important to know the key components of a web crawler because it shows how web crawlers actually function and the mechanism behind their functioning. The image below shows the order in which a web crawler completes it’s “crawling” process.

Crawling Process

Here are the key components and functions of a web crawler:

  1. Seed URLs: A web crawler typically starts with a list of initial URLs, known as seed URLs. These URLs are the entry points from which the crawler begins its journey across the web.
  2. HTTP Requests: The crawler sends HTTP requests to the web servers hosting the websites it aims to crawl. It retrieves the HTML and other resources, such as images, stylesheets, and JavaScript files, associated with a web page.
  3. HTML Parsing: Once the web page content is downloaded, the crawler parses the HTML to extract links, text, and metadata. It also identifies other relevant resources to download, such as linked pages.
  4. URL Extraction: The crawler identifies and extracts links embedded in the HTML. These links are added to a queue of URLs to be visited and crawled. The crawler may filter and prioritize links based on various criteria, such as the relevance to the website’s focus or the page’s freshness.
  5. Page Download: The crawler proceeds to download the content of the discovered URLs, repeating the process of parsing, extracting links, and adding new URLs to the queue.
  6. Robots.txt: The crawler adheres to the “robots.txt” file, a text file hosted on websites that instructs crawlers which parts of the site can be crawled and which should be excluded. This file helps prevent over-crawling or crawling sensitive areas of a website.
  7. Crawl Delays: Some websites specify crawl delays or rate limits to prevent crawlers from overwhelming their servers with requests. Crawl delays help maintain the performance and stability of websites.
  8. Indexing: As the crawler processes web pages, it extracts and stores relevant information, such as text content and metadata. This information is used for indexing, which is essential for search engines to deliver accurate and timely search results.
  9. Recursion: Web crawlers follow links recursively, continuously expanding their scope across the web. They can potentially crawl billions of pages in their quest to index the web comprehensively.
  10. Duplicate Detection: To avoid indexing multiple copies of the same content, web crawlers employ techniques to detect and handle duplicates. This is crucial for search engine efficiency and the quality of search results.
  11. Respecting Site Policies: Ethical web crawlers respect website policies, such as “robots.txt” and “nofollow” attributes, and avoid actions that could disrupt or overload web servers.

Building a Web Crawler

Whenever building a program like this, it’s very important to figure out the scale for which you want to build it. For example, the crawlers that Google uses to crawl the World Wide Web are enormous, they possess a whole lot of functionalities and resources and the amount of data that they store. In this article, we’ll be building a web crawler that will take a base URL as input and it will crawl all the possible pages that it can lead to and have the same base URL, in a nutshell, this is a small scale prototype of a web crawler that you can use to crawl all the possibilities that can be reached from that base URL.

The basic idea behind this could be written down in three steps:

  1. Take the base URL (aka, Seed URL) and download the HTML of the page that this URL leads to
  2. Extract all the URLs on that are present on that HTML page
  3. Store those found URLs if they are not already crawled and repeat steps 1 and 2 until all URLs are crawled

There are two ways in which you can perform the crawling process. You can either use BFS (Breadth-First Search), or DFS (Depth-First Search) in order to traverse all the URLs. You can think of the URLs as the edges of a graph and the web pages that those URLs lead to, as nodes of the graph. Here’s is an image to make it more clear:

Now, in order to preserve “politeness”, we make sure to design a system that does not send “too many” requests to the same page multiple times in a short fraction of time, as it could also be treated as a case of Denial-of-service attack. So, we prefer using BFS traversal over DFS, since we need to make subsequent calls to the same base URL multiple times.

Really building our Web Crawler!

Let’s finally start writing some code.

Step-1: Setting up basic requirements/dependencies

We’ll be using NodeJS to build our crawler program. Simple plain JavaScript! So, first of all you need to make sure that you have NodeJS installed on your system (follow this link if you want to know how you can do so if you don’t have it already). Once that is done, create a new folder and give it any name you want (I’ll call mine “WebCrawler”). Move to this directory and initialize it as an npm directory so that you can track all your dependencies and scripts through the package.json file which gets created automatically when you run “npm init”.

npm init --y
npm install jsdom
npm install nodemon --D

Once you have installed the above dependencies, create a “index.js” file in this directory which will serve as our main file that we run via node.

Step-2: Defining functions required for crawling process

Make a file called “crawl.js” or any other name that you prefer. We’ll define 3 functions here:

  1. normalizeURL -> to make the input URL in the format that we require
  2. getURLsFromHTML -> to extract all the URLs from a given HTML page
  3. crawlPage -> to crawl all the pages that the URLs lead to, and repeat the same process until a dead end is reached.
const {JSDOM} = require('jsdom');

async function crawlPage(baseUrl, currentUrl, pages){
const baseUrlObj = new URL(baseUrl)
const currentUrlObj = new URL(currentUrl)

if(baseUrlObj.hostname !== currentUrlObj.hostname){
return pages
}

const normalizedCurrentUrl = normalizeURL(currentUrl)
if(pages[normalizedCurrentUrl] > 0){
pages[normalizedCurrentUrl]++
return pages
}

pages[normalizedCurrentUrl] = 1;

console.log(`Actively crawlinig: ${currentUrl}`)

try{
const resp = await fetch(currentUrl)

if(resp.status > 399){
console.log("Error in fetching with status code: ", resp.status, " on page: ", currentUrl)
return pages
}

const contentType = resp.headers.get("content-type")
if(!contentType.includes("text/html")){
console.log("Non HTML response, content type: ", contentType, " on page: ", currentUrl)
return pages
}

const htmlBody = await resp.text()

const nextUrls = getURLsFromHTML(htmlBody, baseUrl)

for(const nextUrl of nextUrls){
pages = await crawlPage(baseUrl, nextUrl, pages)
}
} catch(err){
console.log(`Error fetching from: `, currentUrl, err.message)
}

return pages
}

function getURLsFromHTML(htmlBody, baseURL){
const urls = [];
const dom = new JSDOM(htmlBody)
const linkElements = dom.window.document.querySelectorAll('a')
for(linkElement of linkElements){
if(linkElement.href.slice(0, 1) === '/'){
// relative
try{
const urlObj = new URL(`${baseURL}${linkElement.href}`)
urls.push(urlObj.href)
}catch(err){
console.log("error with relative URL:", err.message);
}
}else{
//absolute
try{
const urlObj = new URL(linkElement.href)
urls.push(urlObj.href);
}catch(err){
console.log("error with absolute URL:", err.message);
}
}
}
return urls;
}

// the job of normalizeURL function is to take in the input urls and then return
// same output for the URLs that lead to the same page
// example: 'http://www.boot.dev', 'http://www.BooT.dev', 'https://www.boot.dev' -> Although these three might look different
// All these URLs obviously lead to the same page. So, we want the normalizeURL function to return same output URL
// for all these URLs, like 'boot.dev'
function normalizeURL(urlString) {
const urlObj = new URL(urlString);
const hostPath = `${urlObj.hostname}${urlObj.pathname}`;
if(hostPath.length > 0 && hostPath.slice(-1) === '/'){
return hostPath.slice(0, -1);
}
return hostPath;
}

module.exports = {
normalizeURL,
getURLsFromHTML,
crawlPage
}

Once you have created the crawl.js file and defined all these functions that we require, we can move forward. It is recommended to go through all the three (specially crawlPage) functions thoroughly to get a clear idea of things that I have done here. The crawling process returns an object called ‘pages’ which is an object that contains all the individual URLs that have been crawled and the number of times they have appeared throughout the given base URL.

Step-3: Generating the Report

Now that we have built the crawling mechanism, we need another function that can take in the output of crawling process and generate a report for us. This function will take the output of crawling process (‘pages’ object) and generate a list of all the crawled pages and their frequency. So, I’ll create a file called “report.js” that will have the following code:

function printReport(pages){
console.log("=============================")
console.log("REPORT")
console.log("=============================")
const sortedPages = sortPages(pages)

for(const sortedPage of sortedPages){
const url = sortedPage[0]
const hits = sortedPage[1]

console.log(`Found ${hits} links on page: ${url}`)
}

console.log("=============================")
console.log("END REPORT")
console.log("=============================")
}

function sortPages(pages){
const pagesArr = Object.entries(pages)
pagesArr.sort((a, b) => {
aHits = a[1]
bHits = b[1]
return b[1] - a[1]
})
return pagesArr
}

module.exports = {
sortPages,
printReport
}

Step-4: Bringing all the functions together via a main function

Now that we have built all the required functions, it’s time to bring all of them together bby building a main function. Create a file “index.js”, inside this file we will create our main function. The code that you need to embed in this file is:

const {crawlPage} = require('./crawl')
const {printReport} = require('./report')

async function main(){
if(process.argv.length < 3){
console.log("No website provided")
process.exit(1)
}
if(process.argv.length > 3){
console.log("Too many arguments!")
process.exit(1)
}

const baseUrl = process.argv[2]

console.log("Starting crawl of ", baseUrl)
const pages = await crawlPage(baseUrl, baseUrl, {})

printReport(pages)
}

main()

Step-5: Testing our Web Crawler

In order to test our web crawler, we simply need to give a base URL as an argument in the terminal from where we run the index.js file. You can use any URL that you may want, I’ll be going with the url “https://wagslane.dev/”.

You can run this by giving the following command in the terminal after moving to the directory where you have created all three files just now.

node index.js https://wagslane.dev/

As you can see, the function crawls all the pages and outputs the URL that it is crawling. Once it has crawled all the pages, it automatically generates a report that looks something like:

This is the report that is generated after the crawler has crawled all the pages. It shows the pages that it has visited and the number of URLs that were present on that page.

What’s Next?

You see, this is a very basic use case of a web crawler. Web crawlers of big scale have a lot more functionalities and a lot more processes in between. But, now you have a general building a web crawler, you can keep on adding layers of functionalities over this to make it more capable. Maybe add features like extracting additional information, search for contents, generate a survey, etc. You can learn about Distributed web crawlers if you want to explore more about web crawlers and their applications through this paper by Google.

You can check out the entire code for building this web crawler here. Feel free to explore some of my other blogs about “How to build a URL shortner?”, “How to build a Rate Limiter?

Feel free to get in touch with me on LinkedIn, or follow me on Twitter. 🐘

--

--

Kartik Rai

Full Stack Developer || Problem Solving || Open for discussions