Web Scraping with Node.js: How to Leverage the Power of JavaScript

Web Scraping with Node.js: How to Leverage the Power of JavaScript
Posted on
Jun 29, 2022

NodeJS is a JavaScript runtime environment built on top of the V8 JS engine developed by Google. But above all, Node.js is a platform for building web applications. Like JavaScript, it is ideal for solving web tasks.

There are several web scraping tools for Node.js: Axios, SuperAgent, Cheerio, and Puppeteer with headless browsers.

Tired of getting blocked while scraping the web?

Try out Web Scraping API with proxy rotation, CAPTCHA bypass, and Javascript rendering.

  • 1,000 Free API Credits
  • No Credit Card Required
  • 30-Day Trial
Try now for free

Get structured data in the format you need!

We offer customized web scraping solutions that can provide any data you need, on time and with no hassle!

  • Regular, custom data delivery
  • Pay after you receive sample dataset
  • A range of output formats
Get a Quote

Advantages of using Node.js for Web Scraping

Our company uses a JavaScript + NodeJS + MongoDB stack in a Linux shell for web scraping. The connecting link is NodeJS, which has a number of undeniable advantages.

Firstly, NodeJS as a runtime is efficient due to its support for asynchronous I/O operations. This speeds up the application for HTTP requests and database requests in areas where the main thread of execution does not depend on the results of I/O.

Secondly, NodeJS supports streaming data transfer (Stream), which helps to process big files (or data) even with minimal system requirements.

Thirdly, NodeJS contains a lot of built-in modules that help to interact with the operating system and the web. For example, FileSystem and Path for data input/output procedures on the system disk, URL for manipulating route parameters and query parameters in URL, Process and Child processes - for managing operating system processes serving crawlers, and also Utils, Debugger, and so on.

Fourth, the NodeJS ecosystem contains a huge number of packages from the developer community, which can help to solve almost any problem. For example, for scraping, there are such libraries as Axios, SuperAgent, Cheerio, and Puppeteer.

HTTP requests in NodeJS using Axios

Axios is a promisified HTTP client.

Little is needed from Axios for scraping - the majority of requests are sent with the GET method.

Lets use:

const axios = require('axios');
const UserAgent = require('user-agents');

const axios = axios.create({
	headers: {
		'user-agent': (new UserAgent()).toString(),
		'cookie': process.env.COOKIE,
		'accept': '*/*',
		'accept-encoding': 'gzip, deflate, br',
	},
	timeout: 10000,
	proxy: {
	  host: process.env.PROXY_HOST,
	  port: process.env.PROXY_PORT,
	  auth: {
		  username: process.env.PROXY_USERNAME,
		  password: process.env.PROXY_PASSWORD,
		}
  },
})

Here, an axios instance is created with an example configuration.

User-agents library generates actual values for the header of the same name, so as not to enter these values manually.

Cookies are received using a separate script, the task of which is to launch a Chromium instance using the Puppeteer library, authenticate on the target website, and pass the cookie value cached by the browser to the application.

In the example, the cookie property is bound to the value of the environment variable process.env.COOKIE. This implies that the actual cookie has been placed in an environment variable, for example with the pm2 process manager.

Although the cookie value (which is just a string) can be set directly in the configuration above by copying it from the browser developer panel.

It remains to send an http request and get the content in the data property of the Response object. This is usually html, but can also be any other format of the required data.

const yourAsyncFunc = async () => {
	const { data } = await axios.get(targetUrl); // data --> <!DOCTYPE html>... and so on

	// some code
}

SuperAgent for parsing in NodeJS

As an alternative to Axios, there is a lightweight SuperAgent http client. One can make a simple GET request like this:

const superagent = require('superagent');

const yourAsyncFunc = async () => {
	const { text } = await superagent.get(targetUrl); // text --> page content

	// some code
}

It has a good reputation for building web applications that use AJAX. A feature of the SuperAgent is the pipelined way of setting the request configuration:

const superagent = require('superagent');

superagent
  .get('origin-url/route')
  .set('User-Agent': '<some UA>')
  .query({ city: 'London' })  
	.end((err, res) => {
    // // Calling the end function will send the request
		const { text } = res; // text --> page content
    // some other code
  });

One can add a proxy service to requests using the superagent-proxy wrapper. Also SuperAgent supports async/await syntax.

Structure Transformation with Cheerio 

Cheerio is an indispensable tool for converting string HTML content into a tree structure, then traversing it and extracting the required data.

Scraping mostly uses Load API and Selectors API.

Loading:

const cheerio = require('cheerio');

// somewhere inside asynchronous func...
const { data } = await fetchPageContent(url);

const $ = cheerio.load(data); // now we've got cheerio node tree in '$'

The next steps are to select nodes by selectors and extract the required data.

Here is an example using cheerio, which filters all nodes matching a combination of selectors into a collection and extracts the links they contain:

const urls = $('.catalog li :nth-child(1) a')
	.filter((i, el) => $(el).attr('href'))
	.map((i, el) => $(el).attr('href'))
	.toArray();

// urls --> ['url1', 'url2', ..., 'urlN']

That tells:

  • find an element with the .container class in the markup (at any nesting level);
  • select all elements with li tag (at any nesting level within .container);
  • in each of the li elements, select the first child element;
  • in the first child element, select all elements with a tag (at any nesting level);
  • filter only those a elements that contain href attribute;
  • iterate over the resulting collection of a elements and extract the values from thehref attribute;
  • write the received links to the JS array.

This code contains only 4 short lines but has a high instruction density.

It is worth noting that if cheerio does not return the required content fragments, check that the HTML markup received from the web server really contains what is needed.

Headless Browsers with Puppeteer

Headless browsers are used to simulate user actions on a page programmatically without downloading a GUI instance.

Using a headless browser consumes system resources in one way or another and increases the overall running time of the application. In addition, care must be taken to ensure that processes with browser instances do not remain open in the system, as their uncontrolled growth will bring down the entire server.

Of the “headless” ones, the most used is Chromium, managed using the Puppeteer library, and the most common reason for its appearance in the scraper code is a pop-up captcha (or a requirement to execute some kind of JS code before loading content). The browser receives the task from the captcha, waits for a solution, and sends a response with the solution to the web server. Only after that, the calling code receives HTML for parsing.

Sometimes a headless browser is used to receive an authorization cookie and, in very rare cases, to load content by simulating mouse scrolling.

Using:

// puppeteer-service.js
const puppeteer = require('puppeteer');

module.exports = async () => {
try {
	const browser = await puppeteer.launch({
		args: ['--window-size=1920,1080'],
	  headless: process.env.NODE_ENV === 'PROD', // FALSE means you can see all performance during development
	});

	const page = await browser.newPage();
	await page.setUserAgent('any-user-agent-value');
	await page.goto('any-target-url', { waitUntil: ['domcontentloaded'] });
	await page.waitForSelector('any-selector');
	await page.type('input.email-input', username);
	await page.type('input.password-input', password);
	await page.click('button[value="Log in"]');
	await page.waitForTimeout(3000);

	return page.evaluate(
	  () => document.querySelector('.pagination').innerHTML,
  );
} catch (err) {
		// handling error
} finally {
		browser && await this.browser.close();
	}
}

// somewhere in code
const fetchPagination = require('./puppeteer-service.js');

const $ = cheerio.load(await fetchPagination());
// some useful code next...

This code snippet provides a simplified example with basic methods for manipulating a Chromium page. First, a browser instance and a blank page are instantiated. After setting the header's 'User-Agent' the page requests content at the given url.

Next, make sure that the required piece of content was loaded (waitForSelector), enter the login and password in the authorization fields (type), press the 'Log in' button (click), set the page to wait for 3 seconds (waitForTimeout), during which the content of the authorized user is loaded and, finally, returned to the top the resulting HTML markup of the desired piece with pagination.

Using Javascript's Async for Speed Increase

The asynchronous I/O features that NodeJS and JavaScript support can be used to speed up the application. There are two conditions: free processor cores, on which one can run asynchronous processes separately from the main thread of execution, and independence of the main thread from the result of an asynchronous operation.

It is not possible to use the asynchrony of HTTP requests to speed up the running time of the process, since the continuation of the main thread directly depends on receiving a response to the HTTP request.

The same applies to operations for reading information from the database. Until the data is received, the main thread of the scraper has little to do.

But one can write data to the database, separated from the mainstream. Suppose, at some point in the code, an object is received to write to the database. Depending on the result of the recording, it is necessary to branch further actions, only then proceed to another iteration.

Instead of waiting for the result at the place where the write data function was called, one can create a new process and assign this work to it. The main thread can immediately move on.

// in some place of code...

// we've got the object for inserting to database
const data = {
	product: 'Jacket',
	price: 55,
	currency: 'EUR',
};

const { fork } = require('child_process');

// launch the other bind process
const child = fork('path-to-write-data.js');

child.send(JSON.stringify(data));

// do the next code iteration...

Implementation of the code for the child process in a separate file:

// write-data.js

// inside acync function
process.on('message', (data) => {
	const result = await db.writeDataFunc(JSON.parse(data));

	if (result) {
		// do some impotant things
	} else {
		// do other impotant things 
	}
})

In general, the feature of the fork method is that it allows the parent and child processes to exchange messages with each other. But, in this example, for demonstration purposes, work is delegated to the child process without notifying the parent, which allows the latter to work out its next thread of execution in parallel with the child.

Avoid Blocks while Scraping

Most of the target websites from which data is scraped actively resist this process. Of course, the developers of these resources know how it works. This means that setting just the right headers is often not enough to crawl the entire site.

Web servers can limit the distribution of content after reaching a certain number of requests from one IP per unit of time. They can restrict access if they see that the request came from a data center proxy.

They can send a captcha to solve if the location of the client's IP seems unreliable to them. Or they may offer to execute some JS code on the page before loading the main content.

The purpose is to use web server metrics to make it look like the request came from the user's browser and not from a bot. If it looks realistic enough, the server sends the content so that it doesn't accidentally restrict the real user instead of the bot.

When such a picky web server comes across, the problem is solved by correctly localizing proxy addresses and gradually increasing their quality, up to residential ones. The downside of this decision is the increased cost of data collection.

Conclusion and Takeaways

NodeJS and JavaScript are great for data scraping in all parts of the process. If a stack or executor is needed, NodeJS, JavaScript, and MongoDB will be one of the best choices. The use of NodeJS allows not only to solve all possible issues in the field of scraping but also to ensure the protection and reliability of data extraction. And the use of headless browsers will provide an imitation of user behavior.

Tired of getting blocked while scraping the web?

Try out Web Scraping API with proxy rotation, CAPTCHA bypass, and Javascript rendering.

  • 1,000 Free API Credits
  • No Credit Card Required
  • 30-Day Trial
Try now for free

Get structured data in the format you need!

We offer customized web scraping solutions that can provide any data you need, on time and with no hassle!

  • Regular, custom data delivery
  • Pay after you receive sample dataset
  • A range of output formats
Get a Quote
Roman Milyushkevich
Roman Milyushkevich
Roman Milyushkevich’s profile

I'm a big believer in automation and anything that has the potential to save a human's time. Everyday I help companies extract data and make more informed business decisions for reach their goals.

Request a Quote

Tell us more about you and your project information.