We're hiring experienced Technical Writers! 🚀 Join our team of experts.

Web Scraping using Node.js and Puppeteer

Web Scraping using Node.js and Puppeteer

Web Scraper is an application or tool that helps you extract data from websites, which is then used for further processing or analysis. You can build your web scraper using Python, Java, Ruby, Javascript, and many more, each with its libraries and examples of web scraping. In this article, we will cover How you can build your Web Scraper using Node.js and one of the most famous JavaScript libraries for Web Scraping, Puppeteer.

Why Node.js and Puppeteer for Web Scrapping?

  • Node.js has many resources on the Internet and is easy to use. Its syntax and structure are straightforward, making it accessible even for those new to web scraping.

  • The JavaScript ecosystem is rich in various libraries and frameworks for web scraping, like Puppeteer, Axios, Request, and Cheerio.

  • Puppeteer can be used to control websites that rely on JavaScript for content rendering.

  • Node.js is fast and scalable, which is required when scraping large volumes of data.

  • Puppeteer runs in a headless mode, meaning it does not require a graphical user interface, making it fast and memory-efficient. It also has a Headful mode for testing purposes.

Prerequisites

  • A code editor like VS Code

You can find the complete code of this Puppeteer Scraper on this Github page.

Set up Puppeteer

To set up Puppeteer, you require Node.js. Check out the Node.js website and documentation to install it on your machine.

Create a new directory in your Code Editor and run the command given below in the terminal to initiate a node project with package.json dependencies.

npm init

The above command will create your package.json file after you provide the necessary details. Install Puppeteer by running the following npm command:

npm i puppeteer
	 
# with Yarn
yarn add puppeteer
	 
# with pnpm
pnpm add puppeteer

After you have installed Puppeteer, create a new file in your project root folder with the name ‘index.js’. This will house your primary script.

In the index.js file, import Puppeteer's libraries and functions with:

 import puppeteer from 'puppeteer';

You must set the URL to the blog or article you want to scrap. In this example, we'll use https://hackmamba.io/blog to scrap the articles we publish. Next, declare a constant having this URL as it's value.

 const url='https://hackmamba.io/blog/';

Identify the right HTML tag

To scrape the blog posts, you need to identify the right HTML tag whose content you want to scrape. RIght click on the page as shown below on a Goolge Chrome browser and click on the inspect option to open the developer tool interface.

Click the pointer to select the item to inspect.

Now, click on the title of any blog to see the code in the elements tab on the right, as shown below.

As you can see above, each blog is separated by or is within the <section> tag, the link of the blog is given in the <a> tag, and the title is in the <h2> tag. We will use these tags in the code to extract the blog URL and Title. The specifics of this may differ from one website to another.

Write the scraper script

In the index.js file, launch Puppeteer to scrape the page using the following script:

const main = async () => {
    const browser = await puppeteer.launch(); // to create a new browser
    const page = await browser.newPage(); // create a new page object
    await page.goto(url); // go to the url and passing it as an argument

    const allArticles = await page.evaluate(() => { // waits for the page to load and calls the functions within the Page's context
        const articles = document.querySelectorAll('section'); // extracting the 'section' tag
        return Array.from(articles).slice(0, 4).map((section) => { // new array from the section nodes we have and slice it to 4 so that 4 articles will be displayed
            const title = section.querySelector('h2').innerText; // to extract the title of the blogs
            const url = section.querySelector('a').href; // to extract the URL of the blogs
            return { title, url }; // to return the title and the URL of the respective blogs
        });
    });

    console.log(allArticles);
};
main();

In the code block above, you did the following:

  • Launched a new headless browser instance and opened a new browser tab.
  • Navigated to the specified url which should be passed as an argument when the function is called.
  • Parsed the page data to loop through the section elements and retrieve the title and link for each blog post.
  • Returned an array containing the information for all blog posts on the page.

To run the program, use the command node <filename.extension>. In this case, it is node index.js

You should see the following in your terminal.

You can transform and store this data into a JSON file for subsequent use. Here's how.

Store the scraped data as a JSON object

To store this data in a JSON file, follow these steps:.

Run the fs module command on the terminal of your project folder.

  npm install -g file-system

file-system is a native Node.js module that which allows you to read and write files and perform various other operations on the file system.

Next, import the filesystem to index.js, convert the JavaScript object data to JSON object data and write the data to a pageData.json file. Update the main function to include the filesystem logic with:

import fs using 'fs';
// other imports go here

const main = async () => {
    // script to scrape data goes in here
		
    fs.writeFile(`pageData.json`, JSON.stringify(allArticles), (err) => {
        if (err) {
            console.log(err);
        } else {
            console.log(`Data of Page Scraped`);
        }
    });
}
main();

The conditional statement handles any error from the file creation operation and logs a positive message once done.

Your JSON file content should look like this:

[
  {
    "title": "The Hackmamba Blog",
    "url": "https://hackmamba.io/blog/2024/06/cultivating-developer-enablement-on-composable-platforms/"
  },
  {
    "title": "Cultivating Developer Enablement on Composable Platforms",
    "url": "https://hackmamba.io/blog/2024/06/cultivating-developer-enablement-on-composable-platforms/"
  },
  {
    "title": "Top 5 places to post developer content",
    "url": "https://hackmamba.io/blog/2024/05/top-5-places-to-post-developer-content/"
  },
  {
    "title": "Can a developer marketing agency help you win the dev community?",
    "url": "https://hackmamba.io/blog/2024/05/can-developer-marketing-agency-win-dev-community/"
  }
]

Here's the complete code:

import puppeteer from 'puppeteer';
import fs from 'fs';
const url='https://hackmamba.io/blog/';
const main = async () => {
    const browser = await puppeteer.launch();
    const page = await browser.newPage();
    await page.goto(url);
    const allArticles = await page.evaluate(() => {
        const articles = document.querySelectorAll('section');
        return Array.from(articles).slice(0, 4).map((section) => {
            const title = section.querySelector('h2').innerText;
            const url = section.querySelector('a').href;
            return {title, url};
        });
    });
    console.log(allArticles);
    fs.writeFile(`pageData.json`, JSON.stringify(allArticles), (err) => {
        if (err) {
            console.log(err);
        } else {
            console.log(`Data of Page Scraped`);
        }
    });
}
main();

Wrapping up

You've completed this tutorial! As web scraping is a powerful technique employed to automatically retrieve public data, explore using this in various data scenarios.

Ensure the HTML tags targeted are accurate each time since they differ from page to page. You may have to use other HTML element selectors (like classes and IDs) for specific use cases and avoiding duplicate data.

Node.js is excellent in web scraping due to its efficient, non-blocking I/O model and extensive ecosystem. Puppeteer is great for scraping dynamic JavaScript-heavy websites, offering seamless adjustability, efficiency, and scalability for large data extraction. Whether it’s market research, price tracking, or data collection, Node.js delivers the power and reliability to turn web data into actionable insights.

Resources

Node.js Documentation Puppeteer Documentation

Contact Hackmamba


About the author

I am a DevRel and App Developer who loves creating content and building communities. I want to live a happy life, help others, and become a better Developer Advocate.

Related Blogs

How to add GitHub Copilot in VS Code
Akshat Virmani

Akshat Virmani

Sat Aug 24 2024

How to add GitHub Copilot in VS Code

Read Blog

icon
Common API Integration Challenges and How to Overcome Them
Akshat Virmani

Akshat Virmani

Fri Aug 09 2024

Common API Integration Challenges and How to Overcome Them

Read Blog

icon
Web Scraping using Node.js and Puppeteer
Akshat Virmani

Akshat Virmani

Thu Jun 20 2024

Web Scraping using Node.js and Puppeteer

Read Blog

icon
image
image
icon

Join Our Technical Writing Community

Do you have an interest in technical writing?

image
image
icon

Have any feedback or questions?

We’d love to hear from you.

Building a great product is tough enough, let's handle your content.

Building a great product is tough enough, let's handle your content.

Create 5x faster, engage your audience, & never struggle with the blank page.