Akshat Virmani
6 min readJun 20 2024
Web Scraping using Node.js and Puppeteer
Web Scraper is an application or tool that helps you extract data from websites, which is then used for further processing or analysis. You can build your web scraper using Python, Java, Ruby, Javascript, and many more, each with its libraries and examples of web scraping. In this article, we will cover How you can build your Web Scraper using Node.js and one of the most famous JavaScript libraries for Web Scraping, Puppeteer.
Why Node.js and Puppeteer for Web Scrapping?
-
Node.js has many resources on the Internet and is easy to use. Its syntax and structure are straightforward, making it accessible even for those new to web scraping.
-
The JavaScript ecosystem is rich in various libraries and frameworks for web scraping, like Puppeteer, Axios, Request, and Cheerio.
-
Puppeteer can be used to control websites that rely on JavaScript for content rendering.
-
Node.js is fast and scalable, which is required when scraping large volumes of data.
-
Puppeteer runs in a headless mode, meaning it does not require a graphical user interface, making it fast and memory-efficient. It also has a Headful mode for testing purposes.
Prerequisites
- A code editor like VS Code
You can find the complete code of this Puppeteer Scraper on this Github page.
Set up Puppeteer
To set up Puppeteer, you require Node.js. Check out the Node.js website and documentation to install it on your machine.
Create a new directory in your Code Editor and run the command given below in the terminal to initiate a node project with package.json dependencies.
npm init
The above command will create your package.json file after you provide the necessary details. Install Puppeteer by running the following npm command:
npm i puppeteer # with Yarn yarn add puppeteer # with pnpm pnpm add puppeteer
After you have installed Puppeteer, create a new file in your project root folder with the name ‘index.js’. This will house your primary script.
In the index.js file, import Puppeteer's libraries and functions with:
import puppeteer from 'puppeteer';
You must set the URL to the blog or article you want to scrap. In this example, we'll use https://hackmamba.io/blog to scrap the articles we publish. Next, declare a constant having this URL as it's value.
const url='https://hackmamba.io/blog/';
Identify the right HTML tag
To scrape the blog posts, you need to identify the right HTML tag whose content you want to scrape. RIght click on the page as shown below on a Goolge Chrome browser and click on the inspect option to open the developer tool interface.
Click the pointer to select the item to inspect.
Now, click on the title of any blog to see the code in the elements tab on the right, as shown below.
As you can see above, each blog is separated by or is within the <section>
tag, the link of the blog is given in the <a>
tag, and the title is in the <h2>
tag. We will use these tags in the code to extract the blog URL and Title. The specifics of this may differ from one website to another.
Write the scraper script
In the index.js file, launch Puppeteer to scrape the page using the following script:
const main = async () => { const browser = await puppeteer.launch(); // to create a new browser const page = await browser.newPage(); // create a new page object await page.goto(url); // go to the url and passing it as an argument const allArticles = await page.evaluate(() => { // waits for the page to load and calls the functions within the Page's context const articles = document.querySelectorAll('section'); // extracting the 'section' tag return Array.from(articles).slice(0, 4).map((section) => { // new array from the section nodes we have and slice it to 4 so that 4 articles will be displayed const title = section.querySelector('h2').innerText; // to extract the title of the blogs const url = section.querySelector('a').href; // to extract the URL of the blogs return { title, url }; // to return the title and the URL of the respective blogs }); }); console.log(allArticles); }; main();
In the code block above, you did the following:
- Launched a new headless browser instance and opened a new browser tab.
- Navigated to the specified
url
which should be passed as an argument when the function is called. - Parsed the page data to loop through the
section
elements and retrieve the title and link for each blog post. - Returned an array containing the information for all blog posts on the page.
To run the program, use the command node <filename.extension>
. In this case, it is node index.js
You should see the following in your terminal.
You can transform and store this data into a JSON file for subsequent use. Here's how.
Store the scraped data as a JSON object
To store this data in a JSON file, follow these steps:.
Run the fs module command on the terminal of your project folder.
npm install -g file-system
file-system is a native Node.js module that which allows you to read and write files and perform various other operations on the file system.
Next, import the filesystem to index.js, convert the JavaScript object data to JSON object data and write the data to a pageData.json file. Update the main
function to include the filesystem logic with:
import fs using 'fs'; // other imports go here const main = async () => { // script to scrape data goes in here fs.writeFile(`pageData.json`, JSON.stringify(allArticles), (err) => { if (err) { console.log(err); } else { console.log(`Data of Page Scraped`); } }); } main();
The conditional statement handles any error from the file creation operation and logs a positive message once done.
Your JSON file content should look like this:
[ { "title": "The Hackmamba Blog", "url": "https://hackmamba.io/blog/2024/06/cultivating-developer-enablement-on-composable-platforms/" }, { "title": "Cultivating Developer Enablement on Composable Platforms", "url": "https://hackmamba.io/blog/2024/06/cultivating-developer-enablement-on-composable-platforms/" }, { "title": "Top 5 places to post developer content", "url": "https://hackmamba.io/blog/2024/05/top-5-places-to-post-developer-content/" }, { "title": "Can a developer marketing agency help you win the dev community?", "url": "https://hackmamba.io/blog/2024/05/can-developer-marketing-agency-win-dev-community/" } ]
Here's the complete code:
import puppeteer from 'puppeteer'; import fs from 'fs'; const url='https://hackmamba.io/blog/'; const main = async () => { const browser = await puppeteer.launch(); const page = await browser.newPage(); await page.goto(url); const allArticles = await page.evaluate(() => { const articles = document.querySelectorAll('section'); return Array.from(articles).slice(0, 4).map((section) => { const title = section.querySelector('h2').innerText; const url = section.querySelector('a').href; return {title, url}; }); }); console.log(allArticles); fs.writeFile(`pageData.json`, JSON.stringify(allArticles), (err) => { if (err) { console.log(err); } else { console.log(`Data of Page Scraped`); } }); } main();
Wrapping up
You've completed this tutorial! As web scraping is a powerful technique employed to automatically retrieve public data, explore using this in various data scenarios.
Ensure the HTML tags targeted are accurate each time since they differ from page to page. You may have to use other HTML element selectors (like classes and IDs) for specific use cases and avoiding duplicate data.
Node.js is excellent in web scraping due to its efficient, non-blocking I/O model and extensive ecosystem. Puppeteer is great for scraping dynamic JavaScript-heavy websites, offering seamless adjustability, efficiency, and scalability for large data extraction. Whether it’s market research, price tracking, or data collection, Node.js delivers the power and reliability to turn web data into actionable insights.
Resources
Node.js Documentation Puppeteer Documentation
About the author
I am a DevRel and App Developer who loves creating content and building communities. I want to live a happy life, help others, and become a better Developer Advocate.
More articles
Akshat Virmani
6 min readAug 24 2024
How to add GitHub Copilot in VS Code
Learn how to add GitHub Copilot to Visual Studio Code for AI-assisted coding. Boost productivity, reduce errors, and get intelligent code suggestions in seconds.
Read Blog
Akshat Virmani
6 min readAug 09 2024
Common API Integration Challenges and How to Overcome Them
Discover common API integration challenges and practical solutions. Learn how to optimize testing, debugging, and security to streamline your API processes efficiently.
Read Blog
Akshat Virmani
6 min readJun 20 2024
Web Scraping using Node.js and Puppeteer
Step-by-step tutorial on using Node.js and Puppeteer to scrape web data, including setup, code examples, and best practices.
Read Blog