Extract informations from Internet with Node.js and Puppeteer
Paul Berthelot / January 09, 2023
5 min read • –––
How to get data from a website?
There are a few different ways to get data from websites. Each of these options has its own pros and cons, and the best choice will depend on your specific needs and resources.
- SAAS: The advantages are that you don't need any coding skills, it offers a wide range of features out-of-the-box, and it's easy to use. However, it may be expensive and not support all websites or data types.
- Agency: Work with an agency or freelance developer that specializes in web scraping. They can give you a customized solution, but it may be expensive and time-consuming.
- API: Use the API of the website you want to scrape if you have access to it. An API can give you structured and predictable data. Web development is moving towards server side rendering so it will be more and more uncommun to have access to api.
- Headless Browser: It is a web browser without a GUI that allows you to access and interact with websites programmatically.
Basic scraping in Node.js with Puppeteer
scrape.js
const puppeteer = require('puppeteer');
(async () => {
const browser = await puppeteer.launch({ headless: true });
const page = await browser.newPage();
await page.goto('https://news.ycombinator.com/news');
const links = await page.$eval(
'#hnmain > tbody > tr:nth-child(3) > td > table > tbody',
(list) => {
const links = [];
for (let i = 0; i < list.children.length; i++) {
const link = list.children[i].querySelector(
'td:nth-child(3) > span > a'
)?.href;
if (link) links.push(link);
}
return links;
}
);
console.log(links);
await page.close();
await browser.close();
})();
Here I'm using Puppeteer library to scrape the news website news.ycombinator.com/news for links to articles.
- The Puppeteer library is imported.
- A new browser instance is launched in headless mode (i.e., not visible to the user) using puppeteer.launch().
- A new page is created in the browser using browser.newPage().
- The page navigates to the news website using page.goto().
- The page evaluates a JavaScript expression using page.$eval(). This expression selects all the rows of articles on the page using a CSS selector and then iterates through them to extract the links to the articles.
- The extracted links are logged to the console using console.log().
- The page and the browser are closed using page.close() and browser.close() respectively.
Scrape Google Maps in Node.js
scrape.js
const fs = require('fs');
const puppeteer = require('puppeteer');
const restaurants = [
'https://www.google.com/maps/search/?api=1&query=To%20Restaurant%2034%20Rue%20Beaurepaire%2075010%20Paris',
'https://www.google.com/maps/search/?api=1&query=Sphère%2018%20Rue%20la%20Boétie%2075008%20Paris',
'https://www.google.com/maps/search/?api=1&query=Le%20Reminet%203%20rue%20des%20Grands%20Degrés%2075005%20Paris',
'https://www.google.com/maps/search/?api=1&query=Pink%20Fizz%2046%20Bd%20de%20Clichy%2075018%20Paris',
];
const restaurantsData = [];
(async () => {
const browser = await puppeteer.launch({ headless: true });
for (let i = 0; i < restaurants.length; i++) {
console.log(`${i}/${restaurants.length}`);
const page = await browser.newPage();
await page.goto(restaurants[i]);
// Allow consent page
if (i == 0) {
await page.waitForSelector(
'#yDmH0d > c-wiz > div > div > div > div.NIoIEf > div.G4njw'
);
page.click(
'#yDmH0d > c-wiz > div > div > div > div.NIoIEf > div.G4njw > div.AIC7ge > div.CxJub > div.VtwTSb > form:nth-child(2) > div > div > button'
);
}
// Wait for the page to load
await page.waitForSelector(
'#QA0Szd > div > div > div.w6VYqd > div.bJzME.tTVLSc > div > div.e07Vkf.kA9KIf > div > div'
);
try {
const { phone, website, menu } = await page.$eval(
'#QA0Szd > div > div > div.w6VYqd > div.bJzME.tTVLSc > div > div.e07Vkf.kA9KIf > div > div > div:nth-child(11)',
(list) => {
const informations = { phone: '', website: '', menu: '' };
for (let y = 0; y < list.children.length; y++) {
const element = list.children[y];
if (element.childNodes.length == 0) continue;
// If the HTMLNode is the phone information, get it
const imgPhone = element.querySelector(
'button > div.AeaXub > div.cXHGnc > div > img'
)?.src;
if (imgPhone && imgPhone.includes('phone_gm_blue')) {
informations.phone = element.querySelector(
'button > div.AeaXub > div.rogA2c > div.Io6YTe.fontBodyMedium'
)?.textContent;
}
// If the HTMLNode is the website or menu informations, get it
const imgWebsiteOrMenu = element.querySelector(
'a > div.AeaXub > div.cXHGnc > div > img'
)?.src;
if (
imgWebsiteOrMenu &&
imgWebsiteOrMenu.includes('public_gm_blue')
) {
informations.website = element.querySelector('a')?.href;
}
if (imgWebsiteOrMenu && imgWebsiteOrMenu.includes('list_gm_blue')) {
informations.menu = element.querySelector('a')?.href;
}
}
return informations;
}
);
restaurantsData.push({ phone, website, menu });
} catch (e) {}
await page.close();
}
await browser.close();
fs.writeFileSync('restaurantsData.json', JSON.stringify(restaurantsData));
})();
Here I'm using again the Puppeteer library to scrape informations about a list of restaurants from Google Maps. I'm collecting the phone number, website, and menu (if available) of each restaurant.
- The fs (File System) and Puppeteer libraries are imported and an array of URLs for the restaurants is defined.
- A new browser instance is launched in headless mode (i.e., not visible to the user) using puppeteer.launch().
- A loop iterates through the array of restaurant URLs. For each URL:
- A new page is created in the browser using browser.newPage().
- The page navigates to the restaurant's page on Google Maps using page.goto().
- If this is the first iteration of the loop (i.e., the first restaurant), the script waits for the consent page to load and then clicks the "Accept All" button to proceed.
- The script waits for the page to fully load and then uses page.$eval() to evaluate a JavaScript expression that selects the section of the page containing the phone number, website, and menu information (if available) and extracts this information.
- The extracted information is added to the restaurantsData array.
- The page is closed using page.close().
- The browser is closed using browser.close().
- The restaurantsData array is written to a JSON file called restaurantsData.json using the fs.writeFileSync() function.
Subscribe to the newsletter
Get emails from me about web development, tech trends, and advices for founders.
24 subscribers