Scraping with Puppeteer and Playwright
Before continuing, please make sure you always have permission from the website owner to scrape their page(s).
Getting Started
We'll open the page in a headless Chrome browser and verify if the h1 tag
from the Wikipedia page contains the Wikipedia name.
We'll use Node's assert
for this, but ofcourse you could use other libraries, such as Chai.
const puppeteer = require('puppeteer')
const assert = require('assert')
const browser = await puppeteer.connect({
browserWSEndpoint: 'wss://chrome.headlesstesting.com?token=[YOUR-TOKEN]'
})
const page = await browser.newPage()
await page.goto('https://www.wikipedia.org/')
const name = await page.$eval('#www-wikipedia-org > h1 > div > div', el => el.textContent.trim())
assert.equal(name, 'Wikipedia')
await browser.close()
-
$page.eval() accepts 2 parameters:
a query selector
and acallback
that receives the element. -
Here we simply find an element with the correct selector, and fetch the
trimmed textContent
from that element. - The assertion verifies if the element's content contains the required value.
Scraping lists
Often you'll want to scrape lists, or multiple values from a page.
You can use page.$$eval() to fetch all elements for a specific a query selector
.
This behaves very similar to page.$eval()
, the difference is that this can return multiple elements.
const puppeteer = require('puppeteer')
const assert = require('assert')
const browser = await puppeteer.connect({
browserWSEndpoint: 'wss://chrome.headlesstesting.com?token=[YOUR-TOKEN]'
})
const page = await browser.newPage()
await page.goto('https://www.wikipedia.org/')
const popularLanguages = await page.$$eval('#www-wikipedia-org > div.central-featured > div.central-featured-lang',
elements => elements.map(el => el.textContent.trim().split('\n')[0]))
assert(popularLanguages.length > 0)
await browser.close()
-
$$page.eval() accepts 2 parameters:
a query selector
and acallback
that receives all elements. - We fetch all elements that match the selector, get the content of all elements and clean them up.
- The assertion verifies if at least 1 element can be found.
Form Elements, Images and Attributes
Below is an example on how to use elements such as input fields.
We'll also read the attribute from an element.
const puppeteer = require('puppeteer')
const expect = require('chai').expect
const browser = await puppeteer.connect({
browserWSEndpoint: 'wss://chrome.headlesstesting.com?token=[YOUR-TOKEN]'
})
const page = await browser.newPage()
await page.goto('https://google.com/', { waitUntil: 'networkidle2' })
await page.type('input.gLFyf.gsfi', 'HeadlessTesting')
const searchValue = await page.$eval('input.gLFyf.gsfi', el => el.value)
expect(searchValue).to.equal('HeadlessTesting')
console.log(await page.$eval('#hplogo', img => img.getAttribute('src')))
console.log(await page.$eval('input.gLFyf.gsfi', el => el.getAttribute('title')))
await browser.close()
- Instead of Node's
assert
, we useChai's expect
. - We open Google in a Headless browser and wait for the page to load.
-
Then we enter
HeadlessTesting
in the input field and verify if that worked. - The script looks for the main logo and returns its source attribute (the image URL).
-
Finally, the script will read the
title
attribute from the input field and log it before closing the browser session.
Waiting For...
Sometimes it's necessary to wait until a specific element is present or visible on the page.
Puppeteer offers several waitFor
methods, including wait for selector.
You can specify how long (maximum) you want to wait and if you want the element to be both present and visible.
const puppeteer = require('puppeteer')
const assert = require('assert')
const browser = await puppeteer.connect({
browserWSEndpoint: 'wss://chrome.headlesstesting.com?token=[YOUR-TOKEN]'
})
const page = await browser.newPage()
// wait for selector
await page.waitFor('.foo')
// wait for 1 second
await page.waitFor(1000)
// wait for predicate
await page.waitFor(() => !!document.querySelector('.foo'))
await browser.close()
Navigation timeouts/wait
While opening pages, Puppeteer will by default resolve the promise when the load
event of the page has fired.
You can change this option by using page.goto and change the waitUntil
option:
load
- when the load event of the page has fireddomcontentloaded
- navigation is finished when theDOMContentLoaded
event has firednetworkidle0
- consider navigation to be finished when there are no more than 0 network connections for at least 500 msnetworkidle2
- consider navigation to be finished when there are no more than 2 network connections for at least 500 ms
Cookies and Headers
It might be necessary to read or write Cookies in your script, or retrieve/set HTTP headers.
Fortunately, with Puppeteer this is easy to do.
Setting Cookies and Headers
To set cookies, you can use page.setCookie(...cookies).
If you want to set extra HTTP headers of the page, you can use page.setExtraHTTPHeaders.
Reading Cookies and Headers
To retrieve the headers and cookies from a page, see this example:
const puppeteer = require('puppeteer')
const browser = await puppeteer.connect({
browserWSEndpoint: 'wss://chrome.headlesstesting.com?token=[YOUR-TOKEN]'
})
const page = await browser.newPage()
const response = await page.goto('https://google.com/', { waitUntil: 'networkidle2' })
const headers = response.headers()
console.log(headers)
const cookies = await page.cookies()
console.log(cookies)
await browser.close()
User-Agent
By default, the Puppeteer User-Agent
will contain HeadlessChrome
.
You can change the user-agent like this:
const puppeteer = require('puppeteer')
const browser = await puppeteer.connect({
browserWSEndpoint: 'wss://chrome.headlesstesting.com?token=[YOUR-TOKEN]'
})
const page = await browser.newPage()
await page.setUserAgent('Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36');
await browser.close()