Scraping with Puppeteer and Playwright

Before continuing, please make sure you always have permission from the website owner to scrape their page(s).


Getting Started

We'll open the page in a headless Chrome browser and verify if the h1 tag from the Wikipedia page contains the Wikipedia name.
We'll use Node's assert for this, but ofcourse you could use other libraries, such as Chai.


const puppeteer = require('puppeteer')
const assert = require('assert')

const browser = await puppeteer.connect({
  browserWSEndpoint: 'wss://chrome.headlesstesting.com?token=[YOUR-TOKEN]'
})

const page = await browser.newPage()
await page.goto('https://www.wikipedia.org/')

const name = await page.$eval('#www-wikipedia-org > h1 > div > div', el => el.textContent.trim())
assert.equal(name, 'Wikipedia') 

await browser.close()

  • $page.eval() accepts 2 parameters: a query selector and a callback that receives the element.
  • Here we simply find an element with the correct selector, and fetch the trimmed textContent from that element.
  • The assertion verifies if the element's content contains the required value.

Scraping lists

Often you'll want to scrape lists, or multiple values from a page.
You can use page.$$eval() to fetch all elements for a specific a query selector.


This behaves very similar to page.$eval(), the difference is that this can return multiple elements.


const puppeteer = require('puppeteer')
const assert = require('assert')

const browser = await puppeteer.connect({
  browserWSEndpoint: 'wss://chrome.headlesstesting.com?token=[YOUR-TOKEN]'
})

const page = await browser.newPage()
await page.goto('https://www.wikipedia.org/')

const popularLanguages = await page.$$eval('#www-wikipedia-org > div.central-featured > div.central-featured-lang',
  elements => elements.map(el => el.textContent.trim().split('\n')[0]))
assert(popularLanguages.length > 0)

await browser.close()

  • $$page.eval() accepts 2 parameters: a query selector and a callback that receives all elements.
  • We fetch all elements that match the selector, get the content of all elements and clean them up.
  • The assertion verifies if at least 1 element can be found.

Form Elements, Images and Attributes

Below is an example on how to use elements such as input fields.
We'll also read the attribute from an element.


const puppeteer = require('puppeteer')
const expect = require('chai').expect

const browser = await puppeteer.connect({
  browserWSEndpoint: 'wss://chrome.headlesstesting.com?token=[YOUR-TOKEN]'
})

const page = await browser.newPage()
await page.goto('https://google.com/', { waitUntil: 'networkidle2' })  
await page.type('input.gLFyf.gsfi', 'HeadlessTesting')
const searchValue = await page.$eval('input.gLFyf.gsfi', el => el.value)
expect(searchValue).to.equal('HeadlessTesting')

console.log(await page.$eval('#hplogo', img => img.getAttribute('src')))
console.log(await page.$eval('input.gLFyf.gsfi', el => el.getAttribute('title')))

await browser.close()

  • Instead of Node's assert, we use Chai's expect.
  • We open Google in a Headless browser and wait for the page to load.
  • Then we enter HeadlessTesting in the input field and verify if that worked.
  • The script looks for the main logo and returns its source attribute (the image URL).
  • Finally, the script will read the title attribute from the input field and log it before closing the browser session.

Waiting For...

Sometimes it's necessary to wait until a specific element is present or visible on the page.
Puppeteer offers several waitFor methods, including wait for selector.
You can specify how long (maximum) you want to wait and if you want the element to be both present and visible.


const puppeteer = require('puppeteer')
const assert = require('assert')

const browser = await puppeteer.connect({
  browserWSEndpoint: 'wss://chrome.headlesstesting.com?token=[YOUR-TOKEN]'
})

const page = await browser.newPage()

// wait for selector
await page.waitFor('.foo')
// wait for 1 second
await page.waitFor(1000)
// wait for predicate
await page.waitFor(() => !!document.querySelector('.foo'))

await browser.close()

Navigation timeouts/wait

While opening pages, Puppeteer will by default resolve the promise when the load event of the page has fired.
You can change this option by using page.goto and change the waitUntil option:

  • load - when the load event of the page has fired
  • domcontentloaded - navigation is finished when the DOMContentLoaded event has fired
  • networkidle0 - consider navigation to be finished when there are no more than 0 network connections for at least 500 ms
  • networkidle2 - consider navigation to be finished when there are no more than 2 network connections for at least 500 ms

Cookies and Headers

It might be necessary to read or write Cookies in your script, or retrieve/set HTTP headers.
Fortunately, with Puppeteer this is easy to do.


Setting Cookies and Headers

To set cookies, you can use page.setCookie(...cookies).

If you want to set extra HTTP headers of the page, you can use page.setExtraHTTPHeaders.


Reading Cookies and Headers

To retrieve the headers and cookies from a page, see this example:


const puppeteer = require('puppeteer')

const browser = await puppeteer.connect({
  browserWSEndpoint: 'wss://chrome.headlesstesting.com?token=[YOUR-TOKEN]'
})

const page = await browser.newPage()
const response = await page.goto('https://google.com/', { waitUntil: 'networkidle2' })  
const headers = response.headers()
console.log(headers)

const cookies = await page.cookies()
console.log(cookies)

await browser.close()

User-Agent

By default, the Puppeteer User-Agent will contain HeadlessChrome.
You can change the user-agent like this:


const puppeteer = require('puppeteer')

const browser = await puppeteer.connect({
  browserWSEndpoint: 'wss://chrome.headlesstesting.com?token=[YOUR-TOKEN]'
})

const page = await browser.newPage()
await page.setUserAgent('Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36');

await browser.close()