Build a FX Scraper Cron Job with Puppeteer

In this post, I’ll walk you through how to build a web scraper cron job using SST Ion, AWS Lambda, and Puppeteer. This con job will help you perform data retrieval and other tasks at regular intervals without the need for a dedicated server.

Prerequisites

You will need an AWS account before getting started. You will also need to configure the AWS CLI and your AWS credentials to follow along.

Step 1: Set Up Your Project

Start by initializing a new directory and setting up the project using Yarn and SST Ion.

% mkdir serverless-cron-puppeteer && cd serverless-cron-puppeteer && yarn init -y

yarn init v1.22.19
warning The yes flag has been set. This will automatically answer yes to all questions, which may have security implications.
success Saved package.json
✨  Done in 0.03s.

Now, initialize SST:

% npx sst@latest init

   ███████╗███████╗████████╗
   ██╔════╝██╔════╝╚══██╔══╝
   ███████╗███████╗   ██║
   ╚════██║╚════██║   ██║
   ███████║███████║   ██║
   ╚══════╝╚══════╝   ╚═╝

>  JS project detected. This will...
   - use the JS template
   - create an sst.config.ts

✓  Template: js

✓  Using: aws

✓  Done 🎉

Step 2: Configure Your SST Project

Add the following code to your SST configuration file to set up the Cron Job:

/// <reference path="./.sst/platform/config.d.ts" />

export default $config({
  app(input) {
    return {
      name: "serverless-cron-puppeteer",
      removal: "remove",
      home: "aws",
    };
  },
  async run() {
    new sst.aws.Cron("MyCronJob", {
      job: {
        handler: "cron.handler",
        timeout: "60 seconds",
        memory: "2 GB",
        nodejs: {
          install: ["@sparticuz/chromium"],
        },
      },
      schedule: "rate(1 minute)",
    });
  },
});

Step 3: Install Dependencies

You'll need Puppeteer Core and @sparticuz/chromium. Refer to the SST docs for more information regarding Puppeteer in Lambda.

% yarn add -D puppeteer-core@23.1.1 @sparticuz/chromium@127.0.0

To run the code locally, you'll also need Chromium:

npx @puppeteer/browsers install chromium@latest --path /tmp/localChromium

Once installed you’ll see the location of the Chromium binary, for example: /tmp/localChromium/chromium/mac_arm-1350406/chrome-mac/Chromium.app/Contents/MacOS/Chromium. Replace the install location in the following step.

Step 4: Add the Cron Job Handler Function

Add a cron.ts file to the root directory with the code below. The following example visits the Google Finance page for the USD to MXN exchange rate and prints the exchange rate and the time of the last update in the console. You can modify this code to scrape any other data you need from any website you want.

/// <reference lib="dom" />
import chromium from "@sparticuz/chromium";
import * as puppeteer from "puppeteer-core";

// This is the path to the local Chromium binary
const YOUR_LOCAL_CHROMIUM_PATH =
  "/tmp/localChromium/chromium/mac_arm-1350406/chrome-mac/Chromium.app/Contents/MacOS/Chromium";

const browser = await puppeteer.launch({
  args: [
    "--no-sandbox",
    "--disable-setuid-sandbox",
    "--disable-gpu",
    "--disable-software-rasterizer",
    "--disable-crash-reporter",
    "--disable-extensions",
    "--disable-cloud-management",
    "--disable-background-networking",
    "--disable-sync",
  ],
  executablePath: process.env.SST_DEV
    ? YOUR_LOCAL_CHROMIUM_PATH
    : await chromium.executablePath(),
  headless: true,
});
export async function handler() {
  try {
    console.log("Running cron job");
    const page = await browser.newPage();
    page.setDefaultNavigationTimeout(60000);
    const url = "https://www.google.com/finance/quote/USD-MXN";
    await page.goto(url, { waitUntil: "networkidle0" });
    console.log("Page loaded");
    const fx = await page.waitForSelector(
      "xpath//html/body/c-wiz[2]/div/div[4]/div/main/div[2]/div[1]/c-wiz/div/div[1]/div/div[1]/div/div[1]/div/span/div/div"
    );
    const fxContent = await fx.evaluate((el) => el.textContent);
    const time = await page.waitForSelector(
      "xpath//html/body/c-wiz[2]/div/div[4]/div/main/div[2]/div[1]/c-wiz/div/div[1]/div/div[2]"
    );
    const timeContent = await time.evaluate((el) => el.textContent);
    console.log(fxContent);
    console.log(timeContent);
    await page.close();
    console.log("Page closed");
    // post to twitter, feed my real-time RAG, etc
  } catch (error) {
    console.log(error);
  }
}

Step 5: Test Your Cron Job Locally

Run the following command to test your Cron Job locally. The following command will spin up a local environment, and you will start to see the exchange rate and the time of the last update printed in the console:

% npx sst dev

...

|  Invoke      MyCronJobHandler
|  +14ms       Running cron job
|  +2.741s     Page loaded
|  +2.775s     19.9196
|  +2.775s     Sep 10, 6:33:00 AM UTC
|  +2.797s     Page closed
|  Done        took +2.993s

Step 6: Deploy Your Cron Job

Run the following command to deploy your Cron Job:

% npx sst deploy --stage production

SST 3.0.118  ready!

➜  App:        serverless-cron-puppeteer
   Stage:      production

~  Deploy

✓  Complete

🚀🚀🚀 And that's it! You've successfully built a web scraper and a cron job using SST Ion, AWS Lambda, and Puppeteer. You can now schedule data retrieval and other tasks at regular intervals without the need for a dedicated server.