In this post, I’ll walk you through how to build a web scraper cron job using SST Ion, AWS Lambda, and Puppeteer. This con job will help you perform data retrieval and other tasks at regular intervals without the need for a dedicated server.
You will need an AWS account before getting started. You will also need to configure the AWS CLI and your AWS credentials to follow along.
Start by initializing a new directory and setting up the project using Yarn and SST Ion.
% mkdir serverless-cron-puppeteer && cd serverless-cron-puppeteer && yarn init -y
yarn init v1.22.19
warning The yes flag has been set. This will automatically answer yes to all questions, which may have security implications.
success Saved package.json
✨ Done in 0.03s.
Now, initialize SST:
% npx sst@latest init
███████╗███████╗████████╗
██╔════╝██╔════╝╚══██╔══╝
███████╗███████╗ ██║
╚════██║╚════██║ ██║
███████║███████║ ██║
╚══════╝╚══════╝ ╚═╝
> JS project detected. This will...
- use the JS template
- create an sst.config.ts
✓ Template: js
✓ Using: aws
✓ Done 🎉
Add the following code to your SST configuration file to set up the Cron Job:
/// <reference path="./.sst/platform/config.d.ts" />
export default $config({
app(input) {
return {
name: "serverless-cron-puppeteer",
removal: "remove",
home: "aws",
};
},
async run() {
new sst.aws.Cron("MyCronJob", {
job: {
handler: "cron.handler",
timeout: "60 seconds",
memory: "2 GB",
nodejs: {
install: ["@sparticuz/chromium"],
},
},
schedule: "rate(1 minute)",
});
},
});
You'll need Puppeteer Core and @sparticuz/chromium. Refer to the SST docs for more information regarding Puppeteer in Lambda.
% yarn add -D puppeteer-core@23.1.1 @sparticuz/chromium@127.0.0
To run the code locally, you'll also need Chromium:
npx @puppeteer/browsers install chromium@latest --path /tmp/localChromium
Once installed you’ll see the location of the Chromium binary, for example: /tmp/localChromium/chromium/mac_arm-1350406/chrome-mac/Chromium.app/Contents/MacOS/Chromium
. Replace the install location in the following step.
Add a cron.ts
file to the root directory with the code below. The following example visits the Google Finance page for the USD to MXN exchange rate and prints the exchange rate and the time of the last update in the console. You can modify this code to scrape any other data you need from any website you want.
/// <reference lib="dom" />
import chromium from "@sparticuz/chromium";
import * as puppeteer from "puppeteer-core";
// This is the path to the local Chromium binary
const YOUR_LOCAL_CHROMIUM_PATH =
"/tmp/localChromium/chromium/mac_arm-1350406/chrome-mac/Chromium.app/Contents/MacOS/Chromium";
const browser = await puppeteer.launch({
args: [
"--no-sandbox",
"--disable-setuid-sandbox",
"--disable-gpu",
"--disable-software-rasterizer",
"--disable-crash-reporter",
"--disable-extensions",
"--disable-cloud-management",
"--disable-background-networking",
"--disable-sync",
],
executablePath: process.env.SST_DEV
? YOUR_LOCAL_CHROMIUM_PATH
: await chromium.executablePath(),
headless: true,
});
export async function handler() {
try {
console.log("Running cron job");
const page = await browser.newPage();
page.setDefaultNavigationTimeout(60000);
const url = "https://www.google.com/finance/quote/USD-MXN";
await page.goto(url, { waitUntil: "networkidle0" });
console.log("Page loaded");
const fx = await page.waitForSelector(
"xpath//html/body/c-wiz[2]/div/div[4]/div/main/div[2]/div[1]/c-wiz/div/div[1]/div/div[1]/div/div[1]/div/span/div/div"
);
const fxContent = await fx.evaluate((el) => el.textContent);
const time = await page.waitForSelector(
"xpath//html/body/c-wiz[2]/div/div[4]/div/main/div[2]/div[1]/c-wiz/div/div[1]/div/div[2]"
);
const timeContent = await time.evaluate((el) => el.textContent);
console.log(fxContent);
console.log(timeContent);
await page.close();
console.log("Page closed");
// post to twitter, feed my real-time RAG, etc
} catch (error) {
console.log(error);
}
}
Run the following command to test your Cron Job locally. The following command will spin up a local environment, and you will start to see the exchange rate and the time of the last update printed in the console:
% npx sst dev
...
| Invoke MyCronJobHandler
| +14ms Running cron job
| +2.741s Page loaded
| +2.775s 19.9196
| +2.775s Sep 10, 6:33:00 AM UTC
| +2.797s Page closed
| Done took +2.993s
Run the following command to deploy your Cron Job:
% npx sst deploy --stage production
SST 3.0.118 ready!
➜ App: serverless-cron-puppeteer
Stage: production
~ Deploy
✓ Complete
🚀🚀🚀 And that's it! You've successfully built a web scraper and a cron job using SST Ion, AWS Lambda, and Puppeteer. You can now schedule data retrieval and other tasks at regular intervals without the need for a dedicated server.