5.8 KiB

Raw Blame History

Playwright Scraping Quick Start

Get up and running with Playwright scraping in 5 minutes.

Installation

1. Install Node.js

If you don't have Node.js installed:

macOS (using Homebrew):

brew install node

Ubuntu/Debian:

curl -fsSL https://deb.nodesource.com/setup_20.x | sudo -E bash -
sudo apt-get install -y nodejs

Windows: Download from nodejs.org

2. Install Dependencies

cd /Users/computer/dev/rss-feedmonitor
npm install
npx playwright install chromium

This will install:

Playwright test framework
Chromium browser
All necessary dependencies

Basic Usage

Test a Single Query

Search Google with human-like behavior:

node scripts/playwright-scraper.js '"macbook repair" Toronto'

Output will show:

Number of results found
First 5 result titles and URLs
Result statistics from Google

Scrape a Specific Website

node scripts/playwright-scraper.js --url "https://www.reddit.com/r/toronto"

Validate Multiple Alerts

Test queries from your markdown files:

# Test 5 random alerts
node scripts/validate-scraping.js docs/google-alerts-broad.md

# Test 3 alerts with 10 second delay between each
node scripts/validate-scraping.js docs/google-alerts.md --max 3 --delay 10000

# Run in headless mode (no visible browser)
node scripts/validate-scraping.js docs/google-alerts-broad.md --headless

This generates a JSON report with:

Success/failure for each query
Result counts
Google's result statistics
Full result details

Run Examples

See demonstrations of different scraping scenarios:

# Run all examples
node scripts/example-usage.js

# Run specific example
node scripts/example-usage.js 1  # Google search
node scripts/example-usage.js 2  # Reddit scraping
node scripts/example-usage.js 3  # Multi-step navigation
node scripts/example-usage.js 4  # Mouse patterns

Run Tests

Execute the test suite:

# Run with visible browser (see what's happening)
npm run test:headed

# Run in headless mode (faster)
npm test

What Makes It "Human-like"?

The scraper includes several anti-detection features:

1. Realistic Mouse Movements

Smooth bezier curves instead of straight lines
Occasional overshooting (15% chance)
Random speeds and accelerations

2. Natural Scrolling

Random amounts (100-400 pixels)
Variable delays (0.5-2 seconds)
Occasionally scrolls up instead of down

3. Human-like Typing

Variable delay between keystrokes (50-150ms)
Occasional typos that get corrected (2% chance)
Longer pauses after spaces and punctuation

4. Randomized Fingerprints

Random viewport sizes (1366x768, 1920x1080, etc.)
Rotated user agents
Realistic browser headers
Geolocation set to Toronto

5. Reading Simulation

Random mouse movements while "reading"
Occasional scrolling
Natural pauses

Configuration

Edit scripts/scraper-config.js to customize:

export const config = {
  humanBehavior: {
    mouse: {
      overshootChance: 0.15,    // Chance of overshooting target
      overshootDistance: 20,     // Pixels to overshoot
    },
    scroll: {
      minAmount: 100,            // Min scroll distance
      maxAmount: 400,            // Max scroll distance
      minDelay: 500,             // Min delay between scrolls
      maxDelay: 2000,            // Max delay between scrolls
    },
    typing: {
      minDelay: 50,              // Min ms between keys
      maxDelay: 150,             // Max ms between keys
      mistakeChance: 0.02,       // 2% typo rate
    }
  }
};

Common Issues & Solutions

"Browser not found" error

Run:

npx playwright install chromium

Rate limiting / CAPTCHA

Increase delays between requests:

node scripts/validate-scraping.js docs/google-alerts.md --delay 15000

Or add delays in your code:

await randomDelay(10000, 15000); // 10-15 second delay

Element not found errors

Increase wait times or add explicit waits:

await page.waitForSelector('div.g', { timeout: 30000 });

Tests timeout

Increase timeout in playwright.config.js:

timeout: 120 * 1000,  // 2 minutes

Best Practices

1. Always Add Delays

// Wait between searches
await randomDelay(5000, 10000);

2. Use Headless Mode in Production

const browser = await chromium.launch({ headless: true });

3. Handle Errors Gracefully

try {
  const result = await validateQuery(browser, query);
} catch (error) {
  console.error('Failed:', error.message);
  // Continue or retry
}

4. Respect Rate Limits

Don't exceed 10 requests per minute
Add longer delays for production use
Consider using proxies for high volume

5. Check robots.txt

Before scraping any site:

curl https://example.com/robots.txt

Next Steps

Read Full Documentation: See docs/PLAYWRIGHT_SCRAPING.md
Customize Behaviors: Edit scripts/scraper-config.js
Write Custom Scripts: Use the human-behavior library in your own scripts
Run Tests: Validate your Google Alert queries

Example: Custom Script

import { chromium } from 'playwright';
import { 
  getHumanizedContext, 
  humanClick, 
  humanType, 
  humanScroll 
} from './scripts/human-behavior.js';

const browser = await chromium.launch({ headless: false });
const context = await getHumanizedContext(browser);
const page = await context.newPage();

// Your scraping logic here
await page.goto('https://example.com');
await humanScroll(page, { scrollCount: 3 });
await humanClick(page, 'button.submit');

await browser.close();

Getting Help

Full API documentation: docs/PLAYWRIGHT_SCRAPING.md
Example code: scripts/example-usage.js
Test examples: tests/human-behavior.test.js

Happy scraping! 🚀

5.8 KiB Raw Blame History