rss-feedmonitor/docs/QUICKSTART_PLAYWRIGHT.md

5.8 KiB

Playwright Scraping Quick Start

Get up and running with Playwright scraping in 5 minutes.

Installation

1. Install Node.js

If you don't have Node.js installed:

macOS (using Homebrew):

brew install node

Ubuntu/Debian:

curl -fsSL https://deb.nodesource.com/setup_20.x | sudo -E bash -
sudo apt-get install -y nodejs

Windows: Download from nodejs.org

2. Install Dependencies

cd /Users/computer/dev/rss-feedmonitor
npm install
npx playwright install chromium

This will install:

  • Playwright test framework
  • Chromium browser
  • All necessary dependencies

Basic Usage

Test a Single Query

Search Google with human-like behavior:

node scripts/playwright-scraper.js '"macbook repair" Toronto'

Output will show:

  • Number of results found
  • First 5 result titles and URLs
  • Result statistics from Google

Scrape a Specific Website

node scripts/playwright-scraper.js --url "https://www.reddit.com/r/toronto"

Validate Multiple Alerts

Test queries from your markdown files:

# Test 5 random alerts
node scripts/validate-scraping.js docs/google-alerts-broad.md

# Test 3 alerts with 10 second delay between each
node scripts/validate-scraping.js docs/google-alerts.md --max 3 --delay 10000

# Run in headless mode (no visible browser)
node scripts/validate-scraping.js docs/google-alerts-broad.md --headless

This generates a JSON report with:

  • Success/failure for each query
  • Result counts
  • Google's result statistics
  • Full result details

Run Examples

See demonstrations of different scraping scenarios:

# Run all examples
node scripts/example-usage.js

# Run specific example
node scripts/example-usage.js 1  # Google search
node scripts/example-usage.js 2  # Reddit scraping
node scripts/example-usage.js 3  # Multi-step navigation
node scripts/example-usage.js 4  # Mouse patterns

Run Tests

Execute the test suite:

# Run with visible browser (see what's happening)
npm run test:headed

# Run in headless mode (faster)
npm test

What Makes It "Human-like"?

The scraper includes several anti-detection features:

1. Realistic Mouse Movements

  • Smooth bezier curves instead of straight lines
  • Occasional overshooting (15% chance)
  • Random speeds and accelerations

2. Natural Scrolling

  • Random amounts (100-400 pixels)
  • Variable delays (0.5-2 seconds)
  • Occasionally scrolls up instead of down

3. Human-like Typing

  • Variable delay between keystrokes (50-150ms)
  • Occasional typos that get corrected (2% chance)
  • Longer pauses after spaces and punctuation

4. Randomized Fingerprints

  • Random viewport sizes (1366x768, 1920x1080, etc.)
  • Rotated user agents
  • Realistic browser headers
  • Geolocation set to Toronto

5. Reading Simulation

  • Random mouse movements while "reading"
  • Occasional scrolling
  • Natural pauses

Configuration

Edit scripts/scraper-config.js to customize:

export const config = {
  humanBehavior: {
    mouse: {
      overshootChance: 0.15,    // Chance of overshooting target
      overshootDistance: 20,     // Pixels to overshoot
    },
    scroll: {
      minAmount: 100,            // Min scroll distance
      maxAmount: 400,            // Max scroll distance
      minDelay: 500,             // Min delay between scrolls
      maxDelay: 2000,            // Max delay between scrolls
    },
    typing: {
      minDelay: 50,              // Min ms between keys
      maxDelay: 150,             // Max ms between keys
      mistakeChance: 0.02,       // 2% typo rate
    }
  }
};

Common Issues & Solutions

"Browser not found" error

Run:

npx playwright install chromium

Rate limiting / CAPTCHA

Increase delays between requests:

node scripts/validate-scraping.js docs/google-alerts.md --delay 15000

Or add delays in your code:

await randomDelay(10000, 15000); // 10-15 second delay

Element not found errors

Increase wait times or add explicit waits:

await page.waitForSelector('div.g', { timeout: 30000 });

Tests timeout

Increase timeout in playwright.config.js:

timeout: 120 * 1000,  // 2 minutes

Best Practices

1. Always Add Delays

// Wait between searches
await randomDelay(5000, 10000);

2. Use Headless Mode in Production

const browser = await chromium.launch({ headless: true });

3. Handle Errors Gracefully

try {
  const result = await validateQuery(browser, query);
} catch (error) {
  console.error('Failed:', error.message);
  // Continue or retry
}

4. Respect Rate Limits

  • Don't exceed 10 requests per minute
  • Add longer delays for production use
  • Consider using proxies for high volume

5. Check robots.txt

Before scraping any site:

curl https://example.com/robots.txt

Next Steps

  1. Read Full Documentation: See docs/PLAYWRIGHT_SCRAPING.md
  2. Customize Behaviors: Edit scripts/scraper-config.js
  3. Write Custom Scripts: Use the human-behavior library in your own scripts
  4. Run Tests: Validate your Google Alert queries

Example: Custom Script

import { chromium } from 'playwright';
import { 
  getHumanizedContext, 
  humanClick, 
  humanType, 
  humanScroll 
} from './scripts/human-behavior.js';

const browser = await chromium.launch({ headless: false });
const context = await getHumanizedContext(browser);
const page = await context.newPage();

// Your scraping logic here
await page.goto('https://example.com');
await humanScroll(page, { scrollCount: 3 });
await humanClick(page, 'button.submit');

await browser.close();

Getting Help

  • Full API documentation: docs/PLAYWRIGHT_SCRAPING.md
  • Example code: scripts/example-usage.js
  • Test examples: tests/human-behavior.test.js

Happy scraping! 🚀