rss-feedmonitor/docs/PLAYWRIGHT_SCRAPING.md

9.5 KiB

Playwright Scraping with Human-like Behavior

This directory contains Playwright-based scraping and validation tools with built-in human-like behaviors to avoid bot detection.

Features

🤖 Anti-Detection Behaviors

  • Realistic Mouse Movements: Smooth bezier curve paths with occasional overshooting
  • Natural Scrolling: Random intervals and amounts with occasional direction changes
  • Human Timing: Variable delays between actions mimicking real user behavior
  • Typing Simulation: Realistic keystroke timing with occasional typos and corrections
  • Reading Simulation: Random mouse movements and scrolling to mimic content reading
  • Browser Fingerprinting: Randomized viewports, user agents, and device settings

📦 Components

  1. human-behavior.js - Core library with all human-like behavior utilities
  2. playwright-scraper.js - Main scraper for Google searches and website scraping
  3. validate-scraping.js - Batch validation tool for Google Alert queries
  4. scraper-config.js - Configuration file for fine-tuning behaviors
  5. human-behavior.test.js - Example tests demonstrating usage

Installation

npm install
npx playwright install chromium

Usage

1. Basic Google Search Validation

Test a single Google Alert query:

node scripts/playwright-scraper.js '"macbook repair" Toronto'

2. Scrape a Specific Website

node scripts/playwright-scraper.js --url "https://www.reddit.com/r/toronto"

3. Batch Validate Google Alerts

Validate multiple alerts from your markdown files:

# Test 5 random alerts from the file
node scripts/validate-scraping.js docs/google-alerts-broad.md

# Test specific number with custom delay
node scripts/validate-scraping.js docs/google-alerts.md --max 3 --delay 8000

# Run in headless mode
node scripts/validate-scraping.js docs/google-alerts-broad.md --headless

4. Run Tests

# Run all tests (headed mode)
npm run test:headed

# Run specific test file
npx playwright test tests/human-behavior.test.js --headed

# Run in headless mode
npm test

Human Behavior Library API

Mouse Movement

import { humanMouseMove, randomMouseMovements } from './scripts/human-behavior.js';

// Move mouse to specific coordinates with natural path
await humanMouseMove(page, { x: 500, y: 300 }, {
  overshootChance: 0.15,     // 15% chance to overshoot
  overshootDistance: 20,      // pixels to overshoot
  steps: 25,                  // bezier curve steps
  stepDelay: 10               // ms between steps
});

// Random mouse movements (simulating reading)
await randomMouseMovements(page, 3); // 3 random movements

Scrolling

import { humanScroll, scrollToElement } from './scripts/human-behavior.js';

// Natural scrolling with random patterns
await humanScroll(page, {
  direction: 'down',         // 'down' or 'up'
  scrollCount: 3,            // number of scroll actions
  minScroll: 100,            // min pixels per scroll
  maxScroll: 400,            // max pixels per scroll
  minDelay: 500,             // min delay between scrolls
  maxDelay: 2000,            // max delay between scrolls
  randomDirection: true      // occasionally scroll opposite
});

// Scroll to specific element
await scrollToElement(page, 'h1.title');

Clicking

import { humanClick } from './scripts/human-behavior.js';

// Click with human-like behavior
await humanClick(page, 'button.submit', {
  moveToElement: true,        // move mouse to element first
  doubleClickChance: 0.02     // 2% chance of accidental double-click
});

Typing

import { humanType } from './scripts/human-behavior.js';

// Type with realistic timing and occasional mistakes
await humanType(page, 'input[name="search"]', 'my search query', {
  minDelay: 50,               // min ms between keystrokes
  maxDelay: 150,              // max ms between keystrokes
  mistakes: 0.02              // 2% chance of typo
});

Reading Simulation

import { simulateReading } from './scripts/human-behavior.js';

// Simulate reading behavior (scrolling + mouse movements + pauses)
await simulateReading(page, 5000); // for 5 seconds

Browser Context

import { getHumanizedContext } from './scripts/human-behavior.js';

// Create browser context with randomized fingerprint
const context = await getHumanizedContext(browser, {
  locale: 'en-CA',
  timezone: 'America/Toronto',
  viewport: { width: 1920, height: 1080 } // or null for random
});

const page = await context.newPage();

Delays

import { randomDelay } from './scripts/human-behavior.js';

// Random delay between actions
await randomDelay(500, 1500); // 500-1500ms

Configuration

Edit scripts/scraper-config.js to customize behavior parameters:

export const config = {
  humanBehavior: {
    mouse: {
      overshootChance: 0.15,
      overshootDistance: 20,
      // ... more options
    },
    scroll: {
      minAmount: 100,
      maxAmount: 400,
      // ... more options
    },
    typing: {
      minDelay: 50,
      maxDelay: 150,
      mistakeChance: 0.02,
      // ... more options
    }
  }
};

Example: Complete Scraping Workflow

import { chromium } from 'playwright';
import {
  getHumanizedContext,
  humanClick,
  humanType,
  humanScroll,
  simulateReading,
  randomDelay
} from './scripts/human-behavior.js';

const browser = await chromium.launch({ headless: false });
const context = await getHumanizedContext(browser);
const page = await context.newPage();

try {
  // Navigate to Google
  await page.goto('https://www.google.com');
  await randomDelay(1000, 2000);
  
  // Search with human behavior
  await humanClick(page, 'textarea[name="q"]');
  await humanType(page, 'textarea[name="q"]', 'my search');
  await page.keyboard.press('Enter');
  
  // Wait and scroll
  await page.waitForLoadState('networkidle');
  await randomDelay(1500, 2500);
  await humanScroll(page, { scrollCount: 3 });
  
  // Simulate reading
  await simulateReading(page, 5000);
  
  // Extract results
  const results = await page.evaluate(() => {
    return Array.from(document.querySelectorAll('div.g')).map(el => ({
      title: el.querySelector('h3')?.innerText,
      url: el.querySelector('a')?.href
    }));
  });
  
  console.log(`Found ${results.length} results`);
  
} finally {
  await page.close();
  await context.close();
  await browser.close();
}

Validation Report Format

The validation tool generates JSON reports with the following structure:

{
  "total": 5,
  "successful": 4,
  "failed": 1,
  "successRate": 80,
  "results": [
    {
      "name": "MacBook Repair - Ontario",
      "query": "\"macbook repair\" Toronto",
      "success": true,
      "resultCount": 15,
      "stats": "About 1,234 results (0.45 seconds)",
      "results": [...]
    }
  ]
}

Best Practices

1. Rate Limiting

Always add delays between requests to avoid rate limiting:

// Wait 5-10 seconds between searches
await randomDelay(5000, 10000);

2. Randomization

Use randomization to make behavior less predictable:

// Randomize viewport
const context = await getHumanizedContext(browser); // picks random viewport

// Randomize test order
node scripts/validate-scraping.js docs/google-alerts.md --max 5

3. Headless Mode

For production, use headless mode:

const browser = await chromium.launch({
  headless: true,
  args: ['--disable-blink-features=AutomationControlled']
});

4. Error Handling

Always wrap scraping in try-catch blocks:

try {
  const result = await scrapeWebsite(browser, url);
} catch (error) {
  console.error('Scraping failed:', error.message);
  // Implement retry logic or alerting
}

5. Respect robots.txt

Always check and respect website robots.txt files:

curl https://example.com/robots.txt

Troubleshooting

"Element not found" errors

  • Increase wait times in config
  • Use page.waitForSelector() before actions
  • Check if selectors have changed

Rate limiting / CAPTCHA

  • Increase delays between requests
  • Use different IP addresses (proxies)
  • Reduce request frequency
  • Add more randomization to behavior

Tests timing out

  • Increase timeout in Playwright config
  • Check network connectivity
  • Verify selectors are correct

Advanced Features

Custom Selectors

Override default selectors in config:

const config = {
  targets: {
    google: {
      resultSelector: 'div.g',
      titleSelector: 'h3',
      // ... custom selectors
    }
  }
};

Proxy Support

Add proxy configuration:

const context = await browser.newContext({
  proxy: {
    server: 'http://proxy.example.com:8080',
    username: 'user',
    password: 'pass'
  }
});

Screenshot on Error

Capture screenshots for debugging:

try {
  await humanClick(page, 'button.submit');
} catch (error) {
  await page.screenshot({ path: 'error.png', fullPage: true });
  throw error;
}

⚠️ Important: Always ensure your scraping activities comply with:

  1. Website Terms of Service
  2. robots.txt directives
  3. Local laws and regulations
  4. Rate limiting and server load considerations

Use these tools responsibly and ethically.

Contributing

To add new behaviors or improve existing ones:

  1. Add function to human-behavior.js
  2. Add configuration to scraper-config.js
  3. Add tests to human-behavior.test.js
  4. Update this documentation

License

See main project LICENSE file.