9.5 KiB
Playwright Scraping with Human-like Behavior
This directory contains Playwright-based scraping and validation tools with built-in human-like behaviors to avoid bot detection.
Features
🤖 Anti-Detection Behaviors
- Realistic Mouse Movements: Smooth bezier curve paths with occasional overshooting
- Natural Scrolling: Random intervals and amounts with occasional direction changes
- Human Timing: Variable delays between actions mimicking real user behavior
- Typing Simulation: Realistic keystroke timing with occasional typos and corrections
- Reading Simulation: Random mouse movements and scrolling to mimic content reading
- Browser Fingerprinting: Randomized viewports, user agents, and device settings
📦 Components
- human-behavior.js - Core library with all human-like behavior utilities
- playwright-scraper.js - Main scraper for Google searches and website scraping
- validate-scraping.js - Batch validation tool for Google Alert queries
- scraper-config.js - Configuration file for fine-tuning behaviors
- human-behavior.test.js - Example tests demonstrating usage
Installation
npm install
npx playwright install chromium
Usage
1. Basic Google Search Validation
Test a single Google Alert query:
node scripts/playwright-scraper.js '"macbook repair" Toronto'
2. Scrape a Specific Website
node scripts/playwright-scraper.js --url "https://www.reddit.com/r/toronto"
3. Batch Validate Google Alerts
Validate multiple alerts from your markdown files:
# Test 5 random alerts from the file
node scripts/validate-scraping.js docs/google-alerts-broad.md
# Test specific number with custom delay
node scripts/validate-scraping.js docs/google-alerts.md --max 3 --delay 8000
# Run in headless mode
node scripts/validate-scraping.js docs/google-alerts-broad.md --headless
4. Run Tests
# Run all tests (headed mode)
npm run test:headed
# Run specific test file
npx playwright test tests/human-behavior.test.js --headed
# Run in headless mode
npm test
Human Behavior Library API
Mouse Movement
import { humanMouseMove, randomMouseMovements } from './scripts/human-behavior.js';
// Move mouse to specific coordinates with natural path
await humanMouseMove(page, { x: 500, y: 300 }, {
overshootChance: 0.15, // 15% chance to overshoot
overshootDistance: 20, // pixels to overshoot
steps: 25, // bezier curve steps
stepDelay: 10 // ms between steps
});
// Random mouse movements (simulating reading)
await randomMouseMovements(page, 3); // 3 random movements
Scrolling
import { humanScroll, scrollToElement } from './scripts/human-behavior.js';
// Natural scrolling with random patterns
await humanScroll(page, {
direction: 'down', // 'down' or 'up'
scrollCount: 3, // number of scroll actions
minScroll: 100, // min pixels per scroll
maxScroll: 400, // max pixels per scroll
minDelay: 500, // min delay between scrolls
maxDelay: 2000, // max delay between scrolls
randomDirection: true // occasionally scroll opposite
});
// Scroll to specific element
await scrollToElement(page, 'h1.title');
Clicking
import { humanClick } from './scripts/human-behavior.js';
// Click with human-like behavior
await humanClick(page, 'button.submit', {
moveToElement: true, // move mouse to element first
doubleClickChance: 0.02 // 2% chance of accidental double-click
});
Typing
import { humanType } from './scripts/human-behavior.js';
// Type with realistic timing and occasional mistakes
await humanType(page, 'input[name="search"]', 'my search query', {
minDelay: 50, // min ms between keystrokes
maxDelay: 150, // max ms between keystrokes
mistakes: 0.02 // 2% chance of typo
});
Reading Simulation
import { simulateReading } from './scripts/human-behavior.js';
// Simulate reading behavior (scrolling + mouse movements + pauses)
await simulateReading(page, 5000); // for 5 seconds
Browser Context
import { getHumanizedContext } from './scripts/human-behavior.js';
// Create browser context with randomized fingerprint
const context = await getHumanizedContext(browser, {
locale: 'en-CA',
timezone: 'America/Toronto',
viewport: { width: 1920, height: 1080 } // or null for random
});
const page = await context.newPage();
Delays
import { randomDelay } from './scripts/human-behavior.js';
// Random delay between actions
await randomDelay(500, 1500); // 500-1500ms
Configuration
Edit scripts/scraper-config.js to customize behavior parameters:
export const config = {
humanBehavior: {
mouse: {
overshootChance: 0.15,
overshootDistance: 20,
// ... more options
},
scroll: {
minAmount: 100,
maxAmount: 400,
// ... more options
},
typing: {
minDelay: 50,
maxDelay: 150,
mistakeChance: 0.02,
// ... more options
}
}
};
Example: Complete Scraping Workflow
import { chromium } from 'playwright';
import {
getHumanizedContext,
humanClick,
humanType,
humanScroll,
simulateReading,
randomDelay
} from './scripts/human-behavior.js';
const browser = await chromium.launch({ headless: false });
const context = await getHumanizedContext(browser);
const page = await context.newPage();
try {
// Navigate to Google
await page.goto('https://www.google.com');
await randomDelay(1000, 2000);
// Search with human behavior
await humanClick(page, 'textarea[name="q"]');
await humanType(page, 'textarea[name="q"]', 'my search');
await page.keyboard.press('Enter');
// Wait and scroll
await page.waitForLoadState('networkidle');
await randomDelay(1500, 2500);
await humanScroll(page, { scrollCount: 3 });
// Simulate reading
await simulateReading(page, 5000);
// Extract results
const results = await page.evaluate(() => {
return Array.from(document.querySelectorAll('div.g')).map(el => ({
title: el.querySelector('h3')?.innerText,
url: el.querySelector('a')?.href
}));
});
console.log(`Found ${results.length} results`);
} finally {
await page.close();
await context.close();
await browser.close();
}
Validation Report Format
The validation tool generates JSON reports with the following structure:
{
"total": 5,
"successful": 4,
"failed": 1,
"successRate": 80,
"results": [
{
"name": "MacBook Repair - Ontario",
"query": "\"macbook repair\" Toronto",
"success": true,
"resultCount": 15,
"stats": "About 1,234 results (0.45 seconds)",
"results": [...]
}
]
}
Best Practices
1. Rate Limiting
Always add delays between requests to avoid rate limiting:
// Wait 5-10 seconds between searches
await randomDelay(5000, 10000);
2. Randomization
Use randomization to make behavior less predictable:
// Randomize viewport
const context = await getHumanizedContext(browser); // picks random viewport
// Randomize test order
node scripts/validate-scraping.js docs/google-alerts.md --max 5
3. Headless Mode
For production, use headless mode:
const browser = await chromium.launch({
headless: true,
args: ['--disable-blink-features=AutomationControlled']
});
4. Error Handling
Always wrap scraping in try-catch blocks:
try {
const result = await scrapeWebsite(browser, url);
} catch (error) {
console.error('Scraping failed:', error.message);
// Implement retry logic or alerting
}
5. Respect robots.txt
Always check and respect website robots.txt files:
curl https://example.com/robots.txt
Troubleshooting
"Element not found" errors
- Increase wait times in config
- Use
page.waitForSelector()before actions - Check if selectors have changed
Rate limiting / CAPTCHA
- Increase delays between requests
- Use different IP addresses (proxies)
- Reduce request frequency
- Add more randomization to behavior
Tests timing out
- Increase timeout in Playwright config
- Check network connectivity
- Verify selectors are correct
Advanced Features
Custom Selectors
Override default selectors in config:
const config = {
targets: {
google: {
resultSelector: 'div.g',
titleSelector: 'h3',
// ... custom selectors
}
}
};
Proxy Support
Add proxy configuration:
const context = await browser.newContext({
proxy: {
server: 'http://proxy.example.com:8080',
username: 'user',
password: 'pass'
}
});
Screenshot on Error
Capture screenshots for debugging:
try {
await humanClick(page, 'button.submit');
} catch (error) {
await page.screenshot({ path: 'error.png', fullPage: true });
throw error;
}
Legal & Ethical Considerations
⚠️ Important: Always ensure your scraping activities comply with:
- Website Terms of Service
- robots.txt directives
- Local laws and regulations
- Rate limiting and server load considerations
Use these tools responsibly and ethically.
Contributing
To add new behaviors or improve existing ones:
- Add function to
human-behavior.js - Add configuration to
scraper-config.js - Add tests to
human-behavior.test.js - Update this documentation
License
See main project LICENSE file.