# Playwright Scraping with Human-like Behavior This directory contains Playwright-based scraping and validation tools with built-in human-like behaviors to avoid bot detection. ## Features ### 🤖 Anti-Detection Behaviors - **Realistic Mouse Movements**: Smooth bezier curve paths with occasional overshooting - **Natural Scrolling**: Random intervals and amounts with occasional direction changes - **Human Timing**: Variable delays between actions mimicking real user behavior - **Typing Simulation**: Realistic keystroke timing with occasional typos and corrections - **Reading Simulation**: Random mouse movements and scrolling to mimic content reading - **Browser Fingerprinting**: Randomized viewports, user agents, and device settings ### 📦 Components 1. **human-behavior.js** - Core library with all human-like behavior utilities 2. **playwright-scraper.js** - Main scraper for Google searches and website scraping 3. **validate-scraping.js** - Batch validation tool for Google Alert queries 4. **scraper-config.js** - Configuration file for fine-tuning behaviors 5. **human-behavior.test.js** - Example tests demonstrating usage ## Installation ```bash npm install npx playwright install chromium ``` ## Usage ### 1. Basic Google Search Validation Test a single Google Alert query: ```bash node scripts/playwright-scraper.js '"macbook repair" Toronto' ``` ### 2. Scrape a Specific Website ```bash node scripts/playwright-scraper.js --url "https://www.reddit.com/r/toronto" ``` ### 3. Batch Validate Google Alerts Validate multiple alerts from your markdown files: ```bash # Test 5 random alerts from the file node scripts/validate-scraping.js docs/google-alerts-broad.md # Test specific number with custom delay node scripts/validate-scraping.js docs/google-alerts.md --max 3 --delay 8000 # Run in headless mode node scripts/validate-scraping.js docs/google-alerts-broad.md --headless ``` ### 4. Run Tests ```bash # Run all tests (headed mode) npm run test:headed # Run specific test file npx playwright test tests/human-behavior.test.js --headed # Run in headless mode npm test ``` ## Human Behavior Library API ### Mouse Movement ```javascript import { humanMouseMove, randomMouseMovements } from './scripts/human-behavior.js'; // Move mouse to specific coordinates with natural path await humanMouseMove(page, { x: 500, y: 300 }, { overshootChance: 0.15, // 15% chance to overshoot overshootDistance: 20, // pixels to overshoot steps: 25, // bezier curve steps stepDelay: 10 // ms between steps }); // Random mouse movements (simulating reading) await randomMouseMovements(page, 3); // 3 random movements ``` ### Scrolling ```javascript import { humanScroll, scrollToElement } from './scripts/human-behavior.js'; // Natural scrolling with random patterns await humanScroll(page, { direction: 'down', // 'down' or 'up' scrollCount: 3, // number of scroll actions minScroll: 100, // min pixels per scroll maxScroll: 400, // max pixels per scroll minDelay: 500, // min delay between scrolls maxDelay: 2000, // max delay between scrolls randomDirection: true // occasionally scroll opposite }); // Scroll to specific element await scrollToElement(page, 'h1.title'); ``` ### Clicking ```javascript import { humanClick } from './scripts/human-behavior.js'; // Click with human-like behavior await humanClick(page, 'button.submit', { moveToElement: true, // move mouse to element first doubleClickChance: 0.02 // 2% chance of accidental double-click }); ``` ### Typing ```javascript import { humanType } from './scripts/human-behavior.js'; // Type with realistic timing and occasional mistakes await humanType(page, 'input[name="search"]', 'my search query', { minDelay: 50, // min ms between keystrokes maxDelay: 150, // max ms between keystrokes mistakes: 0.02 // 2% chance of typo }); ``` ### Reading Simulation ```javascript import { simulateReading } from './scripts/human-behavior.js'; // Simulate reading behavior (scrolling + mouse movements + pauses) await simulateReading(page, 5000); // for 5 seconds ``` ### Browser Context ```javascript import { getHumanizedContext } from './scripts/human-behavior.js'; // Create browser context with randomized fingerprint const context = await getHumanizedContext(browser, { locale: 'en-CA', timezone: 'America/Toronto', viewport: { width: 1920, height: 1080 } // or null for random }); const page = await context.newPage(); ``` ### Delays ```javascript import { randomDelay } from './scripts/human-behavior.js'; // Random delay between actions await randomDelay(500, 1500); // 500-1500ms ``` ## Configuration Edit `scripts/scraper-config.js` to customize behavior parameters: ```javascript export const config = { humanBehavior: { mouse: { overshootChance: 0.15, overshootDistance: 20, // ... more options }, scroll: { minAmount: 100, maxAmount: 400, // ... more options }, typing: { minDelay: 50, maxDelay: 150, mistakeChance: 0.02, // ... more options } } }; ``` ## Example: Complete Scraping Workflow ```javascript import { chromium } from 'playwright'; import { getHumanizedContext, humanClick, humanType, humanScroll, simulateReading, randomDelay } from './scripts/human-behavior.js'; const browser = await chromium.launch({ headless: false }); const context = await getHumanizedContext(browser); const page = await context.newPage(); try { // Navigate to Google await page.goto('https://www.google.com'); await randomDelay(1000, 2000); // Search with human behavior await humanClick(page, 'textarea[name="q"]'); await humanType(page, 'textarea[name="q"]', 'my search'); await page.keyboard.press('Enter'); // Wait and scroll await page.waitForLoadState('networkidle'); await randomDelay(1500, 2500); await humanScroll(page, { scrollCount: 3 }); // Simulate reading await simulateReading(page, 5000); // Extract results const results = await page.evaluate(() => { return Array.from(document.querySelectorAll('div.g')).map(el => ({ title: el.querySelector('h3')?.innerText, url: el.querySelector('a')?.href })); }); console.log(`Found ${results.length} results`); } finally { await page.close(); await context.close(); await browser.close(); } ``` ## Validation Report Format The validation tool generates JSON reports with the following structure: ```json { "total": 5, "successful": 4, "failed": 1, "successRate": 80, "results": [ { "name": "MacBook Repair - Ontario", "query": "\"macbook repair\" Toronto", "success": true, "resultCount": 15, "stats": "About 1,234 results (0.45 seconds)", "results": [...] } ] } ``` ## Best Practices ### 1. Rate Limiting Always add delays between requests to avoid rate limiting: ```javascript // Wait 5-10 seconds between searches await randomDelay(5000, 10000); ``` ### 2. Randomization Use randomization to make behavior less predictable: ```javascript // Randomize viewport const context = await getHumanizedContext(browser); // picks random viewport // Randomize test order node scripts/validate-scraping.js docs/google-alerts.md --max 5 ``` ### 3. Headless Mode For production, use headless mode: ```javascript const browser = await chromium.launch({ headless: true, args: ['--disable-blink-features=AutomationControlled'] }); ``` ### 4. Error Handling Always wrap scraping in try-catch blocks: ```javascript try { const result = await scrapeWebsite(browser, url); } catch (error) { console.error('Scraping failed:', error.message); // Implement retry logic or alerting } ``` ### 5. Respect robots.txt Always check and respect website robots.txt files: ```bash curl https://example.com/robots.txt ``` ## Troubleshooting ### "Element not found" errors - Increase wait times in config - Use `page.waitForSelector()` before actions - Check if selectors have changed ### Rate limiting / CAPTCHA - Increase delays between requests - Use different IP addresses (proxies) - Reduce request frequency - Add more randomization to behavior ### Tests timing out - Increase timeout in Playwright config - Check network connectivity - Verify selectors are correct ## Advanced Features ### Custom Selectors Override default selectors in config: ```javascript const config = { targets: { google: { resultSelector: 'div.g', titleSelector: 'h3', // ... custom selectors } } }; ``` ### Proxy Support Add proxy configuration: ```javascript const context = await browser.newContext({ proxy: { server: 'http://proxy.example.com:8080', username: 'user', password: 'pass' } }); ``` ### Screenshot on Error Capture screenshots for debugging: ```javascript try { await humanClick(page, 'button.submit'); } catch (error) { await page.screenshot({ path: 'error.png', fullPage: true }); throw error; } ``` ## Legal & Ethical Considerations ⚠️ **Important**: Always ensure your scraping activities comply with: 1. Website Terms of Service 2. robots.txt directives 3. Local laws and regulations 4. Rate limiting and server load considerations Use these tools responsibly and ethically. ## Contributing To add new behaviors or improve existing ones: 1. Add function to `human-behavior.js` 2. Add configuration to `scraper-config.js` 3. Add tests to `human-behavior.test.js` 4. Update this documentation ## License See main project LICENSE file.