# Playwright Scraping Quick Start Get up and running with Playwright scraping in 5 minutes. ## Installation ### 1. Install Node.js If you don't have Node.js installed: **macOS (using Homebrew):** ```bash brew install node ``` **Ubuntu/Debian:** ```bash curl -fsSL https://deb.nodesource.com/setup_20.x | sudo -E bash - sudo apt-get install -y nodejs ``` **Windows:** Download from [nodejs.org](https://nodejs.org/) ### 2. Install Dependencies ```bash cd /Users/computer/dev/rss-feedmonitor npm install npx playwright install chromium ``` This will install: - Playwright test framework - Chromium browser - All necessary dependencies ## Basic Usage ### Test a Single Query Search Google with human-like behavior: ```bash node scripts/playwright-scraper.js '"macbook repair" Toronto' ``` Output will show: - Number of results found - First 5 result titles and URLs - Result statistics from Google ### Scrape a Specific Website ```bash node scripts/playwright-scraper.js --url "https://www.reddit.com/r/toronto" ``` ### Validate Multiple Alerts Test queries from your markdown files: ```bash # Test 5 random alerts node scripts/validate-scraping.js docs/google-alerts-broad.md # Test 3 alerts with 10 second delay between each node scripts/validate-scraping.js docs/google-alerts.md --max 3 --delay 10000 # Run in headless mode (no visible browser) node scripts/validate-scraping.js docs/google-alerts-broad.md --headless ``` This generates a JSON report with: - Success/failure for each query - Result counts - Google's result statistics - Full result details ### Run Examples See demonstrations of different scraping scenarios: ```bash # Run all examples node scripts/example-usage.js # Run specific example node scripts/example-usage.js 1 # Google search node scripts/example-usage.js 2 # Reddit scraping node scripts/example-usage.js 3 # Multi-step navigation node scripts/example-usage.js 4 # Mouse patterns ``` ### Run Tests Execute the test suite: ```bash # Run with visible browser (see what's happening) npm run test:headed # Run in headless mode (faster) npm test ``` ## What Makes It "Human-like"? The scraper includes several anti-detection features: ### 1. Realistic Mouse Movements - Smooth bezier curves instead of straight lines - Occasional overshooting (15% chance) - Random speeds and accelerations ### 2. Natural Scrolling - Random amounts (100-400 pixels) - Variable delays (0.5-2 seconds) - Occasionally scrolls up instead of down ### 3. Human-like Typing - Variable delay between keystrokes (50-150ms) - Occasional typos that get corrected (2% chance) - Longer pauses after spaces and punctuation ### 4. Randomized Fingerprints - Random viewport sizes (1366x768, 1920x1080, etc.) - Rotated user agents - Realistic browser headers - Geolocation set to Toronto ### 5. Reading Simulation - Random mouse movements while "reading" - Occasional scrolling - Natural pauses ## Configuration Edit `scripts/scraper-config.js` to customize: ```javascript export const config = { humanBehavior: { mouse: { overshootChance: 0.15, // Chance of overshooting target overshootDistance: 20, // Pixels to overshoot }, scroll: { minAmount: 100, // Min scroll distance maxAmount: 400, // Max scroll distance minDelay: 500, // Min delay between scrolls maxDelay: 2000, // Max delay between scrolls }, typing: { minDelay: 50, // Min ms between keys maxDelay: 150, // Max ms between keys mistakeChance: 0.02, // 2% typo rate } } }; ``` ## Common Issues & Solutions ### "Browser not found" error Run: ```bash npx playwright install chromium ``` ### Rate limiting / CAPTCHA Increase delays between requests: ```bash node scripts/validate-scraping.js docs/google-alerts.md --delay 15000 ``` Or add delays in your code: ```javascript await randomDelay(10000, 15000); // 10-15 second delay ``` ### Element not found errors Increase wait times or add explicit waits: ```javascript await page.waitForSelector('div.g', { timeout: 30000 }); ``` ### Tests timeout Increase timeout in `playwright.config.js`: ```javascript timeout: 120 * 1000, // 2 minutes ``` ## Best Practices ### 1. Always Add Delays ```javascript // Wait between searches await randomDelay(5000, 10000); ``` ### 2. Use Headless Mode in Production ```javascript const browser = await chromium.launch({ headless: true }); ``` ### 3. Handle Errors Gracefully ```javascript try { const result = await validateQuery(browser, query); } catch (error) { console.error('Failed:', error.message); // Continue or retry } ``` ### 4. Respect Rate Limits - Don't exceed 10 requests per minute - Add longer delays for production use - Consider using proxies for high volume ### 5. Check robots.txt Before scraping any site: ```bash curl https://example.com/robots.txt ``` ## Next Steps 1. **Read Full Documentation**: See `docs/PLAYWRIGHT_SCRAPING.md` 2. **Customize Behaviors**: Edit `scripts/scraper-config.js` 3. **Write Custom Scripts**: Use the human-behavior library in your own scripts 4. **Run Tests**: Validate your Google Alert queries ## Example: Custom Script ```javascript import { chromium } from 'playwright'; import { getHumanizedContext, humanClick, humanType, humanScroll } from './scripts/human-behavior.js'; const browser = await chromium.launch({ headless: false }); const context = await getHumanizedContext(browser); const page = await context.newPage(); // Your scraping logic here await page.goto('https://example.com'); await humanScroll(page, { scrollCount: 3 }); await humanClick(page, 'button.submit'); await browser.close(); ``` ## Getting Help - Full API documentation: `docs/PLAYWRIGHT_SCRAPING.md` - Example code: `scripts/example-usage.js` - Test examples: `tests/human-behavior.test.js` Happy scraping! 🚀