5.8 KiB
5.8 KiB
Playwright Scraping Quick Start
Get up and running with Playwright scraping in 5 minutes.
Installation
1. Install Node.js
If you don't have Node.js installed:
macOS (using Homebrew):
brew install node
Ubuntu/Debian:
curl -fsSL https://deb.nodesource.com/setup_20.x | sudo -E bash -
sudo apt-get install -y nodejs
Windows: Download from nodejs.org
2. Install Dependencies
cd /Users/computer/dev/rss-feedmonitor
npm install
npx playwright install chromium
This will install:
- Playwright test framework
- Chromium browser
- All necessary dependencies
Basic Usage
Test a Single Query
Search Google with human-like behavior:
node scripts/playwright-scraper.js '"macbook repair" Toronto'
Output will show:
- Number of results found
- First 5 result titles and URLs
- Result statistics from Google
Scrape a Specific Website
node scripts/playwright-scraper.js --url "https://www.reddit.com/r/toronto"
Validate Multiple Alerts
Test queries from your markdown files:
# Test 5 random alerts
node scripts/validate-scraping.js docs/google-alerts-broad.md
# Test 3 alerts with 10 second delay between each
node scripts/validate-scraping.js docs/google-alerts.md --max 3 --delay 10000
# Run in headless mode (no visible browser)
node scripts/validate-scraping.js docs/google-alerts-broad.md --headless
This generates a JSON report with:
- Success/failure for each query
- Result counts
- Google's result statistics
- Full result details
Run Examples
See demonstrations of different scraping scenarios:
# Run all examples
node scripts/example-usage.js
# Run specific example
node scripts/example-usage.js 1 # Google search
node scripts/example-usage.js 2 # Reddit scraping
node scripts/example-usage.js 3 # Multi-step navigation
node scripts/example-usage.js 4 # Mouse patterns
Run Tests
Execute the test suite:
# Run with visible browser (see what's happening)
npm run test:headed
# Run in headless mode (faster)
npm test
What Makes It "Human-like"?
The scraper includes several anti-detection features:
1. Realistic Mouse Movements
- Smooth bezier curves instead of straight lines
- Occasional overshooting (15% chance)
- Random speeds and accelerations
2. Natural Scrolling
- Random amounts (100-400 pixels)
- Variable delays (0.5-2 seconds)
- Occasionally scrolls up instead of down
3. Human-like Typing
- Variable delay between keystrokes (50-150ms)
- Occasional typos that get corrected (2% chance)
- Longer pauses after spaces and punctuation
4. Randomized Fingerprints
- Random viewport sizes (1366x768, 1920x1080, etc.)
- Rotated user agents
- Realistic browser headers
- Geolocation set to Toronto
5. Reading Simulation
- Random mouse movements while "reading"
- Occasional scrolling
- Natural pauses
Configuration
Edit scripts/scraper-config.js to customize:
export const config = {
humanBehavior: {
mouse: {
overshootChance: 0.15, // Chance of overshooting target
overshootDistance: 20, // Pixels to overshoot
},
scroll: {
minAmount: 100, // Min scroll distance
maxAmount: 400, // Max scroll distance
minDelay: 500, // Min delay between scrolls
maxDelay: 2000, // Max delay between scrolls
},
typing: {
minDelay: 50, // Min ms between keys
maxDelay: 150, // Max ms between keys
mistakeChance: 0.02, // 2% typo rate
}
}
};
Common Issues & Solutions
"Browser not found" error
Run:
npx playwright install chromium
Rate limiting / CAPTCHA
Increase delays between requests:
node scripts/validate-scraping.js docs/google-alerts.md --delay 15000
Or add delays in your code:
await randomDelay(10000, 15000); // 10-15 second delay
Element not found errors
Increase wait times or add explicit waits:
await page.waitForSelector('div.g', { timeout: 30000 });
Tests timeout
Increase timeout in playwright.config.js:
timeout: 120 * 1000, // 2 minutes
Best Practices
1. Always Add Delays
// Wait between searches
await randomDelay(5000, 10000);
2. Use Headless Mode in Production
const browser = await chromium.launch({ headless: true });
3. Handle Errors Gracefully
try {
const result = await validateQuery(browser, query);
} catch (error) {
console.error('Failed:', error.message);
// Continue or retry
}
4. Respect Rate Limits
- Don't exceed 10 requests per minute
- Add longer delays for production use
- Consider using proxies for high volume
5. Check robots.txt
Before scraping any site:
curl https://example.com/robots.txt
Next Steps
- Read Full Documentation: See
docs/PLAYWRIGHT_SCRAPING.md - Customize Behaviors: Edit
scripts/scraper-config.js - Write Custom Scripts: Use the human-behavior library in your own scripts
- Run Tests: Validate your Google Alert queries
Example: Custom Script
import { chromium } from 'playwright';
import {
getHumanizedContext,
humanClick,
humanType,
humanScroll
} from './scripts/human-behavior.js';
const browser = await chromium.launch({ headless: false });
const context = await getHumanizedContext(browser);
const page = await context.newPage();
// Your scraping logic here
await page.goto('https://example.com');
await humanScroll(page, { scrollCount: 3 });
await humanClick(page, 'button.submit');
await browser.close();
Getting Help
- Full API documentation:
docs/PLAYWRIGHT_SCRAPING.md - Example code:
scripts/example-usage.js - Test examples:
tests/human-behavior.test.js
Happy scraping! 🚀