rss-feedmonitor/docs/PLAYWRIGHT_SCRAPING.md

419 lines
9.5 KiB
Markdown

# Playwright Scraping with Human-like Behavior
This directory contains Playwright-based scraping and validation tools with built-in human-like behaviors to avoid bot detection.
## Features
### 🤖 Anti-Detection Behaviors
- **Realistic Mouse Movements**: Smooth bezier curve paths with occasional overshooting
- **Natural Scrolling**: Random intervals and amounts with occasional direction changes
- **Human Timing**: Variable delays between actions mimicking real user behavior
- **Typing Simulation**: Realistic keystroke timing with occasional typos and corrections
- **Reading Simulation**: Random mouse movements and scrolling to mimic content reading
- **Browser Fingerprinting**: Randomized viewports, user agents, and device settings
### 📦 Components
1. **human-behavior.js** - Core library with all human-like behavior utilities
2. **playwright-scraper.js** - Main scraper for Google searches and website scraping
3. **validate-scraping.js** - Batch validation tool for Google Alert queries
4. **scraper-config.js** - Configuration file for fine-tuning behaviors
5. **human-behavior.test.js** - Example tests demonstrating usage
## Installation
```bash
npm install
npx playwright install chromium
```
## Usage
### 1. Basic Google Search Validation
Test a single Google Alert query:
```bash
node scripts/playwright-scraper.js '"macbook repair" Toronto'
```
### 2. Scrape a Specific Website
```bash
node scripts/playwright-scraper.js --url "https://www.reddit.com/r/toronto"
```
### 3. Batch Validate Google Alerts
Validate multiple alerts from your markdown files:
```bash
# Test 5 random alerts from the file
node scripts/validate-scraping.js docs/google-alerts-broad.md
# Test specific number with custom delay
node scripts/validate-scraping.js docs/google-alerts.md --max 3 --delay 8000
# Run in headless mode
node scripts/validate-scraping.js docs/google-alerts-broad.md --headless
```
### 4. Run Tests
```bash
# Run all tests (headed mode)
npm run test:headed
# Run specific test file
npx playwright test tests/human-behavior.test.js --headed
# Run in headless mode
npm test
```
## Human Behavior Library API
### Mouse Movement
```javascript
import { humanMouseMove, randomMouseMovements } from './scripts/human-behavior.js';
// Move mouse to specific coordinates with natural path
await humanMouseMove(page, { x: 500, y: 300 }, {
overshootChance: 0.15, // 15% chance to overshoot
overshootDistance: 20, // pixels to overshoot
steps: 25, // bezier curve steps
stepDelay: 10 // ms between steps
});
// Random mouse movements (simulating reading)
await randomMouseMovements(page, 3); // 3 random movements
```
### Scrolling
```javascript
import { humanScroll, scrollToElement } from './scripts/human-behavior.js';
// Natural scrolling with random patterns
await humanScroll(page, {
direction: 'down', // 'down' or 'up'
scrollCount: 3, // number of scroll actions
minScroll: 100, // min pixels per scroll
maxScroll: 400, // max pixels per scroll
minDelay: 500, // min delay between scrolls
maxDelay: 2000, // max delay between scrolls
randomDirection: true // occasionally scroll opposite
});
// Scroll to specific element
await scrollToElement(page, 'h1.title');
```
### Clicking
```javascript
import { humanClick } from './scripts/human-behavior.js';
// Click with human-like behavior
await humanClick(page, 'button.submit', {
moveToElement: true, // move mouse to element first
doubleClickChance: 0.02 // 2% chance of accidental double-click
});
```
### Typing
```javascript
import { humanType } from './scripts/human-behavior.js';
// Type with realistic timing and occasional mistakes
await humanType(page, 'input[name="search"]', 'my search query', {
minDelay: 50, // min ms between keystrokes
maxDelay: 150, // max ms between keystrokes
mistakes: 0.02 // 2% chance of typo
});
```
### Reading Simulation
```javascript
import { simulateReading } from './scripts/human-behavior.js';
// Simulate reading behavior (scrolling + mouse movements + pauses)
await simulateReading(page, 5000); // for 5 seconds
```
### Browser Context
```javascript
import { getHumanizedContext } from './scripts/human-behavior.js';
// Create browser context with randomized fingerprint
const context = await getHumanizedContext(browser, {
locale: 'en-CA',
timezone: 'America/Toronto',
viewport: { width: 1920, height: 1080 } // or null for random
});
const page = await context.newPage();
```
### Delays
```javascript
import { randomDelay } from './scripts/human-behavior.js';
// Random delay between actions
await randomDelay(500, 1500); // 500-1500ms
```
## Configuration
Edit `scripts/scraper-config.js` to customize behavior parameters:
```javascript
export const config = {
humanBehavior: {
mouse: {
overshootChance: 0.15,
overshootDistance: 20,
// ... more options
},
scroll: {
minAmount: 100,
maxAmount: 400,
// ... more options
},
typing: {
minDelay: 50,
maxDelay: 150,
mistakeChance: 0.02,
// ... more options
}
}
};
```
## Example: Complete Scraping Workflow
```javascript
import { chromium } from 'playwright';
import {
getHumanizedContext,
humanClick,
humanType,
humanScroll,
simulateReading,
randomDelay
} from './scripts/human-behavior.js';
const browser = await chromium.launch({ headless: false });
const context = await getHumanizedContext(browser);
const page = await context.newPage();
try {
// Navigate to Google
await page.goto('https://www.google.com');
await randomDelay(1000, 2000);
// Search with human behavior
await humanClick(page, 'textarea[name="q"]');
await humanType(page, 'textarea[name="q"]', 'my search');
await page.keyboard.press('Enter');
// Wait and scroll
await page.waitForLoadState('networkidle');
await randomDelay(1500, 2500);
await humanScroll(page, { scrollCount: 3 });
// Simulate reading
await simulateReading(page, 5000);
// Extract results
const results = await page.evaluate(() => {
return Array.from(document.querySelectorAll('div.g')).map(el => ({
title: el.querySelector('h3')?.innerText,
url: el.querySelector('a')?.href
}));
});
console.log(`Found ${results.length} results`);
} finally {
await page.close();
await context.close();
await browser.close();
}
```
## Validation Report Format
The validation tool generates JSON reports with the following structure:
```json
{
"total": 5,
"successful": 4,
"failed": 1,
"successRate": 80,
"results": [
{
"name": "MacBook Repair - Ontario",
"query": "\"macbook repair\" Toronto",
"success": true,
"resultCount": 15,
"stats": "About 1,234 results (0.45 seconds)",
"results": [...]
}
]
}
```
## Best Practices
### 1. Rate Limiting
Always add delays between requests to avoid rate limiting:
```javascript
// Wait 5-10 seconds between searches
await randomDelay(5000, 10000);
```
### 2. Randomization
Use randomization to make behavior less predictable:
```javascript
// Randomize viewport
const context = await getHumanizedContext(browser); // picks random viewport
// Randomize test order
node scripts/validate-scraping.js docs/google-alerts.md --max 5
```
### 3. Headless Mode
For production, use headless mode:
```javascript
const browser = await chromium.launch({
headless: true,
args: ['--disable-blink-features=AutomationControlled']
});
```
### 4. Error Handling
Always wrap scraping in try-catch blocks:
```javascript
try {
const result = await scrapeWebsite(browser, url);
} catch (error) {
console.error('Scraping failed:', error.message);
// Implement retry logic or alerting
}
```
### 5. Respect robots.txt
Always check and respect website robots.txt files:
```bash
curl https://example.com/robots.txt
```
## Troubleshooting
### "Element not found" errors
- Increase wait times in config
- Use `page.waitForSelector()` before actions
- Check if selectors have changed
### Rate limiting / CAPTCHA
- Increase delays between requests
- Use different IP addresses (proxies)
- Reduce request frequency
- Add more randomization to behavior
### Tests timing out
- Increase timeout in Playwright config
- Check network connectivity
- Verify selectors are correct
## Advanced Features
### Custom Selectors
Override default selectors in config:
```javascript
const config = {
targets: {
google: {
resultSelector: 'div.g',
titleSelector: 'h3',
// ... custom selectors
}
}
};
```
### Proxy Support
Add proxy configuration:
```javascript
const context = await browser.newContext({
proxy: {
server: 'http://proxy.example.com:8080',
username: 'user',
password: 'pass'
}
});
```
### Screenshot on Error
Capture screenshots for debugging:
```javascript
try {
await humanClick(page, 'button.submit');
} catch (error) {
await page.screenshot({ path: 'error.png', fullPage: true });
throw error;
}
```
## Legal & Ethical Considerations
⚠️ **Important**: Always ensure your scraping activities comply with:
1. Website Terms of Service
2. robots.txt directives
3. Local laws and regulations
4. Rate limiting and server load considerations
Use these tools responsibly and ethically.
## Contributing
To add new behaviors or improve existing ones:
1. Add function to `human-behavior.js`
2. Add configuration to `scraper-config.js`
3. Add tests to `human-behavior.test.js`
4. Update this documentation
## License
See main project LICENSE file.