419 lines
9.5 KiB
Markdown
419 lines
9.5 KiB
Markdown
# Playwright Scraping with Human-like Behavior
|
|
|
|
This directory contains Playwright-based scraping and validation tools with built-in human-like behaviors to avoid bot detection.
|
|
|
|
## Features
|
|
|
|
### 🤖 Anti-Detection Behaviors
|
|
|
|
- **Realistic Mouse Movements**: Smooth bezier curve paths with occasional overshooting
|
|
- **Natural Scrolling**: Random intervals and amounts with occasional direction changes
|
|
- **Human Timing**: Variable delays between actions mimicking real user behavior
|
|
- **Typing Simulation**: Realistic keystroke timing with occasional typos and corrections
|
|
- **Reading Simulation**: Random mouse movements and scrolling to mimic content reading
|
|
- **Browser Fingerprinting**: Randomized viewports, user agents, and device settings
|
|
|
|
### 📦 Components
|
|
|
|
1. **human-behavior.js** - Core library with all human-like behavior utilities
|
|
2. **playwright-scraper.js** - Main scraper for Google searches and website scraping
|
|
3. **validate-scraping.js** - Batch validation tool for Google Alert queries
|
|
4. **scraper-config.js** - Configuration file for fine-tuning behaviors
|
|
5. **human-behavior.test.js** - Example tests demonstrating usage
|
|
|
|
## Installation
|
|
|
|
```bash
|
|
npm install
|
|
npx playwright install chromium
|
|
```
|
|
|
|
## Usage
|
|
|
|
### 1. Basic Google Search Validation
|
|
|
|
Test a single Google Alert query:
|
|
|
|
```bash
|
|
node scripts/playwright-scraper.js '"macbook repair" Toronto'
|
|
```
|
|
|
|
### 2. Scrape a Specific Website
|
|
|
|
```bash
|
|
node scripts/playwright-scraper.js --url "https://www.reddit.com/r/toronto"
|
|
```
|
|
|
|
### 3. Batch Validate Google Alerts
|
|
|
|
Validate multiple alerts from your markdown files:
|
|
|
|
```bash
|
|
# Test 5 random alerts from the file
|
|
node scripts/validate-scraping.js docs/google-alerts-broad.md
|
|
|
|
# Test specific number with custom delay
|
|
node scripts/validate-scraping.js docs/google-alerts.md --max 3 --delay 8000
|
|
|
|
# Run in headless mode
|
|
node scripts/validate-scraping.js docs/google-alerts-broad.md --headless
|
|
```
|
|
|
|
### 4. Run Tests
|
|
|
|
```bash
|
|
# Run all tests (headed mode)
|
|
npm run test:headed
|
|
|
|
# Run specific test file
|
|
npx playwright test tests/human-behavior.test.js --headed
|
|
|
|
# Run in headless mode
|
|
npm test
|
|
```
|
|
|
|
## Human Behavior Library API
|
|
|
|
### Mouse Movement
|
|
|
|
```javascript
|
|
import { humanMouseMove, randomMouseMovements } from './scripts/human-behavior.js';
|
|
|
|
// Move mouse to specific coordinates with natural path
|
|
await humanMouseMove(page, { x: 500, y: 300 }, {
|
|
overshootChance: 0.15, // 15% chance to overshoot
|
|
overshootDistance: 20, // pixels to overshoot
|
|
steps: 25, // bezier curve steps
|
|
stepDelay: 10 // ms between steps
|
|
});
|
|
|
|
// Random mouse movements (simulating reading)
|
|
await randomMouseMovements(page, 3); // 3 random movements
|
|
```
|
|
|
|
### Scrolling
|
|
|
|
```javascript
|
|
import { humanScroll, scrollToElement } from './scripts/human-behavior.js';
|
|
|
|
// Natural scrolling with random patterns
|
|
await humanScroll(page, {
|
|
direction: 'down', // 'down' or 'up'
|
|
scrollCount: 3, // number of scroll actions
|
|
minScroll: 100, // min pixels per scroll
|
|
maxScroll: 400, // max pixels per scroll
|
|
minDelay: 500, // min delay between scrolls
|
|
maxDelay: 2000, // max delay between scrolls
|
|
randomDirection: true // occasionally scroll opposite
|
|
});
|
|
|
|
// Scroll to specific element
|
|
await scrollToElement(page, 'h1.title');
|
|
```
|
|
|
|
### Clicking
|
|
|
|
```javascript
|
|
import { humanClick } from './scripts/human-behavior.js';
|
|
|
|
// Click with human-like behavior
|
|
await humanClick(page, 'button.submit', {
|
|
moveToElement: true, // move mouse to element first
|
|
doubleClickChance: 0.02 // 2% chance of accidental double-click
|
|
});
|
|
```
|
|
|
|
### Typing
|
|
|
|
```javascript
|
|
import { humanType } from './scripts/human-behavior.js';
|
|
|
|
// Type with realistic timing and occasional mistakes
|
|
await humanType(page, 'input[name="search"]', 'my search query', {
|
|
minDelay: 50, // min ms between keystrokes
|
|
maxDelay: 150, // max ms between keystrokes
|
|
mistakes: 0.02 // 2% chance of typo
|
|
});
|
|
```
|
|
|
|
### Reading Simulation
|
|
|
|
```javascript
|
|
import { simulateReading } from './scripts/human-behavior.js';
|
|
|
|
// Simulate reading behavior (scrolling + mouse movements + pauses)
|
|
await simulateReading(page, 5000); // for 5 seconds
|
|
```
|
|
|
|
### Browser Context
|
|
|
|
```javascript
|
|
import { getHumanizedContext } from './scripts/human-behavior.js';
|
|
|
|
// Create browser context with randomized fingerprint
|
|
const context = await getHumanizedContext(browser, {
|
|
locale: 'en-CA',
|
|
timezone: 'America/Toronto',
|
|
viewport: { width: 1920, height: 1080 } // or null for random
|
|
});
|
|
|
|
const page = await context.newPage();
|
|
```
|
|
|
|
### Delays
|
|
|
|
```javascript
|
|
import { randomDelay } from './scripts/human-behavior.js';
|
|
|
|
// Random delay between actions
|
|
await randomDelay(500, 1500); // 500-1500ms
|
|
```
|
|
|
|
## Configuration
|
|
|
|
Edit `scripts/scraper-config.js` to customize behavior parameters:
|
|
|
|
```javascript
|
|
export const config = {
|
|
humanBehavior: {
|
|
mouse: {
|
|
overshootChance: 0.15,
|
|
overshootDistance: 20,
|
|
// ... more options
|
|
},
|
|
scroll: {
|
|
minAmount: 100,
|
|
maxAmount: 400,
|
|
// ... more options
|
|
},
|
|
typing: {
|
|
minDelay: 50,
|
|
maxDelay: 150,
|
|
mistakeChance: 0.02,
|
|
// ... more options
|
|
}
|
|
}
|
|
};
|
|
```
|
|
|
|
## Example: Complete Scraping Workflow
|
|
|
|
```javascript
|
|
import { chromium } from 'playwright';
|
|
import {
|
|
getHumanizedContext,
|
|
humanClick,
|
|
humanType,
|
|
humanScroll,
|
|
simulateReading,
|
|
randomDelay
|
|
} from './scripts/human-behavior.js';
|
|
|
|
const browser = await chromium.launch({ headless: false });
|
|
const context = await getHumanizedContext(browser);
|
|
const page = await context.newPage();
|
|
|
|
try {
|
|
// Navigate to Google
|
|
await page.goto('https://www.google.com');
|
|
await randomDelay(1000, 2000);
|
|
|
|
// Search with human behavior
|
|
await humanClick(page, 'textarea[name="q"]');
|
|
await humanType(page, 'textarea[name="q"]', 'my search');
|
|
await page.keyboard.press('Enter');
|
|
|
|
// Wait and scroll
|
|
await page.waitForLoadState('networkidle');
|
|
await randomDelay(1500, 2500);
|
|
await humanScroll(page, { scrollCount: 3 });
|
|
|
|
// Simulate reading
|
|
await simulateReading(page, 5000);
|
|
|
|
// Extract results
|
|
const results = await page.evaluate(() => {
|
|
return Array.from(document.querySelectorAll('div.g')).map(el => ({
|
|
title: el.querySelector('h3')?.innerText,
|
|
url: el.querySelector('a')?.href
|
|
}));
|
|
});
|
|
|
|
console.log(`Found ${results.length} results`);
|
|
|
|
} finally {
|
|
await page.close();
|
|
await context.close();
|
|
await browser.close();
|
|
}
|
|
```
|
|
|
|
## Validation Report Format
|
|
|
|
The validation tool generates JSON reports with the following structure:
|
|
|
|
```json
|
|
{
|
|
"total": 5,
|
|
"successful": 4,
|
|
"failed": 1,
|
|
"successRate": 80,
|
|
"results": [
|
|
{
|
|
"name": "MacBook Repair - Ontario",
|
|
"query": "\"macbook repair\" Toronto",
|
|
"success": true,
|
|
"resultCount": 15,
|
|
"stats": "About 1,234 results (0.45 seconds)",
|
|
"results": [...]
|
|
}
|
|
]
|
|
}
|
|
```
|
|
|
|
## Best Practices
|
|
|
|
### 1. Rate Limiting
|
|
|
|
Always add delays between requests to avoid rate limiting:
|
|
|
|
```javascript
|
|
// Wait 5-10 seconds between searches
|
|
await randomDelay(5000, 10000);
|
|
```
|
|
|
|
### 2. Randomization
|
|
|
|
Use randomization to make behavior less predictable:
|
|
|
|
```javascript
|
|
// Randomize viewport
|
|
const context = await getHumanizedContext(browser); // picks random viewport
|
|
|
|
// Randomize test order
|
|
node scripts/validate-scraping.js docs/google-alerts.md --max 5
|
|
```
|
|
|
|
### 3. Headless Mode
|
|
|
|
For production, use headless mode:
|
|
|
|
```javascript
|
|
const browser = await chromium.launch({
|
|
headless: true,
|
|
args: ['--disable-blink-features=AutomationControlled']
|
|
});
|
|
```
|
|
|
|
### 4. Error Handling
|
|
|
|
Always wrap scraping in try-catch blocks:
|
|
|
|
```javascript
|
|
try {
|
|
const result = await scrapeWebsite(browser, url);
|
|
} catch (error) {
|
|
console.error('Scraping failed:', error.message);
|
|
// Implement retry logic or alerting
|
|
}
|
|
```
|
|
|
|
### 5. Respect robots.txt
|
|
|
|
Always check and respect website robots.txt files:
|
|
|
|
```bash
|
|
curl https://example.com/robots.txt
|
|
```
|
|
|
|
## Troubleshooting
|
|
|
|
### "Element not found" errors
|
|
|
|
- Increase wait times in config
|
|
- Use `page.waitForSelector()` before actions
|
|
- Check if selectors have changed
|
|
|
|
### Rate limiting / CAPTCHA
|
|
|
|
- Increase delays between requests
|
|
- Use different IP addresses (proxies)
|
|
- Reduce request frequency
|
|
- Add more randomization to behavior
|
|
|
|
### Tests timing out
|
|
|
|
- Increase timeout in Playwright config
|
|
- Check network connectivity
|
|
- Verify selectors are correct
|
|
|
|
## Advanced Features
|
|
|
|
### Custom Selectors
|
|
|
|
Override default selectors in config:
|
|
|
|
```javascript
|
|
const config = {
|
|
targets: {
|
|
google: {
|
|
resultSelector: 'div.g',
|
|
titleSelector: 'h3',
|
|
// ... custom selectors
|
|
}
|
|
}
|
|
};
|
|
```
|
|
|
|
### Proxy Support
|
|
|
|
Add proxy configuration:
|
|
|
|
```javascript
|
|
const context = await browser.newContext({
|
|
proxy: {
|
|
server: 'http://proxy.example.com:8080',
|
|
username: 'user',
|
|
password: 'pass'
|
|
}
|
|
});
|
|
```
|
|
|
|
### Screenshot on Error
|
|
|
|
Capture screenshots for debugging:
|
|
|
|
```javascript
|
|
try {
|
|
await humanClick(page, 'button.submit');
|
|
} catch (error) {
|
|
await page.screenshot({ path: 'error.png', fullPage: true });
|
|
throw error;
|
|
}
|
|
```
|
|
|
|
## Legal & Ethical Considerations
|
|
|
|
⚠️ **Important**: Always ensure your scraping activities comply with:
|
|
|
|
1. Website Terms of Service
|
|
2. robots.txt directives
|
|
3. Local laws and regulations
|
|
4. Rate limiting and server load considerations
|
|
|
|
Use these tools responsibly and ethically.
|
|
|
|
## Contributing
|
|
|
|
To add new behaviors or improve existing ones:
|
|
|
|
1. Add function to `human-behavior.js`
|
|
2. Add configuration to `scraper-config.js`
|
|
3. Add tests to `human-behavior.test.js`
|
|
4. Update this documentation
|
|
|
|
## License
|
|
|
|
See main project LICENSE file.
|
|
|