rss-feedmonitor/docs/QUICKSTART_PLAYWRIGHT.md

275 lines
5.8 KiB
Markdown

# Playwright Scraping Quick Start
Get up and running with Playwright scraping in 5 minutes.
## Installation
### 1. Install Node.js
If you don't have Node.js installed:
**macOS (using Homebrew):**
```bash
brew install node
```
**Ubuntu/Debian:**
```bash
curl -fsSL https://deb.nodesource.com/setup_20.x | sudo -E bash -
sudo apt-get install -y nodejs
```
**Windows:**
Download from [nodejs.org](https://nodejs.org/)
### 2. Install Dependencies
```bash
cd /Users/computer/dev/rss-feedmonitor
npm install
npx playwright install chromium
```
This will install:
- Playwright test framework
- Chromium browser
- All necessary dependencies
## Basic Usage
### Test a Single Query
Search Google with human-like behavior:
```bash
node scripts/playwright-scraper.js '"macbook repair" Toronto'
```
Output will show:
- Number of results found
- First 5 result titles and URLs
- Result statistics from Google
### Scrape a Specific Website
```bash
node scripts/playwright-scraper.js --url "https://www.reddit.com/r/toronto"
```
### Validate Multiple Alerts
Test queries from your markdown files:
```bash
# Test 5 random alerts
node scripts/validate-scraping.js docs/google-alerts-broad.md
# Test 3 alerts with 10 second delay between each
node scripts/validate-scraping.js docs/google-alerts.md --max 3 --delay 10000
# Run in headless mode (no visible browser)
node scripts/validate-scraping.js docs/google-alerts-broad.md --headless
```
This generates a JSON report with:
- Success/failure for each query
- Result counts
- Google's result statistics
- Full result details
### Run Examples
See demonstrations of different scraping scenarios:
```bash
# Run all examples
node scripts/example-usage.js
# Run specific example
node scripts/example-usage.js 1 # Google search
node scripts/example-usage.js 2 # Reddit scraping
node scripts/example-usage.js 3 # Multi-step navigation
node scripts/example-usage.js 4 # Mouse patterns
```
### Run Tests
Execute the test suite:
```bash
# Run with visible browser (see what's happening)
npm run test:headed
# Run in headless mode (faster)
npm test
```
## What Makes It "Human-like"?
The scraper includes several anti-detection features:
### 1. Realistic Mouse Movements
- Smooth bezier curves instead of straight lines
- Occasional overshooting (15% chance)
- Random speeds and accelerations
### 2. Natural Scrolling
- Random amounts (100-400 pixels)
- Variable delays (0.5-2 seconds)
- Occasionally scrolls up instead of down
### 3. Human-like Typing
- Variable delay between keystrokes (50-150ms)
- Occasional typos that get corrected (2% chance)
- Longer pauses after spaces and punctuation
### 4. Randomized Fingerprints
- Random viewport sizes (1366x768, 1920x1080, etc.)
- Rotated user agents
- Realistic browser headers
- Geolocation set to Toronto
### 5. Reading Simulation
- Random mouse movements while "reading"
- Occasional scrolling
- Natural pauses
## Configuration
Edit `scripts/scraper-config.js` to customize:
```javascript
export const config = {
humanBehavior: {
mouse: {
overshootChance: 0.15, // Chance of overshooting target
overshootDistance: 20, // Pixels to overshoot
},
scroll: {
minAmount: 100, // Min scroll distance
maxAmount: 400, // Max scroll distance
minDelay: 500, // Min delay between scrolls
maxDelay: 2000, // Max delay between scrolls
},
typing: {
minDelay: 50, // Min ms between keys
maxDelay: 150, // Max ms between keys
mistakeChance: 0.02, // 2% typo rate
}
}
};
```
## Common Issues & Solutions
### "Browser not found" error
Run:
```bash
npx playwright install chromium
```
### Rate limiting / CAPTCHA
Increase delays between requests:
```bash
node scripts/validate-scraping.js docs/google-alerts.md --delay 15000
```
Or add delays in your code:
```javascript
await randomDelay(10000, 15000); // 10-15 second delay
```
### Element not found errors
Increase wait times or add explicit waits:
```javascript
await page.waitForSelector('div.g', { timeout: 30000 });
```
### Tests timeout
Increase timeout in `playwright.config.js`:
```javascript
timeout: 120 * 1000, // 2 minutes
```
## Best Practices
### 1. Always Add Delays
```javascript
// Wait between searches
await randomDelay(5000, 10000);
```
### 2. Use Headless Mode in Production
```javascript
const browser = await chromium.launch({ headless: true });
```
### 3. Handle Errors Gracefully
```javascript
try {
const result = await validateQuery(browser, query);
} catch (error) {
console.error('Failed:', error.message);
// Continue or retry
}
```
### 4. Respect Rate Limits
- Don't exceed 10 requests per minute
- Add longer delays for production use
- Consider using proxies for high volume
### 5. Check robots.txt
Before scraping any site:
```bash
curl https://example.com/robots.txt
```
## Next Steps
1. **Read Full Documentation**: See `docs/PLAYWRIGHT_SCRAPING.md`
2. **Customize Behaviors**: Edit `scripts/scraper-config.js`
3. **Write Custom Scripts**: Use the human-behavior library in your own scripts
4. **Run Tests**: Validate your Google Alert queries
## Example: Custom Script
```javascript
import { chromium } from 'playwright';
import {
getHumanizedContext,
humanClick,
humanType,
humanScroll
} from './scripts/human-behavior.js';
const browser = await chromium.launch({ headless: false });
const context = await getHumanizedContext(browser);
const page = await context.newPage();
// Your scraping logic here
await page.goto('https://example.com');
await humanScroll(page, { scrollCount: 3 });
await humanClick(page, 'button.submit');
await browser.close();
```
## Getting Help
- Full API documentation: `docs/PLAYWRIGHT_SCRAPING.md`
- Example code: `scripts/example-usage.js`
- Test examples: `tests/human-behavior.test.js`
Happy scraping! 🚀