275 lines
5.8 KiB
Markdown
275 lines
5.8 KiB
Markdown
# Playwright Scraping Quick Start
|
|
|
|
Get up and running with Playwright scraping in 5 minutes.
|
|
|
|
## Installation
|
|
|
|
### 1. Install Node.js
|
|
|
|
If you don't have Node.js installed:
|
|
|
|
**macOS (using Homebrew):**
|
|
```bash
|
|
brew install node
|
|
```
|
|
|
|
**Ubuntu/Debian:**
|
|
```bash
|
|
curl -fsSL https://deb.nodesource.com/setup_20.x | sudo -E bash -
|
|
sudo apt-get install -y nodejs
|
|
```
|
|
|
|
**Windows:**
|
|
Download from [nodejs.org](https://nodejs.org/)
|
|
|
|
### 2. Install Dependencies
|
|
|
|
```bash
|
|
cd /Users/computer/dev/rss-feedmonitor
|
|
npm install
|
|
npx playwright install chromium
|
|
```
|
|
|
|
This will install:
|
|
- Playwright test framework
|
|
- Chromium browser
|
|
- All necessary dependencies
|
|
|
|
## Basic Usage
|
|
|
|
### Test a Single Query
|
|
|
|
Search Google with human-like behavior:
|
|
|
|
```bash
|
|
node scripts/playwright-scraper.js '"macbook repair" Toronto'
|
|
```
|
|
|
|
Output will show:
|
|
- Number of results found
|
|
- First 5 result titles and URLs
|
|
- Result statistics from Google
|
|
|
|
### Scrape a Specific Website
|
|
|
|
```bash
|
|
node scripts/playwright-scraper.js --url "https://www.reddit.com/r/toronto"
|
|
```
|
|
|
|
### Validate Multiple Alerts
|
|
|
|
Test queries from your markdown files:
|
|
|
|
```bash
|
|
# Test 5 random alerts
|
|
node scripts/validate-scraping.js docs/google-alerts-broad.md
|
|
|
|
# Test 3 alerts with 10 second delay between each
|
|
node scripts/validate-scraping.js docs/google-alerts.md --max 3 --delay 10000
|
|
|
|
# Run in headless mode (no visible browser)
|
|
node scripts/validate-scraping.js docs/google-alerts-broad.md --headless
|
|
```
|
|
|
|
This generates a JSON report with:
|
|
- Success/failure for each query
|
|
- Result counts
|
|
- Google's result statistics
|
|
- Full result details
|
|
|
|
### Run Examples
|
|
|
|
See demonstrations of different scraping scenarios:
|
|
|
|
```bash
|
|
# Run all examples
|
|
node scripts/example-usage.js
|
|
|
|
# Run specific example
|
|
node scripts/example-usage.js 1 # Google search
|
|
node scripts/example-usage.js 2 # Reddit scraping
|
|
node scripts/example-usage.js 3 # Multi-step navigation
|
|
node scripts/example-usage.js 4 # Mouse patterns
|
|
```
|
|
|
|
### Run Tests
|
|
|
|
Execute the test suite:
|
|
|
|
```bash
|
|
# Run with visible browser (see what's happening)
|
|
npm run test:headed
|
|
|
|
# Run in headless mode (faster)
|
|
npm test
|
|
```
|
|
|
|
## What Makes It "Human-like"?
|
|
|
|
The scraper includes several anti-detection features:
|
|
|
|
### 1. Realistic Mouse Movements
|
|
- Smooth bezier curves instead of straight lines
|
|
- Occasional overshooting (15% chance)
|
|
- Random speeds and accelerations
|
|
|
|
### 2. Natural Scrolling
|
|
- Random amounts (100-400 pixels)
|
|
- Variable delays (0.5-2 seconds)
|
|
- Occasionally scrolls up instead of down
|
|
|
|
### 3. Human-like Typing
|
|
- Variable delay between keystrokes (50-150ms)
|
|
- Occasional typos that get corrected (2% chance)
|
|
- Longer pauses after spaces and punctuation
|
|
|
|
### 4. Randomized Fingerprints
|
|
- Random viewport sizes (1366x768, 1920x1080, etc.)
|
|
- Rotated user agents
|
|
- Realistic browser headers
|
|
- Geolocation set to Toronto
|
|
|
|
### 5. Reading Simulation
|
|
- Random mouse movements while "reading"
|
|
- Occasional scrolling
|
|
- Natural pauses
|
|
|
|
## Configuration
|
|
|
|
Edit `scripts/scraper-config.js` to customize:
|
|
|
|
```javascript
|
|
export const config = {
|
|
humanBehavior: {
|
|
mouse: {
|
|
overshootChance: 0.15, // Chance of overshooting target
|
|
overshootDistance: 20, // Pixels to overshoot
|
|
},
|
|
scroll: {
|
|
minAmount: 100, // Min scroll distance
|
|
maxAmount: 400, // Max scroll distance
|
|
minDelay: 500, // Min delay between scrolls
|
|
maxDelay: 2000, // Max delay between scrolls
|
|
},
|
|
typing: {
|
|
minDelay: 50, // Min ms between keys
|
|
maxDelay: 150, // Max ms between keys
|
|
mistakeChance: 0.02, // 2% typo rate
|
|
}
|
|
}
|
|
};
|
|
```
|
|
|
|
## Common Issues & Solutions
|
|
|
|
### "Browser not found" error
|
|
|
|
Run:
|
|
```bash
|
|
npx playwright install chromium
|
|
```
|
|
|
|
### Rate limiting / CAPTCHA
|
|
|
|
Increase delays between requests:
|
|
```bash
|
|
node scripts/validate-scraping.js docs/google-alerts.md --delay 15000
|
|
```
|
|
|
|
Or add delays in your code:
|
|
```javascript
|
|
await randomDelay(10000, 15000); // 10-15 second delay
|
|
```
|
|
|
|
### Element not found errors
|
|
|
|
Increase wait times or add explicit waits:
|
|
```javascript
|
|
await page.waitForSelector('div.g', { timeout: 30000 });
|
|
```
|
|
|
|
### Tests timeout
|
|
|
|
Increase timeout in `playwright.config.js`:
|
|
```javascript
|
|
timeout: 120 * 1000, // 2 minutes
|
|
```
|
|
|
|
## Best Practices
|
|
|
|
### 1. Always Add Delays
|
|
|
|
```javascript
|
|
// Wait between searches
|
|
await randomDelay(5000, 10000);
|
|
```
|
|
|
|
### 2. Use Headless Mode in Production
|
|
|
|
```javascript
|
|
const browser = await chromium.launch({ headless: true });
|
|
```
|
|
|
|
### 3. Handle Errors Gracefully
|
|
|
|
```javascript
|
|
try {
|
|
const result = await validateQuery(browser, query);
|
|
} catch (error) {
|
|
console.error('Failed:', error.message);
|
|
// Continue or retry
|
|
}
|
|
```
|
|
|
|
### 4. Respect Rate Limits
|
|
|
|
- Don't exceed 10 requests per minute
|
|
- Add longer delays for production use
|
|
- Consider using proxies for high volume
|
|
|
|
### 5. Check robots.txt
|
|
|
|
Before scraping any site:
|
|
```bash
|
|
curl https://example.com/robots.txt
|
|
```
|
|
|
|
## Next Steps
|
|
|
|
1. **Read Full Documentation**: See `docs/PLAYWRIGHT_SCRAPING.md`
|
|
2. **Customize Behaviors**: Edit `scripts/scraper-config.js`
|
|
3. **Write Custom Scripts**: Use the human-behavior library in your own scripts
|
|
4. **Run Tests**: Validate your Google Alert queries
|
|
|
|
## Example: Custom Script
|
|
|
|
```javascript
|
|
import { chromium } from 'playwright';
|
|
import {
|
|
getHumanizedContext,
|
|
humanClick,
|
|
humanType,
|
|
humanScroll
|
|
} from './scripts/human-behavior.js';
|
|
|
|
const browser = await chromium.launch({ headless: false });
|
|
const context = await getHumanizedContext(browser);
|
|
const page = await context.newPage();
|
|
|
|
// Your scraping logic here
|
|
await page.goto('https://example.com');
|
|
await humanScroll(page, { scrollCount: 3 });
|
|
await humanClick(page, 'button.submit');
|
|
|
|
await browser.close();
|
|
```
|
|
|
|
## Getting Help
|
|
|
|
- Full API documentation: `docs/PLAYWRIGHT_SCRAPING.md`
|
|
- Example code: `scripts/example-usage.js`
|
|
- Test examples: `tests/human-behavior.test.js`
|
|
|
|
Happy scraping! 🚀
|
|
|