rss-feedmonitor/docs/QUICKSTART_PLAYWRIGHT.md

# Playwright Scraping Quick Start

Get up and running with Playwright scraping in 5 minutes.

## Installation

### 1. Install Node.js

If you don't have Node.js installed:

**macOS (using Homebrew):**
```bash
brew install node
```

**Ubuntu/Debian:**
```bash
curl -fsSL https://deb.nodesource.com/setup_20.x | sudo -E bash -
sudo apt-get install -y nodejs
```

**Windows:**
Download from [nodejs.org](https://nodejs.org/)

### 2. Install Dependencies

```bash
cd /Users/computer/dev/rss-feedmonitor
npm install
npx playwright install chromium
```

This will install:
- Playwright test framework
- Chromium browser
- All necessary dependencies

## Basic Usage

### Test a Single Query

Search Google with human-like behavior:

```bash
node scripts/playwright-scraper.js '"macbook repair" Toronto'
```

Output will show:
- Number of results found
- First 5 result titles and URLs
- Result statistics from Google

### Scrape a Specific Website

```bash
node scripts/playwright-scraper.js --url "https://www.reddit.com/r/toronto"
```

### Validate Multiple Alerts

Test queries from your markdown files:

```bash
# Test 5 random alerts
node scripts/validate-scraping.js docs/google-alerts-broad.md

# Test 3 alerts with 10 second delay between each
node scripts/validate-scraping.js docs/google-alerts.md --max 3 --delay 10000

# Run in headless mode (no visible browser)
node scripts/validate-scraping.js docs/google-alerts-broad.md --headless
```

This generates a JSON report with:
- Success/failure for each query
- Result counts
- Google's result statistics
- Full result details

### Run Examples

See demonstrations of different scraping scenarios:

```bash
# Run all examples
node scripts/example-usage.js

# Run specific example
node scripts/example-usage.js 1  # Google search
node scripts/example-usage.js 2  # Reddit scraping
node scripts/example-usage.js 3  # Multi-step navigation
node scripts/example-usage.js 4  # Mouse patterns
```

### Run Tests

Execute the test suite:

```bash
# Run with visible browser (see what's happening)
npm run test:headed

# Run in headless mode (faster)
npm test
```

## What Makes It "Human-like"?

The scraper includes several anti-detection features:

### 1. Realistic Mouse Movements
- Smooth bezier curves instead of straight lines
- Occasional overshooting (15% chance)
- Random speeds and accelerations

### 2. Natural Scrolling
- Random amounts (100-400 pixels)
- Variable delays (0.5-2 seconds)
- Occasionally scrolls up instead of down

### 3. Human-like Typing
- Variable delay between keystrokes (50-150ms)
- Occasional typos that get corrected (2% chance)
- Longer pauses after spaces and punctuation

### 4. Randomized Fingerprints
- Random viewport sizes (1366x768, 1920x1080, etc.)
- Rotated user agents
- Realistic browser headers
- Geolocation set to Toronto

### 5. Reading Simulation
- Random mouse movements while "reading"
- Occasional scrolling
- Natural pauses

## Configuration

Edit `scripts/scraper-config.js` to customize:

```javascript
export const config = {
  humanBehavior: {
    mouse: {
      overshootChance: 0.15,    // Chance of overshooting target
      overshootDistance: 20,     // Pixels to overshoot
    },
    scroll: {
      minAmount: 100,            // Min scroll distance
      maxAmount: 400,            // Max scroll distance
      minDelay: 500,             // Min delay between scrolls
      maxDelay: 2000,            // Max delay between scrolls
    },
    typing: {
      minDelay: 50,              // Min ms between keys
      maxDelay: 150,             // Max ms between keys
      mistakeChance: 0.02,       // 2% typo rate
    }
  }
};
```

## Common Issues & Solutions

### "Browser not found" error

Run:
```bash
npx playwright install chromium
```

### Rate limiting / CAPTCHA

Increase delays between requests:
```bash
node scripts/validate-scraping.js docs/google-alerts.md --delay 15000
```

Or add delays in your code:
```javascript
await randomDelay(10000, 15000); // 10-15 second delay
```

### Element not found errors

Increase wait times or add explicit waits:
```javascript
await page.waitForSelector('div.g', { timeout: 30000 });
```

### Tests timeout

Increase timeout in `playwright.config.js`:
```javascript
timeout: 120 * 1000,  // 2 minutes
```

## Best Practices

### 1. Always Add Delays

```javascript
// Wait between searches
await randomDelay(5000, 10000);
```

### 2. Use Headless Mode in Production

```javascript
const browser = await chromium.launch({ headless: true });
```

### 3. Handle Errors Gracefully

```javascript
try {
  const result = await validateQuery(browser, query);
} catch (error) {
  console.error('Failed:', error.message);
  // Continue or retry
}
```

### 4. Respect Rate Limits

- Don't exceed 10 requests per minute
- Add longer delays for production use
- Consider using proxies for high volume

### 5. Check robots.txt

Before scraping any site:
```bash
curl https://example.com/robots.txt
```

## Next Steps

1. **Read Full Documentation**: See `docs/PLAYWRIGHT_SCRAPING.md`
2. **Customize Behaviors**: Edit `scripts/scraper-config.js`
3. **Write Custom Scripts**: Use the human-behavior library in your own scripts
4. **Run Tests**: Validate your Google Alert queries

## Example: Custom Script

```javascript
import { chromium } from 'playwright';
import {
  getHumanizedContext,
  humanClick,
  humanType,
  humanScroll
} from './scripts/human-behavior.js';

const browser = await chromium.launch({ headless: false });
const context = await getHumanizedContext(browser);
const page = await context.newPage();

// Your scraping logic here
await page.goto('https://example.com');
await humanScroll(page, { scrollCount: 3 });
await humanClick(page, 'button.submit');

await browser.close();
```

## Getting Help

- Full API documentation: `docs/PLAYWRIGHT_SCRAPING.md`
- Example code: `scripts/example-usage.js`
- Test examples: `tests/human-behavior.test.js`

Happy scraping! 🚀