8.1 KiB
✅ Playwright Setup Complete
Your RSS Feed Monitor now has full Playwright scraping capabilities with advanced bot detection avoidance!
📦 What Was Created
Core Library
scripts/human-behavior.js(395 lines)- Complete human-like behavior simulation library
- Bezier curve mouse movements with overshooting
- Natural scrolling with random intervals
- Realistic typing with typos and corrections
- Browser fingerprint randomization
- Reading simulation utilities
Main Scripts
-
scripts/playwright-scraper.js(250 lines)- Google search validation with human behavior
- Website scraping with natural interactions
- Result extraction and analysis
- CLI interface for easy usage
-
scripts/validate-scraping.js(180 lines)- Batch validation of Google Alert queries
- Markdown file parsing
- Automatic report generation
- Configurable delays and limits
Configuration & Examples
-
scripts/scraper-config.js- Centralized configuration for all behavior parameters
- Easy customization of timing, movements, and patterns
-
scripts/example-usage.js(300 lines)- 4 complete working examples
- Google search demo
- Reddit scraping demo
- Multi-step navigation demo
- Mouse pattern demonstrations
Testing
tests/human-behavior.test.js(200 lines)- Comprehensive test suite
- Examples for all major features
- Google Alert validation tests
- Playwright Test framework integration
Documentation
-
docs/PLAYWRIGHT_SCRAPING.md(550 lines)- Complete API documentation
- Usage examples for every feature
- Configuration guide
- Best practices and troubleshooting
-
docs/QUICKSTART_PLAYWRIGHT.md(250 lines)- 5-minute setup guide
- Common use cases
- Quick reference
Project Files
package.json- Node.js dependenciesplaywright.config.js- Playwright test configuration.gitignore- Excludes node_modules, reports, etc.- Updated
README.md- Added Playwright section
🚀 Quick Start
# 1. Install dependencies
npm install
npx playwright install chromium
# 2. Test a query
node scripts/playwright-scraper.js '"macbook repair" Toronto'
# 3. Validate alerts
node scripts/validate-scraping.js docs/google-alerts-broad.md --max 3
# 4. Run examples
node scripts/example-usage.js 1
🤖 Anti-Detection Features
Mouse Movements
- ✅ Smooth bezier curves (not straight lines)
- ✅ Occasional overshooting (15% chance)
- ✅ Variable speeds and acceleration
- ✅ Random pause durations
Scrolling
- ✅ Random amounts (100-400px)
- ✅ Variable delays (0.5-2s)
- ✅ Occasionally reverses direction
- ✅ Smooth incremental scrolling
Typing
- ✅ Variable keystroke timing (50-150ms)
- ✅ Occasional typos with corrections (2%)
- ✅ Longer pauses after spaces/punctuation
- ✅ Natural rhythm variations
Browser Fingerprinting
- ✅ Randomized viewports (5 common sizes)
- ✅ Rotated user agents (5 realistic UAs)
- ✅ Realistic HTTP headers
- ✅ Geolocation (Toronto by default)
- ✅ Random device scale factors
- ✅ Removes webdriver detection
- ✅ Injects realistic navigator properties
Behavior Patterns
- ✅ Reading simulation (random scrolls + mouse moves)
- ✅ Random observation pauses
- ✅ Natural page load waiting
- ✅ Occasional "accidental" double-clicks (2%)
📊 Usage Statistics
File Count: 10 new files
- 5 JavaScript modules (1,325 lines)
- 2 Documentation files (800 lines)
- 2 Configuration files
- 1 Test suite (200 lines)
Total Lines of Code: ~2,300 lines
Features Implemented:
- 10+ human behavior simulation functions
- 5 randomized viewport configurations
- 5 realistic user agents
- 4 complete example demonstrations
- 6 comprehensive test cases
- Full API documentation
- CLI tools for validation and scraping
🎯 Use Cases
1. Validate Google Alert Queries
Test if your alert queries actually return results:
node scripts/validate-scraping.js docs/google-alerts-broad.md
2. Scrape Search Results
Get actual search results with full details:
node scripts/playwright-scraper.js '"laptop repair" Toronto'
3. Monitor Reddit
Scrape Reddit with human-like behavior:
node scripts/playwright-scraper.js --url "https://www.reddit.com/r/toronto"
4. Custom Scraping
Use the library in your own scripts:
import { humanClick, humanType, humanScroll } from './scripts/human-behavior.js';
📝 Example Output
Single Query Validation
🔍 Searching Google for: "macbook repair" Toronto
📊 Results Summary:
Stats: About 1,234 results (0.45 seconds)
Found: 15 results
✅ Query returned results:
1. MacBook Repair Toronto - Apple Certified
https://example.com/macbook-repair
Professional MacBook repair services in Toronto...
Batch Validation Report
{
"total": 5,
"successful": 4,
"failed": 1,
"successRate": 80,
"results": [...]
}
🔧 Customization
All behavior parameters are configurable in scripts/scraper-config.js:
mouse: {
overshootChance: 0.15, // 15% chance to overshoot
overshootDistance: 20, // pixels
pathSteps: 25, // bezier curve resolution
}
scroll: {
minAmount: 100, // minimum pixels
maxAmount: 400, // maximum pixels
randomDirectionChance: 0.15 // 15% chance to reverse
}
typing: {
minDelay: 50, // fastest typing
maxDelay: 150, // slowest typing
mistakeChance: 0.02 // 2% typo rate
}
🧪 Testing
Run the comprehensive test suite:
# With visible browser (recommended for learning)
npm run test:headed
# Headless (faster)
npm test
# Specific test file
npx playwright test tests/human-behavior.test.js --headed
📚 Documentation Structure
docs/
├── ALERT_STRATEGY.md # Existing Google Alerts strategy
├── PLAYWRIGHT_SCRAPING.md # NEW: Complete API docs (550 lines)
└── QUICKSTART_PLAYWRIGHT.md # NEW: Quick start guide (250 lines)
scripts/
├── human-behavior.js # NEW: Core library (395 lines)
├── playwright-scraper.js # NEW: Main scraper (250 lines)
├── validate-scraping.js # NEW: Batch validator (180 lines)
├── scraper-config.js # NEW: Configuration (120 lines)
└── example-usage.js # NEW: Examples (300 lines)
tests/
└── human-behavior.test.js # NEW: Test suite (200 lines)
⚠️ Important Notes
Rate Limiting
- Default delay: 5 seconds between requests
- Recommended: 10-15 seconds for production
- Google may still show CAPTCHAs with heavy usage
Legal & Ethical Use
- Always respect robots.txt
- Follow website Terms of Service
- Use reasonable rate limits
- Don't overload servers
Best Practices
- Start with
--headless falseto see behavior - Increase delays between requests
- Test queries in small batches first
- Monitor for CAPTCHAs or rate limiting
- Use different IP addresses for high volume
🎓 Learning Resources
- Start Here:
docs/QUICKSTART_PLAYWRIGHT.md - Full API:
docs/PLAYWRIGHT_SCRAPING.md - Examples:
scripts/example-usage.js - Tests:
tests/human-behavior.test.js - Config:
scripts/scraper-config.js
🔜 Next Steps
- ✅ Install dependencies:
npm install - ✅ Install browser:
npx playwright install chromium - 🎯 Try example:
node scripts/example-usage.js 1 - 🧪 Run tests:
npm run test:headed - ✅ Validate alerts:
node scripts/validate-scraping.js docs/google-alerts-broad.md - 🚀 Start scraping with confidence!
💡 Tips
- Headed mode (visible browser) is great for development
- Headless mode is faster for production
- Use
--max 3when testing to limit requests - Increase
--delayif you encounter rate limiting - Check console output for detailed behavior logs
🎉 You're Ready!
Your Playwright setup is complete with state-of-the-art bot detection avoidance. All the tools, examples, and documentation you need are in place.
Happy scraping! 🚀
Need Help?
- Read the docs:
docs/PLAYWRIGHT_SCRAPING.md - Check examples:
scripts/example-usage.js - Run tests:
npm run test:headed