rss-feedmonitor/README.md

191 lines
6.2 KiB
Markdown

# RSS Feed Monitor - Google Alerts
This repository contains validated Google Alert queries for monitoring repair-related discussions across Canadian platforms.
## ⚠️ START HERE
**✨ NEW: Production-Ready Reddit Alerts Available!**
Use `docs/google-alerts-reddit-tuned.md` for **validated, high-performance alerts** that produce regular, relevant results.
**Read `REDDIT_ALERTS_COMPLETE.md`** for test results showing 100% success rate and 10/10 relevant results.
## Files
### Documentation
- **`docs/google-alerts-reddit-tuned.md`** - ✨ **START HERE** - 25 production-ready alerts (100% validated)
- **`REDDIT_ALERTS_COMPLETE.md`** - ✨ **READ SECOND** - Complete test results and setup guide
- `docs/REDDIT_KEYWORDS.md` - Consumer language keyword conversion table
- `docs/google-alerts-broad.md` - Original 84 alerts (needs tuning)
- `docs/google-alerts.md` - Regional Reddit queries (61 alerts, low volume)
- `docs/PLAYWRIGHT_SCRAPING.md` - Guide to Playwright scraping with anti-detection
- `docs/PLAYWRIGHT_RECORDING.md` - Guide to recording alert setup with codegen
### Python Tools
- `scripts/validate_alerts.py` - Validator tool that checks queries and generates fixes
- `scripts/generate_broad_queries.py` - Generates location-based broad queries
### Playwright Tools (NEW)
- `scripts/human-behavior.js` - Human-like behavior library for bot detection avoidance
- `scripts/playwright-scraper.js` - Main scraper with Google search validation
- `scripts/validate-scraping.js` - Batch validator for testing multiple alerts
- `scripts/example-usage.js` - Usage examples and demonstrations
- `scripts/scraper-config.js` - Configuration for behavior fine-tuning
- `tests/alert-setup.spec.js` - Test documenting alert setup process
- `docs/PLAYWRIGHT_RECORDING.md` - Guide to recording alert setup with codegen
## Quick Start
### 1. Test Before You Create
**Copy this query and test in Google Search (NOT Alerts):**
```
"macbook repair" ("Toronto" OR "Mississauga" OR "Kitchener")
```
If you see 50+ results → the broad approach works ✅
### 2. Choose Your Strategy
- **Want results now?** Use `docs/google-alerts-broad.md` (recommended)
- **Want Reddit-only?** Use `docs/google-alerts.md` (may have low volume)
- **Not sure?** Read `docs/ALERT_STRATEGY.md`
### 3. Set Up Alerts
1. Open the file you chose
2. Find an alert (e.g., "Data Recovery - Ontario")
3. Copy the query block (everything inside ` ``` `)
4. Go to [Google Alerts](https://www.google.com/alerts)
5. Paste the query, set `As-it-happens``RSS feed`
6. Click `Create Alert`
### Validating Queries
#### Python Validator (Static Analysis)
Run the validator to check query structure and limits:
```bash
python3 scripts/validate_alerts.py docs/google-alerts.md
```
To regenerate working queries from a broken file:
```bash
python3 scripts/validate_alerts.py docs/google-alerts.md --fix > docs/google-alerts-fixed.md
```
#### Playwright Validator (Live Testing) - NEW! 🚀
Test queries by actually searching Google with human-like behavior to avoid bot detection:
```bash
# Install dependencies first
npm install
# Test a single query
node scripts/playwright-scraper.js '"macbook repair" Toronto'
# Batch test multiple alerts from markdown file
node scripts/validate-scraping.js docs/google-alerts-broad.md --max 5
# Run example demonstrations
node scripts/example-usage.js 1
```
**Features:**
- 🤖 Realistic mouse movements with bezier curves and occasional overshooting
- 📜 Natural scrolling patterns with random intervals
- ⌨️ Human-like typing with variable speeds and occasional typos
- ⏱️ Random delays mimicking real user behavior
- 🎭 Randomized browser fingerprints to avoid detection
See `docs/PLAYWRIGHT_SCRAPING.md` for full documentation.
#### Recording Alert Setup Process 🎬
Use Playwright's codegen to record and document the alert setup workflow:
```bash
# Record a new alert setup process
npm run record:alert-setup
```
This opens an interactive browser where you can perform the alert setup steps, and Playwright will generate test code automatically. Perfect for documenting the exact process for future reference.
See `docs/PLAYWRIGHT_RECORDING.md` for full documentation.
## Query Design
All queries follow these limits to ensure Google Alerts fires reliably:
- **≤8 site filters** per alert
- **≤18 OR terms** per keyword block
- **≤500 characters** total length
- **≤4 exclusion terms** (`-job -entertainment -movie -music`)
## Regional Structure
Reddit-based alerts are split into 5 regions to stay within limits:
1. **Ontario-GTA**: kitchener, waterloo, CambridgeON, guelph, toronto, mississauga, brampton
2. **Ontario-Other**: ontario, londonontario, HamiltonOntario, niagara, ottawa
3. **Western**: vancouver, VictoriaBC, Calgary, Edmonton
4. **Prairies**: saskatoon, regina, winnipeg
5. **Eastern**: montreal, quebeccity, halifax, newfoundland
Each service type (Data Recovery, Laptop Repair, Console Repair, etc.) has 5 regional alerts.
## Alert Categories
### Data Recovery (15 alerts)
- General data recovery
- HDD/SSD specialty recovery
- SD card/USB recovery
### Device Repair (25 alerts)
- Laptop/MacBook logic board repair
- GPU/Desktop board repair
- Console repair & refurbishment
- Smartphone repair
- iPad repair
- Connector (FPC) replacement
### Specialized Services (10 alerts)
- Key fob repair
- Microsolder/diagnostics
- Device refurbishment & trade-ins
### Non-Reddit Platforms (11 alerts)
- Kijiji/Used.ca classifieds
- Facebook Marketplace
- Craigslist
- Tech forums
- Discord communities
- Bulk/auction sourcing
## Troubleshooting
**No results coming through?**
1. Test the query in Google Search first (not in Alerts)
2. If Google Search shows results, the alert should work
3. If no results exist, the keywords may be too specific
4. Run `python3 scripts/validate_alerts.py` to check for limit violations
**Alert stopped working?**
Re-run validation and regenerate:
```bash
python3 scripts/validate_alerts.py docs/google-alerts.md --fix > docs/google-alerts-new.md
```
## Technical Notes
- Queries use exact-phrase matching (`"keyword"`) for precision
- The `-"ALERT_NAME:..."` marker was removed from all queries (it caused false negatives)
- Exclusions are limited to high-noise terms only
- Site filters use `site:reddit.com/r/subreddit` format (not full URLs)