gosint-sitecrawl/TODO.md

31 lines
1.2 KiB
Markdown

## Roadmap (post v0.0.1)
Prioritized from easiest/low-risk to more involved work. Check off as we ship.
### Quick wins (target v0.0.2)
- [x] Add crawl metadata (startedAt, finishedAt, durationMs)
- [x] Include run parameters in report (maxDepth, concurrency, timeout, userAgent, sameHostOnly)
- [x] Status histogram (2xx/3xx/4xx/5xx totals) in summary
- [x] Normalize and dedupe trailing `/.` URL variants in output
- [x] Add compact `reportSummary` text block to JSON
- [x] Top external domains with counts
- [x] Broken links sample (first N) + per-domain broken counts
### Core additions (default, no flags)
- [ ] Robots.txt summary (present, fetchedAt)
- [ ] Sitemap extras (index → child sitemaps, fetch errors)
- [ ] Per-page response time (responseTimeMs) and content length (basic)
- [ ] Basic page metadata: `<title>`
- [ ] Depth distribution (count of pages by depth)
- [ ] Redirect map summary (from → to domain counts)
### Outputs and UX
- [ ] CSV exports: pages.csv, links.csv
- [ ] NDJSON export option for streaming pipelines
### Notes
- All report metrics must be gathered by default with zero flags required.
- Keep JSON stable and sorted; update `reports/REPORT_SCHEMA.md` when fields change.