gosint-sitecrawl/TODO.md

29 lines
1.2 KiB
Markdown

## Roadmap (post v0.0.2)
Prioritized from easiest/low-risk to more involved work. Check off as we ship.
### Shipped in v0.0.2
- [x] Add crawl metadata (startedAt, finishedAt, durationMs)
- [x] Include run parameters in report (maxDepth, concurrency, timeout, userAgent, sameHostOnly)
- [x] Status histogram (2xx/3xx/4xx/5xx totals) in summary
- [x] Normalize and dedupe trailing `/.` URL variants in output
- [x] Add compact `reportSummary` text block to JSON
- [x] Top external domains with counts
- [x] Broken links sample (first N) + per-domain broken counts
- [x] Robots.txt summary (present, fetchedAt)
- [x] Sitemap extras (index → child sitemaps, fetch errors)
- [x] Per-page response time (responseTimeMs) and content length (basic)
- [x] Basic page metadata: `<title>`
- [x] Depth distribution (count of pages by depth)
- [x] Redirect map summary (from → to domain counts)
### Next (target v0.0.3)
- [x] CSV exports: pages.csv, links.csv
- [x] NDJSON export option for streaming pipelines
### Notes
- All report metrics must be gathered by default with zero flags required.
- Keep JSON stable and sorted; update `reports/REPORT_SCHEMA.md` when fields change.