gosint-sitecrawl/TODO.md

1.2 KiB

Roadmap (post v0.0.2)

Prioritized from easiest/low-risk to more involved work. Check off as we ship.

Shipped in v0.0.2

  • Add crawl metadata (startedAt, finishedAt, durationMs)
  • Include run parameters in report (maxDepth, concurrency, timeout, userAgent, sameHostOnly)
  • Status histogram (2xx/3xx/4xx/5xx totals) in summary
  • Normalize and dedupe trailing /. URL variants in output
  • Add compact reportSummary text block to JSON
  • Top external domains with counts
  • Broken links sample (first N) + per-domain broken counts
  • Robots.txt summary (present, fetchedAt)
  • Sitemap extras (index → child sitemaps, fetch errors)
  • Per-page response time (responseTimeMs) and content length (basic)
  • Basic page metadata: <title>
  • Depth distribution (count of pages by depth)
  • Redirect map summary (from → to domain counts)

Next (target v0.0.3)

  • CSV exports: pages.csv, links.csv
  • NDJSON export option for streaming pipelines

Notes

  • All report metrics must be gathered by default with zero flags required.
  • Keep JSON stable and sorted; update reports/REPORT_SCHEMA.md when fields change.