gosint-sitecrawl/TODO.md

1.9 KiB

Roadmap (post v0.0.1)

Prioritized from easiest/low-risk to more involved work. Check off as we ship.

Quick wins (target v0.0.2)

  • Add crawl metadata (startedAt, finishedAt, durationMs)
  • Include run parameters in report (maxDepth, concurrency, timeout, userAgent, sameHostOnly)
  • Status histogram (2xx/3xx/4xx/5xx totals) in summary
  • Normalize and dedupe trailing /. URL variants in output
  • Add compact reportSummary text block to JSON
  • Top external domains with counts
  • Broken links sample (first N) + per-domain broken counts

Moderate scope

  • Robots.txt summary (present, fetchedAt, sample disallow rules)
  • Sitemap extras (index → child sitemaps, fetch errors)
  • Per-page response time (responseTimeMs) and content length
  • Basic page metadata: <title>, canonical (if present)
  • Depth distribution (count of pages by depth)
  • Duplicate title/canonical detection (lists of URLs)

Content/asset analysis

  • Extract assets (images/css/js) per page with status/type/size
  • Mixed-content detection (http assets on https pages)
  • Image accessibility metric (alt present ratio)

Security and quality signals

  • Security headers by host (HSTS, CSP, X-Frame-Options, Referrer-Policy)
  • Insecure forms (http action on https page)
  • Large pages and slow pages (p95 thresholds) summary
  • Redirect map (from → to, hops; count summary)
  • Indegree/outdegree stats; small graph summary

Outputs and UX

  • CSV exports: pages.csv, links.csv, assets.csv
  • NDJSON export option for streaming pipelines
  • Optional: include file/line anchors in JSON for large outputs

Notes

  • Keep JSON stable and sorted; avoid breaking changes. If we change fields, bump minor version and document in reports/REPORT_SCHEMA.md.
  • Favor opt-in flags for heavier analyses (assets, headers) to keep default runs fast.