gosint-sitecrawl/reports/REPORT_SCHEMA.md

5.1 KiB

URLCrawler Report JSON Schema

This document describes the structure of the JSON reports produced by urlcrawler when run with -output json.

Top-level object

{
  "target": "https://example.com",
  "crawledUrls": ["https://example.com", "https://example.com/about"],
  "sitemapUrls": ["https://example.com", "https://example.com/about"],
  "crawlErrors": {"https://bad.example": "error string"},
  "linkStatuses": [
    {"url": "https://example.com", "statusCode": 200, "ok": true},
    {"url": "https://other.example/broken", "statusCode": 404, "ok": false, "error": "..."}
  ],
  "pageOutlinks": {
    "https://example.com": ["https://example.com/about", "https://other.example/"]
  },
  "linkSources": {
    "https://example.com/about": ["https://example.com"]
  },
  "missingInSitemap": ["https://example.com/page-not-in-sitemap"],
  "inSitemapNotCrawled": ["https://example.com/deferred"],
  "metadata": {
    "startedAt": "2025-08-31T12:34:56Z",
    "finishedAt": "2025-08-31T12:35:57Z",
    "durationMs": 61000
  },
  "params": {
    "maxDepth": 1,
    "concurrency": 5,
    "timeoutMs": 5000,
    "userAgent": "urlcrawler/1.0",
    "sameHostOnly": true
  },
  "stats": {
    "ok": 12,
    "broken": 1,
    "status2xx": 12,
    "status3xx": 0,
    "status4xx": 1,
    "status5xx": 0,
    "statusOther": 0
  },
  "reportSummary": "crawled=2 sitemap=2 links=5 ok=4 broken=1",
  "topExternalDomains": [{"domain": "example-cdn.com", "count": 2}],
  "brokenSample": [{"url": "https://other.example/broken", "statusCode": 404, "ok": false}],
  "brokenByDomain": [{"domain": "other.example", "count": 1}],
  "pages": {
    "https://example.com": {"title": "Home — Example", "responseTimeMs": 42, "contentLength": 5123, "depth": 0}
  },
  "depthDistribution": {"0": 1, "1": 3},
  "robots": {"present": true, "fetchedAt": "2025-08-31T12:34:59Z"}
}

Fields

  • target (string): Normalized start URL used for the crawl.

  • crawledUrls (string[]): Unique URLs that were visited during crawling. Sorted for stability.

  • sitemapUrls (string[]; optional): All URLs discovered via sitemap.xml (and nested sitemaps). Present unless the sitemap is not found.

  • crawlErrors (object map<string,string>; optional): Maps URL → error message for requests that failed (e.g., network/TLS/timeouts). Only set when errors occurred.

  • linkStatuses (LinkStatus[]): Result of HTTP status checks for all unique links discovered (including the pages themselves).

    • url (string): The checked URL.
    • statusCode (number): HTTP status code (0 if request failed before a response was received).
    • ok (boolean): Convenience flag, true when 200 ≤ statusCode < 400 and no error occurred.
    • error (string; optional): Error string when a request failed or there was another client error.
  • pageOutlinks (object map<string,string[]>): For each crawled page URL, the list of normalized outgoing links (internal and external).

  • linkSources (object map<string,string[]>): Inverse index: for each discovered link URL, the list of page URLs where it appeared.

  • missingInSitemap (string[]; optional): URLs that were crawled but not present in the sitemap.

  • inSitemapNotCrawled (string[]; optional): URLs present in the sitemap that were not crawled (e.g., due to depth limits or off-host rules).

  • metadata (object): Crawl timing information.

    • startedAt (string, RFC3339)
    • finishedAt (string, RFC3339)
    • durationMs (number)
  • params (object): Parameters used for the run.

    • maxDepth (number)
    • concurrency (number)
    • timeoutMs (number)
    • userAgent (string)
    • sameHostOnly (boolean)
  • stats (object): Summary of link status results.

    • ok (number)
    • broken (number)
    • status2xx (number)
    • status3xx (number)
    • status4xx (number)
    • status5xx (number)
    • statusOther (number)
  • reportSummary (string): Compact summary string like crawled=7 sitemap=7 links=26 ok=26 broken=0.

  • topExternalDomains (DomainCount[]): Top external domains referenced by links.

  • brokenSample (LinkStatus[]): Up to 10 example broken links.

  • brokenByDomain (DomainCount[]): Broken link counts grouped by domain.

  • pages (object map<string,PageMeta>): Per-page metrics.

    • title (string): The page <title> text.
    • responseTimeMs (number): Time to fetch the document.
    • contentLength (number): Size of the fetched body in bytes (best effort).
    • depth (number): Crawl depth from the start URL.
  • depthDistribution (object map<number,number>): Count of pages by depth.

  • robots (object): robots.txt summary.

    • present (boolean): True if robots.txt exists and returned 200.
    • fetchedAt (string, RFC3339; optional): Fetch time when present.

Notes

  • URLs are normalized and deduplicated during crawl. Minor variants like trailing /. are normalized in output.
  • All metrics described here are included by default; no extra flags are required.
  • Content-type filtering: only text/html pages are parsed for outlinks.
  • Sitemap fetching is best-effort; absence is not treated as an error.
  • The JSON lists are sorted to produce stable outputs across runs.