## URLCrawler Report JSON Schema This document describes the structure of the JSON reports produced by `urlcrawler` when run with `-output json`. ### Top-level object ```json { "target": "https://example.com", "crawledUrls": ["https://example.com", "https://example.com/about"], "sitemapUrls": ["https://example.com", "https://example.com/about"], "crawlErrors": {"https://bad.example": "error string"}, "linkStatuses": [ {"url": "https://example.com", "statusCode": 200, "ok": true}, {"url": "https://other.example/broken", "statusCode": 404, "ok": false, "error": "..."} ], "pageOutlinks": { "https://example.com": ["https://example.com/about", "https://other.example/"] }, "linkSources": { "https://example.com/about": ["https://example.com"] }, "missingInSitemap": ["https://example.com/page-not-in-sitemap"], "inSitemapNotCrawled": ["https://example.com/deferred"], "metadata": { "startedAt": "2025-08-31T12:34:56Z", "finishedAt": "2025-08-31T12:35:57Z", "durationMs": 61000 }, "params": { "maxDepth": 1, "concurrency": 5, "timeoutMs": 5000, "userAgent": "urlcrawler/1.0", "sameHostOnly": true }, "stats": { "ok": 12, "broken": 1, "status2xx": 12, "status3xx": 0, "status4xx": 1, "status5xx": 0, "statusOther": 0 }, "reportSummary": "crawled=2 sitemap=2 links=5 ok=4 broken=1", "topExternalDomains": [{"domain": "example-cdn.com", "count": 2}], "brokenSample": [{"url": "https://other.example/broken", "statusCode": 404, "ok": false}], "brokenByDomain": [{"domain": "other.example", "count": 1}], "pages": { "https://example.com": {"title": "Home — Example", "responseTimeMs": 42, "contentLength": 5123, "depth": 0} }, "depthDistribution": {"0": 1, "1": 3}, "robots": {"present": true, "fetchedAt": "2025-08-31T12:34:59Z"} } ``` ### Fields - **target** (string): Normalized start URL used for the crawl. - **crawledUrls** (string[]): Unique URLs that were visited during crawling. Sorted for stability. - **sitemapUrls** (string[]; optional): All URLs discovered via `sitemap.xml` (and nested sitemaps). Present unless the sitemap is not found. - **crawlErrors** (object map; optional): Maps URL → error message for requests that failed (e.g., network/TLS/timeouts). Only set when errors occurred. - **linkStatuses** (LinkStatus[]): Result of HTTP status checks for all unique links discovered (including the pages themselves). - **url** (string): The checked URL. - **statusCode** (number): HTTP status code (0 if request failed before a response was received). - **ok** (boolean): Convenience flag, true when `200 ≤ statusCode < 400` and no error occurred. - **error** (string; optional): Error string when a request failed or there was another client error. - **pageOutlinks** (object map): For each crawled page URL, the list of normalized outgoing links (internal and external). - **linkSources** (object map): Inverse index: for each discovered link URL, the list of page URLs where it appeared. - **missingInSitemap** (string[]; optional): URLs that were crawled but not present in the sitemap. - **inSitemapNotCrawled** (string[]; optional): URLs present in the sitemap that were not crawled (e.g., due to depth limits or off-host rules). - **metadata** (object): Crawl timing information. - **startedAt** (string, RFC3339) - **finishedAt** (string, RFC3339) - **durationMs** (number) - **params** (object): Parameters used for the run. - **maxDepth** (number) - **concurrency** (number) - **timeoutMs** (number) - **userAgent** (string) - **sameHostOnly** (boolean) - **stats** (object): Summary of link status results. - **ok** (number) - **broken** (number) - **status2xx** (number) - **status3xx** (number) - **status4xx** (number) - **status5xx** (number) - **statusOther** (number) - **reportSummary** (string): Compact summary string like `crawled=7 sitemap=7 links=26 ok=26 broken=0`. - **topExternalDomains** (DomainCount[]): Top external domains referenced by links. - **brokenSample** (LinkStatus[]): Up to 10 example broken links. - **brokenByDomain** (DomainCount[]): Broken link counts grouped by domain. - **pages** (object map): Per-page metrics. - **title** (string): The page `` text. - **responseTimeMs** (number): Time to fetch the document. - **contentLength** (number): Size of the fetched body in bytes (best effort). - **depth** (number): Crawl depth from the start URL. - **depthDistribution** (object map<number,number>): Count of pages by depth. - **robots** (object): robots.txt summary. - **present** (boolean): True if `robots.txt` exists and returned 200. - **fetchedAt** (string, RFC3339; optional): Fetch time when present. ### Notes - URLs are normalized and deduplicated during crawl. Minor variants like trailing `/.` are normalized in output. - All metrics described here are included by default; no extra flags are required. - Content-type filtering: only `text/html` pages are parsed for outlinks. - Sitemap fetching is best-effort; absence is not treated as an error. - The JSON lists are sorted to produce stable outputs across runs.