2.6 KiB
URLCrawler Report JSON Schema
This document describes the structure of the JSON reports produced by urlcrawler
when run with -output json
.
Top-level object
{
"target": "https://example.com",
"crawledUrls": ["https://example.com", "https://example.com/about"],
"sitemapUrls": ["https://example.com", "https://example.com/about"],
"crawlErrors": {"https://bad.example": "error string"},
"linkStatuses": [
{"url": "https://example.com", "statusCode": 200, "ok": true},
{"url": "https://other.example/broken", "statusCode": 404, "ok": false, "error": "..."}
],
"pageOutlinks": {
"https://example.com": ["https://example.com/about", "https://other.example/"]
},
"linkSources": {
"https://example.com/about": ["https://example.com"]
},
"missingInSitemap": ["https://example.com/page-not-in-sitemap"],
"inSitemapNotCrawled": ["https://example.com/deferred"]
}
Fields
-
target (string): Normalized start URL used for the crawl.
-
crawledUrls (string[]): Unique URLs that were visited during crawling. Sorted for stability.
-
sitemapUrls (string[]; optional): All URLs discovered via
sitemap.xml
(and nested sitemaps). Present unless the sitemap is not found. -
crawlErrors (object map<string,string>; optional): Maps URL → error message for requests that failed (e.g., network/TLS/timeouts). Only set when errors occurred.
-
linkStatuses (LinkStatus[]): Result of HTTP status checks for all unique links discovered (including the pages themselves).
- url (string): The checked URL.
- statusCode (number): HTTP status code (0 if request failed before a response was received).
- ok (boolean): Convenience flag, true when
200 ≤ statusCode < 400
and no error occurred. - error (string; optional): Error string when a request failed or there was another client error.
-
pageOutlinks (object map<string,string[]>): For each crawled page URL, the list of normalized outgoing links (internal and external).
-
linkSources (object map<string,string[]>): Inverse index: for each discovered link URL, the list of page URLs where it appeared.
-
missingInSitemap (string[]; optional): URLs that were crawled but not present in the sitemap.
-
inSitemapNotCrawled (string[]; optional): URLs present in the sitemap that were not crawled (e.g., due to depth limits or off-host rules).
Notes
- URLs are normalized and deduplicated during crawl.
- Content-type filtering: only
text/html
pages are parsed for outlinks. - Sitemap fetching is best-effort; absence is not treated as an error.
- The JSON lists are sorted to produce stable outputs across runs.