gosint-sitecrawl/reports/REPORT_SCHEMA.md

126 lines
5.1 KiB
Markdown

## URLCrawler Report JSON Schema
This document describes the structure of the JSON reports produced by `urlcrawler` when run with `-output json`.
### Top-level object
```json
{
"target": "https://example.com",
"crawledUrls": ["https://example.com", "https://example.com/about"],
"sitemapUrls": ["https://example.com", "https://example.com/about"],
"crawlErrors": {"https://bad.example": "error string"},
"linkStatuses": [
{"url": "https://example.com", "statusCode": 200, "ok": true},
{"url": "https://other.example/broken", "statusCode": 404, "ok": false, "error": "..."}
],
"pageOutlinks": {
"https://example.com": ["https://example.com/about", "https://other.example/"]
},
"linkSources": {
"https://example.com/about": ["https://example.com"]
},
"missingInSitemap": ["https://example.com/page-not-in-sitemap"],
"inSitemapNotCrawled": ["https://example.com/deferred"],
"metadata": {
"startedAt": "2025-08-31T12:34:56Z",
"finishedAt": "2025-08-31T12:35:57Z",
"durationMs": 61000
},
"params": {
"maxDepth": 1,
"concurrency": 5,
"timeoutMs": 5000,
"userAgent": "urlcrawler/1.0",
"sameHostOnly": true
},
"stats": {
"ok": 12,
"broken": 1,
"status2xx": 12,
"status3xx": 0,
"status4xx": 1,
"status5xx": 0,
"statusOther": 0
},
"reportSummary": "crawled=2 sitemap=2 links=5 ok=4 broken=1",
"topExternalDomains": [{"domain": "example-cdn.com", "count": 2}],
"brokenSample": [{"url": "https://other.example/broken", "statusCode": 404, "ok": false}],
"brokenByDomain": [{"domain": "other.example", "count": 1}],
"pages": {
"https://example.com": {"title": "Home — Example", "responseTimeMs": 42, "contentLength": 5123, "depth": 0}
},
"depthDistribution": {"0": 1, "1": 3},
"robots": {"present": true, "fetchedAt": "2025-08-31T12:34:59Z"}
}
```
### Fields
- **target** (string): Normalized start URL used for the crawl.
- **crawledUrls** (string[]): Unique URLs that were visited during crawling. Sorted for stability.
- **sitemapUrls** (string[]; optional): All URLs discovered via `sitemap.xml` (and nested sitemaps). Present unless the sitemap is not found.
- **crawlErrors** (object map<string,string>; optional): Maps URL → error message for requests that failed (e.g., network/TLS/timeouts). Only set when errors occurred.
- **linkStatuses** (LinkStatus[]): Result of HTTP status checks for all unique links discovered (including the pages themselves).
- **url** (string): The checked URL.
- **statusCode** (number): HTTP status code (0 if request failed before a response was received).
- **ok** (boolean): Convenience flag, true when `200 ≤ statusCode < 400` and no error occurred.
- **error** (string; optional): Error string when a request failed or there was another client error.
- **pageOutlinks** (object map<string,string[]>): For each crawled page URL, the list of normalized outgoing links (internal and external).
- **linkSources** (object map<string,string[]>): Inverse index: for each discovered link URL, the list of page URLs where it appeared.
- **missingInSitemap** (string[]; optional): URLs that were crawled but not present in the sitemap.
- **inSitemapNotCrawled** (string[]; optional): URLs present in the sitemap that were not crawled (e.g., due to depth limits or off-host rules).
- **metadata** (object): Crawl timing information.
- **startedAt** (string, RFC3339)
- **finishedAt** (string, RFC3339)
- **durationMs** (number)
- **params** (object): Parameters used for the run.
- **maxDepth** (number)
- **concurrency** (number)
- **timeoutMs** (number)
- **userAgent** (string)
- **sameHostOnly** (boolean)
- **stats** (object): Summary of link status results.
- **ok** (number)
- **broken** (number)
- **status2xx** (number)
- **status3xx** (number)
- **status4xx** (number)
- **status5xx** (number)
- **statusOther** (number)
- **reportSummary** (string): Compact summary string like `crawled=7 sitemap=7 links=26 ok=26 broken=0`.
- **topExternalDomains** (DomainCount[]): Top external domains referenced by links.
- **brokenSample** (LinkStatus[]): Up to 10 example broken links.
- **brokenByDomain** (DomainCount[]): Broken link counts grouped by domain.
- **pages** (object map<string,PageMeta>): Per-page metrics.
- **title** (string): The page `<title>` text.
- **responseTimeMs** (number): Time to fetch the document.
- **contentLength** (number): Size of the fetched body in bytes (best effort).
- **depth** (number): Crawl depth from the start URL.
- **depthDistribution** (object map<number,number>): Count of pages by depth.
- **robots** (object): robots.txt summary.
- **present** (boolean): True if `robots.txt` exists and returned 200.
- **fetchedAt** (string, RFC3339; optional): Fetch time when present.
### Notes
- URLs are normalized and deduplicated during crawl. Minor variants like trailing `/.` are normalized in output.
- All metrics described here are included by default; no extra flags are required.
- Content-type filtering: only `text/html` pages are parsed for outlinks.
- Sitemap fetching is best-effort; absence is not treated as an error.
- The JSON lists are sorted to produce stable outputs across runs.