108 lines
4.1 KiB
Markdown
108 lines
4.1 KiB
Markdown
## URLCrawler Report JSON Schema
|
|
|
|
This document describes the structure of the JSON reports produced by `urlcrawler` when run with `-output json`.
|
|
|
|
### Top-level object
|
|
|
|
```json
|
|
{
|
|
"target": "https://example.com",
|
|
"crawledUrls": ["https://example.com", "https://example.com/about"],
|
|
"sitemapUrls": ["https://example.com", "https://example.com/about"],
|
|
"crawlErrors": {"https://bad.example": "error string"},
|
|
"linkStatuses": [
|
|
{"url": "https://example.com", "statusCode": 200, "ok": true},
|
|
{"url": "https://other.example/broken", "statusCode": 404, "ok": false, "error": "..."}
|
|
],
|
|
"pageOutlinks": {
|
|
"https://example.com": ["https://example.com/about", "https://other.example/"]
|
|
},
|
|
"linkSources": {
|
|
"https://example.com/about": ["https://example.com"]
|
|
},
|
|
"missingInSitemap": ["https://example.com/page-not-in-sitemap"],
|
|
"inSitemapNotCrawled": ["https://example.com/deferred"],
|
|
"metadata": {
|
|
"startedAt": "2025-08-31T12:34:56Z",
|
|
"finishedAt": "2025-08-31T12:35:57Z",
|
|
"durationMs": 61000
|
|
},
|
|
"params": {
|
|
"maxDepth": 1,
|
|
"concurrency": 5,
|
|
"timeoutMs": 5000,
|
|
"userAgent": "urlcrawler/1.0",
|
|
"sameHostOnly": true
|
|
},
|
|
"stats": {
|
|
"ok": 12,
|
|
"broken": 1,
|
|
"status2xx": 12,
|
|
"status3xx": 0,
|
|
"status4xx": 1,
|
|
"status5xx": 0,
|
|
"statusOther": 0
|
|
}
|
|
}
|
|
```
|
|
|
|
### Fields
|
|
|
|
- **target** (string): Normalized start URL used for the crawl.
|
|
|
|
- **crawledUrls** (string[]): Unique URLs that were visited during crawling. Sorted for stability.
|
|
|
|
- **sitemapUrls** (string[]; optional): All URLs discovered via `sitemap.xml` (and nested sitemaps). Present unless the sitemap is not found.
|
|
|
|
- **crawlErrors** (object map<string,string>; optional): Maps URL → error message for requests that failed (e.g., network/TLS/timeouts). Only set when errors occurred.
|
|
|
|
- **linkStatuses** (LinkStatus[]): Result of HTTP status checks for all unique links discovered (including the pages themselves).
|
|
- **url** (string): The checked URL.
|
|
- **statusCode** (number): HTTP status code (0 if request failed before a response was received).
|
|
- **ok** (boolean): Convenience flag, true when `200 ≤ statusCode < 400` and no error occurred.
|
|
- **error** (string; optional): Error string when a request failed or there was another client error.
|
|
|
|
- **pageOutlinks** (object map<string,string[]>): For each crawled page URL, the list of normalized outgoing links (internal and external).
|
|
|
|
- **linkSources** (object map<string,string[]>): Inverse index: for each discovered link URL, the list of page URLs where it appeared.
|
|
|
|
- **missingInSitemap** (string[]; optional): URLs that were crawled but not present in the sitemap.
|
|
|
|
- **inSitemapNotCrawled** (string[]; optional): URLs present in the sitemap that were not crawled (e.g., due to depth limits or off-host rules).
|
|
|
|
- **metadata** (object): Crawl timing information.
|
|
- **startedAt** (string, RFC3339)
|
|
- **finishedAt** (string, RFC3339)
|
|
- **durationMs** (number)
|
|
|
|
- **params** (object): Parameters used for the run.
|
|
- **maxDepth** (number)
|
|
- **concurrency** (number)
|
|
- **timeoutMs** (number)
|
|
- **userAgent** (string)
|
|
- **sameHostOnly** (boolean)
|
|
|
|
- **stats** (object): Summary of link status results.
|
|
- **ok** (number)
|
|
- **broken** (number)
|
|
- **status2xx** (number)
|
|
- **status3xx** (number)
|
|
- **status4xx** (number)
|
|
- **status5xx** (number)
|
|
- **statusOther** (number)
|
|
|
|
- **reportSummary** (string): Compact summary string like `crawled=7 sitemap=7 links=26 ok=26 broken=0`.
|
|
- **topExternalDomains** (DomainCount[]): Top external domains referenced by links.
|
|
- **brokenSample** (LinkStatus[]): Up to 10 example broken links.
|
|
- **brokenByDomain** (DomainCount[]): Broken link counts grouped by domain.
|
|
|
|
### Notes
|
|
|
|
- URLs are normalized and deduplicated during crawl. Minor variants like trailing `/.` are normalized in output.
|
|
- All metrics described here are included by default; no extra flags are required.
|
|
- Content-type filtering: only `text/html` pages are parsed for outlinks.
|
|
- Sitemap fetching is best-effort; absence is not treated as an error.
|
|
- The JSON lists are sorted to produce stable outputs across runs.
|
|
|
|
|