gosint-sitecrawl/reports/REPORT_SCHEMA.md

## URLCrawler Report JSON Schema

This document describes the structure of the JSON reports produced by `urlcrawler` when run with `-output json`.

### Top-level object

```json
{
  "target": "https://example.com",
  "crawledUrls": ["https://example.com", "https://example.com/about"],
  "sitemapUrls": ["https://example.com", "https://example.com/about"],
  "crawlErrors": {"https://bad.example": "error string"},
  "linkStatuses": [
    {"url": "https://example.com", "statusCode": 200, "ok": true},
    {"url": "https://other.example/broken", "statusCode": 404, "ok": false, "error": "..."}
  ],
  "pageOutlinks": {
    "https://example.com": ["https://example.com/about", "https://other.example/"]
  },
  "linkSources": {
    "https://example.com/about": ["https://example.com"]
  },
  "missingInSitemap": ["https://example.com/page-not-in-sitemap"],
  "inSitemapNotCrawled": ["https://example.com/deferred"]
}
```

### Fields

- **target** (string): Normalized start URL used for the crawl.

- **crawledUrls** (string[]): Unique URLs that were visited during crawling. Sorted for stability.

- **sitemapUrls** (string[]; optional): All URLs discovered via `sitemap.xml` (and nested sitemaps). Present unless the sitemap is not found.

- **crawlErrors** (object map<string,string>; optional): Maps URL → error message for requests that failed (e.g., network/TLS/timeouts). Only set when errors occurred.

- **linkStatuses** (LinkStatus[]): Result of HTTP status checks for all unique links discovered (including the pages themselves).
  - **url** (string): The checked URL.
  - **statusCode** (number): HTTP status code (0 if request failed before a response was received).
  - **ok** (boolean): Convenience flag, true when `200 ≤ statusCode < 400` and no error occurred.
  - **error** (string; optional): Error string when a request failed or there was another client error.

- **pageOutlinks** (object map<string,string[]>): For each crawled page URL, the list of normalized outgoing links (internal and external).

- **linkSources** (object map<string,string[]>): Inverse index: for each discovered link URL, the list of page URLs where it appeared.

- **missingInSitemap** (string[]; optional): URLs that were crawled but not present in the sitemap.

- **inSitemapNotCrawled** (string[]; optional): URLs present in the sitemap that were not crawled (e.g., due to depth limits or off-host rules).

### Notes

- URLs are normalized and deduplicated during crawl.
- Content-type filtering: only `text/html` pages are parsed for outlinks.
- Sitemap fetching is best-effort; absence is not treated as an error.
- The JSON lists are sorted to produce stable outputs across runs.