Configuration

The config is a JSON file. Load it with ripperoni --config ripperoni.config.json. Schema source of truth: src/schemas/internal/RipperConfigSchema.ts.

Copy ripperoni.config.example.json as a starting point. The unprefixed file is gitignored.

Top-level shape

{
  output:    OutputConfig;                      // required
  targets?:  { [name: string]: TargetConfig };  // HTML scrape targets
  mediawiki?: { [name: string]: WikiConfig };   // MediaWiki scrape targets
  crawlers?: { [name: string]: CrawlerConfig }; // link-crawler configs
}

`output`

Global output settings:

Key	Type	Required	Notes
`basePath`	string	yes	Base directory for all written output files.
`format`	`"json"` \| `"html"` \| `"text"`	no	Output file format. `json` is the default and what downstream tools (like Squashage) expect.
`pretty`	boolean	no	Pretty-print JSON output. Default `false`.

targets (HTML scrape)

Each key is a target name (e.g. "aonprd"). Value is a target config.

Required

Key	Type	Notes
`baseUrl`	URI	Base URL for the target. All fetched URLs are resolved against this.
`pipeline`	string[]	Ordered task names. Minimum one.

Optional

Key	Type	Default	Notes
`rateLimitMs`	integer ≥ 0		Minimum milliseconds between requests.
`jitterMs`	integer ≥ 0		Random jitter added on top of `rateLimitMs`. Applied per request.
`maxRetries`	integer 0–10		Retry attempts on transient errors.
`retryBaseDelayMs`	integer ≥ 100		Base delay for retry backoff.
`retryMaxDelayMs`	integer ≥ 1000		Backoff ceiling.
`concurrency`	integer 1–32	`1`	Parallel fetch/process slots.
`maxPages`	integer ≥ 0		Stop after processing this many pages.
`headers`	object		Additional HTTP headers. Include `User-Agent`.
`outputSchema`	string		Path to a JSON Schema file. Records that fail validation are handled per `onSchemaError`.
`onSchemaError`	`"halt"` \| `"skip"` \| `"warn"`		What to do when a record fails schema validation.
`includeRawContent`	boolean	`true`	When `false`, raw content is not populated on `state.page._raw` at all during the pipeline run. No raw file is written to `raw/`. See Output folder layout below.
`mapping`	object		Field-rename map applied after plugin output.
`cache`	CacheConfig		See Cache.
`crawler`	CrawlerConfig		Inline crawler config for this target.

Concurrency bound rationale: Concurrency is clamped to 1–32 to prevent runaway parallelism. At concurrency 32, you can have 32 HTTP requests in flight simultaneously. This is usually enough to saturate downstream bandwidth and quickly hit many servers' rate limits. Beyond 32, the marginal benefit drops and the risk of getting blocked increases. If you need more parallelism, run multiple Ripperoni instances.

Validation timing: When validation errors surface depends on your schema. If your outputSchema has required fields and your plugin sets output: {}, the validation fails when json:write tries to serialize the record. The error is handled per onSchemaError: "halt" throws and stops the run, "skip" logs a warning and skips the file, "warn" logs and writes anyway.

Retry × concurrency worst-case: If every fetch in a batch of concurrency tasks encounters a transient error and retries to maxDelayMs (30 seconds), the batch duration could be 30+ seconds. Total run time is ceil(N / concurrency) * maxRetryTime. For 1000 URLs with concurrency 10: ceil(1000/10) * 30s = 100 * 30 = 50 minutes in the absolute worst case (every fetch fails and retries max times). In practice, cache hits and successful first attempts keep this much lower.

Field mapping worked example: After your plugin extracts a record, mapping renames fields without touching your code:

json

"targets": {
  "aonprd": {
    "pipeline": ["html:fetch", "aonprd:parse", "json:write"],
    "mapping": {
      "name": "title",
      "description": "desc"
    }
  }
}

If your plugin sets state.output = { name: "Fireball", description: "Conjures..." }, the written file gets { title: "Fireball", desc: "Conjures...", ... }. The original fields are gone; only the mapped names appear in the JSON.

Cache and retry interaction: The cache sits upstream of retry logic. A cache hit means no retry-executor is invoked (no exponential backoff). A cache miss triggers the full HTTP stack: rate limiter, retry executor with backoff, then cache write on success. The first fetch of a URL can take up to maxRetryTime; the second hit takes microseconds (cache read).

Validation errors surface at first write. If your plugin produces invalid output, the first file write detects it. All subsequent pages from the same target go through the same validator, so you'll see the full picture of validation failures quickly (don't run all 1000 pages to find out your schema is wrong).

Example

json

{
  "targets": {
    "aonprd": {
      "baseUrl":          "https://2e.aonprd.com",
      "rateLimitMs":      1000,
      "jitterMs":         250,
      "maxRetries":       3,
      "retryBaseDelayMs": 500,
      "retryMaxDelayMs":  30000,
      "headers": {
        "User-Agent": "ripperoni/2.0 (+https://github.com/Studnicky/PathRipper)"
      },
      "pipeline": ["html:fetch", "aonprd:parse", "json:write"],
      "cache": {
        "dir": "./output/.cache/aonprd",
        "mode": "read-write"
      }
    }
  }
}

Raw content output

Every output record carries a _raw field by default. This field is injected just before the record is written to disk and holds the raw fetched bytes alongside the parsed fields. Downstream consumers can re-parse historical Ripperoni output without depending on Ripperoni's cache infrastructure.

Default behaviour

Raw content is always written. No configuration is required to get _raw in output. Parsing and enrichment are additive layers on top — plugins set state.output fields that appear alongside _raw, not instead of it.

A pipeline with no plugin step (["html:fetch", "json:write"]) is a valid and complete pipeline: it produces a raw dump per page with no further extraction. This is useful for archiving, debugging, or when you want to defer parsing to a downstream tool.

Shape

json

{
  "_raw": {
    "contentType": "text/html",
    "content":     "<html>...</html>",
    "fetchedAt":   "2026-05-07T04:00:00.000Z"
  }
}

Field	Type	Notes
`contentType`	string	MIME type of the response (`text/html` for HTML targets).
`content`	string	Full raw response body, byte-for-byte.
`fetchedAt`	ISO-8601 string	Timestamp at which the content was fetched.

Opting out (storage savings)

Set includeRawContent: false to strip _raw from output. Use this for production scrapes where storage is a concern. Rough estimate: 15,000 AONPRD records x 80 KB of HTML = roughly 1.2 GB of additional output. If you do not need to re-parse output offline, opt out to keep file sizes small.

json

{
  "targets": {
    "aonprd": {
      "baseUrl":           "https://2e.aonprd.com",
      "pipeline":          ["html:fetch", "aonprd:parse", "json:write"],
      "includeRawContent": false,
      "cache": { "dir": "./output/.cache/aonprd", "mode": "read-write" }
    }
  }
}

Raw-dump-only pipeline (no plugin)

A pipeline without a plugin task is fully supported and produces one JSON file per page:

json

{
  "targets": {
    "archive": {
      "baseUrl":  "https://example.com",
      "pipeline": ["html:fetch", "json:write"]
    }
  }
}

Output shape per record (output is empty object because no plugin ran, _raw carries the content):

json

{
  "_raw": {
    "contentType": "text/html",
    "content":     "<html>...</html>",
    "fetchedAt":   "2026-05-07T04:00:00.000Z"
  }
}

Plugin contract

Plugins must not read or write the _raw field. It is set by html:fetch and consumed by json:write / jsonl:append. Plugins interact with state.page.html and state.output as usual; _raw is injected transparently into the serialized file just before the disk write.

mediawiki

Same rate-limit, retry, concurrency, and cache options as targets. MediaWiki-specific additions:

Key	Type	Required	Notes
`apiUrl`	URI	yes	MediaWiki API endpoint (e.g. `https://en.wikipedia.org/w/api.php`).
`batchSize`	integer 1–50	no	Pages per batch request. MediaWiki API maximum is 50.
`categories`	string[]	no	Category names to enumerate. When present, overrides full-site enumeration.

See MediaWiki for the three enumeration modes.

crawlers

Top-level crawlers define link-harvesting jobs independent of scrape targets.

Key	Type	Required	Notes
`startUrls`	URI[]	yes	Entry points for the crawl.
`domain`	regex string	yes	Links must match to be considered. Bounds the crawl to one site.
`target`	regex string	yes	Links matching `delimiter` AND this are collected as results.
`delimiter`	regex string	yes	Links matching this are traversed (followed). Others are ignored.
`rateLimitMs`	integer ≥ 0	no	Gap between requests.
`jitterMs`	integer ≥ 0	no	Jitter on top of rate limit.
`maxPages`	integer ≥ 1	no	Traversal ceiling.

See Crawler for how the three regexes interact.

cache config (shared shape)

Both targets and mediawiki blocks accept the same cache shape:

json

"cache": {
  "dir":   "./output/.cache/aonprd",
  "mode":  "read-write",
  "ttlMs": 86400000
}

Key	Type	Required	Notes
`dir`	string	yes	Directory for cache meta files.
`mode`	enum	yes	`read-write`, `read-only`, `write-only`, or `off`.
`ttlMs`	integer ≥ 0	no	Entries older than this (in ms) are treated as misses.

See Cache for sharding, eviction, and TTL behavior.

Pipeline; task registration and state shape
Scrapers; HtmlScraper vs MediaWikiScraper
MediaWiki; three enumeration modes
Crawler; LinkLister behavior
Cache; read/write modes, TTL, eviction

Configuration ​

Top-level shape ​

output ​

targets (HTML scrape) ​

Required ​

Optional ​

Example ​

Raw content output ​

Default behaviour ​

Shape ​

Opting out (storage savings) ​

Raw-dump-only pipeline (no plugin) ​

Plugin contract ​

mediawiki ​

crawlers ​

cache config (shared shape) ​

Related ​

Configuration

Top-level shape

`output`

targets (HTML scrape)

Required

Optional

Example

Raw content output

Default behaviour

Shape

Opting out (storage savings)

Raw-dump-only pipeline (no plugin)

Plugin contract

mediawiki

crawlers

cache config (shared shape)

Related