Skip to content

Configuration

The config is a JSON file. Load it with ripperoni --config ripperoni.config.json. Schema source of truth: src/schemas/internal/RipperConfigSchema.ts.

Copy ripperoni.config.example.json as a starting point. The unprefixed file is gitignored.

Top-level shape

ts
{
  output:    OutputConfig;                      // required
  targets?:  { [name: string]: TargetConfig };  // HTML scrape targets
  mediawiki?: { [name: string]: WikiConfig };   // MediaWiki scrape targets
  crawlers?: { [name: string]: CrawlerConfig }; // link-crawler configs
}

output

Global output settings:

KeyTypeRequiredNotes
basePathstringyesBase directory for all written output files.
format"json" | "html" | "text"noOutput file format. json is the default and what downstream tools (like Squashage) expect.
prettybooleannoPretty-print JSON output. Default false.

targets (HTML scrape)

Each key is a target name (e.g. "aonprd"). Value is a target config.

Required

KeyTypeNotes
baseUrlURIBase URL for the target. All fetched URLs are resolved against this.
pipelinestring[]Ordered task names. Minimum one.

Optional

KeyTypeDefaultNotes
rateLimitMsinteger ≥ 0Minimum milliseconds between requests.
jitterMsinteger ≥ 0Random jitter added on top of rateLimitMs. Applied per request.
maxRetriesinteger 0–10Retry attempts on transient errors.
retryBaseDelayMsinteger ≥ 100Base delay for retry backoff.
retryMaxDelayMsinteger ≥ 1000Backoff ceiling.
concurrencyinteger 1–321Parallel fetch/process slots.
maxPagesinteger ≥ 0Stop after processing this many pages.
headersobjectAdditional HTTP headers. Include User-Agent.
outputSchemastringPath to a JSON Schema file. Records that fail validation are handled per onSchemaError.
onSchemaError"halt" | "skip" | "warn"What to do when a record fails schema validation.
includeRawContentbooleantrueWhen false, raw content is not populated on state.page._raw at all during the pipeline run. No raw file is written to raw/. See Output folder layout below.
mappingobjectField-rename map applied after plugin output.
cacheCacheConfigSee Cache.
crawlerCrawlerConfigInline crawler config for this target.

Concurrency bound rationale: Concurrency is clamped to 1–32 to prevent runaway parallelism. At concurrency 32, you can have 32 HTTP requests in flight simultaneously. This is usually enough to saturate downstream bandwidth and quickly hit many servers' rate limits. Beyond 32, the marginal benefit drops and the risk of getting blocked increases. If you need more parallelism, run multiple Ripperoni instances.

Validation timing: When validation errors surface depends on your schema. If your outputSchema has required fields and your plugin sets output: {}, the validation fails when json:write tries to serialize the record. The error is handled per onSchemaError: "halt" throws and stops the run, "skip" logs a warning and skips the file, "warn" logs and writes anyway.

Retry × concurrency worst-case: If every fetch in a batch of concurrency tasks encounters a transient error and retries to maxDelayMs (30 seconds), the batch duration could be 30+ seconds. Total run time is ceil(N / concurrency) * maxRetryTime. For 1000 URLs with concurrency 10: ceil(1000/10) * 30s = 100 * 30 = 50 minutes in the absolute worst case (every fetch fails and retries max times). In practice, cache hits and successful first attempts keep this much lower.

Field mapping worked example: After your plugin extracts a record, mapping renames fields without touching your code:

json
"targets": {
  "aonprd": {
    "pipeline": ["html:fetch", "aonprd:parse", "json:write"],
    "mapping": {
      "name": "title",
      "description": "desc"
    }
  }
}

If your plugin sets state.output = { name: "Fireball", description: "Conjures..." }, the written file gets { title: "Fireball", desc: "Conjures...", ... }. The original fields are gone; only the mapped names appear in the JSON.

Cache and retry interaction: The cache sits upstream of retry logic. A cache hit means no retry-executor is invoked (no exponential backoff). A cache miss triggers the full HTTP stack: rate limiter, retry executor with backoff, then cache write on success. The first fetch of a URL can take up to maxRetryTime; the second hit takes microseconds (cache read).

Validation errors surface at first write. If your plugin produces invalid output, the first file write detects it. All subsequent pages from the same target go through the same validator, so you'll see the full picture of validation failures quickly (don't run all 1000 pages to find out your schema is wrong).

Example

json
{
  "targets": {
    "aonprd": {
      "baseUrl":          "https://2e.aonprd.com",
      "rateLimitMs":      1000,
      "jitterMs":         250,
      "maxRetries":       3,
      "retryBaseDelayMs": 500,
      "retryMaxDelayMs":  30000,
      "headers": {
        "User-Agent": "ripperoni/2.0 (+https://github.com/Studnicky/PathRipper)"
      },
      "pipeline": ["html:fetch", "aonprd:parse", "json:write"],
      "cache": {
        "dir": "./output/.cache/aonprd",
        "mode": "read-write"
      }
    }
  }
}

Raw content output

Every output record carries a _raw field by default. This field is injected just before the record is written to disk and holds the raw fetched bytes alongside the parsed fields. Downstream consumers can re-parse historical Ripperoni output without depending on Ripperoni's cache infrastructure.

Default behaviour

Raw content is always written. No configuration is required to get _raw in output. Parsing and enrichment are additive layers on top — plugins set state.output fields that appear alongside _raw, not instead of it.

A pipeline with no plugin step (["html:fetch", "json:write"]) is a valid and complete pipeline: it produces a raw dump per page with no further extraction. This is useful for archiving, debugging, or when you want to defer parsing to a downstream tool.

Shape

json
{
  "_raw": {
    "contentType": "text/html",
    "content":     "<html>...</html>",
    "fetchedAt":   "2026-05-07T04:00:00.000Z"
  }
}
FieldTypeNotes
contentTypestringMIME type of the response (text/html for HTML targets).
contentstringFull raw response body, byte-for-byte.
fetchedAtISO-8601 stringTimestamp at which the content was fetched.

Opting out (storage savings)

Set includeRawContent: false to strip _raw from output. Use this for production scrapes where storage is a concern. Rough estimate: 15,000 AONPRD records x 80 KB of HTML = roughly 1.2 GB of additional output. If you do not need to re-parse output offline, opt out to keep file sizes small.

json
{
  "targets": {
    "aonprd": {
      "baseUrl":           "https://2e.aonprd.com",
      "pipeline":          ["html:fetch", "aonprd:parse", "json:write"],
      "includeRawContent": false,
      "cache": { "dir": "./output/.cache/aonprd", "mode": "read-write" }
    }
  }
}

Raw-dump-only pipeline (no plugin)

A pipeline without a plugin task is fully supported and produces one JSON file per page:

json
{
  "targets": {
    "archive": {
      "baseUrl":  "https://example.com",
      "pipeline": ["html:fetch", "json:write"]
    }
  }
}

Output shape per record (output is empty object because no plugin ran, _raw carries the content):

json
{
  "_raw": {
    "contentType": "text/html",
    "content":     "<html>...</html>",
    "fetchedAt":   "2026-05-07T04:00:00.000Z"
  }
}

Plugin contract

Plugins must not read or write the _raw field. It is set by html:fetch and consumed by json:write / jsonl:append. Plugins interact with state.page.html and state.output as usual; _raw is injected transparently into the serialized file just before the disk write.


mediawiki

Same rate-limit, retry, concurrency, and cache options as targets. MediaWiki-specific additions:

KeyTypeRequiredNotes
apiUrlURIyesMediaWiki API endpoint (e.g. https://en.wikipedia.org/w/api.php).
batchSizeinteger 1–50noPages per batch request. MediaWiki API maximum is 50.
categoriesstring[]noCategory names to enumerate. When present, overrides full-site enumeration.

See MediaWiki for the three enumeration modes.


crawlers

Top-level crawlers define link-harvesting jobs independent of scrape targets.

KeyTypeRequiredNotes
startUrlsURI[]yesEntry points for the crawl.
domainregex stringyesLinks must match to be considered. Bounds the crawl to one site.
targetregex stringyesLinks matching delimiter AND this are collected as results.
delimiterregex stringyesLinks matching this are traversed (followed). Others are ignored.
rateLimitMsinteger ≥ 0noGap between requests.
jitterMsinteger ≥ 0noJitter on top of rate limit.
maxPagesinteger ≥ 1noTraversal ceiling.

See Crawler for how the three regexes interact.


cache config (shared shape)

Both targets and mediawiki blocks accept the same cache shape:

json
"cache": {
  "dir":   "./output/.cache/aonprd",
  "mode":  "read-write",
  "ttlMs": 86400000
}
KeyTypeRequiredNotes
dirstringyesDirectory for cache meta files.
modeenumyesread-write, read-only, write-only, or off.
ttlMsinteger ≥ 0noEntries older than this (in ms) are treated as misses.

See Cache for sharding, eviction, and TTL behavior.


Released under the MIT License.