Ripperoni — web ingestion engine

Point it at a wiki, a site, or a list of URLs. It slices through everything, one page at a time, and hands you the meat. The domain-specific bits are your problem — write a plugin, register it, and Ripperoni will run it against every page it finds.

The core is a typed middleware pipeline: small task functions chain with next(), each one narrowing the raw page into structured output. Retry logic, rate limiting, and error classification are built in. The engine doesn't know what you're scraping and it doesn't care. That's the point.

Grew out of PathRipper (2019). HTTP machinery ported from TORUS (Topological Orchestration Runtime for Unified Streaming), an upcoming streaming DAG orchestration tool currently under development. The rest is new. Now available with 100% more salami iconography.

Features

Typed Pipeline

Middleware task queue with async (next, state) => void signature. Add, compose, and reorder tasks without touching anything else. State is your generic — the pipeline doesn't impose a shape.

HTML Scraper

Native fetch + cheerio. No JSDOM, no headless browser unless you need one. Returns a CheerioAPI handle so you work with familiar selectors. Configurable per-target base URL and headers.

MediaWiki Scraper

Native fetch against the MediaWiki JSON API. Three modes: single category, categories array, or full-wiki enumeration via allpages. Batch wikitext fetch, redirect resolution, wtf_wikipedia infobox extraction.

Link Crawler

Modernized LinkLister from PathRipper. Provide domain, target, and delimiter regexes; it recursively crawls pages, deduplicates, and returns all matching links sorted naturally. Respects rate limit.

Retry + Backoff

Ported from TORUS's RetryPolicyNode. Exponential backoff with ±10% decorrelated jitter. Respects Retry-After headers. Configurable max attempts, base delay, multiplier, and ceiling.

Error Classification

Ported from TORUS's ErrorClassifier. Classifies errors as NETWORK / THROTTLED / TIMEOUT / TRANSIENT / PERMANENT / VALIDATION / RESOURCE. Only retryable categories trigger a retry.

Rate Limiter

bottleneck-backed token bucket. RateLimiter.perSecond(n) or withDelay(ms). Every scraper and crawler runs through a limiter — respecting remote servers isn't optional.

Structured Logger

Ported from Torreya's logger. Logger.forComponent(name) emits JSON lines to stdout/stderr. LOG_LEVEL env gate. Every request, every retry, every file write — all attributable.

Quickstart

Install

npm install
npm run build

Create a config

{
  "output": { "basePath": "./output" },
  "mediawiki": {
    "<your-wiki-target>": {
      "apiUrl":      "https://wiki.example/w/api.php",
      "rateLimitMs": 1000,
      "categories":  ["Category A", "Category B"],
      "pipeline":    ["./plugins/your-target/parse.task.js"]
    }
  },
  "targets": {
    "<your-html-target>": {
      "baseUrl":     "https://example.com",
      "rateLimitMs": 500,
      "pipeline":    ["./plugins/your-target/parse.task.js"]
    }
  }
}

Copy ripperoni.config.example.json to ripperoni.config.json and edit. The unprefixed file is gitignored.

Scrape a MediaWiki target

ripperoni scrape \
  --target <your-wiki-target> \
  --category "Example Category Name" \
  --config ripperoni.config.json

Omit --category to use the categories array from config, or to enumerate every article in the wiki via the allpages API. Writes one .json per page under ./output/<your-wiki-target>/. Output shape is controlled by the parse plugin registered as <targetId>:parse.

Crawl a site for links

ripperoni crawl \
  --starts "https://example.com/index" \
  --domain "example\.com" \
  --target "\?id=" \
  --delimiter "category" \
  --rate 100

Scrape HTML pages

ripperoni scrape \
  --target <your-html-target> \
  --paths "/page/1" "/page/2" \
  --config ripperoni.config.json

Config reference

All options live in a single JSON file. Pass with --config <path> (default: ./ripperoni.config.json).

output

Field	Default	Description
`basePath`	`./output`	Root directory for all scraped output files.
`format`	`json`	`json` · `html` · `text`
`pretty`	`true`	JSON pretty-print with 2-space indent.

targets (HTML scraper)

Field	Default	Description
`baseUrl`	required	Base URL prepended to relative paths.
`rateLimitMs`	`250`	Minimum ms between requests to this target.
`maxRetries`	`3`	Max retry attempts on retryable errors.
`headers`	`{}`	HTTP headers sent with every request to this target.

mediawiki targets

Field	Default	Description
`apiUrl`	required	MediaWiki API endpoint URL.
`rateLimitMs`	`1000`	Minimum ms between API requests. Most wiki policies require ≥1 req/s — check yours.
`jitterMs`	`0`	Random per-request jitter added on top of `rateLimitMs`. Makes request spacing less robotic.
`categories`	–	Optional list of category names to scrape. Omit to enumerate all articles via the allpages API.
`pipeline`	–	Paths to parse plugin `.js` files. Each plugin registers a task as `<targetId>:parse`.

targets (HTML scraper)

Field	Default	Description
`baseUrl`	required	Base URL prepended to relative paths.
`rateLimitMs`	`250`	Minimum ms between requests.
`jitterMs`	`0`	Random per-request jitter.
`headers`	`{}`	HTTP headers sent with every request.
`tasks`	–	Paths to parse plugin `.js` files.

Set LOG_LEVEL=debug to see every request, retry, and file write. Default level is info.

Programmatic use

Classes are exported via subpath imports. Use them directly without the CLI:

import { Pipeline } from 'ripperoni/Pipeline';
import { MediaWikiScraper } from 'ripperoni/MediaWikiScraper';
import { WikitextParser } from 'ripperoni/WikitextParser';
import { TaskRegistry } from 'ripperoni/registry/TaskRegistry';
import { PipelineState } from 'ripperoni/registry/PipelineState';

const scraper = await MediaWikiScraper.create({
  apiUrl:      'https://wiki.example/w/api.php',
  rateLimitMs: 1000,
});

const pages = await scraper.scrapeCategory('Example Category Name');

const pipeline = new Pipeline({ name: 'my-job' });
pipeline.addTask(async (next, state) => {
  state.output = WikitextParser.parse(state.page.title, state.page.wikitext ?? '');
  await next();
});

for (const page of pages) {
  await pipeline.execute(PipelineState.fromWikiPage('my-target', page));
}

Write a parse plugin

Plugins are plain .js files loaded at runtime via tasks in the config. Each plugin registers itself under <targetId>:parse:

// plugins/my-target/parse.task.js
import { TaskRegistry } from '../../dist/registry/TaskRegistry.js';

TaskRegistry.register('my-target:parse', async (next, state) => {
  // state.page.wikitext or state.page.html is available here
  state.output = {
    title: state.page.title,
    // ... your structured fields
  };
  await next();
});

Build TypeScript plugins with npm run build:plugins. The pipeline runs <targetId>:parse for each page before writing the output file.