Point it at a wiki, a site, or a list of URLs. It slices through everything, one page at a time, and hands you the meat. The domain-specific bits are your problem — write a plugin, register it, and Ripperoni will run it against every page it finds.
The core is a typed middleware pipeline: small task functions chain with next(), each one narrowing the raw page into structured output. Retry logic, rate limiting, and error classification are built in. The engine doesn't know what you're scraping and it doesn't care. That's the point.
Grew out of PathRipper (2019). HTTP machinery ported from TORUS (Topological Orchestration Runtime for Unified Streaming), an upcoming streaming DAG orchestration tool currently under development. The rest is new. Now available with 100% more salami iconography.
Features
Typed Pipeline
Middleware task queue with async (next, state) => void signature. Add, compose, and reorder tasks without touching anything else. State is your generic — the pipeline doesn't impose a shape.
HTML Scraper
Native fetch + cheerio. No JSDOM, no headless browser unless you need one. Returns a CheerioAPI handle so you work with familiar selectors. Configurable per-target base URL and headers.
MediaWiki Scraper
Native fetch against the MediaWiki JSON API. Three modes: single category, categories array, or full-wiki enumeration via allpages. Batch wikitext fetch, redirect resolution, wtf_wikipedia infobox extraction.
Link Crawler
Modernized LinkLister from PathRipper. Provide domain, target, and delimiter regexes; it recursively crawls pages, deduplicates, and returns all matching links sorted naturally. Respects rate limit.
Retry + Backoff
Ported from TORUS's RetryPolicyNode. Exponential backoff with ±10% decorrelated jitter. Respects Retry-After headers. Configurable max attempts, base delay, multiplier, and ceiling.
Error Classification
Ported from TORUS's ErrorClassifier. Classifies errors as NETWORK / THROTTLED / TIMEOUT / TRANSIENT / PERMANENT / VALIDATION / RESOURCE. Only retryable categories trigger a retry.
Rate Limiter
bottleneck-backed token bucket. RateLimiter.perSecond(n) or withDelay(ms). Every scraper and crawler runs through a limiter — respecting remote servers isn't optional.
Structured Logger
Ported from Torreya's logger. Logger.forComponent(name) emits JSON lines to stdout/stderr. LOG_LEVEL env gate. Every request, every retry, every file write — all attributable.
Quickstart
Install
npm install
npm run build
Create a config
{
"output": { "basePath": "./output" },
"mediawiki": {
"<your-wiki-target>": {
"apiUrl": "https://wiki.example/w/api.php",
"rateLimitMs": 1000,
"categories": ["Category A", "Category B"],
"pipeline": ["./plugins/your-target/parse.task.js"]
}
},
"targets": {
"<your-html-target>": {
"baseUrl": "https://example.com",
"rateLimitMs": 500,
"pipeline": ["./plugins/your-target/parse.task.js"]
}
}
}
Copy ripperoni.config.example.json to ripperoni.config.json and edit. The unprefixed file is gitignored.
Scrape a MediaWiki target
ripperoni scrape \
--target <your-wiki-target> \
--category "Example Category Name" \
--config ripperoni.config.json
Omit --category to use the categories array from config, or to enumerate every article in the wiki via the allpages API. Writes one .json per page under ./output/<your-wiki-target>/. Output shape is controlled by the parse plugin registered as <targetId>:parse.
Crawl a site for links
ripperoni crawl \
--starts "https://example.com/index" \
--domain "example\.com" \
--target "\?id=" \
--delimiter "category" \
--rate 100
Scrape HTML pages
ripperoni scrape \
--target <your-html-target> \
--paths "/page/1" "/page/2" \
--config ripperoni.config.json
Config reference
All options live in a single JSON file. Pass with --config <path> (default: ./ripperoni.config.json).
output
| Field | Default | Description |
|---|---|---|
basePath | ./output | Root directory for all scraped output files. |
format | json | json · html · text |
pretty | true | JSON pretty-print with 2-space indent. |
targets (HTML scraper)
| Field | Default | Description |
|---|---|---|
baseUrl | required | Base URL prepended to relative paths. |
rateLimitMs | 250 | Minimum ms between requests to this target. |
maxRetries | 3 | Max retry attempts on retryable errors. |
headers | {} | HTTP headers sent with every request to this target. |
mediawiki targets
| Field | Default | Description |
|---|---|---|
apiUrl | required | MediaWiki API endpoint URL. |
rateLimitMs | 1000 | Minimum ms between API requests. Most wiki policies require ≥1 req/s — check yours. |
jitterMs | 0 | Random per-request jitter added on top of rateLimitMs. Makes request spacing less robotic. |
categories | – | Optional list of category names to scrape. Omit to enumerate all articles via the allpages API. |
pipeline | – | Paths to parse plugin .js files. Each plugin registers a task as <targetId>:parse. |
targets (HTML scraper)
| Field | Default | Description |
|---|---|---|
baseUrl | required | Base URL prepended to relative paths. |
rateLimitMs | 250 | Minimum ms between requests. |
jitterMs | 0 | Random per-request jitter. |
headers | {} | HTTP headers sent with every request. |
tasks | – | Paths to parse plugin .js files. |
LOG_LEVEL=debug to see every request, retry, and file write. Default level is info.Programmatic use
Classes are exported via subpath imports. Use them directly without the CLI:
import { Pipeline } from 'ripperoni/Pipeline';
import { MediaWikiScraper } from 'ripperoni/MediaWikiScraper';
import { WikitextParser } from 'ripperoni/WikitextParser';
import { TaskRegistry } from 'ripperoni/registry/TaskRegistry';
import { PipelineState } from 'ripperoni/registry/PipelineState';
const scraper = await MediaWikiScraper.create({
apiUrl: 'https://wiki.example/w/api.php',
rateLimitMs: 1000,
});
const pages = await scraper.scrapeCategory('Example Category Name');
const pipeline = new Pipeline({ name: 'my-job' });
pipeline.addTask(async (next, state) => {
state.output = WikitextParser.parse(state.page.title, state.page.wikitext ?? '');
await next();
});
for (const page of pages) {
await pipeline.execute(PipelineState.fromWikiPage('my-target', page));
}
Write a parse plugin
Plugins are plain .js files loaded at runtime via tasks in the config. Each plugin registers itself under <targetId>:parse:
// plugins/my-target/parse.task.js
import { TaskRegistry } from '../../dist/registry/TaskRegistry.js';
TaskRegistry.register('my-target:parse', async (next, state) => {
// state.page.wikitext or state.page.html is available here
state.output = {
title: state.page.title,
// ... your structured fields
};
await next();
});
Build TypeScript plugins with npm run build:plugins. The pipeline runs <targetId>:parse for each page before writing the output file.