Ripperoni

Web ingestion engine: slices wikis, sites, and URL lists into JSON records, one page at a time. Point it at a wiki, a site, or a list of URLs and it hands you the meat.

Get started Walk-through Architecture GitHub

Point it at a domain. Hand it a plugin. It fetches pages, runs your plugin against each one, and drops structured JSON records on disk.

Typed pipeline. Middleware task queue with async (next, state) => void signature. Add, compose, and reorder tasks without touching anything else.
HTML scraper. Native fetch + cheerio. No JSDOM, no headless browser. Returns a CheerioAPI handle so you work with familiar selectors.
MediaWiki scraper. Native fetch against the MediaWiki JSON API. Three modes: single category, categories array, or full-wiki enumeration. Batch wikitext fetch, redirect resolution, wtf_wikipedia infobox extraction.
Link crawler. Recursively crawls pages matching domain/target/delimiter regexes. Deduplicates, sorts naturally, respects rate limit.
Retry + backoff. Exponential backoff with decorrelated jitter. Respects Retry-After headers. Classifies errors as NETWORK / THROTTLED / TIMEOUT / TRANSIENT / PERMANENT.

Quick install

bash

git clone https://github.com/Studnicky/Ripperoni.git
cd Ripperoni && npm install && npm run build

Where to look next

Walk-through: end-to-end example with a real URL, config, plugin, and output record
Getting started: install and first run
Architecture: pipeline phases, package boundaries, extension points

Ripperoni

Quick install ​

Where to look next ​

Quick install

Where to look next