
Ripperoni
Web ingestion engine: slices wikis, sites, and URL lists into JSON records, one page at a time. Point it at a wiki, a site, or a list of URLs and it hands you the meat.
Point it at a domain. Hand it a plugin. It fetches pages, runs your plugin against each one, and drops structured JSON records on disk.
- Typed pipeline. Middleware task queue with
async (next, state) => voidsignature. Add, compose, and reorder tasks without touching anything else. - HTML scraper. Native fetch + cheerio. No JSDOM, no headless browser. Returns a
CheerioAPIhandle so you work with familiar selectors. - MediaWiki scraper. Native fetch against the MediaWiki JSON API. Three modes: single category, categories array, or full-wiki enumeration. Batch wikitext fetch, redirect resolution,
wtf_wikipediainfobox extraction. - Link crawler. Recursively crawls pages matching domain/target/delimiter regexes. Deduplicates, sorts naturally, respects rate limit.
- Retry + backoff. Exponential backoff with decorrelated jitter. Respects
Retry-Afterheaders. Classifies errors asNETWORK / THROTTLED / TIMEOUT / TRANSIENT / PERMANENT.
Quick install
bash
git clone https://github.com/Studnicky/Ripperoni.git
cd Ripperoni && npm install && npm run buildWhere to look next
- Walk-through: end-to-end example with a real URL, config, plugin, and output record
- Getting started: install and first run
- Architecture: pipeline phases, package boundaries, extension points