Skip to content
Ripperoni

Ripperoni

Web ingestion engine: slices wikis, sites, and URL lists into JSON records, one page at a time. Point it at a wiki, a site, or a list of URLs and it hands you the meat.

Point it at a domain. Hand it a plugin. It fetches pages, runs your plugin against each one, and drops structured JSON records on disk.

  • Typed pipeline. Middleware task queue with async (next, state) => void signature. Add, compose, and reorder tasks without touching anything else.
  • HTML scraper. Native fetch + cheerio. No JSDOM, no headless browser. Returns a CheerioAPI handle so you work with familiar selectors.
  • MediaWiki scraper. Native fetch against the MediaWiki JSON API. Three modes: single category, categories array, or full-wiki enumeration. Batch wikitext fetch, redirect resolution, wtf_wikipedia infobox extraction.
  • Link crawler. Recursively crawls pages matching domain/target/delimiter regexes. Deduplicates, sorts naturally, respects rate limit.
  • Retry + backoff. Exponential backoff with decorrelated jitter. Respects Retry-After headers. Classifies errors as NETWORK / THROTTLED / TIMEOUT / TRANSIENT / PERMANENT.

Quick install

bash
git clone https://github.com/Studnicky/Ripperoni.git
cd Ripperoni && npm install && npm run build

Where to look next

Released under the MIT License.