v2.0.0 is a ground-up rewrite of the 2019 PathRipper. The core pipeline, HTML scraper, MediaWiki scraper, and link crawler are live.

Shipped (v2.0.0)

live

TypeScript rewrite

Full strict TypeScript from scratch. TORUS tsconfig + eslint config ported directly. exactOptionalPropertyTypes, noUncheckedIndexedAccess, flat ESLint config.

live

Pipeline (Transformer modernized)

PathRipper's callback-based Transformer becomes a typed Pipeline<TState>. Same middleware pattern, fully typed generic state, no more arguments juggling.

live

HTML scraper

JSDOM replaced with native fetch + cheerio. Configurable base URL, headers, rate limit. Returns live CheerioAPI handle.

live

MediaWiki scraper

mwn-backed. Category listing with full pagination, 50-page batch wikitext fetches, wtf_wikipedia infobox parsing. Single call to scrape an entire wiki category.

live

LinkLister crawler

PathRipper's recursive crawler rewritten. cheerio replaces JSDOM for link extraction. Concurrent traversals with Promise.all. Numeric-aware sort. Set-based deduplication.

live

HTTP machinery

ErrorClassifier + RetryExecutor ported from TORUS (Topological Orchestration Runtime for Unified Streaming). RateLimiter wrapping bottleneck. Retry-After header respected. Seven error categories. Exponential + jitter backoff.

live

Structured logger

Ported from Torreya's @torreya/logger. Logger.forComponent(name), JSON lines, LOG_LEVEL gate, component + operation attribution on every entry.

live

JSON config

All targets, URLs, rate limits, and output paths live in ripperoni.config.json. Nothing hardcoded. RipperConfig.load(path) validates and returns a typed interface.

live

Concurrent pipeline

ConcurrentPipeline.create(pipeline, concurrency) fans N pages through the same pipeline simultaneously with a semaphore cap. Set concurrency in target config. Cache and scraper instances are shared across all concurrent executions.

live

Task registry

TaskRegistry.register(name, fn) + dynamic plugin loading via pipeline: ["my-target:parse"] in config. Plugins are .js files loaded at runtime — no TypeScript entry point per job required.

live

Checkpoint + resume

Already-written slugs are detected at run start and skipped. Failed pages are written to failures.json; pass --resume-failures to retry only those on the next run.

live

Config schema validation

AJV validates the config at load time. RipperConfig.load(path) throws with the exact field path on any violation — malformed configs fail fast and loudly.

Planned

planned

JSDOM fallback mode

Some pages require JavaScript execution to render their content. A configurable jsdom mode in HtmlScraper would handle these without needing a full headless browser.

planned

HTML → Markdown conversion

Output mode that converts scraped HTML to clean Markdown. Useful for feeding scraped content into LLM pipelines without sending raw HTML. Likely via turndown or similar.