v2.0.0 is a ground-up rewrite of the 2019 PathRipper. The core pipeline, HTML scraper, MediaWiki scraper, and link crawler are live.
Shipped (v2.0.0)
TypeScript rewrite
Full strict TypeScript from scratch. TORUS tsconfig + eslint config ported directly. exactOptionalPropertyTypes, noUncheckedIndexedAccess, flat ESLint config.
Pipeline (Transformer modernized)
PathRipper's callback-based Transformer becomes a typed Pipeline<TState>. Same middleware pattern, fully typed generic state, no more arguments juggling.
HTML scraper
JSDOM replaced with native fetch + cheerio. Configurable base URL, headers, rate limit. Returns live CheerioAPI handle.
MediaWiki scraper
mwn-backed. Category listing with full pagination, 50-page batch wikitext fetches, wtf_wikipedia infobox parsing. Single call to scrape an entire wiki category.
LinkLister crawler
PathRipper's recursive crawler rewritten. cheerio replaces JSDOM for link extraction. Concurrent traversals with Promise.all. Numeric-aware sort. Set-based deduplication.
HTTP machinery
ErrorClassifier + RetryExecutor ported from TORUS (Topological Orchestration Runtime for Unified Streaming). RateLimiter wrapping bottleneck. Retry-After header respected. Seven error categories. Exponential + jitter backoff.
Structured logger
Ported from Torreya's @torreya/logger. Logger.forComponent(name), JSON lines, LOG_LEVEL gate, component + operation attribution on every entry.
JSON config
All targets, URLs, rate limits, and output paths live in ripperoni.config.json. Nothing hardcoded. RipperConfig.load(path) validates and returns a typed interface.
Concurrent pipeline
ConcurrentPipeline.create(pipeline, concurrency) fans N pages through the same pipeline simultaneously with a semaphore cap. Set concurrency in target config. Cache and scraper instances are shared across all concurrent executions.
Task registry
TaskRegistry.register(name, fn) + dynamic plugin loading via pipeline: ["my-target:parse"] in config. Plugins are .js files loaded at runtime — no TypeScript entry point per job required.
Checkpoint + resume
Already-written slugs are detected at run start and skipped. Failed pages are written to failures.json; pass --resume-failures to retry only those on the next run.
Config schema validation
AJV validates the config at load time. RipperConfig.load(path) throws with the exact field path on any violation — malformed configs fail fast and loudly.
Planned
JSDOM fallback mode
Some pages require JavaScript execution to render their content. A configurable jsdom mode in HtmlScraper would handle these without needing a full headless browser.
HTML → Markdown conversion
Output mode that converts scraped HTML to clean Markdown. Useful for feeding scraped content into LLM pipelines without sending raw HTML. Likely via turndown or similar.