Roadmap
v2.0.0 is a ground-up rewrite of the 2019 PathRipper. The core pipeline, HTML scraper, MediaWiki scraper, and link crawler are live.
Shipped (v2.x)
| Feature | Status | Details |
|---|---|---|
| TypeScript rewrite | live | Full strict TypeScript from scratch. exactOptionalPropertyTypes, noUncheckedIndexedAccess, flat ESLint config. |
| Pipeline (Transformer modernized) | live | PathRipper's callback-based Transformer becomes a typed Pipeline<TState>. Same middleware pattern, fully typed generic state. |
| HTML scraper | live | JSDOM replaced with native fetch + cheerio. Configurable base URL, headers, rate limit. Returns live CheerioAPI handle. |
| MediaWiki scraper | live | Native fetch() to the MediaWiki JSON API. Category listing with full pagination, 50-page batch wikitext fetches, wtf_wikipedia infobox parsing. |
| LinkLister crawler | live | PathRipper's recursive crawler rewritten. cheerio replaces JSDOM for link extraction. Concurrent traversals with Promise.all. Numeric-aware sort. Set-based deduplication. |
| HTTP machinery | live | ErrorClassifier + RetryExecutor ported from TORUS. RateLimiter wrapping bottleneck. Retry-After header respected. Seven error categories. Exponential + jitter backoff. |
| Structured logger | live | Ported from Torreya's @torreya/logger. Logger.forComponent(name), JSON lines, LOG_LEVEL gate, component + operation attribution on every entry. |
| JSON config | live | All targets, URLs, rate limits, and output paths live in ripperoni.config.json. Nothing hardcoded. RipperConfig.load(path) validates and returns a typed interface. |
| Concurrent pipeline | live | ConcurrentPipeline.create(pipeline, concurrency) fans N pages through the same pipeline simultaneously with a semaphore cap. |
| Task registry | live | TaskRegistry.register(name, fn) + dynamic plugin loading via pipeline: ["my-target:parse"] in config. Plugins are .js files loaded at runtime. |
| Checkpoint + resume | live | Already-written slugs are detected at run start and skipped. Failed pages are written to failures.json; pass --resume-failures to retry only those. |
| Config schema validation | live | AJV validates the config at load time. RipperConfig.load(path) throws with the exact field path on any violation; malformed configs fail fast and loudly. |
Planned
| Feature | Details |
|---|---|
| JSDOM fallback mode | Some pages require JavaScript execution to render their content. A configurable jsdom mode in HtmlScraper would handle these without needing a full headless browser. |
| HTML → Markdown conversion | Output mode that converts scraped HTML to clean Markdown. Useful for feeding scraped content into LLM pipelines without sending raw HTML. Likely via turndown or similar. |
