Skip to content

Roadmap

v2.0.0 is a ground-up rewrite of the 2019 PathRipper. The core pipeline, HTML scraper, MediaWiki scraper, and link crawler are live.

Shipped (v2.x)

FeatureStatusDetails
TypeScript rewriteliveFull strict TypeScript from scratch. exactOptionalPropertyTypes, noUncheckedIndexedAccess, flat ESLint config.
Pipeline (Transformer modernized)livePathRipper's callback-based Transformer becomes a typed Pipeline<TState>. Same middleware pattern, fully typed generic state.
HTML scraperliveJSDOM replaced with native fetch + cheerio. Configurable base URL, headers, rate limit. Returns live CheerioAPI handle.
MediaWiki scraperliveNative fetch() to the MediaWiki JSON API. Category listing with full pagination, 50-page batch wikitext fetches, wtf_wikipedia infobox parsing.
LinkLister crawlerlivePathRipper's recursive crawler rewritten. cheerio replaces JSDOM for link extraction. Concurrent traversals with Promise.all. Numeric-aware sort. Set-based deduplication.
HTTP machineryliveErrorClassifier + RetryExecutor ported from TORUS. RateLimiter wrapping bottleneck. Retry-After header respected. Seven error categories. Exponential + jitter backoff.
Structured loggerlivePorted from Torreya's @torreya/logger. Logger.forComponent(name), JSON lines, LOG_LEVEL gate, component + operation attribution on every entry.
JSON configliveAll targets, URLs, rate limits, and output paths live in ripperoni.config.json. Nothing hardcoded. RipperConfig.load(path) validates and returns a typed interface.
Concurrent pipelineliveConcurrentPipeline.create(pipeline, concurrency) fans N pages through the same pipeline simultaneously with a semaphore cap.
Task registryliveTaskRegistry.register(name, fn) + dynamic plugin loading via pipeline: ["my-target:parse"] in config. Plugins are .js files loaded at runtime.
Checkpoint + resumeliveAlready-written slugs are detected at run start and skipped. Failed pages are written to failures.json; pass --resume-failures to retry only those.
Config schema validationliveAJV validates the config at load time. RipperConfig.load(path) throws with the exact field path on any violation; malformed configs fail fast and loudly.

Planned

FeatureDetails
JSDOM fallback modeSome pages require JavaScript execution to render their content. A configurable jsdom mode in HtmlScraper would handle these without needing a full headless browser.
HTML → Markdown conversionOutput mode that converts scraped HTML to clean Markdown. Useful for feeding scraped content into LLM pipelines without sending raw HTML. Likely via turndown or similar.

Released under the MIT License.