Three independent concerns — pipeline, HTTP machinery, and scrapers — compose to produce a scraping job. Nothing in the pipeline knows about HTTP. Nothing in the HTTP layer knows about MediaWiki. The scraper classes are pure data accessors that return typed results.

Module graph

graph TD CLI[cli/cli.ts] --> Pipeline CLI --> HtmlScraper CLI --> MediaWikiScraper CLI --> LinkLister CLI --> RipperConfig Pipeline[pipeline/Pipeline] --> Logger HtmlScraper[scrapers/HtmlScraper] --> RateLimiter HtmlScraper --> RetryExecutor HtmlScraper --> Logger MediaWikiScraper[scrapers/MediaWikiScraper] --> RateLimiter MediaWikiScraper --> Logger WikitextParser[scrapers/WikitextParser] -.uses.-> wtf_wikipedia LinkLister[crawlers/LinkLister] --> RateLimiter LinkLister --> RetryExecutor LinkLister --> Logger RetryExecutor[modules/http/RetryExecutor] --> ErrorClassifier RateLimiter[modules/http/RateLimiter] -.wraps.-> bottleneck style Pipeline fill:#180808,stroke:#f05870 style ErrorClassifier fill:#180808,stroke:#f05870 style RetryExecutor fill:#180808,stroke:#f05870 style RateLimiter fill:#180808,stroke:#f05870

Pipeline pattern

Typed async middleware chain where every task receives (next, state) and advances the queue by calling next().

The core architecture is a typed middleware chain inherited from PathRipper's Transformer class, rewritten in TypeScript. Every task receives (next, state). Calling next() advances to the next task; not calling it terminates the chain.

sequenceDiagram participant Caller participant Pipeline participant ParseTask as <targetId>:parse (plugin) participant WriteTask as write-to-disk (orchestrator) Caller->>Pipeline: execute(PipelineState) Pipeline->>ParseTask: (next, state) ParseTask-->>Pipeline: state.output set Pipeline->>WriteTask: (next, state) WriteTask-->>Pipeline: file written Pipeline-->>Caller: state

The ScrapeOrchestrator builds the pipeline per page. A user-registered <targetId>:parse task (from a plugin file declared in config) runs first and sets state.output. If no parse task is registered the orchestrator falls back to raw WikitextParser output. A write-to-disk task is always added last by the orchestrator.

const pipeline = new Pipeline<PipelineStateInterface>({ name: 'my-target' });
if (TaskRegistry.has('my-target:parse')) {
  pipeline.addTask(TaskRegistry.get('my-target:parse'));
}
pipeline.addTask(async (next, state) => {
  await next();
  await writeFile(outputPath, JSON.stringify(state.output ?? fallback));
});
await pipeline.execute(PipelineState.fromWikiPage('my-target', page));

Task signature

type TaskFnType<TState> = (next: NextFnType, state: TState) => Promise<void>

TState must extend Record<string, unknown>. Tasks mutate state directly — the pipeline passes the same reference through the chain.

HTTP machinery

Three composable classes — RateLimiter, RetryExecutor, and ErrorClassifier — form the HTTP stack, each injected independently.

Three classes form the HTTP stack, ported from TORUS (Topological Orchestration Runtime for Unified Streaming), an upcoming streaming DAG orchestration tool currently under development. They're independent of each other and compose by injection.

graph LR Request[fetch call] --> RateLimiter RateLimiter --> RetryExecutor RetryExecutor --> ErrorClassifier ErrorClassifier -->|retryable| RetryExecutor ErrorClassifier -->|permanent| Throw[throw] RetryExecutor -->|max attempts| Throw RetryExecutor -->|success| Response

ErrorClassifier

Classifies errors into seven categories. Only NETWORK, THROTTLED, TIMEOUT, and TRANSIENT are retryable. Permanent 4xx errors immediately throw. Reads Retry-After header for THROTTLED back-off hint.

CategoryRetryableTrigger
NETWORKyesECONNREFUSED, ECONNRESET, ENOTFOUND
TIMEOUTyesETIMEDOUT, ESOCKETTIMEDOUT
THROTTLEDyesHTTP 429 · reads Retry-After
TRANSIENTyesHTTP 5xx
PERMANENTnoHTTP 4xx (except 429)
VALIDATIONnoTypeError, SyntaxError, ValidationError
RESOURCEnoENOMEM, ENOSPC

RetryExecutor

Wraps any async function. On retryable error: waits, retries up to maxAttempts. Delay uses exponential backoff with ±10% decorrelated jitter to avoid thundering herd.

OptionDefaultDescription
maxAttempts3Total attempts before throw (includes first try).
baseDelayMs500Base delay for attempt 1.
multiplier2Delay multiplier per attempt.
maxDelayMs30000Delay ceiling.

RateLimiter

Token-bucket backed by bottleneck. Factory methods: RateLimiter.perSecond(n) for throughput-based limits, RateLimiter.withDelay(ms) for fixed-gap limits. Used by every scraper and crawler.

Scrapers

Pure data accessors for HTML (via cheerio) and MediaWiki (via native fetch) that return typed results without coupling to the pipeline.

HtmlScraper

Native fetch + cheerio. Returns ScrapedPageInterface { url, $, html }. The $ field is a live CheerioAPI handle — use it exactly as you'd use jQuery on a DOM. No browser engine, no JavaScript execution. For JS-rendered pages, swap the fetch call for a headless driver (Playwright, Puppeteer) and feed the HTML to cheerio.load().

MediaWikiScraper

Direct fetch() calls to the MediaWiki JSON API — no mwn or axios layer. Four operations:

  • fetchPage(title) — single page wikitext
  • fetchPagesBatch(titles) — up to 50 pages per API request
  • fetchCategory(name) — paginated category members list
  • fetchAllPages() — enumerates every article in main namespace via action=query&list=allpages

The ScrapeOrchestrator selects from three modes: explicit --category flag → single category; categories[] in config → iterate and deduplicate; no categories → fetchAllPages(). Rate limiting and jitter applied per-request.

WikitextParser

Wraps wtf_wikipedia. WikitextParser.parse(title, wikitext) returns a ParsedPageInterface with infobox (flat key→value record), sections (title + raw wikitext), and categories. Helper methods infoboxField and infoboxNumber pull typed values without null-checks at call site.

Link crawler

Recursive link crawler controlled by three regexes (domain, delimiter, target) that bound traversal and collect matching URLs.

Modernized from PathRipper's LinkLister. Three regexes control behavior:

RegexPurpose
domainLinks must match to be considered at all. Keeps the crawler inside the target site.
delimiterLinks that match are traversed (followed). Links that don't are ignored entirely.
targetLinks that match the delimiter AND this pattern are collected as results. Others are traversed but not returned.

Visited URLs are tracked in a Set. All traversals run concurrently via Promise.all at each level. Results are deduplicated and sorted with a numeric-aware collator — so Item-10 sorts after Item-9, not between Item-1 and Item-2.

Source map

Complete index of every source file, its exported symbols, and the PathRipper or TORUS module it was ported from.

FileExportsPorted from
src/pipeline/Pipeline.tsPipeline<TState>PathRipper Transformer
src/modules/http/ErrorClassifier.tsErrorClassifier, ErrorCategoryTORUS errorClassifier.ts
src/modules/http/RetryExecutor.tsRetryExecutorTORUS RetryPolicyNode
src/modules/http/RateLimiter.tsRateLimiterNew — wraps bottleneck
src/modules/logger/Logger.tsLoggerTorreya @torreya/logger
src/scrapers/HtmlScraper.tsHtmlScraperPathRipper fetchPage, cheerio replaces JSDOM
src/scrapers/MediaWikiScraper.tsMediaWikiScraperNew — native fetch() to MediaWiki JSON API
src/scrapers/WikitextParser.tsWikitextParserNew — wtf_wikipedia
src/crawlers/LinkLister.tsLinkListerPathRipper linkLister/index.js
src/orchestrators/ScrapeOrchestrator.tsScrapeOrchestratorNew — pipeline orchestration, three-mode wiki scrape
src/registry/TaskRegistry.tsTaskRegistryNew — plugin registration and dynamic loading
src/registry/PipelineState.tsPipelineStateNew — typed state bridge between scrapers and plugins
src/config/RipperConfig.tsRipperConfigNew — replaces hardcoded config.js
src/cli/cli.tsripperoni CLINew — commander