Skip to content

Phase 08 · Checkpoint + resume

The compose / validate loop in The Archivist is the most expensive segment — multiple LLM calls per attempt. If the visitor's session times out mid-loop, the dispatcher records the cursor (crl-compose-response or crl-validate-response), the partial draft, and the attempt counter. A later process recalls the checkpoint and finishes the response without paying for the upstream scouts again.

The ArchivistState makes this possible by overriding snapshotData() and restoreData() — the two methods NodeStateBase calls during Checkpoint.from and Checkpoint.recall.

Flow

Code

State snapshot round-trip

The #snapshot-restore region covers snapshotData() and restoreData() — the two methods that serialize and rehydrate the domain fields (query, intent, terms, candidates, shortlist, draft, approved, attempts, recalledContext, memoryDigest):

ts
protected override snapshotData(): JsonObject {
  return {
    "query":      this.query,
    "intent":     this.intent,
    "terms":      [...this.terms],
    "candidates": this.candidates.map((candidate) => ({
      "book":   { ...candidate.book, "authors": [...candidate.book.authors] },
      "score":  candidate.score,
      "source": candidate.source,
    })) as unknown as JsonObject[],
    "shortlist":  this.shortlist.map((candidate) => ({
      "book":   { ...candidate.book, "authors": [...candidate.book.authors] },
      "score":  candidate.score,
      "source": candidate.source,
    })) as unknown as JsonObject[],
    "draft":        this.draft,
    "approved":     this.approved,
    "attempts":     { ...this.attempts },
    "failureCause": this.failureCause,
    "recalledContext": {
      "priorIntents":        this.recalledContext.priorIntents as unknown as JsonObject[],
      "recentCandidates":    this.recalledContext.recentCandidates.map((c) => ({
        "book":   { ...c.book, "authors": [...c.book.authors] },
        "score":  c.score,
        "source": c.source,
      })) as unknown as JsonObject[],
      "similarPriorQueries": this.recalledContext.similarPriorQueries as unknown as JsonObject[],
      "summary":             this.recalledContext.summary,
    },
    "memoryDigest": {
      "bookCount":       this.memoryDigest.bookCount,
      "queryCount":      this.memoryDigest.queryCount,
      "recentBooks":     this.memoryDigest.recentBooks as unknown as JsonObject[],
      "intentBreakdown": this.memoryDigest.intentBreakdown as unknown as JsonObject[],
      "summary":         this.memoryDigest.summary,
    },
  };
}

protected override restoreData(snap: JsonObject): void {
  if (typeof snap['query']  === 'string')  this.query  = snap['query'];
  if (typeof snap['intent'] === 'string')  this.intent = snap['intent'] as ArchivistIntent;
  if (typeof snap['draft']        === 'string')  this.draft  = snap['draft'];
  if (typeof snap['approved']    === 'boolean') this.approved = snap['approved'];
  if (typeof snap['failureCause'] === 'string') this.failureCause = snap['failureCause'];
  if (Array.isArray(snap['terms']))      this.terms      = snap['terms'] as string[];
  if (Array.isArray(snap['candidates'])) this.candidates = snap['candidates'] as unknown as Candidate[];
  if (Array.isArray(snap['shortlist']))  this.shortlist  = snap['shortlist'] as unknown as Candidate[];
  if (snap['attempts'] && typeof snap['attempts'] === 'object') {
    this.attempts = { ...snap['attempts'] as Record<string, number> };
  }
  const rc = snap['recalledContext'];
  if (rc !== null && rc !== undefined && typeof rc === 'object' && !Array.isArray(rc)) {
    const rcObj = rc as Record<string, unknown>;
    this.recalledContext = {
      'priorIntents':        Array.isArray(rcObj['priorIntents'])        ? rcObj['priorIntents'] as RecalledContext['priorIntents']        : [],
      'recentCandidates':    Array.isArray(rcObj['recentCandidates'])    ? rcObj['recentCandidates'] as RecalledContext['recentCandidates'] : [],
      'similarPriorQueries': Array.isArray(rcObj['similarPriorQueries']) ? rcObj['similarPriorQueries'] as RecalledContext['similarPriorQueries'] : [],
      'summary':             typeof rcObj['summary'] === 'string'        ? rcObj['summary']  : '',
    };
  }
  const md = snap['memoryDigest'];
  if (md !== null && md !== undefined && typeof md === 'object' && !Array.isArray(md)) {
    const mdObj = md as Record<string, unknown>;
    this.memoryDigest = {
      'bookCount':       typeof mdObj['bookCount']  === 'number' ? mdObj['bookCount']  : 0,
      'queryCount':      typeof mdObj['queryCount'] === 'number' ? mdObj['queryCount'] : 0,
      'recentBooks':     Array.isArray(mdObj['recentBooks'])     ? mdObj['recentBooks']     as MemoryDigest['recentBooks']     : [],
      'intentBreakdown': Array.isArray(mdObj['intentBreakdown']) ? mdObj['intentBreakdown'] as MemoryDigest['intentBreakdown'] : [],
      'summary':         typeof mdObj['summary'] === 'string'    ? mdObj['summary'] : '',
    };
  }
}

Cancellation → checkpoint → resume

The #cancellation-run region in the runner shows the execute call with signal and deadlineMs, the cursor check after cancellation, and how to read the lifecycle kind:

ts
// Caller-driven cancellation — the visitor closes the page.
const controller = new AbortController();
// Simulate visitor abandoning 800 ms in.
setTimeout(() => controller.abort('visitor closed page'), 800);

const cancelVisitor = new ArchivistState();
cancelVisitor.query = "What's a book about a labyrinth?";

const cancelResult = await dispatcher.execute('the-archivist', cancelVisitor, {
  'signal':     controller.signal,
  'deadlineMs': 5000,              // hard 5s ceiling regardless of signal
});

const lc = cancelResult.state.lifecycle;
switch (lc.kind) {
  case 'completed':
    logger.result(`responded: ${cancelResult.state.draft}`);
    break;
  case 'cancelled':
    logger.result(`visitor abandoned at: ${lc.reason}`);
    break;
  case 'timed_out':
    logger.result(`hit deadline at: ${lc.finishedAt}`);
    break;
}

// result.cursor is the next node that would have run — pass it to
// Checkpoint.from to persist and resume in a later process.
if (cancelResult.cursor !== null) {
  logger.result(`stopped at ${cancelResult.cursor} — resumable`);
}

Persist and resume (illustrative)

The persist and resume calls below use the standard Checkpoint API with MemoryCheckpointStore — swap to any CheckpointStore implementation (Postgres, Redis, S3) without changing the calling code:

ts
// illustrative — runtime equivalent in examples/the-archivist/runArchivist.ts
import { Checkpoint, MemoryCheckpointStore } from '@noocodex/dagonizer/checkpoint';

const store = new MemoryCheckpointStore();

// After a cancelled/timed-out execute call:
if (result.cursor !== null) {
  const data = Checkpoint.from('the-archivist', result);
  await Checkpoint.persist(store, `archivist:${result.state.query}`, data);
}

// In a later process:
const recalled = await Checkpoint.recall(
  store,
  `archivist:${visitor.query}`,
  (snap) => ArchivistState.restore(snap),  // rehydrates via restoreData()
);

if (recalled !== null) {
  const final = await dispatcher.resume(
    recalled.dagName,
    recalled.state,
    recalled.cursor,                       // 'crl-validate-response'
  );
  console.log(final.state.draft);          // validated response
  console.log(final.state.lifecycle.kind); // 'completed'
}

What it demonstrates

  • ArchivistState.snapshotData() / restoreData() — domain-specific serialization. NodeStateBase calls snapshotData during Checkpoint.from and restoreData during Checkpoint.recall. The lifecycle resets to pending on restore; the resumed execution is a fresh lifecycle run on the recovered state data.
  • Checkpoint.from(dagName, result) — produces a CheckpointData record only when result.cursor !== null (an in-progress flow). A completed flow produces no cursor.
  • CheckpointStore adapter contractMemoryCheckpointStore is the test-time implementation. Swap to Postgres / Redis / S3 without touching the dispatcher or state.
  • Checkpoint.persist / Checkpoint.recall — codec + store in one call per side. Checkpoint.recall returns null when nothing is stored under the key.
  • dispatcher.resume(dagName, state, cursor) — starts from the recalled cursor instead of the DAG's entrypoint. The compose/validate retry counter (state.attempts.compose) survives the round-trip so the loop is still bounded.

See this in action in the Archivist live demo.

Watched over by the Order of Dagon.