# Client Stage Taxonomy — Source-of-Truth Spec

**Author:** Loom
**Date:** 2026-05-25
**Status:** DESIGN — pending Larry review before Forge/Cord hand-off
**Sponsor:** Jimmie Needles

---

## TL;DR (Read This First)

1. **One primary stage per client, eight stages total.** Mutually exclusive. Plus a separate `flags` field for parallel work that doesn't dominate the client's current state.
2. **Hybrid transitions — mostly auto, manual only at lifecycle boundaries.** Bookkeepers touch stage ~2–3 times per week across the whole book, not 40 times.
3. **Dashboard leads with exceptions, not a roster.** Stuck clients, missing flags, OOO owners surface at the top. The full list is below the fold.
4. **Source of truth = Atlas `clients` table + `stage_history` audit log.** Updated via API, cron jobs, Slack commands, and agent calls.
5. **Agent-operable from day one.** Jimmie's strategic goal is running most of J2 with agents (see [`feedback_agent_operable_by_default`](../../memory/feedback_agent_operable_by_default.md)). Every API endpoint and update channel is designed for agents as first-class operators, not bolted on later. Phase 1 ships with humans + cron driving most updates; Phase 2 wires agents into the same endpoints with scoped permissions. **Same schema, same audit trail — agents and humans use the identical interface.**

---

## 1. Stage Taxonomy

### Primary Stages (mutually exclusive — a client is in exactly one)

**Decision 2026-05-25 (Jimmie):** Advisory delivery is a FLAG, not a stage. Advisory-tier clients stay in `weekly` (or whatever their bookkeeping cadence dictates) with `advisory_due` set when a package is in flight. If lived experience shows we need a dedicated stage later, adding one is a ~30-min Forge migration. Start simple.

| Stage | Definition | Typical Duration |
|-------|------------|------------------|
| `onboarding` | New client, not yet in recurring rhythm. Chart of accounts, opening balances, app stack, kickoff. | 2–4 weeks |
| `cleanup` | Historical cleanup project. Billable scope distinct from recurring work. Not yet (or temporarily not) in monthly rhythm. | 4–12 weeks |
| `weekly` | Default steady state. Weekly roundup cadence, categorize-as-you-go, ad-hoc bill pay. Advisory deliverables in flight ride here with `advisory_due` flag. | Most of the month |
| `eom_close` | Month-end close in progress for the just-finished period. Recs, accruals, prepaids, payroll JE. | Days 1–8 of month |
| `eom_review` | Close done by bookkeeper, awaiting senior review + client package delivery. | Days 6–10 of month |
| `paused_client` | Client-side hold. Owner non-responsive, payment lapse, mid-dispute, awaiting their decision. Goes to bookkeeper for relationship follow-up. | Variable |
| `paused_internal` | J2-side hold. Capacity, intentional defer, internal scope question. Goes to Reid as a capacity signal. | Variable |
| `offboarding` | Wind-down. Final invoice, file transfer, app-stack revocation. | 2–4 weeks |

### Flags (parallel state — zero or more per client, independent of primary stage)

| Flag | When Set | Why It's a Flag, Not a Stage |
|------|----------|------------------------------|
| `sales_tax_due` | Filing window open this period | Most clients have it; doesn't dominate the primary cadence |
| `1099_prep` | Jan–Feb only | Seasonal overlay on top of normal cadence |
| `year_end` | Dec close period | Overlay on `eom_close`, not a replacement |
| `tax_prep_active` | Rex / external preparer has the books | Bookkeeping continues in parallel |
| `advisory_due` | Monthly package owed this period | Predictable for advisory-tier clients |
| `stuck` | Set automatically when `stage_entered_at` exceeds threshold | Drives exception surfacing — not a manual state |
| `client_blocking` | Waiting on client for docs/answers | Distinguishes "we're slow" from "they're slow" |

### Why this shape

- **Mutually exclusive primary stage** lets the dashboard answer "what is the *one* thing happening with this client right now?" in one cell. That's what Jimmie needs at a glance.
- **Flags handle the messy reality** — sales tax, year-end, tax prep, advisory work happen *in parallel* with the bookkeeping cadence. Folding them into the primary stage would either explode the taxonomy or force false choices.
- **`stuck` as a flag, not a stage** — preserves the *true* state (still in close, just late) while flagging the exception. If we made "stuck" a stage, we'd lose the info about where they're actually stuck.
- **`client_blocking` flag** — critical for J2 dashboarding. A client stuck in `eom_close` because the bookkeeper hasn't gotten to it is a J2 problem. A client stuck in `eom_close` because they haven't sent bank statements is a client problem. Different escalation paths.

---

## 2. Transition Rules

### Hybrid model — defaulting to automatic, manual only at lifecycle boundaries.

| Transition | Trigger Type | Rule |
|------------|--------------|------|
| `weekly` → `eom_close` | **Auto / time-based** | Last business day of month, 6:00 AM CT — all clients in `weekly` flip to `eom_close` |
| `eom_close` → `eom_review` | **Auto / event-based** | When all bank + CC accounts for the period are reconciled in QBO AND close checklist complete in Atlas |
| `eom_close` → `eom_review` (manual override) | **Manual** | Bookkeeper can flip via Slack `/stage` if event detection misses |
| `eom_review` → `weekly` | **Manual** | A user with `senior_reviewer` permission signs off. Per-client `eligible_reviewers` override allowed. As of 2026-05-25: Jimmie is the only senior reviewer; designed to expand later (Ledger or specific bookkeepers per client tier). For advisory-tier clients, `advisory_due` flag is set at this point and clears when the package is acknowledged or on day-16 auto-clear. |
| `advisory_due` flag set | **Auto / event-based** | On `eom_review` → `weekly` transition for advisory-tier clients |
| `advisory_due` flag cleared | **Auto / time-based** | Day 16 of month — assumes delivery acknowledged (manual override available) |
| Any → `paused_client` | **Manual** | Bookkeeper marks when client goes unresponsive / payment lapses / dispute. Requires reason. |
| Any → `paused_internal` | **Manual** | Jimmie or senior marks when J2 capacity / internal hold. Requires reason. Pings Reid. |
| `paused_client` / `paused_internal` → `weekly` / `cleanup` | **Manual** | Resume to prior cadence (uses `returning_to_stage` field). |
| Any → `offboarding` | **Manual** | Jimmie only |
| `onboarding` → `weekly` or `cleanup` | **Manual** | After first clean close OR if cleanup scope identified |
| `cleanup` → `weekly` | **Manual** | When cleanup scope formally closed |
| `stuck` flag set | **Auto** | When `now() - stage_entered_at` exceeds stage-specific SLA (see below) |

### Stuck thresholds (per stage)

| Stage | SLA (days in stage) | After SLA |
|-------|---------------------|-----------|
| `eom_close` | 10 (so day 11 of month) | `stuck` flag set |
| `eom_review` | 4 from entry | `stuck` flag |
| `advisory_due` flag (not a stage) | 6 days since flag set | `stuck` flag |
| `onboarding` | 30 | `stuck` flag + Jimmie alert |
| `cleanup` | client-specific budget × 1.25 | `stuck` flag + Jimmie alert |
| `paused_client` | 30 | `stuck` flag + check-in prompt to bookkeeper |
| `paused_internal` | 14 | `stuck` flag + alert to Reid (capacity signal) |

### Why hybrid, not pure-anything

- **Pure manual** → bookkeepers forget, data rots, dashboard becomes a lie. Failure mode of every workflow tool J2 has tried before.
- **Pure auto** → can't capture "this client is being weird, hold the close" or "they paused us last Friday." Brittle.
- **Hybrid with auto defaults** → the *system* drives the calendar (1st of month, 11th of month), bookkeepers only touch when something exceptional happens. Estimated **2–3 manual updates per week across 40 clients**, not 40+.

---

## 3. Edge Cases

| Case | Resolution |
|------|------------|
| **Client has both active cleanup AND ongoing weekly work** (Afton, Tri-County Tire) | Primary stage = `cleanup` if cleanup is the active billable focus. Flag `recurring_active` to indicate weekly work continuing in parallel. Cord shows this as a sub-row on the dashboard. |
| **New onboard needs cleanup before going recurring** | `onboarding` → `cleanup` → `weekly`. Two-hop is fine. |
| **Multi-tier client (recurring + advisory + tax)** | Primary stage tracks bookkeeping cadence. Advisory work surfaces via `advisory_due` flag (no dedicated stage). Tax prep surfaces via `tax_prep_active` flag. |
| **December close** | Primary = `eom_close`, flag = `year_end`. The flag triggers different checklist items in Atlas. |
| **Sales tax due dates vary by client** | Per-client `sales_tax_schedule` field on client record. Cron job sets `sales_tax_due` flag in the appropriate window. |
| **Client pauses mid-close** | `eom_close` → `paused_client` (or `paused_internal` if J2's the blocker). When resumed, returns to `eom_close` (not `weekly`) if the close is incomplete. Atlas tracks via `returning_to_stage`. |
| **Bookkeeper goes OOO** | Stage doesn't change. `stage_owner` field gets reassigned (manual or auto-rebalance from Atlas). Dashboard surfaces "owner OOO" as an exception. |
| **Two bookkeepers on one client** | `stage_owner` is the primary; add `stage_collaborators` array. Dashboard shows the primary. |
| **Client off-cycle (e.g., quarterly close, not monthly)** | Per-client `close_cadence` field (`monthly`, `quarterly`, `annual`). Auto-transitions respect cadence. Quarterly clients only flip to `eom_close` on Mar/Jun/Sep/Dec last-business-day. |
| **Client in chronic "stuck" — recurring lateness** | If stuck > 3 cycles in a row → flag `chronic_late`. Distinct surface in dashboard. Goes to Reid for client-tier or scope review. |

---

## 4. Schema Change Brief — to Forge

**Goal:** Make Atlas the source of truth for client stage. No spreadsheets, no Notion, no Slack scrollback.

### Additions to `clients` table

```
stage              ENUM not null   -- onboarding, cleanup, weekly, eom_close,
                                       eom_review, paused_client, paused_internal, offboarding
stage_entered_at   TIMESTAMP       -- when current stage began
stage_owner        FK -> users     -- primary bookkeeper for this stage
stage_collaborators ARRAY<FK>      -- secondary bookkeepers
stage_notes        TEXT            -- optional context (why paused, what cleanup scope, etc.)
flags              ARRAY<ENUM>     -- sales_tax_due, 1099_prep, year_end, tax_prep_active,
                                       advisory_due, stuck, client_blocking, chronic_late,
                                       recurring_active
close_cadence      ENUM            -- monthly, quarterly, annual (default monthly)
sales_tax_schedule ENUM            -- none, monthly, quarterly, annual
service_tier       ENUM            -- recurring_only, recurring_advisory, fractional_cfo
returning_to_stage ENUM nullable   -- for resume-from-paused semantics
eligible_reviewers ARRAY<FK> null  -- per-client override of who can sign off eom_review;
                                       null = fall back to global senior_reviewers list
```

### Permissions / global config

```
users.senior_reviewer    BOOLEAN     -- can sign off eom_review by default for any client
                                          where eligible_reviewers is null. Initial seed:
                                          Jimmie = true. Adding Ledger or others later is a
                                          one-field update — no schema change.
```

### New table — `stage_history`

```
id                 PK
client_id          FK -> clients
from_stage         ENUM
to_stage           ENUM
flag_added         ENUM nullable   -- if this row is a flag change, which flag was set
flag_removed       ENUM nullable   -- if this row is a flag change, which flag was cleared
transitioned_at    TIMESTAMP
actor_type         ENUM            -- human | agent | system
actor_id           STRING          -- 'user:jimmie', 'agent:tally', 'system:cron'
channel            ENUM            -- api | slack | nlp_relay | cron | event_webhook
on_behalf_of       STRING nullable -- e.g., 'user:jimmie' when Kade relays a verbal command
triggered_by       STRING nullable -- e.g., 'qbo_reconciliation_complete', 'echo_no_reply_7d'
trigger_type       ENUM            -- auto_time, auto_event, manual, manual_override, agent_action
reason             TEXT nullable
```

### Permissions table — `agent_permissions`

Permissions are **data, not code**. Adding a new agent or expanding scope is a row insert, not a deploy.

```
id                 PK
actor_id           STRING          -- 'agent:tally', 'agent:echo', 'user:jimmie', etc.
permission         ENUM            -- see permission enum below
scope              JSON nullable   -- optional constraints (e.g., {"stage_from": ["eom_close"],
                                       "stage_to": ["eom_review"]})
granted_at         TIMESTAMP
granted_by         STRING
revoked_at         TIMESTAMP nullable
notes              TEXT
```

**Permission enum (atomic, composable):**

```
can_read_clients              -- baseline: see the table
can_read_stage_history        -- can see audit trail
can_set_stage                 -- can change primary stage at all (subject to scope)
can_set_flag                  -- can add/remove flags at all (subject to scope)
can_sign_off_eom_review       -- equivalent to old senior_reviewer flag
can_offboard                  -- can move any client to offboarding
can_admin                     -- Forge / Larry only — full table write
```

Most agent permissions will be `can_set_flag` with a `scope` JSON restricting WHICH flags (e.g., Echo's scope is `{"flags": ["client_blocking"]}`). This keeps the enum small and the scoping flexible.

### API endpoints needed

**Auth model:** every endpoint requires a bearer token. Tokens are issued per actor (`user:jimmie`, `agent:tally`, `agent:echo`, etc.). On each call, Atlas:
1. Resolves the token → actor_id
2. Checks `agent_permissions` for the required permission + scope
3. If allowed: writes the change, stamps `stage_history` with `actor_id` and `channel`
4. If denied: returns 403 with the specific permission that's missing (no silent failure)

**Write endpoints:**

- `PATCH /clients/:id/stage` — requires `can_set_stage` (with scope check on `from`/`to`). Body: `{to_stage, reason, triggered_by?, on_behalf_of?}`. Logs to history.
- `POST /clients/:id/flags/:flag` — requires `can_set_flag` (with scope check on which flag). Body: `{reason, triggered_by?}`. Logs to history.
- `DELETE /clients/:id/flags/:flag` — same permission as above.
- `POST /clients/:id/stage_history/:row_id/rollback` — requires `can_admin`. Doesn't delete the history row; appends a new compensating row marked `trigger_type=rollback` and restores prior state. Preserves audit trail.

**Read endpoints (for agents):**

- `GET /clients?stage=X&owner=Y&flag=Z&stuck=true&service_tier=A` — composable filters. Ledger queries `?stage=eom_close&owner=agent:ledger`. Echo queries `?flag=client_blocking&age>7d`. Cord queries the full set for the dashboard.
- `GET /clients/:id` — single client snapshot including all flags, owner, last `stage_history` entry.
- `GET /clients/:id/stage_history?limit=20` — chronological audit drill-down, for any agent that needs context before acting.
- `GET /agents/:actor_id/queue` — derived view: "what's on this agent's plate right now," based on owner + relevant flags + permissions. Used by Tally, Ledger, Echo to drive their own work.

### Cron jobs needed

- **`stage_autoroll_eom_open`** — last business day of month, 6:00 AM CT. All `weekly` clients (respecting `close_cadence`) → `eom_close`.
- **`flag_autoclear_advisory_due`** — day 16, 6:00 AM CT. Clear `advisory_due` flag from any client where it's still set.
- **`stage_stuck_detector`** — hourly. Sets `stuck` flag on any client exceeding SLA (including `advisory_due` flag age).
- **`flag_sales_tax_due`** — daily. Reads each client's `sales_tax_schedule` + filing calendar; sets/clears `sales_tax_due`.
- **`flag_seasonal`** — daily. Sets `1099_prep` Jan 1–Feb 15; sets `year_end` Dec 1–Jan 15.

### Slack command

- `/stage <client_short_name> <new_stage> [reason]` — manual transition from Slack. Logs to `stage_history` with `transitioned_by = slack:<user>`.
- `/stage <client_short_name>` (no args) — read-only: shows current stage, owner, flags, time-in-stage.
- `/flag <client> +/-<flag>` — flag add/remove.

### Acceptance criteria for Forge

- All 40 active clients seeded with a starting stage (Larry + Jimmie review the seed before go-live).
- All transitions logged in `stage_history` — no silent flips.
- Slack command latency < 2s.
- Cron jobs idempotent (re-running doesn't double-flip).
- Dashboard query (`GET /clients?stage=...&stuck=true`) under 500ms for the full client book.

### ROI estimate

| Cost | Estimate |
|------|----------|
| Forge build | 2–3 days |
| Cord dashboard build | 1–2 days |
| Bookkeeper onboarding to new flow | 1 hour × team |
| **Ongoing maintenance** | ~30 min/month |

| Benefit | Estimate |
|---------|----------|
| Jimmie status-check time saved | 30 min/week × 50 = 25 hr/yr |
| Stuck-client surprises eliminated | 1 hr/month × 12 = 12 hr/yr |
| Bookkeeper "where are we on X" Slack pings reduced | 2 hr/week × 50 = 100 hr/yr |
| **Total** | **~137 hr/yr saved** |

Clears the 10-hr/quarter threshold by 10x. Greenlit.

---

## 5. Dashboard Read Brief — to Cord

**Goal:** Two-tab Sheet (or Quadratic) — Projects on one tab, Clients on another. Client tab leads with exceptions; full roster below.

### Source

- Atlas API: `GET /clients?include=stage,flags,owner,stage_history`
- Refresh: every 15 minutes, OR webhook from Atlas on any stage/flag change (Forge can fire webhooks if Cord prefers push)

### Client tab — top section: EXCEPTIONS (read-first zone)

| Column | Source |
|--------|--------|
| Client | `clients.name` |
| Stage | `clients.stage` |
| Days in stage | `now() - stage_entered_at` |
| Why flagged | derived: `stuck`, `chronic_late`, `client_blocking`, OOO owner, etc. |
| Owner | `stage_owner.name` |
| Action | derived: "Nudge client" / "Reassign" / "Review with Jimmie" |

Filter: any client with `stuck`, `chronic_late`, `paused_client > 30 days`, any `paused_internal`, or OOO owner shows here.

### Client tab — bottom section: full roster

| Column | Source |
|--------|--------|
| Client | `clients.name` |
| Service tier | `clients.service_tier` |
| Stage | `clients.stage` |
| Days in stage | derived |
| Flags | `clients.flags` (rendered as chips/emoji) |
| Owner | `stage_owner.name` |
| Close cadence | `clients.close_cadence` |
| **Last update** | derived from most recent `stage_history` row: `actor_id` + `transitioned_at`. Render with an icon distinguishing 👤 human / 🤖 agent / ⚙️ system. |
| Notes | `clients.stage_notes` |

Default sort: stage, then days-in-stage descending.
Conditional formatting: red if `stuck`, amber if approaching SLA, green otherwise.

**Why surface the actor:** as agents take over more updates, Jimmie needs to glance at the dashboard and instantly know "an agent moved this" vs. "a human moved this." Builds trust during the transition and surfaces agent errors fast.

### Project tab

Source: `atlas.projects` (whatever Forge's projects table has — Cord and Forge align on columns). Out of scope for this spec.

### Acceptance criteria for Cord

- Exceptions section shows zero entries on a healthy day (and that's the signal).
- Stage colors readable at a glance — no decoding needed.
- Stale-data indicator: timestamp of last refresh visible top-right.
- Mobile-readable (Jimmie checks phone constantly).

---

## 6. Phase 2 — Agent Integration

**Principle:** Phase 1 ships with humans + cron driving most updates. Phase 2 wires PKA agents into the *same endpoints*, with scoped permissions stored as data. Adding a new agent is a row in `agent_permissions`, not a code change. Per [`feedback_agent_operable_by_default`](../../memory/feedback_agent_operable_by_default.md), agents are first-class operators, not bolted-on automation.

### 6.1 Agent write-path matrix

| Agent | Stage writes allowed | Flag writes allowed | Reads allowed | Notes |
|-------|---------------------|---------------------|---------------|-------|
| **Tally** | None | set/clear `client_blocking` (when AP capture incomplete), set `ap_status` (new flag if needed) | own queue + clients with AP work | AP completion is event-based; Tally flips flags when WellyBox/Ramp pipeline state changes |
| **Ledger** | `eom_close` → `eom_review` only (event-based on QBO recs done) | set `chronic_late` (after pattern detection), set/clear `client_blocking` for accounting-data-missing | full client list, full history | Largest agent write surface — Ledger is the closest agent to bookkeeper role |
| **Echo** | None | set/clear `client_blocking` (based on client email response patterns), set `chronic_late` on chronic non-response | own queue + email-tied clients | Cannot touch primary stage — judgment too high-stakes for email-pattern inference |
| **Cord** | `eom_close` → `eom_review` (event-based, alternate path if Coefficient detects all recs done before Ledger does) | None directly; fires events that may drive transitions | read-only across full table | Cord's role is detection + event emission, not direct stage manipulation |
| **Kade** | Any stage change (when Jimmie verbally instructs) | Any flag (when Jimmie verbally instructs) | full client list | Acts on `on_behalf_of: jimmie`. Inherits Jimmie's permissions for relayed commands. |
| **Reid** | `paused_internal` → `weekly`/`cleanup` (resolution), `weekly` → `paused_internal` (when capacity-flagging) | set `chronic_late`, set/clear capacity-related flags | full client list + capacity views | Reid's lane is the J2-side workload signal |
| **Tax-prep agent (future: Rex companion)** | None | set/clear `tax_prep_active` | tax-tier clients only | When Rex's prep workflow has the books, flag flips |
| **Riv** | None on clients table directly | None directly | full read for monitoring | Owns the *plumbing* — webhooks, NLP relay, agent auth tokens. Should never need direct stage writes. |
| **Forge** | All | All | Full | Admin owner of the table. Used for migrations and emergency overrides. |
| **Loom** | None | None | Read-only including `stage_history` | Reads cycle-time data for VSM analysis. Never writes. |
| **Larry (orchestrator)** | All (with `on_behalf_of: jimmie` stamp) | All (same) | Full | Reflects Jimmie's authority when Jimmie is in conversation with Larry. |
| **Pixel / Vox / Wren / Vega / Corvus / Ash / Pax / Nolan** | None | None | None on clients table | Out of scope — not client-ops agents |

### 6.2 Permission scoping model

**Where permissions live:** dedicated `agent_permissions` table (see §4 schema above). NOT hardcoded enum / config file.

**Why data, not code:**
- Adding a new agent is a row insert, reviewable in audit logs
- Revoking permission mid-flight is a single UPDATE (set `revoked_at`)
- Scope changes (e.g., expand Echo from `client_blocking` to also `chronic_late`) don't require a deploy
- Larry can grant temporary permissions for one-off operations

**How the `scope` JSON works (examples):**

```jsonc
// Tally — can only touch client_blocking and ap_status flags
{
  "permission": "can_set_flag",
  "scope": { "flags": ["client_blocking", "ap_status"] }
}

// Ledger — can flip eom_close → eom_review but no other stage transitions
{
  "permission": "can_set_stage",
  "scope": { "from": ["eom_close"], "to": ["eom_review"] }
}

// Echo — can set client_blocking on any client, but cannot remove chronic_late
{
  "permission": "can_set_flag",
  "scope": { "flags_set": ["client_blocking", "chronic_late"], "flags_remove": ["client_blocking"] }
}
```

**Permission check happens server-side at every write.** No client-side trust.

### 6.3 Three update channels — all first-class

| Channel | Used By | Auth | Latency Target |
|---------|---------|------|----------------|
| **REST API** | Tally, Ledger, Cord, Riv-orchestrated workflows, Forge admin scripts | Bearer token, one per actor | < 200ms write |
| **Slack `/stage` command** | Human bookkeepers, Jimmie | Slack user → mapped to Atlas user via OAuth | < 2s end-to-end |
| **Conversational NLP** | Kade (relaying Jimmie's verbal commands), future Slack DMs to other agents | Inherits Jimmie's permissions when Kade is the relay; otherwise uses agent's own token | < 3s end-to-end |

All three channels write through the SAME underlying API. Slack and NLP are thin shims that translate to REST calls. No second source of truth.

### 6.4 NLP relay layer — where does it live?

**Recommendation: relay lives in each agent, not as a separate service.**

Reasoning:
- Each agent already has its own LLM persona and conversational context
- A central NLP service would become a brittle dependency every agent has to share
- Agents already know their own permission scope; they can translate "mark Afton as paused" into the correct API call themselves
- If Jimmie says "Kade, mark Afton as paused" → Kade interprets intent → Kade calls `PATCH /clients/afton/stage` with `actor_id=agent:kade, on_behalf_of=user:jimmie`

**What Riv builds for this:**
- A shared library / SDK that each agent imports: `atlas_client.set_stage(client, stage, on_behalf_of=jimmie)`
- Standardized error handling so a 403 from the API surfaces as a clean message to Jimmie in Slack
- Token rotation + secrets management

**What stays out of Riv's scope:**
- The NLP itself — each agent owns its own intent parsing

### 6.5 Audit trail — what we capture, why

Every write produces a `stage_history` row with:

| Field | Why it matters |
|-------|----------------|
| `actor_type` (human/agent/system) | Lets us measure agent-share of operations over time |
| `actor_id` | Pinpoints exactly which agent made the call — critical for debugging agent errors |
| `channel` (api/slack/nlp_relay/cron/event_webhook) | Distinguishes "Kade got a Slack DM and relayed" from "Tally called the API directly" |
| `on_behalf_of` | When Kade or Larry acts on Jimmie's verbal command, this preserves the chain of authority |
| `triggered_by` | The event/condition that caused the agent to act (`qbo_reconciliation_complete`, `echo_no_reply_7d`, etc.) — invaluable for tuning |
| `trigger_type` includes `agent_action` | Lets dashboards filter "show me everything agents did this week" |

**Agent-share KPI:** monthly scorecard tracks `% of stage_history rows where actor_type=agent`. Goal: starts near 0% in Phase 1, climbs toward 70%+ as agents take over.

### 6.6 Read paths for agents

Every agent needs to *query* before acting. Standardized read patterns:

| Query | Used By | Returns |
|-------|---------|---------|
| `GET /agents/agent:ledger/queue` | Ledger | Clients in `eom_close` owned by Ledger, sorted by stuck-risk |
| `GET /agents/agent:tally/queue` | Tally | Clients with AP work pending (derived from flags + AP pipeline state) |
| `GET /clients?flag=client_blocking&age_days>7` | Echo | Clients Echo should escalate via email |
| `GET /clients?stage=paused_internal` | Reid | Capacity backlog |
| `GET /clients/:id/stage_history?limit=20` | Any agent before acting | Context — has someone else already moved this? |
| `GET /clients?owner=user:jimmie&flag=stuck` | Kade | What Jimmie needs to look at when he opens Slack |

### 6.7 Failure modes

| Failure | Resolution |
|---------|------------|
| **Two agents update simultaneously** | Optimistic locking via `clients.version` column. Each write includes the version it read; if version moved, write rejects with 409 and the agent re-reads + retries. Last-write-wins is NOT acceptable for stage. |
| **Agent makes a wrong call** | Any user with `can_admin` can hit `POST /clients/:id/stage_history/:row_id/rollback`. This appends a compensating history row (not a delete), restoring prior state. Audit preserved. |
| **Agent permission revoked mid-flight** | In-flight calls complete (we don't kill mid-transaction). Next call returns 403 with the missing permission named. No silent failure. |
| **Agent outage / unavailable** | Cron jobs continue. Manual Slack updates continue. Other agents continue. *No agent is a single point of failure.* If Tally is down, flag updates queue in Tally's own backlog and replay on restart — they're not lost, but the dashboard reflects "Tally is behind." |
| **Agent goes rogue / loops** | Rate limit per actor_id at the API gateway: max N writes/minute. Exceeding it auto-pauses the agent's token and pings Forge + Jimmie. |
| **NLP misinterprets intent** | Kade (or whichever agent relays) must echo back the action in plain English before calling the API: "Setting Afton to paused_client, reason 'pending owner reply' — confirm?" For high-stakes stages (`offboarding`), confirmation is REQUIRED. For routine flags, confirmation can be skipped. |
| **Permission table corruption** | Permissions are append-only with `revoked_at`. Forge keeps daily snapshots. |

### 6.8 ROI revisited with agent operation

Phase 1 ROI: **~137 hr/yr saved** (humans + cron, manual stage updates).

Phase 2 layered impact (cumulative, after agents are wired in):

| Agent contribution | Hours saved/yr |
|--------------------|----------------|
| **Ledger** auto-flips `eom_close` → `eom_review` on QBO event | 40 hr/yr (8 min × 40 clients × 12 months) |
| **Tally** auto-clears `client_blocking` on receipt capture | 25 hr/yr |
| **Echo** auto-flags `client_blocking` from email patterns | 30 hr/yr (catches things humans miss) |
| **Kade** lets Jimmie update from Slack DM in 3 seconds | 15 hr/yr (vs. opening dashboard, finding client, clicking) |
| **Reid** auto-resolves `paused_internal` on capacity events | 10 hr/yr |
| **Cord** detects QBO state changes faster than humans poll | 15 hr/yr |
| **Phase 2 incremental** | **~135 hr/yr** |
| **Phase 1 + 2 combined** | **~272 hr/yr** |

At a loaded rate of $75/hr (conservative): **~$20,400/yr saved** once Phase 2 is fully wired. Clears the threshold by 50x.

Plus the strategic compounding: every system designed agent-operable from day one accelerates Jimmie's long-term goal of agent-run operations. The dashboard for THIS workflow becomes the template for every J2 workflow that follows.

### 6.9 Phase 2 brief — to Riv (new)

**Goal:** Build the agent-write infrastructure that turns Atlas's REST API into a first-class agent interface.

**Inputs:**
- Forge's completed Atlas API (Phase 1 endpoints)
- `agent_permissions` table populated with initial seed (Forge + Loom + Larry align on seed rows)
- Each PKA agent's existing runtime (Tally, Ledger, Echo, Kade, Reid, Cord)

**Deliverables:**

1. **`atlas-agent-sdk`** — shared Python/JS library every PKA agent imports. Functions:
   - `atlas.clients.set_stage(client_id, to_stage, reason, on_behalf_of=None)`
   - `atlas.clients.set_flag(client_id, flag, reason, on_behalf_of=None)`
   - `atlas.clients.clear_flag(client_id, flag, reason, on_behalf_of=None)`
   - `atlas.clients.get(client_id)` / `atlas.clients.query(...)`
   - `atlas.agents.get_queue(actor_id)`
   - Each method handles auth, permission errors with clear messages, optimistic-lock retries.

2. **Agent auth token issuance + rotation** — each PKA agent gets a long-lived token bound to its actor_id. Tokens rotate quarterly. Secrets live in `.env.local` per [`reference_atlas_secrets_location`](../../memory/reference_atlas_secrets_location.md).

3. **Webhook plumbing for event-based transitions:**
   - QBO `reconciliation_complete` → Atlas event → Ledger picks up → flip `eom_close` → `eom_review`
   - WellyBox `capture_complete` → Atlas event → Tally clears `client_blocking`
   - Email `client_no_reply_7d` (from Echo's monitoring) → Echo sets `client_blocking`

4. **Rate limiter at the API gateway** — per actor_id, max writes/minute configurable. Auto-pause + alert on breach.

5. **Slack `/stage` Kade companion** — when Jimmie DMs Kade in natural language ("mark Afton as paused"), Kade calls the SDK with `on_behalf_of=user:jimmie`. Confirmation echo required for stage changes; optional for flag-only changes.

**Acceptance criteria:**
- An agent can make a permitted write in < 200ms (excluding NLP parsing)
- A permission-denied write returns a clear English error to the calling agent within 300ms
- Optimistic lock conflicts retry up to 3 times automatically; surface to operator on 4th failure
- Audit trail in `stage_history` shows correct `actor_id`, `channel`, `on_behalf_of` for every agent call
- Rate-limit breach triggers Forge + Jimmie alert within 60s

**ROI:** see §6.8. **~135 hr/yr saved** once all six agents are wired. Riv build estimated 3–5 days post-Forge.

**Owner:** Riv. **Reviewers:** Forge (API alignment), Loom (cycle-time tracking).

---

## 7. What This Spec Does NOT Cover

- **Project taxonomy** — projects tab is Cord's lane once Forge confirms what's in the projects table. Different spec.
- **The actual close checklist contents** — Ledger owns checklist content per stage. Hand off separately.
- **Bookkeeper assignment / load balancing** — Reid + Nolan question (who owns which clients), not Loom.
- **Capacity planning via VSM** — once stage data flows for 60 days, Loom does the cycle-time analysis. Phase 3.
- **Agent NLP intent parsing details** — each agent owns its own; not centralized in this spec.
- **Permission seed values** — initial rows for `agent_permissions` get drafted by Loom + reviewed with Larry/Jimmie at Forge hand-off, not pre-locked here.

---

## 8. Open Questions for Larry / Jimmie

Per the one-question-at-a-time rule, these go to Larry to ask Jimmie in sequence:

**Phase 1 questions (RESOLVED):**

1. ~~**Q1 (blocking):** Is `advisory_delivery` a real distinct stage, or do advisory-tier clients just have an `advisory_due` flag while staying in `weekly`?~~ **RESOLVED 2026-05-25 — flag-only. Add stage later if needed.**
2. ~~**Q2:** `paused` — one stage or two?~~ **RESOLVED 2026-05-25 — split into `paused_client` and `paused_internal`. J2-paused alerts Reid as capacity signal.**
3. ~~**Q3:** Who is the senior reviewer in `eom_review`?~~ **RESOLVED 2026-05-25 — Jimmie only for now. Schema designed extensible: `users.senior_reviewer` flag + per-client `eligible_reviewers` override. Adding Ledger or others later is a permission flip, not a migration.**

**Phase 2 questions (open — not blocking Phase 1):**

4. **Q4:** Should Phase 2 ship with all six agents wired at once, or roll out one agent at a time (Ledger first, then Echo, etc.)? Trade-off: all-at-once = faster ROI but bigger blast radius if something's wrong; phased = safer but slower.
5. **Q5:** When Kade relays Jimmie's command via NLP, should EVERY change require confirmation echo, or only stage changes (flags can fire without confirm)? Defaults proposed: confirm on stages, skip confirm on flags.
6. **Q6:** Rate-limit defaults per agent — what's the upper bound on writes/minute before we auto-pause? Suggestion: 30/min for Tally and Echo, 10/min for Ledger and Reid (which touch primary stage).

**Phase 1 is locked and ready for Forge regardless of Phase 2 answers.** Phase 2 work begins after Forge ships and Cord's dashboard validates the read path.

---

## 9. Next Actions

| Who | What | By When |
|-----|------|---------|
| Larry | Greenlight Phase 1 hand-off to Forge | Now |
| Forge | Implement Phase 1 schema (clients + stage_history + agent_permissions table even if empty) + endpoints + cron + Slack command | 2–3 days |
| Forge + Loom + Larry | Seed `agent_permissions` table with initial rows | Day of Forge ship |
| Cord | Build dashboard against Atlas API (including `Last update` actor column) | 1–2 days after Forge ships |
| Larry | Surface Q4–Q6 to Jimmie before Phase 2 kickoff | Before Riv brief lands |
| Riv | Build `atlas-agent-sdk`, auth tokens, webhook plumbing, rate limiter, Kade NLP relay | 3–5 days post-Forge |
| Each agent owner (Forge/Ledger/Echo/Tally/Reid/Cord/Kade) | Wire their agent into the SDK; one agent at a time per Q4 answer | Rolling, post-Riv |
| Loom | Monthly agent-share scorecard (% of writes by agents) + cycle-time VSM at 60 days | 2026-07-25 |