# Client-Stage Phase 1 — Audit + Implementation Plan

**To:** Larry
**From:** Forge
**Date:** 2026-05-25
**Status:** Awaiting Larry sign-off before migration 010 lands.
**Spec reference:** `Team Inbox/Loom/client_stage_taxonomy_spec.md`

---

## 0. TL;DR

- Atlas reality is **better than the spec assumed** in some places (clients table exists, audit-log pattern is already proven in `kade_tasks_log`), and **different from the spec** in others (no `users` table, single shared bearer token, no in-process cron).
- Three small spec-vs-reality renegotiations needed before I write code. None are dealbreakers. **All three preserve Phase 2 agent-operable intent.**
- ETA: **3 working days of focused work, 4 if Phase 2 token plumbing is tougher than I think.** Spec said 2–3.
- One real blocker: I have **52 clients** in the table, spec assumed **40 active**. Need decision on seed scope before launch.

---

## 1. Current Atlas state (what's there now)

### Stack reality
- Next.js 15 + React 19, raw `pg` (no Prisma), raw SQL migrations as `scripts/migrate_NNN_slug.mjs`
- Tracked in `atlas_migrations(id text, applied_at, note)` — latest is `009_kade_tasks` from 2026-05-24. **Mine will be `010_client_stage`.**
- Auth: single shared `ATLAS_INGEST_TOKEN` bearer (`lib/kade-meds.ts → requireAuth`). No users table. "actor" is a self-reported text field in request bodies.
- Cron: standalone `.mjs` scripts run by Windows Task Scheduler (same pattern as `watch-ingest-gdrive.mjs`, `qbo_worker.py`, KadeBot). No in-process cron framework.
- Slack: Atlas has **no native Slack integration**. The Kade Slack bot lives in `C:\PKA\Team\Riv\kade_slack\` (Python) and calls Atlas via REST. **The spec's `/stage` and `/flag` Slack commands are Riv's lane, not mine.** I build the REST endpoints; Riv adapts.

### Table-by-table audit (against spec §4)

| Spec asks for | Atlas reality | Gap |
|---------------|---------------|-----|
| `clients` table additions | **Exists** with `id, name, slug, notes, gdrive_folder`. 52 rows. | All 11 new fields are net additions. Clean ALTER. |
| `users` table | **Does not exist.** | Spec assumed it; Atlas convention is text actor strings (proven by `kade_tasks_log.actor`). **Renegotiation #1 below.** |
| `stage_history` table | Does not exist. | Net new. Model on `kade_tasks_log` (same pattern: `actor text, channel text, payload jsonb, external_id, occurred_at`). |
| `agent_permissions` table | Does not exist. | Net new. Per the brief, scaffold even if mostly empty in Phase 1. |
| `users.senior_reviewer` boolean | N/A (no users table) | **Renegotiation #2 below** — encode as a permission row, not a boolean. |
| Per-actor bearer tokens | Single shared `ATLAS_INGEST_TOKEN` only. | **Renegotiation #3 below** — add an `actor_tokens` mapping table. |

### What's NOT a gap (Atlas already supports it)

- The `actor`/`channel`/`payload`/`external_id` audit pattern is established and battle-tested in `kade_tasks_log`. Stage history piggybacks on a known-good shape.
- Free-form text actors are the Atlas convention. Agents like Tally/Echo can stamp themselves without a FK migration.
- Optimistic locking via row-level `FOR UPDATE` (kade_tasks pattern) is already in use. I'll use the same approach plus a `version` column on `clients` for the dashboard read-then-write case.
- Project/client FK pattern is established (`kade_tasks.project_id`).

---

## 2. Three spec renegotiations (recommend approving)

### Renegotiation #1 — No `users` table; actors are text strings

**Spec assumes:** `stage_owner FK -> users`, `stage_collaborators ARRAY<FK>`, `eligible_reviewers ARRAY<FK> users`.

**Atlas reality:** No users table. `kade_tasks_log.actor` is `text`. Building a users table now is a big lift Atlas hasn't needed for any other feature and would force backfill/sync with Slack identities, Kade bot identifiers, agent names, etc. — all of which currently work fine as strings.

**Proposed change:**
- `stage_owner text` (e.g., `'user:jimmie'`, `'user:ledger-bookkeeper-1'`)
- `stage_collaborators text[]`
- `eligible_reviewers text[] nullable`
- `actor_id` everywhere is a string like `'user:jimmie'` or `'agent:tally'`

**Why this is still Phase-2 safe:** the `agent_permissions` table keys on `actor_id text` exactly as spec'd — works whether it's a human or an agent. Adding users-as-rows later is still possible if we need profile data, but unnecessary for the stage system to function.

### Renegotiation #2 — Senior reviewer is a permission, not a boolean

**Spec assumes:** `users.senior_reviewer BOOLEAN`.

**Without a users table:** this lives in `agent_permissions` as a row like `(actor_id='user:jimmie', permission='can_sign_off_eom_review', scope=null)`. Adding Ledger later is `INSERT INTO agent_permissions ...`, same mechanism the spec already endorses for agents.

**Why this is better:** ONE permission system, not two. Larry+Jimmie reasoning about "who can do what" is unified.

### Renegotiation #3 — Per-actor tokens via an `actor_tokens` mapping table

**Spec assumes:** "every endpoint requires a bearer token… tokens are issued per actor."

**Atlas reality:** today there's one shared token. Replacing it with per-actor tokens out of the gate would force every existing caller (KadeBot, QBO worker, ingest scripts, the chat UI) to migrate simultaneously — way outside this build's scope.

**Proposed change:**
- New `actor_tokens(token_hash text PRIMARY KEY, actor_id text NOT NULL, created_at, revoked_at)` table.
- `requireAuth` checks the incoming bearer against `actor_tokens` first. If matched, resolves `actor_id` automatically. If not matched, falls back to the existing `ATLAS_INGEST_TOKEN` env var and stamps `actor_id = 'system:legacy'`.
- Phase 1 seed: one row for `'user:jimmie'` (his personal token), plus the legacy fallback for backward compat.
- Phase 2: Riv issues `'agent:tally'`, `'agent:ledger'`, etc. as additional rows.

**Why this is still Phase-2 safe:** the auth model is already actor-aware on day one. Every write to `stage_history` already stamps the resolved `actor_id`. Riv's Phase 2 work is purely additive — no refactor of the auth code.

---

## 3. Implementation plan

### Migration `010_client_stage_taxonomy`

**ALTER `clients`:**
```sql
ALTER TABLE clients
  ADD COLUMN stage text NOT NULL DEFAULT 'weekly',
  ADD COLUMN stage_entered_at timestamptz NOT NULL DEFAULT now(),
  ADD COLUMN stage_owner text,
  ADD COLUMN stage_collaborators text[] NOT NULL DEFAULT '{}',
  ADD COLUMN stage_notes text,
  ADD COLUMN flags text[] NOT NULL DEFAULT '{}',
  ADD COLUMN close_cadence text NOT NULL DEFAULT 'monthly',
  ADD COLUMN sales_tax_schedule text NOT NULL DEFAULT 'none',
  ADD COLUMN service_tier text NOT NULL DEFAULT 'recurring_only',
  ADD COLUMN returning_to_stage text,
  ADD COLUMN eligible_reviewers text[],
  ADD COLUMN active boolean NOT NULL DEFAULT true,
  ADD COLUMN version bigint NOT NULL DEFAULT 0;

-- Check constraints enforce enum semantics without ENUM types (easier to evolve)
ALTER TABLE clients ADD CONSTRAINT clients_stage_check
  CHECK (stage IN ('onboarding','cleanup','weekly','eom_close','eom_review',
                   'paused_client','paused_internal','offboarding'));
-- (analogous CHECK constraints for close_cadence, sales_tax_schedule, service_tier)
```

**New tables:**
```sql
CREATE TABLE stage_history (
  id bigserial PRIMARY KEY,
  client_id bigint NOT NULL REFERENCES clients(id),
  from_stage text,
  to_stage text,
  flag_added text,
  flag_removed text,
  occurred_at timestamptz NOT NULL DEFAULT now(),
  actor_type text NOT NULL,          -- 'human' | 'agent' | 'system'
  actor_id text NOT NULL,            -- 'user:jimmie', 'agent:tally', 'system:cron:stage_autoroll'
  channel text NOT NULL,             -- 'api' | 'slack' | 'nlp_relay' | 'cron' | 'event_webhook'
  on_behalf_of text,                 -- nullable
  triggered_by text,                 -- nullable, e.g. 'qbo_reconciliation_complete'
  trigger_type text NOT NULL,        -- 'auto_time' | 'auto_event' | 'manual' | 'manual_override' | 'agent_action' | 'rollback'
  reason text,
  payload jsonb NOT NULL DEFAULT '{}'::jsonb,
  external_id text                   -- for idempotency; UNIQUE partial index below
);
CREATE INDEX stage_history_client_idx ON stage_history (client_id, occurred_at DESC);
CREATE UNIQUE INDEX stage_history_external_id_uniq ON stage_history (external_id) WHERE external_id IS NOT NULL;

CREATE TABLE agent_permissions (
  id bigserial PRIMARY KEY,
  actor_id text NOT NULL,
  permission text NOT NULL,          -- 'can_read_clients', 'can_set_stage', 'can_set_flag',
                                     -- 'can_sign_off_eom_review', 'can_offboard', 'can_admin'
  scope jsonb,                       -- e.g. {"flags": ["client_blocking"]} or {"from":["eom_close"],"to":["eom_review"]}
  granted_at timestamptz NOT NULL DEFAULT now(),
  granted_by text NOT NULL,
  revoked_at timestamptz,
  notes text
);
CREATE INDEX agent_permissions_actor_idx ON agent_permissions (actor_id) WHERE revoked_at IS NULL;

CREATE TABLE actor_tokens (
  token_hash text PRIMARY KEY,       -- sha256 of the actual token; never store raw
  actor_id text NOT NULL,
  description text,
  created_at timestamptz NOT NULL DEFAULT now(),
  created_by text NOT NULL,
  revoked_at timestamptz,
  last_used_at timestamptz
);
CREATE INDEX actor_tokens_actor_idx ON actor_tokens (actor_id) WHERE revoked_at IS NULL;
```

**Phase 1 seed rows (in the same migration):**
```sql
INSERT INTO agent_permissions (actor_id, permission, granted_by) VALUES
  ('user:jimmie', 'can_admin', 'system:install'),
  ('user:jimmie', 'can_sign_off_eom_review', 'system:install');
-- Other bookkeepers (added by Larry/Jimmie at huddle): can_set_stage, can_set_flag
-- Agents (added by Riv in Phase 2): scoped permissions per the Phase 2 matrix
```

### API endpoints (TypeScript, Next.js route handlers)

| Endpoint | Required permission | Notes |
|----------|---------------------|-------|
| `GET /api/clients` (extend existing) | `can_read_clients` (implicit for all authed actors) | Add composable filters: `?stage=&flag=&owner=&age_days_gt=&stuck=&service_tier=` |
| `GET /api/clients/:id` (new) | `can_read_clients` | Returns client + most recent `stage_history` row |
| `PATCH /api/clients/:id/stage` (new) | `can_set_stage` (with scope check) | Body: `{to_stage, reason, triggered_by?, on_behalf_of?, expected_version}`. Optimistic lock via `version`. |
| `POST /api/clients/:id/flags/:flag` (new) | `can_set_flag` (with scope check) | Body: `{reason, triggered_by?, on_behalf_of?}` |
| `DELETE /api/clients/:id/flags/:flag` (new) | `can_set_flag` (with scope check) | Same |
| `GET /api/clients/:id/stage_history?limit=20` (new) | `can_read_clients` | Chronological |
| `POST /api/clients/:id/stage_history/:row_id/rollback` (new) | `can_admin` | Appends compensating row |
| `GET /api/agents/:actor_id/queue` (new) | `can_read_clients` | Derived view per spec §6.6 |

A shared `lib/stage-auth.ts` will:
1. Pull bearer from `Authorization: Bearer <token>` header
2. SHA-256 the token, look up in `actor_tokens` → resolve `actor_id`
3. Fall back to env `ATLAS_INGEST_TOKEN` → `actor_id = 'system:legacy'`
4. For the requested permission, scan `agent_permissions` for matching active rows
5. Apply scope JSON checks against the request (e.g., for `can_set_flag`, check that the targeted flag is in scope.flags if scope.flags is set)
6. 403 with the specific missing permission named; no silent failure

### Cron jobs (Node .mjs scripts, run by Task Scheduler)

- `scripts/cron_stage_autoroll_eom_open.mjs` — last business day, 6:00 AM CT
- `scripts/cron_flag_autoclear_advisory_due.mjs` — day 16 of month
- `scripts/cron_stage_stuck_detector.mjs` — hourly
- `scripts/cron_flag_sales_tax_due.mjs` — daily
- `scripts/cron_flag_seasonal.mjs` — daily

Each script:
- Uses `_pg.mjs` makeClient pattern (already established)
- Stamps `stage_history` with `actor_id='system:cron:<job_name>'`, `channel='cron'`, `trigger_type='auto_time'` or `'auto_event'`
- Idempotent (re-runnable safely)
- Logged to `cron-<job>.log` next to existing `auto-tag.log`

**Task Scheduler XML** for each job goes in `Team Inbox/Forge/cron-schedules/` for Riv (or me) to import.

### Slack `/stage` and `/flag` commands

**Not in this brief.** The Atlas REST endpoints are the deliverable; Riv adapts the existing `Team/Riv/kade_slack/` Python pattern to add the new commands. I'll write a `kade-style API contract doc` for Riv when the endpoints are live, matching the format I already used for Kade v0.2.

---

## 4. ETA breakdown

| Task | Estimate |
|------|----------|
| Migration 010 (schema + seed) | 4–6 hours |
| Auth/permission middleware (`lib/stage-auth.ts`) | 4 hours |
| REST endpoints (8 of them) | 8–10 hours |
| Cron scripts (5 of them) | 4–6 hours |
| Task Scheduler XML + runbook in `Team Inbox/Forge/` | 1–2 hours |
| Smoke tests (curl scripts in the runbook) | 2–3 hours |
| API contract doc for Riv (for Slack adapter) | 1 hour |
| **Total focused** | **24–32 hours** |

**Realistic calendar:** **3 working days** if I run uninterrupted; **4** with one round of Larry/Loom feedback. Spec said 2–3. I'm flagging that the auth + per-actor token work added work that wasn't visible in the spec.

---

## 5. Blockers requiring Larry decisions before I start

| # | Blocker | What I need |
|---|---------|-------------|
| 1 | **52 clients in the table, spec assumed 40** | Decision: (a) add `active` boolean (default true) and let Jimmie+Ledger mark inactives at the huddle; (b) only seed stage on a curated list of 40; (c) something else? **I lean (a)** — it's cheapest and reversible. I've assumed (a) in the schema above. |
| 2 | **Renegotiation #1, #2, #3 above** | Larry's ack on all three. Spec compatibility preserved. If any are denied, ETA grows. |
| 3 | **Initial seed beyond Jimmie** | The migration only seeds Jimmie's permissions. Who else needs `can_set_stage` / `can_set_flag` on day one? Bookkeeper names? I can also leave the migration Jimmie-only and let Larry add others via direct SQL or a small admin endpoint. |
| 4 | **Per-client `eligible_reviewers` initial values** | Spec says null = fall back to global `can_sign_off_eom_review` (Jimmie). I'll seed all clients with NULL — that means Jimmie signs off everywhere. Confirm? |

---

## 6. No other blockers I can see

- Atlas database is live and healthy (PM2 process up, migrations table consistent).
- No conflicts with in-flight work — Kade and Morning Rounds are settled; no overlapping schema changes pending.
- The `clients` table is read by 3 existing places (`api/clients/route.ts`, `api/clients/[slug]/...`, `api/triage/...`). None will break from additive columns. I'll skim them once more before merging migration 010.
- ATLAS_INGEST_TOKEN stays the way it is; existing callers (Kade bot, ingest scripts) continue to work via the `system:legacy` fallback. No breaking changes to other systems.

---

## 7. Recommended sequence once Larry signs off

1. Migration 010 lands (schema + Jimmie seed only).
2. I run a quick smoke test via psql + node REPL — confirm constraints, optimistic-lock, audit-log writes.
3. Build `lib/stage-auth.ts` + endpoints.
4. Build cron scripts.
5. Write API contract doc for Riv (`Team Inbox/Forge/client-stage-api-contract.md`).
6. Hand off to Larry for the seed huddle (Jimmie + Ledger assign starting stages to all active clients) and for the Cord brief to start the dashboard.
7. Hand off to Riv for Phase 2 (SDK, per-agent tokens, Slack `/stage` adapter, NLP relay through Kade bot).

---

## 8. Open question for Loom (not blocking — can answer during build)

The spec's stuck-detector SLAs are time-based (e.g., `eom_close > 10 days = stuck`). Should the detector also consider `client_blocking=true` and *suppress* the `stuck` flag in that case? "Stuck because we're slow" deserves different surfacing than "stuck because client won't respond." Cheap to add now; awkward to retrofit. Loom call.

---

**Awaiting Larry sign-off on blockers 1–4. Will not write migration code until acks land.**
