00TL;DR
Eight things to know — the first row is the new measured-cost picture.
~14.2K input / 88 output per model call; ~2 calls per inbound message. ~11–13K is the tool schemas (the agent gets Family::all() via attach_domain_tools); persona+goals+core ≈ 1K.
67.6% aggregate on agent turns, 92–99.5% on warm calls. The tool+system prefix caches under one ephemeral breakpoint; history is small so it isn't the lever.
≈ $0.02–0.05 per inbound message (cache-dependent). The agent feature spent $0.18 true over the window; the assistant is 93% of recorded spend ($4.46 true).
Cost calc + credit billing charge cached reads at full $3/1M and drop cache-creation tokens — books overstate provider cost ~2.4× (agent) / ~1.5× (assistant).
Now on by default. Active version summarized to ≤10 bullets, injected into the user turn (KV-safe seam), pinned per thread. Never in the cached prefix.
Typed reconciling blocks read/written on-demand via tools — agent memory (update_my_memory) vs per-contact memory (update_contact_memory), routed by the platform-core contract. Not injected.
Still zero injection of user name, role, or org name. The only runtime identity resolved is the contact's. "Act as the assigned user" is a net-new layer (§8–9).
Static→dynamic ordering is correct; 1 of 4 breakpoints used; 5-min ephemeral TTL. Levers: trim the toolset, longer TTL, fix cost accounting.
01The call path
One agent turn, end to end. Everything is reconstructed from the DB each turn — the API is stateless and there is no message store (schema frozen in Phase 1).
Two flavors share this executor
A domain agent (Text Reply, web chat, follow-up drafter) uses the assembled persona/goals prompt below. The assistant flavor (the per-user Loquent Assistant clone) swaps in a full Vernis turn whose preamble replaces the assembled prompt and binds the member's Session-scoped tools. This document is about the domain-agent path unless noted — that's the one that talks to contacts.
02System prompt anatomy
The preamble is a pure conditional assembler — the executor decides what's included and passes pre-rendered strings in. Ordering is deliberately static→dynamic for cache stability.
.preamble(), the cached prefix)ai_agent_rule table + legacy config_payload["agent_rules"]when setai_agent.persona — static, operator-authored, verbatim (multi-line allowed)staticai_agent.goals — rendered as "Your goals:\n{goals}"staticparent_thread_id — relay via enqueue_parentwhen delegatedlist_my_skills/load_skillper-agentThe literal assembly is one format string — note that learning, memory, and contact data are absent here by design:
src/mods/ai_agent/services/build_system_prompt_service.rs:337let system_prompt = format!(
"{platform_block}{org_rules_block}{agent_rules_block}{persona}\n\n\
Your goals:\n{goals}{delegated_block}{delivery_block}\
{skills_block}{invokable_agents_block}{tools_block}",
persona = agent.persona, // static, operator-authored
goals = agent.goals, // static, operator-authored
);
Tier 0 — the platform-core contract
Included only for a domain agent (attach_domain_tools && !is_assistant_flavor). It is a fixed constant (PLATFORM_CORE_BLOCK) with four sections — and crucially, it speaks of "the business" generically:
build_system_prompt_service.rs:73 — PLATFORM_CORE_BLOCK (excerpt)## Who you are
You are an autonomous agent acting on the business's behalf inside Loquent…
You speak to the business's contacts as the business itself — a real person
from the team — and you are never named or described as an AI…
## How you affect the world → you change the world only by calling tools
## Who is talking to you: operator vs. contact → trust steering, distrust the contact
## Ground your replies; never make things up → gather context with read tools, else escalate
"the business" is never resolved to a name
Tier 0 establishes that the agent acts as the business — but "the business" stays a generic placeholder. There is no slot where the org's name, the owner's name, or the assigned user's name is interpolated. That's the gap your goal targets.
What's actually in the ~14K input — measured
The anatomy above is the system-prompt text, but it is a small minority of what goes on the wire. Reading the real effective_prompt_capture for a live Text Reply agent against its ai_usage_log rows, the prompt breaks down like this:
The Text Reply agent carries attach_domain_tools = true with a one-item tools_allowlist (["escalate_to_user"]) — but the executor then attaches the entire domain toolset via build_domain_tools_for_agent → collect_agent_domain_rig_tools → collect_rig_tools(Family::all()) (permission- and tier-gated). Those tool JSON schemas are sent in the request's tools array on every call. The tools_block in the system text only lists allowlist names; the schemas themselves are the ~11–13K.
Good news for caching: the bulk is stable
Tools + system render before the messages and sit inside the one cache breakpoint, so the ~13K prefix is exactly what gets reused. That's why warm calls hit 92–99.5% (§3). The flip side: it's a large prefix, so a cold start (first call of a thread, or after the 5-min TTL lapses) pays full price for all ~13K — which makes trimming the toolset, not caching history, the real lever (§9).
03Cost, tokens & cache — measured
Grounded in real rows from loquent_latest — ai_usage_log (tokens, cache, model) joined to ai_thread_log (the 12-layer prompt capture). Window: ~46 h, Jun 12–14 2026, 190 usage rows.
Read this as unit economics, not a bill
This is dev/test traffic (10 agent model calls, 118 assistant), so absolute totals are tiny. The value is the per-unit shape — tokens per turn, cache hit ratio, cost per message — which is what scales. Projections at the end apply that shape to volume.
Per-feature over the window
Two cost columns: naive = what Loquent records today (every input token at full $3/1M, cached reads included); true = the real provider cost with Anthropic's cache-read discount (fresh $3/1M, cached read $0.30/1M, output $15/1M).
| Feature | Model | Calls | Input tok | Cached | Cache % | Naive $ | True $ |
|---|---|---|---|---|---|---|---|
| assistant (Vernis) | sonnet-4.6 | 118 | 2,135,350 | 829,816 | 38.9% | $6.70 | $4.46 |
| agent (Text Reply) | sonnet-4.6 | 10 | 142,277 | 96,177 | 67.6% | $0.44 | $0.18 |
| extract_tasks_from_messages | deepseek-v3.2 | 25 | 31,045 | 7,665 | 24.7% | ~$0.009 (deepseek) | |
| update_contact_memory_from_messages | deepseek-v3.2 | 25 | 30,354 | 8,880 | 29.3% | ~$0.009 (deepseek) | |
| agent_learning_digest | deepseek-v3.2 | 2 | 3,189 | — | — | ~$0.001 (deepseek) | |
| dashboard_briefing | gemini-3-flash | 2 | 3,304 | 0 | 0% | $0 — unpriced | |
The assistant, not the agent, dominates spend today
The internal Loquent Assistant is ~93% of recorded $ and has a worse cache ratio (38.9% vs 67.6%) — longer, more varied threads with page context. The contact-facing agent is cheap per turn; its cost is what scales with contact volume, the assistant's with member activity. Different problems, same executor.
Per-turn cache behavior — the cold/warm pattern
The 10 agent calls show the mechanism plainly. Calls with a cache read carry 92–99.5% of the prompt from cache; cold calls (first call of a thread, or after the prefix is invalidated / the TTL lapses) carry none and pay full price for the whole ~14K:
The aggregate 67.6% is the blend of 3 cold and 7 warm calls. The architectural fact that drives it: with_prompt_caching() uses an ephemeral (5-minute TTL) breakpoint. Within an agentic loop (tool turn → text turn, seconds apart) and within a burst of replies (<5 min) the prefix is warm; once replies space out past 5 minutes — the norm for SMS — the next message starts cold.
The cost-accounting gap
Three code facts compound so that the cache's real savings never reach Loquent's books:
input_tokenslogged is inclusive of the cached read (it's the total prompt, ~14K), andcached_tokensis the read subset (verified against the data: 14,305 total / 13,737 cached).- In
ai_pricing_type.rs,anthropic/claude-sonnet-4.6falls through..ModelPricing::ZERO→cached_per_1m = 0. Socalculate_cost=input × $3 + cached × $0= the whole prompt at full price. - Cache-creation (write) tokens are never stored —
from_rig_usagelogs onlycached_input_tokens(read), and the table has no write column — so the 1.25× write premium on cold turns is invisible too.
src/mods/ai/types/ai_pricing_type.rs:47 — no cache rate for Sonnet(_, "anthropic/claude-sonnet-4.6") => ModelPricing {
input_per_1m: 3.00,
output_per_1m: 15.00,
..ModelPricing::ZERO // cached_per_1m = 0.0 ← cache discount not modelled
}
Books overstate provider cost — and credit-billing is cache-blind too
Recorded cost vs true provider cost: agent $0.44 → $0.18 (2.4×), assistant $6.70 → $4.46 (1.5×). And the customer-facing credit path (record_billing_for_usage, tokens = input + output) is likewise cache-blind — customers are metered as if no cache existed. Conservative for margin, but inaccurate: the real money caching saves is neither shown in the cost tab nor passed through as a discount. (Plus gemini-3-flash-preview is unpriced → its rows record $0.)
Projected spend at scale
Applying the measured unit economics. Per inbound message ≈ 2 model calls; true provider cost depends entirely on whether the cache is warm:
| Regime | When | True $/inbound msg | 1K msg/day | 10K msg/day |
|---|---|---|---|---|
| warm | replies <5 min apart (in-burst) | ~$0.014 | ~$420/mo | ~$4,200/mo |
| typical | measured blend (67.6% cache) | ~$0.036 | ~$1,080/mo | ~$10,800/mo |
| cold | replies >5 min apart (real SMS cadence) | ~$0.05 | ~$1,500/mo | ~$15,000/mo |
| naive | what the books would record | ~$0.088 | ~$2,640/mo | ~$26,400/mo |
Two readings: (1) real SMS cadence pushes most messages toward the cold regime, so a longer cache TTL or a smaller prefix is worth real money at volume; (2) the gap between the typical and naive rows is the cache savings the accounting currently throws away. Output is negligible (~88 tok/call) — this is an input-bound workload, which is exactly the case prompt caching is built for.
04What this worktree changed since v1 of this doc
The first version of this artifact predates the memory/learning commits that landed on this branch. Here's what shipped and — the part that matters for cost — which layer each one touches: the cached prefix, the (uncached) user turn, or the toolset.
| Change | What it does | Prompt layer it touches |
|---|---|---|
| Learning on by default | Every agent now resolves & injects its active learning version (was opt-in). | user turn — KV-safe, pinned per thread |
| Dynamic few-shot from corrections | Top-k past operator draft-edits for this (agent, contact) pair, rendered as examples. | user turn — uncached by design |
| Typed reconciling memory blocks | Replaced the legacy memory blob with labelled JSONB blocks (PersonaBelief / BusinessFacts / Preferences). | tool-loaded — never injected |
| Memory routing (agent vs contact) | Facts about one person → update_contact_memory; persona/business/global rules → update_my_memory. Platform core now states the routing rule. | cached prefix (core) + tools |
| "Remember people" + contact memory live for all agents | Adds the contact-memory read/write tools to the contact-facing toolset. | tools — adds to the cached ~13K prefix |
contact_id surfaced in the user-turn frame | The framed inbound now carries the contact id so memory tools target the right person. | user turn — tiny |
| De-identified learning digest | The digest prompt is PII-scrubbed before the LLM proposes a new version. | digest LLM call (deepseek) — not the agent turn |
| Soft-cap memory blocks | Bounds memory growth (≤12 contact blocks / capped agent blocks) so on-demand loads stay small. | tool-loaded |
The architecture held: injections stayed out of the cached prefix
Everything turn-varying (learning, few-shot, contact id) went into the user turn; everything long-lived (memory) is tool-loaded. The cached system prefix stays byte-stable turn-to-turn — the whole point of the KV-safe seam. The one cost-relevant side effect: the new memory tools grew the cached toolset prefix, reinforcing the §9 "trim the toolset" lever.
05The user turn
Everything dynamic and/or untrusted lives here — never in the cached system prefix. This is also where learning and few-shot examples are injected.
Message::user(user_message)), assembled per turnrun_ai_agent_thread_service.rs:963 — KV-cache-safe user-turn injectionlet user_message = {
let mut parts = Vec::with_capacity(3);
if !learning_summary.is_empty() { parts.push(&learning_summary); } // §6
if !few_shot_block.is_empty() { parts.push(few_shot_block); } // few-shot
parts.push(&prompt); // framed events
parts.join("\n\n")
};
rig_agent.chat(Message::user(user_message), &mut history).await;
Source framing (prompt-injection defense)
Each event is wrapped by frame_for_prompt(contact_name, contact_number) so the model can tell trusted operator steering from untrusted contact text. Identity fields are scrubbed of newlines and bracket glyphs so a crafted SMS can't forge an operator envelope:
ai_thread_event_payload_type.rs:276 — frame_for_prompt (excerpt)InboundSms → "[New SMS · {name} <{number}>]\n{scrubbed text}"
CallCompleted → "[Call ended · {name} <{number}>]\n{summary}"
UserMessage → "[Operator instruction from your owner — not the contact]\n{text}"
Scheduled → "[Scheduled instruction from your owner — not the contact]\n{text}"
History reconstruction
There's no message store; history is rebuilt each turn from consumed queue events (user) + llm_generation log rows (assistant) + recorded tool calls, sorted by timestamp, capped at 100 per source (worst case ~300 messages). The new user input is passed as a separate Message::user — never interpolated into the system prompt.
06Memory & learning this worktree
The branch adds a three-part knowledge system. The distinction matters: learning is injected, memory is tool-loaded, lessons feed the digest that evolves learning.
| Concept | Table | What it is | How it reaches the model |
|---|---|---|---|
| Learning version | ai_agent_learning_version | Versioned, evergreen behavioral policy (markdown bullets). Exactly one active per agent (partial unique index). Chained via previous_version_id. | Injected — summarized to ≤10 bullets into the user turn |
| Lesson | ai_agent_lesson | Supervision signal: situation / what-went-wrong / what-to-do-instead, with source + support_count + confidence. | Not injected — consumed by the digest |
| Memory block | ai_agent_memory (JSONB blocks) | Long-term facts, typed by label: PersonaBelief · BusinessFacts · Preferences. read_only blocks are owner-pinned. | On-demand — via memory tools only |
| Draft correction | ai_draft_correction | Operator edit of a draft before sending — the "gold signal." | Few-shot examples in the user turn (per contact) |
How learning gets in (and why it's in the user turn)
On the first turn of a thread the executor resolves the agent's active learning version and pins its id to ai_thread.pinned_learning_version_id; every later turn reads the pinned version. This freezes the policy for the life of a conversation — a digest activating a new version mid-thread can't swap it underneath. The pinned body is summarized and prepended to the user message:
run_ai_agent_thread_service.rs:615 — learning resolution (condensed)let learning_summary = if agent.enable_learning {
match thread.pinned_learning_version_id {
Some(pinned) => summarize_learning_version(&get_learning_version_body(pinned)?), // later turns
None => { pin_version_id = Some(v.id); summarize_learning_version(&v.body_markdown) } // first turn → pin
}
} else { String::new() };
Why user-turn, not system prompt
Putting the (turn-varying) learning summary in the system prefix would invalidate the KV cache every time learning changed. Keeping it in the user turn lets the system prefix stay byte-stable. The summary is still mirrored into capture.learning for the debug timeline — observability only, not the injection mechanism.
Memory is tool-loaded, never injected
read_my_memory renders the typed blocks as labelled markdown on demand; update_my_memory applies a diff of add/update/delete ops (honoring read_only). There is also contact-scoped memory (read_contact_memory / update_contact_memory). The system-prompt builder explicitly leaves memory_snapshot: None — memory never enters the prompt automatically.
Evolution / digest loop
Apply mode is org-level: auto activates immediately; approval lands a pending version an owner approves. Risk is bounded by the ≤20-version cap, the one-active-per-agent invariant, and the reward-hacking / hallucination guards. Shadow replay + regression fixtures exist as the validation surface.
Retrieval observability (#1566)
A metadata-only retrieval_context records which learning version, how many learning bullets, and how many few-shot examples informed a turn — stored in the audit capture, never injected, and carrying ids/counts only (no verbatim cross-contact example text).
07Prompt caching vs Anthropic guidance
Caching is enabled, genuinely cache-aware, and the §3 data confirms it works (67.6% / 92–99.5% warm). One correction from v1 of this doc: for the agent, the cached prefix is the toolset, not history — so the lever is its size and TTL, not "cache the history."
What's wired
src/mods/ai/rig/client.rs:38pub fn completion_model(client, model_id) -> openrouter::CompletionModel {
client.completion_model(model_id).with_prompt_caching()
// attaches cache_control: {"type":"ephemeral"} (5-min TTL) to the SYSTEM PROMPT
}
One breakpoint, on the system prompt. Anthropic's render order is tools → system → messages, and a breakpoint caches everything up to and including it — so the marker on the system prompt caches tools + system as one prefix (the ~13K from §2), but not the message tail. Provider path is OpenRouter → Anthropic; the marker is forwarded. Hits are observable — the streaming layer reconciles cached_input_tokens and cache_creation_input_tokens, and the read side is logged to ai_usage_log.
| Anthropic best practice | Status here | Notes (with measured evidence) |
|---|---|---|
| Stable content first, volatile last | Followed | Static→dynamic tiers; the only per-turn system section (delivery directive) sits after the static identity |
| No silent invalidators in the prefix | Followed | System prompt is deterministic; learning & few-shot live in the user turn — confirmed by 92–99.5% warm hits |
| Frozen system prompt; inject dynamic context later | Followed | The user-turn-injection design for learning (#1560) + few-shot (#1562) |
| Deterministic tool set (tools→system→messages) | Followed | The ~13K toolset caches inside the system breakpoint — that's the bulk of the hit |
| Clears the minimum cacheable prefix (~1–2K) | Confirmed | The ~14K prompt is far above the floor — cache_creation fires and reads hit, per the data |
| Breakpoint on the latest turn for incremental multi-turn cache | Missing | Low value for the agent (history is tiny, ≤7 events). Matters more for the assistant (longer threads, 38.9% cache) |
| Long-TTL cache for slow conversations | Missing | Only the 5-min ephemeral cache is used. SMS replies space out past 5 min → cold starts (§3) |
| Use up to 4 breakpoints | 1 of 4 | Room for a tools/system split or a history breakpoint |
Correction from v1: the agent's lever is the toolset + TTL, not history
The first version of this doc called replayed history "the biggest token-cost lever." The data says otherwise for the agent: threads are short (≤7 events), so history is a rounding error and the cached prefix is dominated by the ~13K tool schemas. The two real levers are (1) shrink that prefix — the agent gets Family::all() but uses a handful of tools — so cold starts are cheaper, and (2) a longer cache TTL so the prefix survives realistic SMS cadence instead of going cold every message. History caching is a genuine win for the assistant path, not this one.
A "Who you represent" identity block would help, not hurt, caching
Because it varies per agent (not per turn), an identity block (§9) lands in the stable prefix and is cached on every warm call — near-zero marginal cost once warm, paid once at full price on a cold start. It adds to the same prefix the toolset already dominates, so trimming the toolset and adding identity are complementary, not competing.
08User & org data — the core question
Your original question: do we inject the current user (name, roles) and the organization's data so the agent can act as that person? For contact-facing agents the answer is still no — unchanged by this worktree. Here's the evidence and the one exception.
| Data point | Injected into the domain-agent prompt? | Where it lives / why not |
|---|---|---|
| Current/owner user name | No | ai_agent.user_id exists but is never read into the prompt path |
| User roles / permissions | No | ABAC roles drive API auth (and the domain-tool gate), never reach prompt assembly |
| Organization name / profile | No | org_rules is wired but ships empty (confirmed in the real captures); no org profile fields are read |
| Business hours / timezone / signature | No | Not modeled into the prompt at all |
| Assigned user for a given phone line | No | Phone binds to an agent (phone_number.ai_agent_id), not surfaced as a person in the prompt |
| Contact name + the line they texted | Yes | Resolved per turn, framed into the user turn (the audience, not the actor) |
| Persona & goals (generic "the business") | Yes | Static operator-authored text in ai_agent.persona / .goals |
The grep that confirms it — over the assembler and the executor:
verified on this branch$ grep -n 'user.name|member.name|owner_name|organization.name|role|assigned_user|first_name' \
build_system_prompt_service.rs run_ai_agent_thread_service.rs
(no matches)
The seed personas make the same point — they reference the business relationship but carry no identity. From the Follow-up Drafter seed:
migration/.../seed_followup_drafter_agent.rs — persona (excerpt)You are the Follow-up Drafter… You draft short, personalized SMS follow-up
messages sent to a contact on the user's behalf — the contact sees the
message as coming from the user's business, and you are never mentioned.
"the user's business" is a role, not a value — no name is ever substituted in.
The one exception: the assistant flavor
The internal Loquent Assistant (the in-app helper, not a contact-facing agent) does personalize. Its turn is assembled by assemble_assistant_turn, which rebuilds the owning member's Session and loads get_member_personalization(...). So the capability to thread a member's identity into a prompt already exists in the codebase — just not on the path that talks to contacts.
Domain agent (talks to contacts)
Static persona/goals + generic platform core. No member, org, or assigned-user identity. Knows only the contact it's replying to.
Assistant flavor (talks to the member)
Rebuilds the member Session, loads personalization, binds Session-gated CRM tools. Already "acts as" the member — a template to borrow from.
The closest thing to "from-identity" today
resolve_envelope_identity() resolves the contact's name and the from-line they texted so the reply lands on the right thread "as coming from the business." That references the business phone line, but still no business name or human identity. It's used only to frame the user turn, never the system prompt.
09Insights & levers
What the data says to do, in rough priority. The first three are cost/accuracy levers the telemetry surfaced; the rest are the identity/personalization directions for your original goal. Options, not a committed plan.
Trim the contact-facing toolset (biggest cold-start lever)
The Text Reply agent gets the entire domain toolset via collect_rig_tools(Family::all()) — ~11–13K tokens of schemas — but a reply agent realistically uses a handful: contact read/find, the memory tools, the channel send/draft tool, escalate, maybe schedule. Cutting Family::all() to a focused reply family would shrink the cached prefix, and since the cold start pays full price for the whole prefix (and real SMS cadence makes most messages cold — §3), this directly cuts the dominant cost. Warm calls also get cheaper (smaller cached read). No behavior change beyond removing tools the agent shouldn't call anyway.
Longer cache TTL to survive SMS cadence
The 5-min ephemeral breakpoint goes cold between spaced-out replies — the common case for SMS. Anthropic offers a 1-hour cache (higher write multiplier, far cheaper reads across the window). Converting cold messages to warm is worth ~3× per message at volume (the cold→warm rows in §3). Needs a check that rig 0.38's OpenRouter path can request the extended-TTL cache_control.
Make cost accounting cache-aware
Three small fixes so the books match reality (§3): (a) set cached_per_1m for Sonnet (and subtract cached from the full-rate input, since input_tokens is inclusive) in ai_pricing_type.rs; (b) capture cache_creation_input_tokens (add a column + log it) so the 1.25× write premium is visible; (c) price google/gemini-3-flash-preview (currently unpriced → $0 rows). Decide separately whether credit-billing should pass the cache discount through to customers or keep it as margin.
Add a "Who you represent" identity block (Tier 0.7)
A new static section after Tier 0.5, before persona — interpolating business name, owner/assigned-user display name, role/title, signature, hours, timezone, locale. It varies per agent, not per turn, so it lands in the cache-stable prefix at near-zero marginal cost and actually helps cacheability (more stable prefix bytes). This is the most direct answer to "act as the current user."
## Who you represent
You are messaging on behalf of {business_name}.
Your point of contact on the team is {owner_name} ({owner_title}).
Business hours: {hours} ({timezone}). Sign off as {signature} when appropriate.
Source fields from ai_agent.organization_id (org profile) and ai_agent.user_id (owner). The capability to resolve a member already exists in assemble_assistant_turn / get_member_personalization — reuse it on the domain path.
Per-phone assigned user — static vs per-turn
"The user assigned to a given phone" is subtler. Phone numbers bind to an agent today (phone_number.ai_agent_id), and a thread can receive events across lines. Two shapes:
- Per-agent default (recommended first) — resolve one owning/assigned user for the agent and put it in the Tier 0.7 static block. Cache-safe, simplest, covers the common one-line-per-agent case.
- Per-line override — if a single agent fronts multiple lines with different assigned users, resolve the assigned user from the inbound line and inject it in the user turn (alongside the envelope), keeping the system prefix stable. More precise, slightly more plumbing.
Populate the dormant org-rules hook
Tier 0.5 org_rules is wired through the assembler but ships empty at the executor seam (confirmed empty in the real captures). If org-wide identity/voice/policy belongs anywhere shared, this is the seam that already exists — no schema change to the prompt path, just a loader. It lands in the cached prefix, so it's free once warm.
Cache history — for the assistant, not the agent
A breakpoint on the latest turn lets replayed history accrue incremental cache hits. The data shows this is wasted on the agent (history ≤7 events) but promising for the assistant, whose lower 38.9% cache ratio comes from longer, page-context-varying threads. Scope it to that path. Gate on a rig 0.38 capability check for message-level cache_control placement through OpenRouter.
Two guardrails to respect for any identity injection
1. Trust boundary. Owner/org identity is trusted operator content (like persona/goals) and belongs in the system prefix; never let contact-supplied data masquerade as identity — keep the source-framing discipline. 2. Cache stability. Anything that varies per turn (per-line assigned user, time-of-day greeting) must go in the user turn, not the prefix, or it defeats the whole static-prefix design.
10File reference map
Where each piece lives, for the implementation session.
| Concern | File · symbol |
|---|---|
| System prompt assembler + Tier 0 constant | ai_agent/services/build_system_prompt_service.rs · build_system_prompt, PLATFORM_CORE_BLOCK (line 73), format string (line 337) |
| Per-turn executor (all wiring) | ai_agent/services/run_ai_agent_thread_service.rs · run_ai_agent_thread (learning 615, user turn 963, model 822, chat 978, envelope 1643) |
| Tier 1 delivery directives | same file · CAPABILITY_*_BLOCK / OBSERVE_*_BLOCK consts + render_delivery_directive |
| Tier 0.5 rules loader | ai_agent/services/load_tier_0_5_rules_service.rs |
| Source framing of the user turn | ai_agent/types/ai_thread_event_payload_type.rs · frame_for_prompt (line 276) |
| Prompt caching switch | ai/rig/client.rs · completion_model().with_prompt_caching() (line 38) |
| Usage / cache token reconciliation | ai/rig/streaming.rs (241) · ai/services/log_ai_usage_service.rs |
| Learning version resolve/summarize/pin | get_active_learning_version_service.rs · summarize_learning_version_service.rs |
| Digest / evolution loop | run_learning_digest_service.rs · jobs/learning_digest_poller_job.rs |
| Memory tools + reconcile | tools/{read_my,update_my,read_contact,update_contact}_memory_tool.rs · reconcile_memory_blocks_service.rs |
| Member personalization (assistant flavor — template to reuse) | assistant/services/assemble_assistant_turn_service.rs · get_member_personalization |