Loquent · AI Agent Prompt System Research · feat/agent-memory-learning-evolution-1559
Research + telemetry · read-only

The agent prompt system & what it costs

How mods/ai_agent assembles its prompt, what this worktree's memory & learning layer added, and — grounded in real production usage from loquent_latest — where the tokens go, how well the cache hits, what a turn actually costs, and where the spend is at scale.

Scope · contact-facing domain agents (Text Reply, web chat, follow-up) Model · anthropic/claude-sonnet-4.6 via OpenRouter + rig 0.38 Data · ai_usage_log + ai_thread_log, ~46 h window (Jun 12–14)
↳ Headline numbers (measured, not modelled):
  • A Text Reply turn sends ~14.2K input tokens / ~88 output per model call, ~2 calls per inbound message — and ~11–13K of that is the domain tool schemas, not the persona. The system prompt text is barely ~1K tokens.
  • Cache works: 67.6% aggregate hit on agent turns (92–99.5% on warm calls). The tool+system prefix caches; history is tiny, so the old "cache the history" lever is not the agent's problem — the toolset size and the 5-min cache TTL vs SMS cadence are.
  • A cost-accounting gap: Loquent's books charge cached reads at full input price (cached_per_1m = 0 for Sonnet), so recorded cost overstates true provider cost ~2.4× for the agent. True spend ≈ $0.02–0.05 per inbound message.

00TL;DR

Eight things to know — the first row is the new measured-cost picture.

01 · WHERE TOKENS GO

~14.2K input / 88 output per model call; ~2 calls per inbound message. ~11–13K is the tool schemas (the agent gets Family::all() via attach_domain_tools); persona+goals+core ≈ 1K.

02 · CACHE HIT

67.6% aggregate on agent turns, 92–99.5% on warm calls. The tool+system prefix caches under one ephemeral breakpoint; history is small so it isn't the lever.

03 · TRUE COST

≈ $0.02–0.05 per inbound message (cache-dependent). The agent feature spent $0.18 true over the window; the assistant is 93% of recorded spend ($4.46 true).

04 · BILLING GAP

Cost calc + credit billing charge cached reads at full $3/1M and drop cache-creation tokens — books overstate provider cost ~2.4× (agent) / ~1.5× (assistant).

05 · LEARNING

Now on by default. Active version summarized to ≤10 bullets, injected into the user turn (KV-safe seam), pinned per thread. Never in the cached prefix.

06 · MEMORY

Typed reconciling blocks read/written on-demand via tools — agent memory (update_my_memory) vs per-contact memory (update_contact_memory), routed by the platform-core contract. Not injected.

07 · NO IDENTITY

Still zero injection of user name, role, or org name. The only runtime identity resolved is the contact's. "Act as the assigned user" is a net-new layer (§8–9).

08 · CACHING vs ANTHROPIC

Static→dynamic ordering is correct; 1 of 4 breakpoints used; 5-min ephemeral TTL. Levers: trim the toolset, longer TTL, fix cost accounting.

01The call path

One agent turn, end to end. Everything is reconstructed from the DB each turn — the API is stateless and there is no message store (schema frozen in Phase 1).

enqueue event ──▶ claim active_idle→active_running ──▶ run_worker_loop │ ▼ run_ai_agent_thread() (one turn) src/mods/ai_agent/services/run_ai_agent_thread_service.rs ├─ load thread + agent + skills + invokable agents ├─ drain pending queue events ──────────────▶ the NEW user input for this turn ├─ derive runtime layers: │ • include_platform_core = attach_domain_tools && !assistant_flavor │ • org_rules / agent_rules = load_tier_0_5_rules(org, agent) + legacy config_payload │ • delivery_directive = f(send_mode, reply_ctx, open_escalations) ← only turn-varying tier │ • learning_summary = summarize(active|pinned learning version) ← user-turn layer │ • contact few-shot = top-k past draft corrections for (agent, contact) ├─ build_system_prompt(...) ───────────────▶ the CACHED static preamble ├─ build_chat_history(thread_id) ──────────▶ replayed prior turns (≤100 / source) ├─ user_message = [learning] + [few-shot] + [source-framed events] ├─ resolve model + openrouter_client() + completion_model().with_prompt_caching() ├─ rig_agent.chat( Message::user(user_message), &mut history ) ── AgentTurnModel records usage └─ ONE transaction: ai_usage_log + ai_thread_log + consume events + pin learning version
i

Two flavors share this executor

A domain agent (Text Reply, web chat, follow-up drafter) uses the assembled persona/goals prompt below. The assistant flavor (the per-user Loquent Assistant clone) swaps in a full Vernis turn whose preamble replaces the assembled prompt and binds the member's Session-scoped tools. This document is about the domain-agent path unless noted — that's the one that talks to contacts.

02System prompt anatomy

The preamble is a pure conditional assembler — the executor decides what's included and passes pre-rendered strings in. Ordering is deliberately static→dynamic for cache stability.

System prompt (= rig .preamble(), the cached prefix)
TIER 0Platform core identity · tool-action contract · operator-vs-contact source rules · "ground your replies, never make things up"conditional
TIER 0.5Organization rules operator-authored, org-wide — always empty today at the executor seam (forward-compat hook)when set
TIER 0.5Agent rules from ai_agent_rule table + legacy config_payload["agent_rules"]when set
PERSONAPersona ai_agent.persona — static, operator-authored, verbatim (multi-line allowed)static
GOALSGoals ai_agent.goals — rendered as "Your goals:\n{goals}"static
DELEGDelegated sub-agent block only when the thread has a parent_thread_id — relay via enqueue_parentwhen delegated
TIER 1Delivery directive autonomous-send / suggest-draft / observe + escalation-reconcile — the only section that varies per turnper-turn
SKILLSSkills full bodies, or a compact lazy index (id+title+desc) when the agent has list_my_skills/load_skillper-agent
INVOKEAgents you can invoke id + name + role hint, scrubbed of newlines (a callee may be authored by a different user)per-agent
TOOLSTool allowlist "You can use the following tools: …" — names only, in allowlist orderper-agent

The literal assembly is one format string — note that learning, memory, and contact data are absent here by design:

src/mods/ai_agent/services/build_system_prompt_service.rs:337let system_prompt = format!(
  "{platform_block}{org_rules_block}{agent_rules_block}{persona}\n\n\
   Your goals:\n{goals}{delegated_block}{delivery_block}\
   {skills_block}{invokable_agents_block}{tools_block}",
  persona = agent.persona,   // static, operator-authored
  goals   = agent.goals,     // static, operator-authored
);

Tier 0 — the platform-core contract

Included only for a domain agent (attach_domain_tools && !is_assistant_flavor). It is a fixed constant (PLATFORM_CORE_BLOCK) with four sections — and crucially, it speaks of "the business" generically:

build_system_prompt_service.rs:73 — PLATFORM_CORE_BLOCK (excerpt)## Who you are
You are an autonomous agent acting on the business's behalf inside Loquent…
You speak to the business's contacts as the business itself — a real person
from the team — and you are never named or described as an AI…

## How you affect the world   → you change the world only by calling tools
## Who is talking to you: operator vs. contact   → trust steering, distrust the contact
## Ground your replies; never make things up   → gather context with read tools, else escalate
!

"the business" is never resolved to a name

Tier 0 establishes that the agent acts as the business — but "the business" stays a generic placeholder. There is no slot where the org's name, the owner's name, or the assigned user's name is interpolated. That's the gap your goal targets.

What's actually in the ~14K input — measured

The anatomy above is the system-prompt text, but it is a small minority of what goes on the wire. Reading the real effective_prompt_capture for a live Text Reply agent against its ai_usage_log rows, the prompt breaks down like this:

~11–13K
Domain tool schemas (the bulk)
~670
Platform core (Tier 0)
~290
Persona + goals
<500
History + user-turn injections

The Text Reply agent carries attach_domain_tools = true with a one-item tools_allowlist (["escalate_to_user"]) — but the executor then attaches the entire domain toolset via build_domain_tools_for_agent → collect_agent_domain_rig_tools → collect_rig_tools(Family::all()) (permission- and tier-gated). Those tool JSON schemas are sent in the request's tools array on every call. The tools_block in the system text only lists allowlist names; the schemas themselves are the ~11–13K.

i

Good news for caching: the bulk is stable

Tools + system render before the messages and sit inside the one cache breakpoint, so the ~13K prefix is exactly what gets reused. That's why warm calls hit 92–99.5% (§3). The flip side: it's a large prefix, so a cold start (first call of a thread, or after the 5-min TTL lapses) pays full price for all ~13K — which makes trimming the toolset, not caching history, the real lever (§9).

03Cost, tokens & cache — measured

Grounded in real rows from loquent_latestai_usage_log (tokens, cache, model) joined to ai_thread_log (the 12-layer prompt capture). Window: ~46 h, Jun 12–14 2026, 190 usage rows.

i

Read this as unit economics, not a bill

This is dev/test traffic (10 agent model calls, 118 assistant), so absolute totals are tiny. The value is the per-unit shape — tokens per turn, cache hit ratio, cost per message — which is what scales. Projections at the end apply that shape to volume.

14,228
Avg input tokens / agent model call
88
Avg output tokens / agent model call
67.6%
Cache hit (agent, aggregate)
~2
Model calls per inbound message

Per-feature over the window

Two cost columns: naive = what Loquent records today (every input token at full $3/1M, cached reads included); true = the real provider cost with Anthropic's cache-read discount (fresh $3/1M, cached read $0.30/1M, output $15/1M).

FeatureModelCallsInput tokCachedCache %Naive $True $
assistant (Vernis)sonnet-4.61182,135,350829,81638.9%$6.70$4.46
agent (Text Reply)sonnet-4.610142,27796,17767.6%$0.44$0.18
extract_tasks_from_messagesdeepseek-v3.22531,0457,66524.7%~$0.009 (deepseek)
update_contact_memory_from_messagesdeepseek-v3.22530,3548,88029.3%~$0.009 (deepseek)
agent_learning_digestdeepseek-v3.223,189~$0.001 (deepseek)
dashboard_briefinggemini-3-flash23,30400%$0 — unpriced
!

The assistant, not the agent, dominates spend today

The internal Loquent Assistant is ~93% of recorded $ and has a worse cache ratio (38.9% vs 67.6%) — longer, more varied threads with page context. The contact-facing agent is cheap per turn; its cost is what scales with contact volume, the assistant's with member activity. Different problems, same executor.

Per-turn cache behavior — the cold/warm pattern

The 10 agent calls show the mechanism plainly. Calls with a cache read carry 92–99.5% of the prompt from cache; cold calls (first call of a thread, or after the prefix is invalidated / the TTL lapses) carry none and pay full price for the whole ~14K:

thread e9f7eb41 (3 inbound messages, 8 model calls) 09:47:00 call₁ in 14,083 cached — ← COLD (thread opens: writes the cache) 09:47:00 call₂ in 14,305 cached 13,737 96.0% ← warm (tool turn → text turn, same loop) 09:50:20 call… in ~14,200 cached ~13,800 95–99% ← warm (within the 5-min TTL) ×4 09:50:57 call₁ in 14,507 cached — ← cold again (prefix boundary moved) 09:50:57 call₂ in 14,733 cached 13,626 92.5% ← warm thread 6022b9f6 (clean burst) 10:09:43 call₁ in 13,844 cached — ← COLD 10:10:04 call₁ in 13,923 cached 13,506 97.0% ← warm (21 s later)

The aggregate 67.6% is the blend of 3 cold and 7 warm calls. The architectural fact that drives it: with_prompt_caching() uses an ephemeral (5-minute TTL) breakpoint. Within an agentic loop (tool turn → text turn, seconds apart) and within a burst of replies (<5 min) the prefix is warm; once replies space out past 5 minutes — the norm for SMS — the next message starts cold.

The cost-accounting gap

Three code facts compound so that the cache's real savings never reach Loquent's books:

  1. input_tokens logged is inclusive of the cached read (it's the total prompt, ~14K), and cached_tokens is the read subset (verified against the data: 14,305 total / 13,737 cached).
  2. In ai_pricing_type.rs, anthropic/claude-sonnet-4.6 falls through ..ModelPricing::ZEROcached_per_1m = 0. So calculate_cost = input × $3 + cached × $0 = the whole prompt at full price.
  3. Cache-creation (write) tokens are never stored — from_rig_usage logs only cached_input_tokens (read), and the table has no write column — so the 1.25× write premium on cold turns is invisible too.
src/mods/ai/types/ai_pricing_type.rs:47 — no cache rate for Sonnet(_, "anthropic/claude-sonnet-4.6") => ModelPricing {
  input_per_1m:  3.00,
  output_per_1m: 15.00,
  ..ModelPricing::ZERO        // cached_per_1m = 0.0  ← cache discount not modelled
}

Books overstate provider cost — and credit-billing is cache-blind too

Recorded cost vs true provider cost: agent $0.44 → $0.18 (2.4×), assistant $6.70 → $4.46 (1.5×). And the customer-facing credit path (record_billing_for_usage, tokens = input + output) is likewise cache-blind — customers are metered as if no cache existed. Conservative for margin, but inaccurate: the real money caching saves is neither shown in the cost tab nor passed through as a discount. (Plus gemini-3-flash-preview is unpriced → its rows record $0.)

Projected spend at scale

Applying the measured unit economics. Per inbound message ≈ 2 model calls; true provider cost depends entirely on whether the cache is warm:

RegimeWhenTrue $/inbound msg1K msg/day10K msg/day
warmreplies <5 min apart (in-burst)~$0.014~$420/mo~$4,200/mo
typicalmeasured blend (67.6% cache)~$0.036~$1,080/mo~$10,800/mo
coldreplies >5 min apart (real SMS cadence)~$0.05~$1,500/mo~$15,000/mo
naivewhat the books would record~$0.088~$2,640/mo~$26,400/mo

Two readings: (1) real SMS cadence pushes most messages toward the cold regime, so a longer cache TTL or a smaller prefix is worth real money at volume; (2) the gap between the typical and naive rows is the cache savings the accounting currently throws away. Output is negligible (~88 tok/call) — this is an input-bound workload, which is exactly the case prompt caching is built for.

04What this worktree changed since v1 of this doc

The first version of this artifact predates the memory/learning commits that landed on this branch. Here's what shipped and — the part that matters for cost — which layer each one touches: the cached prefix, the (uncached) user turn, or the toolset.

ChangeWhat it doesPrompt layer it touches
Learning on by defaultEvery agent now resolves & injects its active learning version (was opt-in).user turn — KV-safe, pinned per thread
Dynamic few-shot from correctionsTop-k past operator draft-edits for this (agent, contact) pair, rendered as examples.user turn — uncached by design
Typed reconciling memory blocksReplaced the legacy memory blob with labelled JSONB blocks (PersonaBelief / BusinessFacts / Preferences).tool-loaded — never injected
Memory routing (agent vs contact)Facts about one person → update_contact_memory; persona/business/global rules → update_my_memory. Platform core now states the routing rule.cached prefix (core) + tools
"Remember people" + contact memory live for all agentsAdds the contact-memory read/write tools to the contact-facing toolset.tools — adds to the cached ~13K prefix
contact_id surfaced in the user-turn frameThe framed inbound now carries the contact id so memory tools target the right person.user turn — tiny
De-identified learning digestThe digest prompt is PII-scrubbed before the LLM proposes a new version.digest LLM call (deepseek) — not the agent turn
Soft-cap memory blocksBounds memory growth (≤12 contact blocks / capped agent blocks) so on-demand loads stay small.tool-loaded

The architecture held: injections stayed out of the cached prefix

Everything turn-varying (learning, few-shot, contact id) went into the user turn; everything long-lived (memory) is tool-loaded. The cached system prefix stays byte-stable turn-to-turn — the whole point of the KV-safe seam. The one cost-relevant side effect: the new memory tools grew the cached toolset prefix, reinforcing the §9 "trim the toolset" lever.

05The user turn

Everything dynamic and/or untrusted lives here — never in the cached system prefix. This is also where learning and few-shot examples are injected.

User message (= Message::user(user_message)), assembled per turn
1Learning summary ≤10 bullets from the pinned learning version — prepended only when learning is enableddynamic
2Few-shot block top-k past draft corrections for this (agent, contact) pairdynamic
3Source-framed events each drained event wrapped in a trust envelope (operator vs contact)dynamic
run_ai_agent_thread_service.rs:963 — KV-cache-safe user-turn injectionlet user_message = {
  let mut parts = Vec::with_capacity(3);
  if !learning_summary.is_empty() { parts.push(&learning_summary); }  // §6
  if !few_shot_block.is_empty()   { parts.push(few_shot_block); }       // few-shot
  parts.push(&prompt);                                                  // framed events
  parts.join("\n\n")
};
rig_agent.chat(Message::user(user_message), &mut history).await;

Source framing (prompt-injection defense)

Each event is wrapped by frame_for_prompt(contact_name, contact_number) so the model can tell trusted operator steering from untrusted contact text. Identity fields are scrubbed of newlines and bracket glyphs so a crafted SMS can't forge an operator envelope:

ai_thread_event_payload_type.rs:276 — frame_for_prompt (excerpt)InboundSms  → "[New SMS · {name} <{number}>]\n{scrubbed text}"
CallCompleted → "[Call ended · {name} <{number}>]\n{summary}"
UserMessage → "[Operator instruction from your owner — not the contact]\n{text}"
Scheduled   → "[Scheduled instruction from your owner — not the contact]\n{text}"

History reconstruction

There's no message store; history is rebuilt each turn from consumed queue events (user) + llm_generation log rows (assistant) + recorded tool calls, sorted by timestamp, capped at 100 per source (worst case ~300 messages). The new user input is passed as a separate Message::user — never interpolated into the system prompt.

06Memory & learning this worktree

The branch adds a three-part knowledge system. The distinction matters: learning is injected, memory is tool-loaded, lessons feed the digest that evolves learning.

ConceptTableWhat it isHow it reaches the model
Learning versionai_agent_learning_versionVersioned, evergreen behavioral policy (markdown bullets). Exactly one active per agent (partial unique index). Chained via previous_version_id.Injected — summarized to ≤10 bullets into the user turn
Lessonai_agent_lessonSupervision signal: situation / what-went-wrong / what-to-do-instead, with source + support_count + confidence.Not injected — consumed by the digest
Memory blockai_agent_memory (JSONB blocks)Long-term facts, typed by label: PersonaBelief · BusinessFacts · Preferences. read_only blocks are owner-pinned.On-demand — via memory tools only
Draft correctionai_draft_correctionOperator edit of a draft before sending — the "gold signal."Few-shot examples in the user turn (per contact)

How learning gets in (and why it's in the user turn)

On the first turn of a thread the executor resolves the agent's active learning version and pins its id to ai_thread.pinned_learning_version_id; every later turn reads the pinned version. This freezes the policy for the life of a conversation — a digest activating a new version mid-thread can't swap it underneath. The pinned body is summarized and prepended to the user message:

run_ai_agent_thread_service.rs:615 — learning resolution (condensed)let learning_summary = if agent.enable_learning {
  match thread.pinned_learning_version_id {
    Some(pinned) => summarize_learning_version(&get_learning_version_body(pinned)?),  // later turns
    None => { pin_version_id = Some(v.id); summarize_learning_version(&v.body_markdown) } // first turn → pin
  }
} else { String::new() };

Why user-turn, not system prompt

Putting the (turn-varying) learning summary in the system prefix would invalidate the KV cache every time learning changed. Keeping it in the user turn lets the system prefix stay byte-stable. The summary is still mirrored into capture.learning for the debug timeline — observability only, not the injection mechanism.

Memory is tool-loaded, never injected

read_my_memory renders the typed blocks as labelled markdown on demand; update_my_memory applies a diff of add/update/delete ops (honoring read_only). There is also contact-scoped memory (read_contact_memory / update_contact_memory). The system-prompt builder explicitly leaves memory_snapshot: None — memory never enters the prompt automatically.

Evolution / digest loop

operator edits a draft / answers an escalation / turn dies │ (capture is UNCONDITIONAL — even when enable_learning is off) ▼ ai_agent_lesson (deduped, support_count++) │ learning_digest_poller_job (every minute) ▼ run_learning_digest() GATED on enable_learning ├─ load undigested lessons (≤100) + recent corrections (≤50) ├─ frequency gate: support_count ≥ 2 OR confidence ≥ 0.4 → promotable ├─ LLM proposes new body_markdown + change_summary + folded_lesson_ids ├─ guardrails: drop hallucinated ids · reject empty/unchanged body └─ insert new version (auto → active, or approval → pending) + prune to ≤20

Apply mode is org-level: auto activates immediately; approval lands a pending version an owner approves. Risk is bounded by the ≤20-version cap, the one-active-per-agent invariant, and the reward-hacking / hallucination guards. Shadow replay + regression fixtures exist as the validation surface.

Retrieval observability (#1566)

A metadata-only retrieval_context records which learning version, how many learning bullets, and how many few-shot examples informed a turn — stored in the audit capture, never injected, and carrying ids/counts only (no verbatim cross-contact example text).

07Prompt caching vs Anthropic guidance

Caching is enabled, genuinely cache-aware, and the §3 data confirms it works (67.6% / 92–99.5% warm). One correction from v1 of this doc: for the agent, the cached prefix is the toolset, not history — so the lever is its size and TTL, not "cache the history."

What's wired

src/mods/ai/rig/client.rs:38pub fn completion_model(client, model_id) -> openrouter::CompletionModel {
  client.completion_model(model_id).with_prompt_caching()
  // attaches cache_control: {"type":"ephemeral"} (5-min TTL) to the SYSTEM PROMPT
}

One breakpoint, on the system prompt. Anthropic's render order is tools → system → messages, and a breakpoint caches everything up to and including it — so the marker on the system prompt caches tools + system as one prefix (the ~13K from §2), but not the message tail. Provider path is OpenRouter → Anthropic; the marker is forwarded. Hits are observable — the streaming layer reconciles cached_input_tokens and cache_creation_input_tokens, and the read side is logged to ai_usage_log.

Anthropic best practiceStatus hereNotes (with measured evidence)
Stable content first, volatile lastFollowedStatic→dynamic tiers; the only per-turn system section (delivery directive) sits after the static identity
No silent invalidators in the prefixFollowedSystem prompt is deterministic; learning & few-shot live in the user turn — confirmed by 92–99.5% warm hits
Frozen system prompt; inject dynamic context laterFollowedThe user-turn-injection design for learning (#1560) + few-shot (#1562)
Deterministic tool set (tools→system→messages)FollowedThe ~13K toolset caches inside the system breakpoint — that's the bulk of the hit
Clears the minimum cacheable prefix (~1–2K)ConfirmedThe ~14K prompt is far above the floor — cache_creation fires and reads hit, per the data
Breakpoint on the latest turn for incremental multi-turn cacheMissingLow value for the agent (history is tiny, ≤7 events). Matters more for the assistant (longer threads, 38.9% cache)
Long-TTL cache for slow conversationsMissingOnly the 5-min ephemeral cache is used. SMS replies space out past 5 min → cold starts (§3)
Use up to 4 breakpoints1 of 4Room for a tools/system split or a history breakpoint
!

Correction from v1: the agent's lever is the toolset + TTL, not history

The first version of this doc called replayed history "the biggest token-cost lever." The data says otherwise for the agent: threads are short (≤7 events), so history is a rounding error and the cached prefix is dominated by the ~13K tool schemas. The two real levers are (1) shrink that prefix — the agent gets Family::all() but uses a handful of tools — so cold starts are cheaper, and (2) a longer cache TTL so the prefix survives realistic SMS cadence instead of going cold every message. History caching is a genuine win for the assistant path, not this one.

i

A "Who you represent" identity block would help, not hurt, caching

Because it varies per agent (not per turn), an identity block (§9) lands in the stable prefix and is cached on every warm call — near-zero marginal cost once warm, paid once at full price on a cold start. It adds to the same prefix the toolset already dominates, so trimming the toolset and adding identity are complementary, not competing.

08User & org data — the core question

Your original question: do we inject the current user (name, roles) and the organization's data so the agent can act as that person? For contact-facing agents the answer is still no — unchanged by this worktree. Here's the evidence and the one exception.

Data pointInjected into the domain-agent prompt?Where it lives / why not
Current/owner user nameNoai_agent.user_id exists but is never read into the prompt path
User roles / permissionsNoABAC roles drive API auth (and the domain-tool gate), never reach prompt assembly
Organization name / profileNoorg_rules is wired but ships empty (confirmed in the real captures); no org profile fields are read
Business hours / timezone / signatureNoNot modeled into the prompt at all
Assigned user for a given phone lineNoPhone binds to an agent (phone_number.ai_agent_id), not surfaced as a person in the prompt
Contact name + the line they textedYesResolved per turn, framed into the user turn (the audience, not the actor)
Persona & goals (generic "the business")YesStatic operator-authored text in ai_agent.persona / .goals

The grep that confirms it — over the assembler and the executor:

verified on this branch$ grep -n 'user.name|member.name|owner_name|organization.name|role|assigned_user|first_name' \
      build_system_prompt_service.rs run_ai_agent_thread_service.rs
(no matches)

The seed personas make the same point — they reference the business relationship but carry no identity. From the Follow-up Drafter seed:

migration/.../seed_followup_drafter_agent.rs — persona (excerpt)You are the Follow-up Drafter… You draft short, personalized SMS follow-up
messages sent to a contact on the user's behalf — the contact sees the
message as coming from the user's business, and you are never mentioned.

"the user's business" is a role, not a value — no name is ever substituted in.

The one exception: the assistant flavor

The internal Loquent Assistant (the in-app helper, not a contact-facing agent) does personalize. Its turn is assembled by assemble_assistant_turn, which rebuilds the owning member's Session and loads get_member_personalization(...). So the capability to thread a member's identity into a prompt already exists in the codebase — just not on the path that talks to contacts.

Domain agent (talks to contacts)

Static persona/goals + generic platform core. No member, org, or assigned-user identity. Knows only the contact it's replying to.

Assistant flavor (talks to the member)

Rebuilds the member Session, loads personalization, binds Session-gated CRM tools. Already "acts as" the member — a template to borrow from.

i

The closest thing to "from-identity" today

resolve_envelope_identity() resolves the contact's name and the from-line they texted so the reply lands on the right thread "as coming from the business." That references the business phone line, but still no business name or human identity. It's used only to frame the user turn, never the system prompt.

09Insights & levers

What the data says to do, in rough priority. The first three are cost/accuracy levers the telemetry surfaced; the rest are the identity/personalization directions for your original goal. Options, not a committed plan.

LEVER-1 · cost

Trim the contact-facing toolset (biggest cold-start lever)

The Text Reply agent gets the entire domain toolset via collect_rig_tools(Family::all()) — ~11–13K tokens of schemas — but a reply agent realistically uses a handful: contact read/find, the memory tools, the channel send/draft tool, escalate, maybe schedule. Cutting Family::all() to a focused reply family would shrink the cached prefix, and since the cold start pays full price for the whole prefix (and real SMS cadence makes most messages cold — §3), this directly cuts the dominant cost. Warm calls also get cheaper (smaller cached read). No behavior change beyond removing tools the agent shouldn't call anyway.

LEVER-2 · cost

Longer cache TTL to survive SMS cadence

The 5-min ephemeral breakpoint goes cold between spaced-out replies — the common case for SMS. Anthropic offers a 1-hour cache (higher write multiplier, far cheaper reads across the window). Converting cold messages to warm is worth ~3× per message at volume (the coldwarm rows in §3). Needs a check that rig 0.38's OpenRouter path can request the extended-TTL cache_control.

LEVER-3 · accuracy

Make cost accounting cache-aware

Three small fixes so the books match reality (§3): (a) set cached_per_1m for Sonnet (and subtract cached from the full-rate input, since input_tokens is inclusive) in ai_pricing_type.rs; (b) capture cache_creation_input_tokens (add a column + log it) so the 1.25× write premium is visible; (c) price google/gemini-3-flash-preview (currently unpriced → $0 rows). Decide separately whether credit-billing should pass the cache discount through to customers or keep it as margin.

LEVER-4 · identity

Add a "Who you represent" identity block (Tier 0.7)

A new static section after Tier 0.5, before persona — interpolating business name, owner/assigned-user display name, role/title, signature, hours, timezone, locale. It varies per agent, not per turn, so it lands in the cache-stable prefix at near-zero marginal cost and actually helps cacheability (more stable prefix bytes). This is the most direct answer to "act as the current user."

## Who you represent
You are messaging on behalf of {business_name}.
Your point of contact on the team is {owner_name} ({owner_title}).
Business hours: {hours} ({timezone}). Sign off as {signature} when appropriate.

Source fields from ai_agent.organization_id (org profile) and ai_agent.user_id (owner). The capability to resolve a member already exists in assemble_assistant_turn / get_member_personalization — reuse it on the domain path.

LEVER-5 · identity

Per-phone assigned user — static vs per-turn

"The user assigned to a given phone" is subtler. Phone numbers bind to an agent today (phone_number.ai_agent_id), and a thread can receive events across lines. Two shapes:

  • Per-agent default (recommended first) — resolve one owning/assigned user for the agent and put it in the Tier 0.7 static block. Cache-safe, simplest, covers the common one-line-per-agent case.
  • Per-line override — if a single agent fronts multiple lines with different assigned users, resolve the assigned user from the inbound line and inject it in the user turn (alongside the envelope), keeping the system prefix stable. More precise, slightly more plumbing.
LEVER-6 · identity

Populate the dormant org-rules hook

Tier 0.5 org_rules is wired through the assembler but ships empty at the executor seam (confirmed empty in the real captures). If org-wide identity/voice/policy belongs anywhere shared, this is the seam that already exists — no schema change to the prompt path, just a loader. It lands in the cached prefix, so it's free once warm.

LEVER-7 · assistant

Cache history — for the assistant, not the agent

A breakpoint on the latest turn lets replayed history accrue incremental cache hits. The data shows this is wasted on the agent (history ≤7 events) but promising for the assistant, whose lower 38.9% cache ratio comes from longer, page-context-varying threads. Scope it to that path. Gate on a rig 0.38 capability check for message-level cache_control placement through OpenRouter.

Two guardrails to respect for any identity injection

1. Trust boundary. Owner/org identity is trusted operator content (like persona/goals) and belongs in the system prefix; never let contact-supplied data masquerade as identity — keep the source-framing discipline. 2. Cache stability. Anything that varies per turn (per-line assigned user, time-of-day greeting) must go in the user turn, not the prefix, or it defeats the whole static-prefix design.

10File reference map

Where each piece lives, for the implementation session.

ConcernFile · symbol
System prompt assembler + Tier 0 constantai_agent/services/build_system_prompt_service.rs · build_system_prompt, PLATFORM_CORE_BLOCK (line 73), format string (line 337)
Per-turn executor (all wiring)ai_agent/services/run_ai_agent_thread_service.rs · run_ai_agent_thread (learning 615, user turn 963, model 822, chat 978, envelope 1643)
Tier 1 delivery directivessame file · CAPABILITY_*_BLOCK / OBSERVE_*_BLOCK consts + render_delivery_directive
Tier 0.5 rules loaderai_agent/services/load_tier_0_5_rules_service.rs
Source framing of the user turnai_agent/types/ai_thread_event_payload_type.rs · frame_for_prompt (line 276)
Prompt caching switchai/rig/client.rs · completion_model().with_prompt_caching() (line 38)
Usage / cache token reconciliationai/rig/streaming.rs (241) · ai/services/log_ai_usage_service.rs
Learning version resolve/summarize/pinget_active_learning_version_service.rs · summarize_learning_version_service.rs
Digest / evolution looprun_learning_digest_service.rs · jobs/learning_digest_poller_job.rs
Memory tools + reconciletools/{read_my,update_my,read_contact,update_contact}_memory_tool.rs · reconcile_memory_blocks_service.rs
Member personalization (assistant flavor — template to reuse)assistant/services/assemble_assistant_turn_service.rs · get_member_personalization