Building fast and scalable LLM interactions with FSM-inspired prompt engineering

24 Jul 2025
Building fast and scalable LLM interactions with FSM-inspired prompt engineering

This post is for teams building real-time systems such as support assistants, sales qualifiers, or voice bots where first-token latency within 350-700 ms and stable behavior are non-negotiable.

The challenge: Real-time dialogue without losing context

The core problem is to maintain a long-running, adaptive dialogue while:

  • Responding quickly enough for interactive UX and speech turn-taking
  • Preserving relevant context without paying to resend entire histories
  • Maintaining guardrails so the model stays on task and doesn’t drift

Most approaches struggle on at least one dimension:

  • Monolithic prompts that include the entire history inflate tokens, increase latency, and often degrade into repetition or hallucinations.
  • Generic agent frameworks can be flexible but add orchestration overhead, latency from planning loops, and harder-to-debug behavior.
  • Memory heuristics and RAG everywhere help, but naive retrieval each turn still bloats prompts and can pull in irrelevant context.

We needed something lightweight, explicit, and fast under real-world load.

Our solution: FSM-inspired prompt architecture

We structure the interaction as a finite set of states with explicit transitions, then design small, focused prompt templates per state. The FSM provides the backbone; the LLM provides local reasoning within a state. The orchestrator controls which state the conversation is in and composes only the minimal context needed for that state.

Key ideas:

  • The FSM governs flow. Allowed actions and next states are explicit and observable.
  • Each state has a prompt template dynamically filled with just-in-time context: user input, validated slots, tool results, policy notes, and meta-instructions.
  • The LLM reasons within a bounded frame and produces both user-visible text and machine-readable signals for the orchestrator.

Example state machine (customer support triage)

stateDiagram-v2
[*] --> Greet
Greet --> GatherIssue
GatherIssue --> Disambiguate: ambiguous or missing info
GatherIssue --> RetrievePolicy: enough info
Disambiguate --> GatherIssue
RetrievePolicy --> ProposeResolution
ProposeResolution --> ExecuteTool: needs account action
ProposeResolution --> Summarize: resolved in text
ExecuteTool --> Summarize: tool success
ExecuteTool --> Escalate: tool failure or policy block
Summarize --> [*]
Escalate --> [*]

State schema

States are defined as data. This lets you add or change flows without rewriting code.

{
"name": "GatherIssue",
"description": "Elicit the user's issue and key details.",
"allowedTransitions": ["Disambiguate", "RetrievePolicy"],
"guard": "validated_slots.issue && (validated_slots.product || needs_disambiguation)",
"promptTemplate": "system: You are a concise support triage assistant.\nconstraints: Ask one targeted question if details are missing. Never propose resolutions yet.\ncontext: {{policy_snippet}}\nslots: {{validated_slots}}\nuser: {{last_user_utterance}}\noutput_format: JSON with fields {\"assistant_text\", \"slots_to_update\", \"next_state_suggestion\"}",
"tools": ["extract_slots"]
}

Prompt templates per state

Keep them short, explicit, and machine-readable. Two examples:

GatherIssue (chat)

system: You are a concise support triage assistant. Ask only what is needed to triage.
rules:
- Do not propose resolutions.
- Ask at most one question per turn.
context: {{policy_snippet}}
known_slots: {{validated_slots}}
user: {{last_user_utterance}}

Return JSON:
{
"assistant_text": "string",
"slots_to_update": {"issue": "...", "product": "...", "severity": "..."},
"next_state_suggestion": "Disambiguate|RetrievePolicy"
}

ExecuteTool (action)

system: You are orchestrating a single account action.
allowed_actions: {{allowed_actions}}
required_inputs: {{required_inputs}}
validated_slots: {{validated_slots}}
tool_result: {{tool_result}}

Respond with JSON only:
{
"assistant_text": "string for user",
"action_status": "success|failed|blocked",
"next_state_suggestion": "Summarize|Escalate"
}

Data flow for real-time voice

The same FSM pattern works for voice with streaming I/O.

sequenceDiagram
participant Mic
participant VAD as Voice Activity Detection
participant STT as Speech-to-Text
participant Orchestrator
participant LLM
participant TTS as Text-to-Speech
participant Speaker

Mic->>VAD: PCM stream
VAD-->>Mic: speech segments
VAD->>STT: segment
STT-->>Orchestrator: partial + final transcripts
Orchestrator->>LLM: state prompt + minimal context (streaming)
LLM-->>Orchestrator: JSON control + assistant text (stream)
Orchestrator-->>TTS: assistant text chunks
TTS-->>Speaker: audio stream
Orchestrator-->>State Store: slots, state, tool results, traces

Implementation notes:

  • Use streaming everywhere: WebSockets or Server-Sent Events for transcript and token streams.
  • Run STT and TTS in parallel with careful barge-in handling.
  • Keep a state store for slots, state transitions, and tool outcomes so you never resend long histories.

Example walkthrough: a three-turn dialogue

  • State: Greet
  • User: Hi, I need help with my plan.
  • Assistant: Happy to help. Which product are you on, and what seems to be the issue?
  • Transition: Greet -> GatherIssue
  • State: GatherIssue
  • User: It’s the Pro plan. I was double charged.
  • Assistant: Thanks. Was the duplicate charge this month, and do you see two separate invoices?
  • Slots update: product=Pro, issue=billing_duplicate
  • Transition suggestion: Disambiguate (missing timeframe)
  • State: Disambiguate
  • User: Yes, two invoices for March.
  • Assistant: Got it. I can request a refund for the duplicate March charge and correct your account.
  • Transition: RetrievePolicy (enough info) -> ProposeResolution
  • State: ProposeResolution
  • Assistant: I will issue a refund for the duplicate March invoice. Shall I proceed?
  • User: Yes, please.
  • Transition: ExecuteTool
  • State: ExecuteTool
  • Tool: refund_api. Result: success
  • Assistant: The duplicate March charge has been refunded. You’ll see it in 3–5 business days.
  • Transition: Summarize -> End

Across these turns, prompts stay small. The orchestrator injects only the validated slots, the latest user utterance, and minimal policy snippets relevant to billing refunds.

Implementation details

Orchestrator loop (pseudocode)

def handle_turn(session_id, user_input):
s = load_state(session_id)
state_def = registry[s.current_state]

# 1) Validate and enrich context
slots = validate_slots(s.slots, user_input)
tools_ctx = fetch_tools_context(state_def, slots)

# 2) Render prompt for current state
prompt = render(state_def.prompt_template,
slots=slots,
user=user_input,
tools=tools_ctx,
policy=select_policy_snippet(state_def))

# 3) Call LLM with streaming and schema enforcement
llm_resp = call_llm(prompt, response_format="json", stream=True)
control = parse_and_validate(llm_resp.json, schema=state_def.output_schema)

# 4) Apply side effects and decide next state
if control.action:
result = execute_tool(control.action, timeout=state_def.tool_timeout)
s.tool_results.append(result)

next_state = choose_next_state(control.next_state_suggestion, state_def.allowedTransitions)
s.slots.update(control.slots_to_update)
s.current_state = next_state

# 5) Persist and stream response
persist_state(session_id, s)
stream_to_client(control.assistant_text)

JSON schemas and validation

Use strict schemas for every state’s machine output to keep the LLM honest. Reject and reprompt when validation fails, with a small corrective system message such as “Output was not valid JSON. Return only the JSON object that conforms to the schema.”

Context injection

  • Policies and knowledge: Retrieve only the snippets tagged for the active state and the detected issue type.
  • Slots and state: Keep a compact, typed record. Avoid resending raw transcripts; commit normalized facts.
  • Memory: For long tasks, store intermediate artifacts as structured records rather than tokens.

Real-time voice considerations

  • VAD: Use low-latency VAD to segment reliably and reduce premature cut-offs.
  • STT: Prefer streaming STT with partial hypotheses and a short endpointing delay.
  • TTS: Stream TTS as soon as the model starts emitting; tune barge-in rules to interrupt gracefully.
  • Scheduling: Prioritize first-token time by assembling minimal prompts and deferring expensive retrieval to post-response enrichment when possible.

Use cases and business impact

  • Customer support triage and deflection: Faster first responses, fewer escalations, and clearer handoffs to agents.
  • Sales qualification and booking flows: Deterministic slot-filling increases completion rates and data quality.
  • Internal ops copilots: Guided procedures for refunds, returns, approvals, and IT tickets with clear audit trails.
  • Real-time voice assistants and IVR modernization: Natural turn-taking with streaming ASR/TTS, policy-aware flows, and lower abandonment.
  • Healthcare intake and scheduling: Structured anamnesis and insurance capture with strict guardrails and auditability.

Business levers you unlock:

  • Lower inference spend via smaller prompts and more consistent token usage
  • Better reliability through state-scoped constraints and observable transitions
  • Faster time to iterate: add or adjust states without rewriting the whole agent
  • Clearer observability and compliance: every step and decision has a state, schema, and trace

Under the hood: From research to production

Our approach is informed by the FSM ideas in the paper FSM and adapted for real-world constraints:

  • We keep the state discipline and explicit transitions.
  • We add flexible, per-state prompts with JSON outputs the orchestrator can validate.
  • We emphasize streaming I/O and strict schema enforcement to meet real-time latency.
  • We isolate context by state rather than replaying whole histories, reducing both cost and drift.

When running in production, we also:

  • Pin model versions and test for conformance on golden dialogues before any rollout
  • Capture traces per turn for replay and red-team analysis
  • Separate the tool layer so that slow APIs don’t stall the conversation loop

References and further reading

The FSM prompting paradigm: FSM

State-graph orchestration in agents: LangGraph by LangChain, a related concept for graph-based agent control

Real-time multimodal APIs: OpenAI Realtime API and Anthropic tool use

Speech pipeline building blocks: WebRTC for low-latency media and VAD research overview

Conclusion

FSM-inspired prompt engineering brings order and speed to multi-turn LLM interactions by combining explicit state control with small, purpose-built prompts. Teams building real-time chat or voice experiences can achieve predictable flows, lower token usage, and first-token latency suitable for natural conversation, without sacrificing adaptability where it counts.

If you are exploring how to turn AI ideas into scalable production systems, our team can help design and build practical LLM architectures tailored to your product and operational constraints. If you want to discuss whether this approach fits your roadmap, let’s talk.