Building fast and scalable LLM interactions with FSM-inspired prompt engineering

24 Jul 2025

Prompt engineering is no longer a niche trick — it’s a foundational component for anyone looking to get the most out of large language models (LLMs). As models evolve, so do the techniques we use to interact with them: zero-shot, few-shot, self-consistency, and increasingly, FSM-inspired frameworks. But with all this power comes a challenge — how do you keep your interaction efficient, structured, and scalable, especially in real-time environments?

In this post, we’ll share how we solved a key performance problem using a hybrid architecture that combines the reliability of Finite State Machines (FSM) with the adaptability of modern LLM prompt engineering. This approach is built not only for technical elegance but also for production-readiness, where speed, cost, and clarity are non-negotiable.

The challenge: Real-time dialogue without losing context

Imagine a scenario where you need to maintain a long-running dialogue with a user. The system should dynamically respond to new input, preserve contextual awareness, and deliver near-instantaneous feedback — ideally, the first token should arrive within 350–700 ms.

Traditional agent-based architectures are often too bulky for this kind of demand. On the flip side, sending a single, lengthy prompt with all context bundled in can be difficult to manage, expensive in terms of tokens, and prone to hallucinations.

We needed something better — something lightweight yet structured.

Our solution: FSM-inspired prompt architecture

We developed a hybrid model that draws inspiration from Finite State Machines (FSMs), but with a twist tailored for LLMs. In classic FSMs, transitions between states are rigidly predefined. We took that structure and added flexibility: the LLM retains the ability to assess its current state and choose how to respond.

Here’s how it works:

The FSM governs the overall logic and controls state transitions.
Each state has its prompt template, dynamically filled with fresh, relevant data—including user parameters, interaction history, process status, and meta-instructions from the database.
The model receives only what it needs, when it needs it, and nothing more.

This gives the system a clear understanding of what’s already been processed, what actions are allowed in the current phase, and what should happen next, without overloading the context window or wasting tokens.

Why it works: Key advantages

This token-efficient LLM architecture brings several major benefits:

Token optimization: By limiting context to local state information and phase-specific data, we avoid sending a long prompt in each message by forming a dialogue history. This keeps costs low and responses fast.

Predictable behavior: The FSM backbone enforces a strict logic flow. The model stays on script, doesn’t jump between states, and respects boundaries, making the system highly reliable.

Reduced hallucinations: Explicit constraints within each phase and clear prompt design significantly lower the risk of unexpected or fabricated responses.

Lightning-fast response: Our setup regularly produces the first token in just 350–700 milliseconds — perfect for real-time applications like chatbots, assistants, and live customer support.

Scalable and maintainable: New features, states, or flows can be added without rewriting the entire system. The FSM structure makes updates simple and non-disruptive.

Clear thinking, always: Even in extended conversations, the model doesn’t “lose the thread.” Every step is grounded in a specific state with fresh, focused information.

Related case study - Goldie

Under the hood: Practical adaptation from research

Our solution is inspired by academic research — specifically, ideas presented in FSM: A finite state machine-based zero-shot prompting paradigm for multi-hop question answering — but we adapted them to meet the practical needs of production. We stripped away unnecessary complexity, added flexibility, and optimized for real-world performance.

The result? A system that’s not just structured, but alive: a responsive, controlled conversation that’s ready to scale.

Conclusion

As LLM-based systems become more central to digital products, we believe this kind of FSM-based prompt architecture will become a new standard, especially in applications where fast, reliable, and cost-efficient communication is key.

We’re already using this architecture in our products, and the results speak for themselves. If you’re building LLM-powered apps or thinking of outsourcing LLM prompt engineering for speed, structure, and scale, let’s talk.