System prompts are a function of user and application state
Your system prompts need to be built and managed like a React app, evolving with user intent and data, rather than like a static HTML web page.
You can think of your system prompt as a function of the application state — it needs to be dynamic and evolve based on the progression of the user journey. It’s not even a piece-wise function made up of two or three static prompts. You need to modify or entirely replace the prompt based on the evolution of the conversation, metadata from chain-of-thought workflows, summarization, personal data from the user, etc. You want to include or omit parts of it at any given user state for a better outcome. This blog post on prompt design from Character.ai is a great resource.
In short, think of prompts as a dynamic set of instructions that need to be maintained to control your user experience, more like the UI elements visible to the user in a given screen of your app, rather than as a one-time set of instructions to be locked at the start of the user journey.
Opt for deterministic outcomes, especially in the early user journey
With most online products, you finely control your user’s “day zero” experience with an intricately built onboarding flow, then you unleash them onto a magical blank canvas to do whatever they want. With an AI chat product, you probably want to keep the same philosophy and build deterministic chat outcomes for your users, especially in their first few days. But then what?
Should the AI bring up a certain topic or suggestion within the first five messages, or be prompted to a certain action on their second day? Should the AI change the topic at certain times to keep the user engaged? Is there a conversational ramp for the activation moment? Do you want to extract some info from your user during onboarding using a chat format to personalize the experience?
The answer to all of the above is most likely yes if you’re building a consumer product.
Use model blending
Results improve when you route messages in the same thread to two to six models with orthogonal capabilities instead of always going to the same model. Let’s say you have model A, which is good at prose and role playing, and model B, which is good at reasoning. If you just route every other message between A and B, the outcome over a multi-turn conversation ends up being dramatically better.
Besides running split tests for advanced prompting, this is the easiest win with a huge impact. But choose the models wisely.
Use scripted responses
As amazing as LLMs are, they’re better deployed in a controlled manner for chat rather than as a magical talking box. You can use a smaller model to infer some semantics about the user input, and route to a pre-written response a lot of the time. This will save you a ton of money while actually leading to a better user experience.
If you can build a simple decision tree with some semantic reasoning for routing to serve a common user journey, you’ll probably end up with a better product than having every single response generated being from an inference.
Craft amazing conversation starters
We built an entirely separate inference system from our core dialog system to use summaries of previous chats, previous memories, their recent actions in app, and some random seeds for the AI characters to initiate good conversations. If you don’t do this, your AI will produce some version of “Hi! How can I assist you today?” more often than you want.
The quality of AI-to-AI chats degrades quickly
During user testing, we repeatedly saw the blank canvas problem — users didn’t know what to type to chat. We added a “magic wand” to offer three AI-generated messages in the user’s voice. While it solved a short term user friction, users who used the magic wand churned much faster. When we studied the chatlogs, we found that AI-to-AI chat degrades into a loop of nonsense within a few turns.
Have a clear metric to judge AI output
If you just prompt and test your chatbot yourself for a few messages, and call it good enough… trust me, it won’t be good enough. Your AI outputs need to maintain quality after a 100-turn conversation, across several sessions, and for different user personas.
You need to try many different variants and build a clear feedback loop, using something like a Likert score or simple ELO score, to choose between variants to see what your users find engaging or useful in chat.
We found that using another inference with a general purpose LLM to judge the output (e.g., a prompt like “On scale of 1 to 5, how entertaining is this conversation?” running with GPT4o as the judge) produced poor results that were out of sync with users’ feedback.
All in all, the days of vibing your way to a system prompt and calling it a day are long gone. As obvious as it may sound, if your product is AI, then the AI better be great. This will be the number one factor determining your success. The AI novelty era is over. You will need a clear framework and lots of experimentation to delight your users and deliver them value. Good luck!
SOURCE: Freydoonnejad, Siamak. ''How to build better AI chatbots'' 10/12/2024. Infoworld.com. (https://www.infoworld.com/article/3616618/how-to-build-better-ai-chatbots.html).