The Problem with AI Chat Memory in Bubble Apps
Building an AI chatbot in Bubble that actually remembers what a user said earlier in the conversation requires deliberate architectural choices. The AI model itself has no persistent memory between separate API calls. Everything it knows about the conversation must be passed back in with each new request. How you handle that accumulating context determines your app's cost, scalability, and quality of responses as conversations grow.
There are three main approaches: Context Window, Follow-up Prompt, and RAG using Pinecone. Each involves real trade-offs.
Method 1: Context Window
The simplest approach is to pass the full conversation history into every OpenAI or Anthropic API call. With modern models offering context windows of 200,000 tokens or more, this seems like a non-issue. In practice, several problems emerge.
First, research has documented a middle-of-context dip: AI models tend to pay less attention to information positioned in the middle of a very long input. Filling the context window does not guarantee accurate recall. Second, the cost compounds quickly. Each new user message adds to the history, and you are paying for every token of that history every time. By the sixth exchange in a conversation, you are sending twelve messages worth of tokens. By the twentieth, the cost per reply is significantly higher than the first. Third, some API providers have hard limits on the number of messages per call, separate from token count, which can cause unexpected errors in long conversations.
Method 2: Follow-up Prompt (Distillation)
This approach decouples the expensive answer-generation model from the memory management task.
When a user sends a message, the app uses a high-quality, higher-cost model to compose the full response. Once that response is saved to the Bubble database, a second API call runs in the background using a much cheaper model. That second call takes the full conversation history and distils it into a structured learner profile: the user's role, what they are building, what topics they have struggled with, what decisions they have made, and what questions remain open.
This profile, typically a few hundred tokens, replaces the full conversation history in subsequent requests. Instead of sending every previous message, you send a concise and contextually rich summary. The cost of the distillation call is negligible, the profile stays compact regardless of conversation length, and the quality of personalised responses often improves because the profile captures intent more precisely than raw message history.
For an AI assistant in a learning platform, the structure recommended from frameworks like task-state compression or dialogue-to-knowledge distillation includes fields for the user's current project, recent questions, open blockers, and suggested next steps. Every conversation appends to that profile rather than extending an ever-growing message list.
Method 3: RAG with Pinecone
Retrieval Augmented Generation is the most sophisticated approach and the right choice when you have large volumes of conversation history or external knowledge that needs to be searched semantically.
Each conversation exchange gets stored as a vector in Pinecone. When a user sends a new message, the query is converted to embeddings, compared against the stored vectors, and the most semantically relevant past exchanges are retrieved and included in the AI prompt. Instead of the entire conversation, only the most relevant portions are sent.
Two lessons from production use of this approach:
The first is about embeddings models. Using OpenAI's text embeddings models works, but it requires two separate API calls: one to OpenAI to convert text to embeddings, and then one to Pinecone with those vectors. If you use Pinecone's own integrated embeddings models instead, you can send plain text directly to Pinecone and skip the separate embeddings step. This simplifies the architecture and removes an API dependency.
The second lesson is about hybrid search. Pure semantic search, which finds content based on meaning, can underperform on proper nouns and product names that may not be well represented in the embeddings model's training data. Pinecone's hybrid search combines semantic search with lexical keyword matching to handle these cases, but hybrid search is only available when using Pinecone's integrated embeddings models. If you start with a third-party embeddings model, you lose access to hybrid search and would need a separate lexical index to replicate the functionality.
Choosing the Right Approach
For most Bubble apps building AI assistants with individual user conversations, the Follow-up Prompt method offers the best balance of cost, quality, and implementation simplicity. Context Window works for short conversations with infrequent users. RAG with Pinecone is the right choice when you need semantic search across large conversation histories or want to search a document corpus alongside conversation data.
All three can be implemented using Bubble's API Connector and backend workflows, with the distillation and vector storage steps running asynchronously after each user message so they do not delay the main response.