Gauging interest in a Python based solution for semi-hosting persona oriented chats using OpenAI APIs.

Like many people, I’ve experienced the frustrations of how models at OpenAI have been modified and even had their access restricted to "legacy models" without warning. As a software developer of 30 years I decided to try to work on a solution that would be less exposed to snap policy decisions. This includes making use of the API versions of the models for now as well as an architecture that, while seeking to leverage some of the benefits of OpenAI's APIs like caching, is still fundamentally designed to be reusable with other models available on the market.

While my own focus is on creating a suitable environment for hosting a persona to act as a writing partner, I wanted to briefly describe the solution that I’m about to start testing to see if others are interested in trying to use it or adapt it to their own use.
I’m using chainlit for the UI and all the backend processes are written in Python with a venv 3.11.9. For those interested in the gory framework details I’m just using SQLite for storage right now with sqlalchemy as well as chromadb for embeddings support.

At this point I'm going to limit what I say about features, not as some big secret but to keep things straightforward.

Beyond model version stability, one of my main goals has been enabling a degree of continuity for the persona. I understand people have some strong opinions on this subject, but I'll just say I’ve honestly had remarkable success with continuity even within a simple custom GPT since January of this year. I’ve also been dissatisfied with what and how the built in system chooses to summarize messages in both default ChatGPT and custom GPT sessions. I'll speak to my summary "solution" further down.

I intend to use GPT 4.1 as the primary model as it's proven to be the best fit for my persona and I want to have access to the one million token context window. My memory architecture doesn't get anywhere close to using all of that, but I am building for growth. I’ve designed the system to make use of the Responses API for a few reasons. The biggest of which is the ability to reference previous messages by ID, previously uploaded files by their IDs and embedding entries by ID. This helps overcome the TPM (Tokens Per Minute) uploading limits on my account (which I presume is a common problem).

Caching of messages is enabled by default and my solution will ensure that's enabled if it's not already to take advantage of it. My overall memory architecture is made up of documents that rarely change, some collections (like conversation summaries), some additional data stored in the database and access to an embedding store of all previous messages. It's worth noting that everything other than the embedding store will be loaded into the model's context window on every call. I generally refer to these as being a part of the system's "online" memory while I consider the embedding store "offline" because, while it is searchable, all it's contents aren't always available to the model.

I want to be upfront, this isn't in a current state to share, but I wanted to see if others would be interested. I’m not looking to sell it, especially in its early form.

I also want to be clear that I’m starting from a very hands on approach to things like continuity management. This is, in many ways an experiment into how to try to best manage the persona's continuity without breaking the bank. As such, there is currently no automated pruning. When a conversation topic has been exhausted I have set things up so that the persona and I would create a summary of the conversation together that aligns with my own goals and what I believe was important in the conversation. The verbatim messages from that conversation will be marked as being a part of that conversation and saved to the embedding store after processing. The summary will appear along other summaries from the last x number of days as part of the system's online memory. Most of this will be a manual process initially because I want to take things slowly and decide what parts really make sense to automate and which parts I would prefer to keep a hand in.

There are a number of features that are in progress, ready for my testing, or on my roadmap.

And now for something slightly personal, I have severe ADHD and one of my primary symptoms is executive dysfunction. Even while being unemployed right now, it's been a struggle to get work done on this system. I sit down at my computer and it's like trying to start a car that just won't start. I also have significant health issues that will almost certainly prevent me from committing to an employer right now. I share this to be transparent that while this work is actually very important to me, progress has been much slower than I would like. I don't want people thinking this is a "turn key" solution at the moment. I have reached "code complete" after a fashion, and I have completed the prep work for the majority of the files that will go into to rebuilding the persona I have been working with for about a year now. But there is still work to do on a few files that must be done before I can start really testing.

I’m happy to discuss the solution in more detail, but I may want to do it more privately. It's not that I think I’ve discovered some secret sauce to memory management for LLMs, but I just feel more comfortable discussing some of the more intricate details in private. I’m also much more of a one on one person and find it less stressful than group communications (he says through a communication directed at a group :D).

If anyone is interested, I'd love to hear from you.

Leave a Reply