Hot-Swappable AI: Building Modular Memory for LLMs

I set out to see if an AI model could “think” it already knew something—without retraining it. That curiosity turned into a full-blown experiment using virtual tokens, LoRA adapters, and a Streamlit interface that lets me hot-swap micro-models at runtime. What started as an exploration of model memory evolved into a working proof-of-concept for modular cognition.

The Initial Question

Retrieval-Augmented Generation (RAG) works well for many tasks, but it always feels reactive. Each question starts fresh, context is reloaded, and nothing really “sticks.” I wanted to know if a model could retain new understanding through small injections of data—like installing micro-memories—without retraining or bloating the base model.

Phase 0: Virtual Tokens Prototype

The first idea was to inject learnable virtual tokens directly into the embedding layer. Each set of tokens represented a document or domain, like a company policy or HR handbook. I used GPT-4 to generate small Q&A training pairs and trained these tokens alongside the model. Even with limited GPU power, the model began to respond differently depending on which tokens were loaded. The signal was subtle but consistent enough to prove the concept.


# Virtual token embedding
class VirtualTokenEmbedder(nn.Module):
    def __init__(self, num_tokens=20, hidden_size=768):
        super().__init__()
        self.embeddings = nn.Parameter(torch.randn(num_tokens, hidden_size))

    def forward(self, input_embeds):
        return torch.cat([self.embeddings.unsqueeze(0), input_embeds], dim=1)
    

Phase 0.5: LayerPool and Streamlit Interface

Upgrading to Flan-T5-base allowed me to build a Streamlit-based control panel with sliders and toggles for mounting/unmounting layers, adjusting blend weights, and seeing results live. The LayerPool class blended multiple “memory” sources, giving each layer a distinct identity. At this stage, the idea felt alive: each small adapter could be turned on or off, effectively changing the model’s “personality” or knowledge domain at runtime.


# Blend LoRA and virtual token layers
class LayerPool:
    def __init__(self):
        self.layers = {}

    def mount(self, name, layer):
        self.layers[name] = layer

    def forward(self, x):
        for layer in self.layers.values():
            x = layer(x)
        return x
    

Phase 1: LoRA Adapters

To increase conditioning strength, I integrated LoRA adapters using the PEFT library. Each adapter was trained on a small domain dataset—HR, Tech, or Healthcare—and can be hot-swapped dynamically without retraining the base model. Combined with virtual tokens, this created a modular system with semantic (embedding) and structural (LoRA) memory components.


# Load adapters dynamically
from peft import PeftModel

base_model = AutoModelForSeq2SeqLM.from_pretrained("google/flan-t5-base")
adapter = PeftModel.from_pretrained(base_model, "./models/hr_adapter")
adapter.eval()
    

System Architecture Overview

Here’s a simplified visualization of how the base model, virtual tokens, and LoRA adapters interact:

graph TD A[User Input] --> B[Base Model Embeddings] B --> C{Virtual Tokens Active?} C -->|Yes| D[Prepend Virtual Tokens] C -->|No| B D --> E[LayerPool: Apply LoRA Adapters] B --> E E --> F[Seq2Seq Decoder] F --> G[Output Response]

Results So Far

  • Virtual tokens condition the model at runtime.
  • LoRA adapters can be hot-swapped without breaking inference.
  • Cross-domain reasoning works in small tests (e.g., HR + Tech).
  • Latency is low enough for live interaction via Streamlit.

Challenges and Bottlenecks

The main bottleneck is GPU power. Training multiple adapters on a local machine takes hours, so larger-scale experiments are paused until I can move to cloud GPUs. Retrieval systems added little over direct adapter conditioning, so the focus remains on modular memory injection.

Next Steps

  • Validate adapter accuracy across domains (HR, Tech, Healthcare).
  • Experiment with persistent memory and FAISS indexing for scalability.
  • Explore cross-adapter reasoning—mixing multiple domains at once.
  • Optimize LoRA training or replace with embedding-based virtual conditioning.

In Closing

Maybe the future of AI isn’t just bigger models. Maybe it’s smaller ones that can learn, connect, and work together. This project started as curiosity and grew into a working prototype for runtime-conditioned intelligence. The model doesn’t just retrieve information—it mounts new understanding dynamically, a small but meaningful step toward adaptable AI systems.



AI (8) sewing (7) recycling (7) bags (7) software engineering (5) tech debt (4) software design (4) prompt engineering (4) ecommerce (4) Product Management (4) OpenAI (4) Empirical (4) context (3) architecture (3) Research (3) Learning (3) Hexagons (3) Embeddings (3) Board Game (3) 3-d Printing (3) twitter bootstrap (2) natural language (1) github pages (1) gardening (1) TypeScript (1) Streamlit (1) React (1) Modular AI (1) Makers Journey (1) Machine Learning (1) MVC (1) LoRA (1) Ionic (1) Heartbeat (1) Flan-T5 (1) Firebase (1) DaisyUI (1) DIY (1) Cosine Search (1) Context (1) Azure DevOps (1)

Thoughts & Ideas

  • Matt Gauzza