Hot-Swappable AI: Building Modular Memory for LLMs
I set out to see if an AI model could “think” it already knew something—without retraining it. That curiosity turned into a full-blown experiment using virtual tokens, LoRA adapters, and a Streamlit interface that lets me hot-swap micro-models at runtime. What started as an exploration of model memory evolved into a working proof-of-concept for modular cognition.
The Initial Question
Retrieval-Augmented Generation (RAG) works well for many tasks, but it always feels reactive. Each question starts fresh, context is reloaded, and nothing really “sticks.” I wanted to know if a model could retain new understanding through small injections of data—like installing micro-memories—without retraining or bloating the base model.
Phase 0: Virtual Tokens Prototype
The first idea was to inject learnable virtual tokens directly into the embedding layer. Each set of tokens represented a document or domain, like a company policy or HR handbook. I used GPT-4 to generate small Q&A training pairs and trained these tokens alongside the model. Even with limited GPU power, the model began to respond differently depending on which tokens were loaded. The signal was subtle but consistent enough to prove the concept.
# Virtual token embedding
class VirtualTokenEmbedder(nn.Module):
def __init__(self, num_tokens=20, hidden_size=768):
super().__init__()
self.embeddings = nn.Parameter(torch.randn(num_tokens, hidden_size))
def forward(self, input_embeds):
return torch.cat([self.embeddings.unsqueeze(0), input_embeds], dim=1)
Phase 0.5: LayerPool and Streamlit Interface
Upgrading to Flan-T5-base allowed me to build a Streamlit-based control panel with sliders and toggles for
mounting/unmounting layers, adjusting blend weights, and seeing results live. The LayerPool class
blended multiple “memory” sources, giving each layer a distinct identity. At this stage, the idea felt alive: each
small adapter could be turned on or off, effectively changing the model’s “personality” or knowledge domain at
runtime.
# Blend LoRA and virtual token layers
class LayerPool:
def __init__(self):
self.layers = {}
def mount(self, name, layer):
self.layers[name] = layer
def forward(self, x):
for layer in self.layers.values():
x = layer(x)
return x
Phase 1: LoRA Adapters
To increase conditioning strength, I integrated LoRA adapters using the PEFT library. Each adapter was trained on a small domain dataset—HR, Tech, or Healthcare—and can be hot-swapped dynamically without retraining the base model. Combined with virtual tokens, this created a modular system with semantic (embedding) and structural (LoRA) memory components.
# Load adapters dynamically
from peft import PeftModel
base_model = AutoModelForSeq2SeqLM.from_pretrained("google/flan-t5-base")
adapter = PeftModel.from_pretrained(base_model, "./models/hr_adapter")
adapter.eval()
System Architecture Overview
Here’s a simplified visualization of how the base model, virtual tokens, and LoRA adapters interact:
Results So Far
- Virtual tokens condition the model at runtime.
- LoRA adapters can be hot-swapped without breaking inference.
- Cross-domain reasoning works in small tests (e.g., HR + Tech).
- Latency is low enough for live interaction via Streamlit.
Challenges and Bottlenecks
The main bottleneck is GPU power. Training multiple adapters on a local machine takes hours, so larger-scale experiments are paused until I can move to cloud GPUs. Retrieval systems added little over direct adapter conditioning, so the focus remains on modular memory injection.
Next Steps
- Validate adapter accuracy across domains (HR, Tech, Healthcare).
- Experiment with persistent memory and FAISS indexing for scalability.
- Explore cross-adapter reasoning—mixing multiple domains at once.
- Optimize LoRA training or replace with embedding-based virtual conditioning.
In Closing
Maybe the future of AI isn’t just bigger models. Maybe it’s smaller ones that can learn, connect, and work together. This project started as curiosity and grew into a working prototype for runtime-conditioned intelligence. The model doesn’t just retrieve information—it mounts new understanding dynamically, a small but meaningful step toward adaptable AI systems.