Architecture
Offline RAG & hybrid retrieval
How NuraSafe grounds the on-device LLM with offline knowledge from KnowledgeBase.json using hybrid retrieval (lexical BM25 + dense E5 embeddings), then injects the top passages into the answer prompt when retrieval runs. The descriptions below mirror the iOS implementation (RAGEngine.swift, ChatEngine, EmbeddingService, etc.).
What ObjectBox is not doing here
VectorStore. ObjectBox holds a text mirror of chunks for persistence and possible future use—not the hot retrieval path.1. Knowledge source (corpus)
| Item | Role |
|---|---|
| NuraSafe/Resources/KnowledgeBase.json | Single source of truth: an array of chunks with id, scenario, title, content. |
| KnowledgeChunk (RAGEngine.swift) | Codable mirror of each JSON row. |
At startup / first use, RAGEngine.buildIndex() loads this file into allChunks (in RAM). All candidate selection and BM25 “haystacks” are built from this array.
2. Index build (buildIndex → buildSemanticIndex)
Trigger: ChatEngine.loadModel() → ragEngine.buildIndex() — typically once per app session until the index is ready.
- Load JSON →
allChunks. EmbeddingService.reloadFromBundleIfNeeded()— ensure Core ML E5 is loaded (multilingual-e5-small.mlpackage/.mlmodelc).- If E5 is missing: index can still be marked ready;
ObjectBoxKnowledgeStore.replaceAllruns; retrieval is BM25-only (no dense vectors). - If E5 is present:
- Cache hit: if
KnowledgeIndexStorematches the KB andchunks.jsonhas embeddings for every chunk → load embeddings →vectorStore.insertAll→ ObjectBox sync → done. - Cache miss (KB hash change, tokenizer version bump, etc.): re-embed every chunk on a background
Task.detached, each viaembedPassage(title:content:)→KnowledgeChunkEntitywithembeddingData.
- Cache hit: if
vectorStore.insertAllwith(id, embedding).KnowledgeIndexStore.saveChunks— persists JSON under Application SupportNuraSafe/KnowledgeIndex/chunks.json.KnowledgeIndexStore.markUpToDate— stores a KB fingerprint in UserDefaults.ObjectBoxKnowledgeStore.replaceAll— one ObjectBox row per chunk (text fields onKnowledgeVectorEntity).
Embeddings on disk: KnowledgeChunkEntity.embeddingData — 384 floats per chunk (KnowledgeChunkEntity.swift).
3. Embedding model (E5 + Core ML)
| Component | Detail |
|---|---|
| Model | intfloat/multilingual-e5-small exported to Core ML; 384-dim vectors, L2-normalized after mean pooling. |
| Inputs | Fixed length 128 tokens: input_ids, attention_mask. |
| Passages (index) | embedPassage builds a string passage: {title}: {content}, then embed(...). |
| Queries | embedQuery builds query: {retrievalQuery}, then embed(...). |
| Tokenizer | E5Tokenizer loads tokenizer.json from the bundle; Unigram Viterbi segmentation aligned with Hugging Face. |
| Inference | MLModel.prediction → hidden states → mean pool over masked positions → L2 normalize (EmbeddingService.swift). |
If the tokenizer fails to load, embed returns nil → no vectors for that item.
4. Runtime vector search (VectorStore)
- Storage: in-memory array of
VectorEntry(id, embedding). - Search: brute-force dot product over all entries. Because vectors are unit length, dot product equals cosine similarity. Implemented with Accelerate
vDSP_dotpr. - Parameters:
topKfromRAGEngine(up to all candidates after filtering). Threshold = 0.25 — matches below that similarity are dropped. (Logs may still show many “above threshold” scores if the model assigns high similarity broadly.)
No HNSW in the hot path
5. Hybrid retrieval (RAGEngine.retrieve)
Runs only when ChatEngine has a non-nil retrieval query (see §7 Orchestration).
| Parameter | Use |
|---|---|
| query | Embedded with E5 as query: ... (from query generator + fusion). |
| userMessageForSignals | Verbatim user message for BM25 (and UAE tweak). If empty, falls back to query. |
| scenario | Optional EmergencyScenario → filters candidate chunks. |
5.1 Candidate set
private func candidates(for scenario: EmergencyScenario?) -> [KnowledgeChunk] {
if let sc = scenario {
return allChunks.filter {
$0.scenario == sc.rawValue || $0.scenario == "general"
}
}
return allChunks
}- No emergency mode: all chunks (~120).
- Mode active: only chunks where
scenariomatches the mode orscenario == "general".
5.2 BM25 (always, on candidates)
Okapi BM25 with k1 = 1.5, b = 0.75. Query terms: lowercased, split on non-alphanumeric, length ≥ 3, stopwords removed. Document = title + content (haystack). Document length = token count after the same tokenization. Bonuses: extra score if a term appears in title; extra if in scenario or chunk id string. Scores are normalized by max BM25 across candidates → normBM25 ∈ [0, 1] per chunk id.
5.3 E5 (when model + vectors exist)
Embed the query string only (not necessarily the raw user message, unless fusion makes them identical). vectorStore.search with topK = candidates.count, then restrict to ids still in the scenario-filtered set.
Degenerate detection: on the top ~20 filtered similarities, if min > 0.88 and spread < 0.06, E5 is treated as degenerate → semantic weight 0 (lexical-only fallback).
Otherwise: semantic similarities are min–max normalized per query to normSemantic.
5.4 Score fusion
- Healthy E5:
0.60 × normSemantic + 0.40 × normBM25+ UAE adjustment (below). - Degenerate E5: BM25 only (semantic weight 0).
5.5 UAE lexical adjustment
If user signal text looks UAE / official-sources-related, small ± adjustments nudge chunks that mention NCEMA, UAE, etc., and slightly penalize tactical scenario chunks with no UAE content (uaeOfficialSourcesAdjustment).
5.6 Output
Sort by combined score descending. Take topK = 3 chunks. Return [KnowledgeChunk] — full text for prompt formatting.
6. ObjectBox vs JSON cache vs RAM
| Store | What it holds | Used in retrieval? |
|---|---|---|
| allChunks | Full KB from JSON | Yes — source of chunks returned to the LLM. |
| VectorStore | id → embedding | Yes — cosine search. |
| KnowledgeIndexStore | chunks.json with embeddings + KB hash | Warm-start — avoids re-embedding; not queried each turn except to reload vectors. |
| ObjectBoxKnowledgeStore | KnowledgeVectorEntity: chunkId, scenario, title, content | No for ranking — persistence / sync on index build; text mirror for future use. |
7. Orchestration: when RAG runs (ChatEngine)
Per user message, order of operations:
- Rebuild memory from transcript.
IntentRouter.route— tool style / urgency (does not solely gate retrieval vs direct answer; RAG gating is separate).generateRetrievalQuery— first LLM call (small, low temperature):- General chat, no emergency mode: system prompt asks for either
<retrieval_query>…</retrieval_query>or<retrieval_skip/>. Parsed byRAGQueryGeneration.parseTaggedQuery. Skip → sentinel__RAG_SKIP__→ChatEnginereturnsnil→ noretrieve(), no chunks. - Emergency mode active: different system prompt — always a real query; if the model still outputs skip, override to
[mode] + user message(literal user text).
- General chat, no emergency mode: system prompt asks for either
- If
retrievalQuery != nil:ragEngine.retrieve(query:userMessageForSignals:scenario:)withuserMessageForSignals= raw user text. PromptService.buildPrompt— ifragChunksempty, no KB block; elseRAGEngine.formatContextwraps chunks with labels.- Second LLM call — main answer, streaming.
Fusion after query LLM: fusedRetrievalQuery can replace the LLM phrase with the raw user message when overlap is too low (anti-drift).
8. Prompt injection (formatContext + PromptService)
Retrieved chunks are formatted as numbered [1] title: content with instructions that [1] is most relevant, and to preserve verbatim codes / numbers. In the user turn they sit above the current user message, inside a “Reference knowledge from NuraSafe database” wrapper (see PromptService.swift).
9. Key constants
| Constant | Value / behavior |
|---|---|
| topK | 3 chunks |
| Vector similarity threshold | 0.25 |
| E5 + BM25 weights | 0.6 / 0.4 when E5 healthy; else BM25 1.0 |
| Degenerate E5 | min top-20 sim > 0.88 and spread < 0.06 |
| Embedding dim | 384 |
| Max sequence | 128 tokens |
10. End-to-end flow
KnowledgeBase.json
│
▼
buildIndex → allChunks (RAM)
│
├─ E5 OK? ──► embedPassage per chunk → VectorStore + KnowledgeIndexStore (disk)
│
└─► ObjectBox.replaceAll (text mirror)
Per message:
Query LLM → parse (skip | query string) → [optional] fusedRetrievalQuery
│
├─ nil query? ──► skip retrieve → answer LLM without KB
│
└─ retrieve: scenario filter → BM25(signalText) + E5(query)
→ hybrid + UAE → top 3 chunks
│
▼
Prompt with history + optional RAG block → Answer LLM (stream)In short: corpus → index (E5 + cache + ObjectBox text mirror) → gated query LLM → hybrid BM25 + E5 on filtered candidates → top-3 injection → answer LLM.
11. Developer
NuraSafe is built by developers who care about offline-first AI and safety-critical UX. For professional background, open-source interests, or collaboration, you can connect on LinkedIn.
LinkedInlinkedin.com/in/hsamichg