Architecture

Offline RAG & hybrid retrieval

How NuraSafe grounds the on-device LLM with offline knowledge from KnowledgeBase.json using hybrid retrieval (lexical BM25 + dense E5 embeddings), then injects the top passages into the answer prompt when retrieval runs. The descriptions below mirror the iOS implementation (RAGEngine.swift, ChatEngine, EmbeddingService, etc.).

What ObjectBox is not doing here

ObjectBox does not perform BM25 or vector search at answer time. Ranking uses in-memory chunk lists plus VectorStore. ObjectBox holds a text mirror of chunks for persistence and possible future use—not the hot retrieval path.

1. Knowledge source (corpus)

Item	Role
NuraSafe/Resources/KnowledgeBase.json	Single source of truth: an array of chunks with `id`, `scenario`, `title`, `content`.
KnowledgeChunk (RAGEngine.swift)	`Codable` mirror of each JSON row.

At startup / first use, RAGEngine.buildIndex() loads this file into allChunks (in RAM). All candidate selection and BM25 “haystacks” are built from this array.

2. Index build (buildIndex → buildSemanticIndex)

Trigger: ChatEngine.loadModel() → ragEngine.buildIndex() — typically once per app session until the index is ready.

Load JSON → allChunks.
EmbeddingService.reloadFromBundleIfNeeded() — ensure Core ML E5 is loaded (multilingual-e5-small.mlpackage / .mlmodelc).
If E5 is missing: index can still be marked ready; ObjectBoxKnowledgeStore.replaceAll runs; retrieval is BM25-only (no dense vectors).
If E5 is present:
- Cache hit: if KnowledgeIndexStore matches the KB and chunks.json has embeddings for every chunk → load embeddings → vectorStore.insertAll → ObjectBox sync → done.
- Cache miss (KB hash change, tokenizer version bump, etc.): re-embed every chunk on a background Task.detached, each via embedPassage(title:content:) → KnowledgeChunkEntity with embeddingData.
vectorStore.insertAll with (id, embedding).
KnowledgeIndexStore.saveChunks — persists JSON under Application Support NuraSafe/KnowledgeIndex/chunks.json.
KnowledgeIndexStore.markUpToDate — stores a KB fingerprint in UserDefaults.
ObjectBoxKnowledgeStore.replaceAll — one ObjectBox row per chunk (text fields on KnowledgeVectorEntity).

Embeddings on disk: KnowledgeChunkEntity.embeddingData — 384 floats per chunk (KnowledgeChunkEntity.swift).

3. Embedding model (E5 + Core ML)

Component	Detail
Model	`intfloat/multilingual-e5-small` exported to Core ML; 384-dim vectors, L2-normalized after mean pooling.
Inputs	Fixed length 128 tokens: `input_ids`, `attention_mask`.
Passages (index)	`embedPassage` builds a string `passage: {title}: {content}`, then `embed(...)`.
Queries	`embedQuery` builds `query: {retrievalQuery}`, then `embed(...)`.
Tokenizer	`E5Tokenizer` loads `tokenizer.json` from the bundle; Unigram Viterbi segmentation aligned with Hugging Face.
Inference	`MLModel.prediction` → hidden states → mean pool over masked positions → L2 normalize (`EmbeddingService.swift`).

If the tokenizer fails to load, embed returns nil → no vectors for that item.

4. Runtime vector search (VectorStore)

Storage: in-memory array of VectorEntry(id, embedding).
Search: brute-force dot product over all entries. Because vectors are unit length, dot product equals cosine similarity. Implemented with Accelerate vDSP_dotpr.
Parameters: topK from RAGEngine (up to all candidates after filtering). Threshold = 0.25 — matches below that similarity are dropped. (Logs may still show many “above threshold” scores if the model assigns high similarity broadly.)

No HNSW in the hot path

There is no HNSW / ObjectBox vector index used for ranking at runtime. Comments in code mention ObjectBox HNSW as a possible future upgrade.

5. Hybrid retrieval (RAGEngine.retrieve)

Runs only when ChatEngine has a non-nil retrieval query (see §7 Orchestration).

Parameter	Use
query	Embedded with E5 as `query: ...` (from query generator + fusion).
userMessageForSignals	Verbatim user message for BM25 (and UAE tweak). If empty, falls back to `query`.
scenario	Optional `EmergencyScenario` → filters candidate chunks.

5.1 Candidate set

private func candidates(for scenario: EmergencyScenario?) -> [KnowledgeChunk] {
    if let sc = scenario {
        return allChunks.filter {
            $0.scenario == sc.rawValue || $0.scenario == "general"
        }
    }
    return allChunks
}

No emergency mode: all chunks (~120).
Mode active: only chunks where scenario matches the mode or scenario == "general".

5.2 BM25 (always, on candidates)

Okapi BM25 with k1 = 1.5, b = 0.75. Query terms: lowercased, split on non-alphanumeric, length ≥ 3, stopwords removed. Document = title + content (haystack). Document length = token count after the same tokenization. Bonuses: extra score if a term appears in title; extra if in scenario or chunk id string. Scores are normalized by max BM25 across candidates → normBM25 ∈ [0, 1] per chunk id.

5.3 E5 (when model + vectors exist)

Embed the query string only (not necessarily the raw user message, unless fusion makes them identical). vectorStore.search with topK = candidates.count, then restrict to ids still in the scenario-filtered set.

Degenerate detection: on the top ~20 filtered similarities, if min > 0.88 and spread < 0.06, E5 is treated as degenerate → semantic weight 0 (lexical-only fallback).

Otherwise: semantic similarities are min–max normalized per query to normSemantic.

5.4 Score fusion

Healthy E5: 0.60 × normSemantic + 0.40 × normBM25 + UAE adjustment (below).
Degenerate E5: BM25 only (semantic weight 0).

5.5 UAE lexical adjustment

If user signal text looks UAE / official-sources-related, small ± adjustments nudge chunks that mention NCEMA, UAE, etc., and slightly penalize tactical scenario chunks with no UAE content (uaeOfficialSourcesAdjustment).

5.6 Output

Sort by combined score descending. Take topK = 3 chunks. Return [KnowledgeChunk] — full text for prompt formatting.

6. ObjectBox vs JSON cache vs RAM

Store	What it holds	Used in retrieval?
allChunks	Full KB from JSON	Yes — source of chunks returned to the LLM.
VectorStore	`id → embedding`	Yes — cosine search.
KnowledgeIndexStore	`chunks.json` with embeddings + KB hash	Warm-start — avoids re-embedding; not queried each turn except to reload vectors.
ObjectBoxKnowledgeStore	`KnowledgeVectorEntity`: chunkId, scenario, title, content	No for ranking — persistence / sync on index build; text mirror for future use.

7. Orchestration: when RAG runs (ChatEngine)

Per user message, order of operations:

Rebuild memory from transcript.
IntentRouter.route — tool style / urgency (does not solely gate retrieval vs direct answer; RAG gating is separate).
generateRetrievalQuery — first LLM call (small, low temperature):
- General chat, no emergency mode: system prompt asks for either <retrieval_query>…</retrieval_query> or <retrieval_skip/>. Parsed by RAGQueryGeneration.parseTaggedQuery. Skip → sentinel __RAG_SKIP__ → ChatEngine returns nil → no retrieve(), no chunks.
- Emergency mode active: different system prompt — always a real query; if the model still outputs skip, override to [mode] + user message (literal user text).
If retrievalQuery != nil: ragEngine.retrieve(query:userMessageForSignals:scenario:) with userMessageForSignals = raw user text.
PromptService.buildPrompt — if ragChunks empty, no KB block; else RAGEngine.formatContext wraps chunks with labels.
Second LLM call — main answer, streaming.

Fusion after query LLM: fusedRetrievalQuery can replace the LLM phrase with the raw user message when overlap is too low (anti-drift).

8. Prompt injection (formatContext + PromptService)

Retrieved chunks are formatted as numbered [1] title: content with instructions that [1] is most relevant, and to preserve verbatim codes / numbers. In the user turn they sit above the current user message, inside a “Reference knowledge from NuraSafe database” wrapper (see PromptService.swift).

9. Key constants

Constant	Value / behavior
topK	3 chunks
Vector similarity threshold	0.25
E5 + BM25 weights	0.6 / 0.4 when E5 healthy; else BM25 1.0
Degenerate E5	min top-20 sim > 0.88 and spread < 0.06
Embedding dim	384
Max sequence	128 tokens

10. End-to-end flow

KnowledgeBase.json
       │
       ▼
  buildIndex → allChunks (RAM)
       │
       ├─ E5 OK? ──► embedPassage per chunk → VectorStore + KnowledgeIndexStore (disk)
       │
       └─► ObjectBox.replaceAll (text mirror)

Per message:
  Query LLM → parse (skip | query string) → [optional] fusedRetrievalQuery
       │
       ├─ nil query? ──► skip retrieve → answer LLM without KB
       │
       └─ retrieve: scenario filter → BM25(signalText) + E5(query)
                    → hybrid + UAE → top 3 chunks
       │
       ▼
  Prompt with history + optional RAG block → Answer LLM (stream)

In short: corpus → index (E5 + cache + ObjectBox text mirror) → gated query LLM → hybrid BM25 + E5 on filtered candidates → top-3 injection → answer LLM.

11. Developer

NuraSafe is built by developers who care about offline-first AI and safety-critical UX. For professional background, open-source interests, or collaboration, you can connect on LinkedIn.

LinkedInlinkedin.com/in/hsamichg