India-First Agentic RAG · v1.0 · Parts 1–5

Multilingual
Agentic RAG
for India

Hindi in → Hindi out. Zero translation. Zero loss.

A complete engineering breakdown of SmartDocs — 22 Indian languages, 11-node LangGraph agent, 18 non-negotiable laws, 10 production bugs fixed. Built from first principles for India's actual linguistic reality.

GitHub LinkedIn
22
Languages
11
LG Nodes
18
Laws
10
Bugs Fixed
0.91
Cross-Lang Sim
smartdocs — embedding_validator.py
$python embeddings/embedding_validator.py
Loading multilingual-e5-large...
✓ Model loaded on CUDA (RTX 3050)
─────────────────────────────────
Test 1: Land acquisition (Hindi ↔ EN)
similarity: 0.9069 ✓ PASSED
Test 2: GST invoice (Hindi ↔ EN)
similarity: 0.8925 ✓ PASSED
Test 3: Insurance policy (Hindi ↔ EN)
similarity: 0.9155 ✓ PASSED
─────────────────────────────────
Average: 0.9050 (threshold: 0.85)
ALL 3 PAIRS PASSED. Proceed to Part 2.
$
Stack
Sarvam-30B LangGraph 0.3 FastAPI+SSE pgvector+RLS
Built by
Sahil Alaknur
Agentic AI Engineer · India
System Architecture

Three Pipelines.
One India-First System.

Every component chosen to handle India's linguistic reality — Devanagari Unicode, Hinglish queries, bilingual documents. Not English RAG with Hindi support bolted on.

📥
PIPELINE 01
Ingestion Pipeline
PDF → 5-step Indic preprocessing → parent-child chunking (Devanagari 400t / Latin 500t) → multilingual-e5-large embedding → pgvector RLS storage. 10 stages, 0 compromises.
indic-nlppdfplumberpresidiopgvector
→ Click to explore
🔍
PIPELINE 02
Query Pipeline
7-step language detection → multi-query expansion via asyncio.gather → hybrid dense+BM25+RRF retrieval → FlashRank reranking → CRAG Tavily fallback. PROCEED / CRAG / INSUFFICIENT routing.
langdetectFlashRankTavilyaioredis
→ Click to explore
🤖
PIPELINE 03
LangGraph Agent
11 nodes · 3 conditional edges · self-critique retry cycle (max 2x) · output guardrails · Sarvam-30B streaming generation. This is not a pipeline — it is a stateful agent.
LangGraphSarvam-30BLangSmithRLS
→ Click to explore
⚠ The Foundational Principle
multilingual-e5-large is the FOUNDATION — not Llm call.
If embeddings do not place Hindi and English text about the same concept close together in vector space, retrieval fails before Sarvam ever sees the query. Cross-language cosine similarity > 0.85 must pass before any ingestion code is written.
0.9069
Land Acquisition ✓
0.8925
GST Terms ✓
0.9155
Insurance ✓
0.9050
Average (>0.85)
Parts 1 & 2 — Ingestion Pipeline

10 Stages.
India-First.

From raw PDF to indexed vectors. Every stage handles an India-specific concern. Click any step to expand code and details.

Part 3 — Query Pipeline

From Query to
Top-5 Reranked Chunks.

Language detection, query expansion, hybrid retrieval, and routing decisions. Everything runs async in parallel.

Part 4 — LangGraph Agentic Loop

11 Nodes. 3 Edges.
One Self-Critiquing Agent.

Not a pipeline. A stateful agent. Conditional edges, retry cycles, graceful failures. Click any node to see its logic, state writes, and code.

EDGE 1 — preprocess_node
cache_hit=True → serve_cache → END
cache_hit=False → transform_node
EDGE 2 — rerank_node
score > 0.7 → assemble_context
0.3–0.7 → crag → assemble
< 0.3 → insufficient → END
EDGE 3 — critique_node
passed=True → guardrail → END
retry & count < 2 → retrieve
count ≥ 2 → insufficient → END
🔵
Select a node
Click any node in the graph ↑

Click any node in the graph above to see its purpose, state writes, and implementation logic.

11
TOTAL NODES
3
COND. EDGES
Part 3 — language_detector.py

7-Step
Deterministic Detection.

"transformer kya hai" returned Norwegian. langdetect is statistically unstable on Hinglish and mixed scripts. This 7-step deterministic logic tree fixes it permanently. Try the interactive simulator.

>_
Detected Language
Python language_detector.py
# PRODUCTION FIXES APPLIED:
# DetectorFactory.seed=0 → determinism
# Script detection BEFORE langdetect
# Hinglish lexicon: 130 words
# ASCII dominance: 85% threshold

def detect_language(text: str) -> LangResult:
    if not text.strip():
        return default_english("empty")
    script = _detect_script(text)
    if script: return LangResult(code=script)
    if _detect_hinglish(text): return hi_result
    if _is_ascii_dominant(text): return en_result
    res = langdetect.detect_langs(text)[0]
    if res.prob >= 0.85: return res
    return _check_indic_fallback(text)
7-Step Detection Trace
Part 5A — pgvector + Ingestion Audit

10 Production Bugs.
All Fixed.

Audited before writing a single API route. Silent data corruption, runtime crashes, security bypasses. Ranked by damage. Click cards to see before/after diffs.

Research Foundation

10 Papers.
Implemented in Production.

Every core architectural decision in SmartDocs traces back to peer-reviewed research. Click any paper to see how it maps to the implementation.

Engineering Laws

18 Laws.
Non-Negotiable.

Every architectural decision traces to one of these laws. They encode hard-learned lessons about multilingual RAG in production. Filter by category.

Evaluation Gate — LAW 16

The Numbers That
Define Success.

Hindi faithfulness / English faithfulness > 0.97 — or it is not done. These metrics block deployment. Aggregate accuracy hiding Hindi failures is not accepted.

The India Benchmark
Hindi / English Faithfulness > 0.97
If SmartDocs scores 90% English faithfulness and 65% Hindi faithfulness, it is not an India-first product. It is an English PDF tool with Hindi support theater. This ratio is the single defining metric of the entire project.