India-First Agentic RAG · v1.0 · Parts 1–5

Multilingual
Agentic RAG
for India

Hindi in → Hindi out. Zero translation. Zero loss.

A complete engineering breakdown of SmartDocs — 22 Indian languages, 11-node LangGraph agent, 18 non-negotiable laws, 10 production bugs fixed. Built from first principles for India's actual linguistic reality.

GitHub LinkedIn

22

Languages

11

LG Nodes

18

Laws

10

Bugs Fixed

0.91

Cross-Lang Sim

smartdocs — embedding_validator.py

$python embeddings/embedding_validator.py

Loading multilingual-e5-large...

✓ Model loaded on CUDA (RTX 3050)

─────────────────────────────────

Test 1: Land acquisition (Hindi ↔ EN)

similarity: 0.9069 ✓ PASSED

Test 2: GST invoice (Hindi ↔ EN)

similarity: 0.8925 ✓ PASSED

Test 3: Insurance policy (Hindi ↔ EN)

similarity: 0.9155 ✓ PASSED

─────────────────────────────────

Average: 0.9050 (threshold: 0.85)

ALL 3 PAIRS PASSED. Proceed to Part 2.

$

Stack

Sarvam-30B LangGraph 0.3 FastAPI+SSE pgvector+RLS

Built by

Sahil Alaknur

Agentic AI Engineer · India

System Architecture

Three Pipelines.
One India-First System.

Every component chosen to handle India's linguistic reality — Devanagari Unicode, Hinglish queries, bilingual documents. Not English RAG with Hindi support bolted on.

📥

PIPELINE 01

Ingestion Pipeline

PDF → 5-step Indic preprocessing → parent-child chunking (Devanagari 400t / Latin 500t) → multilingual-e5-large embedding → pgvector RLS storage. 10 stages, 0 compromises.

indic-nlppdfplumberpresidiopgvector

→ Click to explore

🔍

PIPELINE 02

Query Pipeline

7-step language detection → multi-query expansion via asyncio.gather → hybrid dense+BM25+RRF retrieval → FlashRank reranking → CRAG Tavily fallback. PROCEED / CRAG / INSUFFICIENT routing.

langdetectFlashRankTavilyaioredis

→ Click to explore

🤖

PIPELINE 03

LangGraph Agent

11 nodes · 3 conditional edges · self-critique retry cycle (max 2x) · output guardrails · Sarvam-30B streaming generation. This is not a pipeline — it is a stateful agent.

LangGraphSarvam-30BLangSmithRLS

→ Click to explore

⚠ The Foundational Principle

multilingual-e5-large is the FOUNDATION — not Llm call.

If embeddings do not place Hindi and English text about the same concept close together in vector space, retrieval fails before Sarvam ever sees the query. Cross-language cosine similarity > 0.85 must pass before any ingestion code is written.

0.9069

Land Acquisition ✓

0.8925

GST Terms ✓

0.9155

Insurance ✓

0.9050

Average (>0.85)

Parts 1 & 2 — Ingestion Pipeline

10 Stages.
India-First.

From raw PDF to indexed vectors. Every stage handles an India-specific concern. Click any step to expand code and details.

Part 3 — Query Pipeline

From Query to
Top-5 Reranked Chunks.

Language detection, query expansion, hybrid retrieval, and routing decisions. Everything runs async in parallel.

Part 4 — LangGraph Agentic Loop

11 Nodes. 3 Edges.
One Self-Critiquing Agent.

Not a pipeline. A stateful agent. Conditional edges, retry cycles, graceful failures. Click any node to see its logic, state writes, and code.

EDGE 1 — preprocess_node

cache_hit=True → serve_cache → END

cache_hit=False → transform_node

EDGE 2 — rerank_node

score > 0.7 → assemble_context

0.3–0.7 → crag → assemble

< 0.3 → insufficient → END

EDGE 3 — critique_node

passed=True → guardrail → END

retry & count < 2 → retrieve

count ≥ 2 → insufficient → END

🔵

Select a node

Click any node in the graph ↑

Click any node in the graph above to see its purpose, state writes, and implementation logic.

11

TOTAL NODES

3

COND. EDGES

Part 3 — language_detector.py

7-Step
Deterministic Detection.

"transformer kya hai" returned Norwegian. langdetect is statistically unstable on Hinglish and mixed scripts. This 7-step deterministic logic tree fixes it permanently. Try the interactive simulator.

>_

Detected Language

—

Python language_detector.py

# PRODUCTION FIXES APPLIED:
# DetectorFactory.seed=0 → determinism
# Script detection BEFORE langdetect
# Hinglish lexicon: 130 words
# ASCII dominance: 85% threshold

def detect_language(text: str) -> LangResult:
    if not text.strip():
        return default_english("empty")
    script = _detect_script(text)
    if script: return LangResult(code=script)
    if _detect_hinglish(text): return hi_result
    if _is_ascii_dominant(text): return en_result
    res = langdetect.detect_langs(text)[0]
    if res.prob >= 0.85: return res
    return _check_indic_fallback(text)

7-Step Detection Trace

Part 5A — pgvector + Ingestion Audit

10 Production Bugs.
All Fixed.

Audited before writing a single API route. Silent data corruption, runtime crashes, security bypasses. Ranked by damage. Click cards to see before/after diffs.

Research Foundation

10 Papers.
Implemented in Production.

Every core architectural decision in SmartDocs traces back to peer-reviewed research. Click any paper to see how it maps to the implementation.

Engineering Laws

18 Laws.
Non-Negotiable.

Every architectural decision traces to one of these laws. They encode hard-learned lessons about multilingual RAG in production. Filter by category.

Evaluation Gate — LAW 16

The Numbers That
Define Success.

Hindi faithfulness / English faithfulness > 0.97 — or it is not done. These metrics block deployment. Aggregate accuracy hiding Hindi failures is not accepted.

The India Benchmark

Hindi / English Faithfulness > 0.97

If SmartDocs scores 90% English faithfulness and 65% Hindi faithfulness, it is not an India-first product. It is an English PDF tool with Hindi support theater. This ratio is the single defining metric of the entire project.

Multilingual Agentic RAG for India

Three Pipelines.One India-First System.

10 Stages.India-First.

From Query toTop-5 Reranked Chunks.

11 Nodes. 3 Edges.One Self-Critiquing Agent.

7-StepDeterministic Detection.

10 Production Bugs.All Fixed.

10 Papers.Implemented in Production.

18 Laws.Non-Negotiable.

The Numbers ThatDefine Success.

Multilingual
Agentic RAG
for India

Three Pipelines.
One India-First System.

10 Stages.
India-First.

From Query to
Top-5 Reranked Chunks.

11 Nodes. 3 Edges.
One Self-Critiquing Agent.

7-Step
Deterministic Detection.

10 Production Bugs.
All Fixed.

10 Papers.
Implemented in Production.

18 Laws.
Non-Negotiable.

The Numbers That
Define Success.