Proxy-Pointer RAG: 100% Accuracy on 10-K Filings

§ 01

Summary (what this paper is saying)

Proxy-Pointer RAG is a structure-aware vector retrieval architecture that addresses a fundamental flaw in standard RAG: when documents are shredded into flat chunks, the index loses all knowledge of where each chunk lives in the document. The result is retrieval that surfaces plausible but mislocated text, forcing the synthesizer LLM to work from fragmented, context-less inputs.

Proxy-Pointer fixes this without abandoning the vector RAG stack. It parses document headings into a hierarchical skeleton tree, prepends full structural breadcrumbs to every chunk before embedding (e.g. "AMD > Financial Statements > Cash Flows"), and uses a two-stage retrieval pipeline: FAISS for broad semantic recall (top 200, shortlisted to 50), followed by an LLM re-ranker that scores candidates on structural relevance rather than embedding similarity. Retrieved chunks are used as pointers to load complete, unbroken document sections — not truncated passages — for the synthesizer.

Benchmarked on 66 questions across four FY2022 Fortune 500 10-K filings (AMD, American Express, Boeing, PepsiCo), the system achieved 100% accuracy at k=5 and 93.9% at k=3. The benchmark included multi-hop numerical reasoning, adversarial queries, cross-statement reconciliation, and counterintuitive financial metrics. The full pipeline is open-sourced under MIT licence.

§ 02

Core Argument

Why do enterprises need both RAG and knowledge graphs?

This article is not framed around knowledge graphs, but its core argument maps directly to the RAG/KG tension: standard vector RAG discards document structure, which is itself a form of implicit knowledge graph. The heading hierarchy of a 10-K (Company > Financial Statements > Cash Flow Statement > Operating Activities) encodes meaning about relationships between sections that flat chunking destroys. Proxy-Pointer reconstructs that structure without building an explicit KG — by injecting it into the embedding itself.

The implicit argument: for structured documents, the document's own hierarchy is the knowledge graph. You don't need to build one externally; you need to preserve the one that already exists.

§ 03

RAG Side (Strengths & Limits)

§ 04

Strengths:

Full vector RAG scalability — standard embedding model, FAISS index, no LLM calls during indexing (except a one-time noise filter per document)

Works on existing document corpora without schema design or entity extraction

Degrades gracefully on unstructured documents — falls back to standard chunking

Transparent and auditable — every response cites the structural breadcrumb of its source section

§ 05

Weaknesses:

Requires documents to have structural headings to unlock the core value; flat documents get standard vector RAG performance

The two-stage retrieval adds latency — 200 candidate chunks, shortlisted to 50, re-ranked by LLM before final selection

LLM re-ranker is an added cost per query — though the article uses the most cost-efficient available model

PDF-to-Markdown extraction quality (via LlamaParse) is a prerequisite dependency — poor extraction produces a broken skeleton tree

§ 06

Knowledge Graph Side (Strengths & Limits)

§ 07

Strengths:

The skeleton tree is a lightweight, implicit knowledge graph of the document's section hierarchy — built in milliseconds with pure Python, no LLM needed

Breadcrumb injection encodes structural relationships directly into the vector space — structural proximity influences retrieval without a separate graph query layer

The re-ranker operates on the breadcrumb paths (the graph's edges), not raw text — structural relevance is a first-class retrieval signal

§ 08

Weaknesses:

The "graph" is document-scoped, not corpus-scoped — cross-document reasoning requires the synthesizer to do the joining, not the retrieval layer

No explicit entity extraction — the system knows where sections live, not what entities they contain or how those entities relate across documents

The skeleton tree cannot represent cross-references within a document (e.g. a footnote in section 4 that qualifies a figure in section 12)

§ 09

Key Insight (the "why both" claim)

The key insight is that document structure is a knowledge graph that authors have already built — and standard RAG throws it away. Proxy-Pointer's claim is that for the majority of high-value enterprise documents (technical manuals, legal contracts, financial filings, compliance reports), the section hierarchy encodes the semantic relationships that matter for retrieval. You don't need a separate KG layer; you need to stop destroying the implicit graph that already exists.

The two-stage retrieval operationalises this: Stage 1 (FAISS) finds semantically similar text. Stage 2 (LLM re-ranker on breadcrumbs) promotes candidates that are structurally correct — "AMD > Financial Statements > Cash Flows" over a paragraph that merely mentions cash flow. The synthesis step then loads the complete section, not a chunk, so the LLM reasons over full context.

§ 010

Mental Model (how to think about it)

Think of a 500-page financial filing as a filing cabinet with labelled drawers, folders, and sub-folders. Standard vector RAG takes every page out of the cabinet, shreds it into paragraphs, loses all the labels, and searches by keyword similarity. Proxy-Pointer keeps the cabinet structure: every paragraph is re-labelled with its full folder path before it goes into the index. When you search, the system finds paragraphs by similarity but ranks them by whether their folder path matches what the query is asking about. And when it retrieves an answer, it pulls the entire folder — not just the matching paragraph — so the LLM reads complete context.

§ 011

Enterprise Implications

For any enterprise handling structured documents at scale — legal, compliance, financial analysis, regulatory reporting, technical documentation — this architecture is directly deployable today without GPU infrastructure or complex pipelines.

The 100% accuracy benchmark on adversarial financial queries is a meaningful signal for high-stakes document QA use cases where hallucination or misattribution carries legal or financial consequence.

Source grounding via structural breadcrumbs creates an audit trail. Every answer cites the exact document section it came from. For regulated industries, this traceability is a compliance property, not a convenience.

The adversarial robustness results are noteworthy for enterprise risk: the system correctly returned "no evidence" for a non-existent data point (crypto revenue at AMEX) and correctly explained why a metric was undefined (Boeing's debt/equity with negative equity) rather than hallucinating a figure. These are the failure modes that create liability in enterprise document QA.

The architecture unifies two retrieval pathways that many enterprises currently maintain separately — one for structured, high-value documents and one for routine knowledge bases. A single pipeline handles both.

§ 012

Technical Mapping

RAG → two-stage retrieval: FAISS semantic recall (top 200 → shortlist 50) followed by LLM re-ranker on structural breadcrumbs (top 5); synthesizer receives full document sections not truncated chunks

Graph → skeleton tree: pure-Python hierarchical parse of Markdown headings into a JSON tree; breadcrumbs injected into every chunk before embedding

How they connect:

Offline (indexing): PDF → Markdown (LlamaParse) → skeleton tree built from headings → LLM noise filter removes TOC/glossary/boilerplate nodes → chunks split within section boundaries, never across → each chunk prepended with full breadcrumb path → embedded and stored in FAISS

Online (retrieval): query → FAISS top-200 → deduplicate by (doc_id, node_id) → shortlist 50 unique candidate nodes → LLM re-ranker scores breadcrumb paths against query for structural relevance → top 5 nodes selected → full document sections loaded as synthesizer context → LLM generates answer with source citations

§ 013

My Critique

The benchmark is author-generated and self-evaluated. FinanceBench is an established dataset but the 40-question "Comprehensive Stress Test" was created by the same team that built the system. This is a methodological limitation — adversarial queries designed by the system's authors may not represent the full distribution of failure modes a different evaluator would find.

100% accuracy on 66 questions is a strong result but a small sample. Enterprise deployment at scale will encounter edge cases, document formats, and query types not represented in four FY2022 10-Ks. The 93.9% at k=3 is more informative about realistic failure modes.

The skeleton tree depends entirely on consistent Markdown heading structure post-extraction. In practice, PDF-to-Markdown conversion quality is highly variable — especially for scanned documents, complex table layouts, or PDFs with non-standard formatting. The system's accuracy is upper-bounded by extraction quality, which is not evaluated in this article.

Cross-document retrieval is not tested. The corpus spans four companies but queries are company-scoped. Multi-document reasoning — "compare Boeing's cash flow quality to PepsiCo's" — is not evaluated and would stress the architecture differently.

The re-ranker LLM adding structural judgment is the key differentiator, but it also introduces a single point of failure: if the re-ranker misranks the 50 candidates, the synthesizer gets the wrong sections. The article shows this happened in k=3 runs but doesn't fully analyse re-ranker error rates at k=5.

§ 014

When this fails

Documents without structural headings — scanned PDFs, legacy formats, OCR-extracted text with inconsistent heading detection

Cross-document queries requiring entity-level joins that the skeleton tree cannot represent

Documents where the heading structure is misleading or inconsistent — a heading called "Summary" that contains critical numerical data a noise filter might discard

Very long sections where loading the full section for synthesis exceeds the LLM's context window

Queries requiring temporal reasoning across multiple filings (e.g. "how has this ratio changed over five years") where the corpus spans multiple document instances

§ 015

Key Takeaways for the CIO

Watchlist priority: High — deployable today. Particularly relevant for enterprises with large structured document corpora.

Most enterprise RAG deployments are underperforming on the documents that matter most — legal contracts, compliance filings, technical manuals, financial reports — because standard chunking destroys the structural context that makes those documents meaningful. Proxy-Pointer is the most practically grounded solution to this problem published to date, with a reproducible open-source implementation and a rigorous adversarial benchmark.

The practical case for CIOs:

100% accuracy on adversarial financial queries is a production-relevant signal. The benchmark was designed to break naive retrieval — counterintuitive metrics, multi-hop calculations, queries that presuppose incorrect facts. If your document QA system is hallucinating answers or misattributing figures in high-stakes contexts, structural retrieval is the most likely fix.

This requires no new infrastructure. No GPU, no graph database, no complex pipeline. A single Gemini API key and a FAISS index. The architecture slots into existing vector RAG infrastructure — it replaces the chunking and retrieval logic, not the stack.

The audit trail is a compliance property. Every answer cites the structural path of its source section. In regulated industries — financial services, legal, healthcare, insurance — this traceability is not optional. Proxy-Pointer provides it natively.

The adversarial robustness results reduce a specific category of AI liability. A system that returns "no evidence" instead of hallucinating a non-existent figure, and explains why a metric is mathematically undefined instead of forcing a number, is categorically safer for enterprise deployment than one that generates plausible-sounding answers regardless of evidence.

The open-source implementation makes evaluation low-risk. Clone, configure a single API key, run against your own document corpus. Evaluation cost is minimal. The question is whether your documents have consistent heading structure — if they do, this is worth a pilot.

Recommended action: Identify your highest-stakes document QA use case — the one where a wrong answer carries legal, financial, or regulatory consequence. Run Proxy-Pointer against a sample of those documents. Compare accuracy and source traceability against your current RAG setup. The benchmark scripts are included in the repository.

Reference

https://towardsdatascience.com/proxy-pointer-rag-structure-meets-scale-100-accuracy-with-smarter-retrieval/

Linked May 1, 2026

Proxy-Pointer RAG: Structure Meets Scale at 100% Accuracy with Smarter Retrieval

Summary (what this paper is saying)

Core Argument

RAG Side (Strengths & Limits)

Strengths:

Weaknesses:

Knowledge Graph Side (Strengths & Limits)

Strengths:

Weaknesses:

Key Insight (the "why both" claim)

Mental Model (how to think about it)

Enterprise Implications

Technical Mapping

My Critique

When this fails

Key Takeaways for the CIO