§ 01
The decision in one sentence
Microsoft sells the publish button as the finish line. On a real document set it is the starting line — and almost everything left to do between "published" and "trustworthy" is segmentation work the demo never shows.
Your AI assistant answers from your own documents. Its accuracy ceiling was set before the model was ever called — by how those documents were split, structured, retrieved, and dated.
This is not a Microsoft problem. Every document-Q&A product carries the same gap between the live demo and a dependable rollout, because all of them inherit the same dependency: the quality of the units the system retrieves before the model writes a word. Microsoft is simply the most useful example, because most enterprises already own the tool and have already been told it works out of the box.
§ 02
Why your Employee Self-Service Agent won't go live well out of the box
Microsoft 365 Copilot ships an HR and IT assistant — the Employee Self-Service Agent — that answers policy questions, grounded on your SharePoint content, surfaced inside Copilot and Teams. It reached general availability and it is genuinely impressive in a demo. Connect a SharePoint site, publish, and an employee can ask "how many leave days do I carry over?" and get an answer.
The mechanical promise is real. The quality promise is the one that breaks. Microsoft's own documentation is unusually candid about this: the agent is described as a starting point that admins are expected to extend and customize, not a finished answer engine. The deployment guidance then lists the work that comes after "publish" — connect the right SharePoint sources, rewrite topics into your organisation's language, tune the system and behaviour instructions, and curate which content the agent is even allowed to see. None of that is optional polish. It is the difference between an agent that demos well on a clean FAQ and one that answers correctly on a 90-page policy PDF.
The gap is not that the product is weak. It is that the hard part — turning a messy, real-world document estate into clean retrievable knowledge — was never going to be solved by a publish button, and was quietly left to the customer to discover.
§ 03
The story — what enterprises actually fix
When an organisation says the agent "didn't work until we fixed it," they almost never mean they touched the model. They mean they built the segmentation and retrieval layer the product assumed was already there. The fixes cluster into four:
A worked example. An insurer points the agent at a SharePoint library of scanned policy documents. A benefit table spans a page break; half the limits land in one retrieved unit and half in the next, and a two-column layout interleaves an unrelated exclusions clause. The agent answers a coverage question with a limit that does not exist in that product. The model performed perfectly. The failure happened at ingestion — the moment nobody scoped — and no dashboard flagged it. The HR version of the same failure is quieter still: an employee asks about parental leave, the agent retrieves last quarter's superseded policy because it reads as the better match, and answers with an entitlement that no longer applies — fluently, confidently, with no signal that the source was retired. Version-sensitive questions are exactly where naive retrieval is weakest; published benchmarks put accuracy on them in the low-60s without a version-aware layer.
There is also a measurement trap. The out-of-the-box agent gives no real way to test retrieval quality; practitioners report that evaluating whether grounding actually works is manual and cumbersome, which is a second reason teams graduate to a custom Azure pipeline where they can measure it.
§ 04
The lever you control — segmentation methods
Everything above turns on one set of choices: how documents are split into the units that get retrieved. There is no universally best method; the right one depends on your documents and your questions. Independent testing through 2025 repeatedly found that simple fixed-size chunking matched or beat fashionable embedding-based "semantic" chunking on real tasks, at a fraction of the cost — so this is a decision to test against your own corpus, not to take from a vendor benchmark.
| Method | How it splits | Strength | Where it falls short |
|---|---|---|---|
| Fixed-size | Every N tokens, with overlap | Trivial, cheap, a strong baseline | Cuts across ideas; ignores document structure |
| Recursive | Splits on a hierarchy of separators (paragraph, sentence) | Reliable default; avoids mid-thought cuts | Still blind to meaning and layout |
| Structure-aware (declared) | Splits on headings the format already declares (Word styles, Markdown, HTML tags) | Biggest easy win; near-free when the structure is explicit | Only works where structure is declared — a PDF declares none |
| Structure-aware (reconstructed) | Infers structure a PDF doesn't declare — headings from font size and spacing, sections from layout, reading order from column geometry | Recovers real structure from the hard formats where the value sits | Higher engineering effort, and the inference is fallible — it must be tested against ground truth, not trusted |
| Semantic | Ignores layout entirely; embeds sentences and cuts where the meaning shifts | Helps on tangled, multi-topic prose with no usable structure | High cost (an embedding per sentence); frequently fails to beat fixed-size |
| Hierarchical / parent-child | Retrieves small units, returns the larger section | Precision and context together | More moving parts to build and maintain |
| LLM-based / agentic | A model proposes the boundaries | Highest accuracy on small, high-value sets | Does not scale cheaply |
Format decides difficulty more than content does. Word and Markdown are easy — headings are declared, tables are native objects. PDFs are the hard case, and most valuable enterprise documents are PDFs: there are no declared headings, only larger text; no declared sections, only visual grouping; and reading order on a two-column page can fuse two unrelated topics into fluent nonsense. The structure is there — it is just implied by layout rather than declared, so it has to be reconstructed from font size, spacing, and column geometry. That is buildable and it is what layout-aware extractors do; it is simply more engineering and carries a testing burden the declared formats avoid. Scanned PDFs have no text layer at all and must be read by OCR or a vision model before any structure exists. Tables and images are where the most decision-critical data lives and where extraction fails most invisibly — a flattened table produces confident, wrong answers.
A handful of open-source tools give a head start: recursive and structure-aware splitters in LangChain and LlamaIndex for the easy formats, and a layout-aware extractor such as Docling — which reconstructs reading order and table structure and runs on your own infrastructure — for the PDF-and-table cases that do the damage.
§ 05
Decision logic for the Copilot Rollout Team
Treat the publish button as the start of the work, not the end of it — and never scale a document agent whose retrieval you have not measured on your own documents.
§ 06
Takeaway for leaders
Fund segmentation and retrieval as a named workstream with an owner, not as an assumed feature of whichever assistant you licensed. Require a corpus-grounded evaluation harness as a gate before any agent reaches employees. Treat provenance and reading-order validation as compliance controls, because in a regulated answer they are. Budget for the gap between "published" and "trustworthy," because the vendor's pricing and demo will not.
The five wrong moves: