Why Microsoft Copilot's HR Agent Fails Out of the Box: A Segmentation Problem

§ 01

The decision in one sentence

Microsoft sells the publish button as the finish line. On a real document set it is the starting line — and almost everything left to do between "published" and "trustworthy" is segmentation work the demo never shows.

Your AI assistant answers from your own documents. Its accuracy ceiling was set before the model was ever called — by how those documents were split, structured, retrieved, and dated.

This is not a Microsoft problem. Every document-Q&A product carries the same gap between the live demo and a dependable rollout, because all of them inherit the same dependency: the quality of the units the system retrieves before the model writes a word. Microsoft is simply the most useful example, because most enterprises already own the tool and have already been told it works out of the box.

§ 02

Why your Employee Self-Service Agent won't go live well out of the box

Microsoft 365 Copilot ships an HR and IT assistant — the Employee Self-Service Agent — that answers policy questions, grounded on your SharePoint content, surfaced inside Copilot and Teams. It reached general availability and it is genuinely impressive in a demo. Connect a SharePoint site, publish, and an employee can ask "how many leave days do I carry over?" and get an answer.

The mechanical promise is real. The quality promise is the one that breaks. Microsoft's own documentation is unusually candid about this: the agent is described as a starting point that admins are expected to extend and customize, not a finished answer engine. The deployment guidance then lists the work that comes after "publish" — connect the right SharePoint sources, rewrite topics into your organisation's language, tune the system and behaviour instructions, and curate which content the agent is even allowed to see. None of that is optional polish. It is the difference between an agent that demos well on a clean FAQ and one that answers correctly on a 90-page policy PDF.

The gap is not that the product is weak. It is that the hard part — turning a messy, real-world document estate into clean retrievable knowledge — was never going to be solved by a publish button, and was quietly left to the customer to discover.

§ 03

The story — what enterprises actually fix

When an organisation says the agent "didn't work until we fixed it," they almost never mean they touched the model. They mean they built the segmentation and retrieval layer the product assumed was already there. The fixes cluster into four:

Restructure the source documents. Microsoft explicitly instructs content owners to write scannable, topic-focused pages with clear headings, and to centralise official content into a single source of truth, so the agent can ground answers reliably. Read plainly: the tool cannot recover structure a document never had, so humans must add it first. This is the whole thesis of segmentation, conceded by the vendor.

Fix metadata, permissions, and document lifecycle. Retrieval quality depends on consistent labelling and tagging; without it, results are noisy and irrelevant. Loose permissions are worse — they cause the wrong or sensitive content to surface in answers, so rollouts begin by restricting the agent to a curated allowlist of approved sites. The subtler problem is versioning. Copilot's grounding ranks results for relevance and uses a freshness signal, but it does not do version-aware retrieval — it has no concept that one policy supersedes another. When last year's leave policy and this year's both sit in SharePoint, the older one can still be ranked and quoted as current, often because it has accumulated more matching language over time. Microsoft's own remediation for this is not a setting that resolves versions for you; it is governance — archive or clearly mark superseded documents so the stale one is never a candidate. If the index handled versions, that housekeeping would not be the advice.

Turn on and tune better retrieval. Grounding quality rides on the semantic index and newer retrieval layers that pull more context with more precision. These are configuration decisions, not defaults to assume.

For complex corpora, build a real pipeline behind the agent. When the document set is genuinely hard — long PDFs, dense tables, mixed formats — teams leave the declarative agent and build a custom retrieval pipeline in Azure: documents chunked, tagged, and ranked through Azure AI Search, with Azure Document Intelligence handling layout and tables, and Copilot reduced to the front door employees see. Microsoft's own internal "Licensing Navigator," answering questions over a notoriously complex licensing corpus, was built exactly this way — the out-of-the-box experience could not carry it.

A worked example. An insurer points the agent at a SharePoint library of scanned policy documents. A benefit table spans a page break; half the limits land in one retrieved unit and half in the next, and a two-column layout interleaves an unrelated exclusions clause. The agent answers a coverage question with a limit that does not exist in that product. The model performed perfectly. The failure happened at ingestion — the moment nobody scoped — and no dashboard flagged it. The HR version of the same failure is quieter still: an employee asks about parental leave, the agent retrieves last quarter's superseded policy because it reads as the better match, and answers with an entitlement that no longer applies — fluently, confidently, with no signal that the source was retired. Version-sensitive questions are exactly where naive retrieval is weakest; published benchmarks put accuracy on them in the low-60s without a version-aware layer.

There is also a measurement trap. The out-of-the-box agent gives no real way to test retrieval quality; practitioners report that evaluating whether grounding actually works is manual and cumbersome, which is a second reason teams graduate to a custom Azure pipeline where they can measure it.

§ 04

The lever you control — segmentation methods

Everything above turns on one set of choices: how documents are split into the units that get retrieved. There is no universally best method; the right one depends on your documents and your questions. Independent testing through 2025 repeatedly found that simple fixed-size chunking matched or beat fashionable embedding-based "semantic" chunking on real tasks, at a fraction of the cost — so this is a decision to test against your own corpus, not to take from a vendor benchmark.

Method	How it splits	Strength	Where it falls short
Fixed-size	Every N tokens, with overlap	Trivial, cheap, a strong baseline	Cuts across ideas; ignores document structure
Recursive	Splits on a hierarchy of separators (paragraph, sentence)	Reliable default; avoids mid-thought cuts	Still blind to meaning and layout
Structure-aware (declared)	Splits on headings the format already declares (Word styles, Markdown, HTML tags)	Biggest easy win; near-free when the structure is explicit	Only works where structure is declared — a PDF declares none
Structure-aware (reconstructed)	Infers structure a PDF doesn't declare — headings from font size and spacing, sections from layout, reading order from column geometry	Recovers real structure from the hard formats where the value sits	Higher engineering effort, and the inference is fallible — it must be tested against ground truth, not trusted
Semantic	Ignores layout entirely; embeds sentences and cuts where the meaning shifts	Helps on tangled, multi-topic prose with no usable structure	High cost (an embedding per sentence); frequently fails to beat fixed-size
Hierarchical / parent-child	Retrieves small units, returns the larger section	Precision and context together	More moving parts to build and maintain
LLM-based / agentic	A model proposes the boundaries	Highest accuracy on small, high-value sets	Does not scale cheaply

Format decides difficulty more than content does. Word and Markdown are easy — headings are declared, tables are native objects. PDFs are the hard case, and most valuable enterprise documents are PDFs: there are no declared headings, only larger text; no declared sections, only visual grouping; and reading order on a two-column page can fuse two unrelated topics into fluent nonsense. The structure is there — it is just implied by layout rather than declared, so it has to be reconstructed from font size, spacing, and column geometry. That is buildable and it is what layout-aware extractors do; it is simply more engineering and carries a testing burden the declared formats avoid. Scanned PDFs have no text layer at all and must be read by OCR or a vision model before any structure exists. Tables and images are where the most decision-critical data lives and where extraction fails most invisibly — a flattened table produces confident, wrong answers.

A handful of open-source tools give a head start: recursive and structure-aware splitters in LangChain and LlamaIndex for the easy formats, and a layout-aware extractor such as Docling — which reconstructs reading order and table structure and runs on your own infrastructure — for the PDF-and-table cases that do the damage.

§ 05

Decision logic for the Copilot Rollout Team

What is the dominant format the agent will read? Mostly Word, Markdown, HTML — extraction is cheap; spend on retrieval tuning and evaluation. Mostly PDF — fund layout-aware extraction first; it is the binding constraint.

Are scanned documents in scope? If yes, budget explicitly for OCR or a vision model; no structure exists until they are read.

Do tables carry decision-critical facts? If yes, table extraction is a primary requirement and must be tested in isolation, not assumed.

Can the rollout show retrieval accuracy on your own question set? If no, accuracy is unknown regardless of how the demo looked. Require this gate before scaling.

Do your documents get superseded — and is version or effective-date metadata attached? Policies, rate cards, and procedures change. If nothing tags which version is current, the agent cannot prefer it, and lifecycle hygiene plus a recency-and-authority rerank become requirements, not refinements.

Must answers be auditable to source? In regulated settings, yes — make provenance a build requirement, not a retrofit.

Treat the publish button as the start of the work, not the end of it — and never scale a document agent whose retrieval you have not measured on your own documents.

§ 06

Takeaway for leaders

Fund segmentation and retrieval as a named workstream with an owner, not as an assumed feature of whichever assistant you licensed. Require a corpus-grounded evaluation harness as a gate before any agent reaches employees. Treat provenance and reading-order validation as compliance controls, because in a regulated answer they are. Budget for the gap between "published" and "trustworthy," because the vendor's pricing and demo will not.

The five wrong moves:

Believing the publish button is the finish line. It is the start of the segmentation and retrieval work, not a substitute for it.

Blaming the model for ingestion failures. The accuracy ceiling was set upstream; a bigger model does not raise it.

Assuming semantic chunking is the upgrade. It is a hypothesis to test against your corpus, often a cost without a return.

Letting tables flatten into text. This produces confident, wrong answers on your most important data.

Trusting the agent to know which version is current. It ranks for relevance, not validity; without lifecycle hygiene and version metadata, it will quote a retired policy as if it still applies.

Going live without measuring retrieval on your own documents. A clean demo on easy content tells you nothing about your policies and contracts.

Why Your Copilot HR Agent Won't Go Live Out of the Box

The decision in one sentence

Why your Employee Self-Service Agent won't go live well out of the box

The story — what enterprises actually fix

The lever you control — segmentation methods

Decision logic for the Copilot Rollout Team

Takeaway for leaders

New essays to your desk.