Productivity

Microsoft Researchers Highlight Limitations of Current LLMs in Long-Running Workplace Tasks




A new preprint paper from Microsoft researchers underscores a key challenge for AI adoption in professional settings: today's frontier large language models (LLMs) still struggle with reliable, extended document editing and delegated workflows.

The study, titled "LLMs Corrupt Your Documents When You Delegate," tested 19 models—including leading frontier systems like OpenAI’s GPT-5.4, Anthropic’s Claude 4.6 Opus, and Google’s Gemini 3.1 Pro—across 52 professional domains using a new benchmark called DELEGATE-52. This benchmark simulates realistic, multi-step workflows where an AI repeatedly edits work documents (e.g., reports, code, analyses) over the equivalent of many interactions.

Key Findings
- **Frontier models** corrupted or lost an **average of 25%** of document content after 20 delegated interactions. Weaker/older models performed worse, with broader degradation around 50% on average.
- Errors compounded over time. In frontier models, this often appeared as subtle **content corruption**; in weaker ones, outright **deletions**.
- Models were deemed "ready" for delegation in a domain only if they maintained ≥98% accuracy after 20 steps. **Python coding** was the clear outlier, where many models succeeded. In the vast majority of domains, they fell short—often severely.
- Performance degraded further with larger documents, longer interaction chains, or added complexity (like distractor files).

The researchers conclude that current LLMs are **not ready for fully delegated workflows** in most professional domains. They introduce sparse but severe errors that can silently accumulate, undermining trust in autonomous AI handling of important documents.

 Context and Caveats
This is a **preprint** (not yet peer-reviewed), and results reflect simulated multi-turn editing rather than single-shot prompts or well-scoped agent setups with strong human oversight. AI capabilities are advancing quickly—the paper itself notes progress in the GPT family over 16 months. Domains like coding show more promise than highly specialized or natural-language-heavy ones.

Notably, the paper did not test Microsoft’s own Copilot. Real-world outcomes also depend heavily on **human-in-the-loop** practices, prompt engineering, retrieval-augmented generation (RAG), version control, and validation steps.

 Implications
This research adds to evidence that while AI excels at many assistive tasks—drafting, summarizing, brainstorming, and code suggestions—**blind delegation** of complex, long-running knowledge work remains risky in 2026. It can produce "workslop": plausible but flawed output that requires human cleanup.

For organizations rushing full AI replacement of workers, this is a reminder that **augmentation** (AI + skilled humans) is currently far more reliable than pure automation for most high-stakes domains. Oversight, verification, and domain expertise still matter—a lot.

The paper is available on arXiv for those interested in the methodology and detailed results. It provides a useful benchmark for tracking future progress toward truly reliable AI delegates.