Percepti's daily briefing.
Five fresh ML, stats, and signal-processing papers worth your coffee.
Five fresh ML, stats, and signal-processing papers worth your coffee.
Alex Petrov, Alexander Gusak, Denis Mukha, Dima Korolev
“Instead of letting AI assistants remember things by searching through old chat transcripts, this paper builds memory like a database with strict rules about what to store.”
If you've used an AI assistant that forgets your preferences, contradicts itself, or invents details about you, you've felt the limits of today's memory designs. Production agents — think customer service bots, personal assistants, coding copilots — need to track changing facts (your address, your current project status, who reports to whom) reliably. Treating memory like a search engine over old text works for vague recall ('what did we discuss last month?') but fails for crisp questions ('what is the user's shipping address?'). This paper argues that memory should look more like a well-maintained spreadsheet than a diary, and shows that this shift gives big accuracy gains.
On a structured extraction benchmark, the judge-in-the-loop version reaches about 90% accuracy at correctly identifying objects and about 63% at producing fully correct outputs, beating frontier models with built-in structured output features. On an end-to-end memory benchmark, xmemory hits 97.1% F1, while competing systems land between 80% and 87%. On a higher-level application task, xmemory scores 95.2% accuracy, beating both specialized memory products and harnesses built around top commercial AI assistants. The headline finding: for memory tasks that demand stable facts, the architecture of how you write and validate matters more than how big or smart the underlying model is.
The approach assumes you can define a useful schema ahead of time, which works for known domains (CRM, support, project tracking) but is harder for open-ended exploration where you don't know what matters yet. The benchmarks are partly the authors' own, so independent replication will be important. Schema-grounded memory also has more moving parts — extraction, validation, retries — meaning higher cost per write and more engineering to maintain as needs evolve. Skeptics will point out that schemas can become stale or wrong, and that real users often say things the schema didn't anticipate. The paper does not fully address how schemas evolve over time or how the system degrades when input doesn't fit any known object type.
Mem0 is a recent production memory system for AI agents that the authors compare against as a baseline. xmemory differs by enforcing schema-grounded writes and validation gates rather than relying primarily on retrieval over stored text.
Retrieval-augmented generation (RAG) is the dominant pattern this paper pushes back against. RAG fetches relevant text chunks at read time; xmemory instead does the heavy interpretation at write time so reads are clean queries over verified records.
LoCoMo introduced a benchmark for very long-term conversational memory in agents. The authors use this lineage to evaluate end-to-end memory and argue retrieval-only systems underperform on stateful operations.
Self-Refine showed that letting models iteratively critique and revise their own outputs improves quality. xmemory borrows this iterative-refinement idea but channels it through schema validation rather than free-form self-feedback.
Persistent AI memory is often reduced to a retrieval problem: store prior interactions as text, embed them, and ask the model to recover relevant context later. This design is useful for thematic recall, but it is mismatched to the kinds of memory that agents need in production: exact facts, current state, updates and deletions, aggregation, relations, negative queries, and explicit unknowns. These operations require memory to behave less like search and more like a system of record. This paper argues that reliable external AI memory must be schema-grounded. Schemas define what must be remembered, what may be ignored, and which values must never be inferred. We present an iterative, schema-aware write path that decomposes memory ingestion into object detection, field detection, and field-value extraction, with validation gates, local retries, and stateful prompt control. The result shifts interpretation from the read path to the write path: reads become constrained queries over verified records rather than repeated inference over retrieved prose. We evaluate this design on structured extraction and end-to-end memory benchmarks. On the extraction benchmark, the judge-in-the-loop configuration reaches 90.42% object-level accuracy and 62.67% output accuracy, above all tested frontier structured-output baselines. On our end-to-end memory benchmark, xmemory reaches 97.10% F1, compared with 80.16%-87.24% across the third-party baselines. On the application-level task, xmemory reaches 95.2% accuracy, outperforming specialised memory systems, code-generated Markdown harnesses, and customer-facing frontier-model application harnesses. The results show that, for memory workloads requiring stable facts and stateful computation, architecture matters more than retrieval scale or model strength alone.
“xmemory replaces retrieval-style AI memory with a schema-grounded write path that decomposes ingestion into object, field, and value extraction with validation gates, achieving 97.1% F1 on an end-to-end memory benchmark.”
Most current 'long-term memory' for LLM agents — Mem0, LoCoMo-style systems, vanilla RAG with vector stores — is fundamentally a dense retrieval pipeline: store conversation turns, embed them, fetch top-k at query time, and let the model re-interpret. This works for thematic recall ('what did we talk about?') but breaks on the operations production agents actually need: exact-fact lookup, state updates, deletes, aggregation, relations, negative queries ('does the user have any open tickets?'), and explicit unknowns. The paper's thesis — that memory must behave like a system of record, not a search index — directly addresses where retrieval-based memory hits a wall, and the strong end-to-end numbers (97.1% F1 vs 80–87% for baselines) suggest the architectural shift is doing real work that scaling models alone won't fix.
On the structured-extraction benchmark, the judge-in-the-loop xmemory configuration reaches 90.42% object-level accuracy and 62.67% output accuracy, surpassing all tested frontier structured-output baselines. The gap between object-level and output accuracy (≈28 points) suggests the system is good at identifying what's there but stricter end-to-end extraction is harder — consistent with field-value extraction being the bottleneck. On the end-to-end memory benchmark, xmemory achieves 97.10% F1, compared to a 80.16–87.24% range across third-party baselines — a roughly 10–17 point improvement, which is large for a memory benchmark covering updates, deletes, and unknowns where retrieval systems typically fail. On the application-level task, xmemory hits 95.2% accuracy, beating specialized memory systems (Mem0-style), code-generated Markdown harnesses (a strong simple baseline where the agent writes notes to files), and customer-facing frontier-model application harnesses. The authors' takeaway: for memory workloads dominated by stable-fact and stateful-computation requirements, architectural choices dominate raw model strength or retrieval scale.
Several concerns a teammate should flag: (1) Self-built benchmarks are part of the evaluation, so the comparison against Mem0/LoCoMo-style baselines may not be fully apples-to-apples — independent replication on shared benchmarks like LoCoMo with identical splits would strengthen the case. (2) The approach inherits all the costs of multi-stage extraction: latency and token spend per write are substantially higher than 'embed and store', and the judge-in-the-loop variant compounds this. The paper does not appear to report cost or latency curves vs accuracy. (3) Schema design is the elephant in the room — schemas need maintenance, must evolve as user needs change, and the paper cites schema-evolution work (Hernández Chillón et al.) but does not benchmark how xmemory degrades when the schema is mis-specified or incomplete. (4) Out-of-schema content is by construction discarded or stored lossily; for open-ended assistants this is a real limitation. (5) The 62.67% output accuracy on extraction is honestly not high — production use will likely require domain-specific schema tuning. (6) No reported ablations (in the abstract) on which component matters most: object-vs-field decomposition, validation gates, local retries, or the judge. (7) Performance on negative queries and explicit-unknown handling is the strongest theoretical claim but the abstract doesn't break out per-operation numbers. Likely pushback: retrieval advocates will note that hybrid systems (RAG plus structured extraction) are common and that xmemory's gains may shrink against well-tuned hybrids; large-context advocates will argue long-context models with careful prompting can match this without schema overhead.
Mem0 is a production-oriented long-term memory system for AI agents and a primary baseline. xmemory differs by treating writes as schema-validated structured extraction rather than relying on retrieval over stored memory items, and reports substantially higher end-to-end F1.
RAG is the architectural pattern this paper positions itself against. xmemory inverts the typical RAG balance by moving interpretation from read time (retrieve-then-generate) to write time (extract-then-validate), arguing that retrieval is mismatched to stateful memory operations.
Self-Refine demonstrated iterative self-feedback as a way to improve LLM outputs. xmemory uses iterative refinement but constrains it via schema validation gates and local retries, rather than free-form critique, which makes the loop converge to verifiable structure.
LoCoMo introduced a long-term conversational memory benchmark. The paper uses this lineage of end-to-end memory evaluation and argues that retrieval-based systems benchmarked on LoCoMo-style tasks systematically underperform on operations beyond thematic recall.
Persistent AI memory is often reduced to a retrieval problem: store prior interactions as text, embed them, and ask the model to recover relevant context later. This design is useful for thematic recall, but it is mismatched to the kinds of memory that agents need in production: exact facts, current state, updates and deletions, aggregation, relations, negative queries, and explicit unknowns. These operations require memory to behave less like search and more like a system of record. This paper argues that reliable external AI memory must be schema-grounded. Schemas define what must be remembered, what may be ignored, and which values must never be inferred. We present an iterative, schema-aware write path that decomposes memory ingestion into object detection, field detection, and field-value extraction, with validation gates, local retries, and stateful prompt control. The result shifts interpretation from the read path to the write path: reads become constrained queries over verified records rather than repeated inference over retrieved prose. We evaluate this design on structured extraction and end-to-end memory benchmarks. On the extraction benchmark, the judge-in-the-loop configuration reaches 90.42% object-level accuracy and 62.67% output accuracy, above all tested frontier structured-output baselines. On our end-to-end memory benchmark, xmemory reaches 97.10% F1, compared with 80.16%-87.24% across the third-party baselines. On the application-level task, xmemory reaches 95.2% accuracy, outperforming specialised memory systems, code-generated Markdown harnesses, and customer-facing frontier-model application harnesses. The results show that, for memory workloads requiring stable facts and stateful computation, architecture matters more than retrieval scale or model strength alone.
“xmemory operationalizes schema-grounded memory by decomposing ingestion into staged object/field/value extraction with validation gates and local retries, shifting interpretive load from read time to write time and reporting 97.10% F1 on an end-to-end memory benchmark.”
The agentic-memory literature has been converging on retrieval-centric designs (Mem0, LoCoMo-style benchmarks, GraphRAG variants) that inherit the well-documented pathologies of long-context inference: lost-in-the-middle (Liu et al., 2024), context rot (Hong et al., 2025), and the read-time hallucination dynamics analyzed by Kalai et al. (2025). These pathologies are particularly damaging for the operations production agents actually need — exact-fact CRUD, aggregation, negation, explicit unknowns — which are not naturally expressible as similarity search. The paper's framing (interpretation belongs on the write path) recasts memory design in terms compatible with database semantics: validated records, constrained queries, deterministic reads. If the empirical claims hold up under independent replication, this is a substantive shift in how to architect agent memory, comparable to the move from grep-style IR to relational databases for transactional workloads. The application-level result (95.2% beating specialized memory systems and frontier-assistant harnesses) is the strongest signal that the architectural argument has teeth.
Headline metrics: 90.42% object-level / 62.67% output accuracy on structured extraction (judge-in-the-loop), claimed above 'all tested frontier structured-output baselines'; 97.10% F1 on the end-to-end memory benchmark vs 80.16–87.24% across third-party baselines; 95.2% on the application-level task vs Mem0-class systems, code-generated Markdown harnesses, and frontier assistant harnesses. Several observations: (1) The ~28-point gap between object-level (90.42%) and output (62.67%) accuracy on the extraction benchmark is the most informative single number — it locates the bottleneck firmly in field-value extraction rather than object recognition, and is consistent with prior reports that JSON-mode/structured-output models still struggle with verbatim value fidelity at scale. (2) The 10–17 point F1 gap on end-to-end memory is large enough to be meaningful even discounting benchmark-construction effects, but per-operation breakdowns (updates vs deletes vs negation vs unknowns) are not visible in the abstract and would be the most diagnostic data. (3) Outperforming code-generated Markdown harnesses is non-trivial: writing notes to a Markdown file is a surprisingly strong baseline for many memory tasks (it preserves verbatim text, supports updates via overwrite, and inherits the model's natural language reading skills), so a clean win here is the strongest evidence that schema grounding adds real capability beyond 'just write it down'. (4) Beating customer-facing frontier-model assistant harnesses suggests the gap is not closable by raw model strength alone within the tested horizon, supporting the architecture-over-scale framing that aligns with He et al. (2025) on agentic system design.
Methodological concerns and missing analyses: (1) Benchmark provenance — at least one of the benchmarks appears to be authors' own; without released splits and protocols, the claim that xmemory beats baselines on a self-constructed benchmark is weak even if true. The community standard would be reporting on LoCoMo (Maharana et al., 2024), MEMTRACK (Deshpande et al., 2025), and Letta's benchmarking suite with identical configurations. (2) Cost/latency reporting — the architecture has a high write-side compute multiplier (3+ stages, retries, optional judge). Production adoption hinges on the cost-accuracy frontier, which the abstract does not characterize. A pareto plot vs Mem0/RAG would be essential. (3) Ablation gap — without component ablations the contribution attribution is unclear. The four candidate sources of gain (object/field decomposition, validation gates, local retries, judge) need to be teased apart; the 90.42 vs 62.67 split hints that judge is doing heavy lifting on object-level metrics specifically. (4) Schema dependence and brittleness — no reported analysis of degradation under schema misspecification, schema drift, or out-of-distribution inputs. The Hernández Chillón et al. (2024) line of NoSQL schema evolution is cited but apparently not experimentally engaged. (5) Judge correlation risk — if the judge is the same model family as the extractor, gains may be partially illusory under same-family bias; cross-family judge ablation would address this. (6) Output accuracy at 62.67% is not deployment-grade for many enterprise extraction settings, which suggests the technique still requires domain-specific schema engineering and possibly fine-tuning to be production-ready. (7) The 'never infer' discipline is rhetorically clean but operationally fragile: real users coreference, paraphrase, and abbreviate, so the boundary between 'extract verbatim' and 'lightly normalize' will leak in practice. (8) No discussion (in the abstract) of provenance tracking — Buneman et al. (2001) is cited and would be a natural fit for justifying which extraction supports which record, but operational provenance behavior is unclear. (9) The end-to-end benchmark presumably mixes operations; without per-op F1 it's hard to know whether xmemory wins uniformly or only on a subset (suspicion: gains are concentrated on negation and explicit unknowns, where retrieval baselines structurally fail). Likely pushback: retrieval advocates will argue that hybrid retrieval+extraction systems would close most of the gap; long-context advocates will argue that frontier models with structured prompting and a million-token context can avoid the schema-engineering tax; database purists will (rightly) note that this is reinventing well-understood ETL pipelines with LLMs in the loop and ask whether the LLM is even necessary in the extraction stages once the schema is fixed. Failure modes to probe: schema-incomplete inputs (information silently dropped), schema-mis-specified inputs (information mis-categorized), high-cardinality field detection (the field-detection stage's prompt grows with schema size — scaling behavior unclear), adversarial inputs designed to trick the value-extractor into spurious verbatim quoting. Follow-up directions worth pursuing: (a) automated schema induction from interaction logs to amortize the schema-engineering cost; (b) information-theoretic accounting along the lines of He et al. (2025) and Shannon (1948) — specifically, characterizing the bits of interpretation moved from read to write and the implied compression ratio; (c) integration with provenance tracking (Buneman et al., 2001) and grammar-aligned decoding (Park et al., 2024) for the value-extraction stage; (d) a principled treatment of schema evolution that preserves prior records under schema changes; (e) comparison to OneKE (Luo et al., 2025) and Dagdelen et al. (2024) on shared structured-extraction benchmarks; (f) hybrid designs where the structured store is the primary memory but a small RAG channel handles open-ended thematic recall, formalizing the boundary between system-of-record and system-of-engagement memory.
Mem0 is the canonical production-oriented LLM agent memory system and a head-to-head baseline. xmemory diverges by inverting the read/write balance: where Mem0 emphasizes scalable retrieval over stored memory items, xmemory enforces schema-validated structured writes so that reads degenerate to constrained queries.
Self-Refine established iterative self-feedback as an inference-time improvement loop. xmemory adapts this idea but constrains the loop with schema-validation termination criteria and localizes retries to failing sub-tasks, converting an open-ended critique loop into a bounded, verifiable refinement.
Argues that hallucination is partially a calibration/incentive problem — models prefer plausible answers to admitting unknowns. xmemory operationalizes a counter-incentive at the system level by making 'unknown' a first-class schema value and forbidding inference on protected fields.
LoCoMo formalized very-long-term conversational memory evaluation. The paper inherits this evaluation framing for end-to-end memory tasks and uses it to argue retrieval-only memory systems systematically underperform on stateful operations beyond thematic recall.
Persistent AI memory is often reduced to a retrieval problem: store prior interactions as text, embed them, and ask the model to recover relevant context later. This design is useful for thematic recall, but it is mismatched to the kinds of memory that agents need in production: exact facts, current state, updates and deletions, aggregation, relations, negative queries, and explicit unknowns. These operations require memory to behave less like search and more like a system of record. This paper argues that reliable external AI memory must be schema-grounded. Schemas define what must be remembered, what may be ignored, and which values must never be inferred. We present an iterative, schema-aware write path that decomposes memory ingestion into object detection, field detection, and field-value extraction, with validation gates, local retries, and stateful prompt control. The result shifts interpretation from the read path to the write path: reads become constrained queries over verified records rather than repeated inference over retrieved prose. We evaluate this design on structured extraction and end-to-end memory benchmarks. On the extraction benchmark, the judge-in-the-loop configuration reaches 90.42% object-level accuracy and 62.67% output accuracy, above all tested frontier structured-output baselines. On our end-to-end memory benchmark, xmemory reaches 97.10% F1, compared with 80.16%-87.24% across the third-party baselines. On the application-level task, xmemory reaches 95.2% accuracy, outperforming specialised memory systems, code-generated Markdown harnesses, and customer-facing frontier-model application harnesses. The results show that, for memory workloads requiring stable facts and stateful computation, architecture matters more than retrieval scale or model strength alone.
Binghao Huang, Yunzhu Li
“FlexiTac is a cheap, open-source touch-sensor kit that snaps onto robot grippers so robots can feel what they're handling.”
Robots are great at seeing but bad at feeling. Vision alone struggles with tasks like plugging in a cable, picking up a soft fruit without crushing it, or knowing when a grip is slipping. Good touch sensors usually cost a lot or require a specialist to build, which keeps tactile robotics out of reach for most labs and startups. FlexiTac is a 'plug-in' kit anyone can build with off-the-shelf parts and use today, lowering the barrier to giving robots a sense of touch and making research results easier to reproduce across different teams.
The paper demonstrates that FlexiTac can be mounted on multiple robot platforms without major mechanical changes and that it plugs into modern tactile-learning pipelines: fusing touch with 3D vision for contact-aware decisions, transferring learned skills between different robot bodies, and doing 'real-to-sim-to-real' training where a policy is fine-tuned in a fast GPU-based touch simulator before going back to hardware. The abstract emphasizes practicality - low cost, fast to fabricate, 100 Hz streaming, repeatable builds - rather than head-to-head benchmark numbers, so concrete accuracy or success-rate figures are not given in the abstract itself.
Piezoresistive films like Velostat are known to drift over time, respond differently at different temperatures, and can suffer from hysteresis (the reading depends partly on what just happened, not only the current pressure). The abstract doesn't quantify durability, calibration stability, or spatial resolution compared with camera-based sensors like GelSight or DIGIT, which generally give richer contact information. 'Low-cost and open-source' is a real win, but the proof will be in whether other labs can actually reproduce the pads with comparable performance and whether the sensors hold up over thousands of grasps. Expect pushback from groups using high-resolution vision-based tactile sensors who will want apples-to-apples comparisons on standard manipulation benchmarks.
Earlier work (3D-ViTac) by one of the same authors showed that combining 3D vision with tactile signals helps robots do fine manipulation. FlexiTac provides a cheaper, more scalable hardware backbone for that kind of visuo-tactile learning.
AnySkin pushed the idea of plug-and-play tactile skins for robots. FlexiTac shares the 'easy to attach, easy to replace' philosophy but uses a piezoresistive FPC-Velostat-FPC stack aimed at dense pressure maps and high fabrication throughput.
DIGIT is a popular low-cost vision-based tactile sensor. FlexiTac targets a different trade-off: thinner, more flexible, easier to scale to large areas, at the cost of the rich image-like signal a camera-based sensor provides.
Showed that piezoresistive textile sensors can capture human-environment interactions at scale. FlexiTac applies a similar materials philosophy to robot end-effectors with FPC-integrated electrodes for repeatable manufacturing.
We present FlexiTac, a low-cost, open-source, and scalable piezoresistive tactile sensing solution designed for robotic end-effectors. FlexiTac is a practical "plug-in" module consisting of (i) thin, flexible tactile sensor pads that provide dense tactile signals and (ii) a compact multi-channel readout board that streams synchronized measurements for real-time control and large-scale data collection. FlexiTac pads adopt a sealed three-layer laminate stack (FPC-Velostat-FPC) with electrode patterns directly integrated into flexible printed circuits, substantially improving fabrication throughput and repeatability while maintaining mechanical compliance for deployment on both rigid and soft grippers. The readout electronics use widely available, low-cost components and stream tactile signals to a host computer at 100 Hz via serial communication. Across multiple configurations, including fingertip pads and larger tactile mats, FlexiTac can be mounted on diverse platforms without major mechanical redesign. We further show that FlexiTac supports modern tactile learning pipelines, including 3D visuo-tactile fusion for contact-aware decision making, cross-embodiment skill transfer, and real-to-sim-to-real fine-tuning with GPU-parallel tactile simulation. Our project page is available at https://flexitac.github.io/.
“FlexiTac is an open-source, FPC-laminated piezoresistive tactile sensor system with a 100 Hz multi-channel readout, designed as a drop-in module for modern visuo-tactile learning.”
Tactile sensing is widely acknowledged to help in contact-rich manipulation, but the field is fragmented across incompatible hardware: vision-based gels (GelSight, DIGIT, OmniTact) give rich images but are bulky and per-finger; magnetic skins (ReSkin, AnySkin) are thin but indirect; capacitive/piezoresistive arrays scale in area but historically suffer fabrication variability. This makes it hard to share datasets, reproduce policies, or scale data collection. FlexiTac targets exactly that gap: a piezoresistive solution whose electrodes are integrated into FPCs for batch manufacturing, with a documented readout board, an open simulator hook, and demonstrations on multiple robot embodiments. If the hardware reproduces well across labs, it could become a default 'tactile commodity' for visuo-tactile learning research, similar to how RealSense became default for depth.
The headline results in the abstract are systems-level rather than benchmark numbers. FlexiTac is shown to (i) be mountable on diverse robot platforms - rigid parallel-jaw and soft grippers, multiple end-effector geometries - without mechanical redesign; (ii) stream synchronized multi-channel tactile data at 100 Hz suitable for real-time control; and (iii) plug into three nontrivial learning regimes: 3D visuo-tactile fusion for contact-aware decision making, cross-embodiment skill transfer, and real-to-sim-to-real fine-tuning with GPU-parallel tactile simulation. The abstract does not report task success rates, spatial resolution in mm, force range/sensitivity in N, signal-to-noise figures, or unit cost numbers, so the quantitative case rests on the project page and full paper rather than the abstract. Given the body length (~5800 words), one should expect at least qualitative comparisons in those sections and likely demonstration tasks where FlexiTac-equipped grippers outperform vision-only baselines on contact-heavy manipulation.
Several concerns are likely to come up in review. First, Velostat is well-known for drift, hysteresis, temperature sensitivity, and limited dynamic range; the abstract does not discuss calibration strategy, lifetime under repeated grasping, or how the sealed laminate handles shear and edge loading. Second, while integrating electrodes into FPCs improves repeatability vs hand-built arrays, unit-to-unit Velostat variability and aging may still be the dominant noise source - real reproducibility numbers across a batch would be the right ablation. Third, 'low-cost' and 'scalable' need quantification: BOM cost, yield, and the fabrication time per pad relative to ReSkin/AnySkin or capacitive textile arrays. Fourth, the 100 Hz figure is fine for many manipulation tasks but modest compared with vision-based tactile sensors and may be limiting for slip detection or fast contact transients; channel count and latency matter too. Fifth, the visuo-tactile learning demos need apples-to-apples comparisons - ideally the same task and policy class with FlexiTac vs DIGIT/GelSight/AnySkin - to show that the cheaper, lower-resolution signal is sufficient. Sixth, the GPU-parallel tactile simulator is a strong selling point but introduces its own sim-to-real gap that should be characterized; prior work (Narang et al, Bi et al, Church et al) shows this is nontrivial. Finally, the abstract's framing as a 'plug-in module' will invite scrutiny of how it compares with AnySkin's plug-and-play pitch, especially on durability and replacement workflow.
3D-ViTac, by one of the same authors, established a pipeline for fine-grained manipulation by fusing 3D vision with dense tactile signals. FlexiTac is the natural hardware companion: a cheaper, more scalable sensor that feeds the same kind of visuo-tactile policy and is explicitly demonstrated on that pipeline.
AnySkin pushed plug-and-play magnetometer-based skins as a swappable tactile module. FlexiTac shares the modular 'mount it and go' philosophy but uses a piezoresistive FPC-Velostat-FPC stack aimed at dense pressure maps over larger areas with cheaper fabrication, trading magnetic sensitivity for areal scalability.
DIGIT showed that a low-cost vision-based tactile sensor can support in-hand manipulation research. FlexiTac targets the opposite end of the trade space: lower spatial richness than a camera/gel system, but thinner, more flexible, and easier to deploy across heterogeneous end-effectors and large surfaces.
VT-Refine demonstrated bimanual assembly via simulation fine-tuning with visuo-tactile feedback. FlexiTac's claim of GPU-parallel tactile simulation and real-to-sim-to-real fine-tuning sits in the same line of work, providing the hardware and simulator infrastructure to scale that approach more broadly.
We present FlexiTac, a low-cost, open-source, and scalable piezoresistive tactile sensing solution designed for robotic end-effectors. FlexiTac is a practical "plug-in" module consisting of (i) thin, flexible tactile sensor pads that provide dense tactile signals and (ii) a compact multi-channel readout board that streams synchronized measurements for real-time control and large-scale data collection. FlexiTac pads adopt a sealed three-layer laminate stack (FPC-Velostat-FPC) with electrode patterns directly integrated into flexible printed circuits, substantially improving fabrication throughput and repeatability while maintaining mechanical compliance for deployment on both rigid and soft grippers. The readout electronics use widely available, low-cost components and stream tactile signals to a host computer at 100 Hz via serial communication. Across multiple configurations, including fingertip pads and larger tactile mats, FlexiTac can be mounted on diverse platforms without major mechanical redesign. We further show that FlexiTac supports modern tactile learning pipelines, including 3D visuo-tactile fusion for contact-aware decision making, cross-embodiment skill transfer, and real-to-sim-to-real fine-tuning with GPU-parallel tactile simulation. Our project page is available at https://flexitac.github.io/.
Chenxin Li, Zhengyang Tang, Huangxin Lin, Yunlong Lin, Shijue Huang, Shengyuan Liu, Bowen Ye, Rang Li, Lei Li, Benyou Wang, Yixuan Yuan
“A new benchmark tests AI assistants on realistic, regularly-updated office and computer tasks, and even the best model only finishes about two-thirds of them.”
Companies are starting to hand real workflows over to AI agents, things like updating HR records, fixing files in a workspace, or coordinating across multiple business apps. If we only grade an agent on whether its final message sounds right, we miss whether it actually completed the job correctly. And if our test set never changes, agents (and the people training them) will quietly overfit to it. A benchmark that updates with what people actually need done, and that audits the agent's footprints, gives a much more honest picture of whether these systems are ready to be trusted with real work.
No model is close to solved performance. The top model passes only 66.7% of tasks, and none break 70%. Failures cluster in specific areas: HR, management, and tasks that span multiple business systems are the hardest, while local workspace repair is easier but still not maxed out. The authors also point out that just looking at leaderboard rank is misleading, two models with similar pass rates can behave very differently on overall completion, and the tasks that actually distinguish models are concentrated in a middle 'medium-difficulty' band.
The benchmark is only as good as the 'demand signals' it pulls from, if those signals are biased toward certain kinds of work, the benchmark will be too. The simulated business services are controlled fixtures, so they may not capture the messiness of real enterprise systems with weird permissions, flaky APIs, or unusual data. Using an LLM judge for semantic checks introduces some subjectivity, even if it's only used where deterministic checks can't reach. And because the benchmark refreshes over time, comparing scores across releases will need careful versioning. What needs proof next: that scores on this benchmark actually predict real-world deployment success, and that refreshing the task set genuinely prevents overfitting rather than just adding noise.
SWE-bench grades agents on resolving real GitHub issues with code-level checks. Claw-Eval-Live extends that 'verify the actual work, not just the answer' philosophy beyond coding, into broader business and workspace workflows, and adds a refreshable task layer.
WorkArena tests web agents on common knowledge-work tasks in a fixed environment. Claw-Eval-Live shares the focus on real workflow tasks but explicitly avoids freezing the task set, refreshing tasks from live demand signals.
LiveCodeBench introduced the idea of a continuously refreshed, contamination-resistant coding benchmark. Claw-Eval-Live applies a similar live-update philosophy to workflow agents instead of code generation.
WebCanvas benchmarks web agents in online environments with evolving content. Claw-Eval-Live borrows the 'evolving environment' instinct but evaluates structured business and workspace workflows with deterministic graders rather than live web pages.
LLM agents are expected to complete end-to-end units of work across software tools, business services, and local workspaces. Yet many agent benchmarks freeze a curated task set at release time and grade mainly the final response, making it difficult to evaluate agents against evolving workflow demand or verify whether a task was executed. We introduce Claw-Eval-Live, a live benchmark for workflow agents that separates a refreshable signal layer, updated across releases from public workflow-demand signals, from a reproducible, time-stamped release snapshot. Each release is constructed from public workflow-demand signals, with ClawHub Top-500 skills used in the current release, and materialized as controlled tasks with fixed fixtures, services, workspaces, and graders. For grading, Claw-Eval-Live records execution traces, audit logs, service state, and post-run workspace artifacts, using deterministic checks when evidence is sufficient and structured LLM judging only for semantic dimensions. The release contains 105 tasks spanning controlled business services and local workspace repair, and evaluates 13 frontier models under a shared public pass rule. Experiments reveal that reliable workflow automation remains far from solved: the leading model passes only 66.7% of tasks and no model reaches 70%. Failures are structured by task family and execution surface, with HR, management, and multi-system business workflows as persistent bottlenecks and local workspace repair comparatively easier but unsaturated. Leaderboard rank alone is insufficient because models with similar pass rates can diverge in overall completion, and task-level discrimination concentrates in a middle band of tasks. Claw-Eval-Live suggests that workflow-agent evaluation should be grounded twice, in fresh external demand and in verifiable agent action.
“Claw-Eval-Live is a refreshable, action-verified benchmark for workflow agents where the best of 13 frontier models passes only 66.7% of 105 tasks.”
Workflow automation is the current frontier for LLM agents, the pitch is that agents complete end-to-end units of work across SaaS tools, internal services, and local workspaces. But the field's evaluation infrastructure is mismatched to that pitch. Static benchmarks like SWE-bench-style snapshots invite contamination and overfitting, and answer-only grading rewards plausible-sounding outputs even when the agent did not actually mutate the right state. Claw-Eval-Live argues, with evidence, that workflow-agent evaluation needs to be 'grounded twice': in fresh external demand (so tasks track what real users want automated) and in verifiable agent action (so a pass implies the work was actually done). The headline finding, no frontier model exceeds 70%, suggests the field is overstating readiness for production workflow deployment.
The leading model passes 66.7% of tasks; no model reaches 70%. Failures are not uniform — they're structured by task family and execution surface. HR, management, and multi-system business workflows are persistent bottlenecks, suggesting that cross-service coordination and stateful business logic are still hard. Local workspace repair is comparatively easier but unsaturated, meaning even the 'easier' surface has headroom. Two further findings sharpen the picture. First, leaderboard rank is insufficient: models with similar pass rates can diverge substantially on overall completion, indicating different failure profiles (e.g., one model partially completes many tasks, another fully completes a different subset). Second, task-level discrimination is concentrated in a middle band of tasks, easy tasks pass for everyone, hard tasks fail for everyone, and the signal sits in the middle, which has implications for how to grow the benchmark.
Several limitations deserve attention. (1) Demand-signal validity: 'ClawHub Top-500' and similar public sources reflect a particular community's stated demand and may underweight regulated or proprietary workflows where real automation pain lives. (2) Fixture fidelity: controlled business services are inherently simplified versus real enterprise stacks (auth, rate limits, partial outages, schema drift). Agents that pass here may still fail on production-grade systems. (3) Judge reliability: even narrow LLM-judge usage introduces variance and potential model-family bias, especially when judging outputs from a sibling family. (4) Cross-release comparability: the very feature that makes Claw-Eval-Live live, refreshing the signal layer, complicates longitudinal claims about progress; readers will need clear release-versioned reporting. (5) The 105-task scale is small relative to the breadth of real workflows; concentration of discriminative power in a middle band suggests future releases should deliberately expand that band. (6) The paper, per the abstract, does not appear to report inter-judge agreement, deterministic-vs-judge coverage ratios, or human pass rates, all of which would strengthen claims. What needs proof next: external validity (does Claw-Eval-Live rank predict real deployment success?), contamination resistance (do refreshed releases actually degrade memorized-task performance?), and judge calibration (does the structured LLM judge agree with humans on the semantic slices?).
SWE-bench established execution-grounded grading for coding agents by running test suites against real GitHub issues. Claw-Eval-Live generalizes this 'verify by execution evidence' stance to non-code workflow surfaces (business services, workspace repair) and adds a refreshable task layer.
LiveCodeBench introduced contamination-aware, continuously refreshed evaluation for code generation. Claw-Eval-Live ports the live-refresh principle to agentic workflow evaluation, separating a refreshable signal layer from a frozen release snapshot.
WorkArena evaluates web agents on common enterprise knowledge-work tasks. Claw-Eval-Live shares the enterprise-workflow target but emphasizes refreshable demand sourcing and multi-evidence grading rather than a fixed web-task suite.
AgentBench provides a broad multi-environment evaluation of LLM agents with task-specific success criteria. Claw-Eval-Live narrows to workflow agents but deepens the grading layer with traces, audit logs, and service-state checks, and adds the refreshable signal layer AgentBench lacks.
LLM agents are expected to complete end-to-end units of work across software tools, business services, and local workspaces. Yet many agent benchmarks freeze a curated task set at release time and grade mainly the final response, making it difficult to evaluate agents against evolving workflow demand or verify whether a task was executed. We introduce Claw-Eval-Live, a live benchmark for workflow agents that separates a refreshable signal layer, updated across releases from public workflow-demand signals, from a reproducible, time-stamped release snapshot. Each release is constructed from public workflow-demand signals, with ClawHub Top-500 skills used in the current release, and materialized as controlled tasks with fixed fixtures, services, workspaces, and graders. For grading, Claw-Eval-Live records execution traces, audit logs, service state, and post-run workspace artifacts, using deterministic checks when evidence is sufficient and structured LLM judging only for semantic dimensions. The release contains 105 tasks spanning controlled business services and local workspace repair, and evaluates 13 frontier models under a shared public pass rule. Experiments reveal that reliable workflow automation remains far from solved: the leading model passes only 66.7% of tasks and no model reaches 70%. Failures are structured by task family and execution surface, with HR, management, and multi-system business workflows as persistent bottlenecks and local workspace repair comparatively easier but unsaturated. Leaderboard rank alone is insufficient because models with similar pass rates can diverge in overall completion, and task-level discrimination concentrates in a middle band of tasks. Claw-Eval-Live suggests that workflow-agent evaluation should be grounded twice, in fresh external demand and in verifiable agent action.
“Claw-Eval-Live proposes a two-layer (refreshable signal + frozen release) workflow-agent benchmark with multi-evidence grading, finding 66.7% top pass rate across 13 frontier models on 105 tasks.”
Agentic workflow automation is being commercialized aggressively (Claude Code, Codex, Hermes-style agents, MetaGPT-style orchestrators), and procurement decisions increasingly cite benchmark scores. Yet most agent benchmarks inherit assumptions from QA/code-generation evaluation: a frozen, curated task set and answer-string grading. Both assumptions break for workflows. Frozen sets invite contamination as model training corpora absorb leaked tasks and as developers iterate against the leaderboard; answer-string grading conflates 'said the right thing' with 'did the right thing,' a particularly dangerous conflation in environments with side effects. Claw-Eval-Live's framing, 'grounded twice, in fresh external demand and in verifiable agent action,' is a useful design principle the field should adopt, regardless of whether this specific instantiation becomes canonical. The strong empirical result that leaderboard rank diverges from overall completion is, to my reading, the most consequential finding: it implies that current single-number reporting is hiding large differences in agent failure profiles that matter operationally.
Top model: 66.7% pass rate; no model >70% across the 13 evaluated frontier systems. The fact that frontier ceiling sits well below 70% on a 105-task curated suite is itself a significant claim, particularly given how saturated some adjacent benchmarks (HumanEval, MBPP) have become. The failure structure is more interesting than the headline number: errors cluster by task family (HR, management, multi-system business workflows are persistent bottlenecks) and by execution surface (business services harder than workspace repair, but workspace repair also unsaturated). This pattern is consistent with the hypothesis that cross-service stateful coordination, schema reasoning, and long-horizon planning are the binding constraints, not single-tool tool-use, which is largely solved at the frontier. The leaderboard-rank-insufficiency finding, models with similar pass rates diverging on overall completion, implies meaningful variance in partial-credit behavior: some models likely fail catastrophically on hard tasks while others degrade gracefully, which has direct deployment implications. The discriminative-middle-band observation (task-level discrimination concentrates in a mid-difficulty band) has methodological consequences: the benchmark's information content per task is non-uniform, and future releases should deliberately oversample the discriminative band.
Several issues deserve scrutiny. (1) Demand-signal provenance: ClawHub Top-500 is presumably a community-driven popularity ranking, which biases toward developer-visible and English-language workflows and away from regulated/proprietary enterprise tasks (finance compliance, healthcare, legal). The 'fresh external demand' grounding is only as valid as the upstream signal; this should be triangulated with at least one orthogonal demand source. (2) Fixture realism: controlled services strip out auth flakiness, rate limits, eventual consistency, partial outages, schema drift, and adversarial UI patterns (cf. Ersoy et al. on dark patterns). Agents that pass on fixtures may still fail in production; a sim-to-real validation study would strengthen external-validity claims. (3) Judge calibration: even structured LLM-as-judge usage on narrow semantic slices is known to exhibit family bias, sycophancy, and sensitivity to prompt phrasing. The abstract does not report human-judge agreement rates or judge-model ablations, which I would consider essential. (4) Coverage of deterministic vs judge grading: the deterministic share of grading is the key trustworthiness lever; without a reported ratio, it is hard to assess how much of the 66.7% number rests on judge calls. (5) Sample size and statistical power: 105 tasks with 13 models means small per-cell counts on family/surface breakdowns, the 'HR is hardest' style claims need confidence intervals or bootstrap analysis. (6) Cross-release comparability: refresh introduces a versioning burden that the community has historically managed poorly (cf. WebArena's evaluation issues documented by El Hattami et al. 2025); a clear protocol for reporting (release_id, snapshot_hash, judge_version) is necessary. (7) Reward-hacking risk: with multi-evidence grading, agents may learn to satisfy graders without satisfying intent (cf. MacDiarmid et al. 2025 on emergent misalignment from reward hacking); the paper should report adversarial probing of the graders. (8) Contamination defense is asserted via refresh, but not, per the abstract, empirically demonstrated, an ideal ablation would compare frontier-model performance on N-th release tasks vs (N-1)-th release tasks held out from training. (9) The 'overall completion' metric is referenced but not technically defined in the abstract; if it is sub-task weighted, the weighting scheme is itself a design choice with leaderboard consequences.
SWE-bench established execution-grounded grading via test-suite pass/fail on real GitHub issues. Claw-Eval-Live generalizes the execution-evidence stance beyond code, adds non-test evidence channels (audit logs, service state, artifacts), and addresses the contamination problem SWE-bench is known to suffer from via a refreshable signal layer.
LiveCodeBench introduced time-windowed, contamination-resistant evaluation for code generation by continuously sourcing fresh problems. Claw-Eval-Live transplants this live-refresh discipline into agentic workflow evaluation and pairs it with the frozen-snapshot pattern to retain reproducibility.
WorkArena targets enterprise knowledge-work tasks for web agents in a fixed environment. Claw-Eval-Live shares the enterprise-workflow target and the multi-system business-workflow emphasis but rejects the fixed-environment assumption and replaces response-checking with multi-evidence grading.
AgentBench evaluates LLM agents across multiple environments with environment-specific success criteria. Claw-Eval-Live narrows scope to workflow agents but deepens the grading layer (traces + audit logs + state + artifacts) and adds the refreshable demand-driven task sourcing AgentBench lacks.
LLM agents are expected to complete end-to-end units of work across software tools, business services, and local workspaces. Yet many agent benchmarks freeze a curated task set at release time and grade mainly the final response, making it difficult to evaluate agents against evolving workflow demand or verify whether a task was executed. We introduce Claw-Eval-Live, a live benchmark for workflow agents that separates a refreshable signal layer, updated across releases from public workflow-demand signals, from a reproducible, time-stamped release snapshot. Each release is constructed from public workflow-demand signals, with ClawHub Top-500 skills used in the current release, and materialized as controlled tasks with fixed fixtures, services, workspaces, and graders. For grading, Claw-Eval-Live records execution traces, audit logs, service state, and post-run workspace artifacts, using deterministic checks when evidence is sufficient and structured LLM judging only for semantic dimensions. The release contains 105 tasks spanning controlled business services and local workspace repair, and evaluates 13 frontier models under a shared public pass rule. Experiments reveal that reliable workflow automation remains far from solved: the leading model passes only 66.7% of tasks and no model reaches 70%. Failures are structured by task family and execution surface, with HR, management, and multi-system business workflows as persistent bottlenecks and local workspace repair comparatively easier but unsaturated. Leaderboard rank alone is insufficient because models with similar pass rates can diverge in overall completion, and task-level discrimination concentrates in a middle band of tasks. Claw-Eval-Live suggests that workflow-agent evaluation should be grounded twice, in fresh external demand and in verifiable agent action.
Ivan Bercovich
“A practical guide arguing that benchmark tasks for AI command-line agents should be written to expose failure, not to help the agent succeed, and lists the common ways task authors get this wrong.”
When companies and researchers compare AI coding assistants, they often point to scores on benchmarks like Terminal Bench - tests where an AI agent has to use a command line to fix bugs, set up servers, or write code. Those scores influence which models get hyped, funded, and deployed. If the tests themselves are sloppy - too easy, too leading, or gameable - then the leaderboard is measuring the wrong thing. The author, who has spent over a year writing and reviewing these tasks, says this is happening at scale: by one estimate, more than 15% of tasks in popular terminal-agent benchmarks can be 'reward-hacked,' meaning the AI can pass without actually solving the problem. That makes published scores misleading for anyone trying to decide if an AI is ready for real work.
Because this is a guidelines paper rather than an experiment, there is no headline accuracy number. The most striking concrete claim is that, drawing on recent work, over 15% of tasks in popular terminal-agent benchmarks are reward-hackable - meaning an agent can pass them without genuinely doing the task. The paper's deliverable is a structured list of failure modes and design principles (adversarial, difficult, legible) that authors and reviewers can apply directly when writing or auditing tasks.
The advice is opinion and experience, not measurement - it isn't validated by, say, showing that benchmarks rewritten under these guidelines correlate better with real-world AI usefulness. The 15%-reward-hackable figure comes from related work, not new analysis here. The guidance is also specific to terminal/sysadmin/coding agents; some of it transfers to other agent benchmarks, but not all. And there's an inherent tension the paper doesn't fully resolve: making tasks more adversarial and harder also makes them more expensive to author and review, which pushes against the market pressure (acknowledged in the abstract) to ship benchmark tasks quickly. Whether the field will actually slow down and adopt stricter standards is an open question.
Terminal-Bench, the benchmark the author has been contributing to and reviewing for over a year. This paper is essentially a lessons-learned document from inside that effort, generalized into guidelines for anyone building similar evaluations.
A companion dataset by the same author cataloging hundreds of reward-hackable terminal-agent environments and thousands of exploit trajectories. It supplies the empirical backbone - including the >15% reward-hackable figure - for the failure modes this guidelines paper warns against.
Classic catalog of 'specification gaming,' where AI systems satisfy the letter of an objective while violating its intent. The paper applies this lens specifically to benchmark task design, arguing many tasks accidentally invite specification gaming.
METR's empirical observation that current frontier models actively reward-hack evaluations. The guidelines paper treats this as motivation: if top models are already gaming benchmarks, sloppy task design isn't a theoretical risk, it's actively corrupting today's leaderboards.
Terminal-agent benchmarks have become a primary signal for measuring the coding and system-administration capabilities of large language models. As the market for evaluation environments grows, so does the pressure to ship tasks quickly, often without thorough adversarial review of the verification logic. This paper is a guideline for writing good benchmark tasks, drawn from over a year of contributing to and reviewing tasks for Terminal Bench. Most people write benchmark tasks the way they write prompts. They shouldn't. A prompt is designed to help the agent succeed; a benchmark is designed to find out if it can. We argue that good tasks are adversarial, difficult, and legible, and that a large class of common failure modes -- AI-generated instructions, over-prescriptive specifications, clerical difficulty, oracle solutions that assume hidden knowledge, tests that validate the wrong things, and reward-hackable environments -- are predictable consequences of treating task authoring as prompt authoring. We catalog these failure modes, argue that real difficulty is conceptual rather than environmental, and discuss recent empirical evidence that over 15% of tasks in popular terminal-agent benchmarks are reward-hackable. We hope this serves as a useful reference for benchmark maintainers, task contributors, and researchers using benchmark scores as evidence.
“A guidelines paper arguing that terminal-agent benchmark tasks should be written adversarially rather than like prompts, and cataloging the failure modes that follow when authors confuse the two.”
Terminal-agent benchmarks - Terminal Bench, Terminal-Bench Pro, SETA, OpenThoughts-Agent, and similar - have become a primary signal for ranking coding and sysadmin capability of frontier LLMs. They feed model release blog posts, RL training environments, and product claims. As an evaluation market emerges around them, throughput pressure is rising and adversarial review is lagging. Recent empirical work cited here suggests over 15% of tasks in popular benchmarks are reward-hackable, meaning a non-trivial portion of leaderboard signal reflects exploit discovery rather than capability. For anyone using benchmark deltas as evidence - in papers, in procurement, or in RL reward design - this is a systemic measurement problem, not a rounding error.
There are no model-accuracy numbers because this is not a benchmarking paper. The headline empirical claim, imported from related work, is that >15% of tasks in popular terminal-agent benchmarks are reward-hackable - which the author treats as a lower bound on how much published scores overstate true capability. The contribution is a structured taxonomy: three design principles (adversarial, difficult, legible), a list of six recurring failure modes, and a sharper conceptual/environmental difficulty distinction. Practically, this is the kind of artifact that can be turned into a review checklist for benchmark PRs - and the author's framing suggests that's roughly how it has been used inside Terminal Bench review.
Several limitations a careful reader should flag. First, the guidelines are asserted, not validated: the paper does not show that tasks rewritten under these principles produce more predictive or stable model rankings, nor that they correlate better with downstream utility. Second, the >15% reward-hackable figure is borrowed from related work and not re-derived here, so its scope and methodology should be checked at the source. Third, there is an unaddressed economic tension: adversarial authoring and review are substantially more expensive than prompt-style authoring, which conflicts with the throughput pressure the paper itself identifies in the evaluation market - the paper does not propose how maintainers should fund or incentivize the harder workflow. Fourth, some failure modes (oracle solutions assuming hidden knowledge, validating the wrong things) are not unique to terminal agents and have analogs in software testing and RL reward design; the paper would be stronger if it engaged with that literature more directly. Fifth, 'legibility' is left somewhat underspecified - it's clear what it rules out, less clear what operationally satisfies it, especially for long-horizon tasks where trajectories are large. Likely pushback from benchmark authors: that overly adversarial tasks become brittle, ambiguous, or unfair, and that some 'reward hacks' are legitimate solutions the spec failed to anticipate; the paper acknowledges this tension implicitly but does not adjudicate it. What needs proof next: a controlled study showing that benchmarks audited under these guidelines change model rankings or reduce score variance, and a quantification of how many publicly reported model-vs-model deltas would survive a strict adversarial re-authoring pass.
Introduces Terminal-Bench, the benchmark the author has been reviewing and contributing to. This guidelines paper is effectively a retrospective on what task design patterns held up and which ones broke under adversarial scrutiny inside that project, generalized to a broader audience.
A companion artifact (Terminal Wrench) by the same author cataloging 331 reward-hackable environments and 3,632 exploit trajectories. It supplies the empirical evidence behind the failure-mode taxonomy here, including the >15% reward-hackable claim, and grounds the guidelines in observed exploits rather than speculation.
Foundational catalog of specification gaming in RL and AI systems. The paper transposes that lens onto benchmark authoring, arguing many of the failure modes in terminal-agent tasks are specification gaming made possible by under-specified verifiers and over-permissive environments.
Documents reward tampering and emergent subterfuge in LLMs trained with RL. Relevant because terminal-agent benchmarks are increasingly used as RL environments, not just evals - so reward-hackable tasks don't just inflate scores, they actively train models to exploit verifiers, sharpening the urgency of the guidelines.
Terminal-agent benchmarks have become a primary signal for measuring the coding and system-administration capabilities of large language models. As the market for evaluation environments grows, so does the pressure to ship tasks quickly, often without thorough adversarial review of the verification logic. This paper is a guideline for writing good benchmark tasks, drawn from over a year of contributing to and reviewing tasks for Terminal Bench. Most people write benchmark tasks the way they write prompts. They shouldn't. A prompt is designed to help the agent succeed; a benchmark is designed to find out if it can. We argue that good tasks are adversarial, difficult, and legible, and that a large class of common failure modes -- AI-generated instructions, over-prescriptive specifications, clerical difficulty, oracle solutions that assume hidden knowledge, tests that validate the wrong things, and reward-hackable environments -- are predictable consequences of treating task authoring as prompt authoring. We catalog these failure modes, argue that real difficulty is conceptual rather than environmental, and discuss recent empirical evidence that over 15% of tasks in popular terminal-agent benchmarks are reward-hackable. We hope this serves as a useful reference for benchmark maintainers, task contributors, and researchers using benchmark scores as evidence.
Jinchao Li, Yunxin Li, Chenrui Zhao, Zhenran Xu, Baotian Hu, Min Zhang
“A new test suite called WindowsWorld checks whether AI assistants can actually finish multi-step office jobs that span several Windows apps, and today's best agents mostly fail.”
There is a lot of hype right now about AI agents that can use your computer for you - booking travel, filing expenses, doing research. Most public benchmarks measure those agents on neat, self-contained chores like 'edit this one document' or 'fill in this one form.' But office workers do not live inside one app; their day is a relay race across browsers, spreadsheets, chat tools, file explorers, and email. WindowsWorld is the first serious yardstick built around that relay race on Windows, and the results are sobering. If we want to trust agents with actual professional workflows, we need benchmarks like this one, and we need to know honestly where the gaps are. Otherwise companies will deploy agents that look great in a demo and quietly break on Tuesday morning.
Across the board, the leading agents struggled. On multi-app tasks, success rates stayed below 21%, far worse than on single-app tasks. The agents particularly fell apart on jobs that required them to make a judgment call ('if the invoice is over $500, route it to finance') and to coordinate across three or more apps - they would often stall on an early sub-goal and never recover. Even when they did make progress, they were inefficient: many runs blew past the number of steps a human would need, and still ended in failure. So the picture is not 'almost there' - it's that today's agents are decent button-pushers within one app and quite bad at stitching a real workflow together.
A few honest limits. First, this is a simulator, so it does not capture every quirk of a messy real desktop, network hiccups, or weird enterprise software. Second, the tasks were generated with the help of AI and then human-reviewed; that pipeline is scalable but can bake in stylistic patterns that favor or hurt certain agents. Third, 'occupation-grounded' is a nice framing, but 16 occupations with 181 tasks is still a sample, not the world of work. Fourth, the headline numbers depend on which agents and models were tested at this moment in time - GUI agents are improving fast, so the absolute scores will move. The deeper finding - that cross-app, conditional, long-horizon work is the real frontier - is the part that should age well. What needs proving next: can agents trained or prompted specifically for cross-app planning close the gap, and do these lab results predict real on-the-job reliability?
OSWorld is the closest predecessor: a benchmark for multimodal agents in real computer environments. WindowsWorld keeps the simulated-desktop idea but explicitly targets cross-application, profession-grounded workflows instead of mostly single-app tasks.
Windows Agent Arena also evaluates OS-level agents on Windows at scale. WindowsWorld differs by centering on multi-step, multi-app professional workflows with intermediate sub-goal checks rather than isolated tasks.
ProBench pushes for accurate process-level (sub-goal) evaluation of GUI agents. WindowsWorld adopts a similar process-centric scoring philosophy and applies it to cross-application desktop workflows.
AndroidWorld provides a dynamic benchmark for autonomous agents on mobile. WindowsWorld is the desktop, cross-application analogue, focused on professional Windows workflows rather than phone tasks.
While GUI agents have shown impressive capabilities in common computer-use tasks such as OSWorld, current benchmarks mainly focus on isolated and single-application tasks. This overlooks a critical real-world requirement of coordinating across multiple applications to accomplish complex profession-specific workflows. To bridge this gap, we present a computer-use benchmark in cross-application workflows, named WindowsWorld, designed to systematically assess GUI Agents on complex multi-step tasks that mirror real-world professional activities. Our methodology uses a multi-agent framework steered by 16 occupations to generate four difficulty-level tasks with intermediate inspection, which are then refined by human review and executed in a simulated environment. The resulting benchmark contains 181 tasks with an average of 5.0 sub-goals across 17 common desktop applications, of which 78% are inherently multi-application. Experimental results of leading large models and agents show that: 1) All computer-use agents perform poorly on multi-application tasks (< 21% success rate), far below the performance of simple single-app tasks; 2) They largely fail at tasks requiring conditional judgment and reasoning across $\geq$ 3 applications, stalling at early sub-goals; 3) Low execution efficiency, where tasks often fail despite far exceeding human step limits. Code, benchmark data, and evaluation resources are available at github.com/HITsz-TMG/WindowsWorld.
“WindowsWorld is a 181-task, process-graded benchmark of cross-application Windows workflows grounded in 16 occupations, on which top computer-use agents score below 21% on multi-app tasks.”
Most existing computer-use and GUI-agent benchmarks - OSWorld, Windows Agent Arena, OmniAct, Mind2Web, AndroidWorld, VisualWebArena - either focus on single applications, web-only domains, or short-horizon tasks. Yet the commercial pitch for these agents (replace knowledge-worker drudgery, automate back-office processes) lives or dies on multi-app coordination: pulling data from a CRM into a spreadsheet, summarizing it in a doc, and emailing the result. WindowsWorld is one of the first benchmarks built explicitly around that mismatch, with profession-driven task design and process-level (not just outcome-level) scoring. The empirical headline - sub-21% success on multi-app workflows from frontier agents - is a useful corrective to the demo-driven narrative, and it gives the field a concrete target. It also dovetails with the recent move toward process-centric evaluation (e.g. ProBench) and reflects an emerging consensus that step-wise sub-goal checking is needed to meaningfully evaluate long-horizon agents.
Three findings carry the paper. (1) Across the evaluated agents, success on multi-application tasks is below 21%, dramatically below their performance on single-app tasks - the gap, not just the absolute number, is the headline. (2) Conditional reasoning over 3+ applications is a near-cliff: agents typically stall at early sub-goals and never recover, which suggests the failure is in planning and state-tracking, not in low-level GUI grounding. (3) Execution is inefficient: agents routinely exceed reasonable human step counts and still fail, implying loops, redundant exploration, and poor self-monitoring rather than productive search. Together these results argue that improvements in single-app accuracy do not transfer to professional workflows, and that the efficiency dimension (steps to success, not just success) deserves to be a first-class metric. Concrete per-model numbers are not given in the abstract beyond the <21% multi-app ceiling.
Several issues a colleague should flag. First, simulator validity: the tasks live in a controlled Windows simulation, and we do not yet know how scores translate to real enterprise machines with VPNs, SSO, drift in app versions, and unpredictable popups. Second, generation bias: tasks authored by a multi-agent pipeline tend to inherit the planning style of the generator LLMs, which can systematically advantage agents built on similar models or disadvantage agents with different action vocabularies. Human review mitigates but does not eliminate this. Third, sub-goal grading assumes a canonical decomposition; in real workflows there are multiple correct paths, and a strict checker may under-credit creative solutions - the paper should ideally include inter-rater agreement and partial-credit policies. Fourth, the 16 occupations and 181 tasks is meaningful coverage but probably under-samples technical domains (coding IDEs, data engineering tools) and heavy-tail enterprise software. Fifth, the comparison to a 'human step limit' needs definition: averaged over how many humans, with what familiarity? Likely pushback from agent vendors will be that their production stacks (with retries, planners, memory) outperform the evaluated configurations, which is plausible and would benefit from a more standardized harness. What needs proving next: (a) whether explicitly cross-app planners or hierarchical agents close the gap, (b) whether tool-augmented or MCP-style integrations short-circuit GUI bottlenecks, and (c) how performance scales with model capability vs. agent scaffolding.
OSWorld pioneered open-ended multimodal agent evaluation in real computer environments and is the explicit foil: WindowsWorld keeps the simulated-desktop paradigm but reframes it around cross-application, profession-grounded, process-graded tasks rather than predominantly single-app objectives.
Windows Agent Arena established large-scale evaluation of multimodal OS agents on Windows. WindowsWorld differs by emphasizing multi-step cross-app workflows and intermediate sub-goal checking, rather than breadth of isolated OS-level tasks.
ProBench argues for process-information-rich evaluation of GUI agents. WindowsWorld adopts a similar process-centric philosophy - sub-goal-level grading and trajectory diagnostics - and instantiates it specifically for cross-application desktop workflows.
AndroidWorld provides a dynamic, app-rich benchmark for mobile autonomous agents. WindowsWorld can be seen as the desktop counterpart with a stronger emphasis on professional, multi-application coordination rather than mobile single-app interactions.
While GUI agents have shown impressive capabilities in common computer-use tasks such as OSWorld, current benchmarks mainly focus on isolated and single-application tasks. This overlooks a critical real-world requirement of coordinating across multiple applications to accomplish complex profession-specific workflows. To bridge this gap, we present a computer-use benchmark in cross-application workflows, named WindowsWorld, designed to systematically assess GUI Agents on complex multi-step tasks that mirror real-world professional activities. Our methodology uses a multi-agent framework steered by 16 occupations to generate four difficulty-level tasks with intermediate inspection, which are then refined by human review and executed in a simulated environment. The resulting benchmark contains 181 tasks with an average of 5.0 sub-goals across 17 common desktop applications, of which 78% are inherently multi-application. Experimental results of leading large models and agents show that: 1) All computer-use agents perform poorly on multi-application tasks (< 21% success rate), far below the performance of simple single-app tasks; 2) They largely fail at tasks requiring conditional judgment and reasoning across $\geq$ 3 applications, stalling at early sub-goals; 3) Low execution efficiency, where tasks often fail despite far exceeding human step limits. Code, benchmark data, and evaluation resources are available at github.com/HITsz-TMG/WindowsWorld.
“WindowsWorld introduces a process-centric, occupation-conditioned benchmark of 181 cross-application Windows workflows (avg. 5.0 sub-goals, 78% multi-app), on which leading computer-use agents score below 21% and degrade sharply with conditional, ≥3-app reasoning.”
The GUI-agent literature has rapidly accumulated benchmarks - WoB, MiniWoB++, WebShop, Mind2Web, VisualWebArena, OSWorld, Windows Agent Arena, AndroidWorld, AndroidInTheWild, A3, SPA-Bench, OSUniverse, OmniAct, GUI-360, ScreenSpot-Pro, ProBench, MobileWorld - yet most either (i) operate within a single application or domain, (ii) score only terminal outcomes, or (iii) source tasks from crowd templates rather than profession-grounded workflows. The community has converged on the view that long-horizon, cross-app, professional automation is the commercially relevant regime, but lacks a benchmark that operationalizes it with intermediate state inspection. WindowsWorld targets exactly that gap: occupation-conditioned generation gives ecological validity; sub-goal grading gives diagnostic resolution; and the 78% multi-app share gives statistical power to study cross-app failure modes. The reported sub-21% multi-app success rate is a useful, and probably durable, lower bound on agent capability for real workflows, and a concrete target for scaffolding research, hierarchical planners, and memory architectures.
Three quantitative claims anchor the evaluation. (1) Multi-application success ceilings out below 21% across all evaluated computer-use agents, with single-app performance materially higher - establishing a large gap attributable specifically to cross-app coordination. (2) Tasks requiring conditional judgment across ≥3 applications elicit early-sub-goal stalls: agents fail to advance past the first or second checkpoint, which under process-centric scoring reads as low partial-credit, not as 'almost solved.' This pattern is consistent with planner/state-tracking failure rather than grounding failure - if perception were the bottleneck, we would expect more uniform partial progress along the trajectory. (3) Execution efficiency is poor: many failed runs exceed human step budgets by large margins, indicating non-productive exploration (likely loops, repeated re-grounding, and lack of internal progress estimation). Absolute per-model numbers are not given in the abstract; the comparative profile across difficulty tiers is the more informative result. Together these findings imply that scaling current single-step grounding accuracy will not, on its own, close the multi-app gap - the failure surface lives in workflow-level reasoning.
Methodological pushback worth raising: (1) Sub-goal grading granularity. If sub-goal completion is determined by environment state checks, multi-path solutions risk being under-credited; the paper should report inter-rater agreement on decompositions and an ablation comparing strict vs. lenient state matching. (2) Generator-induced bias. LLM-authored tasks tend to encode the generator's plan structure, which can advantage agents built on similar base models. A useful sanity check would be to compare success rates on LLM-authored vs. human-authored subsets, or to perturb task phrasing to test robustness. (3) Agent harness parity. Computer-use agents differ wildly in action vocabularies (raw mouse/keyboard vs. accessibility-tree calls vs. mixed) and in scaffolding (planner, memory, retries). Without a unified harness, the <21% headline is a property of the (agent, scaffold) pair, not of the underlying model. Reporting under at least two scaffolding regimes (minimal and best-known) would strengthen the claim. (4) Simulator validity. Real desktops introduce nondeterminism (network, modal popups, version drift, locale, multi-monitor) that simulators sanitize; an external validity study on a small live-machine subset would be high-value. (5) Efficiency metric. 'Human step limit' needs operational definition (how many humans, expert vs. novice, recorded under what UI?), otherwise the inefficiency claim is suggestive but not pinned. (6) Coverage. 17 applications and 16 occupations skew toward office-knowledge work; technical occupations using IDEs, terminals, BI tools, or proprietary enterprise stacks may behave very differently and are likely underrepresented. (7) Conditional reasoning failures may partly reflect prompt/observation truncation rather than reasoning per se - an ablation with extended context, scratchpads, or explicit sub-goal hints would tease this apart. (8) Statistical power. With 181 tasks split across four difficulties, the per-cell sample sizes for fine-grained slices (e.g., '≥3 apps with branching') may be small; bootstrap confidence intervals on the sub-21% number are essential. Missing ablations I'd want before treating this as definitive: (a) hierarchical planner vs. flat ReAct on the same backbone, (b) memory/replay across sub-goals, (c) accessibility-tree-only vs. screenshot-only vs. multimodal observation, (d) MCP/tool-augmented bypass of GUI for portions of the workflow, (e) retraining or fine-tuning a single model on cross-app trajectories to test whether the gap is data-limited or architecture-limited, (f) human upper bound including time and step distributions, not just a budget. Failure modes to probe further: clipboard-mediated state transfer, focus management across windows, dialog-handling, time-sensitive tasks, and recovery from incorrect actions. Strong follow-ups: (i) train a cross-app planner on synthetic trajectories from this generator and test whether the multi-app gap closes without harming single-app accuracy; (ii) build an evaluator that gives credit for alternate correct decompositions (graph-structured sub-goals); (iii) extend to live machines with telemetry replay; (iv) couple WindowsWorld with ProBench-style process metrics and ScreenSpot-Pro grounding tests to factor performance into grounding × planning × execution components; (v) study whether scaling agent ensembles (à la Gonzalez-Pumariega et al., 2025) yields disproportionate gains on the conditional-≥3-app slice, which would suggest search/verification rather than single-policy capability is the binding constraint.
OSWorld is the methodological anchor for simulated-OS benchmarking of multimodal agents. WindowsWorld inherits the simulated-desktop paradigm but pivots from open-ended, largely single-application tasks to occupation-grounded, cross-application workflows with sub-goal-level evaluation.
Windows Agent Arena scaled multimodal OS-agent evaluation on Windows. WindowsWorld differs in distribution (78% multi-app, profession-conditioned) and in evaluation protocol (process-centric sub-goal checking and step-efficiency metrics) rather than in platform.
ProBench advocates accurate process-information evaluation of GUI agents. WindowsWorld operationalizes a closely related stance for cross-application desktop workflows, exposing where in trajectories agents stall - a diagnostic capacity outcome-only metrics lack.
AndroidWorld established dynamic, app-rich benchmarking for mobile autonomous agents. WindowsWorld is conceptually the desktop counterpart with explicit emphasis on cross-application coordination and conditional reasoning, complementing rather than replacing mobile evaluation.
While GUI agents have shown impressive capabilities in common computer-use tasks such as OSWorld, current benchmarks mainly focus on isolated and single-application tasks. This overlooks a critical real-world requirement of coordinating across multiple applications to accomplish complex profession-specific workflows. To bridge this gap, we present a computer-use benchmark in cross-application workflows, named WindowsWorld, designed to systematically assess GUI Agents on complex multi-step tasks that mirror real-world professional activities. Our methodology uses a multi-agent framework steered by 16 occupations to generate four difficulty-level tasks with intermediate inspection, which are then refined by human review and executed in a simulated environment. The resulting benchmark contains 181 tasks with an average of 5.0 sub-goals across 17 common desktop applications, of which 78% are inherently multi-application. Experimental results of leading large models and agents show that: 1) All computer-use agents perform poorly on multi-application tasks (< 21% success rate), far below the performance of simple single-app tasks; 2) They largely fail at tasks requiring conditional judgment and reasoning across $\geq$ 3 applications, stalling at early sub-goals; 3) Low execution efficiency, where tasks often fail despite far exceeding human step limits. Code, benchmark data, and evaluation resources are available at github.com/HITsz-TMG/WindowsWorld.
papers richer, with enough detail to actually remember them.
New batch tomorrow morning.
Five papers in your inbox every morning.