OmniGameArena: A Unified UE5 Benchmark for VLM Game Agents with Improvement Dynamics Vision-language model (VLM) agents are increasingly deployed in interactive game environments. Yet game benchmarks for VLM agents typically report a single first-attempt score per (agent, game) pair, arxiv · 1天前
An Agency-Transferring Model-Free Policy Enhancement Technique Training reinforcement learning (RL) policies from scratch is costly: it requires careful reward and environment design, extensive tuning, and substantial computation. Yet many control problems arxiv · 1天前
PTL-Diffusion: Manifold-Aware Diffusion with Periodic Terminal Laws Standard diffusion models typically use a single time-homogeneous Gaussian terminal distribution as the reference law for generation. While this choice is analytically convenient and empirically power arxiv · 1天前
AHA-WAM:Asynchronous Horizon-Adaptive World-Action Modeling with Observation-Guided Context Routing World-action models have emerged as a promising paradigm for robot manipulation, jointly modeling visual scene dynamics and actions to inject physical priors into policy learning. However, existing wo arxiv · 1天前
Evaluation Cards: An Interpretive Layer for AI Evaluation Reporting AI evaluation results are produced at scale but reported inconsistently across leaderboards, model cards, benchmark papers, and company blogs. The cost is interpretive: readers cannot reliably compare arxiv · 1天前
Topological Neural Operators We introduce Topological Neural Operators (TNOs), a principled framework for operator learning on cell complexes that lifts neural operators (NOs) from functions on points and/or edges to topological arxiv · 1天前
Bandits for Efficient Experimentation: Adapting to Control Group, Preferences, and Context Drifts We consider a variant of the linear contextual stochastic multi-armed bandits, where the learner must provide recommendations to a group of users, each having its personalized preference vector, and i arxiv · 1天前
FASE: Fast Adaptive Semantic Entropy for Code Quality Multi-agent code generation offers a promising paradigm for autonomous software development by simulating the human software engineering lifecycle. However, system reliability remains hindered by LLM arxiv · 1天前
Who Earns the Safety? Intervention-Aware Quantum Predictive Control with Safety Attribution Hard safety filters are increasingly placed downstream of learned controllers to guarantee constraint satisfaction at run time. Yet a filtered controller that never violates a constraint may still hav arxiv · 1天前
SIGA: Self-Evolving Coding-Agent Adapters for Scientific Simulation Advanced scientific simulators expose specialized input languages that turn simulation goals into executable configurations, but learning them can cost domain scientists hours to days. We study simula arxiv · 1天前
Data Synthesis and Parameter-Efficient Fine-Tuning for Low-Resource NMT: A Case Study on Q'eqchi' Mayan Neural machine translation for digitally low-resource Indigenous languages is often hindered by extreme data scarcity, prompting reliance on extractive web-scraping. To ensure data sovereignty, this s arxiv · 1天前
Preserving Plasticity in Continual Learning via Dynamical Isometry Continual training of deep neural networks under non-stationarity often leads to a progressive loss of plasticity, eventually limiting further learning. We relate plasticity to the empirical Neural Ta arxiv · 1天前
Difference-Aware Retrieval Policies for Imitation Learning Parametric imitation learning via behavior cloning can suffer from poor generalization to out-of-distribution states due to compounding errors during deployment. We show that reusing the training data arxiv · 1天前
Collaborative Human-Agent Protocol (CHAP) Foundation models are moving from response generation into operational roles. They plan across steps, call tools, request human input, coordinate with other agents, and increasingly carry responsibili arxiv · 1天前
Multi-Turn Evaluation of Deep Research Agents Under Process-Level Feedback Existing benchmarks for deep research agents (DRAs) assess only single-shot outputs, ignoring a key question: can DRAs improve their reports when guided by feedback? To investigate this, we conduct a arxiv · 1天前