Skip to main content

THE DISPATCH

Vol. 1, No. 1Monday, February 23, 2026All the News Fit to Print
In Progress (Targeting ICML 2026)
New Benchmark Reveals Coordination as Fundamental Challenge for Multi-Agent AI Systems

New Benchmark Reveals Coordination as Fundamental Challenge for Multi-Agent AI Systems

Khatua, A.¹, Zhu, H.¹, Tran, P.², Prabhudesai, A.², Yu, X.¹, Sadrieh, F.², Lieberwirth, J. K.², Fu, Y.¹, Ryan, M. J.¹, Pei, J.¹, & Yang, D.¹ (¹Stanford University, ²SAP)

Initial results from Cotomata, a multi-agent collaboration benchmark, show that LLM coordination in version-controlled programming environments suffers dramatically. Models drop from 70-75% accuracy (single-agent) to 20-30% (multi-agent), with majority of failures attributed to mismatched shared-state assumptions. The benchmark includes multi-agent interaction protocols and a failure analysis pipeline.

Paper:

The Curse of Coordination: Why Current AI Cannot be Your Teammates

Preprint (2026)

TherapyGym Trains Small Models to Match Frontier Therapists

Huang, F., Chbeir, S., Khatua, A., Wang, S., Tan, S., Ye, K., Bailey, L., Daniel, M., Louie, R., Koyejo, S., & Adeli, E.

TherapyGym, a clinical evaluation and training framework for therapy chatbots, launched at Stanford. By decomposing therapeutic competence into measurable clinical skills, the framework trained a small open-source model (Qwen3-4B) with reinforcement learning to increase clinical skill scores 6x while reducing safety violations by 47%, matching the performance of much larger frontier models.

Paper:

TherapyGym: A Clinical Evaluation and Training Framework for Therapy Chatbots

Preprint (2026)

HumanLM Simulates Users via State Alignment

Wu, S., Choi, E., Khatua, A., Wang, Z., He-Yueya, J., Weerasooriya, T. C., Wei, W., Yang, D., Leskovec, J., & Zou, J.

HumanLM builds user simulators by generating natural-language latent states aligned with ground-truth responses, then synthesizing responses from those aligned states.

Paper:

HumanLM: Simulating Users with State Alignment Beats Response Imitation

arXiv (2026)

CooperBench Benchmarks Why Coding Agents Aren’t Teammates Yet

Khatua, A., Zhu, H., Tran, P., Prabhudesai, A., Sadrieh, F., Lieberwirth, J. K., Yu, X., Fu, Y., Ryan, M. J., Pei, J., & Yang, D.

CooperBench introduces a benchmark of cooperative coding tasks in real open-source repositories to quantify the curse of coordination in multi-agent coding.

Paper:

CooperBench: Why Coding Agents Cannot be Your Teammates Yet

In Progress (Targeting ICML 2026)

Multilingual SWE-smith Scales Bug Generation Across Programming Languages

Khatua, A.*, Li, X.*, Shethia, P.*, Li, Z.*, & Yang, J.

Extended SWE-smith by automating test environments construction for JavaScript/TypeScript, Java, Rust, and C++. The system scales procedural bug generation to any repository, producing mid-training data for improving Multi-SWE-Bench performance across multiple programming languages.

Paper:

Multilingual SWE-smith

EMNLP 2025

Study Reveals 3.3% of Wikipedia Facts Contradict Each Other

Semnani, S. J., Burapacheep, J., Khatua, A., Atchariyachanvanit, T., Wang, Z., & Lam, M. S.

Stanford researchers have discovered that at least 3.3 percent of English Wikipedia facts contradict other information in the corpus. This amounts to millions of inconsistent statements across the encyclopedia. The team introduced CLAIRE, an AI system that helps Wikipedia editors identify 64.7 percent more inconsistencies than traditional methods.

Paper:

Detecting Corpus-Level Knowledge Inconsistencies in Wikipedia with Large Language Models

NeurIPS 2025 Workshop (Oral)

Video Understanding System Achieves 3% Improvement Without Architecture Changes

Durante, Z., Singh, S., Khatua, A., Agarwal, S., Tan, R., Lee, Y. J., Gao, J., Adeli, E., & Fei-Fei, L.

A data-centric approach for efficient video understanding achieved 3 percent absolute improvement on VideoMME benchmarks under identical compute constraints, without requiring architectural modifications. The method splices short captioned videos into synthetic long-context training samples.

Paper:

VideoWeave: A Data-Centric Approach for Efficient Video Understanding

Panasonic

Multi-Agent Video QA System Improves Zero-Shot Performance by 6%

Kugo, N., Li, X., Li, Z., Gupta, A., Khatua, A., Jain, N., Patel, C., Kyuragi, Y., Ishii, Y., Tanabiki, M., Kozuka, K., & Adeli, E.

A multi-agent framework for video question answering with role specialization improved zero-shot performance by up to 6 percent over prior state-of-the-art methods through collaborative reasoning and information aggregation.

Paper:

VideoMultiAgents: A Multi-Agent Framework for Video Question Answering

KDD 2023

Largest Academic Graph Dataset Created: 162× Larger Than Prior Benchmarks

Khatua, A., Mailthody, V., Taleka, B., Song, X., Ma, T., Bigaj, P. & Hwu, W.

Researchers created the Illinois Graph Benchmark (IGB), the largest academic GNN dataset with 260 million nodes, 4 billion edges, and 220 million labels. The dataset is 162 times larger than prior datasets and has been adopted as an MLPerf industry standard for GNN benchmarking.

Paper:

IGB: Addressing The Gaps In Labeling, Features, Heterogeneity, and Size of Public Graph Datasets for Deep Learning Research

CLASSIFIED AD

GPU

WANTED

More GPU hours. Will trade sanity. Contact: desperate@stanford.edu

Seeking: A100s, H100s, or any GPU that doesn't crash during training. Willing to negotiate: firstborn child, coffee supply, or eternal gratitude.

Current situation: Running experiments on a potato. Results may vary. Desperation level: Critical.

References available upon request. Previous GPU owners: please don't ask.

·· ·—·· ·· ·· · / · ·· ·· / · ··

Click for Details
RESEARCH MARKET
COFFEE ↑ 247%SLEEP ↓ 62%REJECTIONS: 12GPU HOURS: ∞BUGS FIXED: 47BUGS CREATED: 52TO READ PILE ↑ 340%REVIEWER 2: N/ACOFFEE ↑ 247%SLEEP ↓ 62%REJECTIONS: 12GPU HOURS: ∞BUGS FIXED: 47BUGS CREATED: 52TO READ PILE ↑ 340%REVIEWER 2: N/A