Skip to main content

THE DISPATCH

Vol. 1, No. 1Saturday, December 6, 2025All the News Fit to Print
In Progress (Targeting ICML 2026)
New Benchmark Reveals Coordination as Fundamental Challenge for Multi-Agent AI Systems

New Benchmark Reveals Coordination as Fundamental Challenge for Multi-Agent AI Systems

Khatua, A.¹, Zhu, H.¹, Tran, P.², Prabhudesai, A.², Yu, X.¹, Sadrieh, F.², Lieberwirth, J. K.², Fu, Y.¹, Ryan, M. J.¹, Pei, J.¹, & Yang, D.¹ (¹Stanford University, ²SAP)

Initial results from Cotomata, a multi-agent collaboration benchmark, show that LLM coordination in version-controlled programming environments suffers dramatically. Models drop from 70-75% accuracy (single-agent) to 20-30% (multi-agent), with majority of failures attributed to mismatched shared-state assumptions. The benchmark includes multi-agent interaction protocols and a failure analysis pipeline.

Paper:

The Curse of Coordination: Why Current AI Cannot be Your Teammates

In Progress (Targeting ICML 2026)

Hierarchical RL User Simulation Improves Alignment by 36%

Khatua, A.*, Wu, S.*, Choi, E.*, Wang, H., He-Yueva, J., Weerasooriya, C., Wei, W., Yang, D., Leskovec, J., & Zou, J.*

HumanLM introduces hierarchical RL modules for user simulation, with initial results showing improved alignment with real user responses. The system raises LLM-judge similarity by 36% over SFT baselines through multi-level generation training and carefully designed stance and style reward functions.

Paper:

HumanLM: Building Digital Humans from Large Language Models

In Progress (Targeting ICML 2026)

Multilingual SWE-smith Scales Bug Generation Across Programming Languages

Khatua, A.*, Li, X.*, Shethia, P.*, Li, Z.*, & Yang, J.

Extended SWE-smith by automating test environments construction for JavaScript/TypeScript, Java, Rust, and C++. The system scales procedural bug generation to any repository, producing mid-training data for improving Multi-SWE-Bench performance across multiple programming languages.

Paper:

Multilingual SWE-smith

EMNLP 2025

Study Reveals 3.3% of Wikipedia Facts Contradict Each Other

Semnani, S. J., Burapacheep, J., Khatua, A., Atchariyachanvanit, T., Wang, Z., & Lam, M. S.

Stanford researchers have discovered that at least 3.3 percent of English Wikipedia facts contradict other information in the corpus. This amounts to millions of inconsistent statements across the encyclopedia. The team introduced CLAIRE, an AI system that helps Wikipedia editors identify 64.7 percent more inconsistencies than traditional methods.

Paper:

Detecting Corpus-Level Knowledge Inconsistencies in Wikipedia with Large Language Models

NeurIPS 2025 Workshop (Oral)

Video Understanding System Achieves 3% Improvement Without Architecture Changes

Durante, Z., Singh, S., Khatua, A., Agarwal, S., Tan, R., Lee, Y. J., Gao, J., Adeli, E., & Fei-Fei, L.

A data-centric approach for efficient video understanding achieved 3 percent absolute improvement on VideoMME benchmarks under identical compute constraints, without requiring architectural modifications. The method splices short captioned videos into synthetic long-context training samples.

Paper:

VideoWeave: A Data-Centric Approach for Efficient Video Understanding

Panasonic

Multi-Agent Video QA System Improves Zero-Shot Performance by 6%

Kugo, N., Li, X., Li, Z., Gupta, A., Khatua, A., Jain, N., Patel, C., Kyuragi, Y., Ishii, Y., Tanabiki, M., Kozuka, K., & Adeli, E.

A multi-agent framework for video question answering with role specialization improved zero-shot performance by up to 6 percent over prior state-of-the-art methods through collaborative reasoning and information aggregation.

Paper:

VideoMultiAgents: A Multi-Agent Framework for Video Question Answering

KDD 2023

Largest Academic Graph Dataset Created: 162× Larger Than Prior Benchmarks

Khatua, A., Mailthody, V., Taleka, B., Song, X., Ma, T., Bigaj, P. & Hwu, W.

Researchers created the Illinois Graph Benchmark (IGB), the largest academic GNN dataset with 260 million nodes, 4 billion edges, and 220 million labels. The dataset is 162 times larger than prior datasets and has been adopted as an MLPerf industry standard for GNN benchmarking.

Paper:

IGB: Addressing The Gaps In Labeling, Features, Heterogeneity, and Size of Public Graph Datasets for Deep Learning Research

CLASSIFIED AD

GPU

WANTED

More GPU hours. Will trade sanity. Contact: desperate@stanford.edu

Seeking: A100s, H100s, or any GPU that doesn't crash during training. Willing to negotiate: firstborn child, coffee supply, or eternal gratitude.

Current situation: Running experiments on a potato. Results may vary. Desperation level: Critical.

References available upon request. Previous GPU owners: please don't ask.

·· ·—·· ·· ·· · / · ·· ·· / · ··

Click for Details
RESEARCH MARKET
COFFEE ↑ 247%SLEEP ↓ 62%REJECTIONS: 12GPU HOURS: ∞BUGS FIXED: 47BUGS CREATED: 52TO READ PILE ↑ 340%REVIEWER 2: N/ACOFFEE ↑ 247%SLEEP ↓ 62%REJECTIONS: 12GPU HOURS: ∞BUGS FIXED: 47BUGS CREATED: 52TO READ PILE ↑ 340%REVIEWER 2: N/A